The standard normal distribution is a special case of the normal distribution. For the standard normal distribution, the value of the mean is equal to zero ($\mu = 0$), and the value of the standard deviation is equal to 1 ($\sigma = 1$).

Thus, by plugin $\mu = 0$ and $\sigma = 1$ in the PDF of the normal distribution, the equation simplifies to

\begin{align} f(x)& = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \\ & =\frac{1}{1 \times \sqrt{2 \pi}}e^{-\frac{1}{2}\left(\frac{x-0}{1}\right)^2} \\ & = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2} \end{align}

The random variable that possesses the standard normal distribution is denoted by $z$. Consequently units for the standard normal distribution curve are denoted by $z$ and are called the $z$-values or $z$-scores. They are also called standard units or standard scores.

The cumulative distribution function (CDF) of the standard normal distribution, corresponding to the area under the cure for the interval $(-\infty, z]$, usually denoted with the capital Greek letter $\phi$, is given by

$$F(x<z) = \phi (z) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{z}e^{-\frac{1}{2}x^2}dx$$

where $e \approx 2.71828$ and $\pi \approx 3.14159$.


Basic Properties of the Standard Normal Curve¶

The standard normal curve is a special case of the normal distribution, and thus as well a probability distribution curve. Therefore basic properties of the normal distribution hold true for the standard normal curve as well (Weiss 2010).

  1. The total area under the standard normal curve is 1 (this property is shared by all density curves).
  2. The standard normal curve extends indefinitely in both directions, approaching, but never touching, the horizontal axis as it does so.
  3. The standard normal curve is is bell shaped, is centered at $z=0$. Almost all the area under the standard normal curve lies between $z=-3$ and $z=3$.
In [2]:
# First, let's import all the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
In [3]:
mu = 0
sigma = 1
cut_a = -4
cut_b = 0.5

x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)

plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="darkblue")

plt.title("The probability density function of the standard normal distribution")
plt.xticks(
    [
        mu - 3 * sigma,
        mu - 2 * sigma,
        mu - sigma,
        mu,
        0.5,
        mu + sigma,
        mu + 2 * sigma,
        mu + 3 * sigma,
    ],
    [-3, -2, -1, 0, "z", 1, 2, 3],
)

plt.fill_between(
    x=x,
    y1=yy,
    where=(x <= 0.5),
    color="red",
    edgecolor="black",
    alpha=0.75,
)


xpos = [-3, -2, -1, 0, 1, 2, 3]
ypos = [0.005, 0.05, 0.25, max(yy), 0.25, 0.05, 0.005]
for px, py in zip(xpos, ypos):
    plt.vlines(x=mu + px, ymin=0, ymax=py, color="blue", linestyle="--")

plt.text(
    2.5, 0.3, "$f(x) = \\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{1}{2} x^2}$", fontsize=16
)

plt.arrow(
    2.2,
    0.26,
    -0.9,
    -0.08,
    length_includes_head=True,
    head_width=0.02,
    head_length=0.1,
    color="black",
)

plt.arrow(
    -2.2,
    0.28,
    1.3,
    -0.1,
    length_includes_head=True,
    head_width=0.02,
    head_length=0.1,
    color="black",
)

plt.text(-2.5, 0.3, "$\\phi(z)$", fontsize=16)
plt.xlabel("z score")
plt.ylabel("f(x)")
plt.show()

The $z$-values on the right side of the mean are positive and those on the left side are negative. The $z$-value for a point on the horizontal axis gives the distance between the mean ($z=0$) and that point in terms of the standard deviation. For example, a point with a value of $z=2$ is two standard deviations to the right of the mean. Similarly, a point with a value of $z=-2$ is two standard deviations to the left of the mean.

In [4]:
mu = 0
sigma = 1
cut_a = -4
cut_b = 0.5

x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.cdf(x, mu, sigma)

plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="darkblue")

plt.title(
    "The cummulative probability density function\nof the standard normal distribution"
)
plt.xlabel("z score")
plt.ylabel("$\\phi(z)$")
plt.xticks(
    [
        mu - 3 * sigma,
        mu - 2 * sigma,
        mu - sigma,
        mu,
        0.5,
        mu + sigma,
        mu + 2 * sigma,
        mu + 3 * sigma,
    ],
    [-3, -2, -1, 0, "z", 1, 2, 3],
)

plt.fill_between(
    x=x,
    y1=yy,
    where=(x <= 0.5),
    color="red",
    edgecolor="black",
    alpha=0.75,
)


xpos = [-3, -2, -1, 0, 1, 2, 3]
ypos = [0.0001, 0.01, 0.15, 0.5, 0.83, 0.97, max(yy)]
for px, py in zip(xpos, ypos):
    plt.vlines(x=mu + px, ymin=0, ymax=py, color="blue", linestyle="--")

plt.arrow(
    -1,
    0.7,
    1.2,
    -0.3,
    length_includes_head=True,
    head_width=0.02,
    head_length=0.1,
    color="black",
)

plt.text(
    -2,
    0.8,
    "$\\phi (z) = \\frac{1}{\\sqrt{2\\pi}} \\int_{-\\infty}^{z}e^{-\\frac{1}{2}x^2}dx$",
    fontsize=16,
)

plt.show()

The concept of determining probabilities by calculating the area under the standard normal curve is extensively applied. That is why there exist probability tables to look up the area for a particular $z$-value. However, R is such a powerful tool, that we can calculate the area under the curve for any particular $z$ score.

To calculate the area under the curve for a standard normal distribution we apply the cdf method from the scipy.stats module using the norm function. function. The scipy.stats.norm.cdf() function is defined as cdf(x, loc=0, scale=1). The location (loc) keyword specifies the mean, while the The scale keyword specifies the standard deviation. Further, we see that the defaults for the mean and the standard deviation is $0$ and and $1$, respectively. Thus, the cdf() function, applied to the standard normal distribution, simplifies to stats.norm.cdf(q). We calculate the the area under the curve for $z = -3, -2, -1, 0, 1, 2, 3$ or written more formally:

$$P(x\le z) \qquad \text{for } z \in (-3, -2, -1, 0, 1, 2, 3)$$
In [5]:
stats.norm.cdf(-3)
stats.norm.cdf(-2)
## and so on...
Out[5]:
0.022750131948179195
In [6]:
## ... or simplified in a loop:
z = [-3, -2, -1, 0, 1, 2, 3]
for i in z:
    print(i, "->", stats.norm.cdf(i))
-3 -> 0.0013498980316300933
-2 -> 0.022750131948179195
-1 -> 0.15865525393145707
0 -> 0.5
1 -> 0.8413447460685429
2 -> 0.9772498680518208
3 -> 0.9986501019683699

Perfect! We confirmed some of the above stated properties of a standard normal curve. This means we calculated the area below the curve for the interval $(-\infty, z]$. Calling stats.norm.cdf(-3) yields very low number. Only about 0.00135% of the total area under the curve are found left to $z=-3$, which corresponds to the distance of 3 times the standard deviation from the mean. Moreover, stats.norm.cdf(0) yields 0.99865%. Awesome! Thus, we conclude that the area under the cure for the interval $(-\infty, 0]$ is the same as the area under the cure for the interval $[0, \infty)$, and that the area under the curve sums up to $1$. Again, we confirmed one of the above stated properties of a standard normal curve. And finally, calling stats.norm.cdf(3) yields a high number close to 1. Thus, approximately 99.865% of the area under the cure can be found in the interval $(-\infty, 3]$. Only little left for the area beyond $z = 3$.

Recall, that we may explicitly calculate the area under the curve for any interval of interest:

\begin{align} P(a \le z \le b) & = P(z \le b) - P(z \le a) \\ & =\int_{a}^{b}f(z)dz \\ & = \int_{-\infty}^{b}f(z)dz - \int_{-\infty}^{a}f(z)dz \end{align}

Let us calculate the area under the curve for the following intervals: $[-1,1], [-2,2], [-3,3]$. Or in words, let us determine the area under the curve for $\pm 1$ standard deviation, for $\pm 2$ standard deviations, and for $\pm 3$ standard deviations.

In [7]:
# 1st standard deviation
stats.norm.cdf(1) - stats.norm.cdf(-1)
Out[7]:
0.6826894921370859
In [8]:
# 2nd standard deviations
stats.norm.cdf(2) - stats.norm.cdf(-2)
Out[8]:
0.9544997361036416
In [9]:
# 3rd standard deviation
stats.norm.cdf(3) - stats.norm.cdf(-3)
Out[9]:
0.9973002039367398

Awesome, we just confirmed the Empirical Rule, also known as the 68-95-99.7 rule, which relates to the Chebyshev's theorem. For a bell-shaped distribution the 3 rules are, that approximately

  • 68% of the observations lie within one standard deviation of the mean,
  • 95% of the observations lie within two standard deviations of the mean, and
  • 99.7% of the observations lie within three standard deviations of the mean.

To strengthen our intuition, the Empirical rule is visualized below.

In [10]:
mu = 0
sigma = 1
cut_a = -1
cut_b = 1

x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)

plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")

plt.title("The area between the interval z = [-1,1]")
plt.yticks([])

plt.fill_between(
    x=x,
    y1=yy,
    where=(x >= -1) & (x <= 1),
    color="red",
    alpha=0.75,
)


plt.text(
    2.5,
    0.3,
    "$\\phi(z) =\\int_{-1}^{1}f(z)dz = P(z \\leq 1) - P(z \\leq -1)$",
    fontsize=14,
)

plt.text(
    -0.5,
    0.15,
    "$\\phi(z) \\approx 0.68$",
    fontsize=14,
)

plt.axhline(-0.001, color="black")
plt.xlabel("z score")

plt.show()
In [11]:
mu = 0
sigma = 1
cut_a = -1
cut_b = 1

x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)

plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")

plt.title("The area between the interval z = [-2,2]")
plt.yticks([])

plt.fill_between(
    x=x,
    y1=yy,
    where=(x >= -2) & (x <= 2),
    color="red",
    alpha=0.75,
)


plt.text(
    2.5,
    0.3,
    "$\\phi(z) =\\int_{-2}^{2}f(z)dz = P(z \\leq 2) - P(z \\leq -2)$",
    fontsize=14,
)


plt.text(
    -0.5,
    0.15,
    "$\\phi(z) \\approx 0.95$",
    fontsize=14,
)

plt.axhline(-0.001, color="black")
plt.xlabel("z score")

plt.show()
In [12]:
mu = 0
sigma = 1
cut_a = -1
cut_b = 1

x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)

plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")

plt.title("The area between the interval z = [-3,3]")
plt.yticks([])

plt.fill_between(
    x=x,
    y1=yy,
    where=(x >= -3) & (x <= 3),
    color="red",
    alpha=0.75,
)


plt.text(
    2.5,
    0.3,
    "$\\phi(z) =\\int_{-3}^{3}f(z)dz = P(z \\leq 3) - P(z \\leq -3)$",
    fontsize=14,
)

plt.text(
    -0.5,
    0.15,
    "$\\phi(z) \\approx 0.97$",
    fontsize=14,
)

plt.axhline(-0.001, color="black")
plt.xlabel("z score")

plt.show()

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.