The standard normal distribution is a special case of the normal distribution. For the standard normal distribution, the value of the mean is equal to zero ($\mu = 0$), and the value of the standard deviation is equal to 1 ($\sigma = 1$).
Thus, by plugin $\mu = 0$ and $\sigma = 1$ in the PDF of the normal distribution, the equation simplifies to
\begin{align} f(x)& = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \\ & =\frac{1}{1 \times \sqrt{2 \pi}}e^{-\frac{1}{2}\left(\frac{x-0}{1}\right)^2} \\ & = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2} \end{align}The random variable that possesses the standard normal distribution is denoted by $z$. Consequently units for the standard normal distribution curve are denoted by $z$ and are called the $z$-values or $z$-scores. They are also called standard units or standard scores.
The cumulative distribution function (CDF) of the standard normal distribution, corresponding to the area under the cure for the interval $(-\infty, z]$, usually denoted with the capital Greek letter $\phi$, is given by
$$F(x<z) = \phi (z) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{z}e^{-\frac{1}{2}x^2}dx$$where $e \approx 2.71828$ and $\pi \approx 3.14159$.
The standard normal curve is a special case of the normal distribution, and thus as well a probability distribution curve. Therefore basic properties of the normal distribution hold true for the standard normal curve as well (Weiss 2010).
# First, let's import all the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
mu = 0
sigma = 1
cut_a = -4
cut_b = 0.5
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="darkblue")
plt.title("The probability density function of the standard normal distribution")
plt.xticks(
[
mu - 3 * sigma,
mu - 2 * sigma,
mu - sigma,
mu,
0.5,
mu + sigma,
mu + 2 * sigma,
mu + 3 * sigma,
],
[-3, -2, -1, 0, "z", 1, 2, 3],
)
plt.fill_between(
x=x,
y1=yy,
where=(x <= 0.5),
color="red",
edgecolor="black",
alpha=0.75,
)
xpos = [-3, -2, -1, 0, 1, 2, 3]
ypos = [0.005, 0.05, 0.25, max(yy), 0.25, 0.05, 0.005]
for px, py in zip(xpos, ypos):
plt.vlines(x=mu + px, ymin=0, ymax=py, color="blue", linestyle="--")
plt.text(
2.5, 0.3, "$f(x) = \\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{1}{2} x^2}$", fontsize=16
)
plt.arrow(
2.2,
0.26,
-0.9,
-0.08,
length_includes_head=True,
head_width=0.02,
head_length=0.1,
color="black",
)
plt.arrow(
-2.2,
0.28,
1.3,
-0.1,
length_includes_head=True,
head_width=0.02,
head_length=0.1,
color="black",
)
plt.text(-2.5, 0.3, "$\\phi(z)$", fontsize=16)
plt.xlabel("z score")
plt.ylabel("f(x)")
plt.show()
The $z$-values on the right side of the mean are positive and those on the left side are negative. The $z$-value for a point on the horizontal axis gives the distance between the mean ($z=0$) and that point in terms of the standard deviation. For example, a point with a value of $z=2$ is two standard deviations to the right of the mean. Similarly, a point with a value of $z=-2$ is two standard deviations to the left of the mean.
mu = 0
sigma = 1
cut_a = -4
cut_b = 0.5
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.cdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="darkblue")
plt.title(
"The cummulative probability density function\nof the standard normal distribution"
)
plt.xlabel("z score")
plt.ylabel("$\\phi(z)$")
plt.xticks(
[
mu - 3 * sigma,
mu - 2 * sigma,
mu - sigma,
mu,
0.5,
mu + sigma,
mu + 2 * sigma,
mu + 3 * sigma,
],
[-3, -2, -1, 0, "z", 1, 2, 3],
)
plt.fill_between(
x=x,
y1=yy,
where=(x <= 0.5),
color="red",
edgecolor="black",
alpha=0.75,
)
xpos = [-3, -2, -1, 0, 1, 2, 3]
ypos = [0.0001, 0.01, 0.15, 0.5, 0.83, 0.97, max(yy)]
for px, py in zip(xpos, ypos):
plt.vlines(x=mu + px, ymin=0, ymax=py, color="blue", linestyle="--")
plt.arrow(
-1,
0.7,
1.2,
-0.3,
length_includes_head=True,
head_width=0.02,
head_length=0.1,
color="black",
)
plt.text(
-2,
0.8,
"$\\phi (z) = \\frac{1}{\\sqrt{2\\pi}} \\int_{-\\infty}^{z}e^{-\\frac{1}{2}x^2}dx$",
fontsize=16,
)
plt.show()
The concept of determining probabilities by calculating the area under the standard normal curve is extensively applied. That is why there exist probability tables to look up the area for a particular $z$-value. However, R is such a powerful tool, that we can calculate the area under the curve for any particular $z$ score.
To calculate the area under the curve for a standard normal distribution we apply the cdf
method from the scipy.stats
module using the norm
function. function. The scipy.stats.norm.cdf()
function is defined as cdf(x, loc=0, scale=1)
. The location (loc
) keyword specifies the mean, while the The scale
keyword specifies the standard deviation. Further, we see that the defaults for the mean and the standard deviation is $0$ and and $1$, respectively. Thus, the cdf()
function, applied to the standard normal distribution, simplifies to stats.norm.cdf(q)
. We calculate the the area under the curve for $z = -3, -2, -1, 0, 1, 2, 3$ or written more formally:
stats.norm.cdf(-3)
stats.norm.cdf(-2)
## and so on...
0.022750131948179195
## ... or simplified in a loop:
z = [-3, -2, -1, 0, 1, 2, 3]
for i in z:
print(i, "->", stats.norm.cdf(i))
-3 -> 0.0013498980316300933 -2 -> 0.022750131948179195 -1 -> 0.15865525393145707 0 -> 0.5 1 -> 0.8413447460685429 2 -> 0.9772498680518208 3 -> 0.9986501019683699
Perfect! We confirmed some of the above stated properties of a standard normal curve. This means we calculated the area below the curve for the interval $(-\infty, z]$. Calling stats.norm.cdf(-3)
yields very low number. Only about 0.00135% of the total area under the curve are found left to $z=-3$, which corresponds to the distance of 3 times the standard deviation from the mean. Moreover, stats.norm.cdf(0)
yields 0.99865%. Awesome! Thus, we conclude that the area under the cure for the interval $(-\infty, 0]$ is the same as the area under the cure for the interval $[0, \infty)$, and that the area under the curve sums up to $1$. Again, we confirmed one of the above stated properties of a standard normal curve. And finally, calling stats.norm.cdf(3)
yields a high number close to 1. Thus, approximately 99.865% of the area under the cure can be found in the interval $(-\infty, 3]$. Only little left for the area beyond $z = 3$.
Recall, that we may explicitly calculate the area under the curve for any interval of interest:
\begin{align} P(a \le z \le b) & = P(z \le b) - P(z \le a) \\ & =\int_{a}^{b}f(z)dz \\ & = \int_{-\infty}^{b}f(z)dz - \int_{-\infty}^{a}f(z)dz \end{align}Let us calculate the area under the curve for the following intervals: $[-1,1], [-2,2], [-3,3]$. Or in words, let us determine the area under the curve for $\pm 1$ standard deviation, for $\pm 2$ standard deviations, and for $\pm 3$ standard deviations.
# 1st standard deviation
stats.norm.cdf(1) - stats.norm.cdf(-1)
0.6826894921370859
# 2nd standard deviations
stats.norm.cdf(2) - stats.norm.cdf(-2)
0.9544997361036416
# 3rd standard deviation
stats.norm.cdf(3) - stats.norm.cdf(-3)
0.9973002039367398
Awesome, we just confirmed the Empirical Rule, also known as the 68-95-99.7 rule, which relates to the Chebyshev's theorem. For a bell-shaped distribution the 3 rules are, that approximately
To strengthen our intuition, the Empirical rule is visualized below.
mu = 0
sigma = 1
cut_a = -1
cut_b = 1
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")
plt.title("The area between the interval z = [-1,1]")
plt.yticks([])
plt.fill_between(
x=x,
y1=yy,
where=(x >= -1) & (x <= 1),
color="red",
alpha=0.75,
)
plt.text(
2.5,
0.3,
"$\\phi(z) =\\int_{-1}^{1}f(z)dz = P(z \\leq 1) - P(z \\leq -1)$",
fontsize=14,
)
plt.text(
-0.5,
0.15,
"$\\phi(z) \\approx 0.68$",
fontsize=14,
)
plt.axhline(-0.001, color="black")
plt.xlabel("z score")
plt.show()
mu = 0
sigma = 1
cut_a = -1
cut_b = 1
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")
plt.title("The area between the interval z = [-2,2]")
plt.yticks([])
plt.fill_between(
x=x,
y1=yy,
where=(x >= -2) & (x <= 2),
color="red",
alpha=0.75,
)
plt.text(
2.5,
0.3,
"$\\phi(z) =\\int_{-2}^{2}f(z)dz = P(z \\leq 2) - P(z \\leq -2)$",
fontsize=14,
)
plt.text(
-0.5,
0.15,
"$\\phi(z) \\approx 0.95$",
fontsize=14,
)
plt.axhline(-0.001, color="black")
plt.xlabel("z score")
plt.show()
mu = 0
sigma = 1
cut_a = -1
cut_b = 1
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")
plt.title("The area between the interval z = [-3,3]")
plt.yticks([])
plt.fill_between(
x=x,
y1=yy,
where=(x >= -3) & (x <= 3),
color="red",
alpha=0.75,
)
plt.text(
2.5,
0.3,
"$\\phi(z) =\\int_{-3}^{3}f(z)dz = P(z \\leq 3) - P(z \\leq -3)$",
fontsize=14,
)
plt.text(
-0.5,
0.15,
"$\\phi(z) \\approx 0.97$",
fontsize=14,
)
plt.axhline(-0.001, color="black")
plt.xlabel("z score")
plt.show()
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.