The standard normal distribution is a special case of the normal distribution. For the standard normal distribution, the value of the mean is equal to zero (μ=0μ=0), and the value of the standard deviation is equal to 1 (σ=1σ=1).
Thus, by plugin μ=0μ=0 and σ=1σ=1 in the PDF of the normal distribution, the equation simplifies to
f(x)=1σ√2πe−12(x−μσ)2=11×√2πe−12(x−01)2=1√2πe−12x2f(x)=1σ√2πe−12(x−μσ)2=11×√2πe−12(x−01)2=1√2πe−12x2(1)(2)(3)The random variable that possesses the standard normal distribution is denoted by zz. Consequently units for the standard normal distribution curve are denoted by zz and are called the zz-values or zz-scores. They are also called standard units or standard scores.
The cumulative distribution function (CDF) of the standard normal distribution, corresponding to the area under the cure for the interval (−∞,z](−∞,z], usually denoted with the capital Greek letter ϕϕ, is given by
F(x<z)=ϕ(z)=1√2π∫z−∞e−12x2dxF(x<z)=ϕ(z)=1√2π∫z−∞e−12x2dxwhere e≈2.71828e≈2.71828 and π≈3.14159π≈3.14159.
The standard normal curve is a special case of the normal distribution, and thus as well a probability distribution curve. Therefore basic properties of the normal distribution hold true for the standard normal curve as well (Weiss 2010).
# First, let's import all the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
mu = 0
sigma = 1
cut_a = -4
cut_b = 0.5
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="darkblue")
plt.title("The probability density function of the standard normal distribution")
plt.xticks(
[
mu - 3 * sigma,
mu - 2 * sigma,
mu - sigma,
mu,
0.5,
mu + sigma,
mu + 2 * sigma,
mu + 3 * sigma,
],
[-3, -2, -1, 0, "z", 1, 2, 3],
)
plt.fill_between(
x=x,
y1=yy,
where=(x <= 0.5),
color="red",
edgecolor="black",
alpha=0.75,
)
xpos = [-3, -2, -1, 0, 1, 2, 3]
ypos = [0.005, 0.05, 0.25, max(yy), 0.25, 0.05, 0.005]
for px, py in zip(xpos, ypos):
plt.vlines(x=mu + px, ymin=0, ymax=py, color="blue", linestyle="--")
plt.text(
2.5, 0.3, "$f(x) = \\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{1}{2} x^2}$", fontsize=16
)
plt.arrow(
2.2,
0.26,
-0.9,
-0.08,
length_includes_head=True,
head_width=0.02,
head_length=0.1,
color="black",
)
plt.arrow(
-2.2,
0.28,
1.3,
-0.1,
length_includes_head=True,
head_width=0.02,
head_length=0.1,
color="black",
)
plt.text(-2.5, 0.3, "$\\phi(z)$", fontsize=16)
plt.xlabel("z score")
plt.ylabel("f(x)")
plt.show()
The zz-values on the right side of the mean are positive and those on the left side are negative. The zz-value for a point on the horizontal axis gives the distance between the mean (z=0z=0) and that point in terms of the standard deviation. For example, a point with a value of z=2z=2 is two standard deviations to the right of the mean. Similarly, a point with a value of z=−2z=−2 is two standard deviations to the left of the mean.
mu = 0
sigma = 1
cut_a = -4
cut_b = 0.5
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.cdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="darkblue")
plt.title(
"The cummulative probability density function\nof the standard normal distribution"
)
plt.xlabel("z score")
plt.ylabel("$\\phi(z)$")
plt.xticks(
[
mu - 3 * sigma,
mu - 2 * sigma,
mu - sigma,
mu,
0.5,
mu + sigma,
mu + 2 * sigma,
mu + 3 * sigma,
],
[-3, -2, -1, 0, "z", 1, 2, 3],
)
plt.fill_between(
x=x,
y1=yy,
where=(x <= 0.5),
color="red",
edgecolor="black",
alpha=0.75,
)
xpos = [-3, -2, -1, 0, 1, 2, 3]
ypos = [0.0001, 0.01, 0.15, 0.5, 0.83, 0.97, max(yy)]
for px, py in zip(xpos, ypos):
plt.vlines(x=mu + px, ymin=0, ymax=py, color="blue", linestyle="--")
plt.arrow(
-1,
0.7,
1.2,
-0.3,
length_includes_head=True,
head_width=0.02,
head_length=0.1,
color="black",
)
plt.text(
-2,
0.8,
"$\\phi (z) = \\frac{1}{\\sqrt{2\\pi}} \\int_{-\\infty}^{z}e^{-\\frac{1}{2}x^2}dx$",
fontsize=16,
)
plt.show()
The concept of determining probabilities by calculating the area under the standard normal curve is extensively applied. That is why there exist probability tables to look up the area for a particular zz-value. However, R is such a powerful tool, that we can calculate the area under the curve for any particular zz score.
To calculate the area under the curve for a standard normal distribution we apply the cdf
method from the scipy.stats
module using the norm
function. function. The scipy.stats.norm.cdf()
function is defined as cdf(x, loc=0, scale=1)
. The location (loc
) keyword specifies the mean, while the The scale
keyword specifies the standard deviation. Further, we see that the defaults for the mean and the standard deviation is 00 and and 11, respectively. Thus, the cdf()
function, applied to the standard normal distribution, simplifies to stats.norm.cdf(q)
. We calculate the the area under the curve for z=−3,−2,−1,0,1,2,3z=−3,−2,−1,0,1,2,3 or written more formally:
stats.norm.cdf(-3)
stats.norm.cdf(-2)
## and so on...
0.022750131948179195
## ... or simplified in a loop:
z = [-3, -2, -1, 0, 1, 2, 3]
for i in z:
print(i, "->", stats.norm.cdf(i))
-3 -> 0.0013498980316300933 -2 -> 0.022750131948179195 -1 -> 0.15865525393145707 0 -> 0.5 1 -> 0.8413447460685429 2 -> 0.9772498680518208 3 -> 0.9986501019683699
Perfect! We confirmed some of the above stated properties of a standard normal curve. This means we calculated the area below the curve for the interval (−∞,z](−∞,z]. Calling stats.norm.cdf(-3)
yields very low number. Only about 0.00135% of the total area under the curve are found left to z=−3z=−3, which corresponds to the distance of 3 times the standard deviation from the mean. Moreover, stats.norm.cdf(0)
yields 0.99865%. Awesome! Thus, we conclude that the area under the cure for the interval (−∞,0](−∞,0] is the same as the area under the cure for the interval [0,∞)[0,∞), and that the area under the curve sums up to 11. Again, we confirmed one of the above stated properties of a standard normal curve. And finally, calling stats.norm.cdf(3)
yields a high number close to 1. Thus, approximately 99.865% of the area under the cure can be found in the interval (−∞,3](−∞,3]. Only little left for the area beyond z=3z=3.
Recall, that we may explicitly calculate the area under the curve for any interval of interest:
P(a≤z≤b)=P(z≤b)−P(z≤a)=∫baf(z)dz=∫b−∞f(z)dz−∫a−∞f(z)dzP(a≤z≤b)=P(z≤b)−P(z≤a)=∫baf(z)dz=∫b−∞f(z)dz−∫a−∞f(z)dz(4)(5)(6)Let us calculate the area under the curve for the following intervals: [−1,1],[−2,2],[−3,3][−1,1],[−2,2],[−3,3]. Or in words, let us determine the area under the curve for ±1±1 standard deviation, for ±2±2 standard deviations, and for ±3±3 standard deviations.
# 1st standard deviation
stats.norm.cdf(1) - stats.norm.cdf(-1)
0.6826894921370859
# 2nd standard deviations
stats.norm.cdf(2) - stats.norm.cdf(-2)
0.9544997361036416
# 3rd standard deviation
stats.norm.cdf(3) - stats.norm.cdf(-3)
0.9973002039367398
Awesome, we just confirmed the Empirical Rule, also known as the 68-95-99.7 rule, which relates to the Chebyshev's theorem. For a bell-shaped distribution the 3 rules are, that approximately
To strengthen our intuition, the Empirical rule is visualized below.
mu = 0
sigma = 1
cut_a = -1
cut_b = 1
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")
plt.title("The area between the interval z = [-1,1]")
plt.yticks([])
plt.fill_between(
x=x,
y1=yy,
where=(x >= -1) & (x <= 1),
color="red",
alpha=0.75,
)
plt.text(
2.5,
0.3,
"$\\phi(z) =\\int_{-1}^{1}f(z)dz = P(z \\leq 1) - P(z \\leq -1)$",
fontsize=14,
)
plt.text(
-0.5,
0.15,
"$\\phi(z) \\approx 0.68$",
fontsize=14,
)
plt.axhline(-0.001, color="black")
plt.xlabel("z score")
plt.show()
mu = 0
sigma = 1
cut_a = -1
cut_b = 1
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")
plt.title("The area between the interval z = [-2,2]")
plt.yticks([])
plt.fill_between(
x=x,
y1=yy,
where=(x >= -2) & (x <= 2),
color="red",
alpha=0.75,
)
plt.text(
2.5,
0.3,
"$\\phi(z) =\\int_{-2}^{2}f(z)dz = P(z \\leq 2) - P(z \\leq -2)$",
fontsize=14,
)
plt.text(
-0.5,
0.15,
"$\\phi(z) \\approx 0.95$",
fontsize=14,
)
plt.axhline(-0.001, color="black")
plt.xlabel("z score")
plt.show()
mu = 0
sigma = 1
cut_a = -1
cut_b = 1
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")
plt.title("The area between the interval z = [-3,3]")
plt.yticks([])
plt.fill_between(
x=x,
y1=yy,
where=(x >= -3) & (x <= 3),
color="red",
alpha=0.75,
)
plt.text(
2.5,
0.3,
"$\\phi(z) =\\int_{-3}^{3}f(z)dz = P(z \\leq 3) - P(z \\leq -3)$",
fontsize=14,
)
plt.text(
-0.5,
0.15,
"$\\phi(z) \\approx 0.97$",
fontsize=14,
)
plt.axhline(-0.001, color="black")
plt.xlabel("z score")
plt.show()
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.