The **standard normal distribution** is a special case of the normal distribution. For the standard normal distribution, the value of the mean is equal to zero ($\mu = 0$), and the value of the standard deviation is equal to 1 ($\sigma = 1$).

Thus, by plugin $\mu = 0$ and $\sigma = 1$ in the PDF of the normal distribution, the equation simplifies to

\begin{align} f(x)& = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \\ & =\frac{1}{1 \times \sqrt{2 \pi}}e^{-\frac{1}{2}\left(\frac{x-0}{1}\right)^2} \\ & = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2} \end{align}The random variable that possesses the standard normal distribution is denoted by $z$. Consequently units for the standard normal distribution curve are denoted by $z$ and are called the **$z$-values** or **$z$-scores**. They are also called **standard units** or **standard scores**.

The **cumulative distribution function (CDF)** of the standard normal distribution, corresponding to the area under the cure for the interval $(-\infty, z]$, usually denoted with the capital Greek letter $\phi$, is given by

where $e \approx 2.71828$ and $\pi \approx 3.14159$.

The standard normal curve is a special case of the normal distribution, and thus as well a probability distribution curve. Therefore basic properties of the normal distribution hold true for the standard normal curve as well (Weiss 2010).

- The total area under the standard normal curve is 1 (this property is shared by all density curves).
- The standard normal curve extends indefinitely in both directions, approaching, but never touching, the horizontal axis as it does so.
- The standard normal curve is is bell shaped, is centered at $z=0$. Almost all the area under the standard normal curve lies between $z=-3$ and $z=3$.

In [2]:

```
# First, let's import all the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
```

In [3]:

```
mu = 0
sigma = 1
cut_a = -4
cut_b = 0.5
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="darkblue")
plt.title("The probability density function of the standard normal distribution")
plt.xticks(
[
mu - 3 * sigma,
mu - 2 * sigma,
mu - sigma,
mu,
0.5,
mu + sigma,
mu + 2 * sigma,
mu + 3 * sigma,
],
[-3, -2, -1, 0, "z", 1, 2, 3],
)
plt.fill_between(
x=x,
y1=yy,
where=(x <= 0.5),
color="red",
edgecolor="black",
alpha=0.75,
)
xpos = [-3, -2, -1, 0, 1, 2, 3]
ypos = [0.005, 0.05, 0.25, max(yy), 0.25, 0.05, 0.005]
for px, py in zip(xpos, ypos):
plt.vlines(x=mu + px, ymin=0, ymax=py, color="blue", linestyle="--")
plt.text(
2.5, 0.3, "$f(x) = \\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{1}{2} x^2}$", fontsize=16
)
plt.arrow(
2.2,
0.26,
-0.9,
-0.08,
length_includes_head=True,
head_width=0.02,
head_length=0.1,
color="black",
)
plt.arrow(
-2.2,
0.28,
1.3,
-0.1,
length_includes_head=True,
head_width=0.02,
head_length=0.1,
color="black",
)
plt.text(-2.5, 0.3, "$\\phi(z)$", fontsize=16)
plt.xlabel("z score")
plt.ylabel("f(x)")
plt.show()
```

In [4]:

```
mu = 0
sigma = 1
cut_a = -4
cut_b = 0.5
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.cdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="darkblue")
plt.title(
"The cummulative probability density function\nof the standard normal distribution"
)
plt.xlabel("z score")
plt.ylabel("$\\phi(z)$")
plt.xticks(
[
mu - 3 * sigma,
mu - 2 * sigma,
mu - sigma,
mu,
0.5,
mu + sigma,
mu + 2 * sigma,
mu + 3 * sigma,
],
[-3, -2, -1, 0, "z", 1, 2, 3],
)
plt.fill_between(
x=x,
y1=yy,
where=(x <= 0.5),
color="red",
edgecolor="black",
alpha=0.75,
)
xpos = [-3, -2, -1, 0, 1, 2, 3]
ypos = [0.0001, 0.01, 0.15, 0.5, 0.83, 0.97, max(yy)]
for px, py in zip(xpos, ypos):
plt.vlines(x=mu + px, ymin=0, ymax=py, color="blue", linestyle="--")
plt.arrow(
-1,
0.7,
1.2,
-0.3,
length_includes_head=True,
head_width=0.02,
head_length=0.1,
color="black",
)
plt.text(
-2,
0.8,
"$\\phi (z) = \\frac{1}{\\sqrt{2\\pi}} \\int_{-\\infty}^{z}e^{-\\frac{1}{2}x^2}dx$",
fontsize=16,
)
plt.show()
```

The concept of determining probabilities by calculating the area under the standard normal curve is extensively applied. That is why there exist probability tables to look up the area for a particular $z$-value. However, R is such a powerful tool, that we can calculate the area under the curve for any particular $z$ score.

To calculate the area under the curve for a standard normal distribution we apply the `cdf`

method from the `scipy.stats`

module using the `norm`

function. function. The `scipy.stats.norm.cdf()`

function is defined as `cdf(x, loc=0, scale=1)`

. The location (`loc`

) keyword specifies the mean, while the The `scale`

keyword specifies the standard deviation. Further, we see that the defaults for the mean and the standard deviation is $0$ and and $1$, respectively. Thus, the `cdf()`

function, applied to the standard normal distribution, simplifies to `stats.norm.cdf(q)`

. We calculate the the area under the curve for $z = -3, -2, -1, 0, 1, 2, 3$ or written more formally:

In [5]:

```
stats.norm.cdf(-3)
stats.norm.cdf(-2)
## and so on...
```

Out[5]:

0.022750131948179195

In [6]:

```
## ... or simplified in a loop:
z = [-3, -2, -1, 0, 1, 2, 3]
for i in z:
print(i, "->", stats.norm.cdf(i))
```

Perfect! We confirmed some of the above stated properties of a standard normal curve. This means we calculated the area below the curve for the interval $(-\infty, z]$. Calling `stats.norm.cdf(-3)`

yields very low number. Only about 0.00135% of the total area under the curve are found left to $z=-3$, which corresponds to the distance of 3 times the standard deviation from the mean. Moreover, `stats.norm.cdf(0)`

yields 0.99865%. Awesome! Thus, we conclude that the area under the cure for the interval $(-\infty, 0]$ is the same as the area under the cure for the interval $[0, \infty)$, and that the area under the curve sums up to $1$. Again, we confirmed one of the above stated properties of a standard normal curve. And finally, calling `stats.norm.cdf(3)`

yields a high number close to 1. Thus, approximately 99.865% of the area under the cure can be found in the interval $(-\infty, 3]$. Only little left for the area beyond $z = 3$.

Recall, that we may explicitly calculate the area under the curve for any interval of interest:

\begin{align} P(a \le z \le b) & = P(z \le b) - P(z \le a) \\ & =\int_{a}^{b}f(z)dz \\ & = \int_{-\infty}^{b}f(z)dz - \int_{-\infty}^{a}f(z)dz \end{align}Let us calculate the area under the curve for the following intervals: $[-1,1], [-2,2], [-3,3]$. Or in words, let us determine the area under the curve for $\pm 1$ standard deviation, for $\pm 2$ standard deviations, and for $\pm 3$ standard deviations.

In [7]:

```
# 1st standard deviation
stats.norm.cdf(1) - stats.norm.cdf(-1)
```

Out[7]:

0.6826894921370859

In [8]:

```
# 2nd standard deviations
stats.norm.cdf(2) - stats.norm.cdf(-2)
```

Out[8]:

0.9544997361036416

In [9]:

```
# 3rd standard deviation
stats.norm.cdf(3) - stats.norm.cdf(-3)
```

Out[9]:

0.9973002039367398

Awesome, we just confirmed the Empirical Rule, also known as the **68-95-99.7 rule**, which relates to the Chebyshev's theorem. For a bell-shaped distribution the 3 rules are, that approximately

- 68% of the observations lie within one standard deviation of the mean,
- 95% of the observations lie within two standard deviations of the mean, and
- 99.7% of the observations lie within three standard deviations of the mean.

To strengthen our intuition, the Empirical rule is visualized below.

In [10]:

```
mu = 0
sigma = 1
cut_a = -1
cut_b = 1
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")
plt.title("The area between the interval z = [-1,1]")
plt.yticks([])
plt.fill_between(
x=x,
y1=yy,
where=(x >= -1) & (x <= 1),
color="red",
alpha=0.75,
)
plt.text(
2.5,
0.3,
"$\\phi(z) =\\int_{-1}^{1}f(z)dz = P(z \\leq 1) - P(z \\leq -1)$",
fontsize=14,
)
plt.text(
-0.5,
0.15,
"$\\phi(z) \\approx 0.68$",
fontsize=14,
)
plt.axhline(-0.001, color="black")
plt.xlabel("z score")
plt.show()
```

In [11]:

```
mu = 0
sigma = 1
cut_a = -1
cut_b = 1
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")
plt.title("The area between the interval z = [-2,2]")
plt.yticks([])
plt.fill_between(
x=x,
y1=yy,
where=(x >= -2) & (x <= 2),
color="red",
alpha=0.75,
)
plt.text(
2.5,
0.3,
"$\\phi(z) =\\int_{-2}^{2}f(z)dz = P(z \\leq 2) - P(z \\leq -2)$",
fontsize=14,
)
plt.text(
-0.5,
0.15,
"$\\phi(z) \\approx 0.95$",
fontsize=14,
)
plt.axhline(-0.001, color="black")
plt.xlabel("z score")
plt.show()
```

In [12]:

```
mu = 0
sigma = 1
cut_a = -1
cut_b = 1
x = np.arange(-4, 4.01, 0.01)
yy = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")
plt.title("The area between the interval z = [-3,3]")
plt.yticks([])
plt.fill_between(
x=x,
y1=yy,
where=(x >= -3) & (x <= 3),
color="red",
alpha=0.75,
)
plt.text(
2.5,
0.3,
"$\\phi(z) =\\int_{-3}^{3}f(z)dz = P(z \\leq 3) - P(z \\leq -3)$",
fontsize=14,
)
plt.text(
-0.5,
0.15,
"$\\phi(z) \\approx 0.97$",
fontsize=14,
)
plt.axhline(-0.001, color="black")
plt.xlabel("z score")
plt.show()
```

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*