20433_The_Standard_Normal_Distribution_-_An_Example_in

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: [stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary].

In this sections we use the height variable to exercise what we have discussed so far.

First, we want to assure that we are dealing with normally distributed data. If a variable is normally distributed, then, for a large sample, a histogram of the observations should be roughly bell shaped.

In [5]:

plt.figure(figsize=(10, 5))
plt.hist(students["height"], bins="sturges", color="lightgreen", edgecolor="grey")
plt.xlabel("Height in cm")
plt.ylabel("Frequency")
plt.show()

By inspecting the plot we may conclude that the height variable is normally distributed, however, especially for small samples, ascertaining a clear shape in a histogram and, in particular, whether it is bell shaped is often difficult. Thus, a more sensitive graphical technique is required for assessing normality. Normal probability plots provide such a technique. The idea behind a normal probability plot is simple: Compare the observed values of the variable to the observations expected for a normally distributed variable. More precisely, a normal probability plot is a plot of the observed values of the variable versus the normal scores of the observations expected for a variable having the standard normal distribution. If the variable is normally distributed, the normal probability plot should be roughly linear (i.e., fall roughly in a straight line) (Weiss 2010).

When using a normal probability plot to assess the normality of a variable, we must remember two things:

The decision of whether a normal probability plot is roughly linear is a subjective one, and
that we are using only a limited number of observations of that particular variable to make a judgment about all possible observations of the variable.

In Python we may employ the statsmodels.api library, which provides a function for plotting normal probability plots often referred to as Q-Q plots. For plotting a complete Q-Q-Plot, we make use of the qqplot function.

In [6]:

import statsmodels.api as sm

plt.figure(figsize=(12, 5))
sm.qqplot(students["height"], line="r")

plt.show()

<Figure size 1200x500 with 0 Axes>

By inspecting the plot we see that there is some divergence for the sample quantiles compared to the theoretic quantiles at the lower and upper tails. This fact needs a little more attention! What might be the reason for the departure at the upper and lower tail of the distribution? Any guess?

What about gender? Honestly, it is seems natural that the mean height for males and females differs. Let us plot a histogram of the height of males and females.

In [7]:

## subset by gender
males = students.loc[students["gender"] == "Male"]
females = students.loc[students["gender"] == "Female"]

## plot by gender
plt.figure(figsize=(10, 5))
plt.hist(males["height"], bins="sturges", color="lightblue", edgecolor="grey")
plt.hist(females["height"], bins="sturges", color="lightgreen", edgecolor="grey")
plt.xlabel("Height in cm")
plt.ylabel("Frequency")
plt.title("Females and Males", fontsize=16)
plt.show()

There it is! Obviously, the two groups have different means, and thus, putting them together into one group causes the left an right tails of the resulting distribution to extend further, than expected for a normally distributed variable. In order to continue, we thus take only the height of female students into considerations. For the matter of clarity we once again plot the normal probability plot of the height variable to assure that our target variables are normally distributed.

In [8]:

plt.figure(figsize=(12, 5))
sm.qqplot(females["height"], line="r")

plt.show()

<Figure size 1200x500 with 0 Axes>

Before we actually start with the exercises we first calculate the mean, $\bar x$, and the standard deviation, $s$, of the target variable. Moreover, we standardize the variable to get a standard normal distribution with $\bar x =0$ and $s = 1$ and assign it to an appropriate variable name.

In [9]:

# Heights
height_mean = np.mean(females["height"])
height_sd = np.std(females["height"])
height_mean, height_sd

Out[9]:

(163.65328467153284, 7.918762263149209)

In [10]:

# z-transformation
height_z = (females["height"] - height_mean) / height_sd

The height variable has a mean of <IPython.core.display.Javascript object> cm of and a standard deviation of <IPython.core.display.Javascript object> cm.

Finding the area to the left of a specified $z$-score or $x$-score¶

Question 1
What is the probability of a randomly picked female student from the student data set with a height less or equal to 168 cm? That means that we are looking for $P(x \le 168)$.

First, we calculate the probability for the standardized variable. Therefore, we have to transform our value of interest (168 cm) into a $z$-score. $$z = \frac{x-\mu}{\sigma} = \frac{ 168- 163.7}{7.9} = 0.55$$

Then we have to calculate the area under the curve left to the obtained $z$-value. Recall, that the area under the curve of a normal distributed variable can be calculated by applying the norm.cdf function implemented in the scipy.stats module. The norm.cdf function is written as norm.cdf(x, loc=0, scale=1). For this particular example we can accept all default argument values.

In [11]:

x = 168  # height in cm
x_z = (x - height_mean) / height_sd  # z-transformation
stats.norm.cdf(x_z)

Out[11]:

0.7084675855873308

Awesome, we have a result: $P(z \le 0.55) \approx 0.70$

Now, we do the same calculation, however this time we skip the step of standardization. Thanks to the power of Python we do not need to rely on tables, but can we can easily put the sample mean, $\bar x$, and the sample standard deviation, $s$, into the norm.cdf() function.

In [12]:

x = 168  # height in cm
stats.norm.cdf(x, loc=height_mean, scale=height_sd)

Out[12]:

0.7084675855873308

Perfect! The numbers match: $P(x \le 168) \approx 0.70$. To make sure we realize what is going on, both the area under the curve for the standardized variable in $z$-values (left panel) as well as the area for the non-standardized variable in cm (right panel) are visualized below.

In [13]:

mean = 0
sigma = 1
cut_a = (x - height_mean) / height_sd

x = np.arange(-3, 3.01, 0.01)
yy = stats.norm.pdf(x)

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].plot(x, yy, color="black")
ax[0].set_title("$P(z \leq 0.55)$")
ax[0].set_yticks([])
ax[0].set_xlabel("z value")

ax[0].fill_between(
    x=x,
    y1=yy,
    where=(x <= 0.55),
    color="red",
    alpha=0.75,
)

ax[0].text(
    -1.5,
    0.1,
    "$area \\approx 0.71$",
    fontsize=14,
)

x_vals = np.arange(min(females["height"]), max(females["height"]), 0.01)
yy2 = stats.norm.pdf(x_vals, height_mean, height_sd)
ax[1].plot(x_vals, yy2, color="black")
ax[1].set_title("$P(z \leq 0.168)$")
ax[1].set_yticks([])
ax[1].set_xlabel("x height (cm)")

ax[1].fill_between(
    x=x_vals,
    y1=yy2,
    where=(x_vals <= 168),
    color="red",
    alpha=0.75,
)
ax[1].text(
    148,
    0.005,
    "$area \\approx 0.71$",
    fontsize=14,
)


plt.show()

Finding the area to the right of a specified $x$-score¶

Question 2
What is the probability of a randomly picked female student from the students data set with a height higher or equal to 175 cm? We are looking for $P(x \ge 175)$. In order to obtain the area under the curve right of the value of interest, we need to set the norm.cdf function to 1 - norm.cdf in order to get the upper tail of the distribution.

In [14]:

x = 175  # height in cm
1 - stats.norm.cdf(x, loc=height_mean, scale=height_sd)

Out[14]:

0.07594463650610317

Answer: : $P(x \ge 175) \approx 0.076$

Finding the area between two specified $x$-scores¶

In order to find the area under a curve for an interval $[a,b]$ we make use of the equation

$$ P(a \le x \le b) = \int_{a}^{b}f(x)dx = P(x \le b) - P(x \le a)\text{.} $$

Question 3
What is the probability of a randomly picked female student from the student data set with a height between 155 and 165 cm, $P(155 \le x \le 165)$?.

In [15]:

x_lower = 155  # height in cm
x_upper = 165  # height in cm


cdf_upper = stats.norm.cdf(x_upper, loc=height_mean, scale=height_sd)
cdf_lower = stats.norm.cdf(x_lower, loc=height_mean, scale=height_sd)
cdf_upper - cdf_lower

Out[15]:

0.4302708345408188

Answer: : $P(155 \le x \le 165) \approx 0.43`$

Question 4
What is the probability of a randomly picked female student from the students data set with a height between 170 and 180 cm, i.e. $P(170 \le x \le 180)$?

In [16]:

x_lower = 170  # height in cm
x_upper = 180  # height in cm


cdf_upper = stats.norm.cdf(x_upper, loc=height_mean, scale=height_sd)
cdf_lower = stats.norm.cdf(x_lower, loc=height_mean, scale=height_sd)
cdf_upper - cdf_lower

Out[16]:

0.1919328752280871

Answer: : $P(170 \le x \le 180) \approx 0.19$

Finding $z_\alpha$¶

Question 5
What is the height of female students in our students data set that corresponds to a probability of 0.60? In other words, if we randomly pick a number of $n$ students of the students data set, what height divides the sample in 60% of $n$ students being lower and 40% of $n$ students being higher than that particular height. Thus, we are looking for $P(X<?)=0.60$.

In order to solve $P(X<?)=0.60$ we will take two approaches. The first makes use of $z$-score, and the second one uses R to make the standardization step obsolete. For both approaches we apply the qnorm function, which is written as follows: qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE).

For the first approach we need to rearrange the equation for standardization from above and solve it for $x$.

$$z = \frac{x-\mu}{\sigma} \implies x = z \sigma + \mu$$

In order to calculate $x$ we need the mean (height_mean) and the standard deviation (height_sd) for the height variable, being <IPython.core.display.Javascript object> cm and <IPython.core.display.Javascript object> cm, respectively. In addition, we need to obtain a $z$-score for the given probability of 0.60. We may look up that $z$-score in a table, or we may apply the norm.ppf()function in Python. In order to do so, we want that particular $z$-score, where the area left to that $z$-score corresponds to 0.60; recall we are looking for $P(X<?)=0.60$.

In [17]:

z = stats.norm.ppf(0.6, loc=0, scale=1)
z

Out[17]:

0.2533471031357997

Now that we know $z$ we can plug into the equation from above

\begin{align} x & = z \sigma + \mu \\ & = 0.25 \times 7.9 + 163.7 \\ & \approx 165.66 \end{align}

In [18]:

out = round(z * height_sd + height_mean, 2)

Perfect, we are done: $P(X<165.66)=0.60$

Now we go through the second approach, where we skip the step of the $z$-score calculation. All we have to do is to feed the norm.ppf() function with the mean and standard deviation of our height variable.

In [19]:

x = stats.norm.ppf(0.6, loc=height_mean, scale=height_sd)
x

Out[19]:

165.65948015132278

No surprise, the numbers match: $P(X<165.66)=0.60$

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.

Data preparation¶

Finding the area to the left of a specified $z$-score or $x$-score¶

Finding the area to the right of a specified $x$-score¶

Finding the area between two specified $x$-scores¶

Finding $z_\alpha$¶