Now we are ready to do some exercises. First, we load the students
data set students.csv
, assign a proper name to the data set and take a look at the shape.
# First, let's import all the needed libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
students = pd.read_csv(
"https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
count_row = students.shape[0] # number of rows
count_col = students.shape[1] # number of columns
column_names = students.columns # column names
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: [stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary].
In this sections we use the height
variable to exercise what we have discussed so far.
First, we want to assure that we are dealing with normally distributed data. If a variable is normally distributed, then, for a large sample, a histogram of the observations should be roughly bell shaped.
plt.figure(figsize=(10, 5))
plt.hist(students["height"], bins="sturges", color="lightgreen", edgecolor="grey")
plt.xlabel("Height in cm")
plt.ylabel("Frequency")
plt.show()
By inspecting the plot we may conclude that the height
variable is normally distributed, however, especially for small samples, ascertaining a clear shape in a histogram and, in particular, whether it is bell shaped is often difficult. Thus, a more sensitive graphical technique is required for assessing normality. Normal probability plots provide such a technique. The idea behind a normal probability plot is simple: Compare the observed values of the variable to the observations expected for a normally distributed variable. More precisely, a normal probability plot is a plot of the observed values of the variable versus the normal scores of the observations expected for a variable having the standard normal distribution. If the variable is normally distributed, the normal probability plot should be roughly linear (i.e., fall roughly in a straight line) (Weiss 2010).
When using a normal probability plot to assess the normality of a variable, we must remember two things:
In Python we may employ the statsmodels.api
library, which provides a function for plotting normal probability plots often referred to as Q-Q plots. For plotting a complete Q-Q-Plot, we make use of the qqplot
function.
import statsmodels.api as sm
plt.figure(figsize=(12, 5))
sm.qqplot(students["height"], line="r")
plt.show()
<Figure size 1200x500 with 0 Axes>
By inspecting the plot we see that there is some divergence for the sample quantiles compared to the theoretic quantiles at the lower and upper tails. This fact needs a little more attention! What might be the reason for the departure at the upper and lower tail of the distribution? Any guess?
What about gender? Honestly, it is seems natural that the mean height for males and females differs. Let us plot a histogram of the height of males and females.
## subset by gender
males = students.loc[students["gender"] == "Male"]
females = students.loc[students["gender"] == "Female"]
## plot by gender
plt.figure(figsize=(10, 5))
plt.hist(males["height"], bins="sturges", color="lightblue", edgecolor="grey")
plt.hist(females["height"], bins="sturges", color="lightgreen", edgecolor="grey")
plt.xlabel("Height in cm")
plt.ylabel("Frequency")
plt.title("Females and Males", fontsize=16)
plt.show()
There it is! Obviously, the two groups have different means, and thus, putting them together into one group causes the left an right tails of the resulting distribution to extend further, than expected for a normally distributed variable. In order to continue, we thus take only the height of female students into considerations. For the matter of clarity we once again plot the normal probability plot of the height variable to assure that our target variables are normally distributed.
plt.figure(figsize=(12, 5))
sm.qqplot(females["height"], line="r")
plt.show()
<Figure size 1200x500 with 0 Axes>
Before we actually start with the exercises we first calculate the mean, $\bar x$, and the standard deviation, $s$, of the target variable. Moreover, we standardize the variable to get a standard normal distribution with $\bar x =0$ and $s = 1$ and assign it to an appropriate variable name.
# Heights
height_mean = np.mean(females["height"])
height_sd = np.std(females["height"])
height_mean, height_sd
(163.65328467153284, 7.918762263149209)
# z-transformation
height_z = (females["height"] - height_mean) / height_sd
The height variable has a mean of <IPython.core.display.Javascript object> cm of and a standard deviation of <IPython.core.display.Javascript object> cm.
Question 1
What is the probability of a randomly picked female student from the student
data set with a height less or equal to 168 cm? That means that we are looking for $P(x \le 168)$.
First, we calculate the probability for the standardized variable. Therefore, we have to transform our value of interest (168 cm) into a $z$-score. $$z = \frac{x-\mu}{\sigma} = \frac{ 168- 163.7}{7.9} = 0.55$$
Then we have to calculate the area under the curve left to the obtained $z$-value. Recall, that the area under the curve of a normal distributed variable can be calculated by applying the norm.cdf
function implemented in the scipy.stats
module. The norm.cdf
function is written as norm.cdf(x, loc=0, scale=1)
. For this particular example we can accept all default argument values.
x = 168 # height in cm
x_z = (x - height_mean) / height_sd # z-transformation
stats.norm.cdf(x_z)
0.7084675855873308
Awesome, we have a result: $P(z \le 0.55) \approx 0.70$
Now, we do the same calculation, however this time we skip the step of standardization. Thanks to the power of Python we do not need to rely on tables, but can we can easily put the sample mean, $\bar x$, and the sample standard deviation, $s$, into the norm.cdf()
function.
x = 168 # height in cm
stats.norm.cdf(x, loc=height_mean, scale=height_sd)
0.7084675855873308
Perfect! The numbers match: $P(x \le 168) \approx 0.70$. To make sure we realize what is going on, both the area under the curve for the standardized variable in $z$-values (left panel) as well as the area for the non-standardized variable in cm (right panel) are visualized below.
mean = 0
sigma = 1
cut_a = (x - height_mean) / height_sd
x = np.arange(-3, 3.01, 0.01)
yy = stats.norm.pdf(x)
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].plot(x, yy, color="black")
ax[0].set_title("$P(z \leq 0.55)$")
ax[0].set_yticks([])
ax[0].set_xlabel("z value")
ax[0].fill_between(
x=x,
y1=yy,
where=(x <= 0.55),
color="red",
alpha=0.75,
)
ax[0].text(
-1.5,
0.1,
"$area \\approx 0.71$",
fontsize=14,
)
x_vals = np.arange(min(females["height"]), max(females["height"]), 0.01)
yy2 = stats.norm.pdf(x_vals, height_mean, height_sd)
ax[1].plot(x_vals, yy2, color="black")
ax[1].set_title("$P(z \leq 0.168)$")
ax[1].set_yticks([])
ax[1].set_xlabel("x height (cm)")
ax[1].fill_between(
x=x_vals,
y1=yy2,
where=(x_vals <= 168),
color="red",
alpha=0.75,
)
ax[1].text(
148,
0.005,
"$area \\approx 0.71$",
fontsize=14,
)
plt.show()
Question 2
What is the probability of a randomly picked female student from the students data set with a height higher or equal to 175 cm? We are looking for $P(x \ge 175)$. In order to obtain the area under the curve right of the value of interest, we need to set the norm.cdf
function to 1 - norm.cdf
in order to get the upper tail of the distribution.
x = 175 # height in cm
1 - stats.norm.cdf(x, loc=height_mean, scale=height_sd)
0.07594463650610317
Answer: : $P(x \ge 175) \approx 0.076$
In order to find the area under a curve for an interval $[a,b]$ we make use of the equation
$$ P(a \le x \le b) = \int_{a}^{b}f(x)dx = P(x \le b) - P(x \le a)\text{.} $$Question 3
What is the probability of a randomly picked female student from the student data set with a height between 155 and 165 cm, $P(155 \le x \le 165)$?.
x_lower = 155 # height in cm
x_upper = 165 # height in cm
cdf_upper = stats.norm.cdf(x_upper, loc=height_mean, scale=height_sd)
cdf_lower = stats.norm.cdf(x_lower, loc=height_mean, scale=height_sd)
cdf_upper - cdf_lower
0.4302708345408188
Answer: : $P(155 \le x \le 165) \approx 0.43`$
Question 4
What is the probability of a randomly picked female student from the students data set with a height between 170 and 180 cm, i.e. $P(170 \le x \le 180)$?
x_lower = 170 # height in cm
x_upper = 180 # height in cm
cdf_upper = stats.norm.cdf(x_upper, loc=height_mean, scale=height_sd)
cdf_lower = stats.norm.cdf(x_lower, loc=height_mean, scale=height_sd)
cdf_upper - cdf_lower
0.1919328752280871
Answer: : $P(170 \le x \le 180) \approx 0.19$
Question 5
What is the height of female students in our students data set that corresponds to a probability of 0.60? In other words, if we randomly pick a number of $n$ students of the students data set, what height divides the sample in 60% of $n$ students being lower and 40% of $n$ students being higher than that particular height. Thus, we are looking for $P(X<?)=0.60$.
In order to solve $P(X<?)=0.60$ we will take two approaches. The first makes use of $z$-score, and the second one uses R to make the standardization step obsolete. For both approaches we apply the qnorm
function, which is written as follows: qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
.
For the first approach we need to rearrange the equation for standardization from above and solve it for $x$.
$$z = \frac{x-\mu}{\sigma} \implies x = z \sigma + \mu$$In order to calculate $x$ we need the mean (height_mean
) and the standard deviation (height_sd
) for the height variable, being <IPython.core.display.Javascript object> cm and <IPython.core.display.Javascript object> cm, respectively. In addition, we need to obtain a $z$-score for the given probability of 0.60. We may look up that $z$-score in a table, or we may apply the norm.ppf()
function in Python. In order to do so, we want that particular $z$-score, where the area left to that $z$-score corresponds to 0.60; recall we are looking for $P(X<?)=0.60$.
z = stats.norm.ppf(0.6, loc=0, scale=1)
z
0.2533471031357997
Now that we know $z$ we can plug into the equation from above
\begin{align} x & = z \sigma + \mu \\ & = 0.25 \times 7.9 + 163.7 \\ & \approx 165.66 \end{align}out = round(z * height_sd + height_mean, 2)
Perfect, we are done: $P(X<165.66)=0.60$
Now we go through the second approach, where we skip the step of the $z$-score calculation. All we have to do is to feed the norm.ppf()
function with the mean and standard deviation of our height
variable.
x = stats.norm.ppf(0.6, loc=height_mean, scale=height_sd)
x
165.65948015132278
No surprise, the numbers match: $P(X<165.66)=0.60$
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.