In this section, we perform a hypothesis test for the means of two populations. We assume that the standard deviations of the two populations are equal but unknown. If, however, we knew $\sigma$ and the difference of the sample means ($\bar {x}_{1} - \bar {x}_{2}$), the test statistic could be written as follows:
$$z = \frac {(\bar {x}_{1} - \bar {x}_{2}) - (\mu_{1} - \mu_{2})} {\sigma \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}}$$However, we do not know $\sigma$ in almost all real-life applications. Thus, we estimate it beforehand. The best way to do that is to consider the sample variances, $s^{2}_{1}$ and $s^{2}_{2}$, as two estimates for $\sigma^{2}$. By pooling the two sample variances and weighting them according to the sample size, the estimate for $\sigma^{2}$ is given by:
$$s^{2}_{p} = \frac {(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}} {n_{1} + n_{2} - 2},$$And by taking the square root, we get:
$$s_{p} = \sqrt {\frac {(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}} {n_{1} + n_{2} - 2}}.$$The quantity $s_{p}$ is called the pooled sample standard deviation, where the subscript $p$ stands for pooled.
The replacement of $\sigma$ in the equation above with its estimate $s_{p}$ results in:
$$t = \frac {(\bar {x}_{1} - \bar {x}_{2}) - (\mu_{1} - \mu_{2})} {s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}}$$The denominator of the equation, $s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}$, is the estimator of the standard deviation of $\bar {x}_{1} - \bar {x}_{2}$, which can be written as:
$$s_{\bar {x}_{1} - \bar {x}_{2}} = s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}$$Please note that the equation for the test statistic $t$ has a t-distribution. The degrees of freedom ($df$) are given by:
$$df = n_{1} + n_{2} - 2$$The $100(1−\alpha)\ \%$ confidence interval for $\mu_{1} - \mu_{2}$ is given by:
$$(\bar {x}_{1} - \bar {x}_{2}) \pm t \times s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}},$$where the value of $t$ is obtained from the t-distribution for the given confidence level and $n_{1} + n_{2} - 2$ degrees of freedom.
Python allows us to conduct a pooled t-test using the ttest_ind_from_stats()
function over the stats
module from the scipy
package.
Note: For this example, we will assume that the variances between the two samples are equal. Therefore, we will set the argument
equal_var
within the function call toTrue
.
In order to practice the pooled t-test, we load the students
data set. You may download the students.csv
file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas
dataframe
object by using the read_csv
method:
import pandas as pd
import numpy as np
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each representing a particular student, and 16 columns, each corresponding to a variable/feature related to that particular student. These self-explaining variables are:
In order to showcase the pooled t-test, we examine the mean annual salary (in Euro) of graduates. Therefore, we will examine the following populations for statistically significant differences:
The question is whether there is a difference in the mean annual salary of graduates related to gender.
We start with the data preparation, which includes the following steps:
sample_males
and sample_femals
. The relevant information is stored in the column salary
.graduated_students = students.loc[students.graduated == 1]
males_graduated = graduated_students.loc[students.gender == "Male"]
females_graduated = graduated_students.loc[students.gender == "Female"]
n = 50
sample_males = males_graduated.sample(n, random_state = 9)["salary"]
sample_females = females_graduated.sample(n, random_state = 9)["salary"]
np.mean(sample_males)
47262.417382791995
np.mean(sample_females)
35825.54641661087
Further, we test the normality assumption by plotting a normal probability plot, often referred to as Q-Q plot. If the variable is normally distributed, the normal probability plot should be roughly linear.
import matplotlib.pyplot as plt
import scipy.stats as stats
plt.figure(figsize=(12,5))
ax = plt.subplot(1, 2, 1)
qq = stats.probplot(sample_males, dist="norm", plot = plt)
ax.set_title("Q-Q plot for males (sample)")
ax.set_ylabel("Sample quantiles")
ax = plt.subplot(1, 2, 2)
qq = stats.probplot(sample_females, dist="norm", plot = plt)
ax.set_title("Q-Q plot for females (sample)")
ax.set_ylabel("Sample quantiles")
Text(0, 0.5, 'Sample quantiles')
We see that the sample data is somehow noisy, but it is still roughly normally distributed. The deviations from the straight line in the upper and lower parts suggest, that the probability distribution is slightly skewed.
Further, we check if the standard deviations of the two populations are roughly equal. As a rule of thumb, the condition of equal population standard deviations is met, if the ratio of the larger to the smaller sample standard deviation is less than 2 (Weiss, 2010). Let us assume, that the data of the students
data set is a good approximation for the population.
np.std(sample_males, ddof = 1)
9576.164209488374
np.std(sample_females, ddof = 1)
7200.9719569940535
np.std(sample_males, ddof = 1) / np.std(sample_females, ddof = 1)
1.3298432859729967
The ratio is approximately $1.33$ and thus, we conclude that the equal population standard deviations criterion is fulfilled. A simple visualization technique for evaluating the spread of a variable is to plot a box plot.
import seaborn as sns
plt.figure(figsize=(11,5))
sample_df = pd.melt(pd.DataFrame({'male': sample_males.values,
'female': sample_females.values}))
sample_df.columns = ["Gender", "Annual salary in EUR"]
sns.boxplot(x='Gender', y='Annual salary in EUR', data=sample_df).set(title='Sample data')
[Text(0.5, 1.0, 'Sample data')]
We conduct the pooled t-test by following the step-wise implementation procedure for hypothesis testing.
Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$
The null hypothesis states that the average annual salary of male graduates ($\mu_1$) is equal to the average annual salary of female graduates ($\mu_2$).
$$H_{0}: \quad \mu_{1} = \mu_{2}$$Recall, that the formulation of the alternative hypothesis dictates, whether we apply a two-sided, a left-tailed or a right-tailed hypothesis test. We want to test, if the salary of male graduates ($\mu_{1}$) is higher than the average annual salary of female graduates ($\mu_{2}$). Thus, the alternative hypothesis is formulated as follows:
$$H_{A}: \quad \mu_{1} > \mu_{2} $$This formulation results in a right-tailed hypothesis test.
Step 2: Decide on the significance level, $\alpha$
$$\alpha = 0.01$$alpha = 0.01
Step 3 and 4: Compute the value of the test statistic and the p-value
For illustration purposes we compute the test statistic manually in R. Recall the equation for the test statistic from above:
$$t = \frac {(\bar {x}_{1} - \bar {x}_{2}) - (\mu_{1} - \mu_{2})} {s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}}$$If $H_{0}$ is true, then $\mu_{1} - \mu_{2} = 0$ and thus, the equation simplifies to:
$$t = \frac {(\bar {x}_{1} - \bar {x}_{2})} {s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}},$$where $s_{p}$ is:
$$s_{p} = \sqrt {\frac {(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}} {n_{1} + n_{2} - 2}}$$Let's calculate the test statistic manually in Python:
n1 = n2 = n
std_1 = np.std(sample_males, ddof = 1)
std_2 = np.std(sample_females, ddof = 1)
mean_1 = np.mean(sample_males)
mean_2 = np.mean(sample_females)
sp = np.sqrt(((n1 - 1) * std_1**2 + (n2 - 1) * std_2**2) / (n1 + n2 - 2))
t_value = (mean_1 - mean_2) / (sp * np.sqrt((1/n1) + (1/n2)))
t_value
6.749633014283106
The numerical value of the test statistic is 6.749633.
In order to calculate the p-value, we apply the t.cdf
function derived by the scipy
package to calculate the probability of occurrence for the test statistic based on the t distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:
from scipy.stats import t
df = n1 + n2 - 2
df
p_value = 1 - t.cdf(t_value, df = df)
p_value
5.25205323675948e-10
$p = 5.25205323675948 \times 10^{-10}$
Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$
# reject H0?
p_value <= alpha
True
The p-value is less than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.
Step 6: Interpret the result of the hypothesis test
At the 1 % significance level, the data provides very strong evidence to conclude that the average salary of male graduates is higher than the average salary of female graduates.
stats.ttest_ind(sample_males, sample_females, equal_var = True, alternative = "greater")
Ttest_indResult(statistic=6.7496330142831065, pvalue=5.25205313471519e-10)
scipy
¶We just manually completed a pooled t-test in Python. However, please note that we can use the full power of Python's package universe to obtain the same result as above in just one line of code! Therefore we will apply the ttest_ind()
function over the stats
module from the scipy
package. We have to provide the following information as arguments for the function:
sample_males
sample_females
equal_var = True
alternative = "greater"
from scipy import stats
test_result = stats.ttest_ind(sample_males, sample_females,
equal_var = True,
alternative = "greater")
test_result
Ttest_indResult(statistic=6.7496330142831065, pvalue=5.25205313471519e-10)
The ttest_ind()
function returns an object
, which provides the teststatic as well as the corresponding *p*-value of the test result. Those values can be retrieved over the following properties:
<object>.statistic
holds the actual teststatic and represents the empirical t-value.<object>.pvalue
represents the p-value of the performed t-test.Consequently, the teststatistic is retrieved over:
test_result.statistic
6.7496330142831065
The *p*-value is retrieved over:
test_result.pvalue
5.25205313471519e-10
Great success! Compare the output of the ttest_ind()
function with our result from above. They match perfectly! Again, we may conclude that at the 1 % significance level, the data provides very strong evidence to conclude that the average salary of male graduates is higher than the average salary of female graduates.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.