In this section, we perform a hypothesis test for the means of two populations. We assume that the standard deviations of the two populations are equal but unknown. If, however, we knew $\sigma$ and the difference of the sample means ($\bar {x}_{1} - \bar {x}_{2}$), the test statistic could be written as follows:

$$z = \frac {(\bar {x}_{1} - \bar {x}_{2}) - (\mu_{1} - \mu_{2})} {\sigma \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}}$$

However, we do not know $\sigma$ in almost all real-life applications. Thus, we estimate it beforehand. The best way to do that is to consider the sample variances, $s^{2}_{1}$ and $s^{2}_{2}$, as two estimates for $\sigma^{2}$. By pooling the two sample variances and weighting them according to the sample size, the estimate for $\sigma^{2}$ is given by:

$$s^{2}_{p} = \frac {(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}} {n_{1} + n_{2} - 2},$$

And by taking the square root, we get:

$$s_{p} = \sqrt {\frac {(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}} {n_{1} + n_{2} - 2}}.$$

The quantity $s_{p}$ is called the pooled sample standard deviation, where the subscript $p$ stands for pooled.

The replacement of $\sigma$ in the equation above with its estimate $s_{p}$ results in:

$$t = \frac {(\bar {x}_{1} - \bar {x}_{2}) - (\mu_{1} - \mu_{2})} {s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}}$$

The denominator of the equation, $s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}$, is the estimator of the standard deviation of $\bar {x}_{1} - \bar {x}_{2}$, which can be written as:

$$s_{\bar {x}_{1} - \bar {x}_{2}} = s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}$$

Please note that the equation for the test statistic $t$ has a t-distribution. The degrees of freedom ($df$) are given by:

$$df = n_{1} + n_{2} - 2$$

Interval Estimation of $\mu_{1} - \mu_{2}$¶

The $100(1−\alpha)\ \%$ confidence interval for $\mu_{1} - \mu_{2}$ is given by:

$$(\bar {x}_{1} - \bar {x}_{2}) \pm t \times s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}},$$

where the value of $t$ is obtained from the t-distribution for the given confidence level and $n_{1} + n_{2} - 2$ degrees of freedom.

The pooled t-test: An example¶

Python allows us to conduct a pooled t-test using the ttest_ind_from_stats() function over the stats module from the scipy package.

Note: For this example, we will assume that the variances between the two samples are equal. Therefore, we will set the argument equal_var within the function call to True.

In order to practice the pooled t-test, we load the students data set. You may download the students.csv file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas dataframe object by using the read_csv method:

In [1]:
import pandas as pd
import numpy as np

students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each representing a particular student, and 16 columns, each corresponding to a variable/feature related to that particular student. These self-explaining variables are:

  • stud.id
  • name
  • gender
  • age
  • height
  • weight
  • religion
  • nc.score
  • semester
  • major
  • minor
  • score1
  • score2
  • online.tutorial
  • graduated
  • salary

In order to showcase the pooled t-test, we examine the mean annual salary (in Euro) of graduates. Therefore, we will examine the following populations for statistically significant differences:

  1. male students
  2. female students.

The question is whether there is a difference in the mean annual salary of graduates related to gender.


Data preparation¶

We start with the data preparation, which includes the following steps:

  • We subset the data set based on the binary graduated variable, which indicates if the student has graduated yet. The integer 1 stands for graduated, and 0 indicates that the student did not graduate yet.
  • We split the data set based on gender (male and female).
  • We sample subset 50 female and 50 male students from each subset and extract the mean annual salary (in Euro) as the variable of interest. We will store the salary information for the males and females in dedicated variables called sample_males and sample_femals. The relevant information is stored in the column salary.
In [2]:
graduated_students = students.loc[students.graduated == 1]
males_graduated = graduated_students.loc[students.gender == "Male"]
females_graduated = graduated_students.loc[students.gender == "Female"]

n = 50

sample_males = males_graduated.sample(n, random_state = 9)["salary"]
sample_females = females_graduated.sample(n, random_state = 9)["salary"]
In [3]:
np.mean(sample_males)
Out[3]:
47262.417382791995
In [4]:
np.mean(sample_females)
Out[4]:
35825.54641661087

Further, we test the normality assumption by plotting a normal probability plot, often referred to as Q-Q plot. If the variable is normally distributed, the normal probability plot should be roughly linear.

In [5]:
import matplotlib.pyplot as plt
import scipy.stats as stats

plt.figure(figsize=(12,5))

ax = plt.subplot(1, 2, 1)
qq = stats.probplot(sample_males, dist="norm", plot = plt)
ax.set_title("Q-Q plot for males (sample)")
ax.set_ylabel("Sample quantiles")

ax = plt.subplot(1, 2, 2)
qq = stats.probplot(sample_females, dist="norm", plot = plt)
ax.set_title("Q-Q plot for females (sample)")
ax.set_ylabel("Sample quantiles")
Out[5]:
Text(0, 0.5, 'Sample quantiles')

We see that the sample data is somehow noisy, but it is still roughly normally distributed. The deviations from the straight line in the upper and lower parts suggest, that the probability distribution is slightly skewed.

Further, we check if the standard deviations of the two populations are roughly equal. As a rule of thumb, the condition of equal population standard deviations is met, if the ratio of the larger to the smaller sample standard deviation is less than 2 (Weiss, 2010). Let us assume, that the data of the students data set is a good approximation for the population.

In [6]:
np.std(sample_males, ddof = 1)
Out[6]:
9576.164209488374
In [7]:
np.std(sample_females, ddof = 1)
Out[7]:
7200.9719569940535
In [8]:
np.std(sample_males, ddof = 1) / np.std(sample_females, ddof = 1)
Out[8]:
1.3298432859729967

The ratio is approximately $1.33$ and thus, we conclude that the equal population standard deviations criterion is fulfilled. A simple visualization technique for evaluating the spread of a variable is to plot a box plot.

In [9]:
import seaborn as sns

plt.figure(figsize=(11,5))

sample_df = pd.melt(pd.DataFrame({'male': sample_males.values,
                                  'female': sample_females.values}))
sample_df.columns = ["Gender", "Annual salary in EUR"]

sns.boxplot(x='Gender', y='Annual salary in EUR', data=sample_df).set(title='Sample data')
Out[9]:
[Text(0.5, 1.0, 'Sample data')]

Hypothesis testing¶

We conduct the pooled t-test by following the step-wise implementation procedure for hypothesis testing.

Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$

The null hypothesis states that the average annual salary of male graduates ($\mu_1$) is equal to the average annual salary of female graduates ($\mu_2$).

$$H_{0}: \quad \mu_{1} = \mu_{2}$$

Recall, that the formulation of the alternative hypothesis dictates, whether we apply a two-sided, a left-tailed or a right-tailed hypothesis test. We want to test, if the salary of male graduates ($\mu_{1}$) is higher than the average annual salary of female graduates ($\mu_{2}$). Thus, the alternative hypothesis is formulated as follows:

$$H_{A}: \quad \mu_{1} > \mu_{2} $$

This formulation results in a right-tailed hypothesis test.


Step 2: Decide on the significance level, $\alpha$

$$\alpha = 0.01$$
In [10]:
alpha = 0.01

Step 3 and 4: Compute the value of the test statistic and the p-value

For illustration purposes we compute the test statistic manually in R. Recall the equation for the test statistic from above:

$$t = \frac {(\bar {x}_{1} - \bar {x}_{2}) - (\mu_{1} - \mu_{2})} {s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}}$$

If $H_{0}$ is true, then $\mu_{1} - \mu_{2} = 0$ and thus, the equation simplifies to:

$$t = \frac {(\bar {x}_{1} - \bar {x}_{2})} {s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}},$$

where $s_{p}$ is:

$$s_{p} = \sqrt {\frac {(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}} {n_{1} + n_{2} - 2}}$$

Let's calculate the test statistic manually in Python:

In [11]:
n1 = n2 = n
std_1 = np.std(sample_males, ddof = 1) 
std_2 = np.std(sample_females, ddof = 1)

mean_1 = np.mean(sample_males)
mean_2 = np.mean(sample_females)

sp = np.sqrt(((n1 - 1) * std_1**2 + (n2 - 1) * std_2**2) / (n1 + n2 - 2))

t_value = (mean_1 - mean_2) / (sp * np.sqrt((1/n1) + (1/n2)))

t_value
Out[11]:
6.749633014283106

The numerical value of the test statistic is 6.749633.

In order to calculate the p-value, we apply the t.cdf function derived by the scipy package to calculate the probability of occurrence for the test statistic based on the t distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:

$$df = n_{1} + n_{2} - 2 = 50 + 50 - 2 = 98$$
In [12]:
from scipy.stats import t

df = n1 + n2 - 2
df

p_value = 1 - t.cdf(t_value, df = df)
p_value
Out[12]:
5.25205323675948e-10

$p = 5.25205323675948 \times 10^{-10}$


Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$

In [13]:
# reject H0?
p_value <= alpha
Out[13]:
True

The p-value is less than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.


Step 6: Interpret the result of the hypothesis test

At the 1 % significance level, the data provides very strong evidence to conclude that the average salary of male graduates is higher than the average salary of female graduates.

In [14]:
stats.ttest_ind(sample_males, sample_females, equal_var = True, alternative = "greater")
Out[14]:
Ttest_indResult(statistic=6.7496330142831065, pvalue=5.25205313471519e-10)

Hypothesis testing in Python with scipy¶

We just manually completed a pooled t-test in Python. However, please note that we can use the full power of Python's package universe to obtain the same result as above in just one line of code! Therefore we will apply the ttest_ind() function over the stats module from the scipy package. We have to provide the following information as arguments for the function:

  • the male obervations over sample_males
  • the femals observations over sample_females
  • whether the variances of both samples are assumed to be equal over equal_var = True
  • which alternative hypothesis shall be tested over alternative = "greater"
In [15]:
from scipy import stats

test_result = stats.ttest_ind(sample_males, sample_females, 
                              equal_var = True, 
                              alternative = "greater")

test_result
Out[15]:
Ttest_indResult(statistic=6.7496330142831065, pvalue=5.25205313471519e-10)

The ttest_ind() function returns an object, which provides the teststatic as well as the corresponding *p*-value of the test result. Those values can be retrieved over the following properties:

  • <object>.statistic holds the actual teststatic and represents the empirical t-value.
  • <object>.pvalue represents the p-value of the performed t-test.

Consequently, the teststatistic is retrieved over:

In [16]:
test_result.statistic
Out[16]:
6.7496330142831065

The *p*-value is retrieved over:

In [17]:
test_result.pvalue
Out[17]:
5.25205313471519e-10

Great success! Compare the output of the ttest_ind() function with our result from above. They match perfectly! Again, we may conclude that at the 1 % significance level, the data provides very strong evidence to conclude that the average salary of male graduates is higher than the average salary of female graduates.


Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.