In this section, we perform a hypothesis test for the means of two populations. We assume that the standard deviations of the two populations are equal but unknown. If, however, we knew $\sigma$ and the difference of the sample means ($\bar {x}_{1} - \bar {x}_{2}$), the test statistic could be written as follows:

$$z = \frac {(\bar {x}_{1} - \bar {x}_{2}) - (\mu_{1} - \mu_{2})} {\sigma \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}}$$However, we do not know $\sigma$ in almost all real-life applications. Thus, we estimate it beforehand. The best way to do that is to consider the sample variances, $s^{2}_{1}$ and $s^{2}_{2}$, as two estimates for $\sigma^{2}$. By pooling the two sample variances and weighting them according to the sample size, the estimate for $\sigma^{2}$ is given by:

$$s^{2}_{p} = \frac {(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}} {n_{1} + n_{2} - 2},$$And by taking the square root, we get:

$$s_{p} = \sqrt {\frac {(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}} {n_{1} + n_{2} - 2}}.$$The quantity $s_{p}$ is called the pooled sample standard deviation, where the subscript $p$ stands for **pooled**.

The replacement of $\sigma$ in the equation above with its estimate $s_{p}$ results in:

$$t = \frac {(\bar {x}_{1} - \bar {x}_{2}) - (\mu_{1} - \mu_{2})} {s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}}$$The denominator of the equation, $s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}$, is the estimator of the standard deviation of $\bar {x}_{1} - \bar {x}_{2}$, which can be written as:

$$s_{\bar {x}_{1} - \bar {x}_{2}} = s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}$$Please note that the equation for the test statistic $t$ has a *t*-distribution. The degrees of freedom ($df$) are given by:

The $100(1−\alpha)\ \%$ confidence interval for $\mu_{1} - \mu_{2}$ is given by:

$$(\bar {x}_{1} - \bar {x}_{2}) \pm t \times s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}},$$where the value of $t$ is obtained from the *t*-distribution for the given confidence level and $n_{1} + n_{2} - 2$ degrees of freedom.

Python allows us to conduct a **pooled t-test** using the

`ttest_ind_from_stats()`

function over the `stats`

module from the `scipy`

package.

Note:For this example, we will assume that the variances between the two samples are equal. Therefore, we will set the argument`equal_var`

within the function call to`True`

.

In order to practice the pooled t-test, we load the `students`

*data set*. You may download the `students.csv`

file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as `pandas`

`dataframe`

object by using the `read_csv`

method:

In [1]:

```
import pandas as pd
import numpy as np
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
```

The *students* data set consists of 8239 rows, each representing a particular student, and 16 columns, each corresponding to a variable/feature related to that particular student. These self-explaining variables are:

- stud.id
- name
- gender
- age
- height
- weight
- religion
- nc.score
- semester
- major
- minor
- score1
- score2
- online.tutorial
- graduated
- salary

In order to showcase the pooled *t*-test, we examine the **mean annual salary** (in *Euro*) of graduates. Therefore, we will examine the following populations for statistically significant differences:

- male students
- female students.

**The question is whether there is a difference in the mean annual salary of graduates related to gender.**

We start with the data preparation, which includes the following steps:

- We subset the data set based on the binary graduated variable, which indicates if the student has graduated yet. The integer 1 stands for graduated, and 0 indicates that the student did not graduate yet.
- We split the data set based on gender (male and female).
- We sample subset 50 female and 50 male students from each subset and extract the mean annual salary (in Euro) as the variable of interest. We will store the salary information for the males and females in dedicated variables called
`sample_males`

and`sample_femals`

. The relevant information is stored in the column`salary`

.

In [2]:

```
graduated_students = students.loc[students.graduated == 1]
males_graduated = graduated_students.loc[students.gender == "Male"]
females_graduated = graduated_students.loc[students.gender == "Female"]
n = 50
sample_males = males_graduated.sample(n, random_state = 9)["salary"]
sample_females = females_graduated.sample(n, random_state = 9)["salary"]
```

In [3]:

```
np.mean(sample_males)
```

Out[3]:

47262.417382791995

In [4]:

```
np.mean(sample_females)
```

Out[4]:

35825.54641661087

In [5]:

```
import matplotlib.pyplot as plt
import scipy.stats as stats
plt.figure(figsize=(12,5))
ax = plt.subplot(1, 2, 1)
qq = stats.probplot(sample_males, dist="norm", plot = plt)
ax.set_title("Q-Q plot for males (sample)")
ax.set_ylabel("Sample quantiles")
ax = plt.subplot(1, 2, 2)
qq = stats.probplot(sample_females, dist="norm", plot = plt)
ax.set_title("Q-Q plot for females (sample)")
ax.set_ylabel("Sample quantiles")
```

Out[5]:

Text(0, 0.5, 'Sample quantiles')

We see that the sample data is somehow noisy, but it is still roughly normally distributed. The deviations from the straight line in the upper and lower parts suggest, that the probability distribution is slightly skewed.

Further, we check if the standard deviations of the two populations are roughly equal. As a rule of thumb, the condition of equal population standard deviations is met, if the ratio of the larger to the smaller sample standard deviation is less than 2 (Weiss, 2010). Let us assume, that the data of the `students`

*data set* is a good approximation for the population.

In [6]:

```
np.std(sample_males, ddof = 1)
```

Out[6]:

9576.164209488374

In [7]:

```
np.std(sample_females, ddof = 1)
```

Out[7]:

7200.9719569940535

In [8]:

```
np.std(sample_males, ddof = 1) / np.std(sample_females, ddof = 1)
```

Out[8]:

1.3298432859729967

In [9]:

```
import seaborn as sns
plt.figure(figsize=(11,5))
sample_df = pd.melt(pd.DataFrame({'male': sample_males.values,
'female': sample_females.values}))
sample_df.columns = ["Gender", "Annual salary in EUR"]
sns.boxplot(x='Gender', y='Annual salary in EUR', data=sample_df).set(title='Sample data')
```

Out[9]:

[Text(0.5, 1.0, 'Sample data')]

We conduct the **pooled t-test** by following the step-wise implementation procedure for hypothesis testing.

**Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$**

The null hypothesis states that the average annual salary of male graduates ($\mu_1$) is equal to the average annual salary of female graduates ($\mu_2$).

$$H_{0}: \quad \mu_{1} = \mu_{2}$$Recall, that the formulation of the alternative hypothesis dictates, whether we apply a two-sided, a left-tailed or a right-tailed hypothesis test. We want to test, if the salary of male graduates ($\mu_{1}$) is higher than the average annual salary of female graduates ($\mu_{2}$). Thus, the alternative hypothesis is formulated as follows:

$$H_{A}: \quad \mu_{1} > \mu_{2} $$This formulation results in a right-tailed hypothesis test.

**Step 2: Decide on the significance level, $\alpha$**

In [10]:

```
alpha = 0.01
```

**Step 3 and 4: Compute the value of the test statistic and the p-value**

For illustration purposes we compute the test statistic manually in R. Recall the equation for the test statistic from above:

$$t = \frac {(\bar {x}_{1} - \bar {x}_{2}) - (\mu_{1} - \mu_{2})} {s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}}$$If $H_{0}$ is true, then $\mu_{1} - \mu_{2} = 0$ and thus, the equation simplifies to:

$$t = \frac {(\bar {x}_{1} - \bar {x}_{2})} {s_{p} \sqrt {\frac {1} {n_{1}} + \frac {1} {n_{2}}}},$$where $s_{p}$ is:

$$s_{p} = \sqrt {\frac {(n_{1} - 1) s^{2}_{1} + (n_{2} - 1) s^{2}_{2}} {n_{1} + n_{2} - 2}}$$Let's calculate the test statistic manually in Python:

In [11]:

```
n1 = n2 = n
std_1 = np.std(sample_males, ddof = 1)
std_2 = np.std(sample_females, ddof = 1)
mean_1 = np.mean(sample_males)
mean_2 = np.mean(sample_females)
sp = np.sqrt(((n1 - 1) * std_1**2 + (n2 - 1) * std_2**2) / (n1 + n2 - 2))
t_value = (mean_1 - mean_2) / (sp * np.sqrt((1/n1) + (1/n2)))
t_value
```

Out[11]:

6.749633014283106

The numerical value of the test statistic is 6.749633.

In order to calculate the *p*-value, we apply the `t.cdf`

function derived by the `scipy`

package to calculate the probability of occurrence for the test statistic based on the *t distribution*. To do so, we also need the *degrees of freedom*. Recall how to calculate the degrees of freedom:

In [12]:

```
from scipy.stats import t
df = n1 + n2 - 2
df
p_value = 1 - t.cdf(t_value, df = df)
p_value
```

Out[12]:

5.25205323675948e-10

$p = 5.25205323675948 \times 10^{-10}$

**Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$**

In [13]:

```
# reject H0?
p_value <= alpha
```

Out[13]:

True

The *p*-value is less than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis test**

At the 1 % significance level, the data provides very strong evidence to conclude that the average salary of male graduates is higher than the average salary of female graduates.

In [14]:

```
stats.ttest_ind(sample_males, sample_females, equal_var = True, alternative = "greater")
```

Out[14]:

Ttest_indResult(statistic=6.7496330142831065, pvalue=5.25205313471519e-10)

`scipy`

¶We just manually completed a **pooled t-test** in Python. However, please note that we can use the full power of Python's package universe to obtain the same result as above in just one line of code! Therefore we will apply the

`ttest_ind()`

function over the `stats`

module from the `scipy`

package. We have to provide the following information as arguments for the function:- the male obervations over
`sample_males`

- the femals observations over
`sample_females`

- whether the variances of both samples are assumed to be equal over
`equal_var = True`

- which alternative hypothesis shall be tested over
`alternative = "greater"`

In [15]:

```
from scipy import stats
test_result = stats.ttest_ind(sample_males, sample_females,
equal_var = True,
alternative = "greater")
test_result
```

Out[15]:

Ttest_indResult(statistic=6.7496330142831065, pvalue=5.25205313471519e-10)

The `ttest_ind()`

function returns an `object`

, which provides the **teststatic** as well as the corresponding ***p*-value** of the test result. Those values can be retrieved over the following properties:

`<object>.statistic`

holds the actual teststatic and represents the empirical*t*-value.`<object>.pvalue`

represents the*p*-value of the performed t-test.

Consequently, the teststatistic is retrieved over:

In [16]:

```
test_result.statistic
```

Out[16]:

6.7496330142831065

The ***p*-value** is retrieved over:

In [17]:

```
test_result.pvalue
```

Out[17]:

5.25205313471519e-10

`ttest_ind()`

function with our result from above. They match perfectly! Again, we may conclude that at the 1 % significance level, the data provides very strong evidence to conclude that the average salary of male graduates is higher than the average salary of female graduates.

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*