In cases we want to test for two population means and the standard deviations are different between the two populations, the so-called non-pooled t-test or Welch’s t-test is applied.

The non-pooled t-test is very similar to the pooled t-test, except for the test statistic $t$ and for calculating the degrees of freedom ($df$). The test statistic does not invoke $s_{p}$, the pooled standard deviation, and is written as:

$$t = \frac {\bar {x}_{1} - \bar {x}_{2}} { \sqrt {\frac {s^{2}_{1}} {n_{1}} + \frac {s^{2}_{1}} {n_{1} } } }$$

The denominator of the equation from above is the estimator of the standard deviation of $\bar {x_{1}} - \bar {x_{2}}$, given by:

$$s_{\bar x_{1} - \bar x_{2}} = \sqrt{ \frac{s^{2}_{1}}{n_{1}} + \frac{s^{2}_{2}}{n_{2}} }\text{.}$$

The test statistic $t$ has a t-distribution and the degrees of freedom ($df$) are given by:

$$df=\frac{\left(\frac{s_1^2}{n_1}+\frac{s_1^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1-1}+\frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2-1}}\text{.}$$

When using look-up tables, round down the degrees of freedom to the nearest integer!

The non-pooled t-test is robust to moderate violations of the normal population assumption, but it is less robust regarding outliers (Weiss, 2010).

Interval Estimation of $\mu_{1} - \mu_{2}$¶

The $100(1-\alpha)\ \%$ confidence interval for $\mu_{1} - \mu_{2}$ is:

$$(\bar {x_{1}} - \bar {x_{2}}) \pm t^{*} \times \sqrt { \frac {s^{2}_{1}} {n_{1}} + \frac {s^{2}_{2}} {n_{2}}}$$

where the value of t is obtained from the t-distribution for the given confidence level. The degrees of freedom ($df$) and obtained using the equation above.

The Non-Pooled t-Test: An Example¶

In order to get some hands-on experience, we apply the non-pooled t-test in an exercise. Therefore, we load the students data set. You may download the students.csv file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas dataframe object by using the read_csv method:

Note: Ensure pandas and numpy are installed in your mamba environment!

In [1]:
import pandas as pd
import numpy as np

students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each representing a particular student, and 16 columns, each corresponding to a variable/feature related to that particular student. These self-explaining variables are:

  • stud.id
  • name
  • gender
  • age
  • height
  • weight
  • religion
  • nc.score
  • semester
  • major
  • minor
  • score1
  • score2
  • online.tutorial
  • graduated
  • salary

In order to showcase the non-pooled t-test we examine the mean annual salary (in Euro) of female graduates with respect to their major study subject. We want to investigate whether there is a statistically significan difference in the mean salary of the following populations:

  • female students with their major in Political Science.
  • female students with their major in Social Sciences.

Data preparation¶

Starting with the data preparation, we will:

  • We subset the students data set based on the variables gender and graduated.
  • We split the subset into graduates of Political Science and Social Sciences. These distinctions are stored within the variable major.
  • We sample from each group 50 students and extract the mean annual salary (in Euro) as the variable of interest. We will store the salary information for the two groups in dedicated variables called sample_polt_sc and sample_social_sc. The relevant information is stored in the column salary.
In [2]:
female_graduated_students = students.loc[(students.graduated == 1) & (students.gender == "Female")]
political_sc = female_graduated_students.loc[female_graduated_students.major == "Political Science"]
social_sc = female_graduated_students.loc[female_graduated_students.major == "Social Sciences"]

n = 50

sample_polt_sc = political_sc.sample(n, random_state = 9)["salary"]
sample_social_sc = social_sc.sample(n, random_state = 9)["salary"]

Further, we check if the data is normally distributed by plotting a Q-Q plot. You can quickly generate a well-looking QQ-Plot in Python over the probplot() function provided over the stats module within the scipy package.

Note: Ensure matplotlib and scipy are installed in your mamba environment!

In [3]:
import matplotlib.pyplot as plt
import scipy.stats as stats

plt.figure(figsize=(12,5))

ax = plt.subplot(1, 2, 1)
qq = stats.probplot(sample_polt_sc, dist="norm", plot = plt)
ax.set_title("Q-Q plot for female graduates of \nPolitical Science (sample data)")
ax.set_ylabel("Sample quantiles")

ax = plt.subplot(1, 2, 2)
qq = stats.probplot(sample_social_sc, dist="norm", plot = plt)
ax.set_title("Q-Q plot for female graduates of \n Social Sciences (sample data)")
ax.set_ylabel("Sample quantiles")
Out[3]:
Text(0, 0.5, 'Sample quantiles')

The data of both samples fall mostly onto a straight line. Therefore we assume that the data of the students data set is a good approximation for the population. Then, we may check visually if the standard deviations of the two populations actually differ from each another by plotting a box plot.

Note: We want to provide a nicely-looking boxplot using the boxplot() function over the seaborn package. Please ensure seaborn is part of your mamba environment!

In [4]:
sample_polt_sc = political_sc.sample(n, random_state = 9)
sample_social_sc = social_sc.sample(n, random_state = 9)
In [5]:
import seaborn as sns

plt.figure(figsize=(11,5))

df = pd.DataFrame({'salary' : np.concatenate([sample_social_sc["salary"].values, sample_polt_sc["salary"].values]),
                   'major'  : np.concatenate([sample_social_sc["major"].values, sample_polt_sc["major"].values])},
                  columns = ['salary', 'major'])

sns.boxplot(
    data=df, 
    x="salary", 
    y="major"
).set(
    title='Population data',
    xlabel='Annual salary in EUR',
    ylabel=''
)
Out[5]:
[Text(0.5, 1.0, 'Population data'),
 Text(0.5, 0, 'Annual salary in EUR'),
 Text(0, 0.5, '')]

Based on the graphical evaluation approach we conclude, that the data is roughly normally distributed and that the standard deviations differ from each other.

Hypothesis testing¶

Recall the research question: Do the data provide sufficient evidence to conclude that the mean annual salary of female graduates with a major in Political Science differs from the mean annual salary of female graduates with a major in Social Sciences?

In order to conduct the non-pooled t-test, we follow the step-wise implementation procedure for hypothesis testing.

Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$

The null hypothesis states that the average annual salary of female graduates with a major in Political Science ($\mu_{1}$) is equal to the average annual salary of female graduates with a major in Social Sciences ($\mu_{2}$):

$$H_{0}: \quad \mu_{1} = \mu_{2}$$

Alternative hypothesis:

$$H_{A}: \quad \mu_{1} \ne \mu_{2} $$

This formulation results in a two-sided hypothesis test.


Step 2: Decide on the significance level, $\alpha$

$$\alpha = 0.05$$
In [6]:
alpha = 0.05

Step 3 and 4: Compute the value of the test statistic and the p-value

For illustration purposes we manually compute the test statistic in R. Recall the equations for the test statistic from above:

$$t = \frac {\bar {x}_{1} - \bar {x}_{2}} { \sqrt {\frac {s^{2}_{1}} {n_{1}} + \frac {s^{2}_{1}} {n_{1} } } }$$
In [7]:
mean_1 = np.mean(sample_polt_sc["salary"])
mean_2 = np.mean(sample_social_sc["salary"])

std_1 = np.std(sample_polt_sc["salary"], ddof = 1)
std_2 = np.std(sample_social_sc["salary"], ddof = 1)

t_value = (mean_1 - mean_2) / (np.sqrt((std_1**2 / n) + (std_2**2 / n)))

t_value
Out[7]:
2.6231818381404737

The numerical value of the test statistic is 2.6232.

In order to calculate the p-value, we apply the t.cdf function derived by the scipy package to calculate the probability of occurrence for the test statistic based on the t distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:

$$df=\frac{\left(\frac{s_1^2}{n_1}+\frac{s_1^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1-1}+\frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2-1}}\text{.}$$
In [8]:
df_numerator = ((std_1**2 / n) + (std_2**2 / n)) ** 2
df_demoninator = (((std_1**2 / n) ** 2) / (n - 1)) + (((std_2**2 / n) ** 2) / (n - 1))
df = df_numerator / df_demoninator
print(df)
88.98193607766795
In [9]:
from scipy.stats import t

lower = t.cdf(-abs(t_value), df = df)
upper = 1 - t.cdf(abs(t_value), df = df)

p = lower + upper
print(p)
0.010249087170337384

$p = 0.0102490871$


Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$

In [10]:
# reject H0?
p <= alpha
Out[10]:
True

The p-value is less than the specified significance level of 0.05; we reject $H_{0}$. The test results are statistically significant at the 5 % level and provide very strong evidence against the null hypothesis.


Step 6: Interpret the result of the hypothesis test

At the 5 % significance level the data provides very strong evidence to conclude, that the average annual salary of female graduates of Politcal Science differs from the average annual salary of female graduates of Social Sciences.

Hypothesis testing in Python with scipy¶

We just manually completed a pooled t-test in Python. However, please note that we can use the full power of Python's package universe to obtain the same result as above in just one line of code!

Exercise: Repeat the above example by applying the ttest_ind() function over the stats module from the scipy package to conduct a non-pooled t-test in Python!

Hint: You will need to provide sample_polt_sc as well as sample_social_sc as observations. Furthermore, you must set the argument equal_var to False. You can find additional information for the function's usage within scipys documentation.

In [11]:
### your solution
In [12]:
Show code
from scipy import stats

test_result = stats.ttest_ind(sample_polt_sc["salary"], sample_social_sc["salary"], equal_var = False)

print("t-value:", round(test_result.statistic, 5))
print("p-value:", round(test_result.pvalue, 5))
t-value: 2.62318
p-value: 0.01025

Super powerful! Compare the output of the ttest_ind() function with our result from above. They match perfectly! Again, we may conclude that at the 5 % significance level the data provides very strong evidence to conclude, that the average annual salary of female graduates of Politcal Science differs from the average annual salary of female graduates of Social Sciences.


Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.