Let us now turn to a hypothesis testing procedure for the difference between two population means when the samples are **dependent**. If, for example, two data values are collected from the same source (or element), these are called **paired** or **matched samples**.

These procedures are often applied for **Before-After-Control-Impact (BACI)** analysis. Imagine a case when you are asked to evaluate the effectiveness of a filtering system in removing air pollutants released by a factory. In that case, one population consists of air quality measurements before the filtering system is implemented or renewed. The other population consists of air quality measurements after installing the new filter system. In that case, you are dealing with paired samples because the two data sets are collected from the same source, i.e. the factory.

In paired samples, the difference between the data values of the two samples is denoted by $d$, often called **paired difference**. Note that the sample size $n$ for each sample is equal. The mean of the paired differences for the samples is denoted as $\bar {d}$:

The standard deviation of paired differences for two samples, $s_{d}$, is calculated as:

$$s_{d} = \sqrt {\frac {\sum d^{2} - \frac {(\sum d)^{2}} {n}} {n-1}}$$Suppose that the paired-difference variable $d$ is normally distributed. Then the paired $t$-statistic is expressed as:

$$t = \frac {\bar {d} - (\mu_{1} - \mu_2)} {\frac {s_{d}} {\sqrt {n} } }$$which simplifies to:

$$t = \frac {\bar d} {\frac{s_{d}} {\sqrt{n} } }$$if $\mu_{1} - \mu_{2} = 0$. The test statistic $t$ for paired samples follows a *t*-distribution with $df = n - 1$.

The $100(1 - \alpha)\ \%$ confidence interval for $\mu_{d}$ is

$$\bar {d} \pm t \times \frac {s_{d}} {\sqrt{n}}$$where the value of $t$ is obtained from the *t*-distribution for the given confidence level and $n - 1$ degrees of freedom.

In order to practice the **paired t-test**, we load the

`students`

`students.csv`

file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as `pandas`

`dataframe`

object by using the `read_csv`

method:

Note: Ensure`pandas`

and`numpy`

are installed in your`mamba`

environment!

In [1]:

```
import pandas as pd
import numpy as np
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
```

The *students* data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:

- stud.id
- name
- gender
- age
- height
- weight
- religion
- nc.score
- semester
- major
- minor
- score1
- score2
- online.tutorial
- graduated
- salary

In order to showcase the paired *t*-test for dependent samples, **we are interested in whether an online statistics learning tutorial helps students improve their grades**.

There are three variables of interest in the `students`

*data set*:

- The variable
`online.tutorial`

is a binary variable,`1`

if the student completed the online statistics learning tutorial, or`0`

otherwise. - The variables
`score1`

and`score2`

show the grades (0-100) for two exams on mathematics and statistics. The higher the value, the better the particular student performed. Please note, that the first exam takes place before the students attended the online statistics learning tutorial. The participation in the online statistics learning tutorial is not mandatory, however the two exams are obligatory for all students. The first exam (`score1`

) takes place at the beginning of the 3^{rd}semester, the second exam (`score2`

) takes place at the end of the 3^{rd}semester.

There are two research questions of interest:

**we want to examine if the group of students, which attended the online statistics learning tutorial, performed better on the second exam compared to the first exam.****we test how the group of students that did not join the online statistics learning tutorial performed on both tests.**

We start with the first research question and focus on those students that attended the online statistics learning tutorial.

For data preparation, we subset the dataset based on the variable `online.tutorial`

, which indicates if the student took the tutorial or not ($1=\text{yes}, \,0=\text{no}$).
Then, we randomly sample 65 students from the dataset and extract the two variables of interest, `score1`

and `score2`

. We store the sample as `dataframe`

`object`

called `sample`

.

In [2]:

```
n = 65
subset = students.loc[students["online.tutorial"] == 1]
sample = subset.sample(n, random_state = 9)[["score1", "score2"]]
```

Now, we compute the paired differences, $d$, and plot them:

Note: Ensure`matplotlib`

and`seaborn`

are installed in your`mamba`

environment!

In [3]:

```
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(9,6))
d = sample["score1"] - sample["score2"]
data = pd.DataFrame({'index' : np.arange(1, 66),
'Paired differences' : d},
columns = ['index', 'Paired differences'])
sns.barplot(data = data, x = "index", y = "Paired differences", color = "darkgrey")
plt.axhline(y = 0, color = "orangered")
ax = plt.gca()
ax.get_xaxis().set_visible(False)
```

The plot looks as expected. Some students perform better on the first exam than the second and vice versa.

In order to check the normality assumption, we again rely on a visual inspection of a Q-Q plot. The Q-Q plot should be roughly linear if the variable is normally distributed. You can quickly generate a well-looking QQ-Plot in Python over the `probplot()`

function provided over the `stats`

module within the `scipy`

package.

Note: Ensure the`scipy`

package is part of your`mamba`

environment!

In [4]:

```
import matplotlib.pyplot as plt
import scipy.stats as stats
plt.figure(figsize=(12,5))
fig, ax = plt.subplots()
qq = stats.probplot(d, dist="norm", plot = plt)
ax.set_title("Q-Q plot for differences in exam scores ")
ax.set_ylabel("Sample quantiles")
```

Out[4]:

Text(0, 0.5, 'Sample quantiles')

<Figure size 1200x500 with 0 Axes>

Not super exact and a bit noisy, but the data seems to be roughly normally distributed.

We further calculate $\bar {d}$, the mean of the paired differences by:

$$\bar{d} = \frac {\sum d} {n}$$and $s_{d}$, the standard deviation of the paired differences for two samples

$$s_{d} = \sqrt {\frac {\sum d^{2} - \frac {(\sum d)^{2}} {n}} {n - 1}}$$In [5]:

```
diff_mean = np.mean(d)
diff_std = np.sqrt((np.sum(d**2) - ((np.sum(d)**2) / n)) / (n - 1))
```

Now we are ready to apply the **paired t-test**. Recall our first research question:

We follow the step-wise implementation procedure for hypothesis testing.

**Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$**

The null hypothesis states that there is no difference in the mean of the exam grades of one exam compared to the other:

$$H_{0}: \quad \mu_{1} = \mu_{2}$$Recall that the formulation of the alternative hypothesis dictates whether we apply a two-sided, a left-tailed or a right-tailed hypothesis test.

Alternative hypothesis:

$$H_{A}: \quad \mu_1 < \mu_2 $$This formulation results in a left-tailed hypothesis test and states that, on average, the students perform better on the second exam.

**Step 2: Decide on the significance level, $\alpha$**

In [6]:

```
alpha = 0.05
```

**Steps 3 and 4: Compute the value of the test statistic and the p-value**

For illustration purposes we manually compute the test statistic in Python. Recall the equation from above:

$$t = \frac {\bar {d} - (\mu_{1} - \mu_2)} {\frac {s_{d}} {\sqrt {n} } }$$If $H_{0}$ is true, then $\mu_{1} - \mu_{2} = 0$ and thus, the equation simplifies to

$$t = \frac {\bar d} {\frac{s_{d}} {\sqrt{n} } }$$In [7]:

```
t_value = diff_mean / (diff_std / np.sqrt(n))
t_value
```

Out[7]:

-1.8474277357017477

The numerical value of the test statistic is -1.84743.

In order to calculate the *p*-value, we apply the `t.cdf`

function derived by the `scipy`

package to calculate the probability of occurrence for the test statistic based on the *t distribution*. To do so, we also need the *degrees of freedom*. Recall how to calculate the degrees of freedom:

In [8]:

```
from scipy.stats import t
df = n - 1
p = t.cdf(-abs(t_value), df = df)
p
```

Out[8]:

0.034653989114893445

$p = 0.03465398911$

**Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$**

In [9]:

```
# reject H0?
p <= alpha
```

Out[9]:

True

The *p*-value is less than the specified significance level of 0.05; we reject $H_{0}$. The test results are statistically significant at the 5 % level and provide strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis test**

At the 5 % significance level, the data provide strong evidence to conclude that students' exam grades improve after taking an online statistics learning tutorial.

`scipy`

¶We just manually completed a **paired t-test** in Python. However, please note that we can use the full power of Python's package universe to obtain the same result as above in just one line of code!

Exercise:Repeat the above example by applying the`ttest_rel()`

function over the`stats`

module from the`scipy`

package to conduct a *pairedt-testin Python!

Hint:You will need to provide`sample["score1"]`

as well as`sample["score2"]`

as observations. Furthermore, you must adapt the default of the`alternative`

argument accordingly. You can find additional information for the function's usage within`scipys`

documentation.

In [ ]:

```
### your solution
```

In [10]:

```
from scipy import stats
test_result = stats.ttest_rel(sample["score1"], sample["score2"], alternative = "less")
print("t-value:", round(test_result.statistic, 5))
print("p-value:", round(test_result.pvalue, 5))
```

t-value: -1.84743 p-value: 0.03465

Awesome! Compare the results of the method's output with our result from above. They match perfectly! Again, we may conclude that at the 5 % significance level, the data provides strong evidence to conclude, that the exam grades of students improve after taking an online statistics learning tutorial.

Before we continue, there is still one research question to be answered. What if there are other reasons for better grades on the second exam? What if the second exam was much more manageable? What if the students had an awesome lecturer and thus improved during the semester? We test that hypothesis by conducting a **paired t-test**explicitly for those students who did not take the online statistics learning tutorial.
Now, as we are fully aware of the powerful

`scipy`

package, we conduct a In [11]:

```
sample = students.loc[students["online.tutorial"] == 0].dropna().sample(n, random_state = 10)[["score1", "score2"]]
test_result = stats.ttest_rel(sample["score1"], sample["score2"], alternative = "less")
print("t-value:", round(test_result.statistic, 5))
print("p-value:", round(test_result.pvalue, 5))
```

t-value: 0.68109 p-value: 0.75086

The *p*-value is greater than the specified significance level of 0.05; we do not reject $H_{0}$. The test results are statistically significant at the 5 % level and do not provide sufficient evidence against the null hypothesis.

At the 5 % significance level, the data does not provide sufficient evidence to conclude that the exam grades of students, who did not attend the online tutorial, improved.

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*