The **regression t-test** is applied to test if the slope, $\beta$, of the population regression line equals $0$. Based on that test, we may decide whether $x$ is a useful (linear) predictor of $y$.

The test statistic follows a t-distribution with $df = n - 2$ and can be written as

$$t = \frac {\beta} {s_{b}}= \frac {\beta} {s_{e} / \sqrt {\sum(x- \bar{x})^{2}}}$$where $\beta$ corresponds to the sample regression coefficient and $s_{e}$ to the **residual standard error** $(s_{e} = \sqrt {\frac {SSE} {n - 2} }$ and $SSE = \sum_{i = 1}^{n} \epsilon_i^{2})$.

The $100(1 - \alpha)\ \%$ confidence interval for $\beta$ is given by

$$\beta \pm t_{\alpha / 2} \times \frac {s_{e}} {\sqrt {\sum(x - \bar{x})^{2}}}$$where $s_{e}$ corresponds to the **residual standard error** (also known as the **standard error of the estimate**).

The value of $t$ is obtained from the *t*-distribution for the given confidence level and $n - 2$ degrees of freedom.

In order to practice the correlation **regression t-test**, we load the

`students`

`students.csv`

file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as `pandas`

`dataframe`

object by using the `read_csv`

method:

Note:Make sure the`numpy`

and`pandas`

packages are part of your`mamba`

environment!

In [1]:

```
import pandas as pd
import numpy as np
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
```

The *students* data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:

- stud.id
- name
- gender
- age
- height
- weight
- religion
- nc.score
- semester
- major
- minor
- score1
- score2
- online.tutorial
- graduated
- salary

In order to showcase the regression *t*-test, we examine the relationship between two variables: the height of students as the predictor variable and the weight of students as the response variable. **The question is whether the predictor variable height is useful for predicting students' weight'?**

For data preparation, we randomly sampled 12 students from the data set and built a data frame with the two variables of interest (`height`

and `weight`

). Further, we plot the data as a scatter plot to visualize the underlying linear relationship between the two variables.

In [2]:

```
import seaborn as sns
n = 12
sample = students.sample(n, random_state = 9)[["height", "weight"]]
sample
sns.scatterplot(data = sample, x = "height", y = "weight")
```

Out[2]:

<Axes: xlabel='height', ylabel='weight'>

The visual inspection supports our assumption that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height, the individual student tends the have a higher weight.

In order to conduct the **regression t-test**, we follow the step-wise implementation procedure for hypothesis testing. The

- State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$.
- Decide on the significance level, $\alpha$.
- Compute the value of the test statistic.
- Determine the
*p*-value. - If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$.
- Interpret the result of the hypothesis test.

**Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$**

The null hypothesis states that there is no linear relationship between the `height`

and the `weight`

of the individuals in the `students`

*data set*:

Alternative hypothesis:

$$H_{A}: \beta \ne 0\ \text{(predictor variable is useful for making predictions)}$$**Step 2: Decide on the significance level, $\alpha$**

In [3]:

```
alpha = 0.01
```

**Steps 3 and 4: Compute the value of the test statistic and the p-value**

For illustration purposes, we manually compute the test statistic in Python first. Recall the equation for the test statistic from above:

$$t = \frac {\beta} {s_{b}}= \frac {\beta} {s_{e} / \sqrt {\sum(x- \bar{x})^{2}}}$$where $\beta = \frac {cov(x,y)} {var(x)}$, and

$$s_{e} = \sqrt {\frac {SSE} {n - 2} }$$where $SSE = \sum_{i = 1}^{n} \epsilon_{i}^{2} = \sum_{i = 1}^{n} (y - \hat y)^{2}$.

The test statistic follows a t-distribution with $df = n - 2$. In order to calculate $\hat y = \alpha + \beta x$ we need to know $\alpha$, defined as $\alpha = \bar y -\beta \bar {x}$.

We do one step after another to avoid getting confused by the different computational steps.

- Build the linear model by calculating the intercept $(\alpha)$ and the regression coefficient $(\beta)$:

In [4]:

```
weight_mean = np.mean(sample["weight"])
height_mean = np.mean(sample["height"])
lm_beta = np.cov(sample.height, sample.weight)[0, 1] / (np.std(sample.height, ddof = 1)**2)
lm_beta
```

Out[4]:

0.7477660695468916

In [5]:

```
lm_alpha = weight_mean - lm_beta * height_mean
lm_alpha
```

Out[5]:

-54.99202318229719

- Calculate the sum of squares errors $(SSE)$ and the residual standard error $(s_{e})$:

In [6]:

```
y_hat = lm_alpha + sample["height"] * lm_beta
SSE = np.sum((sample["weight"] - y_hat) ** 2)
SSE
```

Out[6]:

115.9337734457324

In [7]:

```
se = np.sqrt(SSE / (n - 2))
se
```

Out[7]:

3.4049048950849183

- Compute the value of the test statistic:

In [8]:

```
t_value = lm_beta / (se / np.sqrt(np.sum((sample["height"] - height_mean)**2)))
t_value
```

Out[8]:

8.734101810239357

The numerical value of the test statistic is 8.73410181.

In order to calculate the *p*-value, we apply the `t.cdf`

function derived by the `scipy`

package to calculate the probability of occurrence for the test statistic based on the *t distribution*. To do so, we also need the *degrees of freedom*. Recall how to calculate the degrees of freedom:

Note:Make sure the`scipy`

package is part of your`mamba`

environment!

In [9]:

```
from scipy.stats import t
df = n - 2
p_value_lower = t.cdf(-np.abs(t_value), df)
p_value_upper = 1 - t.cdf(t_value, df)
p_value = p_value_lower + p_value_upper
p_value
```

Out[9]:

5.414310239956427e-06

$p = 5.41431024 \times 10^{-6}$

**Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$**

In [10]:

```
# reject H0?
p_value < alpha
```

Out[10]:

True

The *p*-value is smaller than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis test**

At the 1 % significance level, the data provide very strong evidence to conclude that the height variable is a good estimator for the weight of students.

`scipy`

¶We just computed a **regression t-test** in Python manually. We can do the same with just one line of code by using the power of Python's package universe, namely the

`scipy`

package!In order to conduct a **regression t-test** in Python over the

`stats`

module from the `scipy`

package, we apply the `linregress()`

function. We only have to provide our data separately as `numpy`

`array`

. Furthermore, you can specify the test method over the `alternative`

argument. Because the default is already set to `two-sided`

, we do not have to adapt it. Additional information regarding the function's usage can be derived directly from the function's documentation of `scipy`

.In [11]:

```
from scipy.stats import linregress
test_result = linregress(sample["height"], sample["weight"])
test_result
```

Out[11]:

LinregressResult(slope=0.7477660695468914, intercept=-54.99202318229713, rvalue=0.9402682465821417, pvalue=5.414310239986678e-06, stderr=0.08561453550613006, intercept_stderr=15.028988955403943)

The `linregress()`

function returns an `object`

, which provides all relevant information regarding the performed **regression t-test**. This also includes the

`object`

consists of the following properties:`<object>.slope`

represents $\beta$.`<object>.intercept`

represents $\alpha$.`<object>.rvalue`

stores the Pearson correlation coefficient.`<object>.pvalue`

represents the*p*-value of the performed significance test.`<object>.sterr`

represents the standard error for $\beta$.`<object>.intercept_stderr`

stores the standard error for $\beta$.

Consequently, the ** p-value** is retrieved over:

In [12]:

```
test_result.pvalue
```

Out[12]:

5.414310239986678e-06

The slope and the intercept of the linear model based on the given data is retrieved over:

In [13]:

```
test_result.slope
```

Out[13]:

0.7477660695468914

In [14]:

```
test_result.intercept
```

Out[14]:

-54.99202318229713

Compare the output to our results from above. They match perfectly!

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*