20771_inferences_About_the_Slope_The_Regression

The regression t-test is applied to test if the slope, $\beta$, of the population regression line equals $0$. Based on that test, we may decide whether $x$ is a useful (linear) predictor of $y$.

The test statistic follows a t-distribution with $df = n - 2$ and can be written as

$$t = \frac {\beta} {s_{b}}= \frac {\beta} {s_{e} / \sqrt {\sum(x- \bar{x})^{2}}}$$

where $\beta$ corresponds to the sample regression coefficient and $s_{e}$ to the residual standard error $(s_{e} = \sqrt {\frac {SSE} {n - 2} }$ and $SSE = \sum_{i = 1}^{n} \epsilon_i^{2})$.

Interval Estimation of $\beta$¶

The $100(1 - \alpha)\ \%$ confidence interval for $\beta$ is given by

$$\beta \pm t_{\alpha / 2} \times \frac {s_{e}} {\sqrt {\sum(x - \bar{x})^{2}}}$$

where $s_{e}$ corresponds to the residual standard error (also known as the standard error of the estimate).

The value of $t$ is obtained from the t-distribution for the given confidence level and $n - 2$ degrees of freedom.

The Regression t-Test: An Example¶

In order to practice the correlation regression t-test, we load the students data set. You may download the students.csv file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas dataframe object by using the read_csv method:

Note: Make sure the numpy and pandas packages are part of your mamba environment!

In [1]:

import pandas as pd
import numpy as np

students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:

stud.id
name
gender
age
height
weight
religion
nc.score
semester
major
minor
score1
score2
online.tutorial
graduated
salary

In order to showcase the regression t-test, we examine the relationship between two variables: the height of students as the predictor variable and the weight of students as the response variable. The question is whether the predictor variable height is useful for predicting students' weight'?

Data preparation¶

For data preparation, we randomly sampled 12 students from the data set and built a data frame with the two variables of interest (height and weight). Further, we plot the data as a scatter plot to visualize the underlying linear relationship between the two variables.

In [2]:

import seaborn as sns

n = 12

sample = students.sample(n, random_state = 9)[["height", "weight"]]
sample

sns.scatterplot(data = sample, x = "height", y = "weight")

Out[2]:

<Axes: xlabel='height', ylabel='weight'>

The visual inspection supports our assumption that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height, the individual student tends the have a higher weight.

Hypothesis testing¶

In order to conduct the regression t-test, we follow the step-wise implementation procedure for hypothesis testing. The regression t-test follows the same step-wise procedure as discussed in the previous sections:

State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$.
Decide on the significance level, $\alpha$.
Compute the value of the test statistic.
Determine the p-value.
If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$.
Interpret the result of the hypothesis test.

Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$

The null hypothesis states that there is no linear relationship between the height and the weight of the individuals in the students data set:

$$H_{0}: \beta = 0\ \text{(predictor variable is not useful for making predictions)}$$

Alternative hypothesis:

$$H_{A}: \beta \ne 0\ \text{(predictor variable is useful for making predictions)}$$

Step 2: Decide on the significance level, $\alpha$

$$\alpha = 0.01$$

In [3]:

alpha = 0.01

Steps 3 and 4: Compute the value of the test statistic and the p-value

For illustration purposes, we manually compute the test statistic in Python first. Recall the equation for the test statistic from above:

$$t = \frac {\beta} {s_{b}}= \frac {\beta} {s_{e} / \sqrt {\sum(x- \bar{x})^{2}}}$$

where $\beta = \frac {cov(x,y)} {var(x)}$, and

$$s_{e} = \sqrt {\frac {SSE} {n - 2} }$$

where $SSE = \sum_{i = 1}^{n} \epsilon_{i}^{2} = \sum_{i = 1}^{n} (y - \hat y)^{2}$.

The test statistic follows a t-distribution with $df = n - 2$. In order to calculate $\hat y = \alpha + \beta x$ we need to know $\alpha$, defined as $\alpha = \bar y -\beta \bar {x}$.

We do one step after another to avoid getting confused by the different computational steps.

Build the linear model by calculating the intercept $(\alpha)$ and the regression coefficient $(\beta)$:

In [4]:

weight_mean = np.mean(sample["weight"])
height_mean = np.mean(sample["height"])

lm_beta = np.cov(sample.height, sample.weight)[0, 1] / (np.std(sample.height, ddof = 1)**2)
lm_beta 

Out[4]:

0.7477660695468916

In [5]:

lm_alpha = weight_mean - lm_beta * height_mean
lm_alpha

Out[5]:

-54.99202318229719

Calculate the sum of squares errors $(SSE)$ and the residual standard error $(s_{e})$:

In [6]:

y_hat = lm_alpha + sample["height"] * lm_beta 
SSE = np.sum((sample["weight"] - y_hat) ** 2)

SSE

Out[6]:

115.9337734457324

In [7]:

se = np.sqrt(SSE / (n - 2))
se

Out[7]:

3.4049048950849183

Compute the value of the test statistic:

In [8]:

t_value = lm_beta / (se / np.sqrt(np.sum((sample["height"] - height_mean)**2)))
t_value

Out[8]:

8.734101810239357

The numerical value of the test statistic is 8.73410181.

In order to calculate the p-value, we apply the t.cdf function derived by the scipy package to calculate the probability of occurrence for the test statistic based on the t distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:

$$df = n - 2 = 12 - 2 = 10$$

Note: Make sure the scipy package is part of your mamba environment!

In [9]:

from scipy.stats import t

df = n - 2

p_value_lower = t.cdf(-np.abs(t_value), df)
p_value_upper = 1 - t.cdf(t_value, df)
p_value = p_value_lower + p_value_upper
p_value

Out[9]:

5.414310239956427e-06

$p = 5.41431024 \times 10^{-6}$

Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$

In [10]:

# reject H0?

p_value < alpha

Out[10]:

True

The p-value is smaller than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.

Step 6: Interpret the result of the hypothesis test

At the 1 % significance level, the data provide very strong evidence to conclude that the height variable is a good estimator for the weight of students.

Hypothesis testing in Python with `scipy`¶

We just computed a regression t-test in Python manually. We can do the same with just one line of code by using the power of Python's package universe, namely the scipy package!

In order to conduct a regression t-test in Python over the stats module from the scipy package, we apply the linregress() function. We only have to provide our data separately as numpy array. Furthermore, you can specify the test method over the alternative argument. Because the default is already set to two-sided, we do not have to adapt it. Additional information regarding the function's usage can be derived directly from the function's documentation of scipy.

In [11]:

from scipy.stats import linregress

test_result = linregress(sample["height"], sample["weight"])
test_result

Out[11]:

LinregressResult(slope=0.7477660695468914, intercept=-54.99202318229713, rvalue=0.9402682465821417, pvalue=5.414310239986678e-06, stderr=0.08561453550613006, intercept_stderr=15.028988955403943)

The linregress() function returns an object, which provides all relevant information regarding the performed regression t-test. This also includes the slope and the intercept as well as the corresponding p-value of the test result. In detail, the object consists of the following properties:

<object>.slope represents $\beta$.
<object>.intercept represents $\alpha$.
<object>.rvalue stores the Pearson correlation coefficient.
<object>.pvalue represents the p-value of the performed significance test.
<object>.sterr represents the standard error for $\beta$.
<object>.intercept_stderr stores the standard error for $\beta$.

Consequently, the p-value is retrieved over:

In [12]:

test_result.pvalue

Out[12]:

5.414310239986678e-06

The slope and the intercept of the linear model based on the given data is retrieved over:

In [13]:

test_result.slope

Out[13]:

0.7477660695468914

In [14]:

test_result.intercept

Out[14]:

-54.99202318229713

Compare the output to our results from above. They match perfectly!

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.

Interval Estimation of $\beta$¶

The Regression t-Test: An Example¶

Data preparation¶

Hypothesis testing¶

Hypothesis testing in Python with scipy¶

Hypothesis testing in Python with `scipy`¶