The regression t-test is applied to test if the slope, $\beta$, of the population regression line equals $0$. Based on that test, we may decide whether $x$ is a useful (linear) predictor of $y$.
The test statistic follows a t-distribution with $df = n - 2$ and can be written as
$$t = \frac {\beta} {s_{b}}= \frac {\beta} {s_{e} / \sqrt {\sum(x- \bar{x})^{2}}}$$where $\beta$ corresponds to the sample regression coefficient and $s_{e}$ to the residual standard error $(s_{e} = \sqrt {\frac {SSE} {n - 2} }$ and $SSE = \sum_{i = 1}^{n} \epsilon_i^{2})$.
The $100(1 - \alpha)\ \%$ confidence interval for $\beta$ is given by
$$\beta \pm t_{\alpha / 2} \times \frac {s_{e}} {\sqrt {\sum(x - \bar{x})^{2}}}$$where $s_{e}$ corresponds to the residual standard error (also known as the standard error of the estimate).
The value of $t$ is obtained from the t-distribution for the given confidence level and $n - 2$ degrees of freedom.
In order to practice the correlation regression t-test, we load the students data set. You may download the students.csv file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas dataframe object by using the read_csv method:
Note: Make sure the
numpyandpandaspackages are part of yourmambaenvironment!
import pandas as pd
import numpy as np
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:
In order to showcase the regression t-test, we examine the relationship between two variables: the height of students as the predictor variable and the weight of students as the response variable. The question is whether the predictor variable height is useful for predicting students' weight'?
For data preparation, we randomly sampled 12 students from the data set and built a data frame with the two variables of interest (height and weight). Further, we plot the data as a scatter plot to visualize the underlying linear relationship between the two variables.
import seaborn as sns
n = 12
sample = students.sample(n, random_state = 9)[["height", "weight"]]
sample
sns.scatterplot(data = sample, x = "height", y = "weight")
<Axes: xlabel='height', ylabel='weight'>
The visual inspection supports our assumption that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height, the individual student tends the have a higher weight.
In order to conduct the regression t-test, we follow the step-wise implementation procedure for hypothesis testing. The regression t-test follows the same step-wise procedure as discussed in the previous sections:
Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$
The null hypothesis states that there is no linear relationship between the height and the weight of the individuals in the students data set:
Alternative hypothesis:
$$H_{A}: \beta \ne 0\ \text{(predictor variable is useful for making predictions)}$$Step 2: Decide on the significance level, $\alpha$
$$\alpha = 0.01$$alpha = 0.01
Steps 3 and 4: Compute the value of the test statistic and the p-value
For illustration purposes, we manually compute the test statistic in Python first. Recall the equation for the test statistic from above:
$$t = \frac {\beta} {s_{b}}= \frac {\beta} {s_{e} / \sqrt {\sum(x- \bar{x})^{2}}}$$where $\beta = \frac {cov(x,y)} {var(x)}$, and
$$s_{e} = \sqrt {\frac {SSE} {n - 2} }$$where $SSE = \sum_{i = 1}^{n} \epsilon_{i}^{2} = \sum_{i = 1}^{n} (y - \hat y)^{2}$.
The test statistic follows a t-distribution with $df = n - 2$. In order to calculate $\hat y = \alpha + \beta x$ we need to know $\alpha$, defined as $\alpha = \bar y -\beta \bar {x}$.
We do one step after another to avoid getting confused by the different computational steps.
weight_mean = np.mean(sample["weight"])
height_mean = np.mean(sample["height"])
lm_beta = np.cov(sample.height, sample.weight)[0, 1] / (np.std(sample.height, ddof = 1)**2)
lm_beta
0.7477660695468916
lm_alpha = weight_mean - lm_beta * height_mean
lm_alpha
-54.99202318229719
y_hat = lm_alpha + sample["height"] * lm_beta
SSE = np.sum((sample["weight"] - y_hat) ** 2)
SSE
115.9337734457324
se = np.sqrt(SSE / (n - 2))
se
3.4049048950849183
t_value = lm_beta / (se / np.sqrt(np.sum((sample["height"] - height_mean)**2)))
t_value
8.734101810239357
The numerical value of the test statistic is 8.73410181.
In order to calculate the p-value, we apply the t.cdf function derived by the scipy package to calculate the probability of occurrence for the test statistic based on the t distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:
Note: Make sure the
scipypackage is part of yourmambaenvironment!
from scipy.stats import t
df = n - 2
p_value_lower = t.cdf(-np.abs(t_value), df)
p_value_upper = 1 - t.cdf(t_value, df)
p_value = p_value_lower + p_value_upper
p_value
5.414310239956427e-06
$p = 5.41431024 \times 10^{-6}$
Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$
# reject H0?
p_value < alpha
True
The p-value is smaller than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.
Step 6: Interpret the result of the hypothesis test
At the 1 % significance level, the data provide very strong evidence to conclude that the height variable is a good estimator for the weight of students.
scipy¶We just computed a regression t-test in Python manually. We can do the same with just one line of code by using the power of Python's package universe, namely the scipy package!
In order to conduct a regression t-test in Python over the stats module from the scipy package, we apply the linregress() function. We only have to provide our data separately as numpy array. Furthermore, you can specify the test method over the alternative argument. Because the default is already set to two-sided, we do not have to adapt it. Additional information regarding the function's usage can be derived directly from the function's documentation of scipy.
from scipy.stats import linregress
test_result = linregress(sample["height"], sample["weight"])
test_result
LinregressResult(slope=0.7477660695468914, intercept=-54.99202318229713, rvalue=0.9402682465821417, pvalue=5.414310239986678e-06, stderr=0.08561453550613006, intercept_stderr=15.028988955403943)
The linregress() function returns an object, which provides all relevant information regarding the performed regression t-test. This also includes the slope and the intercept as well as the corresponding p-value of the test result. In detail, the object consists of the following properties:
<object>.slope represents $\beta$.<object>.intercept represents $\alpha$.<object>.rvalue stores the Pearson correlation coefficient.<object>.pvalue represents the p-value of the performed significance test.<object>.sterr represents the standard error for $\beta$.<object>.intercept_stderr stores the standard error for $\beta$.Consequently, the p-value is retrieved over:
test_result.pvalue
5.414310239986678e-06
The slope and the intercept of the linear model based on the given data is retrieved over:
test_result.slope
0.7477660695468914
test_result.intercept
-54.99202318229713
Compare the output to our results from above. They match perfectly!
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.