The regression t-test is applied to test if the slope, $\beta$, of the population regression line equals $0$. Based on that test, we may decide whether $x$ is a useful (linear) predictor of $y$.
The test statistic follows a t-distribution with $df = n - 2$ and can be written as
$$t = \frac {\beta} {s_{b}}= \frac {\beta} {s_{e} / \sqrt {\sum(x- \bar{x})^{2}}}$$where $\beta$ corresponds to the sample regression coefficient and $s_{e}$ to the residual standard error $(s_{e} = \sqrt {\frac {SSE} {n - 2} }$ and $SSE = \sum_{i = 1}^{n} \epsilon_i^{2})$.
The $100(1 - \alpha)\ \%$ confidence interval for $\beta$ is given by
$$\beta \pm t_{\alpha / 2} \times \frac {s_{e}} {\sqrt {\sum(x - \bar{x})^{2}}}$$where $s_{e}$ corresponds to the residual standard error (also known as the standard error of the estimate).
The value of $t$ is obtained from the t-distribution for the given confidence level and $n - 2$ degrees of freedom.
In order to practice the correlation regression t-test, we load the students
data set. You may download the students.csv
file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas
dataframe
object by using the read_csv
method:
Note: Make sure the
numpy
andpandas
packages are part of yourmamba
environment!
import pandas as pd
import numpy as np
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:
In order to showcase the regression t-test, we examine the relationship between two variables: the height of students as the predictor variable and the weight of students as the response variable. The question is whether the predictor variable height
is useful for predicting students' weight'?
For data preparation, we randomly sampled 12 students from the data set and built a data frame with the two variables of interest (height
and weight
). Further, we plot the data as a scatter plot to visualize the underlying linear relationship between the two variables.
import seaborn as sns
n = 12
sample = students.sample(n, random_state = 9)[["height", "weight"]]
sample
sns.scatterplot(data = sample, x = "height", y = "weight")
<Axes: xlabel='height', ylabel='weight'>
The visual inspection supports our assumption that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height, the individual student tends the have a higher weight.
In order to conduct the regression t-test, we follow the step-wise implementation procedure for hypothesis testing. The regression t-test follows the same step-wise procedure as discussed in the previous sections:
Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$
The null hypothesis states that there is no linear relationship between the height
and the weight
of the individuals in the students
data set:
Alternative hypothesis:
$$H_{A}: \beta \ne 0\ \text{(predictor variable is useful for making predictions)}$$Step 2: Decide on the significance level, $\alpha$
$$\alpha = 0.01$$alpha = 0.01
Steps 3 and 4: Compute the value of the test statistic and the p-value
For illustration purposes, we manually compute the test statistic in Python first. Recall the equation for the test statistic from above:
$$t = \frac {\beta} {s_{b}}= \frac {\beta} {s_{e} / \sqrt {\sum(x- \bar{x})^{2}}}$$where $\beta = \frac {cov(x,y)} {var(x)}$, and
$$s_{e} = \sqrt {\frac {SSE} {n - 2} }$$where $SSE = \sum_{i = 1}^{n} \epsilon_{i}^{2} = \sum_{i = 1}^{n} (y - \hat y)^{2}$.
The test statistic follows a t-distribution with $df = n - 2$. In order to calculate $\hat y = \alpha + \beta x$ we need to know $\alpha$, defined as $\alpha = \bar y -\beta \bar {x}$.
We do one step after another to avoid getting confused by the different computational steps.
weight_mean = np.mean(sample["weight"])
height_mean = np.mean(sample["height"])
lm_beta = np.cov(sample.height, sample.weight)[0, 1] / (np.std(sample.height, ddof = 1)**2)
lm_beta
0.7477660695468916
lm_alpha = weight_mean - lm_beta * height_mean
lm_alpha
-54.99202318229719
y_hat = lm_alpha + sample["height"] * lm_beta
SSE = np.sum((sample["weight"] - y_hat) ** 2)
SSE
115.9337734457324
se = np.sqrt(SSE / (n - 2))
se
3.4049048950849183
t_value = lm_beta / (se / np.sqrt(np.sum((sample["height"] - height_mean)**2)))
t_value
8.734101810239357
The numerical value of the test statistic is 8.73410181.
In order to calculate the p-value, we apply the t.cdf
function derived by the scipy
package to calculate the probability of occurrence for the test statistic based on the t distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:
Note: Make sure the
scipy
package is part of yourmamba
environment!
from scipy.stats import t
df = n - 2
p_value_lower = t.cdf(-np.abs(t_value), df)
p_value_upper = 1 - t.cdf(t_value, df)
p_value = p_value_lower + p_value_upper
p_value
5.414310239956427e-06
$p = 5.41431024 \times 10^{-6}$
Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$
# reject H0?
p_value < alpha
True
The p-value is smaller than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.
Step 6: Interpret the result of the hypothesis test
At the 1 % significance level, the data provide very strong evidence to conclude that the height variable is a good estimator for the weight of students.
scipy
¶We just computed a regression t-test in Python manually. We can do the same with just one line of code by using the power of Python's package universe, namely the scipy
package!
In order to conduct a regression t-test in Python over the stats
module from the scipy
package, we apply the linregress()
function. We only have to provide our data separately as numpy
array
. Furthermore, you can specify the test method over the alternative
argument. Because the default is already set to two-sided
, we do not have to adapt it. Additional information regarding the function's usage can be derived directly from the function's documentation of scipy
.
from scipy.stats import linregress
test_result = linregress(sample["height"], sample["weight"])
test_result
LinregressResult(slope=0.7477660695468914, intercept=-54.99202318229713, rvalue=0.9402682465821417, pvalue=5.414310239986678e-06, stderr=0.08561453550613006, intercept_stderr=15.028988955403943)
The linregress()
function returns an object
, which provides all relevant information regarding the performed regression t-test. This also includes the slope and the intercept as well as the corresponding p-value of the test result. In detail, the object
consists of the following properties:
<object>.slope
represents $\beta$.<object>.intercept
represents $\alpha$.<object>.rvalue
stores the Pearson correlation coefficient.<object>.pvalue
represents the p-value of the performed significance test.<object>.sterr
represents the standard error for $\beta$.<object>.intercept_stderr
stores the standard error for $\beta$.Consequently, the p-value is retrieved over:
test_result.pvalue
5.414310239986678e-06
The slope and the intercept of the linear model based on the given data is retrieved over:
test_result.slope
0.7477660695468914
test_result.intercept
-54.99202318229713
Compare the output to our results from above. They match perfectly!
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.