The regression t-test is applied to test if the slope, \(\beta\), of the population regression line equals \(0\). Based on that test we may decide whether \(x\) is a useful (linear) predictor of \(y\).
The test statistic follows a t-distribution with \(df = n - 2\) and can be written as
\[t =\frac{\beta}{s_b}= \frac{\beta}{s_e/\sqrt{\sum(x- \bar x)^2}}\text{,} \] where \(\beta\) corresponds to the sample regression coefficient and \(s_e\) to the residual standard error \((s_e=\sqrt{\frac{SSE}{n-2}}\) and \(SSE = \sum_{i=1}^n \epsilon_i^2)\).
The \(100(1-\alpha)\) % confidence interval for \(\beta\) is given by
\[\beta \pm t_{\alpha/2} \times \frac{s_e}{\sqrt{\sum(x- \bar x)^2}}\text{,}\] where \(s_e\) corresponds to the residual standard error (also known as the standard error of the estimate).
The value of \(t\) is obtained from the t-distribution for the given confidence level and \(n-2\) degrees of freedom.
In order to practice the correlation regression
t-test we load the students data set. You may
download the students.csv
file here
or import the data set directly into R:
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.
In order to showcase the regression t-test we examine the
relationship between two variables: the height of students as the
predictor variable and the weight of students as the response variable.
The question is, whether the predictor variable
height
is useful for making predictions of the weight of
students?
For data preparation we randomly sample 12 students from the data set
and build a data frame with the two variables of interest
(height
and weight
). Further, we plot the data
in form of a scatter plot to visualize the underlying linear
relationship between the two variables.
n <- 12
sample_idx <- sample(1:nrow(students), n)
data <- students[sample_idx, c("height", "weight")]
plot(data$height, data$weight)
The visual inspection supports our assumption, that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height the individual student tends the have a higher weight.
In order to conduct the regression t-test we follow the step-wise implementation procedure for hypothesis testing. The regression t-test follows the same step-wise procedure as discussed in the previous sections:
\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]
Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)
The null hypothesis states that there is no linear relationship between the height and the weight of the individuals in the students data set:
\[H_0: \beta = 0\text{ (predictor variable is not useful for making predictions)}\]
Alternative hypothesis:
\[H_A: \beta \ne 0\text{ (predictor variable is useful for making predictions)}\]
Step 2: Decide on the significance level, \(\alpha\)
\[\alpha = 0.01\]
alpha <- 0.01
Step 3 and 4: Compute the value of the test statistic and the p-value
For illustration purposes we manually compute the test statistic in R. Recall the equation from above:
\[t =\frac{\beta}{s_b}= \frac{\beta}{s_e/\sqrt{\sum(x- \bar x)^2}}\text{,}\] where \(\beta = \frac{cov(x,y)}{var(x)}\), and
\[s_e = \sqrt{\frac{SSE}{n-2}}\text{,}\]
where \(SSE = \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y - \hat y)^2\). The test statistic follows a t-distribution with \(df = n - 2\). In order to calculate \(\hat y = \alpha + \beta x\) we need to know \(\alpha\), which is defined as \(\alpha = \bar y -\beta \bar x\).
In order not to get confused by the different computational steps we do one step after another.
y_bar <- mean(data$weight)
x_bar <- mean(data$height)
# linear model
lm_beta <- cov(data$height, data$weight) / var(data$height)
lm_beta
## [1] 0.605819
lm_alpha <- y_bar - lm_beta * x_bar
lm_alpha
## [1] -29.9842
# sum of squares errors SSE
y_hat <- lm_alpha + lm_beta * data$height
SSE <- sum((data$weight - y_hat)^2)
SSE
## [1] 55.07382
# residual standard error
se <- sqrt(SSE / (n - 2))
se
## [1] 2.346781
t <- lm_beta / (se / sqrt(sum((data$height - x_bar)^2)))
t
## [1] 11.12139
The numerical value of the test statistic is 11.1213944.
In order to calculate the p-value we apply the
pt()
function. Recall, how to calculate the degrees of
freedom:
\[df = n - 2= 10\]
# compute the p-value
df <- length(data$height) - 2
# two-sided test
p_upper <- pt(abs(t), df = df, lower.tail = FALSE)
p_lower <- pt(-abs(t), df = df, lower.tail = TRUE)
p <- p_upper + p_lower
p
## [1] 5.952194e-07
\(p = 5.9521935\times 10^{-7}\).
Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\)
p <= alpha
## [1] TRUE
The p-value is smaller than the specified significance level of 0.01; we reject \(H_0\). The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.
Step 6: Interpret the result of the hypothesis test
At the 1 % significance level the data provides very strong evidence to conclude, that the height variable is a good estimator for the weight of students.
We just computed a regression t-test in R manually. That is fine, but we can do the same in R in just a few lines of code!
Therefore, we have to apply the lm()
function on our
response variable weight
and our predictor variable
height
:
lin_model <- lm(weight ~ height, data = data)
In order to access the lm
object we apply the
summary()
function:
summary(lin_model)
##
## Call:
## lm(formula = weight ~ height, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3283 -0.9208 -0.4376 1.2132 4.6902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.98420 9.39387 -3.192 0.00962 **
## height 0.60582 0.05447 11.121 5.95e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.347 on 10 degrees of freedom
## Multiple R-squared: 0.9252, Adjusted R-squared: 0.9177
## F-statistic: 123.7 on 1 and 10 DF, p-value: 5.952e-07
The summary()
function returns a number of model
properties, which are discussed in detail in the linear
regression section. A nice overview of the output of a simple
linear model in R is provided in a blog entry by Felipe Rego.
For the purpose of comparison to the manually calculated values we
extract \(\alpha\) and \(\beta\) by applying the extractor function
coef()
on the lm
object:
coef(lin_model)
## (Intercept) height
## -29.984195 0.605819
Further, we extract the sum of squares errors \((SSE)\) by applying the
residuals()
function and the residual standard error \((s_e)\) by applying the
sigma()
function on the lm
object.
# sum of squares errors
sum(residuals(lin_model)^2)
## [1] 55.07382
# residual standard error
sigma(lin_model)
## [1] 2.346781
Finally, we directly access the t-test statistic and the
p-value by indexing the coefficients of the lm
object:
coef(summary(lin_model))[, "t value"][2]
## height
## 11.12139
coef(summary(lin_model))[, "Pr(>|t|)"][2]
## height
## 5.952194e-07
Compare the output to our results from above. They match perfectly!
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.