The regression t-Test is applied to test if the slope, \(\beta_1\), of the population regression line equals \(0\). Based on that test we may decide whether \(x\) is a useful (linear) predictor of \(y\).

The test statistic follows a t-distribution with \(df = n - 2\) and can be written as

\[t =\frac{\beta_1}{s_b}= \frac{\beta_1}{s_e/\sqrt{\sum(x- \bar x)^2}}\text{,} \] where \(\beta_1\) corresponds to the sample regression coefficient and \(s_e\) to residual standard error \((se=\sqrt{\frac{SSE}{n-2}}\) and \(SSE = \sum_{i=1}^n e_i^2)\).


Interval Estimation of \(\beta_1\)

The \(100(1-\alpha)\)% confidence interval for \(\beta_1\) is given by

\[\beta_1 \pm t_{\alpha/2} \times \frac{s_e}{\sqrt{\sum(x- \bar x)^2}}\text{,}\] where \(s_e\) corresponds to the residual standard error (also known as the standard error of the estimate).

The value of \(t\) is obtained from the t-distribution for the given confidence level and \(n-2\) degrees of freedom.


The Regression t-Test: An Example

In order to practice the correlation regression t-test we load the students data set. You may download the students.csv file here. Import the data set and assign a proper name to it.

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In order to showcase the regression t-test we examine the relationship between two variables, the height of students, as the predictor variable and the weight of students as response variable. The question is, whether the predictor variable height is useful for making predictions about the weight of students?


Data preparation

For data preparation we randomly sample 12 students from the data set and build a data frame with the two variables of interest (height and weight). Further we plot the data in form of a scatter plot to visualize the underlying linear relationship between the two variables.

n <- 12

sample.idx <- sample(1:nrow(students), n)
data <- students[sample.idx , c('height','weight')]

plot(data$height, data$weight)

The visual inspection confirms our assumption that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height the individual student tends the have a higher weight.


Hypothesis testing

In order to conduct the regression t-test we follow the step-wise implementation procedure for hypothesis testing. The regression t-test follows the same step-wise procedure as discussed in the previous sections.

\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)

The null hypothesis states that slope there is no linear relationship between the height and the weight of the individuals in the students data set.

\[H_0: \beta_1 = 0\text{ (predictor variable is not useful for making predictions)}\]

Alternative hypothesis \[H_A: \beta_1 \ne 0\text{ (predictor variable is useful for making predictions)}\]


Step 2: Decide on the significance level, \(\alpha\)

\[\alpha = 0.01\]

alpha <- 0.01

Step 3 and 4: Compute the value of the test statistic and the p-value.

For illustration purposes we manually compute the test statistic in R. Recall the equation form above.

\[t =\frac{\beta_1}{s_b}= \frac{\beta_1}{s_e/\sqrt{\sum(x- \bar x)^2}}\] where \(\beta_1 = \frac{cov(x,y)}{var(x)}\), and

\[s_e = \sqrt{\frac{SSE}{n-2}}\text{,}\]

where \(SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y - \hat y)^2\). The test statistic follows a t-distribution with \(df = n - 2\). In order to calculate \(\hat y = \beta_0+\beta_1x\) we need to to know \(\beta_0\), which is calculated as \(\beta_0 = \bar y -\beta_1 \bar x\).

In order not to get confused by the different computational steps we do one step after another.

y.bar <- mean(data$weight)
x.bar <- mean(data$height)

# Linear model
beta1 <- cov(data$height, data$weight) / var(data$height)
beta1
## [1] 0.740068
beta0 <- y.bar - beta1 * x.bar
beta0
## [1] -53.77354
# Sum of squared errors SSE
y.hat <- beta0 + beta1 * data$height
SSE <- sum((data$weight - y.hat)^2)
SSE
## [1] 95.77876
# Residual standard error
se <- sqrt(SSE/(n-2))
se
## [1] 3.094814
# Compute the value of the test statistic
t <- beta1 / (se/sqrt(sum((data$height-x.bar)^2)))
t
## [1] 8.859711

The numerical value of the test statistic is 8.8597111.

In order to calculate the p-value we apply the pt() function. Recall how to calculate the degrees of freedom.

\[df = n - 2= 10\]

# Compute the p-value
df = length(data$height)-2

# two-sided test
p.upper <- pt(abs(t), df = df, lower.tail = FALSE)
p.lower <- pt(-abs(t), df = df, lower.tail = TRUE)
p <- p.upper + p.lower
p
## [1] 4.764806e-06

Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\).

p <= alpha
## [1] TRUE

The p-value is less than the specified significance level of 0.01; we reject \(H_0\). The test results are statistically significant at the 1% level and provide very strong evidence against the null hypothesis.


Step 6: Interpret the result of the hypothesis test.

\(p = 4.7648062\times 10^{-6}\). At the 1% significance level, the data provides very strong evidence to conclude that the height variable is a good estimator for the weight of students.


Hypothesis testing in R

We just computed a regression t-test in R manually. That is fine, but we can do the same in R by just a few lines of code!

Therefore, we have to apply the lm() function on our response variable weight and our predictor variable height.

lin.model <- lm(weight ~ height, data=data)

In order to access the lm object we apply the summary() function.

summary(lin.model)
## 
## Call:
## lm(formula = weight ~ height, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8185 -1.0788  0.6116  1.8912  3.2024 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -53.77354   14.61756  -3.679  0.00426 ** 
## height        0.74007    0.08353   8.860 4.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.095 on 10 degrees of freedom
## Multiple R-squared:  0.887,  Adjusted R-squared:  0.8757 
## F-statistic: 78.49 on 1 and 10 DF,  p-value: 4.765e-06

The summary() function returns a number of model properties, which are discussed in detail in the linear regression section. A nice overview of the output of a simple linear model in R is provided in a blog entry by Felipe Rego.

For the purpose of comparison to the manually calculated values we extract the intercept \(\beta_0\) and \(\beta_1\) by applying the extractor function coef() on the lm object.

coef(lin.model)
## (Intercept)      height 
##  -53.773543    0.740068

Further we extract the sum of squared errors \((SSE)\) by applying the residuals() function and the residual standard error \((s_e)\) by applying the sigma() function on the lm object.

sum(residuals(lin.model)^2)
## [1] 95.77876
sigma(lin.model)
## [1] 3.094814

Finally, we directly access the t-test statistic and the p-value by indexing the coefficients of the lm object

coef(summary(lin.model))[,'t value'][2]
##   height 
## 8.859711
coef(summary(lin.model))[,'Pr(>|t|)'][2]
##       height 
## 4.764806e-06

Compare the output with our results from above. They perfectly fit!