20771_inferences_about_the_slope_the_regression

The regression t-test is applied to test if the slope, \(\beta\), of the population regression line equals \(0\). Based on that test we may decide whether \(x\) is a useful (linear) predictor of \(y\).

The test statistic follows a t-distribution with \(df = n - 2\) and can be written as

\[t =\frac{\beta}{s_b}= \frac{\beta}{s_e/\sqrt{\sum(x- \bar x)^2}}\text{,} \] where \(\beta\) corresponds to the sample regression coefficient and \(s_e\) to the residual standard error \((s_e=\sqrt{\frac{SSE}{n-2}}\) and \(SSE = \sum_{i=1}^n \epsilon_i^2)\).

Interval Estimation of \(\beta\)

The \(100(1-\alpha)\) % confidence interval for \(\beta\) is given by

\[\beta \pm t_{\alpha/2} \times \frac{s_e}{\sqrt{\sum(x- \bar x)^2}}\text{,}\] where \(s_e\) corresponds to the residual standard error (also known as the standard error of the estimate).

The value of \(t\) is obtained from the t-distribution for the given confidence level and \(n-2\) degrees of freedom.

The Regression t-Test: An Example

In order to practice the correlation regression t-test we load the students data set. You may download the students.csv file here or import the data set directly into R:

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In order to showcase the regression t-test we examine the relationship between two variables: the height of students as the predictor variable and the weight of students as the response variable. The question is, whether the predictor variable height is useful for making predictions of the weight of students?

Data preparation

For data preparation we randomly sample 12 students from the data set and build a data frame with the two variables of interest (height and weight). Further, we plot the data in form of a scatter plot to visualize the underlying linear relationship between the two variables.

n <- 12

sample_idx <- sample(1:nrow(students), n)
data <- students[sample_idx, c("height", "weight")]

plot(data$height, data$weight)

The visual inspection supports our assumption, that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height the individual student tends the have a higher weight.

Hypothesis testing

In order to conduct the regression t-test we follow the step-wise implementation procedure for hypothesis testing. The regression t-test follows the same step-wise procedure as discussed in the previous sections:

\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)

The null hypothesis states that there is no linear relationship between the height and the weight of the individuals in the students data set:

\[H_0: \beta = 0\text{ (predictor variable is not useful for making predictions)}\]

Alternative hypothesis:

\[H_A: \beta \ne 0\text{ (predictor variable is useful for making predictions)}\]

Step 2: Decide on the significance level, \(\alpha\)

\[\alpha = 0.01\]

alpha <- 0.01

Step 3 and 4: Compute the value of the test statistic and the p-value

For illustration purposes we manually compute the test statistic in R. Recall the equation from above:

\[t =\frac{\beta}{s_b}= \frac{\beta}{s_e/\sqrt{\sum(x- \bar x)^2}}\text{,}\] where \(\beta = \frac{cov(x,y)}{var(x)}\), and

\[s_e = \sqrt{\frac{SSE}{n-2}}\text{,}\]

where \(SSE = \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y - \hat y)^2\). The test statistic follows a t-distribution with \(df = n - 2\). In order to calculate \(\hat y = \alpha + \beta x\) we need to know \(\alpha\), which is defined as \(\alpha = \bar y -\beta \bar x\).

In order not to get confused by the different computational steps we do one step after another.

Build the linear model by calculating the intercept \((\alpha)\) and the regression coefficient \((\beta)\):

y_bar <- mean(data$weight)
x_bar <- mean(data$height)

# linear model
lm_beta <- cov(data$height, data$weight) / var(data$height)
lm_beta

## [1] 0.605819

lm_alpha <- y_bar - lm_beta * x_bar
lm_alpha

## [1] -29.9842

Calculate the sum of squares errors \((SSE)\) and the residual standard error \((s_e)\):

# sum of squares errors SSE
y_hat <- lm_alpha + lm_beta * data$height
SSE <- sum((data$weight - y_hat)^2)
SSE

## [1] 55.07382

# residual standard error
se <- sqrt(SSE / (n - 2))
se

## [1] 2.346781

Compute the value of the test statistic:

t <- lm_beta / (se / sqrt(sum((data$height - x_bar)^2)))
t

## [1] 11.12139

The numerical value of the test statistic is 11.1213944.

In order to calculate the p-value we apply the pt() function. Recall, how to calculate the degrees of freedom:

\[df = n - 2= 10\]

# compute the p-value
df <- length(data$height) - 2

# two-sided test
p_upper <- pt(abs(t), df = df, lower.tail = FALSE)
p_lower <- pt(-abs(t), df = df, lower.tail = TRUE)
p <- p_upper + p_lower
p

## [1] 5.952194e-07

\(p = 5.9521935\times 10^{-7}\).

Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\)

p <= alpha

## [1] TRUE

The p-value is smaller than the specified significance level of 0.01; we reject \(H_0\). The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.

Step 6: Interpret the result of the hypothesis test

At the 1 % significance level the data provides very strong evidence to conclude, that the height variable is a good estimator for the weight of students.

Hypothesis testing in R

We just computed a regression t-test in R manually. That is fine, but we can do the same in R in just a few lines of code!

Therefore, we have to apply the lm() function on our response variable weight and our predictor variable height:

lin_model <- lm(weight ~ height, data = data)

In order to access the lm object we apply the summary() function:

summary(lin_model)

## 
## Call:
## lm(formula = weight ~ height, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3283 -0.9208 -0.4376  1.2132  4.6902 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -29.98420    9.39387  -3.192  0.00962 ** 
## height        0.60582    0.05447  11.121 5.95e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.347 on 10 degrees of freedom
## Multiple R-squared:  0.9252, Adjusted R-squared:  0.9177 
## F-statistic: 123.7 on 1 and 10 DF,  p-value: 5.952e-07

The summary() function returns a number of model properties, which are discussed in detail in the linear regression section. A nice overview of the output of a simple linear model in R is provided in a blog entry by Felipe Rego.

For the purpose of comparison to the manually calculated values we extract \(\alpha\) and \(\beta\) by applying the extractor function coef() on the lm object:

coef(lin_model)

## (Intercept)      height 
##  -29.984195    0.605819

Further, we extract the sum of squares errors \((SSE)\) by applying the residuals() function and the residual standard error \((s_e)\) by applying the sigma() function on the lm object.

# sum of squares errors
sum(residuals(lin_model)^2)

## [1] 55.07382

# residual standard error
sigma(lin_model)

## [1] 2.346781

Finally, we directly access the t-test statistic and the p-value by indexing the coefficients of the lm object:

coef(summary(lin_model))[, "t value"][2]

##   height 
## 11.12139

coef(summary(lin_model))[, "Pr(>|t|)"][2]

##       height 
## 5.952194e-07

Compare the output to our results from above. They match perfectly!

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.