The **regression t-Test** is applied to test if the slope, \(\beta_1\), of the population regression line equals \(0\). Based on that test we may decide whether \(x\) is a useful (linear) predictor of \(y\).

The test statistic follows a t-distribution with \(df = n - 2\) and can be written as

\[t =\frac{\beta_1}{s_b}= \frac{\beta_1}{s_e/\sqrt{\sum(x- \bar x)^2}}\text{,} \] where \(\beta_1\) corresponds to the sample regression coefficient and \(s_e\) to **residual standard error** \((se=\sqrt{\frac{SSE}{n-2}}\) and \(SSE = \sum_{i=1}^n e_i^2)\).

The \(100(1-\alpha)\)% confidence interval for \(\beta_1\) is given by

\[\beta_1 \pm t_{\alpha/2} \times \frac{s_e}{\sqrt{\sum(x- \bar x)^2}}\text{,}\] where \(s_e\) corresponds to the **residual standard error** (also known as the **standard error of the estimate**).

The value of \(t\) is obtained from the *t*-distribution for the given confidence level and \(n-2\) degrees of freedom.

In order to practice the correlation **regression t-test** we load the

`students.csv`

file here. Import the data set and assign a proper name to it.`students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")`

The *students* data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: *stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary*.

In order to showcase the regression *t*-test we examine the relationship between two variables, the height of students, as the predictor variable and the weight of students as response variable. **The question is, whether the predictor variable height is useful for making predictions about the weight of students?**

For data preparation we randomly sample 12 students from the data set and build a data frame with the two variables of interest (`height`

and `weight`

). Further we plot the data in form of a scatter plot to visualize the underlying linear relationship between the two variables.

```
n <- 12
sample.idx <- sample(1:nrow(students), n)
data <- students[sample.idx , c('height','weight')]
plot(data$height, data$weight)
```

The visual inspection confirms our assumption that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height the individual student tends the have a higher weight.

In order to conduct the **regression t-test** we follow the step-wise implementation procedure for hypothesis testing. The

\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

**Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)**

The null hypothesis states that slope there is no linear relationship between the height and the weight of the individuals in the *students* data set.

\[H_0: \beta_1 = 0\text{ (predictor variable is not useful for making predictions)}\]

**Alternative hypothesis** \[H_A: \beta_1 \ne 0\text{ (predictor variable is useful for making predictions)}\]

**Step 2: Decide on the significance level, \(\alpha\)**

\[\alpha = 0.01\]

`alpha <- 0.01`

**Step 3 and 4: Compute the value of the test statistic and the p-value.**

For illustration purposes we manually compute the test statistic in R. Recall the equation form above.

\[t =\frac{\beta_1}{s_b}= \frac{\beta_1}{s_e/\sqrt{\sum(x- \bar x)^2}}\] where \(\beta_1 = \frac{cov(x,y)}{var(x)}\), and

\[s_e = \sqrt{\frac{SSE}{n-2}}\text{,}\]

where \(SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y - \hat y)^2\). The test statistic follows a t-distribution with \(df = n - 2\). In order to calculate \(\hat y = \beta_0+\beta_1x\) we need to to know \(\beta_0\), which is calculated as \(\beta_0 = \bar y -\beta_1 \bar x\).

In order not to get confused by the different computational steps we do one step after another.

- Build the linear model by calculating the intercept \((\beta_0)\) and the regression coefficient \((\beta_1)\).

```
y.bar <- mean(data$weight)
x.bar <- mean(data$height)
# Linear model
beta1 <- cov(data$height, data$weight) / var(data$height)
beta1
```

`## [1] 0.740068`

```
beta0 <- y.bar - beta1 * x.bar
beta0
```

`## [1] -53.77354`

- Calculate the sum of squared errors \((SSE)\) and the residual standard error \((s_e)\).

```
# Sum of squared errors SSE
y.hat <- beta0 + beta1 * data$height
SSE <- sum((data$weight - y.hat)^2)
SSE
```

`## [1] 95.77876`

```
# Residual standard error
se <- sqrt(SSE/(n-2))
se
```

`## [1] 3.094814`

- Compute the value of the test statistic.

```
# Compute the value of the test statistic
t <- beta1 / (se/sqrt(sum((data$height-x.bar)^2)))
t
```

`## [1] 8.859711`

The numerical value of the test statistic is 8.8597111.

In order to calculate the *p*-value we apply the `pt()`

function. Recall how to calculate the degrees of freedom.

\[df = n - 2= 10\]

```
# Compute the p-value
df = length(data$height)-2
# two-sided test
p.upper <- pt(abs(t), df = df, lower.tail = FALSE)
p.lower <- pt(-abs(t), df = df, lower.tail = TRUE)
p <- p.upper + p.lower
p
```

`## [1] 4.764806e-06`

**Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\).**

`p <= alpha`

`## [1] TRUE`

The *p*-value is less than the specified significance level of 0.01; we reject \(H_0\). The test results are statistically significant at the 1% level and provide very strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis test.**

\(p = 4.7648062\times 10^{-6}\). At the 1% significance level, the data provides very strong evidence to conclude that the height variable is a good estimator for the weight of students.

We just computed a **regression t-test** in R manually. That is fine, but we can do the same in R by just a few lines of code!

Therefore, we have to apply the `lm()`

function on our response variable `weight`

and our predictor variable `height`

.

`lin.model <- lm(weight ~ height, data=data)`

In order to access the `lm`

object we apply the `summary()`

function.

`summary(lin.model)`

```
##
## Call:
## lm(formula = weight ~ height, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8185 -1.0788 0.6116 1.8912 3.2024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -53.77354 14.61756 -3.679 0.00426 **
## height 0.74007 0.08353 8.860 4.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.095 on 10 degrees of freedom
## Multiple R-squared: 0.887, Adjusted R-squared: 0.8757
## F-statistic: 78.49 on 1 and 10 DF, p-value: 4.765e-06
```

The `summary()`

function returns a number of model properties, which are discussed in detail in the *linear regression* section. A nice overview of the output of a simple linear model in R is provided in a blog entry by Felipe Rego.

For the purpose of comparison to the manually calculated values we extract the intercept \(\beta_0\) and \(\beta_1\) by applying the extractor function `coef()`

on the `lm`

object.

`coef(lin.model)`

```
## (Intercept) height
## -53.773543 0.740068
```

Further we extract the sum of squared errors \((SSE)\) by applying the `residuals()`

function and the residual standard error \((s_e)\) by applying the `sigma()`

function on the `lm`

object.

`sum(residuals(lin.model)^2)`

`## [1] 95.77876`

`sigma(lin.model)`

`## [1] 3.094814`

Finally, we directly access the *t*-test statistic and the *p*-value by indexing the coefficients of the `lm`

object

`coef(summary(lin.model))[,'t value'][2]`

```
## height
## 8.859711
```

`coef(summary(lin.model))[,'Pr(>|t|)'][2]`

```
## height
## 4.764806e-06
```

Compare the output with our results from above. They perfectly fit!