The **regression t-test** is applied to test if
the slope, \(\beta\), of the population
regression line equals \(0\). Based on
that test we may decide whether \(x\)
is a useful (linear) predictor of \(y\).

The test statistic follows a t-distribution with \(df = n - 2\) and can be written as

\[t =\frac{\beta}{s_b}=
\frac{\beta}{s_e/\sqrt{\sum(x- \bar x)^2}}\text{,} \] where \(\beta\) corresponds to the sample
regression coefficient and \(s_e\) to
the **residual standard error** \((s_e=\sqrt{\frac{SSE}{n-2}}\) and \(SSE = \sum_{i=1}^n \epsilon_i^2)\).

The \(100(1-\alpha)\) % confidence interval for \(\beta\) is given by

\[\beta \pm t_{\alpha/2}
\times \frac{s_e}{\sqrt{\sum(x- \bar x)^2}}\text{,}\] where
\(s_e\) corresponds to the
**residual standard error** (also known as the
**standard error of the estimate**).

The value of \(t\) is obtained from
the *t*-distribution for the given confidence level and \(n-2\) degrees of freedom.

In order to practice the correlation **regression
t-test** we load the

`students.csv`

file here
or import the data set directly into R:`students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")`

The *students* data set consists of 8239 rows, each of them
representing a particular student, and 16 columns, each of them
corresponding to a variable/feature related to that particular student.
These self-explaining variables are: *stud.id, name, gender, age,
height, weight, religion, nc.score, semester, major, minor, score1,
score2, online.tutorial, graduated, salary*.

In order to showcase the regression *t*-test we examine the
relationship between two variables: the height of students as the
predictor variable and the weight of students as the response variable.
**The question is, whether the predictor variable
height is useful for making predictions of the weight of
students?**

For data preparation we randomly sample 12 students from the data set
and build a data frame with the two variables of interest
(`height`

and `weight`

). Further, we plot the data
in form of a scatter plot to visualize the underlying linear
relationship between the two variables.

```
n <- 12
sample_idx <- sample(1:nrow(students), n)
data <- students[sample_idx, c("height", "weight")]
plot(data$height, data$weight)
```

The visual inspection supports our assumption, that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height the individual student tends the have a higher weight.

In order to conduct the **regression t-test**
we follow the step-wise implementation procedure for hypothesis testing.
The

\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

**Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)**

The null hypothesis states that there is no linear relationship
between the height and the weight of the individuals in the
*students* data set:

\[H_0: \beta = 0\text{ (predictor variable is not useful for making predictions)}\]

Alternative hypothesis:

\[H_A: \beta \ne 0\text{ (predictor variable is useful for making predictions)}\]

**Step 2: Decide on the significance level, \(\alpha\)**

\[\alpha = 0.01\]

`alpha <- 0.01`

**Step 3 and 4: Compute the value of the test statistic and the
p-value**

For illustration purposes we manually compute the test statistic in R. Recall the equation from above:

\[t =\frac{\beta}{s_b}= \frac{\beta}{s_e/\sqrt{\sum(x- \bar x)^2}}\text{,}\] where \(\beta = \frac{cov(x,y)}{var(x)}\), and

\[s_e = \sqrt{\frac{SSE}{n-2}}\text{,}\]

where \(SSE = \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y - \hat y)^2\). The test statistic follows a t-distribution with \(df = n - 2\). In order to calculate \(\hat y = \alpha + \beta x\) we need to know \(\alpha\), which is defined as \(\alpha = \bar y -\beta \bar x\).

In order not to get confused by the different computational steps we do one step after another.

- Build the linear model by calculating the intercept \((\alpha)\) and the regression coefficient \((\beta)\):

```
y_bar <- mean(data$weight)
x_bar <- mean(data$height)
# linear model
lm_beta <- cov(data$height, data$weight) / var(data$height)
lm_beta
```

`## [1] 0.605819`

```
lm_alpha <- y_bar - lm_beta * x_bar
lm_alpha
```

`## [1] -29.9842`

- Calculate the sum of squares errors \((SSE)\) and the residual standard error \((s_e)\):

```
# sum of squares errors SSE
y_hat <- lm_alpha + lm_beta * data$height
SSE <- sum((data$weight - y_hat)^2)
SSE
```

`## [1] 55.07382`

```
# residual standard error
se <- sqrt(SSE / (n - 2))
se
```

`## [1] 2.346781`

- Compute the value of the test statistic:

```
t <- lm_beta / (se / sqrt(sum((data$height - x_bar)^2)))
t
```

`## [1] 11.12139`

The numerical value of the test statistic is 11.1213944.

In order to calculate the *p*-value we apply the
`pt()`

function. Recall, how to calculate the degrees of
freedom:

\[df = n - 2= 10\]

```
# compute the p-value
df <- length(data$height) - 2
# two-sided test
p_upper <- pt(abs(t), df = df, lower.tail = FALSE)
p_lower <- pt(-abs(t), df = df, lower.tail = TRUE)
p <- p_upper + p_lower
p
```

`## [1] 5.952194e-07`

\(p = 5.9521935\times 10^{-7}\).

**Step 5: If \(p \le \alpha\),
reject \(H_0\); otherwise, do not
reject \(H_0\)**

`p <= alpha`

`## [1] TRUE`

The *p*-value is smaller than the specified significance level
of 0.01; we reject \(H_0\). The test
results are statistically significant at the 1 % level and provide very
strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis
test**

At the 1 % significance level the data provides very strong evidence to conclude, that the height variable is a good estimator for the weight of students.

We just computed a **regression t-test** in R
manually. That is fine, but we can do the same in R in just a few lines
of code!

Therefore, we have to apply the `lm()`

function on our
response variable `weight`

and our predictor variable
`height`

:

`lin_model <- lm(weight ~ height, data = data)`

In order to access the `lm`

object we apply the
`summary()`

function:

`summary(lin_model)`

```
##
## Call:
## lm(formula = weight ~ height, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3283 -0.9208 -0.4376 1.2132 4.6902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.98420 9.39387 -3.192 0.00962 **
## height 0.60582 0.05447 11.121 5.95e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.347 on 10 degrees of freedom
## Multiple R-squared: 0.9252, Adjusted R-squared: 0.9177
## F-statistic: 123.7 on 1 and 10 DF, p-value: 5.952e-07
```

The `summary()`

function returns a number of model
properties, which are discussed in detail in the *linear
regression* section. A nice overview of the output of a simple
linear model in R is provided in a blog entry by Felipe Rego.

For the purpose of comparison to the manually calculated values we
extract \(\alpha\) and \(\beta\) by applying the extractor function
`coef()`

on the `lm`

object:

`coef(lin_model)`

```
## (Intercept) height
## -29.984195 0.605819
```

Further, we extract the sum of squares errors \((SSE)\) by applying the
`residuals()`

function and the residual standard error \((s_e)\) by applying the
`sigma()`

function on the `lm`

object.

```
# sum of squares errors
sum(residuals(lin_model)^2)
```

`## [1] 55.07382`

```
# residual standard error
sigma(lin_model)
```

`## [1] 2.346781`

Finally, we directly access the *t*-test statistic and the
*p*-value by indexing the coefficients of the `lm`

object:

`coef(summary(lin_model))[, "t value"][2]`

```
## height
## 11.12139
```

`coef(summary(lin_model))[, "Pr(>|t|)"][2]`

```
## height
## 5.952194e-07
```

Compare the output to our results from above. They match perfectly!

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*