In order to test if two variables are linearly correlated, or in other words, whether there is a linear relationship between the two variables, we may apply the so-called **correlation t-test**. The **population linear correlation coefficient**, \(\rho\), measures the linear correlation of two variables in the same manner as the **sample linear correlation coefficient**, \(r\), measures the linear correlation of two variables of a sample of pairs. Both, \(\rho\) and \(r\) describe the strength of the linear relationship between two variables; however, \(r\) is an estimate of \(\rho\) obtained from sample data.

The linear correlation coefficient of two variables lies between \(-1\) and \(1\). If \(\rho = 0\) the variables are linearly uncorrelated, thus there is no linear relationship between the variables. If \(\rho \ne 0\) the variables are linearly correlated. If \(\rho > 0\), the variables are positively linearly correlated, and if \(\rho < 0\) the variables are negatively linearly correlated.

A commonly used statistic to calculate the linear relationship between quantitative variables is the **Pearson product moment correlation coefficient**. Is it given by

\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\text{,}\]

where \(s_{xy}\) is the covariance of \(x\) and \(y\) and \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\), respectively.

As the sample linear correlation coefficient, \(r\) , is an estimate of the population linear correlation coefficient, \(\rho\), we may use \(r\) for a hypothesis test for \(\rho\). The test statistic for a **correlation test** has a t-distribution with \(n-2\) degrees of freedom and may be written as

\[t= \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}\text{.}\]

In order to practice the **correlation t-test** we load the

`students.csv`

file here. Import the data set and assign a proper name to it.`students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")`

The *students* data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: *stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary*.

In order to showcase the correlation *t*-test we examine the relationship between the variable `score1`

and `score2`

, which show the result of two mandatory statistical exams. **The question is, whether there is a linear relationship between the grades of two consecutive statistic exams?**

We start with data preparation.

- We subset the data set based on the
`score1`

and`score2`

variables. By applying the`complete.cases()`

function, we omit any`NA`

value in the data set. - Then we sample from each subset 50 students and extract the variables of interest.

```
complete <- students[complete.cases(students),]
n <- 50
sample.idx <- sample(1:nrow(complete),n)
score1 <- complete[sample.idx, 'score1']
score2 <- complete[sample.idx, 'score2']
```

For the purpose of visual inspection we plot the random sample in form of a scatter plot.

`plot(score1, score2)`

The visual inspection indicates an existing positive linear relationship between the variables `score1`

and `score2`

.

In order to conduct the **correlation t-test** we follow the step-wise implementation procedure for hypothesis testing. The

**Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)**

The null hypothesis states that there is no linear relationship between the grades of two consecutive statistic exams.

\[H_0: r = 0\]

Recall, that the formulation of the alternative hypothesis dictates if we apply a two-sided, a left tailed or a right tailed hypothesis test.

**Alternative hypothesis** \[H_A: r \ne 0 \] This formulation results in a two-sided hypothesis test.

**Step 2: Decide on the significance level, \(\alpha\)**

\[\alpha = 0.01\]

`alpha <- 0.01`

**Step 3 and 4: Compute the value of the test statistic and the p-value.**

For illustration purposes we manually compute the test statistic in R. Recall the equation form above:

\[t= \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}\]

```
n <- length(score1)
# Compute the value of the test statistic
# pearson correlation coefficient r
r <- cor(score1, score2)
#test statistic
t <- r / sqrt((1-r^2)/(n-2))
t
```

`## [1] 16.76656`

The numerical value of the test statistic is 16.7665597.

In order to calculate the *p*-value we apply the `pt()`

function. Recall how to calculate the degrees of freedom:

\[df = n - 2= 48\]

```
# Compute the p-value
df = length(score1)-2
# two-sided test
p.upper <- pt(abs(t), df = df, lower.tail = FALSE)
p.lower <- pt(-abs(t), df = df, lower.tail = TRUE)
p <- p.upper + p.lower
p
```

`## [1] 1.059701e-21`

**Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\).**

`p <= alpha`

`## [1] TRUE`

The *p*-value is less than the specified significance level of 0.01; we reject \(H_0\). The test results are statistically significant at the 1% level and provide very strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis test.**

\(p = 1.0597013\times 10^{-21}\). At the 1% significance level, the data provides very strong evidence to conclude that the exam grades of students are linearly correlated.

We just ran a **correlation t-test** in R manually. We can do the same in R by just one line of code!

Therefore we apply the `cor.test()`

function. For the function we provide two vectors as data input, such as `score1`

and `score2`

.

`cor.test(score1, score2)`

```
##
## Pearson's product-moment correlation
##
## data: score1 and score2
## t = 16.767, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8695496 0.9564944
## sample estimates:
## cor
## 0.9242053
```

Perfect! Compare the output of the `cor.test()`

function with our result from above. In addition, the function output returns the 95% confidence interval and the Pearson correlation coefficient for the sample data. Based on the output of the `cor.test()`

function we may conclude that at the 1% significance level, the data provides very strong evidence to conclude that the exam grades of students are linearly correlated.