In order to test if two variables are linearly correlated, or in other words, whether there is a linear relationship between the two variables, we may apply the so-called correlation t-test. The population linear correlation coefficient, $$\rho$$, measures the linear correlation of two variables in the same manner as the sample linear correlation coefficient, $$r$$, measures the linear correlation of two variables of a sample of pairs. Both, $$\rho$$ and $$r$$ describe the strength of the linear relationship between two variables; however, $$r$$ is an estimate of $$\rho$$ obtained from sample data.

The linear correlation coefficient of two variables lies between $$-1$$ and $$1$$. If $$\rho = 0$$ the variables are linearly uncorrelated, thus there is no linear relationship between the variables. If $$\rho \ne 0$$ the variables are linearly correlated. If $$\rho > 0$$, the variables are positively linearly correlated, and if $$\rho < 0$$ the variables are negatively linearly correlated.

A commonly used statistic to calculate the linear relationship between quantitative variables is the Pearson product moment correlation coefficient. Is it given by

$r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\text{,}$

where $$s_{xy}$$ is the covariance of $$x$$ and $$y$$ and $$s_x$$ and $$s_y$$ are the standard deviations of $$x$$ and $$y$$, respectively.

As the sample linear correlation coefficient, $$r$$ , is an estimate of the population linear correlation coefficient, $$\rho$$, we may use $$r$$ for a hypothesis test for $$\rho$$. The test statistic for a correlation test has a t-distribution with $$n-2$$ degrees of freedom and may be written as

$t= \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}\text{.}$

#### The Correlation t-Test: An Example

In order to practice the correlation t-test we load the students data set. You may download the students.csv file here. Import the data set and assign a proper name to it.

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In order to showcase the correlation t-test we examine the relationship between the variable score1 and score2, which show the result of two mandatory statistical exams. The question is, whether there is a linear relationship between the grades of two consecutive statistic exams?

#### Data preparation

• We subset the data set based on the score1 and score2 variables. By applying the complete.cases() function, we omit any NA value in the data set.
• Then we sample from each subset 50 students and extract the variables of interest.
complete <- students[complete.cases(students),]

n <- 50
sample.idx <- sample(1:nrow(complete),n)

score1 <- complete[sample.idx, 'score1']
score2 <- complete[sample.idx, 'score2'] 

For the purpose of visual inspection we plot the random sample in form of a scatter plot.

plot(score1, score2)

The visual inspection indicates an existing positive linear relationship between the variables score1 and score2.

#### Hypothesis testing

In order to conduct the correlation t-test we follow the step-wise implementation procedure for hypothesis testing. The correlation t-test follows the same step-wise procedure as discussed in the previous sections. $\begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array}$

Step 1: State the null hypothesis $$H_0$$ and alternative hypothesis $$H_A$$

The null hypothesis states that there is no linear relationship between the grades of two consecutive statistic exams.

$H_0: r = 0$

Recall, that the formulation of the alternative hypothesis dictates if we apply a two-sided, a left tailed or a right tailed hypothesis test.

Alternative hypothesis $H_A: r \ne 0$ This formulation results in a two-sided hypothesis test.

Step 2: Decide on the significance level, $$\alpha$$

$\alpha = 0.01$

alpha <- 0.01

Step 3 and 4: Compute the value of the test statistic and the p-value.

For illustration purposes we manually compute the test statistic in R. Recall the equation form above:

$t= \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}$

n <- length(score1)

# Compute the value of the test statistic
#  pearson correlation coefficient r
r <- cor(score1, score2)

#test statistic
t <- r / sqrt((1-r^2)/(n-2))
t
## [1] 16.76656

The numerical value of the test statistic is 16.7665597.

In order to calculate the p-value we apply the pt() function. Recall how to calculate the degrees of freedom:

$df = n - 2= 48$

# Compute the p-value
df = length(score1)-2

# two-sided test
p.upper <- pt(abs(t), df = df, lower.tail = FALSE)
p.lower <- pt(-abs(t), df = df, lower.tail = TRUE)
p <- p.upper + p.lower
p
## [1] 1.059701e-21

Step 5: If $$p \le \alpha$$, reject $$H_0$$; otherwise, do not reject $$H_0$$.

p <= alpha
## [1] TRUE

The p-value is less than the specified significance level of 0.01; we reject $$H_0$$. The test results are statistically significant at the 1% level and provide very strong evidence against the null hypothesis.

Step 6: Interpret the result of the hypothesis test.

$$p = 1.0597013\times 10^{-21}$$. At the 1% significance level, the data provides very strong evidence to conclude that the exam grades of students are linearly correlated.

#### Hypothesis testing in R

We just ran a correlation t-test in R manually. We can do the same in R by just one line of code!

Therefore we apply the cor.test() function. For the function we provide two vectors as data input, such as score1 and score2.

cor.test(score1, score2)
##
##  Pearson's product-moment correlation
##
## data:  score1 and score2
## t = 16.767, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8695496 0.9564944
## sample estimates:
##       cor
## 0.9242053

Perfect! Compare the output of the cor.test() function with our result from above. In addition, the function output returns the 95% confidence interval and the Pearson correlation coefficient for the sample data. Based on the output of the cor.test() function we may conclude that at the 1% significance level, the data provides very strong evidence to conclude that the exam grades of students are linearly correlated.