In order to test if two variables are linearly correlated, or in other words, whether there is a linear relationship between the two variables, we may apply the so-called correlation t-test. The population linear correlation coefficient, \(\rho\), measures the linear correlation of two variables in the same manner as the sample linear correlation coefficient, \(r\), measures the linear correlation of two variables of a sample of pairs. Both, \(\rho\) and \(r\) describe the strength of the linear relationship between two variables; however, \(r\) is an estimate of \(\rho\) obtained from sample data.

The linear correlation coefficient of two variables lies between \(-1\) and \(1\). If \(\rho = 0\) the variables are linearly uncorrelated, thus there is no linear relationship between the variables. If \(\rho \ne 0\) the variables are linearly correlated. If \(\rho > 0\), the variables are positively linearly correlated, and if \(\rho < 0\) the variables are negatively linearly correlated.

A commonly used statistic to calculate the linear relationship between quantitative variables is the Pearson product moment correlation coefficient. Is it given by

\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\text{,}\]

where \(s_{xy}\) is the covariance of \(x\) and \(y\) and \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\), respectively.

As the sample linear correlation coefficient, \(r\) , is an estimate of the population linear correlation coefficient, \(\rho\), we may use \(r\) for a hypothesis test for \(\rho\). The test statistic for a correlation test has a t-distribution with \(n-2\) degrees of freedom and may be written as

\[t= \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}\text{.}\]


The Correlation t-Test: An Example

In order to practice the correlation t-test we load the students data set. You may download the students.csv file here. Import the data set and assign a proper name to it.

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In order to showcase the correlation t-test we examine the relationship between the variable score1 and score2, which show the result of two mandatory statistical exams. The question is, whether there is a linear relationship between the grades of two consecutive statistic exams?


Data preparation

We start with data preparation.

complete <- students[complete.cases(students),]

n <- 50
sample.idx <- sample(1:nrow(complete),n) 

score1 <- complete[sample.idx, 'score1'] 
score2 <- complete[sample.idx, 'score2'] 

For the purpose of visual inspection we plot the random sample in form of a scatter plot.

plot(score1, score2)

The visual inspection indicates an existing positive linear relationship between the variables score1 and score2.


Hypothesis testing

In order to conduct the correlation t-test we follow the step-wise implementation procedure for hypothesis testing. The correlation t-test follows the same step-wise procedure as discussed in the previous sections. \[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)

The null hypothesis states that there is no linear relationship between the grades of two consecutive statistic exams.

\[H_0: r = 0\]

Recall, that the formulation of the alternative hypothesis dictates if we apply a two-sided, a left tailed or a right tailed hypothesis test.

Alternative hypothesis \[H_A: r \ne 0 \] This formulation results in a two-sided hypothesis test.


Step 2: Decide on the significance level, \(\alpha\)

\[\alpha = 0.01\]

alpha <- 0.01

Step 3 and 4: Compute the value of the test statistic and the p-value.

For illustration purposes we manually compute the test statistic in R. Recall the equation form above:

\[t= \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}\]

n <- length(score1)

# Compute the value of the test statistic
#  pearson correlation coefficient r
r <- cor(score1, score2)

#test statistic
t <- r / sqrt((1-r^2)/(n-2))
t
## [1] 16.76656

The numerical value of the test statistic is 16.7665597.

In order to calculate the p-value we apply the pt() function. Recall how to calculate the degrees of freedom:

\[df = n - 2= 48\]

# Compute the p-value
df = length(score1)-2

# two-sided test
p.upper <- pt(abs(t), df = df, lower.tail = FALSE)
p.lower <- pt(-abs(t), df = df, lower.tail = TRUE)
p <- p.upper + p.lower
p
## [1] 1.059701e-21

Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\).

p <= alpha
## [1] TRUE

The p-value is less than the specified significance level of 0.01; we reject \(H_0\). The test results are statistically significant at the 1% level and provide very strong evidence against the null hypothesis.


Step 6: Interpret the result of the hypothesis test.

\(p = 1.0597013\times 10^{-21}\). At the 1% significance level, the data provides very strong evidence to conclude that the exam grades of students are linearly correlated.


Hypothesis testing in R

We just ran a correlation t-test in R manually. We can do the same in R by just one line of code!

Therefore we apply the cor.test() function. For the function we provide two vectors as data input, such as score1 and score2.

cor.test(score1, score2)
## 
##  Pearson's product-moment correlation
## 
## data:  score1 and score2
## t = 16.767, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8695496 0.9564944
## sample estimates:
##       cor 
## 0.9242053

Perfect! Compare the output of the cor.test() function with our result from above. In addition, the function output returns the 95% confidence interval and the Pearson correlation coefficient for the sample data. Based on the output of the cor.test() function we may conclude that at the 1% significance level, the data provides very strong evidence to conclude that the exam grades of students are linearly correlated.