20772_hypothesis_testing_about_the_linear_correlation

In order to test whether two variables are linearly correlated, or in other words, whether there is a linear relationship between the two variables, we may apply the so-called correlation t-test. The population linear correlation coefficient, \(\rho\), measures the linear correlation of two variables in the same manner as the sample linear correlation coefficient, \(r\), measures the linear correlation of two variables of a sample of pairs. Both, \(\rho\) and \(r\), describe the strength of the linear relationship between two variables; however, \(r\) is an estimate of \(\rho\) obtained from sample data.

The linear correlation coefficient of two variables lies between \(-1\) and \(1\). If \(\rho = 0\) the variables are linearly uncorrelated, thus there is no linear relationship between the variables. If \(\rho \ne 0\) the variables are linearly correlated. If \(\rho > 0\) the variables are positively linearly correlated, and if \(\rho < 0\) the variables are negatively linearly correlated.

A commonly used statistic to calculate the linear relationship between quantitative variables is the Pearson product moment correlation coefficient. It is given by

\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\text{,}\]

where \(s_{xy}\) is the covariance of \(x\) and \(y\) and \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\), respectively.

Since the sample linear correlation coefficient, \(r\), is an estimate of the population linear correlation coefficient, \(\rho\), we may use \(r\) for a hypothesis test for \(\rho\). The test statistic for a correlation test has a t-distribution with \(n-2\) degrees of freedom and may be written as

\[t= \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}\text{.}\]

The Correlation t-Test: An Example

In order to practice the correlation t-test we load the students data set. You may download the students.csv file here or import the data set directly into R:

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In order to showcase the correlation t-test we examine the relationship between the variable score1 and score2, which show the result of two mandatory statistics exams. The question is, whether there is a linear relationship between the grades of two consecutive statistics exams.

Data preparation

We start with data preparation.

First, we subset the data set based on the score1 and score2 variables. By applying the complete.cases() function, we omit any NA value in the data set. Then we sample 50 students from each subset and extract the variables of interest.

complete <- students[complete.cases(students), ]

n <- 50
sample_idx <- sample(1:nrow(complete), n)

score1 <- complete[sample_idx, "score1"]
score2 <- complete[sample_idx, "score2"]

For the purpose of visual inspection we plot the random sample in form of a scatter plot:

plot(score1, score2)

The visual inspection indicates an existing positive linear relationship between the variables score1 and score2.

Hypothesis testing

In order to conduct the correlation t-test we follow the step-wise implementation procedure for hypothesis testing. The correlation t-test follows the same step-wise procedure as discussed in the previous sections:

\[ \begin{array}{l} \hline \ \text{Step 1} & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\ \ \text{Step 2} & \text{Decide on the significance level, } \alpha\text{.} \\ \ \text{Step 3} & \text{Compute the value of the test statistic.} \\ \ \text{Step 4} &\text{Determine the p-value.} \\ \ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\ \ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\ \hline \end{array} \]

Step 1: State the null hypothesis \(H_0\) and alternative hypothesis \(H_A\)

The null hypothesis states that there is no linear relationship between the grades of two consecutive statistics exams:

\[H_0: r = 0\]

Recall, that the formulation of the alternative hypothesis dictates whether we apply a two-sided, a left-tailed or a right-tailed hypothesis test.

Alternative hypothesis:

\[H_A: r \ne 0 \]

This formulation results in a two-sided hypothesis test.

Step 2: Decide on the significance level, \(\alpha\)

\[\alpha = 0.01\]

alpha <- 0.01

Step 3 and 4: Compute the value of the test statistic and the p-value

For illustration purposes we manually compute the test statistic in R. Recall the equation from above:

\[t= \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}\]

n <- length(score1)

# pearson correlation coefficient r
r <- cor(score1, score2)

# compute the value of the test statistic
t <- r / sqrt((1 - r^2) / (n - 2))
t

## [1] 13.01841

The numerical value of the test statistic is 13.0184099.

In order to calculate the p-value we apply the pt() function. Recall, how to calculate the degrees of freedom:

\[df = n - 2= 48\]

# compute the p-value
df <- length(score1) - 2

# two-sided test
p_upper <- pt(abs(t), df = df, lower.tail = FALSE)
p_lower <- pt(-abs(t), df = df, lower.tail = TRUE)
p <- p_upper + p_lower
p

## [1] 2.304383e-17

\(p = 2.3043834\times 10^{-17}\).

Step 5: If \(p \le \alpha\), reject \(H_0\); otherwise, do not reject \(H_0\)

p <= alpha

## [1] TRUE

The p-value is smaller than the specified significance level of 0.01; we reject \(H_0\). The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.

Step 6: Interpret the result of the hypothesis test

At the 1 % significance level the data provides very strong evidence to conclude, that the exam grades of students are linearly correlated.

Hypothesis testing in R

We just ran a correlation t-test in R manually. We can do the same in R in just one line of code!

Therefore, we apply the cor.test() function. For this we provide two vectors as input data; in our case score1 and score2:

cor.test(score1, score2)

## 
##  Pearson's product-moment correlation
## 
## data:  score1 and score2
## t = 13.018, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8013260 0.9320899
## sample estimates:
##       cor 
## 0.8827735

Perfect! Compare the output of the cor.test() function to our result from above. In addition, the function output returns the 95 % confidence interval and the Pearson correlation coefficient for the sample data. Based on the output of the cor.test() function we may conclude, that at the 1 % significance level the data provides very strong evidence to conclude, that the exam grades of students are linearly correlated.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.