20772_hypothesis_testing_about_the_linear_correlation

In order to test whether two variables are linearly correlated, or in other words, whether there is a linear relationship between the two variables, we may apply the so-called correlation t-test. The population linear correlation coefficient, $\rho$, measures the linear correlation of two variables in the same manner as the sample linear correlation coefficient, $r$, measures the linear correlation of two variables of a sample of pairs. $\rho$ and $r$ describe the strength of the linear relationship between two variables; however, $r$ is an estimate of $\rho$ obtained from sample data.

The linear correlation coefficient of two variables lies between $-1$ and $1$. If $\rho = 0$ the variables are linearly uncorrelated. Thus there is no linear relationship between the variables. If $\rho \ne 0$ the variables are linearly correlated. If $\rho > 0$, the variables are positively linearly correlated, and the variables are negatively linearly correlated if $\rho < 0$.

A commonly used statistic to calculate the linear relationship between quantitative variables is the Pearson product-moment correlation coefficient. It is given by

$$r = \frac {\sum_{i = 1}^{n} (x_{i} - \bar {x}) (y_{i} - \bar {y})} {\sqrt {\sum_{i = 1}^{n} (x_{i} - \bar {x})^{2}} \sqrt {\sum_{i = 1}^{n}(y_{i} - \bar {y})^{2}}} = \frac {s_{xy}} {s_{x} s_{y}}$$

where $s_{xy}$ is the covariance of $x$ and $y$ and $s_{x}$ and $s_{y}$ are the standard deviations of $x$ and $y$, respectively.

Since the sample linear correlation coefficient, $r$, is an estimate of the population linear correlation coefficient, $\rho$, we may use $r$ for a hypothesis test for $\rho$. The test statistic for a correlation test has a t-distribution with $n - 2$ degrees of freedom and may be written as:

$$t= \frac{r} {\sqrt {\frac {1 - r^{2}} {n - 2} } }$$

The Correlation t-Test: An Example¶

In order to practice the correlation t-test, we load the students data set. You may download the students.csv file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas dataframe object by using the read_csv method:

Note: Make sure the numpy and pandas packages are part of your mamba environment!

In [1]:

import pandas as pd
import numpy as np

students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:

stud.id
name
gender
age
height
weight
religion
nc.score
semester
major
minor
score1
score2
online.tutorial
graduated
salary

In order to showcase the correlation t-test, we examine the relationship between the variable score1 and score2, which show the result of two mandatory statistics exams. The question is whether a linear relationship exists between the grades of two consecutive statistics exams.

Data preparation¶

We subset the data set based on the score1 and score2 variables.
We omit any NaN value in the data set by applying the <df>.dropna() function.
We sample 50 students from the subset.

In [2]:

n = 50

sample = students[["score1", "score2"]].dropna().sample(n, random_state = 9)

For visual inspection, we plot the random sample in the form of a scatter plot by using the scatterplot() function provided over the seaborn package:

Note: Ensure the seaborn package is part of your mamba environment!

In [3]:

import seaborn as sns

sns.scatterplot(data = sample, x = "score1", y = "score2")

Out[3]:

<Axes: xlabel='score1', ylabel='score2'>

The visual inspection indicates a positive linear relationship between the variables score1 and score2.

Hypothesis testing¶

In order to conduct the correlation t-test, we follow the step-wise implementation procedure for hypothesis testing. The correlation t-test follows the same step-wise procedure as discussed in the previous sections:

State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$.
Decide on the significance level, $\alpha$.
Compute the value of the test statistic.
Determine the p-value.
If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$.
Interpret the result of the hypothesis test.

Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$

The null hypothesis states that there is no linear relationship between the grades of two consecutive statistics exams:

$$H_{0} : r = 0$$

Recall that the formulation of the alternative hypothesis dictates whether we apply a two-sided, a left-tailed or a right-tailed hypothesis test.

Alternative hypothesis:

$$H_{A} : r \ne 0 $$

This formulation results in a two-sided hypothesis test.

Step 2: Decide on the significance level, $\alpha$

$$\alpha = 0.01$$

In [4]:

alpha = 0.01

Steps 3 and 4: Compute the value of the test statistic and the p-value

For illustration purposes, we manually compute the test statistic in Python. Recall the equation from above:

$$t = \frac {r} {\sqrt {\frac {1 - r^{2}} {n - 2}}}$$

In [5]:

r = np.corrcoef(sample["score1"], sample["score2"])[0, 1]

t_value = r / np.sqrt((1 - r**2) / (n - 2))
t_value

Out[5]:

12.453777135382685

Note: the corrcoef() function provided by the numpy package returns the correlation matrix based on the Pearson product-moment correlation coefficient. Since we have two input variables, the result is a $2 \times 2$ matrix with four elements. The Pearson product-moment correlation coefficients of interest are given within the matrix at the indices [0, 1] and [1, 0]. Additional information about the corrcoef() function is provided over scipies function documentation.

The numerical value of the test statistic is 12.4537771.

In order to calculate the p-value, we apply the t.cdf function derived by the scipy package to calculate the probability of occurrence for the test statistic based on the t distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:

$$df = n - 2 = 50 - 2 = 48$$

Note: Make sure the scipy package is part of your mamba environment!

In [6]:

from scipy.stats import t

df = n - 2

p_value_lower = t.cdf(-np.abs(t_value), df)
p_value_upper = 1 - t.cdf(t_value, df)
p_value = p_value_lower + p_value_upper
p_value

Out[6]:

6.009746324118586e-17

$p = 6.0097463241 \times 10^{-17}$

Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$

In [7]:

# reject H0?

p_value < alpha

Out[7]:

True

The p-value is smaller than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.

Step 6: Interpret the result of the hypothesis test

At the 1 % significance level, the data provide very strong evidence to conclude that students' exam grades are linearly correlated.

Hypothesis testing in Python with `scipy`¶

We just ran a correlation t-test in Python manually. We can do the same with just one line of code by using the power of Python's package universe, namely the scipy package!

In order to conduct a correlation t-test in Python over the stats module from the scipy package, we apply the pearsonr() function. We only have to provide our data separately as numpy array. Furthermore, you can specify the test method over the alternative argument. Because the default is already set to two-sided, we do not have to adapt it. Additional information regarding the function's usage can be derived directly from the function's documentation of scipy.

In [8]:

from scipy.stats import pearsonr

test_result = pearsonr(sample["score1"], sample["score2"])
test_result

Out[8]:

PearsonRResult(statistic=0.8738759767952878, pvalue=1.2019492648237166e-16)

The pearsonr() function returns an object, which provides all relevant information regarding the performed correlation t-test. This also includes the Pearson product-moment correlation coefficient as well as the corresponding p-value of the test result. In detail, the object consists of the following properties:

<object>.statistic represents the Pearson product-moment correlation coefficient $r$.
<object>.pvalue represents the p-value of the performed significance test.

Consequently, the p-value is retrieved over:

In [9]:

test_result.pvalue

Out[9]:

1.2019492648237166e-16

Additionally, the returned object allows calculating the confidence interval for $r$ based on a given confidence level over the confidence_interval() method. An $\alpha$ level of 0.05 is the default:

In [10]:

test_result.confidence_interval()

Out[10]:

ConfidenceInterval(low=0.7869461225871097, high=0.9267900101824088)

Perfect! Based on the returned test result, we may conclude, that, at the 1 % significance level, the data provides very strong evidence to conclude, that the exam grades of students are linearly correlated.

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.

The Correlation t-Test: An Example¶

Data preparation¶

Hypothesis testing¶

Hypothesis testing in Python with scipy¶

Hypothesis testing in Python with `scipy`¶