In order to test whether two variables are linearly correlated, or in other words, whether there is a linear relationship between the two variables, we may apply the so-called correlation t-test. The population linear correlation coefficient, $\rho$, measures the linear correlation of two variables in the same manner as the sample linear correlation coefficient, $r$, measures the linear correlation of two variables of a sample of pairs. $\rho$ and $r$ describe the strength of the linear relationship between two variables; however, $r$ is an estimate of $\rho$ obtained from sample data.
The linear correlation coefficient of two variables lies between $-1$ and $1$. If $\rho = 0$ the variables are linearly uncorrelated. Thus there is no linear relationship between the variables. If $\rho \ne 0$ the variables are linearly correlated. If $\rho > 0$, the variables are positively linearly correlated, and the variables are negatively linearly correlated if $\rho < 0$.
A commonly used statistic to calculate the linear relationship between quantitative variables is the Pearson product-moment correlation coefficient. It is given by
$$r = \frac {\sum_{i = 1}^{n} (x_{i} - \bar {x}) (y_{i} - \bar {y})} {\sqrt {\sum_{i = 1}^{n} (x_{i} - \bar {x})^{2}} \sqrt {\sum_{i = 1}^{n}(y_{i} - \bar {y})^{2}}} = \frac {s_{xy}} {s_{x} s_{y}}$$where $s_{xy}$ is the covariance of $x$ and $y$ and $s_{x}$ and $s_{y}$ are the standard deviations of $x$ and $y$, respectively.
Since the sample linear correlation coefficient, $r$, is an estimate of the population linear correlation coefficient, $\rho$, we may use $r$ for a hypothesis test for $\rho$. The test statistic for a correlation test has a t-distribution with $n - 2$ degrees of freedom and may be written as:
$$t= \frac{r} {\sqrt {\frac {1 - r^{2}} {n - 2} } }$$In order to practice the correlation t-test, we load the students
data set. You may download the students.csv
file here and import it from your local file system, or you load it directly as a web resource. In either case, you import the data set to python as pandas
dataframe
object by using the read_csv
method:
Note: Make sure the
numpy
andpandas
packages are part of yourmamba
environment!
import pandas as pd
import numpy as np
students = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each representing a particular student, and 16 columns corresponding to a variable/feature related to that particular student. These self-explaining variables are:
In order to showcase the correlation t-test, we examine the relationship between the variable score1
and score2
, which show the result of two mandatory statistics exams. The question is whether a linear relationship exists between the grades of two consecutive statistics exams.
score1
and score2
variables. NaN
value in the data set by applying the <df>.dropna()
function.n = 50
sample = students[["score1", "score2"]].dropna().sample(n, random_state = 9)
For visual inspection, we plot the random sample in the form of a scatter plot by using the scatterplot()
function provided over the seaborn
package:
Note: Ensure the
seaborn
package is part of yourmamba
environment!
import seaborn as sns
sns.scatterplot(data = sample, x = "score1", y = "score2")
<Axes: xlabel='score1', ylabel='score2'>
The visual inspection indicates a positive linear relationship between the variables score1
and score2
.
In order to conduct the correlation t-test, we follow the step-wise implementation procedure for hypothesis testing. The correlation t-test follows the same step-wise procedure as discussed in the previous sections:
Step 1: State the null hypothesis $H_{0}$ and alternative hypothesis $H_{A}$
The null hypothesis states that there is no linear relationship between the grades of two consecutive statistics exams:
$$H_{0} : r = 0$$Recall that the formulation of the alternative hypothesis dictates whether we apply a two-sided, a left-tailed or a right-tailed hypothesis test.
Alternative hypothesis:
$$H_{A} : r \ne 0 $$This formulation results in a two-sided hypothesis test.
Step 2: Decide on the significance level, $\alpha$
$$\alpha = 0.01$$alpha = 0.01
Steps 3 and 4: Compute the value of the test statistic and the p-value
For illustration purposes, we manually compute the test statistic in Python. Recall the equation from above:
$$t = \frac {r} {\sqrt {\frac {1 - r^{2}} {n - 2}}}$$r = np.corrcoef(sample["score1"], sample["score2"])[0, 1]
t_value = r / np.sqrt((1 - r**2) / (n - 2))
t_value
12.453777135382685
Note: the
corrcoef()
function provided by thenumpy
package returns the correlation matrix based on the Pearson product-moment correlation coefficient. Since we have two input variables, the result is a $2 \times 2$ matrix with four elements. The Pearson product-moment correlation coefficients of interest are given within the matrix at the indices[0, 1]
and[1, 0]
. Additional information about thecorrcoef()
function is provided over scipies function documentation.
The numerical value of the test statistic is 12.4537771.
In order to calculate the p-value, we apply the t.cdf
function derived by the scipy
package to calculate the probability of occurrence for the test statistic based on the t distribution. To do so, we also need the degrees of freedom. Recall how to calculate the degrees of freedom:
Note: Make sure the
scipy
package is part of yourmamba
environment!
from scipy.stats import t
df = n - 2
p_value_lower = t.cdf(-np.abs(t_value), df)
p_value_upper = 1 - t.cdf(t_value, df)
p_value = p_value_lower + p_value_upper
p_value
6.009746324118586e-17
$p = 6.0097463241 \times 10^{-17}$
Step 5: If $p \le \alpha$, reject $H_{0}$; otherwise, do not reject $H_{0}$
# reject H0?
p_value < alpha
True
The p-value is smaller than the specified significance level of 0.01; we reject $H_{0}$. The test results are statistically significant at the 1 % level and provide very strong evidence against the null hypothesis.
Step 6: Interpret the result of the hypothesis test
At the 1 % significance level, the data provide very strong evidence to conclude that students' exam grades are linearly correlated.
scipy
¶We just ran a correlation t-test in Python manually. We can do the same with just one line of code by using the power of Python's package universe, namely the scipy
package!
In order to conduct a correlation t-test in Python over the stats
module from the scipy
package, we apply the pearsonr()
function. We only have to provide our data separately as numpy
array
. Furthermore, you can specify the test method over the alternative
argument. Because the default is already set to two-sided
, we do not have to adapt it. Additional information regarding the function's usage can be derived directly from the function's documentation of scipy
.
from scipy.stats import pearsonr
test_result = pearsonr(sample["score1"], sample["score2"])
test_result
PearsonRResult(statistic=0.8738759767952878, pvalue=1.2019492648237166e-16)
The pearsonr()
function returns an object
, which provides all relevant information regarding the performed correlation t-test. This also includes the Pearson product-moment correlation coefficient as well as the corresponding p-value of the test result. In detail, the object
consists of the following properties:
<object>.statistic
represents the Pearson product-moment correlation coefficient $r$.<object>.pvalue
represents the p-value of the performed significance test.Consequently, the p-value is retrieved over:
test_result.pvalue
1.2019492648237166e-16
Additionally, the returned object
allows calculating the confidence interval for $r$ based on a given confidence level over the confidence_interval()
method. An $\alpha$ level of 0.05 is the default:
test_result.confidence_interval()
ConfidenceInterval(low=0.7869461225871097, high=0.9267900101824088)
Perfect! Based on the returned test result, we may conclude, that, at the 1 % significance level, the data provides very strong evidence to conclude, that the exam grades of students are linearly correlated.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.