Correlation is a commonly used method to examine the relationship between quantitative variables. The most commonly used statistic is the linear correlation coefficient, $r$, which is also known as the Pearson product moment correlation coefficient in honor of its developer, Karl Pearson. It is given by
$$r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\text{,}$$where $s_{xy}$ is the covariance of $x$ and $y$, $s_x$ and $s_y$ are the standard deviations of $x$ and $y$, respectively. By dividing by the sample standard deviations, $s_x$ and $s_y$, the linear correlation coefficient, $r$, becomes scale independent and takes values between $-1$ and $1$.
The linear correlation coefficient measures the strength of the linear relationship between two variables. If $r$ is close to $\pm 1$, the two variables are highly correlated and if plotted on a scatter plot, the data points cluster around a line. If $r$ is far from $\pm 1$, the data points are more widely scattered. If $r$ is near $0$, the data points are essentially scattered around a horizontal line indicating that there is almost no linear relationship between the variables.
# First, let's import all the needed libraries.
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.stats as stats
random.seed(1)
n = 100
xlim = [-1, 1]
ylim = [-3, 3]
# plot1
x1 = np.random.uniform(low=-1, high=1.0, size=n)
y1 = x1 * 3 + np.random.normal(loc=0, scale=0.15, size=n)
# plot2
x2 = np.random.uniform(low=-1, high=1.0, size=n)
y2 = x2 * -3 + np.random.normal(loc=0, scale=0.15, size=n)
# plot3
x3 = np.random.normal(loc=0, scale=1, size=n)
y3 = np.random.normal(loc=0, scale=1, size=n)
# plot4
x4 = np.random.uniform(low=-1, high=1.0, size=n)
y4 = x4 * 3 + np.random.normal(loc=0, scale=2, size=n)
# plot5
x5 = np.random.uniform(low=-1, high=1.0, size=n)
y5 = x5 * -3 + np.random.normal(loc=0, scale=2, size=n)
# plot6
x6 = np.random.uniform(low=-1, high=1.0, size=n)
y6 = x6**2 + np.random.normal(loc=0, scale=0.08, size=n)
### generate plot ###
fig, axs = plt.subplots(2, 3, figsize=(12, 7))
# plot 1
coef = np.polyfit(x1, y1, 1)
lm = np.poly1d(coef) # lm = function which takes in x and returns an estimate for y
axs[0, 0].plot(x1, y1, "bo", x1, lm(x1), "-r", linewidth=2.5) #
axs[0, 0].get_xaxis().set_visible(False)
axs[0, 0].get_yaxis().set_visible(False)
axs[0, 0].title.set_text("strong positive linear correlation")
axs[0, 0].text(x=-1, y=2, s=f"r ={round(np.corrcoef(x1,y1)[0,1],3)}")
# plot 2
coef = np.polyfit(x2, y2, 1)
lm = np.poly1d(coef)
axs[0, 1].plot(x2, y2, "bo", x2, lm(x2), "-r", linewidth=2.5)
axs[0, 1].get_xaxis().set_visible(False)
axs[0, 1].get_yaxis().set_visible(False)
axs[0, 1].title.set_text("strong negative linear correlation")
axs[0, 1].text(x=-1, y=-2, s=f"r ={round(np.corrcoef(x2,y2)[0,1],3)}")
# plot 3
coef = np.polyfit(x3, y3, 1)
lm = np.poly1d(coef)
axs[0, 2].plot(x3, y3, "bo")
axs[0, 2].axhline(y=0, color="r", linestyle="-", linewidth=2.5)
axs[0, 2].get_xaxis().set_visible(False)
axs[0, 2].get_yaxis().set_visible(False)
axs[0, 2].title.set_text("no linear correlation")
axs[0, 2].text(x=1.5, y=2, s=f"r ={round(np.corrcoef(x3,y3)[0,1],3)}")
# plot 4
coef = np.polyfit(x4, y4, 1)
lm = np.poly1d(coef)
axs[1, 0].plot(x4, y4, "bo", x4, lm(x4), "-r", linewidth=2.5)
axs[1, 0].get_xaxis().set_visible(False)
axs[1, 0].get_yaxis().set_visible(False)
axs[1, 0].title.set_text("weak to medium positive\nlinear correlation")
axs[1, 0].text(x=-1, y=3, s=f"r ={round(np.corrcoef(x4,y4)[0,1],3)}")
# plot 5
coef = np.polyfit(x5, y5, 1)
lm = np.poly1d(coef)
axs[1, 1].plot(x5, y5, "bo", x5, lm(x5), "-r", linewidth=2.5)
axs[1, 1].get_xaxis().set_visible(False)
axs[1, 1].get_yaxis().set_visible(False)
axs[1, 1].title.set_text("weak to medium negative\nlinear correlation")
axs[1, 1].text(x=0.3, y=4, s=f"r ={round(np.corrcoef(x5,y5)[0,1],3)}")
# plot 6
coef = np.polyfit(x6, y6, 1)
lm = np.poly1d(coef)
axs[1, 2].plot(x6, y6, "bo")
axs[1, 2].plot(np.sort(x6), np.sort(x6) ** 2, "-r", linewidth=2.5)
axs[1, 2].get_xaxis().set_visible(False)
axs[1, 2].get_yaxis().set_visible(False)
axs[1, 2].title.set_text("no linear correlation")
axs[1, 2].text(x=0, y=0.9, s=f"r ={round(np.corrcoef(x6,y6)[0,1],3)}")
plt.show()
An interesting property of $r$ is that its sign reflects the slope of the linear relationship between two variables. A positive value of $r$ suggests that the variables are positively linearly correlated, indicating that $y$ tends to increase linearly as $x$ increases. A negative value of $r$ suggests that the variables are negatively linearly correlated, indicating that $y$ tends to decrease linearly as $x$ increases.
There is no unambiguous classification rule for the quantity of a linear relationship between two variables. However, the following table may serve a as rule of thumb how to address the numerical values of the Pearson product moment correlation coefficient:
$$ \begin{array}{lc} \hline \ \text{Strong linear relationship} & r > 0.9 \\ \ \text{Medium linear relationship} & 0.7 < r \le 0.9\\ \ \text{Weak linear relationship} & 0.5 < r \le 0.7 \\ \ \text{No or doubtful linear relationship} & 0 < r \le 0.5 \\ \hline \end{array} $$Pearson's correlation assumes the variables to be roughly normally distributed and it is not robust in the presence of outliers.
In a later section on linear regression we discuss the coefficient of determination, $R^2$, a descriptive measure for the quality of linear models. There is a close relation between $R^2$ and the linear correlation coefficient, $r$. The coefficient of determination, $R^2$, equals the square of the linear correlation coefficient, $r$:
$$\text{coefficient of determination }(R^2) =r^2 $$In order to get some intuition we calculate the Pearson product moment correlation coefficient in an example.
Therefore, we load the students data set into our workspace (You may download the students.csv
file here or load it directly using pd.read_csv()
).
students = pd.read_csv(
"https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
df_rows, df_cols = students.shape
df_colnames = students.columns
n = 37
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: ['stud.id', 'name', 'gender', 'age', 'height', 'weight', 'religion', 'nc.score', 'semester', 'major', 'minor', 'score1', 'score2', 'online.tutorial', 'graduated', 'salary'].
In this example we assess the linear relationship between the weight and the height of students.
For this we randomly pick 37 students and extract the weight
and height
variables from the data set.
sample_idx = students.sample(n=n)
weight = sample_idx["weight"]
height = sample_idx["height"]
plt.plot(height, weight, "o")
plt.xlabel("height")
plt.ylabel("weight")
plt.show()
The scatter plot indicates that there exists a linear relationship between the two variables under consideration.
For the sake of this exercise we calculate the linear correlation coefficient by hand at first and then we apply the np.corrcoef()
function from the NumPy
package in Python.
Recall the equation from above:
x = height.copy()
y = weight.copy()
x_bar = np.mean(height)
y_bar = np.mean(weight)
np.sum((x - x_bar) * (y - y_bar)) / (
np.sqrt(np.sum((x - x_bar) ** 2)) * np.sqrt(sum((y - y_bar) ** 2))
)
0.9734419536082125
As as sanity check we calculate the ratio of the covariance of $x$ and $y$ and the standard deviations of $x$ and $y$:
$$r = \frac{s_{xy}}{s_x s_y}$$(
np.cov(x, y) / (np.std(x) * np.std(y))
) # [0,1] ## indicating [0,1] will give the right output
array([[1.17599026, 1.00048201], [1.00048201, 0.89824482]])
Finally, we apply the in-build np.corrcoef()
function:
np.corrcoef(x, y)[0, 1] ## indicating [0,1] will give the right output
0.9734419536082127
Perfect. The three calculations yield the exact same result! The linear correlation coefficient evaluates to $r =$ <IPython.core.display.Javascript object>. Thus, we may conclude that there is a strong linear correlation between the height and the weight of a student.
Of course a correlation analysis is not restricted to two variables.
Thanks to statistical programming in languaes such as Python, we are able to conduct a pairwise correlation analysis for more than two variables.
Let us first prepare the data set.
For a better visualization experience we draw 100 randomly picked students from the students data set.
Then we select a number of variables to perform the correlation analysis on. The pd.corr()
comes in handy for correlation multiple columns of a pandas DataFrame
.
n = 100
sample_idx = students.sample(n=n)
vars = [
"height",
"weight",
"nc.score",
"score1",
"score2",
"salary",
] # select variables
students_sample = sample_idx[vars]
students_sample.corr(method="pearson")
height | weight | nc.score | score1 | score2 | salary | |
---|---|---|---|---|---|---|
height | 1.000000 | 0.958346 | 0.076327 | -0.024130 | 0.032678 | 0.495396 |
weight | 0.958346 | 1.000000 | 0.083525 | -0.081879 | -0.027481 | 0.456302 |
nc.score | 0.076327 | 0.083525 | 1.000000 | -0.059677 | 0.013939 | 0.076737 |
score1 | -0.024130 | -0.081879 | -0.059677 | 1.000000 | 0.920166 | 0.363850 |
score2 | 0.032678 | -0.027481 | 0.013939 | 0.920166 | 1.000000 | 0.404771 |
salary | 0.495396 | 0.456302 | 0.076737 | 0.363850 | 0.404771 | 1.000000 |
The pd.corr()
function returns a nice table, also called correlation matrix, with the pairwise Pearson correlation coefficients.
A table is a nice representation for a correlation analysis, but a figure would of course improve the interpretability.
The seaborn
package provides the pairplot()
function for easily plotting correlation matrices.
plt.rcParams["figure.figsize"] = (5, 3)
ax = sns.pairplot(students_sample, height=1.7, aspect=0.9)
plt.tight_layout()
plt.show()
Immediately, we realize that the majority of the variables does not appear to be linearly correlated.
In contrast, the variable pairs height
and weight
, as well as score1
and score2
appear to be positively correlated.
The seaborn
package also allows for more advanced pair plots. It is thus a very flexible package, whichs comes with many nice plotting features. Down below an example code shows the custom function corr_dot
, which allows for the correlation coefficient to be plotted as well.
def corr_dot(*args, **kwargs):
corr_r = args[0].corr(args[1], "pearson")
corr_text = f"{corr_r:2.2f}"
ax = plt.gca()
ax.set_axis_off()
marker_size = abs(corr_r) * 10000
ax.scatter(
[0.5],
[0.5],
marker_size,
[corr_r],
alpha=0.6,
cmap="coolwarm",
vmin=-1,
vmax=1,
transform=ax.transAxes,
)
font_size = abs(corr_r) * 40 + 5
ax.annotate(
corr_text,
[
0.5,
0.5,
],
xycoords="axes fraction",
ha="center",
va="center",
fontsize=font_size,
)
sns.set(style="white", font_scale=1.6)
g = sns.PairGrid(students_sample, aspect=1.4, diag_sharey=False)
g.map_lower(plt.scatter)
g.map_diag(sns.histplot, kde_kws={"color": "black"})
g.map_upper(corr_dot)
<seaborn.axisgrid.PairGrid at 0x2202721b430>
The code above returns a plot, which shows scatter plots, the variable histograms and the density curves as well as the correlation coefficients.
Note: Correlation functions implemented in Python, such as
np.corrcoef()
orpd.corr()
, include different types of correlation coefficients such as Pearson's, Spearman's and Kendall's correlation coefficients.
To pick one particular formula you add the argument method
to the function call.
The Pearson's correlation coefficient is the default setting.
Spearman's rank correlation coefficient, also known as Spearman's $\rho$, is a non-parametric rank correlation coefficient. It was developed by Charles Spearman and is an alternative to the Pearson product moment correlation coefficient. The Spearman's rank correlation coefficient is denoted by $r_s$ for sample data and by $\rho_s$ for population data (Mann 2012). The correlation coefficient assesses the monotonic relationship between two variables and ranges between $-1$ and $1$. It describes the linear correlation between the ranks of the data of variables $x$ and $y$. Spearman's correlation is high when the variables have a similar rank and low when variables have a dissimilar rank.
To calculate $r_s$ the data for each variable, $x$ and $y$, is ranked separately. For a given bivariate sequence $(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)$ Spearman's $r_s$ is given by
$$r_s=1-\frac{6\sum_{i=1}^n (r_{xi}-r_{yi})^2}{n(n^2-1)}\text{,}$$where $r_{xi}=Rank(x_i)$ , $r_{yi}= Rank(y_i)$ and $n$ is the sample size.
In contrast to Pearson's linear correlation coefficient Spearman's linear correlation coefficient is appropriate for both, quantitative and ordinal, variables. In addition, rank based correlations are not dependent on the assumption of a normal distribution and are more resistant to outliers (Schumann 2010).
Let us consider an example. The population of a community along a river believes that recent increases in peak discharge rates are due to deforestation by a logging company. We calculate Spearman's rank correlation coefficient to assess, whether there is a correlation between peak discharge and the fraction of deforestation area in the watershed (data modified after McCuen 2003, p. 112).
def cf2m3(q):
return q * 0.028316846592
Q = np.array(
[
8000,
8800,
7400,
6700,
11100,
12200,
5700,
9400,
14200,
7600,
5800,
14300,
11600,
10400,
]
)
np.round(cf2m3(Q))
array([227., 249., 210., 190., 314., 345., 161., 266., 402., 215., 164., 405., 328., 294.])
Let us construct our data vectors.
We assign the discharge values to the variable Q
and the logging area to the variable logged
.
Q = np.array([227, 249, 210, 190, 314, 345, 161, 266, 402, 215, 164, 405, 328, 294])
logged = np.array([53, 56, 57, 58, 55, 54, 51, 50, 49, 47, 46, 44, 43, 42])
First, we calculate Spearman's rank correlation coefficient by hand. Recall the equation
$$r_s=1-\frac{6\sum_{i=1}^n (r_{xi}-r_{yi})^2}{n(n^2-1)}\text{,}$$where $r_{xi}=Rank(x_i)$ , $r_{yi}= Rank(y_i)$ and $n$ is the sample size.
We apply the rankdata()
function from the scipy.stats
module, to calculate the rank of the values of each variable.
# set number of observations
n = len(Q)
# calculate rank
r_xi = stats.rankdata(Q)
r_yi = stats.rankdata(logged)
# plug into equation
rs = 1 - ((6 * np.sum((r_xi - r_yi) ** 2)) / (n * (n**2 - 1)))
rs
-0.34065934065934056
Alternatively, we may apply the spearmanr()
function from the scipy.stats
module.
stats.spearmanr(Q, logged)
SpearmanrResult(correlation=-0.3406593406593406, pvalue=0.23331605103682404)
Spearman's rank correlation coefficient is nothing else than Pearson's linear correlation coefficient on the ranked data. Thus, the following line of code should equal the previous results:
np.corrcoef(stats.rankdata(Q), stats.rankdata(logged))[0, 1]
-0.3406593406593406
Perfect! We got the same result using all three calculations, which yield a Spearman's rank correlation coefficient of $r_s=$ <IPython.core.display.Javascript object>. The result indicates that there is no to a weak negative correlation between peak discharge and logging area. In other words, the discharge tends to decrease when the logging area increases. Thus, the perception of the population cannot be confirmed by our statistical analysis.
Note: It is always recommended to advance with a statistical test, in order to assess whether the result is statistically significant or whether the variation is just due to chance.
Check out the sections on Hypothesis Testing for further information!
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.