Correlation is a commonly used method to examine the relationship between **quantitative variables**.
The most commonly used statistic is the **linear correlation coefficient**, $r$, which is also known as the **Pearson product moment correlation coefficient** in honor of its developer, Karl Pearson.
It is given by

where $s_{xy}$ is the covariance of $x$ and $y$, $s_x$ and $s_y$ are the standard deviations of $x$ and $y$, respectively. By dividing by the sample standard deviations, $s_x$ and $s_y$, the linear correlation coefficient, $r$, becomes scale independent and takes values between $-1$ and $1$.

The linear correlation coefficient measures the strength of the linear relationship between two variables. If $r$ is close to $\pm 1$, the two variables are highly correlated and if plotted on a scatter plot, the data points cluster around a line. If $r$ is far from $\pm 1$, the data points are more widely scattered. If $r$ is near $0$, the data points are essentially scattered around a horizontal line indicating that there is almost no linear relationship between the variables.

In [2]:

```
# First, let's import all the needed libraries.
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.stats as stats
```

In [3]:

```
random.seed(1)
n = 100
xlim = [-1, 1]
ylim = [-3, 3]
```

In [4]:

```
# plot1
x1 = np.random.uniform(low=-1, high=1.0, size=n)
y1 = x1 * 3 + np.random.normal(loc=0, scale=0.15, size=n)
# plot2
x2 = np.random.uniform(low=-1, high=1.0, size=n)
y2 = x2 * -3 + np.random.normal(loc=0, scale=0.15, size=n)
# plot3
x3 = np.random.normal(loc=0, scale=1, size=n)
y3 = np.random.normal(loc=0, scale=1, size=n)
# plot4
x4 = np.random.uniform(low=-1, high=1.0, size=n)
y4 = x4 * 3 + np.random.normal(loc=0, scale=2, size=n)
# plot5
x5 = np.random.uniform(low=-1, high=1.0, size=n)
y5 = x5 * -3 + np.random.normal(loc=0, scale=2, size=n)
# plot6
x6 = np.random.uniform(low=-1, high=1.0, size=n)
y6 = x6**2 + np.random.normal(loc=0, scale=0.08, size=n)
```

In [5]:

```
### generate plot ###
fig, axs = plt.subplots(2, 3, figsize=(12, 7))
# plot 1
coef = np.polyfit(x1, y1, 1)
lm = np.poly1d(coef) # lm = function which takes in x and returns an estimate for y
axs[0, 0].plot(x1, y1, "bo", x1, lm(x1), "-r", linewidth=2.5) #
axs[0, 0].get_xaxis().set_visible(False)
axs[0, 0].get_yaxis().set_visible(False)
axs[0, 0].title.set_text("strong positive linear correlation")
axs[0, 0].text(x=-1, y=2, s=f"r ={round(np.corrcoef(x1,y1)[0,1],3)}")
# plot 2
coef = np.polyfit(x2, y2, 1)
lm = np.poly1d(coef)
axs[0, 1].plot(x2, y2, "bo", x2, lm(x2), "-r", linewidth=2.5)
axs[0, 1].get_xaxis().set_visible(False)
axs[0, 1].get_yaxis().set_visible(False)
axs[0, 1].title.set_text("strong negative linear correlation")
axs[0, 1].text(x=-1, y=-2, s=f"r ={round(np.corrcoef(x2,y2)[0,1],3)}")
# plot 3
coef = np.polyfit(x3, y3, 1)
lm = np.poly1d(coef)
axs[0, 2].plot(x3, y3, "bo")
axs[0, 2].axhline(y=0, color="r", linestyle="-", linewidth=2.5)
axs[0, 2].get_xaxis().set_visible(False)
axs[0, 2].get_yaxis().set_visible(False)
axs[0, 2].title.set_text("no linear correlation")
axs[0, 2].text(x=1.5, y=2, s=f"r ={round(np.corrcoef(x3,y3)[0,1],3)}")
# plot 4
coef = np.polyfit(x4, y4, 1)
lm = np.poly1d(coef)
axs[1, 0].plot(x4, y4, "bo", x4, lm(x4), "-r", linewidth=2.5)
axs[1, 0].get_xaxis().set_visible(False)
axs[1, 0].get_yaxis().set_visible(False)
axs[1, 0].title.set_text("weak to medium positive\nlinear correlation")
axs[1, 0].text(x=-1, y=3, s=f"r ={round(np.corrcoef(x4,y4)[0,1],3)}")
# plot 5
coef = np.polyfit(x5, y5, 1)
lm = np.poly1d(coef)
axs[1, 1].plot(x5, y5, "bo", x5, lm(x5), "-r", linewidth=2.5)
axs[1, 1].get_xaxis().set_visible(False)
axs[1, 1].get_yaxis().set_visible(False)
axs[1, 1].title.set_text("weak to medium negative\nlinear correlation")
axs[1, 1].text(x=0.3, y=4, s=f"r ={round(np.corrcoef(x5,y5)[0,1],3)}")
# plot 6
coef = np.polyfit(x6, y6, 1)
lm = np.poly1d(coef)
axs[1, 2].plot(x6, y6, "bo")
axs[1, 2].plot(np.sort(x6), np.sort(x6) ** 2, "-r", linewidth=2.5)
axs[1, 2].get_xaxis().set_visible(False)
axs[1, 2].get_yaxis().set_visible(False)
axs[1, 2].title.set_text("no linear correlation")
axs[1, 2].text(x=0, y=0.9, s=f"r ={round(np.corrcoef(x6,y6)[0,1],3)}")
plt.show()
```

An interesting property of $r$ is that its sign reflects the slope of the linear relationship between two variables.
A **positive value** of $r$ suggests that the variables are **positively linearly correlated**, indicating that $y$ tends to increase linearly as $x$ increases.
A **negative value** of $r$ suggests that the variables are **negatively linearly correlated**, indicating that $y$ tends to decrease linearly as $x$ increases.

There is no unambiguous classification rule for the quantity of a linear relationship between two variables. However, the following table may serve a as rule of thumb how to address the numerical values of the Pearson product moment correlation coefficient:

$$ \begin{array}{lc} \hline \ \text{Strong linear relationship} & r > 0.9 \\ \ \text{Medium linear relationship} & 0.7 < r \le 0.9\\ \ \text{Weak linear relationship} & 0.5 < r \le 0.7 \\ \ \text{No or doubtful linear relationship} & 0 < r \le 0.5 \\ \hline \end{array} $$Pearson's correlation assumes the variables to be roughly normally distributed and it is not robust in the presence of outliers.

In a later section on linear regression we discuss the coefficient of determination, $R^2$, a descriptive measure for the quality of linear models. There is a close relation between $R^2$ and the linear correlation coefficient, $r$. The coefficient of determination, $R^2$, equals the square of the linear correlation coefficient, $r$:

$$\text{coefficient of determination }(R^2) =r^2 $$In order to get some intuition we calculate the Pearson product moment correlation coefficient in an example.
Therefore, we load the *students* data set into our workspace (You may download the `students.csv`

file here or load it directly using `pd.read_csv()`

).

In [6]:

```
students = pd.read_csv(
"https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv"
)
```

In [7]:

```
df_rows, df_cols = students.shape
df_colnames = students.columns
n = 37
```

The *students* data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student.
These self-explaining variables are: ['stud.id', 'name', 'gender', 'age', 'height', 'weight', 'religion', 'nc.score', 'semester', 'major', 'minor', 'score1', 'score2', 'online.tutorial', 'graduated', 'salary'].

In this example we assess **the linear relationship between the weight and the height of students**.
For this we randomly pick 37 students and extract the `weight`

and `height`

variables from the data set.

In [8]:

```
sample_idx = students.sample(n=n)
weight = sample_idx["weight"]
height = sample_idx["height"]
plt.plot(height, weight, "o")
plt.xlabel("height")
plt.ylabel("weight")
plt.show()
```

The scatter plot indicates that there exists a linear relationship between the two variables under consideration.

For the sake of this exercise we calculate the linear correlation coefficient by hand at first and then we apply the `np.corrcoef()`

function from the `NumPy`

package in Python.
Recall the equation from above:

In [9]:

```
x = height.copy()
y = weight.copy()
x_bar = np.mean(height)
y_bar = np.mean(weight)
np.sum((x - x_bar) * (y - y_bar)) / (
np.sqrt(np.sum((x - x_bar) ** 2)) * np.sqrt(sum((y - y_bar) ** 2))
)
```

Out[9]:

0.9734419536082125

As as sanity check we calculate the ratio of the covariance of $x$ and $y$ and the standard deviations of $x$ and $y$:

$$r = \frac{s_{xy}}{s_x s_y}$$In [10]:

```
(
np.cov(x, y) / (np.std(x) * np.std(y))
) # [0,1] ## indicating [0,1] will give the right output
```

Out[10]:

array([[1.17599026, 1.00048201], [1.00048201, 0.89824482]])

Finally, we apply the in-build `np.corrcoef()`

function:

In [11]:

```
np.corrcoef(x, y)[0, 1] ## indicating [0,1] will give the right output
```

Out[11]:

0.9734419536082127

Perfect. The three calculations yield the exact same result! The linear correlation coefficient evaluates to $r =$ <IPython.core.display.Javascript object>. Thus, we may conclude that there is a strong linear correlation between the height and the weight of a student.

Of course a correlation analysis is not restricted to two variables.
Thanks to statistical programming in languaes such as Python, we are able to conduct a pairwise correlation analysis for **more than two variables**.
Let us first prepare the data set.
For a better visualization experience we draw 100 randomly picked students from the *students* data set.
Then we select a number of variables to perform the correlation analysis on. The `pd.corr()`

comes in handy for correlation multiple columns of a `pandas DataFrame`

.

In [12]:

```
n = 100
sample_idx = students.sample(n=n)
vars = [
"height",
"weight",
"nc.score",
"score1",
"score2",
"salary",
] # select variables
students_sample = sample_idx[vars]
students_sample.corr(method="pearson")
```

Out[12]:

height | weight | nc.score | score1 | score2 | salary | |
---|---|---|---|---|---|---|

height | 1.000000 | 0.958346 | 0.076327 | -0.024130 | 0.032678 | 0.495396 |

weight | 0.958346 | 1.000000 | 0.083525 | -0.081879 | -0.027481 | 0.456302 |

nc.score | 0.076327 | 0.083525 | 1.000000 | -0.059677 | 0.013939 | 0.076737 |

score1 | -0.024130 | -0.081879 | -0.059677 | 1.000000 | 0.920166 | 0.363850 |

score2 | 0.032678 | -0.027481 | 0.013939 | 0.920166 | 1.000000 | 0.404771 |

salary | 0.495396 | 0.456302 | 0.076737 | 0.363850 | 0.404771 | 1.000000 |

The `pd.corr()`

function returns a nice table, also called **correlation matrix**, with the pairwise Pearson correlation coefficients.

A table is a nice representation for a correlation analysis, but a figure would of course improve the interpretability.
The `seaborn`

package provides the `pairplot()`

function for easily plotting correlation matrices.

In [13]:

```
plt.rcParams["figure.figsize"] = (5, 3)
ax = sns.pairplot(students_sample, height=1.7, aspect=0.9)
plt.tight_layout()
plt.show()
```

Immediately, we realize that the majority of the variables does not appear to be linearly correlated.
In contrast, the variable pairs `height`

and `weight`

, as well as `score1`

and `score2`

appear to be positively correlated.

The `seaborn`

package also allows for more advanced pair plots. It is thus a very flexible package, whichs comes with many nice plotting features. Down below an example code shows the custom function `corr_dot`

, which allows for the correlation coefficient to be plotted as well.

In [14]:

```
def corr_dot(*args, **kwargs):
corr_r = args[0].corr(args[1], "pearson")
corr_text = f"{corr_r:2.2f}"
ax = plt.gca()
ax.set_axis_off()
marker_size = abs(corr_r) * 10000
ax.scatter(
[0.5],
[0.5],
marker_size,
[corr_r],
alpha=0.6,
cmap="coolwarm",
vmin=-1,
vmax=1,
transform=ax.transAxes,
)
font_size = abs(corr_r) * 40 + 5
ax.annotate(
corr_text,
[
0.5,
0.5,
],
xycoords="axes fraction",
ha="center",
va="center",
fontsize=font_size,
)
sns.set(style="white", font_scale=1.6)
g = sns.PairGrid(students_sample, aspect=1.4, diag_sharey=False)
g.map_lower(plt.scatter)
g.map_diag(sns.histplot, kde_kws={"color": "black"})
g.map_upper(corr_dot)
```

Out[14]:

<seaborn.axisgrid.PairGrid at 0x2202721b430>