Correlation is a commonly used method to examine the relationship between quantitative variables. The most commonly used statistic is the linear correlation coefficient, \(r\), which is also known as the Pearson product moment correlation coefficient in honor of its developer, Karl Pearson. Is it given by
\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\text{,}\]
where \(s_{xy}\) is the covariance of \(x\) and \(y\) and \(s_x\) and \(s_y\) the standard deviation of \(x\) and \(y\), respectively. By dividing by the sample standard deviations, \(s_x\) and \(s_y\), the linear correlation coefficient, \(r\), becomes scale independent and takes values between \(-1\) and \(1\).
The linear correlation coefficient measures the strength of the linear relationship between two variables. If \(r\) is close to \(\pm 1\), the two variables are highly correlated and if plotted on a scatter plot, the data points cluster about a line. If \(r\) is far from \(\pm 1\), the data points are more widely scattered. If \(r\) is near \(0\), the data points are essentially scattered about a horizontal line indicating that there is almost no linear relationship between the variables.
An interesting property of \(r\) is, that its sign reflects the slope of the linear relationship between two variables. A positive value of \(r\) suggests that the variables are positively linearly correlated, indicating that \(y\) tends to increase linearly as \(x\) increases. A negative value of \(r\) suggests that the variables are negatively linearly correlated, indicating that \(y\) tends to decrease linearly as \(x\) increases.
There is no unambiguous classification rule for the quantity of a linear relationship between two variables. However, the following table may serve a as rule of thumb how to address the numerical values of Pearson product moment correlation coefficient.
\[ \begin{array}{lc} \hline \ \text{Strong linear relationship} & r > 0.9 \\ \ \text{Medium linear relationship} & 0.7 < r \le 0.9\\ \ \text{Weak linear relationship} & 0.5 < r \le 0.7 \\ \ \text{No or doubtful linear relationship} & 0 < r \le 0.5 \\ \hline \end{array} \] Pearson’s correlation assumes the variables to be roughly normally distributed and it is not robust in the presence of outliers.
In a later section on linear regression we discuss the coefficient of determination, \(R^2\), a descriptive measure for the quality of linear models. There is a close relation between \(R^2\) and the linear correlation coefficient, \(r\). The coefficient of determination, \(R^2\), equals the square of the linear correlation coefficient, \(r\).
\[\text{coefficient of determination }(R^2) =r^2 \]
In order to get some intuition we calculate the Pearson product moment correlation coefficient in an example. Therefore we we load the students data set into our workspace (You may download the students.csv
file here).
students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.
In this example we assess the linear relationship between the weight and the height of students. Therefore, we pick at random 37 students and extract weight
and height
variables from the data set.
n = 37
sample.idx <- sample(1:nrow(students),size = n)
weight <- students[sample.idx, 'weight']
height <- students[sample.idx, 'height']
plot(height, weight)
The scatter plot indicates that there exists an linear relationship between the two variables under consideration. For the sake of this exercise we calculate the linear correlation coefficient by hand at first, and then we apply the cor()
function in R. Recall the equation from above
\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\]
x <- height
y <- weight
x.bar <- mean(height)
y.bar <- mean(weight)
sum((x - x.bar) * (y - y.bar)) / (sqrt(sum((x - x.bar)^2)) * sqrt(sum((y - y.bar)^2)))
## [1] 0.946434
As as sanity check we calculate the ratio of the covariance of \(x\) and \(y\) and the standard deviation of \(x\) and \(y\).
\[r = \frac{s_{xy}}{s_x s_y}\]
cov(x,y)/(sd(x)*sd(y))
## [1] 0.946434
Finally, we apply the in-build cor()
function.
cor(x,y)
## [1] 0.946434
Perfect. The three calculations yield the exact same result! The linear correlation coefficient evaluates to \(r = 0.946434\). Thus we may conclude, that there is a strong linear correlation between the height and the weight of a student.
Of course a correlation analysis is not restricted to two variables. Thanks to statistical software packages, such as R, we are able to conduct a pairwise correlation analysis for more than two variables. Let us first prepare the data set. For a better visualization experience we draw 100 randomly picked students from the students data set. Then we select a number of variables to perform correlation analysis.
n = 100
sample.idx <- sample(1:nrow(students), size = n)
vars <- c('height', 'weight','nc.score', 'score1','score2','salary') # select variables
cor(students[sample.idx, vars])
## height weight nc.score score1 score2 salary
## height 1.00000000 0.94080941 -0.06878171 NA NA NA
## weight 0.94080941 1.00000000 -0.05861424 NA NA NA
## nc.score -0.06878171 -0.05861424 1.00000000 NA NA NA
## score1 NA NA NA 1 NA NA
## score2 NA NA NA NA 1 NA
## salary NA NA NA NA NA 1
The cor()
function returns a nice table, also called correlation matrix, with the pairwise Pearson’s correlation coefficients. Obviously some variables contain missing values, denoted as NA
. We may exclude those values from the analysis by adding the argument use = 'pairwise.complete.obs'
to the function call.
cor(students[sample.idx, vars], use = 'pairwise.complete.obs')
## height weight nc.score score1 score2
## height 1.00000000 0.94080941 -0.068781709 0.168821978 0.169925657
## weight 0.94080941 1.00000000 -0.058614239 0.089633664 0.128984150
## nc.score -0.06878171 -0.05861424 1.000000000 0.006101369 -0.005755013
## score1 0.16882198 0.08963366 0.006101369 1.000000000 0.901300595
## score2 0.16992566 0.12898415 -0.005755013 0.901300595 1.000000000
## salary 0.42137954 0.32609646 0.033045107 0.369480415 0.413249496
## salary
## height 0.42137954
## weight 0.32609646
## nc.score 0.03304511
## score1 0.36948041
## score2 0.41324950
## salary 1.00000000
A table is nice representation for a correlation analysis, but a figure of course would improve the interpretability. The software package R provides the pairs()
function for plotting correlation matrices.
pairs(students[sample.idx, vars])
Immediately, we realize that the majority of the variables does not appear to be linearly correlated. In contrast the variable pairs height
and weight
, as well as score1
and score2
appear to be positively correlated.
Another recommendable function for pair plots is provided by the psych
package. The function pairs.panels()
is very flexible and ships with many nice plotting features. Visit the help page or type ?pairs.panels
into your console for more information on the function.
library(psych)
pairs.panels(students[sample.idx, vars],
smooth = FALSE,
ellipses = FALSE,
main = 'Pearson product moment correlation coefficient')
The function returns a plot, which shows scatter plots, the variable histograms and the density curves, as well as the correlation coefficients. Please note that functionally includes different types of correlation coefficients such as Pearson’s, Spearman’s and Kendall’s correlation coefficients. To pick one particular you add the argument method
to the function call. Pearson’s correlation coefficient is the default setting.
pairs.panels(students[sample.idx, vars],
smooth = FALSE,
ellipses = FALSE,
method = 'spearman',
main = 'Spearman\'s rank correlation coefficient')
Spearman’s rank correlation coefficient, also known as the Spearman’s \(\rho\) is a non-parametric rank correlation coefficient. It was developed by Charles Spearman and is an alternative to Pearson’s product moment correlation coefficient. The Spearman \(\rho\) rank correlation coefficient is denoted by \(r_s\) for sample data and by \(\rho_s\) for population data (Mann 2012). The correlation coefficient assesses the monotonic relationship between two variables and ranges between \(-1\) and \(1\). It describes the linear correlation between the ranks of the data on variables \(x\) and \(y\). Spearman’s correlation is high when the variables have a similar rank, and low when variables have a dissimilar rank.
To calculate \(r_s\), the data for each variable, \(x\) and \(y\), is ranked separately. The difference between each pair of ranks and denote it by \(d\). For a given bivariate sequence \((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\) Spearman’s \(r_s\) is given by
\[r_s=1-\frac{6\sum_{i=1}^n (r_{xi}-r_{yi})^2}{n(n^2-1)}\text{,}\] where \(r_{xi}=Rank(x_i)\) , \(r_{yi}= Rank(y_i)\) , and \(n\) is the sample size.
In contrast to Pearson’s linear correlation coefficient Spearman’s linear coefficient is appropriate for both quantitative and ordinal variables. In addition rank based correlations are not dependent on the normal distributional assumption are more resistant to outliers (Schumann 2010).
Let us consider an example. The population of a community along a river believes that recent increases in peak discharge rates are due to deforestation by a logging company in recent years. We calculate Spearman’s rank correlation coefficient to assess if there is a correlation between peak discharge and the faction of deforestation area in the watershed (data modified after McCuen 2003, p. 112).
\[ \begin{array}{lc} \hline \ \text{Year} & \text{Discharge (m}^3 \text{s}^{-1}) & \text{Logging area (%)} \\ \hline \ 1982 & 227 & 53 \\ \ 1983 & 249 & 56 \\ \ 1984 & 210 & 57 \\ \ 1985 & 190 & 58 \\ \ 1986 & 314 & 55 \\ \ 1987 & 345 & 54 \\ \ 1988 & 161 & 51 \\ \ 1988 & 266 & 50 \\ \ 1989 & 402 & 49 \\ \ 1990 & 215 & 57 \\ \ 1991 & 164 & 46 \\ \ 1992 & 405 & 44 \\ \ 1993 & 328 & 43 \\ \ 1994 & 294 & 42 \\ \hline \end{array} \]
Let us construct our data vectors. We assign the discharge values to the variable Q
and the logging area to the variable logged
.
Q <- c(227, 249, 210, 190, 314, 345, 161, 266, 402, 215, 164, 405, 328, 294)
logged <- c(53, 56, 57, 58, 55, 54, 51, 50, 49, 47, 46, 44, 43, 42)
First we calculate Spearman’s rank correlation coefficient by hand. Recall the equation
\[r_s=1-\frac{6\sum_{i=1}^n (r_{xi}-r_{yi})^2}{n(n^2-1)}\text{,}\]
where \(r_{xi}=Rank(x_i)\) , \(r_{yi}= Rank(y_i)\) , and \(n\) is the sample size. We apply the rank()
function to calculate the rank for the values of each variable.
#set number of observations
n <- length(Q)
#calculate rank
r.xi <- rank(Q)
r.yi <- rank(logged)
#plug into equation
rs <- 1 - ((6*sum((r.xi-r.yi)^2))/(n*(n^2-1)))
rs
## [1] -0.3406593
Alternatively, we may apply the cor()
function in R and add the argument method = "spearman"
.
cor(Q, logged, method = "spearman")
## [1] -0.3406593
As Spearman’s rank correlation coefficient is nothing else but Pearson’s linear correlation coefficient on the ranked data. The result the following code cell should equal the previous results.
cor(rank(Q), rank(logged))
## [1] -0.3406593
Perfect, we got the same result by all three calculations, which yield a Spearman’s rank correlation coefficient of \(r_s= -0.34\). The results indicate that there is no to a weak negative correlation between peak discharge and logging area. In other words, the discharge tends to decrease when the logging area increases. Thus, the perception of the population can not be confirmed by our statistical analysis. It is recommended to advance with a statistical test, in order to assess if the result is statistical significant, or if the variation is just due to chance. Check out the sections on Hypothesis Testing for further information.