Correlation is a commonly used method to examine the relationship between quantitative variables. The most commonly used statistic is the linear correlation coefficient, \(r\), which is also known as the Pearson product moment correlation coefficient in honor of its developer, Karl Pearson. It is given by
\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\text{,}\]
where \(s_{xy}\) is the covariance of \(x\) and \(y\), \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\), respectively. By dividing by the sample standard deviations, \(s_x\) and \(s_y\), the linear correlation coefficient, \(r\), becomes scale independent and takes values between \(-1\) and \(1\).
The linear correlation coefficient measures the strength of the linear relationship between two variables. If \(r\) is close to \(\pm 1\), the two variables are highly correlated and if plotted on a scatter plot, the data points cluster around a line. If \(r\) is far from \(\pm 1\), the data points are more widely scattered. If \(r\) is near \(0\), the data points are essentially scattered around a horizontal line indicating that there is almost no linear relationship between the variables.
An interesting property of \(r\) is that its sign reflects the slope of the linear relationship between two variables. A positive value of \(r\) suggests that the variables are positively linearly correlated, indicating that \(y\) tends to increase linearly as \(x\) increases. A negative value of \(r\) suggests that the variables are negatively linearly correlated, indicating that \(y\) tends to decrease linearly as \(x\) increases.
There is no unambiguous classification rule for the quantity of a linear relationship between two variables. However, the following table may serve a as rule of thumb how to address the numerical values of the Pearson product moment correlation coefficient:
\[ \begin{array}{lc} \hline \ \text{Strong linear relationship} & r > 0.9 \\ \ \text{Medium linear relationship} & 0.7 < r \le 0.9\\ \ \text{Weak linear relationship} & 0.5 < r \le 0.7 \\ \ \text{No or doubtful linear relationship} & 0 < r \le 0.5 \\ \hline \end{array} \]
Pearson’s correlation assumes the variables to be roughly normally distributed and it is not robust in the presence of outliers.
In a later section on linear regression we discuss the coefficient of determination, \(R^2\), a descriptive measure for the quality of linear models. There is a close relation between \(R^2\) and the linear correlation coefficient, \(r\). The coefficient of determination, \(R^2\), equals the square of the linear correlation coefficient, \(r\):
\[\text{coefficient of determination }(R^2) =r^2 \]
In order to get some intuition we calculate the Pearson product
moment correlation coefficient in an example. Therefore, we load the
students data set into our workspace (You may download the
students.csv
file here
or load it directly using read.csv()
).
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.
In this example we assess the linear relationship between the
weight and the height of students. For this we randomly pick 37
students and extract the weight
and height
variables from the data set.
n <- 37
sample_idx <- sample(1:nrow(students), size = n)
weight <- students[sample_idx, "weight"]
height <- students[sample_idx, "height"]
plot(height, weight)
The scatter plot indicates that there exists a linear relationship between the two variables under consideration.
For the sake of this exercise we calculate the linear correlation
coefficient by hand at first and then we apply the cor()
function in R. Recall the equation from above:
\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\]
x <- height
y <- weight
x_bar <- mean(height)
y_bar <- mean(weight)
sum((x - x_bar) * (y - y_bar)) / (sqrt(sum((x - x_bar)^2)) * sqrt(sum((y - y_bar)^2)))
## [1] 0.9743696
As as sanity check we calculate the ratio of the covariance of \(x\) and \(y\) and the standard deviations of \(x\) and \(y\):
\[r = \frac{s_{xy}}{s_x s_y}\]
cov(x, y) / (sd(x) * sd(y))
## [1] 0.9743696
Finally, we apply the in-build cor()
function:
cor(x, y)
## [1] 0.9743696
Perfect. The three calculations yield the exact same result! The linear correlation coefficient evaluates to \(r = 0.9743696\). Thus, we may conclude that there is a strong linear correlation between the height and the weight of a student.
Of course a correlation analysis is not restricted to two variables. Thanks to statistical software packages, such as R, we are able to conduct a pairwise correlation analysis for more than two variables. Let us first prepare the data set. For a better visualization experience we draw 100 randomly picked students from the students data set. Then we select a number of variables to perform the correlation analysis on.
n <- 100
sample_idx <- sample(1:nrow(students), size = n)
vars <- c("height", "weight", "nc.score", "score1", "score2", "salary") # select variables
cor(students[sample_idx, vars])
## height weight nc.score score1 score2 salary
## height 1.00000000 0.94526224 -0.02448925 NA NA NA
## weight 0.94526224 1.00000000 -0.04048844 NA NA NA
## nc.score -0.02448925 -0.04048844 1.00000000 NA NA NA
## score1 NA NA NA 1 NA NA
## score2 NA NA NA NA 1 NA
## salary NA NA NA NA NA 1
The cor()
function returns a nice table, also called
correlation matrix, with the pairwise Pearson
correlation coefficients. Obviously, some variables contain missing
values, denoted as NA
. We exclude those values from the
analysis by adding the argument
use = 'pairwise.complete.obs'
to the function call.
cor(students[sample_idx, vars], use = "pairwise.complete.obs")
## height weight nc.score score1 score2
## height 1.00000000 0.94526224 -0.024489250 0.195887582 0.16769035
## weight 0.94526224 1.00000000 -0.040488441 0.241458260 0.21921260
## nc.score -0.02448925 -0.04048844 1.000000000 -0.008946945 -0.02146809
## score1 0.19588758 0.24145826 -0.008946945 1.000000000 0.87080991
## score2 0.16769035 0.21921260 -0.021468095 0.870809912 1.00000000
## salary 0.51781672 0.36597798 0.283531817 0.381895086 0.15137427
## salary
## height 0.5178167
## weight 0.3659780
## nc.score 0.2835318
## score1 0.3818951
## score2 0.1513743
## salary 1.0000000
A table is a nice representation for a correlation analysis, but a
figure would of course improve the interpretability. R provides the
pairs()
function for plotting correlation matrices.
pairs(students[sample_idx, vars])
Immediately, we realize that the majority of the variables does not
appear to be linearly correlated. In contrast, the variable pairs
height
and weight
, as well as
score1
and score2
appear to be positively
correlated.
Another recommendable function for pair plots is provided by the
psych
package. The function pairs.panels()
is
very flexible and comes with many nice plotting features. Visit the help
page or type ?pairs.panels
into your console for more
information on the function.
library(psych)
pairs.panels(students[sample_idx, vars],
smooth = FALSE,
ellipses = FALSE,
main = "Pearson product moment correlation coefficient"
)
The function returns a plot, which shows scatter plots, the variable histograms and the density curves as well as the correlation coefficients.
Note: Correlation functions implemented in R, such as
cor()
orpairs.panels()
, include different types of correlation coefficients such as Pearson’s, Spearman’s and Kendall’s correlation coefficients. To pick one particular formula you add the argumentmethod
to the function call. Pearson’s correlation coefficient is the default setting.
pairs.panels(students[sample_idx, vars],
smooth = FALSE,
ellipses = FALSE,
method = "spearman",
main = "Spearman's rank correlation coefficient"
)
Spearman’s rank correlation coefficient, also known as Spearman’s \(\rho\), is a non-parametric rank correlation coefficient. It was developed by Charles Spearman and is an alternative to the Pearson product moment correlation coefficient. The Spearman’s rank correlation coefficient is denoted by \(r_s\) for sample data and by \(\rho_s\) for population data (Mann 2012). The correlation coefficient assesses the monotonic relationship between two variables and ranges between \(-1\) and \(1\). It describes the linear correlation between the ranks of the data of variables \(x\) and \(y\). Spearman’s correlation is high when the variables have a similar rank and low when variables have a dissimilar rank.
To calculate \(r_s\) the data for each variable, \(x\) and \(y\), is ranked separately. For a given bivariate sequence \((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\) Spearman’s \(r_s\) is given by
\[r_s=1-\frac{6\sum_{i=1}^n (r_{xi}-r_{yi})^2}{n(n^2-1)}\text{,}\] where \(r_{xi}=Rank(x_i)\) , \(r_{yi}= Rank(y_i)\) and \(n\) is the sample size.
In contrast to Pearson’s linear correlation coefficient Spearman’s linear correlation coefficient is appropriate for both, quantitative and ordinal, variables. In addition, rank based correlations are not dependent on the assumption of a normal distribution and are more resistant to outliers (Schumann 2010).
Let us consider an example. The population of a community along a river believes that recent increases in peak discharge rates are due to deforestation by a logging company. We calculate Spearman’s rank correlation coefficient to assess, whether there is a correlation between peak discharge and the fraction of deforestation area in the watershed (data modified after McCuen 2003, p. 112).
\[ \begin{array}{lc} \hline \ \text{Year} & \text{Discharge (m}^3 \text{s}^{-1}) & \text{Logging area (%)} \\ \hline \ 1982 & 227 & 53 \\ \ 1983 & 249 & 56 \\ \ 1984 & 210 & 57 \\ \ 1985 & 190 & 58 \\ \ 1986 & 314 & 55 \\ \ 1987 & 345 & 54 \\ \ 1988 & 161 & 51 \\ \ 1988 & 266 & 50 \\ \ 1989 & 402 & 49 \\ \ 1990 & 215 & 57 \\ \ 1991 & 164 & 46 \\ \ 1992 & 405 & 44 \\ \ 1993 & 328 & 43 \\ \ 1994 & 294 & 42 \\ \hline \end{array} \]
Let us construct our data vectors. We assign the discharge values to
the variable Q
and the logging area to the variable
logged
.
Q <- c(227, 249, 210, 190, 314, 345, 161, 266, 402, 215, 164, 405, 328, 294)
logged <- c(53, 56, 57, 58, 55, 54, 51, 50, 49, 47, 46, 44, 43, 42)
First, we calculate Spearman’s rank correlation coefficient by hand. Recall the equation
\[r_s=1-\frac{6\sum_{i=1}^n (r_{xi}-r_{yi})^2}{n(n^2-1)}\text{,}\]
where \(r_{xi}=Rank(x_i)\) , \(r_{yi}= Rank(y_i)\) and \(n\) is the sample size. We apply the
rank()
function to calculate the rank of the values of each
variable.
# set number of observations
n <- length(Q)
# calculate rank
r_xi <- rank(Q)
r_yi <- rank(logged)
# plug into equation
rs <- 1 - ((6 * sum((r_xi - r_yi)^2)) / (n * (n^2 - 1)))
rs
## [1] -0.3406593
Alternatively, we may apply the cor()
function in R and
add the argument method = "spearman"
.
cor(Q, logged, method = "spearman")
## [1] -0.3406593
Spearman’s rank correlation coefficient is nothing else than Pearson’s linear correlation coefficient on the ranked data. Thus, the following line of code should equal the previous results:
cor(rank(Q), rank(logged))
## [1] -0.3406593
Perfect! We got the same result using all three calculations, which yield a Spearman’s rank correlation coefficient of \(r_s= -0.34\). The result indicates that there is no to a weak negative correlation between peak discharge and logging area. In other words, the discharge tends to decrease when the logging area increases. Thus, the perception of the population cannot be confirmed by our statistical analysis.
Note: It is always recommended to advance with a statistical test, in order to assess whether the result is statistically significant or whether the variation is just due to chance. Check out the sections on Hypothesis Testing for further information!
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.