20242_correlation.knit

Pearson product moment correlation coefficient

Correlation is a commonly used method to examine the relationship between quantitative variables. The most commonly used statistic is the linear correlation coefficient, \(r\), which is also known as the Pearson product moment correlation coefficient in honor of its developer, Karl Pearson. It is given by

\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\text{,}\]

where \(s_{xy}\) is the covariance of \(x\) and \(y\), \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\), respectively. By dividing by the sample standard deviations, \(s_x\) and \(s_y\), the linear correlation coefficient, \(r\), becomes scale independent and takes values between \(-1\) and \(1\).

The linear correlation coefficient measures the strength of the linear relationship between two variables. If \(r\) is close to \(\pm 1\), the two variables are highly correlated and if plotted on a scatter plot, the data points cluster around a line. If \(r\) is far from \(\pm 1\), the data points are more widely scattered. If \(r\) is near \(0\), the data points are essentially scattered around a horizontal line indicating that there is almost no linear relationship between the variables.

An interesting property of \(r\) is that its sign reflects the slope of the linear relationship between two variables. A positive value of \(r\) suggests that the variables are positively linearly correlated, indicating that \(y\) tends to increase linearly as \(x\) increases. A negative value of \(r\) suggests that the variables are negatively linearly correlated, indicating that \(y\) tends to decrease linearly as \(x\) increases.

There is no unambiguous classification rule for the quantity of a linear relationship between two variables. However, the following table may serve a as rule of thumb how to address the numerical values of the Pearson product moment correlation coefficient:

\[ \begin{array}{lc} \hline \ \text{Strong linear relationship} & r > 0.9 \\ \ \text{Medium linear relationship} & 0.7 < r \le 0.9\\ \ \text{Weak linear relationship} & 0.5 < r \le 0.7 \\ \ \text{No or doubtful linear relationship} & 0 < r \le 0.5 \\ \hline \end{array} \]

Pearson’s correlation assumes the variables to be roughly normally distributed and it is not robust in the presence of outliers.

In a later section on linear regression we discuss the coefficient of determination, \(R^2\), a descriptive measure for the quality of linear models. There is a close relation between \(R^2\) and the linear correlation coefficient, \(r\). The coefficient of determination, \(R^2\), equals the square of the linear correlation coefficient, \(r\):

\[\text{coefficient of determination }(R^2) =r^2 \]

Pearson correlation coefficient: An example

In order to get some intuition we calculate the Pearson product moment correlation coefficient in an example. Therefore, we load the students data set into our workspace (You may download the students.csv file here or load it directly using read.csv()).

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In this example we assess the linear relationship between the weight and the height of students. For this we randomly pick 37 students and extract the weight and height variables from the data set.

n <- 37
sample_idx <- sample(1:nrow(students), size = n)
weight <- students[sample_idx, "weight"]
height <- students[sample_idx, "height"]

plot(height, weight)

The scatter plot indicates that there exists a linear relationship between the two variables under consideration.

For the sake of this exercise we calculate the linear correlation coefficient by hand at first and then we apply the cor() function in R. Recall the equation from above:

\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\]

x <- height
y <- weight
x_bar <- mean(height)
y_bar <- mean(weight)

sum((x - x_bar) * (y - y_bar)) / (sqrt(sum((x - x_bar)^2)) * sqrt(sum((y - y_bar)^2)))

## [1] 0.9743696

As as sanity check we calculate the ratio of the covariance of \(x\) and \(y\) and the standard deviations of \(x\) and \(y\):

\[r = \frac{s_{xy}}{s_x s_y}\]

cov(x, y) / (sd(x) * sd(y))

## [1] 0.9743696

Finally, we apply the in-build cor() function:

cor(x, y)

## [1] 0.9743696

Perfect. The three calculations yield the exact same result! The linear correlation coefficient evaluates to \(r = 0.9743696\). Thus, we may conclude that there is a strong linear correlation between the height and the weight of a student.

Of course a correlation analysis is not restricted to two variables. Thanks to statistical software packages, such as R, we are able to conduct a pairwise correlation analysis for more than two variables. Let us first prepare the data set. For a better visualization experience we draw 100 randomly picked students from the students data set. Then we select a number of variables to perform the correlation analysis on.

n <- 100
sample_idx <- sample(1:nrow(students), size = n)
vars <- c("height", "weight", "nc.score", "score1", "score2", "salary") # select variables
cor(students[sample_idx, vars])

##               height      weight    nc.score score1 score2 salary
## height    1.00000000  0.94526224 -0.02448925     NA     NA     NA
## weight    0.94526224  1.00000000 -0.04048844     NA     NA     NA
## nc.score -0.02448925 -0.04048844  1.00000000     NA     NA     NA
## score1            NA          NA          NA      1     NA     NA
## score2            NA          NA          NA     NA      1     NA
## salary            NA          NA          NA     NA     NA      1

The cor() function returns a nice table, also called correlation matrix, with the pairwise Pearson correlation coefficients. Obviously, some variables contain missing values, denoted as NA. We exclude those values from the analysis by adding the argument use = 'pairwise.complete.obs' to the function call.

cor(students[sample_idx, vars], use = "pairwise.complete.obs")

##               height      weight     nc.score       score1      score2
## height    1.00000000  0.94526224 -0.024489250  0.195887582  0.16769035
## weight    0.94526224  1.00000000 -0.040488441  0.241458260  0.21921260
## nc.score -0.02448925 -0.04048844  1.000000000 -0.008946945 -0.02146809
## score1    0.19588758  0.24145826 -0.008946945  1.000000000  0.87080991
## score2    0.16769035  0.21921260 -0.021468095  0.870809912  1.00000000
## salary    0.51781672  0.36597798  0.283531817  0.381895086  0.15137427
##             salary
## height   0.5178167
## weight   0.3659780
## nc.score 0.2835318
## score1   0.3818951
## score2   0.1513743
## salary   1.0000000

A table is a nice representation for a correlation analysis, but a figure would of course improve the interpretability. R provides the pairs() function for plotting correlation matrices.

pairs(students[sample_idx, vars])

Immediately, we realize that the majority of the variables does not appear to be linearly correlated. In contrast, the variable pairs height and weight, as well as score1 and score2 appear to be positively correlated.

Another recommendable function for pair plots is provided by the psych package. The function pairs.panels() is very flexible and comes with many nice plotting features. Visit the help page or type ?pairs.panels into your console for more information on the function.

library(psych)
pairs.panels(students[sample_idx, vars],
  smooth = FALSE,
  ellipses = FALSE,
  main = "Pearson product moment correlation coefficient"
)

The function returns a plot, which shows scatter plots, the variable histograms and the density curves as well as the correlation coefficients.

Note: Correlation functions implemented in R, such as cor() or pairs.panels(), include different types of correlation coefficients such as Pearson’s, Spearman’s and Kendall’s correlation coefficients. To pick one particular formula you add the argument method to the function call. Pearson’s correlation coefficient is the default setting.

pairs.panels(students[sample_idx, vars],
  smooth = FALSE,
  ellipses = FALSE,
  method = "spearman",
  main = "Spearman's rank correlation coefficient"
)

Spearman’s rank correlation coefficient

Spearman’s rank correlation coefficient, also known as Spearman’s \(\rho\), is a non-parametric rank correlation coefficient. It was developed by Charles Spearman and is an alternative to the Pearson product moment correlation coefficient. The Spearman’s rank correlation coefficient is denoted by \(r_s\) for sample data and by \(\rho_s\) for population data (Mann 2012). The correlation coefficient assesses the monotonic relationship between two variables and ranges between \(-1\) and \(1\). It describes the linear correlation between the ranks of the data of variables \(x\) and \(y\). Spearman’s correlation is high when the variables have a similar rank and low when variables have a dissimilar rank.

To calculate \(r_s\) the data for each variable, \(x\) and \(y\), is ranked separately. For a given bivariate sequence \((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\) Spearman’s \(r_s\) is given by

\[r_s=1-\frac{6\sum_{i=1}^n (r_{xi}-r_{yi})^2}{n(n^2-1)}\text{,}\] where \(r_{xi}=Rank(x_i)\) , \(r_{yi}= Rank(y_i)\) and \(n\) is the sample size.

In contrast to Pearson’s linear correlation coefficient Spearman’s linear correlation coefficient is appropriate for both, quantitative and ordinal, variables. In addition, rank based correlations are not dependent on the assumption of a normal distribution and are more resistant to outliers (Schumann 2010).

Spearman’s rank correlation coefficient: An example

Let us consider an example. The population of a community along a river believes that recent increases in peak discharge rates are due to deforestation by a logging company. We calculate Spearman’s rank correlation coefficient to assess, whether there is a correlation between peak discharge and the fraction of deforestation area in the watershed (data modified after McCuen 2003, p. 112).

\[ \begin{array}{lc} \hline \ \text{Year} & \text{Discharge (m}^3 \text{s}^{-1}) & \text{Logging area (%)} \\ \hline \ 1982 & 227 & 53 \\ \ 1983 & 249 & 56 \\ \ 1984 & 210 & 57 \\ \ 1985 & 190 & 58 \\ \ 1986 & 314 & 55 \\ \ 1987 & 345 & 54 \\ \ 1988 & 161 & 51 \\ \ 1988 & 266 & 50 \\ \ 1989 & 402 & 49 \\ \ 1990 & 215 & 57 \\ \ 1991 & 164 & 46 \\ \ 1992 & 405 & 44 \\ \ 1993 & 328 & 43 \\ \ 1994 & 294 & 42 \\ \hline \end{array} \]

Let us construct our data vectors. We assign the discharge values to the variable Q and the logging area to the variable logged.

Q <- c(227, 249, 210, 190, 314, 345, 161, 266, 402, 215, 164, 405, 328, 294)
logged <- c(53, 56, 57, 58, 55, 54, 51, 50, 49, 47, 46, 44, 43, 42)

First, we calculate Spearman’s rank correlation coefficient by hand. Recall the equation

\[r_s=1-\frac{6\sum_{i=1}^n (r_{xi}-r_{yi})^2}{n(n^2-1)}\text{,}\]

where \(r_{xi}=Rank(x_i)\) , \(r_{yi}= Rank(y_i)\) and \(n\) is the sample size. We apply the rank() function to calculate the rank of the values of each variable.

# set number of observations
n <- length(Q)
# calculate rank
r_xi <- rank(Q)
r_yi <- rank(logged)
# plug into equation
rs <- 1 - ((6 * sum((r_xi - r_yi)^2)) / (n * (n^2 - 1)))
rs

## [1] -0.3406593

Alternatively, we may apply the cor() function in R and add the argument method = "spearman".

cor(Q, logged, method = "spearman")

## [1] -0.3406593

Spearman’s rank correlation coefficient is nothing else than Pearson’s linear correlation coefficient on the ranked data. Thus, the following line of code should equal the previous results:

cor(rank(Q), rank(logged))

## [1] -0.3406593

Perfect! We got the same result using all three calculations, which yield a Spearman’s rank correlation coefficient of \(r_s= -0.34\). The result indicates that there is no to a weak negative correlation between peak discharge and logging area. In other words, the discharge tends to decrease when the logging area increases. Thus, the perception of the population cannot be confirmed by our statistical analysis.

Note: It is always recommended to advance with a statistical test, in order to assess whether the result is statistically significant or whether the variation is just due to chance. Check out the sections on Hypothesis Testing for further information!

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.