Correlation is a commonly used method to examine the relationship
between **quantitative variables**. The most commonly used
statistic is the **linear correlation coefficient**, \(r\), which is also known as the **Pearson product moment correlation
coefficient** in honor of its developer, Karl
Pearson. It is given by

\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\text{,}\]

where \(s_{xy}\) is the covariance of \(x\) and \(y\), \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\), respectively. By dividing by the sample standard deviations, \(s_x\) and \(s_y\), the linear correlation coefficient, \(r\), becomes scale independent and takes values between \(-1\) and \(1\).

The linear correlation coefficient measures the strength of the linear relationship between two variables. If \(r\) is close to \(\pm 1\), the two variables are highly correlated and if plotted on a scatter plot, the data points cluster around a line. If \(r\) is far from \(\pm 1\), the data points are more widely scattered. If \(r\) is near \(0\), the data points are essentially scattered around a horizontal line indicating that there is almost no linear relationship between the variables.

An interesting property of \(r\) is
that its sign reflects the slope of the linear relationship between two
variables. A **positive value** of \(r\) suggests that the variables are
**positively linearly correlated**, indicating that \(y\) tends to increase linearly as \(x\) increases. A **negative
value** of \(r\) suggests that
the variables are **negatively linearly correlated**,
indicating that \(y\) tends to decrease
linearly as \(x\) increases.

There is no unambiguous classification rule for the quantity of a linear relationship between two variables. However, the following table may serve a as rule of thumb how to address the numerical values of the Pearson product moment correlation coefficient:

\[ \begin{array}{lc} \hline \ \text{Strong linear relationship} & r > 0.9 \\ \ \text{Medium linear relationship} & 0.7 < r \le 0.9\\ \ \text{Weak linear relationship} & 0.5 < r \le 0.7 \\ \ \text{No or doubtful linear relationship} & 0 < r \le 0.5 \\ \hline \end{array} \]

Pearson’s correlation assumes the variables to be roughly normally distributed and it is not robust in the presence of outliers.

In a later section on linear regression we discuss the coefficient of determination, \(R^2\), a descriptive measure for the quality of linear models. There is a close relation between \(R^2\) and the linear correlation coefficient, \(r\). The coefficient of determination, \(R^2\), equals the square of the linear correlation coefficient, \(r\):

\[\text{coefficient of determination }(R^2) =r^2 \]

In order to get some intuition we calculate the Pearson product
moment correlation coefficient in an example. Therefore, we load the
*students* data set into our workspace (You may download the
`students.csv`

file here
or load it directly using `read.csv()`

).

`students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")`

The *students* data set consists of 8239 rows, each of them
representing a particular student, and 16 columns, each of them
corresponding to a variable/feature related to that particular student.
These self-explaining variables are: *stud.id, name, gender, age,
height, weight, religion, nc.score, semester, major, minor, score1,
score2, online.tutorial, graduated, salary*.

In this example we assess **the linear relationship between the
weight and the height of students**. For this we randomly pick 37
students and extract the `weight`

and `height`

variables from the data set.

```
n <- 37
sample_idx <- sample(1:nrow(students), size = n)
weight <- students[sample_idx, "weight"]
height <- students[sample_idx, "height"]
plot(height, weight)
```

The scatter plot indicates that there exists a linear relationship between the two variables under consideration.

For the sake of this exercise we calculate the linear correlation
coefficient by hand at first and then we apply the `cor()`

function in R. Recall the equation from above:

\[r = \frac{\sum_{i=1}^n(x_i- \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i- \bar x)^2}\sqrt{\sum_{i=1}^n(y_i- \bar y)^2}}=\frac{s_{xy}}{s_x s_y}\]

```
x <- height
y <- weight
x_bar <- mean(height)
y_bar <- mean(weight)
sum((x - x_bar) * (y - y_bar)) / (sqrt(sum((x - x_bar)^2)) * sqrt(sum((y - y_bar)^2)))
```

`## [1] 0.9743696`

As as sanity check we calculate the ratio of the covariance of \(x\) and \(y\) and the standard deviations of \(x\) and \(y\):

\[r = \frac{s_{xy}}{s_x s_y}\]

`cov(x, y) / (sd(x) * sd(y))`

`## [1] 0.9743696`

Finally, we apply the in-build `cor()`

function:

`cor(x, y)`

`## [1] 0.9743696`

Perfect. The three calculations yield the exact same result! The linear correlation coefficient evaluates to \(r = 0.9743696\). Thus, we may conclude that there is a strong linear correlation between the height and the weight of a student.

Of course a correlation analysis is not restricted to two variables.
Thanks to statistical software packages, such as R, we are able to
conduct a pairwise correlation analysis for **more than two
variables**. Let us first prepare the data set. For a better
visualization experience we draw 100 randomly picked students from the
*students* data set. Then we select a number of variables to
perform the correlation analysis on.

```
n <- 100
sample_idx <- sample(1:nrow(students), size = n)
vars <- c("height", "weight", "nc.score", "score1", "score2", "salary") # select variables
cor(students[sample_idx, vars])
```

```
## height weight nc.score score1 score2 salary
## height 1.00000000 0.94526224 -0.02448925 NA NA NA
## weight 0.94526224 1.00000000 -0.04048844 NA NA NA
## nc.score -0.02448925 -0.04048844 1.00000000 NA NA NA
## score1 NA NA NA 1 NA NA
## score2 NA NA NA NA 1 NA
## salary NA NA NA NA NA 1
```

The `cor()`

function returns a nice table, also called
**correlation matrix**, with the pairwise Pearson
correlation coefficients. Obviously, some variables contain missing
values, denoted as `NA`

. We exclude those values from the
analysis by adding the argument
`use = 'pairwise.complete.obs'`

to the function call.

`cor(students[sample_idx, vars], use = "pairwise.complete.obs")`

```
## height weight nc.score score1 score2
## height 1.00000000 0.94526224 -0.024489250 0.195887582 0.16769035
## weight 0.94526224 1.00000000 -0.040488441 0.241458260 0.21921260
## nc.score -0.02448925 -0.04048844 1.000000000 -0.008946945 -0.02146809
## score1 0.19588758 0.24145826 -0.008946945 1.000000000 0.87080991
## score2 0.16769035 0.21921260 -0.021468095 0.870809912 1.00000000
## salary 0.51781672 0.36597798 0.283531817 0.381895086 0.15137427
## salary
## height 0.5178167
## weight 0.3659780
## nc.score 0.2835318
## score1 0.3818951
## score2 0.1513743
## salary 1.0000000
```

A table is a nice representation for a correlation analysis, but a
figure would of course improve the interpretability. R provides the
`pairs()`

function for plotting correlation matrices.

`pairs(students[sample_idx, vars])`

Immediately, we realize that the majority of the variables does not
appear to be linearly correlated. In contrast, the variable pairs
`height`

and `weight`

, as well as
`score1`

and `score2`

appear to be positively
correlated.

Another recommendable function for pair plots is provided by the
`psych`

package. The function `pairs.panels()`

is
very flexible and comes with many nice plotting features. Visit the help
page or type `?pairs.panels`

into your console for more
information on the function.

```
library(psych)
pairs.panels(students[sample_idx, vars],
smooth = FALSE,
ellipses = FALSE,
main = "Pearson product moment correlation coefficient"
)
```

The function returns a plot, which shows scatter plots, the variable histograms and the density curves as well as the correlation coefficients.

Note:Correlation functions implemented in R, such as`cor()`

or`pairs.panels()`

, include different types of correlation coefficients such asPearson’s,Spearman’sandKendall’s correlation coefficients. To pick one particular formula you add the argument`method`

to the function call. Pearson’s correlation coefficient is the default setting.

```
pairs.panels(students[sample_idx, vars],
smooth = FALSE,
ellipses = FALSE,
method = "spearman",
main = "Spearman's rank correlation coefficient"
)
```

**Spearman’s rank correlation
coefficient**, also known as Spearman’s \(\rho\), is a non-parametric rank
correlation coefficient. It was developed by Charles Spearman and is an alternative to the
Pearson product moment correlation coefficient. The Spearman’s rank
correlation coefficient is denoted by \(r_s\) for sample data and by \(\rho_s\) for population data (Mann 2012).
The correlation coefficient assesses the monotonic relationship between
two variables and ranges between \(-1\)
and \(1\). It describes the linear
correlation between the ranks of the data of variables \(x\) and \(y\). Spearman’s correlation is high when
the variables have a similar rank and low when variables have a
dissimilar rank.

To calculate \(r_s\) the data for each variable, \(x\) and \(y\), is ranked separately. For a given bivariate sequence \((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\) Spearman’s \(r_s\) is given by

\[r_s=1-\frac{6\sum_{i=1}^n (r_{xi}-r_{yi})^2}{n(n^2-1)}\text{,}\] where \(r_{xi}=Rank(x_i)\) , \(r_{yi}= Rank(y_i)\) and \(n\) is the sample size.

In contrast to Pearson’s linear correlation coefficient Spearman’s linear correlation coefficient is appropriate for both, quantitative and ordinal, variables. In addition, rank based correlations are not dependent on the assumption of a normal distribution and are more resistant to outliers (Schumann 2010).

Let us consider an example. The population of a community along a river believes that recent increases in peak discharge rates are due to deforestation by a logging company. We calculate Spearman’s rank correlation coefficient to assess, whether there is a correlation between peak discharge and the fraction of deforestation area in the watershed (data modified after McCuen 2003, p. 112).

\[ \begin{array}{lc} \hline \ \text{Year} & \text{Discharge (m}^3 \text{s}^{-1}) & \text{Logging area (%)} \\ \hline \ 1982 & 227 & 53 \\ \ 1983 & 249 & 56 \\ \ 1984 & 210 & 57 \\ \ 1985 & 190 & 58 \\ \ 1986 & 314 & 55 \\ \ 1987 & 345 & 54 \\ \ 1988 & 161 & 51 \\ \ 1988 & 266 & 50 \\ \ 1989 & 402 & 49 \\ \ 1990 & 215 & 57 \\ \ 1991 & 164 & 46 \\ \ 1992 & 405 & 44 \\ \ 1993 & 328 & 43 \\ \ 1994 & 294 & 42 \\ \hline \end{array} \]

Let us construct our data vectors. We assign the discharge values to
the variable `Q`

and the logging area to the variable
`logged`

.

```
Q <- c(227, 249, 210, 190, 314, 345, 161, 266, 402, 215, 164, 405, 328, 294)
logged <- c(53, 56, 57, 58, 55, 54, 51, 50, 49, 47, 46, 44, 43, 42)
```

First, we calculate Spearman’s rank correlation coefficient by hand. Recall the equation

\[r_s=1-\frac{6\sum_{i=1}^n (r_{xi}-r_{yi})^2}{n(n^2-1)}\text{,}\]

where \(r_{xi}=Rank(x_i)\) , \(r_{yi}= Rank(y_i)\) and \(n\) is the sample size. We apply the
`rank()`

function to calculate the rank of the values of each
variable.

```
# set number of observations
n <- length(Q)
# calculate rank
r_xi <- rank(Q)
r_yi <- rank(logged)
# plug into equation
rs <- 1 - ((6 * sum((r_xi - r_yi)^2)) / (n * (n^2 - 1)))
rs
```

`## [1] -0.3406593`

Alternatively, we may apply the `cor()`

function in R and
add the argument `method = "spearman"`

.

`cor(Q, logged, method = "spearman")`

`## [1] -0.3406593`

Spearman’s rank correlation coefficient is nothing else than Pearson’s linear correlation coefficient on the ranked data. Thus, the following line of code should equal the previous results:

`cor(rank(Q), rank(logged))`

`## [1] -0.3406593`

Perfect! We got the same result using all three calculations, which yield a Spearman’s rank correlation coefficient of \(r_s= -0.34\). The result indicates that there is no to a weak negative correlation between peak discharge and logging area. In other words, the discharge tends to decrease when the logging area increases. Thus, the perception of the population cannot be confirmed by our statistical analysis.

Note:It is always recommended to advance with a statistical test, in order to assess whether the result is statistically significant or whether the variation is just due to chance. Check out the sections onHypothesis Testingfor further information!

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*