30350_Exercises.knit

In the previous section we developed a multi-linear regression model from the climfrance data set for the mean annual temperature in France with which we were able to perform more or less accurate temperature predictions. In this section we want to apply Principle Component Analysis to explore the same data set further. Principle Component Analysis is a method for data transformation rather than modelling the data, such that interrelations between variables become clearer.

We start by downloading the data and looking at the structure of the data set.

clim <- read.csv("https://userpage.fu-berlin.de/soga/300/30100_data_sets/Climfrance.csv", sep = ";")
str(clim)

## 'data.frame':    36 obs. of  12 variables:
##  $ station          : chr  "Marseille" "Nice" "Cherbourg" "Bastia (Korsika)" ...
##  $ altitude         : chr  "3" "5" "8" "10" ...
##  $ lat              : num  43.5 43.7 49.6 42.5 47.2 ...
##  $ lon              : num  5.21 7.2 -1.63 9.48 -1.61 -1.71 -3.21 2.86 3.1 -0.7 ...
##  $ t_mean           : num  14.2 14.8 11.4 14.9 11.7 11.3 12.3 12.5 9.7 12.3 ...
##  $ t_max            : num  24.4 23.7 19.2 25 22.6 22.1 19.9 26.2 21.3 24.9 ...
##  $ t_min            : num  2.3 7.1 4.3 7.5 1.6 0.6 4.5 5.4 -0.2 0.6 ...
##  $ relhumidity      : int  68 72 80 70 82 82 89 66 82 81 ...
##  $ p_mean           : chr  "546" "862" "931" "735" ...
##  $ p_max24h         : int  86 147 62 201 62 45 81 186 49 54 ...
##  $ rainydays        : int  76 86 171 91 172 168 147 85 171 162 ...
##  $ sun_shine_hperyrs: int  2764 2775 1608 2603 1952 1805 2111 2644 1574 2052 ...

The data set consists of 36 rows (observations) and 12 columns (features/variables). The features are: altitude, latitude, longitude, mean annual temperature, maximum annual temperature, minimum annual temperature, relative humidity, mean annual precipitation, maximum precipitation in 24h, number of rainy days and number of sun shine hours per year. Before we can perform any calculations we need to transform the entries for altitude and mean annual precipitation into numerical values and delete the thousands separator.

clim[, "p_mean"] <- as.numeric(gsub(",", "", clim[, "p_mean"]))
clim[, "altitude"] <- as.numeric(gsub(",", "", clim[, "altitude"]))
str(clim)

## 'data.frame':    36 obs. of  12 variables:
##  $ station          : chr  "Marseille" "Nice" "Cherbourg" "Bastia (Korsika)" ...
##  $ altitude         : num  3 5 8 10 26 35 37 43 44 47 ...
##  $ lat              : num  43.5 43.7 49.6 42.5 47.2 ...
##  $ lon              : num  5.21 7.2 -1.63 9.48 -1.61 -1.71 -3.21 2.86 3.1 -0.7 ...
##  $ t_mean           : num  14.2 14.8 11.4 14.9 11.7 11.3 12.3 12.5 9.7 12.3 ...
##  $ t_max            : num  24.4 23.7 19.2 25 22.6 22.1 19.9 26.2 21.3 24.9 ...
##  $ t_min            : num  2.3 7.1 4.3 7.5 1.6 0.6 4.5 5.4 -0.2 0.6 ...
##  $ relhumidity      : int  68 72 80 70 82 82 89 66 82 81 ...
##  $ p_mean           : num  546 862 931 735 741 669 637 639 637 900 ...
##  $ p_max24h         : int  86 147 62 201 62 45 81 186 49 54 ...
##  $ rainydays        : int  76 86 171 91 172 168 147 85 171 162 ...
##  $ sun_shine_hperyrs: int  2764 2775 1608 2603 1952 1805 2111 2644 1574 2052 ...

Exercise 1: How many PCs are able to explain more than 80% of the total variance within the climfrance data set?

### your code here

PCA:

Show code

Let us first check the means and standard deviations of our variables.

apply(clim[2:ncol(clim)], 2, mean)

##          altitude               lat               lon            t_mean 
##        249.944444         46.187222          2.762778         11.097222 
##             t_max             t_min       relhumidity            p_mean 
##         22.366667          0.900000         77.111111        780.944444 
##          p_max24h         rainydays sun_shine_hperyrs 
##         90.027778        144.361111       2077.638889

apply(clim[2:ncol(clim)], 2, sd)

##          altitude               lat               lon            t_mean 
##        503.072102          2.406199          3.389456          2.999189 
##             t_max             t_min       relhumidity            p_mean 
##          3.453776          3.991133          4.944132        195.023067 
##          p_max24h         rainydays sun_shine_hperyrs 
##         43.348743         35.106493        332.560212

Since the variables are not on the same scale, we need to center and standardize them.

clim_pca <- prcomp(clim[2:ncol(clim)], center = TRUE, scale = TRUE)
clim_pca

## Standard deviations (1, .., p=11):
##  [1] 2.1676154 1.7154306 1.1576548 0.8840419 0.7425686 0.4883985 0.4340285
##  [8] 0.3695903 0.2786583 0.1731670 0.1203767
## 
## Rotation (n x k) = (11 x 11):
##                           PC1        PC2           PC3         PC4         PC5
## altitude           0.17210388  0.5257808 -0.0251946898  0.09725826 -0.07615375
## lat                0.31147404 -0.3267181 -0.2336254639 -0.09202134  0.37737972
## lon               -0.24657207  0.1461986 -0.4289998779 -0.56378978  0.48071686
## t_mean            -0.32004173 -0.3989083  0.1286004160  0.02594161 -0.03361975
## t_max             -0.26929880 -0.3554552  0.0001614318 -0.38964952 -0.51898942
## t_min             -0.30465109 -0.3378688  0.1778800264  0.30191875  0.34128576
## relhumidity        0.34102285 -0.2722595  0.1578985533  0.19120157  0.33218618
## p_mean             0.05222856  0.1626626  0.7536128421 -0.31610051  0.21222174
## p_max24h          -0.35017378  0.1872587  0.2914537986 -0.21769919  0.20365435
## rainydays          0.40794123 -0.1252653  0.1864210645 -0.25187906 -0.19093118
## sun_shine_hperyrs -0.36978836  0.2108617  0.0012958534  0.41622802  0.02060679
##                           PC6          PC7         PC8          PC9
## altitude           0.12071767 -0.117926505  0.14767138 -0.513287597
## lat               -0.32544552 -0.312688092 -0.41542144  0.112031531
## lon                0.28167619  0.159937833 -0.05224435 -0.238497982
## t_mean             0.20638126 -0.039478777  0.08560432 -0.121564466
## t_max              0.04031268  0.194207610 -0.14157256 -0.118003307
## t_min              0.04550189 -0.369884540  0.16582211 -0.453146229
## relhumidity        0.12243862  0.745264555  0.14433203 -0.086063470
## p_mean             0.34233833 -0.145695162 -0.27785286  0.179205261
## p_max24h          -0.74370879  0.183263387  0.24734765 -0.001318358
## rainydays         -0.25359176  0.009557541 -0.21865163 -0.609931651
## sun_shine_hperyrs -0.06545180  0.276272494 -0.73552533 -0.146695208
##                           PC10          PC11
## altitude          -0.264082054  0.5427261874
## lat               -0.195971461  0.4044199722
## lon                0.090481830 -0.1100925799
## t_mean             0.572952376  0.5705890782
## t_max             -0.539612101  0.1243482668
## t_min             -0.312638964 -0.2903335558
## relhumidity       -0.164893012  0.1090948118
## p_mean            -0.092937560  0.0393242216
## p_max24h          -0.008346965  0.1433829101
## rainydays          0.358580359 -0.2653993355
## sun_shine_hperyrs  0.044648468 -0.0009463635

Explained variance:

Show code

Now, we can calculate the explained variance:

clim_pca_var <- clim_pca$sdev^2
clim_pca_ve <- clim_pca_var / sum(clim_pca_var)
clim_pca_ve

##  [1] 0.427141489 0.267518367 0.121833158 0.071048183 0.050128017 0.021684831
##  [7] 0.017125519 0.012417909 0.007059133 0.002726073 0.001317322

The first principal component accounts for 43% of the total variance in our data set, the second principal component for 27%, the third principal component for 12% and so forth.

Visualization with scree plots:

par(mfrow = c(1, 2), mar = c(4, 5, 3, 1))
plot(clim_pca_ve,
  xlab = "Principal Component",
  ylab = "Proportion of Variance Explained",
  ylim = c(0, 1),
  type = "b",
  main = "Scree plot"
)

plot(cumsum(clim_pca_ve),
  xlab = "Principal Component",
  ylab = "Cumulative Proportion of\nVariance Explained",
  ylim = c(0, 1),
  type = "b",
  main = "Scree plot"
)

In the cumulative scree plot it is clearly visible that the first three principal components explain more than \(80\%\) of the variance in the data. Therefore we can assume that the first three PCs represent our data set adequately.

Exercise 2: How can we interprete the first two PC’s? Consider the loading matrix and the biplots.

### your code here

Loadings:

Show code

clim_pca$rotation[, 1:2]

##                           PC1        PC2
## altitude           0.17210388  0.5257808
## lat                0.31147404 -0.3267181
## lon               -0.24657207  0.1461986
## t_mean            -0.32004173 -0.3989083
## t_max             -0.26929880 -0.3554552
## t_min             -0.30465109 -0.3378688
## relhumidity        0.34102285 -0.2722595
## p_mean             0.05222856  0.1626626
## p_max24h          -0.35017378  0.1872587
## rainydays          0.40794123 -0.1252653
## sun_shine_hperyrs -0.36978836  0.2108617

By just looking at the loadings, we see that most variables seem to be equally represented in both PCs, although 24-hour precipitation and number of rainy days are better represented in PC1 and altitude is better represented in PC2. The annual mean precipitation is barely represented in either of the first two PCs. Also the variable longitude is less represented than others.

Bi-Plots:

Show code

For the bi-plots we can either use the built-in biplot() function or the autoplot() function from the ggfortify package.

library(ggfortify)

autoplot(clim_pca,
  loadings = TRUE,
  x = 1, y = 2,
  loadings.colour = "blue",
  loadings.label.size = 3,
  loadings.label = TRUE
) +
  scale_x_continuous(limits = c(-0.4, 0.4)) +
  scale_y_continuous(limits = c(-0.4, 0.75)) +
  theme_bw()

As we have already seen in the loading matrix, except for the mean precipitation, all variables seem to be fairly well represented in the first two PCs. This is visualized by the length of the vector representing a variable.
All variables seem to contribute more or less equally to both principal components, although rainy days and sunshine hours contribute more to PC1 and altitude and mean precipitation contribute more to PC2.
The angles between the vectors for each variable represent their correlation. Thus, the bi-plot visualizes a strong correlation of mean annual precipitation with altitude, a strong correlation between the number of sunshine hours, maximum precipitation within 24 hours and longitude and a strong correlation between mean maximum and minimum temperature (as one would expect). The number of rainy days and relative humidity are positively correlated with latitude and negatively correlated with sunshine hours per year, maximum precipitation within 24 hours and longitude.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.