In the previous section we developed a multi-linear regression model
from the climfrance data set for the mean annual temperature in France
with which we were able to perform more or less accurate temperature
predictions. In this section we want to apply Principle Component
Analysis to explore the same data set further. Principle Component
Analysis is a method for data transformation rather than modelling the
data, such that interrelations between variables become clearer.
We start by downloading the data and looking at the structure of the data set.
clim <- read.csv("https://userpage.fu-berlin.de/soga/300/30100_data_sets/Climfrance.csv", sep = ";")
str(clim)
## 'data.frame': 36 obs. of 12 variables:
## $ station : chr "Marseille" "Nice" "Cherbourg" "Bastia (Korsika)" ...
## $ altitude : chr "3" "5" "8" "10" ...
## $ lat : num 43.5 43.7 49.6 42.5 47.2 ...
## $ lon : num 5.21 7.2 -1.63 9.48 -1.61 -1.71 -3.21 2.86 3.1 -0.7 ...
## $ t_mean : num 14.2 14.8 11.4 14.9 11.7 11.3 12.3 12.5 9.7 12.3 ...
## $ t_max : num 24.4 23.7 19.2 25 22.6 22.1 19.9 26.2 21.3 24.9 ...
## $ t_min : num 2.3 7.1 4.3 7.5 1.6 0.6 4.5 5.4 -0.2 0.6 ...
## $ relhumidity : int 68 72 80 70 82 82 89 66 82 81 ...
## $ p_mean : chr "546" "862" "931" "735" ...
## $ p_max24h : int 86 147 62 201 62 45 81 186 49 54 ...
## $ rainydays : int 76 86 171 91 172 168 147 85 171 162 ...
## $ sun_shine_hperyrs: int 2764 2775 1608 2603 1952 1805 2111 2644 1574 2052 ...
The data set consists of 36 rows (observations) and 12 columns (features/variables). The features are: altitude, latitude, longitude, mean annual temperature, maximum annual temperature, minimum annual temperature, relative humidity, mean annual precipitation, maximum precipitation in 24h, number of rainy days and number of sun shine hours per year. Before we can perform any calculations we need to transform the entries for altitude and mean annual precipitation into numerical values and delete the thousands separator.
clim[, "p_mean"] <- as.numeric(gsub(",", "", clim[, "p_mean"]))
clim[, "altitude"] <- as.numeric(gsub(",", "", clim[, "altitude"]))
str(clim)
## 'data.frame': 36 obs. of 12 variables:
## $ station : chr "Marseille" "Nice" "Cherbourg" "Bastia (Korsika)" ...
## $ altitude : num 3 5 8 10 26 35 37 43 44 47 ...
## $ lat : num 43.5 43.7 49.6 42.5 47.2 ...
## $ lon : num 5.21 7.2 -1.63 9.48 -1.61 -1.71 -3.21 2.86 3.1 -0.7 ...
## $ t_mean : num 14.2 14.8 11.4 14.9 11.7 11.3 12.3 12.5 9.7 12.3 ...
## $ t_max : num 24.4 23.7 19.2 25 22.6 22.1 19.9 26.2 21.3 24.9 ...
## $ t_min : num 2.3 7.1 4.3 7.5 1.6 0.6 4.5 5.4 -0.2 0.6 ...
## $ relhumidity : int 68 72 80 70 82 82 89 66 82 81 ...
## $ p_mean : num 546 862 931 735 741 669 637 639 637 900 ...
## $ p_max24h : int 86 147 62 201 62 45 81 186 49 54 ...
## $ rainydays : int 76 86 171 91 172 168 147 85 171 162 ...
## $ sun_shine_hperyrs: int 2764 2775 1608 2603 1952 1805 2111 2644 1574 2052 ...
Exercise 1: How many PCs are able to explain more than 80% of the total variance within the climfrance data set?
### your code here
PCA:
Let us first check the means and standard deviations of our variables.
apply(clim[2:ncol(clim)], 2, mean)
## altitude lat lon t_mean
## 249.944444 46.187222 2.762778 11.097222
## t_max t_min relhumidity p_mean
## 22.366667 0.900000 77.111111 780.944444
## p_max24h rainydays sun_shine_hperyrs
## 90.027778 144.361111 2077.638889
apply(clim[2:ncol(clim)], 2, sd)
## altitude lat lon t_mean
## 503.072102 2.406199 3.389456 2.999189
## t_max t_min relhumidity p_mean
## 3.453776 3.991133 4.944132 195.023067
## p_max24h rainydays sun_shine_hperyrs
## 43.348743 35.106493 332.560212
Since the variables are not on the same scale, we need to center and standardize them.
clim_pca <- prcomp(clim[2:ncol(clim)], center = TRUE, scale = TRUE)
clim_pca
## Standard deviations (1, .., p=11):
## [1] 2.1676154 1.7154306 1.1576548 0.8840419 0.7425686 0.4883985 0.4340285
## [8] 0.3695903 0.2786583 0.1731670 0.1203767
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4 PC5
## altitude 0.17210388 0.5257808 -0.0251946898 0.09725826 -0.07615375
## lat 0.31147404 -0.3267181 -0.2336254639 -0.09202134 0.37737972
## lon -0.24657207 0.1461986 -0.4289998779 -0.56378978 0.48071686
## t_mean -0.32004173 -0.3989083 0.1286004160 0.02594161 -0.03361975
## t_max -0.26929880 -0.3554552 0.0001614318 -0.38964952 -0.51898942
## t_min -0.30465109 -0.3378688 0.1778800264 0.30191875 0.34128576
## relhumidity 0.34102285 -0.2722595 0.1578985533 0.19120157 0.33218618
## p_mean 0.05222856 0.1626626 0.7536128421 -0.31610051 0.21222174
## p_max24h -0.35017378 0.1872587 0.2914537986 -0.21769919 0.20365435
## rainydays 0.40794123 -0.1252653 0.1864210645 -0.25187906 -0.19093118
## sun_shine_hperyrs -0.36978836 0.2108617 0.0012958534 0.41622802 0.02060679
## PC6 PC7 PC8 PC9
## altitude 0.12071767 -0.117926505 0.14767138 -0.513287597
## lat -0.32544552 -0.312688092 -0.41542144 0.112031531
## lon 0.28167619 0.159937833 -0.05224435 -0.238497982
## t_mean 0.20638126 -0.039478777 0.08560432 -0.121564466
## t_max 0.04031268 0.194207610 -0.14157256 -0.118003307
## t_min 0.04550189 -0.369884540 0.16582211 -0.453146229
## relhumidity 0.12243862 0.745264555 0.14433203 -0.086063470
## p_mean 0.34233833 -0.145695162 -0.27785286 0.179205261
## p_max24h -0.74370879 0.183263387 0.24734765 -0.001318358
## rainydays -0.25359176 0.009557541 -0.21865163 -0.609931651
## sun_shine_hperyrs -0.06545180 0.276272494 -0.73552533 -0.146695208
## PC10 PC11
## altitude -0.264082054 0.5427261874
## lat -0.195971461 0.4044199722
## lon 0.090481830 -0.1100925799
## t_mean 0.572952376 0.5705890782
## t_max -0.539612101 0.1243482668
## t_min -0.312638964 -0.2903335558
## relhumidity -0.164893012 0.1090948118
## p_mean -0.092937560 0.0393242216
## p_max24h -0.008346965 0.1433829101
## rainydays 0.358580359 -0.2653993355
## sun_shine_hperyrs 0.044648468 -0.0009463635
Now, we can calculate the explained variance:
clim_pca_var <- clim_pca$sdev^2
clim_pca_ve <- clim_pca_var / sum(clim_pca_var)
clim_pca_ve
## [1] 0.427141489 0.267518367 0.121833158 0.071048183 0.050128017 0.021684831
## [7] 0.017125519 0.012417909 0.007059133 0.002726073 0.001317322
The first principal component accounts for 43% of the total variance
in our data set, the second principal component for 27%, the third
principal component for 12% and so forth.
Visualization with scree plots:
par(mfrow = c(1, 2), mar = c(4, 5, 3, 1))
plot(clim_pca_ve,
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1),
type = "b",
main = "Scree plot"
)
plot(cumsum(clim_pca_ve),
xlab = "Principal Component",
ylab = "Cumulative Proportion of\nVariance Explained",
ylim = c(0, 1),
type = "b",
main = "Scree plot"
)
Exercise 2: How can we interprete the first two PC’s? Consider the loading matrix and the biplots.
### your code here
Loadings:
clim_pca$rotation[, 1:2]
## PC1 PC2
## altitude 0.17210388 0.5257808
## lat 0.31147404 -0.3267181
## lon -0.24657207 0.1461986
## t_mean -0.32004173 -0.3989083
## t_max -0.26929880 -0.3554552
## t_min -0.30465109 -0.3378688
## relhumidity 0.34102285 -0.2722595
## p_mean 0.05222856 0.1626626
## p_max24h -0.35017378 0.1872587
## rainydays 0.40794123 -0.1252653
## sun_shine_hperyrs -0.36978836 0.2108617
By just looking at the loadings, we see that most variables seem to be equally represented in both PCs, although 24-hour precipitation and number of rainy days are better represented in PC1 and altitude is better represented in PC2. The annual mean precipitation is barely represented in either of the first two PCs. Also the variable longitude is less represented than others.
For the bi-plots we can either use the built-in biplot()
function or the autoplot()
function from the
ggfortify
package.
library(ggfortify)
autoplot(clim_pca,
loadings = TRUE,
x = 1, y = 2,
loadings.colour = "blue",
loadings.label.size = 3,
loadings.label = TRUE
) +
scale_x_continuous(limits = c(-0.4, 0.4)) +
scale_y_continuous(limits = c(-0.4, 0.75)) +
theme_bw()
As we have already seen in the loading matrix, except for the mean
precipitation, all variables seem to be fairly well represented in the
first two PCs. This is visualized by the length of the vector
representing a variable.
All variables seem to contribute more or less equally to both principal
components, although rainy days and sunshine hours contribute more to
PC1 and altitude and mean precipitation contribute more to PC2.
The angles between the vectors for each variable represent their
correlation. Thus, the bi-plot visualizes a strong correlation of mean
annual precipitation with altitude, a strong correlation between the
number of sunshine hours, maximum precipitation within 24 hours and
longitude and a strong correlation between mean maximum and minimum
temperature (as one would expect). The number of rainy days and relative
humidity are positively correlated with latitude and negatively
correlated with sunshine hours per year, maximum precipitation within 24
hours and longitude.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.