Before we start with PCA, we take a look at the mean and standard deviation of the variables in our data set.
# load data from previous sections
load("dwd_pca_30300.RData")
# Exclude variables
cols_to_drop <- c("ID",
"DWD_ID",
"STATION_NAME",
"FEDERAL_STATE",
"PERIOD") # columns to drop
rows_to_drop <- complete.cases(dwd_data_pca) # rows to drop
dwd_data_pca <- dwd_data_pca[rows_to_drop, !(colnames(dwd_data_pca) %in% cols_to_drop)]
apply(dwd_data_pca, 2, mean)
## LAT LON ALTITUDE RECORD_LENGTH
## 51.389132 10.487669 254.852941 92.274510
## MEAN_ANNUAL_AIR_TEMP MEAN_MONTHLY_MAX_TEMP MEAN_MONTHLY_MIN_TEMP MEAN_ANNUAL_WIND_SPEED
## 8.403431 12.432843 4.676471 2.500000
## MEAN_CLOUD_COVER MEAN_ANNUAL_SUNSHINE MEAN_ANNUAL_RAINFALL MAX_MONTHLY_WIND_SPEED
## 67.700980 1555.985294 745.656863 3.098039
## MAX_AIR_TEMP MAX_WIND_SPEED MAX_RAINFALL MIN_AIR_TEMP
## 31.630392 27.607353 37.588235 -14.345098
## MEAN_RANGE_AIR_TEMP
## 7.752941
apply(dwd_data_pca, 2, sd)
## LAT LON ALTITUDE RECORD_LENGTH
## 1.9907330 2.2244276 296.8507079 41.0932274
## MEAN_ANNUAL_AIR_TEMP MEAN_MONTHLY_MAX_TEMP MEAN_MONTHLY_MIN_TEMP MEAN_ANNUAL_WIND_SPEED
## 1.3520056 1.6021075 1.2299366 0.7260121
## MEAN_CLOUD_COVER MEAN_ANNUAL_SUNSHINE MEAN_ANNUAL_RAINFALL MAX_MONTHLY_WIND_SPEED
## 2.7026349 151.7691233 232.2631213 0.8654257
## MAX_AIR_TEMP MAX_WIND_SPEED MAX_RAINFALL MIN_AIR_TEMP
## 2.1149614 5.0680099 7.0972873 2.5423750
## MEAN_RANGE_AIR_TEMP
## 1.1622728
We notice that the data is not on the same scale. Thus, when calling
the prcomp() function we additionally center
(center = TRUE) and standardize (scale = TRUE)
the variables.
pca <- prcomp(dwd_data_pca, center = TRUE, scale = TRUE)
Let us check the dimensions of the principal component loading vectors and of the principal component scores.
# principal component loading vector
dim(pca$rotation)
## [1] 17 17
# principal component scores
dim(pca$x)
## [1] 204 17
All right, we got 397 observations and 15 principal components. In a next step we calculate the explained variance by each principal component.
pca_var <- pca$sdev^2
pca_ve <- pca_var / sum(pca_var)
pca_ve
## [1] 3.651465e-01 2.350028e-01 1.046496e-01 8.741310e-02 5.955536e-02 4.249315e-02 2.552711e-02
## [8] 2.391003e-02 2.105482e-02 1.461689e-02 9.801329e-03 4.004338e-03 2.852801e-03 2.688936e-03
## [15] 7.573997e-04 4.281688e-04 9.765269e-05
We see that the first principal component accounts for 36% of the total variance in our data set, the second principal component for 24%, the third principal component for 11% and so forth. We plot the proportion of explained variance by each particular principal component with their absolute values (left plot) and as cumulative sums (right plot).
par(mfrow = c(1, 2), mar = c(4, 5, 3, 1))
plot(pca_ve,
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1),
type = "b",
main = "Scree plot")
plot(cumsum(pca_ve),
xlab = "Principal Component",
ylab = "Cumulative Proportion of\nVariance Explained",
ylim = c(0, 1),
type = "b",
main = "Scree plot")
By inspecting the scree plots above, we see that after the third principal components there is drop off (elbow) of the proportion of variance explained. Moreover, the first three principal components account for approximately 69% of the explained variance in the data set. Based on these indicators it seems reasonable to keep the first three principal components to represent our data set.
pca_PC3 <- pca$x[, 1:3]
# save pca_PC3 for later usage
save(pca_PC3, file = "pca_PC3_30300.RData")
Finally, we take a look at the biplot. This time, we plot the biplot
by applying the autoplot() function from the
ggfortify package, which allows more tweaking and results
in more beautiful plots compared to the build-in biplot()
function. We plot the combinations PC1 vs PC2 and PC1 vs PC3.
library(ggfortify)
autoplot(pca, loadings = TRUE,
x = 1, y = 2,
loadings.colour = "blue",
loadings.label.size = 3,
loadings.label = TRUE) +
scale_x_continuous(limits = c(-0.3, 0.3)) +
scale_y_continuous(limits = c(-0.3, 0.3)) +
theme_bw()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Recall how to interpret a biplot: The points show the observations in the plane formed by two principal components (here PC1 and PC2). We may look for patterns, cluster, and outliers. The arrows are vectors, corresponding to the original variables from which the principal components were computed. The orientation (direction/angle) of the vector is an indicator how much the variable contributes to the principal component PC space: the more parallel to a principal component axis, the more it contributes only to that PC. The length in the space of the vector indicates how much variability of this variable is represented by the two displayed principal components (here PC1 and PC2). The angles between vectors of different variables show their correlation in this space: small angles represent high positive correlation, right angles represent lack of correlation, opposite angles represent high negative correlation (Rossiter 2014).
Based on that reasoning and by examining the biplot above we may
conclude that variables related to rainfall and the variable
ALTITUDE are highly positive correlated and negatively
correlated with variables related to temperatures. Furthermore, we
realize that these vectors plot along the axis of the first principal
component (PC1), indicating that PC1 accounts for the variability in the
data set due to weather stations located in warm and dry low altitude
regions and weather stations located in wet and cold high altitude
regions. Similar conclusions can be drawn for the second principal
component (PC2). PC2 accounts for the variability in the data set due to
weather stations characterized by high wind speed and low seasonal
variability in temperature, probably coastal regions, and weather
stations characterized by low(er) wind speed and high(er) seasonal
variability in temperature, probably regions with continental climates.
Based on that characterization we may reference each particular point,
and thus each particular weather station, and associate it with a
particular domain. Let us consider the points that plot in the lower
left quadrant; these are probably high altitude weather stations,
characterized by low temperatures, high rainfall and gusty winds.
autoplot(pca, loadings = TRUE,
x = 1, y = 3,
loadings.colour = "blue",
loadings.label.size = 3,
loadings.label = TRUE) +
scale_x_continuous(limits = c(-0.25, 0.25)) +
scale_y_continuous(limits = c(-0.15, 0.15)) +
theme_bw()
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_point()`).
In this biplot we plotted the first principal component (PC1) against
the third principal component (PC3). The figure indicates the third
principal component accounts for some of the variability in the data set
related to the location in terms of longitude (LON) and
latitude (LAT). However, compared to the previous figure
this plot is somehow more difficult to interpret.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.