In [1]:

```
# import libraries
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
```

In [2]:

```
# load data from previous sections
dwd_data_pca = pd.read_feather("https://userpage.fu-berlin.de/soga/data/py-data/DWD.feather")
dwd_data_pca.set_index("index", inplace = True)
dwd_data_pca.head()
```

Out[2]:

ID | LAT | LON | ALTITUDE | RECORD_LENGTH | MEAN_ANNUAL_AIR_TEMP | MEAN_MONTHLY_MAX_TEMP | MEAN_MONTHLY_MIN_TEMP | MEAN_ANNUAL_WIND_SPEED | MEAN_CLOUD_COVER | MEAN_ANNUAL_SUNSHINE | MEAN_ANNUAL_RAINFALL | MAX_MONTHLY_WIND_SPEED | MAX_RAINFALL | MIN_AIR_TEMP | MEAN_RANGE_AIR_TEMP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

index | ||||||||||||||||

1 | 1 | 50.7827 | 6.0941 | 202.0 | 160 | 9.8 | 13.6 | 6.3 | 3.0 | 67.0 | 1531.0 | 820.0 | 3.0 | 36.0 | -10.9 | 7.3 |

2 | 2 | 52.9335 | 8.2370 | 44.0 | 45 | 9.2 | 13.2 | 5.4 | 2.0 | 67.0 | 1459.0 | 759.0 | 3.0 | 32.0 | -12.6 | 7.8 |

3 | 6 | 48.2156 | 8.9784 | 759.0 | 30 | 7.4 | 12.2 | 3.3 | 2.0 | 66.0 | 1725.0 | 919.0 | 2.0 | 43.0 | -15.5 | 8.9 |

4 | 8 | 48.6159 | 13.0506 | 340.0 | 64 | 8.4 | 13.4 | 3.9 | 1.0 | 65.0 | 1595.0 | 790.0 | 2.0 | 43.0 | -19.2 | 9.5 |

11 | 20 | 49.7273 | 8.1164 | 215.0 | 115 | 9.5 | 13.8 | 5.3 | 2.0 | 65.0 | 1597.0 | 526.0 | 2.0 | 31.0 | -13.4 | 8.6 |

In [3]:

```
# center and scale data
dwd_data_pca = (dwd_data_pca - dwd_data_pca.mean()) / dwd_data_pca.std()
```

Now we apply the PCA algorithm to our data set.

In [4]:

```
# apply PCA
pca = PCA()
pca.fit(dwd_data_pca)
```

Out[4]:

PCA()

In [5]:

```
# dimensions of principal component loading vectors
print(pca.components_.shape)
# dimensions of principal component scores
print(pca.transform(dwd_data_pca).shape)
```

(16, 16) (397, 16)

In [6]:

```
# explained variance
pca_ve = pca.explained_variance_ratio_
print(pca_ve)
```

In [7]:

```
# plot explained variance
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].plot(pca_ve)
ax[0].set_xlabel("Principal Component")
ax[0].set_ylabel("Proportion of Explained Variance")
ax[0].set_title("Proportion of Explained Variance by Principal Component")
ax[1].plot(np.cumsum(pca_ve))
ax[1].set_xlabel("Principal Component")
ax[1].set_ylabel("Cumulative Sum of Explained Variance")
ax[1].set_title("Cumulative Sum of Explained Variance by Principal Component")
plt.show()
```

*elbow*) of the proportion of variance explained. Moreover, the first three principal
components account for approximately 69% of the explained variance in the data set. Based on these
indicators it seems reasonable to keep the first three principal components to represent our data
set.

In [8]:

```
# calculate principal component scores
pca_scores = pca.transform(dwd_data_pca)
# save pca_PC3 to feather for later usage
pca_PC3 = pd.DataFrame(pca_scores[:, 0:3])
pca_PC3.columns = ["PC1", "PC2", "PC3"]
pca_PC3.to_feather("pca_PC3.feather")
```

In [25]:

```
# biplot PC1 vs PC2 with vectors and variable labels
fig=plt.figure(figsize=(15,15), dpi= 100, facecolor='w', edgecolor='k')
sns.scatterplot(x = pca_scores[:, 0], y = pca_scores[:, 1])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Biplot PC1 vs PC2")
for i in range(pca.components_.shape[1]):
plt.arrow(0, 0, pca.components_[0, i] * 7, pca.components_[1, i] * 4, color = "r")
plt.text(pca.components_[0, i] *7, pca.components_[1, i] * 4.5, dwd_data_pca.columns[i],
color = "k", ha = "center")
plt.show()
```

Recall how to interpret a biplot: The points show the observations in the plane formed by two
principal components (here PC1 and PC2). We may look for patterns, cluster, and outliers. The arrows
are vectors, corresponding to the original variables from which the principal components were
computed. The *orientation (direction/angle)* of the vector is an indicator how much the variable
contributes to the principal component PC space: the more parallel to a principal component axis,
the more it contributes only to that PC. The *length in the space* of the vector indicates how much
variability of this variable is represented by the two displayed principal components
(here PC1 and PC2). The *angles between vectors* of different variables show their correlation in
this space: small angles represent high positive correlation, right angles represent lack of
correlation, opposite angles represent high negative correlation (Rossiter 2014).

Based on that reasoning and by examining the biplot above we may conclude that variables related to
rainfall and the variable `ALTITUDE`

are highly positive correlated and negatively correlated with
variables related to temperatures. Further we realize that these vectors plot along the axis of the
first principal component (PC1), thus indicating that PC1 accounts for the variability in the data
set due to weather stations located in warm and dry low altitude regions and weather stations located in wet and cold high altitude regions. Similar conclusions can be drawn for the second principal component (PC2). PC2 accounts for the variability in the data set due to weather stations characterized by high wind speed and low seasonal variability in temperature, probably coastal regions, and weather stations characterized by low(er) wind speed and high(er) seasonal variability in temperature, probably regions with continental climates. Based on that characterization we may reference each particular point, and thus each particular weather station, and associate it with a particular domain. Let us consider the points that plot in the lower left quadrant; these are probably high altitude weather stations, characterized by low temperatures, high rainfall and gusty winds.

In [27]:

```
# biplot PC1 vs PC3
# biplot PC1 vs PC2 with vectors and variable labels
fig=plt.figure(figsize=(15,15), dpi= 100, facecolor='w', edgecolor='k')
sns.scatterplot(x = pca_scores[:, 0], y = pca_scores[:, 2])
plt.xlabel("PC1")
plt.ylabel("PC3")
plt.title("Biplot PC1 vs PC3")
for i in range(pca.components_.shape[1]):
plt.arrow(0, 0, pca.components_[0, i] * 4, pca.components_[2, i] * 4, color = "r")
plt.text(pca.components_[0, i] * 4.5, pca.components_[2, i] * 4.5, dwd_data_pca.columns[i],
color = "k", ha = "center")
plt.show()
```

`LON`

) and latitude (`LAT`

). However,
compared to the previous figure this plot is somehow more difficult to interpret.

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*