Before we dive into PCA we first prepare our data set. Therefore we
load the dwd data set. You may download the DWD.csv
file here. Import the data set and assign a proper name
to it.
dwd <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv",
encoding = "latin1"
)
The dwd data set consists of 1103 rows, each of them representing a particular weather station in Germany, and 21 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: Station_id, Altitude, Lat, Lon, Name, Federal_State, Period, Record_length, mean_annual_air_temp, mean_monthly_max_temp, mean_monthly_min_temp, mean_annual_wind_speed, mean_annual_cloud_cover, mean_annual_sunshine, mean_annual_rainfall, mean_monthly_max_wind_speed, max_air_temp, max_wind_speed, max_rainfall, min_air_temp, mean_range_air_temp
The data was downloaded from the DWD (German Weather Service) data portal and prepossessed for the purpose of this tutorial. You may find a detailed description of the data set here.
By calling the str()
function on the data set we realize
that there are features in the data set such as Station_id
,
and Name
, among others, which do not carry any particular
useful information. Hence, we exclude them from the data set. In
addition we make sure that we exclude observations that contain missing
values (NA
). With respect to missing values we exclude the
variable max_wind_speed
from the data set, as this feature
shows over-proportionally many missing values.
str(dwd)
## 'data.frame': 599 obs. of 22 variables:
## $ ID : int 0 1 2 6 8 9 10 12 14 18 ...
## $ DWD_ID : int 1 3 44 71 73 78 91 98 116 132 ...
## $ STATION_NAME : chr "Aach" "Aachen" "Großenkneten" "Albstadt-Badkap" ...
## $ FEDERAL_STATE : chr "Baden-Württemberg" "Nordrhein-Westfalen" "Niedersachsen" "Baden-Württemberg" ...
## $ LAT : num 47.8 50.8 52.9 48.2 48.6 ...
## $ LON : num 8.85 6.09 8.24 8.98 13.05 ...
## $ ALTITUDE : num 478 202 44 759 340 65 300 780 213 750 ...
## $ PERIOD : chr "1931-1986" "1851-2011" "1971-2016" "1986-2016" ...
## $ RECORD_LENGTH : int 55 160 45 30 64 55 38 67 67 33 ...
## $ MEAN_ANNUAL_AIR_TEMP : num 8.2 9.8 9.2 7.4 8.4 9.3 8.2 5.1 8.4 5.7 ...
## $ MEAN_MONTHLY_MAX_TEMP : num 13.1 13.6 13.2 12.2 13.4 13.4 12.7 8.9 12.9 9.2 ...
## $ MEAN_MONTHLY_MIN_TEMP : num 3.5 6.3 5.4 3.3 3.9 5.2 4.1 2.2 4.2 2.7 ...
## $ MEAN_ANNUAL_WIND_SPEED: num 2 3 2 2 1 2 3 3 2 3 ...
## $ MEAN_CLOUD_COVER : num 67 67 67 66 65 67 72 72 66 64 ...
## $ MEAN_ANNUAL_SUNSHINE : num NA 1531 1459 1725 1595 ...
## $ MEAN_ANNUAL_RAINFALL : num 755 820 759 919 790 794 657 NA NA 915 ...
## $ MAX_MONTHLY_WIND_SPEED: num 2 3 3 2 2 2 3 4 3 3 ...
## $ MAX_AIR_TEMP : num 32.5 32.3 32.4 30.2 33 32.2 31.6 27.6 33.2 29 ...
## $ MAX_WIND_SPEED : num NA 30.2 29.9 NA NA NA NA NA NA NA ...
## $ MAX_RAINFALL : num 39 36 32 43 43 33 37 NA NA 40 ...
## $ MIN_AIR_TEMP : num -16.3 -10.9 -12.6 -15.5 -19.2 -13.3 -15.2 -15.7 -17.5 -17.2 ...
## $ MEAN_RANGE_AIR_TEMP : num 9.6 7.3 7.8 8.9 9.5 8.2 8.6 6.7 8.6 6.5 ...
# Exclude variables
cols_to_drop <- c("Station_id",
"Name",
"Federal_State",
"Period",
"max_wind_speed", # shows many missing values
"max_air_temp") # the response variable is not included in PCA
dwd_data_pca <- dwd[, !(colnames(dwd) %in% cols_to_drop)] # drop columns
rows_to_drop <- complete.cases(dwd_data_pca) # rows to drop
dwd_data_pca <- dwd_data_pca[rows_to_drop, ] # drop rows
# save dwd_data_pca for later usage
save(dwd_data_pca, file = "dwd_pca_30300.RData")
str(dwd_data_pca)
## 'data.frame': 204 obs. of 22 variables:
## $ ID : int 1 2 24 29 31 37 43 45 50 55 ...
## $ DWD_ID : int 3 44 164 175 183 198 222 232 282 298 ...
## $ STATION_NAME : chr "Aachen" "Großenkneten" "Angermünde" "Ansbach" ...
## $ FEDERAL_STATE : chr "Nordrhein-Westfalen" "Niedersachsen" "Brandenburg" "Bayern" ...
## $ LAT : num 50.8 52.9 53 49.3 54.7 ...
## $ LON : num 6.09 8.24 13.99 10.58 13.43 ...
## $ ALTITUDE : num 202 44 54 413 42 164 387 461 240 3 ...
## $ PERIOD : chr "1851-2011" "1971-2016" "1908-2016" "1881-1976" ...
## $ RECORD_LENGTH : int 160 45 108 95 80 62 135 70 67 79 ...
## $ MEAN_ANNUAL_AIR_TEMP : num 9.8 9.2 8.4 7.5 8.2 9 8.1 8.3 8.9 8.2 ...
## $ MEAN_MONTHLY_MAX_TEMP : num 13.6 13.2 12.9 12.2 10.6 13.4 13.2 12.9 13.8 11.8 ...
## $ MEAN_MONTHLY_MIN_TEMP : num 6.3 5.4 4.3 3.3 6 5 4.3 4 4.2 4.8 ...
## $ MEAN_ANNUAL_WIND_SPEED: num 3 2 2 2 4 3 2 2 2 3 ...
## $ MEAN_CLOUD_COVER : num 67 67 68 65 66 69 69 67 68 67 ...
## $ MEAN_ANNUAL_SUNSHINE : num 1531 1459 1695 1657 1840 ...
## $ MEAN_ANNUAL_RAINFALL : num 820 759 531 681 543 478 803 785 638 625 ...
## $ MAX_MONTHLY_WIND_SPEED: num 3 3 3 2 5 3 2 3 2 3 ...
## $ MAX_AIR_TEMP : num 32.3 32.4 33 31.4 26.8 33.5 32.2 32.9 33.7 30.6 ...
## $ MAX_WIND_SPEED : num 30.2 29.9 28.8 22.8 35.5 29.9 23 28 23.5 29.1 ...
## $ MAX_RAINFALL : num 36 32 37 36 33 31 44 40 36 33 ...
## $ MIN_AIR_TEMP : num -10.9 -12.6 -16.1 -17.3 -9.4 -15.3 -16.6 -16.8 -17.8 -14.1 ...
## $ MEAN_RANGE_AIR_TEMP : num 7.3 7.8 8.5 9 4.6 8.4 9 9 9.6 7 ...
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.