Before we dive into PCA we first prepare our data set. Therefore we load the dwd data set. You may download the DWD.csv
file here. Import the data set and assign a proper name to it. Note that we use the read.csv2()
function to load the data set. The read.csv()
and read.csv2
are identical except for the defaults. The read.csv2()
function is suitable for data originating from countries that use a comma as decimal point and a semicolon as field separator.
dwd <- read.csv2("https://userpage.fu-berlin.de/soga/300/30100_data_sets/DWD.csv",
stringsAsFactors = FALSE)
The dwd data set consists of 599 rows, each of them representing a particular weather station in Germany, and 22 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: id, DWD_ID, STATION.NAME, FEDERAL.STATE, LAT, LON, ALTITUDE, PERIOD, RECORD.LENGTH, MEAN.ANNUAL.AIR.TEMP, MEAN.MONTHLY.MAX.TEMP, MEAN.MONTHLY.MIN.TEMP, MEAN.ANNUAL.WIND.SPEED, MEAN.CLOUD.COVER, MEAN.ANNUAL.SUNSHINE, MEAN.ANNUAL.RAINFALL, MAX.MONTHLY.WIND.SPEED, MAX.AIR.TEMP, MAX.WIND.SPEED, MAX.RAINFALL, MIN.AIR.TEMP, MEAN.RANGE.AIR.TEMP.
The data was downloaded from the DWD (German Weather Service) data portal and prepossessed for the purpose of this tutorial. You may find a detailed description of the data set here.
By calling the str()
function on the data set we realize that there are features in the data set such as, id
, DWD_ID
, and STATION.NAME
, among others, which do not carry any particular useful information. Hence, we exclude them from the data set. In addition we make sure that we exclude observations that contain missing values (NA
). With respect to missing values we exclude the variable MAX.WIND.SPEED
from the data set, as this feature shows over-proportionally many missing values.
str(dwd)
## 'data.frame': 599 obs. of 22 variables:
## $ id : int 0 1 2 6 8 9 10 12 14 18 ...
## $ DWD_ID : int 1 3 44 71 73 78 91 98 116 132 ...
## $ STATION.NAME : chr "Aach" "Aachen" "Gro\xdfenkneten" "Albstadt-Badkap" ...
## $ FEDERAL.STATE : chr "Baden-W\xfcrttemberg" "Nordrhein-Westfalen" "Niedersachsen" "Baden-W\xfcrttemberg" ...
## $ LAT : num 47.8 50.8 52.9 48.2 48.6 ...
## $ LON : num 8.85 6.09 8.24 8.98 13.05 ...
## $ ALTITUDE : num 478 202 44 759 340 65 300 780 213 750 ...
## $ PERIOD : chr "1931-1986" "1851-2011" "1971-2016" "1986-2016" ...
## $ RECORD.LENGTH : int 55 160 45 30 64 55 38 67 67 33 ...
## $ MEAN.ANNUAL.AIR.TEMP : num 8.2 9.8 9.2 7.4 8.4 9.3 8.2 5.1 8.4 5.7 ...
## $ MEAN.MONTHLY.MAX.TEMP : num 13.1 13.6 13.2 12.2 13.4 13.4 12.7 8.9 12.9 9.2 ...
## $ MEAN.MONTHLY.MIN.TEMP : num 3.5 6.3 5.4 3.3 3.9 5.2 4.1 2.2 4.2 2.7 ...
## $ MEAN.ANNUAL.WIND.SPEED: num 2 3 2 2 1 2 3 3 2 3 ...
## $ MEAN.CLOUD.COVER : num 67 67 67 66 65 67 72 72 66 64 ...
## $ MEAN.ANNUAL.SUNSHINE : num NA 1531 1459 1725 1595 ...
## $ MEAN.ANNUAL.RAINFALL : num 755 820 759 919 790 794 657 NA NA 915 ...
## $ MAX.MONTHLY.WIND.SPEED: num 2 3 3 2 2 2 3 4 3 3 ...
## $ MAX.AIR.TEMP : num 32.5 32.3 32.4 30.2 33 32.2 31.6 27.6 33.2 29 ...
## $ MAX.WIND.SPEED : num NA 30.2 29.9 NA NA NA NA NA NA NA ...
## $ MAX.RAINFALL : num 39 36 32 43 43 33 37 NA NA 40 ...
## $ MIN.AIR.TEMP : num -16.3 -10.9 -12.6 -15.5 -19.2 -13.3 -15.2 -15.7 -17.5 -17.2 ...
## $ MEAN.RANGE.AIR.TEMP : num 9.6 7.3 7.8 8.9 9.5 8.2 8.6 6.7 8.6 6.5 ...
# Exlclude variables
cols.to.drop <- c('id', # columns to drop
'DWD_ID',
'STATION.NAME',
'FEDERAL.STATE',
'PERIOD',
'MAX.WIND.SPEED', # shows many missing values
'MAX.AIR.TEMP') # the response variable is not included in PCA
dwd.data.pca <- dwd[, !(colnames(dwd) %in% cols.to.drop)] # drop columns
rows.to.drop <- complete.cases(dwd.data.pca) # rows to drop
dwd.data.pca <- dwd.data.pca[rows.to.drop, ] # drop rows
# save dwd.data.pca for later usage
save(dwd.data.pca, file = 'dwd_pca_30300.RData')
str(dwd.data.pca)
## 'data.frame': 397 obs. of 15 variables:
## $ LAT : num 50.8 52.9 48.2 48.6 49.7 ...
## $ LON : num 6.09 8.24 8.98 13.05 8.12 ...
## $ ALTITUDE : num 202 44 759 340 215 383 54 9 630 413 ...
## $ RECORD.LENGTH : int 160 45 30 64 115 103 108 77 85 95 ...
## $ MEAN.ANNUAL.AIR.TEMP : num 9.8 9.2 7.4 8.4 9.5 7.9 8.4 8.1 6.4 7.5 ...
## $ MEAN.MONTHLY.MAX.TEMP : num 13.6 13.2 12.2 13.4 13.8 12.5 12.9 13.1 10.6 12.2 ...
## $ MEAN.MONTHLY.MIN.TEMP : num 6.3 5.4 3.3 3.9 5.3 3.6 4.3 5.1 2.9 3.3 ...
## $ MEAN.ANNUAL.WIND.SPEED: num 3 2 2 1 2 1 2 3 3 2 ...
## $ MEAN.CLOUD.COVER : num 67 67 66 65 65 66 68 62 66 65 ...
## $ MEAN.ANNUAL.SUNSHINE : num 1531 1459 1725 1595 1597 ...
## $ MEAN.ANNUAL.RAINFALL : num 820 759 919 790 526 690 531 553 902 681 ...
## $ MAX.MONTHLY.WIND.SPEED: num 3 3 2 2 2 2 3 3 3 2 ...
## $ MAX.RAINFALL : num 36 32 43 43 31 36 37 32 46 36 ...
## $ MIN.AIR.TEMP : num -10.9 -12.6 -15.5 -19.2 -13.4 -17.2 -16.1 -14.8 -18 -17.3 ...
## $ MEAN.RANGE.AIR.TEMP : num 7.3 7.8 8.9 9.5 8.6 9.2 8.5 8 7.8 9 ...