/Users/jokr/Documents/soga/30331_Data

Before we dive into PCA we first prepare our data set. Therefore we load the dwd data set. You may download the DWD.csv file here. Import the data set and assign a proper name to it. Note that we use the read.csv2() function to load the data set. The read.csv() and read.csv2 are identical except for the defaults. The read.csv2() function is suitable for data originating from countries that use a comma as decimal point and a semicolon as field separator.

dwd <- read.csv2("https://userpage.fu-berlin.de/soga/300/30100_data_sets/DWD.csv",
                 stringsAsFactors = FALSE)

The dwd data set consists of 599 rows, each of them representing a particular weather station in Germany, and 22 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: id, DWD_ID, STATION.NAME, FEDERAL.STATE, LAT, LON, ALTITUDE, PERIOD, RECORD.LENGTH, MEAN.ANNUAL.AIR.TEMP, MEAN.MONTHLY.MAX.TEMP, MEAN.MONTHLY.MIN.TEMP, MEAN.ANNUAL.WIND.SPEED, MEAN.CLOUD.COVER, MEAN.ANNUAL.SUNSHINE, MEAN.ANNUAL.RAINFALL, MAX.MONTHLY.WIND.SPEED, MAX.AIR.TEMP, MAX.WIND.SPEED, MAX.RAINFALL, MIN.AIR.TEMP, MEAN.RANGE.AIR.TEMP.

The data was downloaded from the DWD (German Weather Service) data portal and prepossessed for the purpose of this tutorial. You may find a detailed description of the data set here.

By calling the str() function on the data set we realize that there are features in the data set such as, id, DWD_ID, and STATION.NAME, among others, which do not carry any particular useful information. Hence, we exclude them from the data set. In addition we make sure that we exclude observations that contain missing values (NA). With respect to missing values we exclude the variable MAX.WIND.SPEED from the data set, as this feature shows over-proportionally many missing values.

str(dwd)

## 'data.frame':    599 obs. of  22 variables:
##  $ id                    : int  0 1 2 6 8 9 10 12 14 18 ...
##  $ DWD_ID                : int  1 3 44 71 73 78 91 98 116 132 ...
##  $ STATION.NAME          : chr  "Aach" "Aachen" "Gro\xdfenkneten" "Albstadt-Badkap" ...
##  $ FEDERAL.STATE         : chr  "Baden-W\xfcrttemberg" "Nordrhein-Westfalen" "Niedersachsen" "Baden-W\xfcrttemberg" ...
##  $ LAT                   : num  47.8 50.8 52.9 48.2 48.6 ...
##  $ LON                   : num  8.85 6.09 8.24 8.98 13.05 ...
##  $ ALTITUDE              : num  478 202 44 759 340 65 300 780 213 750 ...
##  $ PERIOD                : chr  "1931-1986" "1851-2011" "1971-2016" "1986-2016" ...
##  $ RECORD.LENGTH         : int  55 160 45 30 64 55 38 67 67 33 ...
##  $ MEAN.ANNUAL.AIR.TEMP  : num  8.2 9.8 9.2 7.4 8.4 9.3 8.2 5.1 8.4 5.7 ...
##  $ MEAN.MONTHLY.MAX.TEMP : num  13.1 13.6 13.2 12.2 13.4 13.4 12.7 8.9 12.9 9.2 ...
##  $ MEAN.MONTHLY.MIN.TEMP : num  3.5 6.3 5.4 3.3 3.9 5.2 4.1 2.2 4.2 2.7 ...
##  $ MEAN.ANNUAL.WIND.SPEED: num  2 3 2 2 1 2 3 3 2 3 ...
##  $ MEAN.CLOUD.COVER      : num  67 67 67 66 65 67 72 72 66 64 ...
##  $ MEAN.ANNUAL.SUNSHINE  : num  NA 1531 1459 1725 1595 ...
##  $ MEAN.ANNUAL.RAINFALL  : num  755 820 759 919 790 794 657 NA NA 915 ...
##  $ MAX.MONTHLY.WIND.SPEED: num  2 3 3 2 2 2 3 4 3 3 ...
##  $ MAX.AIR.TEMP          : num  32.5 32.3 32.4 30.2 33 32.2 31.6 27.6 33.2 29 ...
##  $ MAX.WIND.SPEED        : num  NA 30.2 29.9 NA NA NA NA NA NA NA ...
##  $ MAX.RAINFALL          : num  39 36 32 43 43 33 37 NA NA 40 ...
##  $ MIN.AIR.TEMP          : num  -16.3 -10.9 -12.6 -15.5 -19.2 -13.3 -15.2 -15.7 -17.5 -17.2 ...
##  $ MEAN.RANGE.AIR.TEMP   : num  9.6 7.3 7.8 8.9 9.5 8.2 8.6 6.7 8.6 6.5 ...

# Exlclude variables
cols.to.drop <- c('id',                   # columns to drop
                  'DWD_ID', 
                  'STATION.NAME', 
                  'FEDERAL.STATE', 
                  'PERIOD',
                  'MAX.WIND.SPEED', # shows many missing values
                  'MAX.AIR.TEMP') # the response variable is not included in PCA
dwd.data.pca <- dwd[, !(colnames(dwd) %in% cols.to.drop)] # drop columns

rows.to.drop <- complete.cases(dwd.data.pca)  # rows to drop
dwd.data.pca <- dwd.data.pca[rows.to.drop, ] # drop rows

# save dwd.data.pca for later usage
save(dwd.data.pca, file = 'dwd_pca_30300.RData')

str(dwd.data.pca)

## 'data.frame':    397 obs. of  15 variables:
##  $ LAT                   : num  50.8 52.9 48.2 48.6 49.7 ...
##  $ LON                   : num  6.09 8.24 8.98 13.05 8.12 ...
##  $ ALTITUDE              : num  202 44 759 340 215 383 54 9 630 413 ...
##  $ RECORD.LENGTH         : int  160 45 30 64 115 103 108 77 85 95 ...
##  $ MEAN.ANNUAL.AIR.TEMP  : num  9.8 9.2 7.4 8.4 9.5 7.9 8.4 8.1 6.4 7.5 ...
##  $ MEAN.MONTHLY.MAX.TEMP : num  13.6 13.2 12.2 13.4 13.8 12.5 12.9 13.1 10.6 12.2 ...
##  $ MEAN.MONTHLY.MIN.TEMP : num  6.3 5.4 3.3 3.9 5.3 3.6 4.3 5.1 2.9 3.3 ...
##  $ MEAN.ANNUAL.WIND.SPEED: num  3 2 2 1 2 1 2 3 3 2 ...
##  $ MEAN.CLOUD.COVER      : num  67 67 66 65 65 66 68 62 66 65 ...
##  $ MEAN.ANNUAL.SUNSHINE  : num  1531 1459 1725 1595 1597 ...
##  $ MEAN.ANNUAL.RAINFALL  : num  820 759 919 790 526 690 531 553 902 681 ...
##  $ MAX.MONTHLY.WIND.SPEED: num  3 3 2 2 2 2 3 3 3 2 ...
##  $ MAX.RAINFALL          : num  36 32 43 43 31 36 37 32 46 36 ...
##  $ MIN.AIR.TEMP          : num  -10.9 -12.6 -15.5 -19.2 -13.4 -17.2 -16.1 -14.8 -18 -17.3 ...
##  $ MEAN.RANGE.AIR.TEMP   : num  7.3 7.8 8.9 9.5 8.6 9.2 8.5 8 7.8 9 ...