30331_Data_preparation.knit

Before we dive into PCA we first prepare our data set. Therefore we load the dwd data set. You may download the DWD.csv file here. Import the data set and assign a proper name to it.

dwd <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv",
  encoding = "latin1"
)

The dwd data set consists of 1103 rows, each of them representing a particular weather station in Germany, and 21 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: Station_id, Altitude, Lat, Lon, Name, Federal_State, Period, Record_length, mean_annual_air_temp, mean_monthly_max_temp, mean_monthly_min_temp, mean_annual_wind_speed, mean_annual_cloud_cover, mean_annual_sunshine, mean_annual_rainfall, mean_monthly_max_wind_speed, max_air_temp, max_wind_speed, max_rainfall, min_air_temp, mean_range_air_temp

The data was downloaded from the DWD (German Weather Service) data portal and prepossessed for the purpose of this tutorial. You may find a detailed description of the data set here.

By calling the str() function on the data set we realize that there are features in the data set such as Station_id, and Name, among others, which do not carry any particular useful information. Hence, we exclude them from the data set. In addition we make sure that we exclude observations that contain missing values (NA). With respect to missing values we exclude the variable max_wind_speed from the data set, as this feature shows over-proportionally many missing values.

str(dwd)

## 'data.frame':    599 obs. of  22 variables:
##  $ ID                    : int  0 1 2 6 8 9 10 12 14 18 ...
##  $ DWD_ID                : int  1 3 44 71 73 78 91 98 116 132 ...
##  $ STATION_NAME          : chr  "Aach" "Aachen" "GroÃŸenkneten" "Albstadt-Badkap" ...
##  $ FEDERAL_STATE         : chr  "Baden-WÃ¼rttemberg" "Nordrhein-Westfalen" "Niedersachsen" "Baden-WÃ¼rttemberg" ...
##  $ LAT                   : num  47.8 50.8 52.9 48.2 48.6 ...
##  $ LON                   : num  8.85 6.09 8.24 8.98 13.05 ...
##  $ ALTITUDE              : num  478 202 44 759 340 65 300 780 213 750 ...
##  $ PERIOD                : chr  "1931-1986" "1851-2011" "1971-2016" "1986-2016" ...
##  $ RECORD_LENGTH         : int  55 160 45 30 64 55 38 67 67 33 ...
##  $ MEAN_ANNUAL_AIR_TEMP  : num  8.2 9.8 9.2 7.4 8.4 9.3 8.2 5.1 8.4 5.7 ...
##  $ MEAN_MONTHLY_MAX_TEMP : num  13.1 13.6 13.2 12.2 13.4 13.4 12.7 8.9 12.9 9.2 ...
##  $ MEAN_MONTHLY_MIN_TEMP : num  3.5 6.3 5.4 3.3 3.9 5.2 4.1 2.2 4.2 2.7 ...
##  $ MEAN_ANNUAL_WIND_SPEED: num  2 3 2 2 1 2 3 3 2 3 ...
##  $ MEAN_CLOUD_COVER      : num  67 67 67 66 65 67 72 72 66 64 ...
##  $ MEAN_ANNUAL_SUNSHINE  : num  NA 1531 1459 1725 1595 ...
##  $ MEAN_ANNUAL_RAINFALL  : num  755 820 759 919 790 794 657 NA NA 915 ...
##  $ MAX_MONTHLY_WIND_SPEED: num  2 3 3 2 2 2 3 4 3 3 ...
##  $ MAX_AIR_TEMP          : num  32.5 32.3 32.4 30.2 33 32.2 31.6 27.6 33.2 29 ...
##  $ MAX_WIND_SPEED        : num  NA 30.2 29.9 NA NA NA NA NA NA NA ...
##  $ MAX_RAINFALL          : num  39 36 32 43 43 33 37 NA NA 40 ...
##  $ MIN_AIR_TEMP          : num  -16.3 -10.9 -12.6 -15.5 -19.2 -13.3 -15.2 -15.7 -17.5 -17.2 ...
##  $ MEAN_RANGE_AIR_TEMP   : num  9.6 7.3 7.8 8.9 9.5 8.2 8.6 6.7 8.6 6.5 ...

# Exclude variables
cols_to_drop <- c("Station_id",
                  "Name",
                  "Federal_State",
                  "Period",
                  "max_wind_speed", # shows many missing values
                  "max_air_temp") # the response variable is not included in PCA
dwd_data_pca <- dwd[, !(colnames(dwd) %in% cols_to_drop)] # drop columns

rows_to_drop <- complete.cases(dwd_data_pca) # rows to drop
dwd_data_pca <- dwd_data_pca[rows_to_drop, ] # drop rows

# save dwd_data_pca for later usage
save(dwd_data_pca, file = "dwd_pca_30300.RData")

str(dwd_data_pca)

## 'data.frame':    204 obs. of  22 variables:
##  $ ID                    : int  1 2 24 29 31 37 43 45 50 55 ...
##  $ DWD_ID                : int  3 44 164 175 183 198 222 232 282 298 ...
##  $ STATION_NAME          : chr  "Aachen" "GroÃŸenkneten" "AngermÃ¼nde" "Ansbach" ...
##  $ FEDERAL_STATE         : chr  "Nordrhein-Westfalen" "Niedersachsen" "Brandenburg" "Bayern" ...
##  $ LAT                   : num  50.8 52.9 53 49.3 54.7 ...
##  $ LON                   : num  6.09 8.24 13.99 10.58 13.43 ...
##  $ ALTITUDE              : num  202 44 54 413 42 164 387 461 240 3 ...
##  $ PERIOD                : chr  "1851-2011" "1971-2016" "1908-2016" "1881-1976" ...
##  $ RECORD_LENGTH         : int  160 45 108 95 80 62 135 70 67 79 ...
##  $ MEAN_ANNUAL_AIR_TEMP  : num  9.8 9.2 8.4 7.5 8.2 9 8.1 8.3 8.9 8.2 ...
##  $ MEAN_MONTHLY_MAX_TEMP : num  13.6 13.2 12.9 12.2 10.6 13.4 13.2 12.9 13.8 11.8 ...
##  $ MEAN_MONTHLY_MIN_TEMP : num  6.3 5.4 4.3 3.3 6 5 4.3 4 4.2 4.8 ...
##  $ MEAN_ANNUAL_WIND_SPEED: num  3 2 2 2 4 3 2 2 2 3 ...
##  $ MEAN_CLOUD_COVER      : num  67 67 68 65 66 69 69 67 68 67 ...
##  $ MEAN_ANNUAL_SUNSHINE  : num  1531 1459 1695 1657 1840 ...
##  $ MEAN_ANNUAL_RAINFALL  : num  820 759 531 681 543 478 803 785 638 625 ...
##  $ MAX_MONTHLY_WIND_SPEED: num  3 3 3 2 5 3 2 3 2 3 ...
##  $ MAX_AIR_TEMP          : num  32.3 32.4 33 31.4 26.8 33.5 32.2 32.9 33.7 30.6 ...
##  $ MAX_WIND_SPEED        : num  30.2 29.9 28.8 22.8 35.5 29.9 23 28 23.5 29.1 ...
##  $ MAX_RAINFALL          : num  36 32 37 36 33 31 44 40 36 33 ...
##  $ MIN_AIR_TEMP          : num  -10.9 -12.6 -16.1 -17.3 -9.4 -15.3 -16.6 -16.8 -17.8 -14.1 ...
##  $ MEAN_RANGE_AIR_TEMP   : num  7.3 7.8 8.5 9 4.6 8.4 9 9 9.6 7 ...

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.