Before we start, we clean up the data set and exclude non-meaningful features and missing values. Then, we create our binary response variable,HOT.DAY, based on the variable MAX.AIR.TEMP. In a next step we split our data set at random into two distinct parts. One part is our trainig set while the other part will serve as our test set. The basic idea behind this data grouping strategy is, that we learn the model parameters on the training set, and then we use the test data set, a data set that was not involved at all during parameter estimation, to determine the goodness of fit of our model, or in other words by using two distinct data sets we may evaluate the generalization properties of our model.

For the sub-setting of the data set we rely on the sample.split() function from the caTools package, which splits the data set into two sets (training and test set) in a predefined ratio. Finally, we assign the data matrices and the response vectors proper variable names for further processing.

# Load data set
dwd <- read.csv2("https://userpage.fu-berlin.de/soga/300/30100_data_sets/DWD.csv",
                 stringsAsFactors = FALSE)

str(dwd)
## 'data.frame':    599 obs. of  22 variables:
##  $ id                    : int  0 1 2 6 8 9 10 12 14 18 ...
##  $ DWD_ID                : int  1 3 44 71 73 78 91 98 116 132 ...
##  $ STATION.NAME          : chr  "Aach" "Aachen" "Gro\xdfenkneten" "Albstadt-Badkap" ...
##  $ FEDERAL.STATE         : chr  "Baden-W\xfcrttemberg" "Nordrhein-Westfalen" "Niedersachsen" "Baden-W\xfcrttemberg" ...
##  $ LAT                   : num  47.8 50.8 52.9 48.2 48.6 ...
##  $ LON                   : num  8.85 6.09 8.24 8.98 13.05 ...
##  $ ALTITUDE              : num  478 202 44 759 340 65 300 780 213 750 ...
##  $ PERIOD                : chr  "1931-1986" "1851-2011" "1971-2016" "1986-2016" ...
##  $ RECORD.LENGTH         : int  55 160 45 30 64 55 38 67 67 33 ...
##  $ MEAN.ANNUAL.AIR.TEMP  : num  8.2 9.8 9.2 7.4 8.4 9.3 8.2 5.1 8.4 5.7 ...
##  $ MEAN.MONTHLY.MAX.TEMP : num  13.1 13.6 13.2 12.2 13.4 13.4 12.7 8.9 12.9 9.2 ...
##  $ MEAN.MONTHLY.MIN.TEMP : num  3.5 6.3 5.4 3.3 3.9 5.2 4.1 2.2 4.2 2.7 ...
##  $ MEAN.ANNUAL.WIND.SPEED: num  2 3 2 2 1 2 3 3 2 3 ...
##  $ MEAN.CLOUD.COVER      : num  67 67 67 66 65 67 72 72 66 64 ...
##  $ MEAN.ANNUAL.SUNSHINE  : num  NA 1531 1459 1725 1595 ...
##  $ MEAN.ANNUAL.RAINFALL  : num  755 820 759 919 790 794 657 NA NA 915 ...
##  $ MAX.MONTHLY.WIND.SPEED: num  2 3 3 2 2 2 3 4 3 3 ...
##  $ MAX.AIR.TEMP          : num  32.5 32.3 32.4 30.2 33 32.2 31.6 27.6 33.2 29 ...
##  $ MAX.WIND.SPEED        : num  NA 30.2 29.9 NA NA NA NA NA NA NA ...
##  $ MAX.RAINFALL          : num  39 36 32 43 43 33 37 NA NA 40 ...
##  $ MIN.AIR.TEMP          : num  -16.3 -10.9 -12.6 -15.5 -19.2 -13.3 -15.2 -15.7 -17.5 -17.2 ...
##  $ MEAN.RANGE.AIR.TEMP   : num  9.6 7.3 7.8 8.9 9.5 8.2 8.6 6.7 8.6 6.5 ...
# Exlclude variables
cols.to.drop <- c('id',                   # columns to drop
                  'DWD_ID', 
                  'STATION.NAME', 
                  'FEDERAL.STATE', 
                  'PERIOD',
                  'MAX.WIND.SPEED') # shows many missing values
                  
dwd.data <- dwd[, !(colnames(dwd) %in% cols.to.drop)] # drop columns

rows.to.drop <- complete.cases(dwd.data)  # rows to drop
dwd.data <- dwd.data[rows.to.drop, ] # drop rows

# Create binary response variable HEAT.DAY
dwd.data['HOT.DAY'] <- as.factor(ifelse(dwd.data$MAX.AIR.TEMP >= 30, 1, 0))


# Split data into training and test set, 60% training and 40% test set
library(caTools)
set.seed(200) # set seed for reproducibility

# define split vector
split <- sample.split(dwd.data$HOT.DAY, SplitRatio = 0.6)

#########################
### Original data set ###
#########################

train.set <- dwd.data[split==TRUE,] # training set
test.set <- dwd.data[split==FALSE,] # test set

# response vectors
y.train <- train.set$HOT.DAY # response vector training set
y.test <- test.set$HOT.DAY # response vector test set

# exclude response variable(s) from the data matrix
idx.HOT.DAY <- which(colnames(train.set)=='HOT.DAY') 
idx.MAX.AIR.TEMP <- which(colnames(train.set)=='MAX.AIR.TEMP')

X.train <- data.matrix(train.set[, -c(idx.HOT.DAY, idx.MAX.AIR.TEMP)])
X.test <- data.matrix(test.set[, -c(idx.HOT.DAY, idx.MAX.AIR.TEMP)])

####################
### PCA data set ###
####################

# load PC3 model from previous sections
load(url("https://userpage.fu-berlin.de/soga/300/30100_data_sets/pca_PC3_30300.RData"))
     
X.train.PC3 <- pca.PC3[split==TRUE,] # training set PCA
X.test.PC3 <- pca.PC3[split==FALSE,] # test set PCA

### save data matrices and response vectors for later usage ###
save(y.train,
     y.test,
     X.train,
     X.test,
     X.train.PC3,
     X.test.PC3,
     pca.PC3,
     dwd.data,
     file = 'pca_data_30300.RData')

Finally, there remain 238 observations with 16 features in the data set to train our model, and 159 observations with 16 features to evaluate our model.

We notice that the classes of our response variable HOT.DAY are not balanced.

tbl.hot.day <- table(dwd.data['HOT.DAY'])
tbl.hot.day
## 
##   0   1 
##  54 343

The ratio of weather stations, which recorded at least one hot day and weather stations, which have not yet recorded a hot day is 6 to 1. When working with imbalanced data set we have to be cautious with respect to model metrics, such as the model accuracy. Consider a model that always predict the more common class; in our example the model accuracy would be \(343/(54 + 343) = 0.86%\); which is a quite high number for a non-predictive model. Hence, we will consider other model metrics such as the confusion matrix and Cohen’s kappa too. The number of \(0.86%\) will serve as our model benchmark. Whatever model we build, the model’s accuracy must be better than \(0.86%\).


However, before we continue we should remind ourselves that the data set we are working with has a spatial component. Basically, our observations are point measurements of meteorological parameters spread across Germany. Let us plot a simple map to get an impression of the spatial distribution of our observations. For plotting we rely on the raster package and on the ggplot2 package.

library(raster)
library(ggplot2)

# Retrieve Federal States by the the getData() function from the raster package
G1 <- getData(country = "Germany", level = 1) 

#plot the map
ggplot() +
geom_polygon(data = G1,
             aes(x = long, y = lat, group = group),
             colour = "grey10", fill = "#fff7bc") +
geom_point(data = dwd.data,
           aes(x = LON, y = LAT, colour = HOT.DAY),
           alpha = .5,
           size = 2) +
theme_bw() +
xlab("Longitude") + ylab("Latitude") +
coord_map()