So far we focused on an example data set with just two explanatory variables. In order to get some hands-on experience we advance by using a data set with more than two explanatory variables.

In order to showcase the multiple linear regression analysis we examine the relationship between the response variable MEAN.ANNUAL.RAINFALL and the other variables in the dwd data set.

You may download the DWD.csv file here. Import the data set and assign a proper name to it. Note that we use the read.csv2() function to load the data set. The read.csv() and read.csv2 are identical except for the defaults. The read.csv2() function is suitable for data originating from countries that use a comma as decimal point and a semicolon as field separator.

dwd <- read.csv2("https://userpage.fu-berlin.de/soga/300/30100_data_sets/DWD.csv",
                 stringsAsFactors = FALSE)

The dwd data set consists of 599 rows, each of them representing a particular weather station in Germany, and 22 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: id, DWD_ID, STATION.NAME, FEDERAL.STATE, LAT, LON, ALTITUDE, PERIOD, RECORD.LENGTH, MEAN.ANNUAL.AIR.TEMP, MEAN.MONTHLY.MAX.TEMP, MEAN.MONTHLY.MIN.TEMP, MEAN.ANNUAL.WIND.SPEED, MEAN.CLOUD.COVER, MEAN.ANNUAL.SUNSHINE, MEAN.ANNUAL.RAINFALL, MAX.MONTHLY.WIND.SPEED, MAX.AIR.TEMP, MAX.WIND.SPEED, MAX.RAINFALL, MIN.AIR.TEMP, MEAN.RANGE.AIR.TEMP.

The data was downloaded from the DWD (German Weather Service) data portal on April 21, 2017. You may find a detailed description of the data set here. Please note that for the purpose of this tutorial the data set was preprocessed and columns have been renamed.


Data preparation

Before we dive into the modelling we first have to prepare our data set. By calling the str() function on the data set we realize that there are features in the data set such as, id, DWD_ID, and STATION.NAME, among others, which we know, or at least presumably know, do not carry any particular useful information for the prediction of the mean annual rainfall. Thus, we exclude them from the data set. In addition we make sure that we exclude observations that contain missing values (NA).

str(dwd)
## 'data.frame':    599 obs. of  22 variables:
##  $ id                    : int  0 1 2 6 8 9 10 12 14 18 ...
##  $ DWD_ID                : int  1 3 44 71 73 78 91 98 116 132 ...
##  $ STATION.NAME          : chr  "Aach" "Aachen" "Gro\xdfenkneten" "Albstadt-Badkap" ...
##  $ FEDERAL.STATE         : chr  "Baden-W\xfcrttemberg" "Nordrhein-Westfalen" "Niedersachsen" "Baden-W\xfcrttemberg" ...
##  $ LAT                   : num  47.8 50.8 52.9 48.2 48.6 ...
##  $ LON                   : num  8.85 6.09 8.24 8.98 13.05 ...
##  $ ALTITUDE              : num  478 202 44 759 340 65 300 780 213 750 ...
##  $ PERIOD                : chr  "1931-1986" "1851-2011" "1971-2016" "1986-2016" ...
##  $ RECORD.LENGTH         : int  55 160 45 30 64 55 38 67 67 33 ...
##  $ MEAN.ANNUAL.AIR.TEMP  : num  8.2 9.8 9.2 7.4 8.4 9.3 8.2 5.1 8.4 5.7 ...
##  $ MEAN.MONTHLY.MAX.TEMP : num  13.1 13.6 13.2 12.2 13.4 13.4 12.7 8.9 12.9 9.2 ...
##  $ MEAN.MONTHLY.MIN.TEMP : num  3.5 6.3 5.4 3.3 3.9 5.2 4.1 2.2 4.2 2.7 ...
##  $ MEAN.ANNUAL.WIND.SPEED: num  2 3 2 2 1 2 3 3 2 3 ...
##  $ MEAN.CLOUD.COVER      : num  67 67 67 66 65 67 72 72 66 64 ...
##  $ MEAN.ANNUAL.SUNSHINE  : num  NA 1531 1459 1725 1595 ...
##  $ MEAN.ANNUAL.RAINFALL  : num  755 820 759 919 790 794 657 NA NA 915 ...
##  $ MAX.MONTHLY.WIND.SPEED: num  2 3 3 2 2 2 3 4 3 3 ...
##  $ MAX.AIR.TEMP          : num  32.5 32.3 32.4 30.2 33 32.2 31.6 27.6 33.2 29 ...
##  $ MAX.WIND.SPEED        : num  NA 30.2 29.9 NA NA NA NA NA NA NA ...
##  $ MAX.RAINFALL          : num  39 36 32 43 43 33 37 NA NA 40 ...
##  $ MIN.AIR.TEMP          : num  -16.3 -10.9 -12.6 -15.5 -19.2 -13.3 -15.2 -15.7 -17.5 -17.2 ...
##  $ MEAN.RANGE.AIR.TEMP   : num  9.6 7.3 7.8 8.9 9.5 8.2 8.6 6.7 8.6 6.5 ...
# Exlclude variables
cols.to.drop <- c('id', 'DWD_ID', 'STATION.NAME', 
                  'FEDERAL.STATE', 'PERIOD') # columns to drop
rows.to.drop <- complete.cases(dwd)  # rows to drop
dwd.data <- dwd[rows.to.drop, !(colnames(dwd) %in% cols.to.drop)]

After the data cleaning steps there are 204 rows (observations) and 17 columns (features) left in our data set.

dim(dwd.data)
## [1] 204  17

Moreover, we split our data set at random into two distinct parts. One part is our trainig set while the other part will serve as our test set. The basic idea behind this data grouping strategy is, that we learn the model parameters on the training set, and then we use the test data set, a data set that was not involved at all during parameter estimation, to determine the goodness of fit of our model, or in other words by using two distinct data sets we may evaluate the generalization properties of our model.

For the sub-setting of the data set we rely on the sample.split() function from the caTools package, which splits the data set into two sets (training and test set) in a predefined ratio. Finally, we assign the response vector and the data matrix for each subset of the data a proper variable name for further processing.

# Split data into training and test set, 80% training and 20% test set
library(caTools)
set.seed(100) # set seed for reproducibility

# define split vector
split <- sample.split(dwd.data$MEAN.ANNUAL.RAINFALL,
                      SplitRatio = 0.8)

train.set <- dwd.data[split==TRUE,] # training set
test.set <- dwd.data[split==FALSE,] # test set

# response vectors
y.train <- train.set$MEAN.ANNUAL.RAINFALL # response vector training set
y.test <- test.set$MEAN.ANNUAL.RAINFALL # response vector test set

# model matrices for training and test set
# select all features except the response variable
X.train <- as.matrix(subset(train.set, 
                            select = -c(MEAN.ANNUAL.RAINFALL)) )
X.test <- as.matrix(subset(test.set, 
                           select = -c(MEAN.ANNUAL.RAINFALL)))

# save objects for later usage
save(dwd, 
     dwd.data, 
     train.set, 
     test.set, 
     y.train, 
     y.test, 
     X.train, 
     X.test,
     split,
     file = "dwd_30200.RData")

Finally there remain 163 observations with 16 features in the data set to train our model, and 41 observations with 16 features to evaluate our model.


Data visualization

Before we build our model we visualize the data we are working with by plotting a scatter plot matrix. The R software environment provides several functions that allow us to plot scatter plot matrices. Just to name a few, we may use the pairs() function, the splom() function from the lattice library, the pairs.panels() function from the psych library or the ggpairs() function from the GGally library. The particular choice is a matter of experience, the size of the data set, and of personal taste. As a proof of concept that it becomes difficult to visualize a high dimensional feature space we apply the built-in function pairs() to the data set.

pairs(dwd.data)

Arghhh! Such a plot is of no use at all! Note that we are just visualizing 17 features, which is compared to many other real world applications not much at all.

A good strategy would be to choose only those pairs of variables that at least somehow are affecting our response variable MEAN.ANNUAL.RAINFALL. Thus, we first examine the pairwise correlation between the features, and then we select those features for visualization which show the highest correlation. The correlation of features can easily be calculated by using the cor() function, however, still the output of the cor() would flood the screen, so that we do not improve our situation. A fancy method to overcome this issue is provided by the corrplot library. If we feed the eponymous corrrplot() function with a coloration matrix we may visualize the correlations among the features in a quite convenient way.

library(corrplot)
## corrplot 0.84 loaded
dwd.data.cor <- cor(dwd.data)
corrplot(dwd.data.cor, method="number", 
         type = "lower", tl.cex = 0.6, 
         number.cex = 0.5)