So far we focused on an example data set with just two explanatory variables. In order to get some hands-on experience we advance by using a data set with more than two explanatory variables.

In order to showcase the multiple linear regression analysis we **examine the relationship between the response variable MEAN.ANNUAL.RAINFALL and the other variables in the dwd data set**.

You may download the `DWD.csv`

file here. Import the data set and assign a proper name to it. Note that we use the `read.csv2()`

function to load the data set. The `read.csv()`

and `read.csv2`

are identical except for the defaults. The `read.csv2()`

function is suitable for data originating from countries that use a comma as decimal point and a semicolon as field separator.

```
dwd <- read.csv2("https://userpage.fu-berlin.de/soga/300/30100_data_sets/DWD.csv",
stringsAsFactors = FALSE)
```

The *dwd* data set consists of 599 rows, each of them representing a particular weather station in Germany, and 22 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: id, DWD_ID, STATION.NAME, FEDERAL.STATE, LAT, LON, ALTITUDE, PERIOD, RECORD.LENGTH, MEAN.ANNUAL.AIR.TEMP, MEAN.MONTHLY.MAX.TEMP, MEAN.MONTHLY.MIN.TEMP, MEAN.ANNUAL.WIND.SPEED, MEAN.CLOUD.COVER, MEAN.ANNUAL.SUNSHINE, MEAN.ANNUAL.RAINFALL, MAX.MONTHLY.WIND.SPEED, MAX.AIR.TEMP, MAX.WIND.SPEED, MAX.RAINFALL, MIN.AIR.TEMP, MEAN.RANGE.AIR.TEMP.

The data was downloaded from the DWD (German Weather Service) data portal on April 21, 2017. You may find a detailed description of the data set here. Please note that for the purpose of this tutorial the data set was preprocessed and columns have been renamed.

Before we dive into the modelling we first have to prepare our data set. By calling the `str()`

function on the data set we realize that there are features in the data set such as, `id`

, `DWD_ID`

, and `STATION.NAME`

, among others, which we know, or at least presumably know, do not carry any particular useful information for the prediction of the mean annual rainfall. Thus, we exclude them from the data set. In addition we make sure that we exclude observations that contain missing values (`NA`

).

`str(dwd)`

```
## 'data.frame': 599 obs. of 22 variables:
## $ id : int 0 1 2 6 8 9 10 12 14 18 ...
## $ DWD_ID : int 1 3 44 71 73 78 91 98 116 132 ...
## $ STATION.NAME : chr "Aach" "Aachen" "Gro\xdfenkneten" "Albstadt-Badkap" ...
## $ FEDERAL.STATE : chr "Baden-W\xfcrttemberg" "Nordrhein-Westfalen" "Niedersachsen" "Baden-W\xfcrttemberg" ...
## $ LAT : num 47.8 50.8 52.9 48.2 48.6 ...
## $ LON : num 8.85 6.09 8.24 8.98 13.05 ...
## $ ALTITUDE : num 478 202 44 759 340 65 300 780 213 750 ...
## $ PERIOD : chr "1931-1986" "1851-2011" "1971-2016" "1986-2016" ...
## $ RECORD.LENGTH : int 55 160 45 30 64 55 38 67 67 33 ...
## $ MEAN.ANNUAL.AIR.TEMP : num 8.2 9.8 9.2 7.4 8.4 9.3 8.2 5.1 8.4 5.7 ...
## $ MEAN.MONTHLY.MAX.TEMP : num 13.1 13.6 13.2 12.2 13.4 13.4 12.7 8.9 12.9 9.2 ...
## $ MEAN.MONTHLY.MIN.TEMP : num 3.5 6.3 5.4 3.3 3.9 5.2 4.1 2.2 4.2 2.7 ...
## $ MEAN.ANNUAL.WIND.SPEED: num 2 3 2 2 1 2 3 3 2 3 ...
## $ MEAN.CLOUD.COVER : num 67 67 67 66 65 67 72 72 66 64 ...
## $ MEAN.ANNUAL.SUNSHINE : num NA 1531 1459 1725 1595 ...
## $ MEAN.ANNUAL.RAINFALL : num 755 820 759 919 790 794 657 NA NA 915 ...
## $ MAX.MONTHLY.WIND.SPEED: num 2 3 3 2 2 2 3 4 3 3 ...
## $ MAX.AIR.TEMP : num 32.5 32.3 32.4 30.2 33 32.2 31.6 27.6 33.2 29 ...
## $ MAX.WIND.SPEED : num NA 30.2 29.9 NA NA NA NA NA NA NA ...
## $ MAX.RAINFALL : num 39 36 32 43 43 33 37 NA NA 40 ...
## $ MIN.AIR.TEMP : num -16.3 -10.9 -12.6 -15.5 -19.2 -13.3 -15.2 -15.7 -17.5 -17.2 ...
## $ MEAN.RANGE.AIR.TEMP : num 9.6 7.3 7.8 8.9 9.5 8.2 8.6 6.7 8.6 6.5 ...
```

```
# Exlclude variables
cols.to.drop <- c('id', 'DWD_ID', 'STATION.NAME',
'FEDERAL.STATE', 'PERIOD') # columns to drop
rows.to.drop <- complete.cases(dwd) # rows to drop
dwd.data <- dwd[rows.to.drop, !(colnames(dwd) %in% cols.to.drop)]
```

After the data cleaning steps there are 204 rows (observations) and 17 columns (features) left in our data set.

`dim(dwd.data)`

`## [1] 204 17`

Moreover, we split our data set at random into two distinct parts. One part is our **trainig set** while the other part will serve as our **test set**. The basic idea behind this data grouping strategy is, that we learn the model parameters on the training set, and then we use the test data set, a data set that was not involved at all during parameter estimation, to determine the goodness of fit of our model, or in other words by using two distinct data sets we may evaluate the **generalization properties** of our model.

For the sub-setting of the data set we rely on the `sample.split()`

function from the `caTools`

package, which splits the data set into two sets (training and test set) in a predefined ratio. Finally, we assign the **response vector** and the **data matrix** for each subset of the data a proper variable name for further processing.

```
# Split data into training and test set, 80% training and 20% test set
library(caTools)
set.seed(100) # set seed for reproducibility
# define split vector
split <- sample.split(dwd.data$MEAN.ANNUAL.RAINFALL,
SplitRatio = 0.8)
train.set <- dwd.data[split==TRUE,] # training set
test.set <- dwd.data[split==FALSE,] # test set
# response vectors
y.train <- train.set$MEAN.ANNUAL.RAINFALL # response vector training set
y.test <- test.set$MEAN.ANNUAL.RAINFALL # response vector test set
# model matrices for training and test set
# select all features except the response variable
X.train <- as.matrix(subset(train.set,
select = -c(MEAN.ANNUAL.RAINFALL)) )
X.test <- as.matrix(subset(test.set,
select = -c(MEAN.ANNUAL.RAINFALL)))
# save objects for later usage
save(dwd,
dwd.data,
train.set,
test.set,
y.train,
y.test,
X.train,
X.test,
split,
file = "dwd_30200.RData")
```

Finally there remain 163 observations with 16 features in the data set to train our model, and 41 observations with 16 features to evaluate our model.

Before we build our model we visualize the data we are working with by plotting a scatter plot matrix. The R software environment provides several functions that allow us to plot scatter plot matrices. Just to name a few, we may use the `pairs()`

function, the `splom()`

function from the `lattice`

library, the `pairs.panels()`

function from the `psych`

library or the `ggpairs()`

function from the `GGally`

library. The particular choice is a matter of experience, the size of the data set, and of personal taste. As a proof of concept that it becomes difficult to visualize a high dimensional feature space we apply the built-in function `pairs()`

to the data set.

`pairs(dwd.data)`

Arghhh! Such a plot is of no use at all! Note that we are just visualizing 17 features, which is compared to many other real world applications not much at all.

A good strategy would be to choose only those pairs of variables that at least somehow are affecting our response variable `MEAN.ANNUAL.RAINFALL`

. Thus, we first examine the pairwise correlation between the features, and then we select those features for visualization which show the highest correlation. The correlation of features can easily be calculated by using the `cor()`

function, however, still the output of the `cor()`

would flood the screen, so that we do not improve our situation. A fancy method to overcome this issue is provided by the `corrplot`

library. If we feed the eponymous `corrrplot()`

function with a coloration matrix we may visualize the correlations among the features in a quite convenient way.

`library(corrplot)`

`## corrplot 0.84 loaded`

```
dwd.data.cor <- cor(dwd.data)
corrplot(dwd.data.cor, method="number",
type = "lower", tl.cex = 0.6,
number.cex = 0.5)
```