So far we focused on an example data set with just two explanatory variables. In order to get some hands-on experience we advance by using a data set with more than two explanatory variables.

In order to showcase the multiple linear regression analysis we **examine the relationship between the response variable MEAN_ANNUAL_RAINFALL and the other variables in the dwd data set**.

You may download the `DWD.csv`

file here. Import the data set and assign a proper name to it. Note that we use the `read.csv()`

-method of the `pandas`

-library to load the data set. We need to set the character encoding and a non-standard column separator (;). This is due to the fact that this dataset originates from Germany, where ',' is used as a delimiter and can thus not be used as the separator.

In [4]:

```
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
```

In [5]:

```
dwd = pd.read_table(
"https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv",
index_col=0,
sep=',',
)
dwd.head()
```

Out[5]:

DWD_ID | STATION_NAME | FEDERAL_STATE | LAT | LON | ALTITUDE | PERIOD | RECORD_LENGTH | MEAN_ANNUAL_AIR_TEMP | MEAN_MONTHLY_MAX_TEMP | ... | MEAN_ANNUAL_WIND_SPEED | MEAN_CLOUD_COVER | MEAN_ANNUAL_SUNSHINE | MEAN_ANNUAL_RAINFALL | MAX_MONTHLY_WIND_SPEED | MAX_AIR_TEMP | MAX_WIND_SPEED | MAX_RAINFALL | MIN_AIR_TEMP | MEAN_RANGE_AIR_TEMP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

ID | |||||||||||||||||||||

0 | 1 | Aach | Baden-Württemberg | 47.8413 | 8.8493 | 478.0 | 1931-1986 | 55 | 8.2 | 13.1 | ... | 2.0 | 67.0 | NaN | 755.0 | 2.0 | 32.5 | NaN | 39.0 | -16.3 | 9.6 |

1 | 3 | Aachen | Nordrhein-Westfalen | 50.7827 | 6.0941 | 202.0 | 1851-2011 | 160 | 9.8 | 13.6 | ... | 3.0 | 67.0 | 1531.0 | 820.0 | 3.0 | 32.3 | 30.2 | 36.0 | -10.9 | 7.3 |

2 | 44 | Großenkneten | Niedersachsen | 52.9335 | 8.2370 | 44.0 | 1971-2016 | 45 | 9.2 | 13.2 | ... | 2.0 | 67.0 | 1459.0 | 759.0 | 3.0 | 32.4 | 29.9 | 32.0 | -12.6 | 7.8 |

6 | 71 | Albstadt-Badkap | Baden-Württemberg | 48.2156 | 8.9784 | 759.0 | 1986-2016 | 30 | 7.4 | 12.2 | ... | 2.0 | 66.0 | 1725.0 | 919.0 | 2.0 | 30.2 | NaN | 43.0 | -15.5 | 8.9 |

8 | 73 | Aldersbach-Kriestorf | Bayern | 48.6159 | 13.0506 | 340.0 | 1952-2016 | 64 | 8.4 | 13.4 | ... | 1.0 | 65.0 | 1595.0 | 790.0 | 2.0 | 33.0 | NaN | 43.0 | -19.2 | 9.5 |

5 rows × 21 columns

In [6]:

```
dwd.columns
```

Out[6]:

Index(['DWD_ID', 'STATION_NAME', 'FEDERAL_STATE', 'LAT', 'LON', 'ALTITUDE', 'PERIOD', 'RECORD_LENGTH', 'MEAN_ANNUAL_AIR_TEMP', 'MEAN_MONTHLY_MAX_TEMP', 'MEAN_MONTHLY_MIN_TEMP', 'MEAN_ANNUAL_WIND_SPEED', 'MEAN_CLOUD_COVER', 'MEAN_ANNUAL_SUNSHINE', 'MEAN_ANNUAL_RAINFALL', 'MAX_MONTHLY_WIND_SPEED', 'MAX_AIR_TEMP', 'MAX_WIND_SPEED', 'MAX_RAINFALL', 'MIN_AIR_TEMP', 'MEAN_RANGE_AIR_TEMP'], dtype='object')

The *dwd* data set consists of 599 rows, each of them representing a particular weather station in Germany, and 21 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are:
'DWD_ID', 'STATION NAME', 'FEDERAL STATE', 'LAT', 'LON', 'ALTITUDE',
'PERIOD', 'RECORD LENGTH', 'MEAN ANNUAL AIR TEMP',
'MEAN MONTHLY MAX TEMP', 'MEAN MONTHLY MIN TEMP',
'MEAN ANNUAL WIND SPEED', 'MEAN CLOUD COVER', 'MEAN ANNUAL SUNSHINE',
'MEAN ANNUAL RAINFALL', 'MAX MONTHLY WIND SPEED', 'MAX AIR TEMP',
'MAX WIND SPEED', 'MAX RAINFALL', 'MIN AIR TEMP',
'MEAN RANGE AIR TEMP'.

The data was downloaded from the DWD (German Weather Service) data portal on April 21, 2017. You may find a detailed description of the data set here. Please note that for the purpose of this tutorial the data set was preprocessed and columns have been renamed.

Before we dive into the modelling we first have to prepare our data set. By calling the `info()`

method on the data set we realize that there are features in the data set such as the index, `DWD_ID`

, and `STATION.NAME`

. We will drop columns that we presume not to carry information valuable to us for the prediction of the mean annual rainfall. In addition we make sure that we exclude observations that contain missing values (`NAN's`

).

In [7]:

```
dwd.info()
```

In [8]:

```
dwd = dwd.drop(['DWD_ID', 'STATION_NAME', 'FEDERAL_STATE', 'PERIOD'], axis=1).dropna()
dwd.shape
```

Out[8]:

(204, 17)

After the data cleaning steps there are 204 rows (observations) and 18 columns (features) left in our data set.

Moreover, we split our data set at random into two distinct parts. One part is our **training set** while the other part will serve as our **test set**. The basic idea behind this data grouping strategy is, that we learn the model parameters on the training set, and then we use the test data set, a data set that was not involved at all during parameter estimation, to determine the goodness of fit of our model, or in other words by using two distinct data sets we may evaluate the **generalization properties** of our model.

For the sub-setting of the data set we rely on the `.sample()`

method, which splits the data set into two sets (training and test set) in a predefined ratio. Finally, we assign the **response vector** and the **data matrix** for each subset of the data a proper variable name for further processing.

In [9]:

```
df = dwd.drop('MEAN_ANNUAL_RAINFALL', axis=1)
train_set = df.sample(random_state=42, frac=.8)
test_set = df.drop(train_set.index)
train_set.shape
```

Out[9]:

(163, 16)

In [10]:

```
train_set.columns
```

Out[10]:

Index(['LAT', 'LON', 'ALTITUDE', 'RECORD_LENGTH', 'MEAN_ANNUAL_AIR_TEMP', 'MEAN_MONTHLY_MAX_TEMP', 'MEAN_MONTHLY_MIN_TEMP', 'MEAN_ANNUAL_WIND_SPEED', 'MEAN_CLOUD_COVER', 'MEAN_ANNUAL_SUNSHINE', 'MAX_MONTHLY_WIND_SPEED', 'MAX_AIR_TEMP', 'MAX_WIND_SPEED', 'MAX_RAINFALL', 'MIN_AIR_TEMP', 'MEAN_RANGE_AIR_TEMP'], dtype='object')

Finally there remain 163 observations with 17 features in the data set to train our model, and 41 observations with 17 features to evaluate our model.

Before we build our model we visualize the data we are working with by plotting a scatter plot matrix. `seaborn`

library provides us with the `pairplot()`

function, to create a facet grid of a `DataFrame`

in long format. We unpivot using the `melt`

method and plot.

In [11]:

```
sns.pairplot(train_set, vars=train_set.columns)
plt.show()
```