30220_An_advanced_example_of_a_multiple_linear_regression

So far we focused on an example data set with just two explanatory variables. In order to get some hands-on experience we advance by using a data set with more than two explanatory variables.

In order to showcase the multiple linear regression analysis we examine the relationship between the response variable MEAN_ANNUAL_RAINFALL and the other variables in the dwd data set.

You may download the DWD.csv file here. Import the data set and assign a proper name to it. Note that we use the read.csv()-method of the pandas-library to load the data set. We need to set the character encoding and a non-standard column separator (;). This is due to the fact that this dataset originates from Germany, where ',' is used as a delimiter and can thus not be used as the separator.

In [4]:

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

In [5]:

dwd = pd.read_table(
    "https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv", 
    index_col=0,
    sep=',',
)
dwd.head()

Out[5]:

	DWD_ID	STATION_NAME	FEDERAL_STATE	LAT	LON	ALTITUDE	PERIOD	RECORD_LENGTH	MEAN_ANNUAL_AIR_TEMP	MEAN_MONTHLY_MAX_TEMP	...	MEAN_ANNUAL_WIND_SPEED	MEAN_CLOUD_COVER	MEAN_ANNUAL_SUNSHINE	MEAN_ANNUAL_RAINFALL	MAX_MONTHLY_WIND_SPEED	MAX_AIR_TEMP	MAX_WIND_SPEED	MAX_RAINFALL	MIN_AIR_TEMP	MEAN_RANGE_AIR_TEMP
ID
0	1	Aach	Baden-Württemberg	47.8413	8.8493	478.0	1931-1986	55	8.2	13.1	...	2.0	67.0	NaN	755.0	2.0	32.5	NaN	39.0	-16.3	9.6
1	3	Aachen	Nordrhein-Westfalen	50.7827	6.0941	202.0	1851-2011	160	9.8	13.6	...	3.0	67.0	1531.0	820.0	3.0	32.3	30.2	36.0	-10.9	7.3
2	44	Großenkneten	Niedersachsen	52.9335	8.2370	44.0	1971-2016	45	9.2	13.2	...	2.0	67.0	1459.0	759.0	3.0	32.4	29.9	32.0	-12.6	7.8
6	71	Albstadt-Badkap	Baden-Württemberg	48.2156	8.9784	759.0	1986-2016	30	7.4	12.2	...	2.0	66.0	1725.0	919.0	2.0	30.2	NaN	43.0	-15.5	8.9
8	73	Aldersbach-Kriestorf	Bayern	48.6159	13.0506	340.0	1952-2016	64	8.4	13.4	...	1.0	65.0	1595.0	790.0	2.0	33.0	NaN	43.0	-19.2	9.5

5 rows × 21 columns

In [6]:

dwd.columns

Out[6]:

Index(['DWD_ID', 'STATION_NAME', 'FEDERAL_STATE', 'LAT', 'LON', 'ALTITUDE',
       'PERIOD', 'RECORD_LENGTH', 'MEAN_ANNUAL_AIR_TEMP',
       'MEAN_MONTHLY_MAX_TEMP', 'MEAN_MONTHLY_MIN_TEMP',
       'MEAN_ANNUAL_WIND_SPEED', 'MEAN_CLOUD_COVER', 'MEAN_ANNUAL_SUNSHINE',
       'MEAN_ANNUAL_RAINFALL', 'MAX_MONTHLY_WIND_SPEED', 'MAX_AIR_TEMP',
       'MAX_WIND_SPEED', 'MAX_RAINFALL', 'MIN_AIR_TEMP',
       'MEAN_RANGE_AIR_TEMP'],
      dtype='object')

The dwd data set consists of 599 rows, each of them representing a particular weather station in Germany, and 21 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: 'DWD_ID', 'STATION NAME', 'FEDERAL STATE', 'LAT', 'LON', 'ALTITUDE', 'PERIOD', 'RECORD LENGTH', 'MEAN ANNUAL AIR TEMP', 'MEAN MONTHLY MAX TEMP', 'MEAN MONTHLY MIN TEMP', 'MEAN ANNUAL WIND SPEED', 'MEAN CLOUD COVER', 'MEAN ANNUAL SUNSHINE', 'MEAN ANNUAL RAINFALL', 'MAX MONTHLY WIND SPEED', 'MAX AIR TEMP', 'MAX WIND SPEED', 'MAX RAINFALL', 'MIN AIR TEMP', 'MEAN RANGE AIR TEMP'.

The data was downloaded from the DWD (German Weather Service) data portal on April 21, 2017. You may find a detailed description of the data set here. Please note that for the purpose of this tutorial the data set was preprocessed and columns have been renamed.

Data preparation¶

Before we dive into the modelling we first have to prepare our data set. By calling the info() method on the data set we realize that there are features in the data set such as the index, DWD_ID, and STATION.NAME. We will drop columns that we presume not to carry information valuable to us for the prediction of the mean annual rainfall. In addition we make sure that we exclude observations that contain missing values (NAN's).

In [7]:

dwd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 599 entries, 0 to 1058
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   DWD_ID                  599 non-null    int64  
 1   STATION_NAME            599 non-null    object 
 2   FEDERAL_STATE           599 non-null    object 
 3   LAT                     599 non-null    float64
 4   LON                     599 non-null    float64
 5   ALTITUDE                599 non-null    float64
 6   PERIOD                  599 non-null    object 
 7   RECORD_LENGTH           599 non-null    int64  
 8   MEAN_ANNUAL_AIR_TEMP    598 non-null    float64
 9   MEAN_MONTHLY_MAX_TEMP   597 non-null    float64
 10  MEAN_MONTHLY_MIN_TEMP   595 non-null    float64
 11  MEAN_ANNUAL_WIND_SPEED  588 non-null    float64
 12  MEAN_CLOUD_COVER        588 non-null    float64
 13  MEAN_ANNUAL_SUNSHINE    406 non-null    float64
 14  MEAN_ANNUAL_RAINFALL    586 non-null    float64
 15  MAX_MONTHLY_WIND_SPEED  588 non-null    float64
 16  MAX_AIR_TEMP            597 non-null    float64
 17  MAX_WIND_SPEED          219 non-null    float64
 18  MAX_RAINFALL            585 non-null    float64
 19  MIN_AIR_TEMP            597 non-null    float64
 20  MEAN_RANGE_AIR_TEMP     599 non-null    float64
dtypes: float64(16), int64(2), object(3)
memory usage: 103.0+ KB

In [8]:

dwd = dwd.drop(['DWD_ID', 'STATION_NAME', 'FEDERAL_STATE', 'PERIOD'], axis=1).dropna()
dwd.shape

Out[8]:

(204, 17)

After the data cleaning steps there are 204 rows (observations) and 18 columns (features) left in our data set.

Moreover, we split our data set at random into two distinct parts. One part is our training set while the other part will serve as our test set. The basic idea behind this data grouping strategy is, that we learn the model parameters on the training set, and then we use the test data set, a data set that was not involved at all during parameter estimation, to determine the goodness of fit of our model, or in other words by using two distinct data sets we may evaluate the generalization properties of our model.

For the sub-setting of the data set we rely on the .sample() method, which splits the data set into two sets (training and test set) in a predefined ratio. Finally, we assign the response vector and the data matrix for each subset of the data a proper variable name for further processing.

In [9]:

df = dwd.drop('MEAN_ANNUAL_RAINFALL', axis=1)
train_set = df.sample(random_state=42, frac=.8)
test_set = df.drop(train_set.index)
train_set.shape

Out[9]:

(163, 16)

In [10]:

train_set.columns

Out[10]:

Index(['LAT', 'LON', 'ALTITUDE', 'RECORD_LENGTH', 'MEAN_ANNUAL_AIR_TEMP',
       'MEAN_MONTHLY_MAX_TEMP', 'MEAN_MONTHLY_MIN_TEMP',
       'MEAN_ANNUAL_WIND_SPEED', 'MEAN_CLOUD_COVER', 'MEAN_ANNUAL_SUNSHINE',
       'MAX_MONTHLY_WIND_SPEED', 'MAX_AIR_TEMP', 'MAX_WIND_SPEED',
       'MAX_RAINFALL', 'MIN_AIR_TEMP', 'MEAN_RANGE_AIR_TEMP'],
      dtype='object')

Finally there remain 163 observations with 17 features in the data set to train our model, and 41 observations with 17 features to evaluate our model.

Data visualization¶

Before we build our model we visualize the data we are working with by plotting a scatter plot matrix. seaborn library provides us with the pairplot() function, to create a facet grid of a DataFrame in long format. We unpivot using the melt method and plot.

In [11]:

sns.pairplot(train_set, vars=train_set.columns)
plt.show()

Arghhh! Such a plot is of no use at all! Note that we are just visualizing 17 features, which is compared to many other real world applications not much at all.

A good strategy would be to choose only those pairs of variables that at least somehow are affecting our response variable MEAN ANNUAL RAINFALL. Thus, we first examine the pairwise correlation between the features, and then we select those features for visualization which show the highest correlation. The correlation of features can easily be calculated by using the corr() method, however, the output would still flood the screen, so that we do not improve our situation. A fancy method to overcome this issue is provided by the matplotlib library. If we feed the heatmap() function with a correlation matrix we may visualize the correlations among the features in a quite convenient way.

In [12]:

corr = dwd.corr()

mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)

plt.subplots(figsize=(12, 10))
sns.heatmap(corr, mask=mask, cmap=cmap)
plt.show()

Take a look at the plot above and focus on the response variable MEAN ANNUAL RAINFALL at position (row and column) r id.response. In that particular row/column you find the Pearson correlation coefficients for each feature pair. The Pearson correlation coefficient is a number between $-1$ and $+1$. A positive number indicates a positive relation between the pairs of features and a negative number indicates a negative relation between the features, respectively. The more extreme the number the stronger the correlation.

OK, back to our strategy: How to choose those pairs of variables that at least somehow are affecting our response variable MEAN ANNUAL RAINFALL?

By now we have all the building blocks to solve this problem. We set a threshold level, let us say $\pm 0.5$. We select all feature pair combinations which have a correlation coefficient above or below that threshold, and then we plot them in a scatter plot matrix.

In [13]:

corr_threshold = .5
best_features = corr[np.abs(corr['MEAN_ANNUAL_RAINFALL']) > corr_threshold].index

sns.pairplot(dwd, vars=best_features, corner=True)
plt.show()

Finally, we get a nice and clear representation of a subset of features of the dwd data set, all of them showing some kind of a relation among each another. In the next section we build a bunch of multiple linear regression models and we are going to investigate these relationships in more detail.

However, before we continue we should remind ourselves that the data set we are working with has a spatial component. Basically, our observations are point measurements of meteorological parameters spread across Germany. Let us plot a simple map to get a feel of the spatial distribution of our observations. Therefore we rely on the folium package.

In [14]:

import folium
loc_center = [df['LAT'].mean(), df['LON'].mean()]
germany_map = folium.Map(location = loc_center, tiles='Openstreetmap', zoom_start = 5, control_scale=True)

for index, loc in train_set.iterrows():
    folium.CircleMarker(
        [loc['LAT'], loc['LON']], 
        weight=5,
        radius = 2,
        color = 'red',
        popup = 'training data',
        labels = ['training data'],
    ).add_to(germany_map)

for index, loc in test_set.iterrows():
    folium.CircleMarker(
        [loc['LAT'], loc['LON']], 
        weight=5,
        radius = 2,
        color = 'blue',
        popup = 'test data',
        labels = 'test data',
    ).add_to(germany_map)
folium.LayerControl().add_to(germany_map)
germany_map

Out[14]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Nice! We realize that our data is well distributed across Germany. Further, it appears that the observations in the test data set are randomly distributed across the country too.

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.