So far we focused on an example data set with just two explanatory variables. In order to get some hands-on experience we advance by using a data set with more than two explanatory variables.
In order to showcase the multiple linear regression analysis we examine the relationship between the response variable MEAN_ANNUAL_RAINFALL
and the other variables in the dwd data set.
You may download the DWD.csv
file here. Import the data set and assign a proper name to it. Note that we use the read.csv()
-method of the pandas
-library to load the data set. We need to set the character encoding and a non-standard column separator (;). This is due to the fact that this dataset originates from Germany, where ',' is used as a delimiter and can thus not be used as the separator.
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
dwd = pd.read_table(
"https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv",
index_col=0,
sep=',',
)
dwd.head()
DWD_ID | STATION_NAME | FEDERAL_STATE | LAT | LON | ALTITUDE | PERIOD | RECORD_LENGTH | MEAN_ANNUAL_AIR_TEMP | MEAN_MONTHLY_MAX_TEMP | ... | MEAN_ANNUAL_WIND_SPEED | MEAN_CLOUD_COVER | MEAN_ANNUAL_SUNSHINE | MEAN_ANNUAL_RAINFALL | MAX_MONTHLY_WIND_SPEED | MAX_AIR_TEMP | MAX_WIND_SPEED | MAX_RAINFALL | MIN_AIR_TEMP | MEAN_RANGE_AIR_TEMP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | |||||||||||||||||||||
0 | 1 | Aach | Baden-Württemberg | 47.8413 | 8.8493 | 478.0 | 1931-1986 | 55 | 8.2 | 13.1 | ... | 2.0 | 67.0 | NaN | 755.0 | 2.0 | 32.5 | NaN | 39.0 | -16.3 | 9.6 |
1 | 3 | Aachen | Nordrhein-Westfalen | 50.7827 | 6.0941 | 202.0 | 1851-2011 | 160 | 9.8 | 13.6 | ... | 3.0 | 67.0 | 1531.0 | 820.0 | 3.0 | 32.3 | 30.2 | 36.0 | -10.9 | 7.3 |
2 | 44 | Großenkneten | Niedersachsen | 52.9335 | 8.2370 | 44.0 | 1971-2016 | 45 | 9.2 | 13.2 | ... | 2.0 | 67.0 | 1459.0 | 759.0 | 3.0 | 32.4 | 29.9 | 32.0 | -12.6 | 7.8 |
6 | 71 | Albstadt-Badkap | Baden-Württemberg | 48.2156 | 8.9784 | 759.0 | 1986-2016 | 30 | 7.4 | 12.2 | ... | 2.0 | 66.0 | 1725.0 | 919.0 | 2.0 | 30.2 | NaN | 43.0 | -15.5 | 8.9 |
8 | 73 | Aldersbach-Kriestorf | Bayern | 48.6159 | 13.0506 | 340.0 | 1952-2016 | 64 | 8.4 | 13.4 | ... | 1.0 | 65.0 | 1595.0 | 790.0 | 2.0 | 33.0 | NaN | 43.0 | -19.2 | 9.5 |
5 rows × 21 columns
dwd.columns
Index(['DWD_ID', 'STATION_NAME', 'FEDERAL_STATE', 'LAT', 'LON', 'ALTITUDE', 'PERIOD', 'RECORD_LENGTH', 'MEAN_ANNUAL_AIR_TEMP', 'MEAN_MONTHLY_MAX_TEMP', 'MEAN_MONTHLY_MIN_TEMP', 'MEAN_ANNUAL_WIND_SPEED', 'MEAN_CLOUD_COVER', 'MEAN_ANNUAL_SUNSHINE', 'MEAN_ANNUAL_RAINFALL', 'MAX_MONTHLY_WIND_SPEED', 'MAX_AIR_TEMP', 'MAX_WIND_SPEED', 'MAX_RAINFALL', 'MIN_AIR_TEMP', 'MEAN_RANGE_AIR_TEMP'], dtype='object')
The dwd data set consists of 599 rows, each of them representing a particular weather station in Germany, and 21 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: 'DWD_ID', 'STATION NAME', 'FEDERAL STATE', 'LAT', 'LON', 'ALTITUDE', 'PERIOD', 'RECORD LENGTH', 'MEAN ANNUAL AIR TEMP', 'MEAN MONTHLY MAX TEMP', 'MEAN MONTHLY MIN TEMP', 'MEAN ANNUAL WIND SPEED', 'MEAN CLOUD COVER', 'MEAN ANNUAL SUNSHINE', 'MEAN ANNUAL RAINFALL', 'MAX MONTHLY WIND SPEED', 'MAX AIR TEMP', 'MAX WIND SPEED', 'MAX RAINFALL', 'MIN AIR TEMP', 'MEAN RANGE AIR TEMP'.
The data was downloaded from the DWD (German Weather Service) data portal on April 21, 2017. You may find a detailed description of the data set here. Please note that for the purpose of this tutorial the data set was preprocessed and columns have been renamed.
Before we dive into the modelling we first have to prepare our data set. By calling the info()
method on the data set we realize that there are features in the data set such as the index, DWD_ID
, and STATION.NAME
. We will drop columns that we presume not to carry information valuable to us for the prediction of the mean annual rainfall. In addition we make sure that we exclude observations that contain missing values (NAN's
).
dwd.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 599 entries, 0 to 1058 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DWD_ID 599 non-null int64 1 STATION_NAME 599 non-null object 2 FEDERAL_STATE 599 non-null object 3 LAT 599 non-null float64 4 LON 599 non-null float64 5 ALTITUDE 599 non-null float64 6 PERIOD 599 non-null object 7 RECORD_LENGTH 599 non-null int64 8 MEAN_ANNUAL_AIR_TEMP 598 non-null float64 9 MEAN_MONTHLY_MAX_TEMP 597 non-null float64 10 MEAN_MONTHLY_MIN_TEMP 595 non-null float64 11 MEAN_ANNUAL_WIND_SPEED 588 non-null float64 12 MEAN_CLOUD_COVER 588 non-null float64 13 MEAN_ANNUAL_SUNSHINE 406 non-null float64 14 MEAN_ANNUAL_RAINFALL 586 non-null float64 15 MAX_MONTHLY_WIND_SPEED 588 non-null float64 16 MAX_AIR_TEMP 597 non-null float64 17 MAX_WIND_SPEED 219 non-null float64 18 MAX_RAINFALL 585 non-null float64 19 MIN_AIR_TEMP 597 non-null float64 20 MEAN_RANGE_AIR_TEMP 599 non-null float64 dtypes: float64(16), int64(2), object(3) memory usage: 103.0+ KB
dwd = dwd.drop(['DWD_ID', 'STATION_NAME', 'FEDERAL_STATE', 'PERIOD'], axis=1).dropna()
dwd.shape
(204, 17)
After the data cleaning steps there are 204 rows (observations) and 18 columns (features) left in our data set.
Moreover, we split our data set at random into two distinct parts. One part is our training set while the other part will serve as our test set. The basic idea behind this data grouping strategy is, that we learn the model parameters on the training set, and then we use the test data set, a data set that was not involved at all during parameter estimation, to determine the goodness of fit of our model, or in other words by using two distinct data sets we may evaluate the generalization properties of our model.
For the sub-setting of the data set we rely on the .sample()
method, which splits the data set into two sets (training and test set) in a predefined ratio. Finally, we assign the response vector and the data matrix for each subset of the data a proper variable name for further processing.
df = dwd.drop('MEAN_ANNUAL_RAINFALL', axis=1)
train_set = df.sample(random_state=42, frac=.8)
test_set = df.drop(train_set.index)
train_set.shape
(163, 16)
train_set.columns
Index(['LAT', 'LON', 'ALTITUDE', 'RECORD_LENGTH', 'MEAN_ANNUAL_AIR_TEMP', 'MEAN_MONTHLY_MAX_TEMP', 'MEAN_MONTHLY_MIN_TEMP', 'MEAN_ANNUAL_WIND_SPEED', 'MEAN_CLOUD_COVER', 'MEAN_ANNUAL_SUNSHINE', 'MAX_MONTHLY_WIND_SPEED', 'MAX_AIR_TEMP', 'MAX_WIND_SPEED', 'MAX_RAINFALL', 'MIN_AIR_TEMP', 'MEAN_RANGE_AIR_TEMP'], dtype='object')
Finally there remain 163 observations with 17 features in the data set to train our model, and 41 observations with 17 features to evaluate our model.
Before we build our model we visualize the data we are working with by plotting a scatter plot matrix. seaborn
library provides us with the pairplot()
function, to create a facet grid of a DataFrame
in long format. We unpivot using the melt
method and plot.
sns.pairplot(train_set, vars=train_set.columns)
plt.show()
Arghhh! Such a plot is of no use at all! Note that we are just visualizing 17 features, which is compared to many other real world applications not much at all.
A good strategy would be to choose only those pairs of variables that at least somehow are affecting our response variable MEAN ANNUAL RAINFALL
. Thus, we first examine the pairwise correlation between the features, and then we select those features for visualization which show the highest correlation. The correlation of features can easily be calculated by using the corr()
method, however, the output would still flood the screen, so that we do not improve our situation. A fancy method to overcome this issue is provided by the matplotlib
library. If we feed the heatmap()
function with a correlation matrix we may visualize the correlations among the features in a quite convenient way.
corr = dwd.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
plt.subplots(figsize=(12, 10))
sns.heatmap(corr, mask=mask, cmap=cmap)
plt.show()
Take a look at the plot above and focus on the response variable MEAN ANNUAL RAINFALL
at position (row and column) r id.response
. In that particular row/column you find the Pearson correlation coefficients for each feature pair. The Pearson correlation coefficient is a number between $-1$ and $+1$. A positive number indicates a positive relation between the pairs of features and a negative number indicates a negative relation between the features, respectively. The more extreme the number the stronger the correlation.
OK, back to our strategy: How to choose those pairs of variables that at least somehow are affecting our response variable MEAN ANNUAL RAINFALL
?
By now we have all the building blocks to solve this problem. We set a threshold level, let us say $\pm 0.5$. We select all feature pair combinations which have a correlation coefficient above or below that threshold, and then we plot them in a scatter plot matrix.
corr_threshold = .5
best_features = corr[np.abs(corr['MEAN_ANNUAL_RAINFALL']) > corr_threshold].index
sns.pairplot(dwd, vars=best_features, corner=True)
plt.show()
Finally, we get a nice and clear representation of a subset of features of the dwd data set, all of them showing some kind of a relation among each another. In the next section we build a bunch of multiple linear regression models and we are going to investigate these relationships in more detail.
However, before we continue we should remind ourselves that the data set we are working with has a spatial component. Basically, our observations are point measurements of meteorological parameters spread across Germany. Let us plot a simple map to get a feel of the spatial distribution of our observations. Therefore we rely on the folium
package.
import folium
loc_center = [df['LAT'].mean(), df['LON'].mean()]
germany_map = folium.Map(location = loc_center, tiles='Openstreetmap', zoom_start = 5, control_scale=True)
for index, loc in train_set.iterrows():
folium.CircleMarker(
[loc['LAT'], loc['LON']],
weight=5,
radius = 2,
color = 'red',
popup = 'training data',
labels = ['training data'],
).add_to(germany_map)
for index, loc in test_set.iterrows():
folium.CircleMarker(
[loc['LAT'], loc['LON']],
weight=5,
radius = 2,
color = 'blue',
popup = 'test data',
labels = 'test data',
).add_to(germany_map)
folium.LayerControl().add_to(germany_map)
germany_map
Nice! We realize that our data is well distributed across Germany. Further, it appears that the observations in the test data set are randomly distributed across the country too.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.