Let us recall our research question: We want to examine the relationship between the response variable MEAN ANNUAL RAINFALL and the other variables in the dwd data set.

A multiple regression model is written as:

$$ \begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \\ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...n\text{, } x\in \mathbb R^d\text{}, \end{align} $$

where $β_0$ corresponds to the intercept, $β_1,...,β_d$ correspond to the model coefficients, $x_i$ to the observation/measurement data, and $ϵ$ to the residuals.

One of the functions to build multiple linear regression models in Python is the ols() function of statsmodels. It can be used with (R-Style) formula notation as input; thus, a linear model is specified as response ~ predictor(s). In order to increase the number of predictor variables we concatenate them by using the + sign: response ~ predictor1 + predictor2 + predictor3 + ....

The general procedure we will follow is outlined below:

  • Select features and model
  • Build a model for the training set
  • Evaluate model performance on the training set
  • Evaluate model performance on the test set

Consequently, we will showcase different options for feature selection, sometimes referred to as model selection or feature engineering. In order to evaluate a model we define a metric to assess the model performance. In regression modelling, which deals with continuous $d$-dimensional feature spaces ($x_i∈R^{d+1}$) it is quite common to use the mean squared error (MSE) metric or the root mean squared error (RMSE).

The RMSE is the square root of the sum of the squared difference between the observed and predicted values, normalized be the number of observations, $n$.

$$RMSE = \sqrt\frac{\sum_{i=1}^n{(pred_i-obs_i)^2}}{n}$$

The lower the RMSE the better the model fits the data. Please note that the difference between the observed and predicted is nothing else but the residual term of a linear model. In statsmodels, for a given RegressionResult object the residuals are easily accessed by the mse_resid method. For convenience, we wrap the equation into our own function, which calculates RMSE:

In [2]:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.graphics.api import abline_plot
from mpl_toolkits.mplot3d import axes3d

def model_rmse(model_results):
    return np.sqrt(model_results.mse_resid)

In addition to the rmse() function from above we implement a rmse2() function, which returns the RMSE as well, but this time the input is not a RegressionResult object, but just two vectors, one of them containing the observations, the other containing the model predictions.

In [3]:
import numpy as np
def diff_rmse(obs, preds):
    return np.sqrt(np.sum((obs-preds)**2)/len(obs))

We start with learning the model parameters based on the training data set. Then we evaluate the goodness of fit by calculating the RMSE. After that we evaluate the model on data, which was never seen before by the model. Therefore we calculate the RMSE of the model applied on the test data set. This out-of-sample measure provides a clue on how well our model generalizes. If the model performs well on the training data set, but performs poorly on the test data set, this is an indication for model overfitting, which is in general a bad thing. That is because we do not want our model to learn, or better call it to memorize, the training data set, but we want to find a model that accounts for the unknown data generation process generating our observations.

Feature and model selection¶

The process of feature and model selection is the first step in model development and sometimes the most time consuming and challenging one. This part in the model development pipeline is more arts than pure science. Domain knowledge, experience and a solid background in mathematics and statistics help during this endeavor. In this section we build the baseline model. Thereafter we will build increasingly more sophisticated models. We store each of our models and the associated prediction error (RMSE), both for the training set as well as the test set, in a list object for further usage.

In [4]:
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.formula.api as smf
dwd = pd.read_table(
    "https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv", 
    index_col=0,
    sep=',',
)
df = dwd.drop(['STATION_NAME', 'FEDERAL_STATE', 'PERIOD'], axis=1).dropna()
train_set = df.sample(random_state=42, frac=.8)
test_set = df.drop(train_set.index)

The baseline model¶

The baseline model is the simplest model we may build. It forms the baseline in terms of RMSE. Whatever we do in the future, we should always try to perform better than the baseline model. In our example the baseline model is just the arithmetic mean of the response variable MEAN_ANNUAL_RAINFALL. This model corresponds to a model without any explanatory variable and is encoded in formula notation by statsmodels response ~ 1.

In [5]:
baseline_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ 1', data=train_set)
results = baseline_model.fit()
results.params
Out[5]:
Intercept    756.907975
dtype: float64

The intercept term corresponds to a value of $\beta = 756.91$. Let us compare this value the arithmetic mean of the response variable MEAN_ANNUAL_RAINFALL.

In [6]:
np.round(np.mean(train_set["MEAN_ANNUAL_RAINFALL"]),2)
Out[6]:
756.91

As expected, the arithmetic mean yields the same number as $β_0$, the intercept of the baseline model.

We calculate the RMSE of the baseline model on the training data set and store the model, as well as the results of our model_rmse(model_results) function it in our RegressionResults object named results.

In [7]:
# calculate rmse on training set
print("RMSE on training set:", str(model_rmse(results)))

# store model object and results of rmse in the `list` object named `model.outcome`
# Note that we build a nested list-object
model_outcome = {}
model_outcome["baseline"] = {
    "model": baseline_model, 
    "results": results, 
    "rmse": pd.DataFrame.from_dict({"name": pd.Index(["baseline model"]), "train_RMSE": model_rmse(results), "test_RMSE": np.nan})
}
RMSE on training set: 243.88215150947042

It makes perfect sense that the baseline model is not extremely useful. After all we did not make any enhancements to predict the mean annual rainfall for any given weather station in Germany. To strengthen our intuition - and as long as we are dealing with low-dimensional regression models - we plot the regression line of the baseline model and the data points in our training set.

In [8]:
_, ax = plt.subplots(figsize=(12, 6))
ax.scatter(train_set.index, train_set["MEAN_ANNUAL_RAINFALL"])
ax.hlines(results.params, xmin=0, xmax=1000, colors="red", label="Intercept")
ax.grid()
ax.set_ylabel("mean annual precipitation in mm")
ax.set_xlabel("index")
plt.legend()
plt.show()

Well, looks as expected, but we can do better!

Recall our general procedure outlined above:

  • Select features and model ⇒ DONE!
  • Build a model for the training set ⇒ Baseline model, DONE!
  • Evaluate model performance on the training set ⇒ RMSE: 243.88, DONE!
  • Evaluate model performance on the test set ⇒ not yet done!

In order to wrap up the baseline model scenario let us calculate the generalization error of the baseline model. We apply predict() method and feed it with the test data set test_set. Then we calculate the RMSE using the diff_rmse(obs, preds) function defined in the section above. Finally, for the sake of completeness we store the RMSE of the baseline model on the test data set in the dictionary object named model_outcome.

In [9]:
X = sm.add_constant(test_set["MEAN_ANNUAL_RAINFALL"])
pred = results.predict(X)

# calculate RMSE for the test data set
test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)
model_outcome["baseline"]["rmse"]["test_RMSE"] = test_rmse
RMSE on test set: 180.87701106302217

Before we move on let us just take a quick summarizing look at the model performance metrics of the baseline model:

In [10]:
model_outcome["baseline"]["rmse"]
Out[10]:
name train_RMSE test_RMSE
0 baseline model 243.882152 180.877011

The simple linear regression model

In simple linear regression analysis, the relationship between two variables is modeled as

$$y=f(x)=\beta_0+\beta_1x_1.$$

As for the baseline model scenario in the above section we follow the general procedure:

  • Select features and model
  • Build a model for the training set
  • Evaluate model performance on the training set
  • Evaluate model performance on the test set

Immediately the question arises, which feature to pick. A heuristic approach is to check for correlations between the response variable and the explanatory variables. We already have done that in a previous section. The analysis revealed positive correlation of the explanatory variable ALTITUDE with the response variable MEAN_ANNUAL_RAINFALL.

In [11]:
p_cor = np.corrcoef(df["ALTITUDE"], df["MEAN_ANNUAL_RAINFALL"])
print(np.round(p_cor[0,1],2))
0.76

The Pearson correlation coefficient is 0.76. Thus, we expect that by extending the baseline model with the explanatory variable ALTITUDE, we significantly increase the predictive power of the model.

OK, let us build a simple linear regression model to predict mean annual rainfall recorded at weather stations across Germany by the altitude of the weather stations.

In [12]:
simple_alt_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ ALTITUDE', data=train_set)
results = simple_alt_model.fit()
results.summary()
Out[12]:
OLS Regression Results
Dep. Variable: MEAN_ANNUAL_RAINFALL R-squared: 0.599
Model: OLS Adj. R-squared: 0.596
Method: Least Squares F-statistic: 240.1
Date: Sat, 24 Jun 2023 Prob (F-statistic): 9.89e-34
Time: 21:27:26 Log-Likelihood: -1052.4
No. Observations: 163 AIC: 2109.
Df Residuals: 161 BIC: 2115.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 592.2684 16.133 36.712 0.000 560.409 624.128
ALTITUDE 0.6126 0.040 15.495 0.000 0.535 0.691
Omnibus: 22.111 Durbin-Watson: 2.077
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.099
Skew: 0.790 Prob(JB): 2.91e-07
Kurtosis: 4.392 Cond. No. 542.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Well, by looking at the model summary we see that the explanatory variable ALTITUDE is statistically significant. Further, we see a $R^2=0.599$, which states the proportion of variation in the observed values explained by the regression model. The numbers look good, lets plot the data:

In [13]:
fig, ax = plt.subplots(figsize=(10,6))
ax.plot(train_set["ALTITUDE"], train_set["MEAN_ANNUAL_RAINFALL"], "o", label="Data")
abline_plot(model_results=results, ax=ax, color="red", label="Regression line")
ax.grid()
ax.legend(loc="best")
Out[13]:
<matplotlib.legend.Legend at 0x7fe5bb706cd0>

Well, apparently there is some truth in the regression model we built. Let us continue with our procedure and calculate the RMSE on the training data set and then on the test data. Afterwards we compare the results with the baseline model.

In [14]:
pred = results.predict(test_set)

test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)

model_outcome["simple_alt"] = {
    "model": simple_alt_model, 
    "results": results, 
    "rmse": pd.DataFrame.from_dict({"name": pd.Index(["simple alt model"]), "train_RMSE": model_rmse(results), "test_RMSE": test_rmse})
}
RMSE on test set: 138.85454382261386
In [15]:
rmses = pd.concat([model_outcome["baseline"]["rmse"],model_outcome["simple_alt"]["rmse"]])

Great! We dramatically increased the predictive power of our model, both on the training data set but as well on the test data set.

Let us build one more simple regression model. This time we pick the variable MAX_RAINFALL as our explanatory variable. It seems reasonable to consider the amount of the highest rainfall event at a particular weather station as a proxy for the mean annual rainfall at that weather station.

In [16]:
max_rainfall_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ MAX_RAINFALL', data=train_set)
results = max_rainfall_model.fit()
results.summary()

pred = results.predict(test_set)

test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)

model_outcome["max_rainfall"] = {
    "model": simple_alt_model, 
    "results": results, 
    "rmse": pd.DataFrame.from_dict({"name": pd.Index(["max rainfall model"]), "train_RMSE": model_rmse(results), "test_RMSE": test_rmse})
}
RMSE on test set: 117.43789683775168

Let us once again compare the results we obtained so far.

In [17]:
rmses = pd.concat([rmses, model_outcome["max_rainfall"]["rmse"]])
rmses
Out[17]:
name train_RMSE test_RMSE
0 baseline model 243.882152 180.877011
0 simple alt model 154.992815 138.854544
0 max rainfall model 119.953630 117.437897

Awesome! Once again we increased the predictive power of our model. Just by looking at the numbers in the table above one question should immediately come to our mind: What if we combine both simple linear regression models and try to predict the response variable MEAN_ANNUAL_RAINFALL by using both explanatory variables? … Welcome to the domain of multiple linear regression!


The multiple linear regression model¶

Recall the multiple regression model:

$$ y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,n\text{, } x\in \mathbb R^d. $$

For the purpose of multiple linear regression analysis the function may be extended by concatenating additional explanatory variables by using the + sign. In order to encode the following model

$$\text{MEAN_ANNUAL_RAINFALL}=\beta_0+\beta_1\times\text{ALTITUDE}+\beta_2\times\text{MAX_RAINFALL},$$

we encode the formula expression as MEAN_ANNUAL_RAINFALL ~ ALTITUDE + MAX_RAINFALL. Let’s give it a try: First we build the regression model, then we calculate the RMSE on the training data set, then on the test data, and then we compare the models.

In [18]:
multi_alt_rain_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ ALTITUDE + MAX_RAINFALL', data=train_set)
results = multi_alt_rain_model.fit()

pred = results.predict(test_set)

test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)

model_outcome["multi_alt_rain"] = {
    "model": simple_alt_model, 
    "results": results, 
    "rmse": pd.DataFrame.from_dict({"name": pd.Index(["multi alt rain model"]), "train_RMSE": model_rmse(results), "test_RMSE": test_rmse})
}
RMSE on test set: 113.7463628599613
In [18]:
results.summary()
Out[18]:
OLS Regression Results
Dep. Variable: MEAN_ANNUAL_RAINFALL R-squared: 0.768
Model: OLS Adj. R-squared: 0.766
Method: Least Squares F-statistic: 265.4
Date: Wed, 21 Jun 2023 Prob (F-statistic): 1.50e-51
Time: 16:07:28 Log-Likelihood: -1007.5
No. Observations: 163 AIC: 2021.
Df Residuals: 160 BIC: 2030.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -198.2681 74.013 -2.679 0.008 -344.437 -52.099
ALTITUDE 0.1324 0.054 2.471 0.015 0.027 0.238
MAX_RAINFALL 24.2740 2.241 10.831 0.000 19.848 28.700
Omnibus: 0.317 Durbin-Watson: 2.085
Prob(Omnibus): 0.853 Jarque-Bera (JB): 0.236
Skew: 0.093 Prob(JB): 0.889
Kurtosis: 2.995 Cond. No. 3.28e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.28e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The model summary indicates that both explanatory variables are statistically significant. The explained variance by the two explanatory variables accounts for approximately 76.8% of the total variance $(R^2=0.768)$. The numbers look fine, let’s plot the data:

In [19]:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection = '3d')

ax.scatter3D(
    test_set["ALTITUDE"].values,
    test_set["MAX_RAINFALL"].values, 
    test_set["MEAN_ANNUAL_RAINFALL"].values,
)

ax.set_xlabel("ALTITUDE")
ax.set_ylabel("MAX_RAINFALL")
ax.set_zlabel("MEAN_ANNUAL_RAINFALL")
plt.show()

Let us once again compare the results we obtained so far.

In [20]:
rmses = pd.concat([rmses, model_outcome["multi_alt_rain"]["rmse"]])
rmses
Out[20]:
name train_RMSE test_RMSE
0 baseline model 243.882152 180.877011
0 simple alt model 154.992815 138.854544
0 max rainfall model 119.953630 117.437897
0 multi alt rain model 118.095746 113.746363
In [23]:
rmses.reset_index().to_feather('30221_rmses.feather')

We see that the multiple regression model reduced the RMSE of the training set, but the RMSE on the test set was not reduced. That means that the added flexibly to the model causes the model to learn the training data better than all models developed so far, however, the higher flexibility did not improve the generalization properties of the model. Such a finding raises caution, as we should avoid to overfit the model.

Interim results I¶

Finally, Let us visualize the interim results and check which models are leading our “model selection comparison contest”.

In [19]:
_, ax = plt.subplots(figsize=(10,6))
rmses.plot(kind="bar", x="name", ax=ax)
ax.set_xlabel("")
ax.set_ylabel("RMSE in mm")
plt.show()

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.