Let us recall our research question: We want to examine the relationship between the response variable MEAN ANNUAL RAINFALL
and the other variables in the dwd data set.
A multiple regression model is written as:
$$ \begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \\ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...n\text{, } x\in \mathbb R^d\text{}, \end{align} $$where $β_0$ corresponds to the intercept, $β_1,...,β_d$ correspond to the model coefficients, $x_i$ to the observation/measurement data, and $ϵ$ to the residuals.
One of the functions to build multiple linear regression models in Python is the ols()
function of statsmodels
. It can be used with (R-Style) formula notation as input; thus, a linear model is specified as response ~ predictor(s)
. In order to increase the number of predictor variables we concatenate them by using the +
sign: response ~ predictor1 + predictor2 + predictor3 + ...
.
The general procedure we will follow is outlined below:
Consequently, we will showcase different options for feature selection, sometimes referred to as model selection or feature engineering. In order to evaluate a model we define a metric to assess the model performance. In regression modelling, which deals with continuous $d$-dimensional feature spaces ($x_i∈R^{d+1}$) it is quite common to use the mean squared error (MSE) metric or the root mean squared error (RMSE).
The RMSE is the square root of the sum of the squared difference between the observed and predicted values, normalized be the number of observations, $n$.
$$RMSE = \sqrt\frac{\sum_{i=1}^n{(pred_i-obs_i)^2}}{n}$$The lower the RMSE the better the model fits the data. Please note that the difference between the observed and predicted is nothing else but the residual term of a linear model. In statsmodels
, for a given RegressionResult
object the residuals are easily accessed by the mse_resid
method. For convenience, we wrap the equation into our own function, which calculates RMSE:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.graphics.api import abline_plot
from mpl_toolkits.mplot3d import axes3d
def model_rmse(model_results):
return np.sqrt(model_results.mse_resid)
In addition to the rmse()
function from above we implement a rmse2()
function, which returns the RMSE as well, but this time the input is not a RegressionResult
object, but just two vectors, one of them containing the observations, the other containing the model predictions.
import numpy as np
def diff_rmse(obs, preds):
return np.sqrt(np.sum((obs-preds)**2)/len(obs))
We start with learning the model parameters based on the training data set. Then we evaluate the goodness of fit by calculating the RMSE. After that we evaluate the model on data, which was never seen before by the model. Therefore we calculate the RMSE of the model applied on the test data set. This out-of-sample measure provides a clue on how well our model generalizes. If the model performs well on the training data set, but performs poorly on the test data set, this is an indication for model overfitting, which is in general a bad thing. That is because we do not want our model to learn, or better call it to memorize, the training data set, but we want to find a model that accounts for the unknown data generation process generating our observations.
The process of feature and model selection is the first step in model development and sometimes the most time consuming and challenging one. This part in the model development pipeline is more arts than pure science. Domain knowledge, experience and a solid background in mathematics and statistics help during this endeavor. In this section we build the baseline model. Thereafter we will build increasingly more sophisticated models. We store each of our models and the associated prediction error (RMSE), both for the training set as well as the test set, in a list
object for further usage.
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.formula.api as smf
dwd = pd.read_table(
"https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv",
index_col=0,
sep=',',
)
df = dwd.drop(['STATION_NAME', 'FEDERAL_STATE', 'PERIOD'], axis=1).dropna()
train_set = df.sample(random_state=42, frac=.8)
test_set = df.drop(train_set.index)
The baseline model is the simplest model we may build. It forms the baseline in terms of RMSE. Whatever we do in the future, we should always try to perform better than the baseline model. In our example the baseline model is just the arithmetic mean of the response variable MEAN_ANNUAL_RAINFALL
. This model corresponds to a model without any explanatory variable and is encoded in formula notation by statsmodels response ~ 1
.
baseline_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ 1', data=train_set)
results = baseline_model.fit()
results.params
Intercept 756.907975 dtype: float64
The intercept term corresponds to a value of $\beta = 756.91$. Let us compare this value the arithmetic mean of the response variable MEAN_ANNUAL_RAINFALL
.
np.round(np.mean(train_set["MEAN_ANNUAL_RAINFALL"]),2)
756.91
As expected, the arithmetic mean yields the same number as $β_0$, the intercept of the baseline model.
We calculate the RMSE of the baseline model on the training data set and store the model, as well as the results of our model_rmse(model_results)
function it in our RegressionResults
object named results
.
# calculate rmse on training set
print("RMSE on training set:", str(model_rmse(results)))
# store model object and results of rmse in the `list` object named `model.outcome`
# Note that we build a nested list-object
model_outcome = {}
model_outcome["baseline"] = {
"model": baseline_model,
"results": results,
"rmse": pd.DataFrame.from_dict({"name": pd.Index(["baseline model"]), "train_RMSE": model_rmse(results), "test_RMSE": np.nan})
}
RMSE on training set: 243.88215150947042
It makes perfect sense that the baseline model is not extremely useful. After all we did not make any enhancements to predict the mean annual rainfall for any given weather station in Germany. To strengthen our intuition - and as long as we are dealing with low-dimensional regression models - we plot the regression line of the baseline model and the data points in our training set.
_, ax = plt.subplots(figsize=(12, 6))
ax.scatter(train_set.index, train_set["MEAN_ANNUAL_RAINFALL"])
ax.hlines(results.params, xmin=0, xmax=1000, colors="red", label="Intercept")
ax.grid()
ax.set_ylabel("mean annual precipitation in mm")
ax.set_xlabel("index")
plt.legend()
plt.show()
Well, looks as expected, but we can do better!
Recall our general procedure outlined above:
In order to wrap up the baseline model scenario let us calculate the generalization error of the baseline model. We apply predict()
method and feed it with the test data set test_set
. Then we calculate the RMSE using the diff_rmse(obs, preds)
function defined in the section above. Finally, for the sake of completeness we store the RMSE of the baseline model on the test data set in the dictionary object named model_outcome
.
X = sm.add_constant(test_set["MEAN_ANNUAL_RAINFALL"])
pred = results.predict(X)
# calculate RMSE for the test data set
test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)
model_outcome["baseline"]["rmse"]["test_RMSE"] = test_rmse
RMSE on test set: 180.87701106302217
Before we move on let us just take a quick summarizing look at the model performance metrics of the baseline model:
model_outcome["baseline"]["rmse"]
name | train_RMSE | test_RMSE | |
---|---|---|---|
0 | baseline model | 243.882152 | 180.877011 |
The simple linear regression model
In simple linear regression analysis, the relationship between two variables is modeled as
$$y=f(x)=\beta_0+\beta_1x_1.$$As for the baseline model scenario in the above section we follow the general procedure:
Immediately the question arises, which feature to pick. A heuristic approach is to check for correlations between the response variable and the explanatory variables. We already have done that in a previous section. The analysis revealed positive correlation of the explanatory variable ALTITUDE
with the response variable MEAN_ANNUAL_RAINFALL
.
p_cor = np.corrcoef(df["ALTITUDE"], df["MEAN_ANNUAL_RAINFALL"])
print(np.round(p_cor[0,1],2))
0.76
The Pearson correlation coefficient is 0.76. Thus, we expect that by extending the baseline model with the explanatory variable ALTITUDE
, we significantly increase the predictive power of the model.
OK, let us build a simple linear regression model to predict mean annual rainfall recorded at weather stations across Germany by the altitude of the weather stations.
simple_alt_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ ALTITUDE', data=train_set)
results = simple_alt_model.fit()
results.summary()
Dep. Variable: | MEAN_ANNUAL_RAINFALL | R-squared: | 0.599 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.596 |
Method: | Least Squares | F-statistic: | 240.1 |
Date: | Sat, 24 Jun 2023 | Prob (F-statistic): | 9.89e-34 |
Time: | 21:27:26 | Log-Likelihood: | -1052.4 |
No. Observations: | 163 | AIC: | 2109. |
Df Residuals: | 161 | BIC: | 2115. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 592.2684 | 16.133 | 36.712 | 0.000 | 560.409 | 624.128 |
ALTITUDE | 0.6126 | 0.040 | 15.495 | 0.000 | 0.535 | 0.691 |
Omnibus: | 22.111 | Durbin-Watson: | 2.077 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 30.099 |
Skew: | 0.790 | Prob(JB): | 2.91e-07 |
Kurtosis: | 4.392 | Cond. No. | 542. |
Well, by looking at the model summary we see that the explanatory variable ALTITUDE
is statistically significant. Further, we see a $R^2=0.599$, which states the proportion of variation in the observed values explained by the regression model. The numbers look good, lets plot the data:
fig, ax = plt.subplots(figsize=(10,6))
ax.plot(train_set["ALTITUDE"], train_set["MEAN_ANNUAL_RAINFALL"], "o", label="Data")
abline_plot(model_results=results, ax=ax, color="red", label="Regression line")
ax.grid()
ax.legend(loc="best")
<matplotlib.legend.Legend at 0x7fe5bb706cd0>
Well, apparently there is some truth in the regression model we built. Let us continue with our procedure and calculate the RMSE on the training data set and then on the test data. Afterwards we compare the results with the baseline model.
pred = results.predict(test_set)
test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)
model_outcome["simple_alt"] = {
"model": simple_alt_model,
"results": results,
"rmse": pd.DataFrame.from_dict({"name": pd.Index(["simple alt model"]), "train_RMSE": model_rmse(results), "test_RMSE": test_rmse})
}
RMSE on test set: 138.85454382261386
rmses = pd.concat([model_outcome["baseline"]["rmse"],model_outcome["simple_alt"]["rmse"]])
Great! We dramatically increased the predictive power of our model, both on the training data set but as well on the test data set.
Let us build one more simple regression model. This time we pick the variable MAX_RAINFALL
as our explanatory variable. It seems reasonable to consider the amount of the highest rainfall event at a particular weather station as a proxy for the mean annual rainfall at that weather station.
max_rainfall_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ MAX_RAINFALL', data=train_set)
results = max_rainfall_model.fit()
results.summary()
pred = results.predict(test_set)
test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)
model_outcome["max_rainfall"] = {
"model": simple_alt_model,
"results": results,
"rmse": pd.DataFrame.from_dict({"name": pd.Index(["max rainfall model"]), "train_RMSE": model_rmse(results), "test_RMSE": test_rmse})
}
RMSE on test set: 117.43789683775168
Let us once again compare the results we obtained so far.
rmses = pd.concat([rmses, model_outcome["max_rainfall"]["rmse"]])
rmses
name | train_RMSE | test_RMSE | |
---|---|---|---|
0 | baseline model | 243.882152 | 180.877011 |
0 | simple alt model | 154.992815 | 138.854544 |
0 | max rainfall model | 119.953630 | 117.437897 |
Awesome! Once again we increased the predictive power of our model. Just by looking at the numbers in the table above one question should immediately come to our mind: What if we combine both simple linear regression models and try to predict the response variable MEAN_ANNUAL_RAINFALL
by using both explanatory variables? … Welcome to the domain of multiple linear regression!
Recall the multiple regression model:
$$ y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,n\text{, } x\in \mathbb R^d. $$For the purpose of multiple linear regression analysis the function may be extended by concatenating additional explanatory variables by using the +
sign. In order to encode the following model
we encode the formula expression as MEAN_ANNUAL_RAINFALL ~ ALTITUDE + MAX_RAINFALL
. Let’s give it a try: First we build the regression model, then we calculate the RMSE on the training data set, then on the test data, and then we compare the models.
multi_alt_rain_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ ALTITUDE + MAX_RAINFALL', data=train_set)
results = multi_alt_rain_model.fit()
pred = results.predict(test_set)
test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)
model_outcome["multi_alt_rain"] = {
"model": simple_alt_model,
"results": results,
"rmse": pd.DataFrame.from_dict({"name": pd.Index(["multi alt rain model"]), "train_RMSE": model_rmse(results), "test_RMSE": test_rmse})
}
RMSE on test set: 113.7463628599613
results.summary()
Dep. Variable: | MEAN_ANNUAL_RAINFALL | R-squared: | 0.768 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.766 |
Method: | Least Squares | F-statistic: | 265.4 |
Date: | Wed, 21 Jun 2023 | Prob (F-statistic): | 1.50e-51 |
Time: | 16:07:28 | Log-Likelihood: | -1007.5 |
No. Observations: | 163 | AIC: | 2021. |
Df Residuals: | 160 | BIC: | 2030. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -198.2681 | 74.013 | -2.679 | 0.008 | -344.437 | -52.099 |
ALTITUDE | 0.1324 | 0.054 | 2.471 | 0.015 | 0.027 | 0.238 |
MAX_RAINFALL | 24.2740 | 2.241 | 10.831 | 0.000 | 19.848 | 28.700 |
Omnibus: | 0.317 | Durbin-Watson: | 2.085 |
---|---|---|---|
Prob(Omnibus): | 0.853 | Jarque-Bera (JB): | 0.236 |
Skew: | 0.093 | Prob(JB): | 0.889 |
Kurtosis: | 2.995 | Cond. No. | 3.28e+03 |
The model summary indicates that both explanatory variables are statistically significant. The explained variance by the two explanatory variables accounts for approximately 76.8% of the total variance $(R^2=0.768)$. The numbers look fine, let’s plot the data:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter3D(
test_set["ALTITUDE"].values,
test_set["MAX_RAINFALL"].values,
test_set["MEAN_ANNUAL_RAINFALL"].values,
)
ax.set_xlabel("ALTITUDE")
ax.set_ylabel("MAX_RAINFALL")
ax.set_zlabel("MEAN_ANNUAL_RAINFALL")
plt.show()
Let us once again compare the results we obtained so far.
rmses = pd.concat([rmses, model_outcome["multi_alt_rain"]["rmse"]])
rmses
name | train_RMSE | test_RMSE | |
---|---|---|---|
0 | baseline model | 243.882152 | 180.877011 |
0 | simple alt model | 154.992815 | 138.854544 |
0 | max rainfall model | 119.953630 | 117.437897 |
0 | multi alt rain model | 118.095746 | 113.746363 |
rmses.reset_index().to_feather('30221_rmses.feather')
We see that the multiple regression model reduced the RMSE of the training set, but the RMSE on the test set was not reduced. That means that the added flexibly to the model causes the model to learn the training data better than all models developed so far, however, the higher flexibility did not improve the generalization properties of the model. Such a finding raises caution, as we should avoid to overfit the model.
Finally, Let us visualize the interim results and check which models are leading our “model selection comparison contest”.
_, ax = plt.subplots(figsize=(10,6))
rmses.plot(kind="bar", x="name", ax=ax)
ax.set_xlabel("")
ax.set_ylabel("RMSE in mm")
plt.show()
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.