Let us recall our research question: We want to **examine the relationship between the response variable MEAN ANNUAL RAINFALL and the other variables in the dwd data set**.

A multiple regression model is written as:

$$ \begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \\ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...n\text{, } x\in \mathbb R^d\text{}, \end{align} $$where $β_0$ corresponds to the *intercept*, $β_1,...,β_d$ correspond to the model coefficients, $x_i$ to the observation/measurement data, and $ϵ$ to the *residuals*.

One of the functions to build multiple linear regression models in Python is the `ols()`

function of `statsmodels`

. It can be used with (R-Style) formula notation as input; thus, a linear model is specified as `response ~ predictor(s)`

. In order to increase the number of predictor variables we concatenate them by using the `+`

sign: `response ~ predictor1 + predictor2 + predictor3 + ...`

.

The general procedure we will follow is outlined below:

- Select features and model
- Build a model for the training set
- Evaluate model performance on the training set
- Evaluate model performance on the test set

Consequently, we will showcase different options for **feature selection**, sometimes referred to as **model selection** or **feature engineering**. In order to evaluate a model we define a metric to assess the model performance. In regression modelling, which deals with continuous $d$-dimensional feature spaces ($x_i∈R^{d+1}$) it is quite common to use the mean squared error (MSE) metric or the root mean squared error (RMSE).

The RMSE is the square root of the sum of the squared difference between the observed and predicted values, normalized be the number of observations, $n$.

$$RMSE = \sqrt\frac{\sum_{i=1}^n{(pred_i-obs_i)^2}}{n}$$The lower the RMSE the better the model fits the data. Please note that the difference between the observed and predicted is nothing else but the residual term of a linear model. In `statsmodels`

, for a given `RegressionResult`

object the residuals are easily accessed by the `mse_resid`

method. For convenience, we wrap the equation into our own function, which calculates RMSE:

In [2]:

```
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.graphics.api import abline_plot
from mpl_toolkits.mplot3d import axes3d
def model_rmse(model_results):
return np.sqrt(model_results.mse_resid)
```

`rmse()`

function from above we implement a `rmse2()`

function, which returns the RMSE as well, but this time the input is not a `RegressionResult`

object, but just two vectors, one of them containing the observations, the other containing the model predictions.

In [3]:

```
import numpy as np
def diff_rmse(obs, preds):
return np.sqrt(np.sum((obs-preds)**2)/len(obs))
```

We start with learning the model parameters based on the training data set. Then we evaluate the goodness of fit by calculating the RMSE. After that we evaluate the model on data, which was never seen before by the model. Therefore we calculate the RMSE of the model applied on the test data set. This out-of-sample measure provides a clue on how well our model generalizes. If the model performs well on the training data set, but performs poorly on the test data set, this is an indication for model overfitting, which is in general a bad thing. That is because we do not want our model to learn, or better call it to memorize, the training data set, but we want to find a model that accounts for the unknown data generation process generating our observations.

The process of feature and model selection is the first step in model development and sometimes the most time consuming and challenging one. This part in the model development pipeline is more arts than pure science. Domain knowledge, experience and a solid background in mathematics and statistics help during this endeavor. In this section we build the **baseline model**. Thereafter we will build increasingly more sophisticated models. We store each of our models and the associated prediction error (RMSE), both for the training set as well as the test set, in a `list`

object for further usage.

In [4]:

```
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.formula.api as smf
dwd = pd.read_table(
"https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv",
index_col=0,
sep=',',
)
df = dwd.drop(['STATION_NAME', 'FEDERAL_STATE', 'PERIOD'], axis=1).dropna()
train_set = df.sample(random_state=42, frac=.8)
test_set = df.drop(train_set.index)
```

The baseline model is the simplest model we may build. It forms the baseline in terms of RMSE. Whatever we do in the future, we should always try to perform better than the baseline model. In our example the baseline model is just the arithmetic mean of the response variable `MEAN_ANNUAL_RAINFALL`

. This model corresponds to a model without any explanatory variable and is encoded in formula notation by statsmodels `response ~ 1`

.

In [5]:

```
baseline_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ 1', data=train_set)
results = baseline_model.fit()
results.params
```

Out[5]:

Intercept 756.907975 dtype: float64

`MEAN_ANNUAL_RAINFALL`

.

In [6]:

```
np.round(np.mean(train_set["MEAN_ANNUAL_RAINFALL"]),2)
```

Out[6]:

756.91

As expected, the arithmetic mean yields the same number as $β_0$, the intercept of the baseline model.

We calculate the RMSE of the baseline model on the training data set and store the model, as well as the results of our `model_rmse(model_results)`

function it in our `RegressionResults`

object named `results`

.

In [7]:

```
# calculate rmse on training set
print("RMSE on training set:", str(model_rmse(results)))
# store model object and results of rmse in the `list` object named `model.outcome`
# Note that we build a nested list-object
model_outcome = {}
model_outcome["baseline"] = {
"model": baseline_model,
"results": results,
"rmse": pd.DataFrame.from_dict({"name": pd.Index(["baseline model"]), "train_RMSE": model_rmse(results), "test_RMSE": np.nan})
}
```

RMSE on training set: 243.88215150947042

In [8]:

```
_, ax = plt.subplots(figsize=(12, 6))
ax.scatter(train_set.index, train_set["MEAN_ANNUAL_RAINFALL"])
ax.hlines(results.params, xmin=0, xmax=1000, colors="red", label="Intercept")
ax.grid()
ax.set_ylabel("mean annual precipitation in mm")
ax.set_xlabel("index")
plt.legend()
plt.show()
```

Well, looks as expected, but we can do better!

Recall our general procedure outlined above:

- Select features and model ⇒ DONE!
- Build a model for the training set ⇒ Baseline model, DONE!
- Evaluate model performance on the training set ⇒ RMSE: 243.88, DONE!
- Evaluate model performance on the test set ⇒ not yet done!

In order to wrap up the baseline model scenario let us calculate the generalization error of the baseline model. We apply `predict()`

method and feed it with the test data set `test_set`

. Then we calculate the RMSE using the `diff_rmse(obs, preds)`

function defined in the section above. Finally, for the sake of completeness we store the RMSE of the baseline model on the test data set in the dictionary object named `model_outcome`

.

In [9]:

```
X = sm.add_constant(test_set["MEAN_ANNUAL_RAINFALL"])
pred = results.predict(X)
# calculate RMSE for the test data set
test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)
model_outcome["baseline"]["rmse"]["test_RMSE"] = test_rmse
```

RMSE on test set: 180.87701106302217

In [10]:

```
model_outcome["baseline"]["rmse"]
```

Out[10]:

name | train_RMSE | test_RMSE | |
---|---|---|---|

0 | baseline model | 243.882152 | 180.877011 |

The simple linear regression model

In simple linear regression analysis, the relationship between two variables is modeled as

$$y=f(x)=\beta_0+\beta_1x_1.$$As for the baseline model scenario in the above section we follow the general procedure:

- Select features and model
- Build a model for the training set
- Evaluate model performance on the training set
- Evaluate model performance on the test set

Immediately the question arises, which feature to pick. A heuristic approach is to check for correlations between the response variable and the explanatory variables. We already have done that in a previous section. The analysis revealed positive correlation of the explanatory variable `ALTITUDE`

with the response variable `MEAN_ANNUAL_RAINFALL`

.

In [11]:

```
p_cor = np.corrcoef(df["ALTITUDE"], df["MEAN_ANNUAL_RAINFALL"])
print(np.round(p_cor[0,1],2))
```

0.76

The Pearson correlation coefficient is 0.76. Thus, we expect that by extending the baseline model with the explanatory variable `ALTITUDE`

, we significantly increase the predictive power of the model.

OK, let us build a simple linear regression model to predict mean annual rainfall recorded at weather stations across Germany by the altitude of the weather stations.

In [12]:

```
simple_alt_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ ALTITUDE', data=train_set)
results = simple_alt_model.fit()
results.summary()
```

Out[12]:

Dep. Variable: | MEAN_ANNUAL_RAINFALL | R-squared: | 0.599 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.596 |

Method: | Least Squares | F-statistic: | 240.1 |

Date: | Sat, 24 Jun 2023 | Prob (F-statistic): | 9.89e-34 |

Time: | 21:27:26 | Log-Likelihood: | -1052.4 |

No. Observations: | 163 | AIC: | 2109. |

Df Residuals: | 161 | BIC: | 2115. |

Df Model: | 1 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | 592.2684 | 16.133 | 36.712 | 0.000 | 560.409 | 624.128 |

ALTITUDE | 0.6126 | 0.040 | 15.495 | 0.000 | 0.535 | 0.691 |

Omnibus: | 22.111 | Durbin-Watson: | 2.077 |
---|---|---|---|

Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 30.099 |

Skew: | 0.790 | Prob(JB): | 2.91e-07 |

Kurtosis: | 4.392 | Cond. No. | 542. |

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

`ALTITUDE`

is statistically significant. Further, we see a $R^2=0.599$, which states the proportion of variation in the observed values explained by the regression model. The numbers look good, lets plot the data:

In [13]:

```
fig, ax = plt.subplots(figsize=(10,6))
ax.plot(train_set["ALTITUDE"], train_set["MEAN_ANNUAL_RAINFALL"], "o", label="Data")
abline_plot(model_results=results, ax=ax, color="red", label="Regression line")
ax.grid()
ax.legend(loc="best")
```

Out[13]:

<matplotlib.legend.Legend at 0x7fe5bb706cd0>

In [14]:

```
pred = results.predict(test_set)
test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)
model_outcome["simple_alt"] = {
"model": simple_alt_model,
"results": results,
"rmse": pd.DataFrame.from_dict({"name": pd.Index(["simple alt model"]), "train_RMSE": model_rmse(results), "test_RMSE": test_rmse})
}
```

RMSE on test set: 138.85454382261386

In [15]:

```
rmses = pd.concat([model_outcome["baseline"]["rmse"],model_outcome["simple_alt"]["rmse"]])
```

Great! We dramatically increased the predictive power of our model, both on the training data set but as well on the test data set.

Let us build one more simple regression model. This time we pick the variable `MAX_RAINFALL`

as our explanatory variable. It seems reasonable to consider the amount of the highest rainfall event at a particular weather station as a proxy for the mean annual rainfall at that weather station.

In [16]:

```
max_rainfall_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ MAX_RAINFALL', data=train_set)
results = max_rainfall_model.fit()
results.summary()
pred = results.predict(test_set)
test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)
model_outcome["max_rainfall"] = {
"model": simple_alt_model,
"results": results,
"rmse": pd.DataFrame.from_dict({"name": pd.Index(["max rainfall model"]), "train_RMSE": model_rmse(results), "test_RMSE": test_rmse})
}
```

RMSE on test set: 117.43789683775168

Let us once again compare the results we obtained so far.

In [17]:

```
rmses = pd.concat([rmses, model_outcome["max_rainfall"]["rmse"]])
rmses
```

Out[17]:

name | train_RMSE | test_RMSE | |
---|---|---|---|

0 | baseline model | 243.882152 | 180.877011 |

0 | simple alt model | 154.992815 | 138.854544 |

0 | max rainfall model | 119.953630 | 117.437897 |

Awesome! Once again we increased the predictive power of our model. Just by looking at the numbers in the table above one question should immediately come to our mind: What if we combine both simple linear regression models and try to predict the response variable `MEAN_ANNUAL_RAINFALL`

by using both explanatory variables? … **Welcome to the domain of multiple linear regression!**

Recall the multiple regression model:

$$ y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,n\text{, } x\in \mathbb R^d. $$For the purpose of multiple linear regression analysis the function may be extended by concatenating additional explanatory variables by using the `+`

sign. In order to encode the following model

we encode the formula expression as `MEAN_ANNUAL_RAINFALL ~ ALTITUDE + MAX_RAINFALL`

. Let’s give it a try: First we build the regression model, then we calculate the RMSE on the training data set, then on the test data, and then we compare the models.

In [18]:

```
multi_alt_rain_model = smf.ols(formula='MEAN_ANNUAL_RAINFALL ~ ALTITUDE + MAX_RAINFALL', data=train_set)
results = multi_alt_rain_model.fit()
pred = results.predict(test_set)
test_rmse = diff_rmse(test_set["MEAN_ANNUAL_RAINFALL"], pred)
print("RMSE on test set:", test_rmse)
model_outcome["multi_alt_rain"] = {
"model": simple_alt_model,
"results": results,
"rmse": pd.DataFrame.from_dict({"name": pd.Index(["multi alt rain model"]), "train_RMSE": model_rmse(results), "test_RMSE": test_rmse})
}
```

RMSE on test set: 113.7463628599613

In [18]:

```
results.summary()
```

Out[18]:

Dep. Variable: | MEAN_ANNUAL_RAINFALL | R-squared: | 0.768 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.766 |

Method: | Least Squares | F-statistic: | 265.4 |

Date: | Wed, 21 Jun 2023 | Prob (F-statistic): | 1.50e-51 |

Time: | 16:07:28 | Log-Likelihood: | -1007.5 |

No. Observations: | 163 | AIC: | 2021. |

Df Residuals: | 160 | BIC: | 2030. |

Df Model: | 2 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | -198.2681 | 74.013 | -2.679 | 0.008 | -344.437 | -52.099 |

ALTITUDE | 0.1324 | 0.054 | 2.471 | 0.015 | 0.027 | 0.238 |

MAX_RAINFALL | 24.2740 | 2.241 | 10.831 | 0.000 | 19.848 | 28.700 |

Omnibus: | 0.317 | Durbin-Watson: | 2.085 |
---|---|---|---|

Prob(Omnibus): | 0.853 | Jarque-Bera (JB): | 0.236 |

Skew: | 0.093 | Prob(JB): | 0.889 |

Kurtosis: | 2.995 | Cond. No. | 3.28e+03 |

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 3.28e+03. This might indicate that there are

strong multicollinearity or other numerical problems.

In [19]:

```
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter3D(
test_set["ALTITUDE"].values,
test_set["MAX_RAINFALL"].values,
test_set["MEAN_ANNUAL_RAINFALL"].values,
)
ax.set_xlabel("ALTITUDE")
ax.set_ylabel("MAX_RAINFALL")
ax.set_zlabel("MEAN_ANNUAL_RAINFALL")
plt.show()
```

Let us once again compare the results we obtained so far.

In [20]:

```
rmses = pd.concat([rmses, model_outcome["multi_alt_rain"]["rmse"]])
rmses
```

Out[20]:

name | train_RMSE | test_RMSE | |
---|---|---|---|

0 | baseline model | 243.882152 | 180.877011 |

0 | simple alt model | 154.992815 | 138.854544 |

0 | max rainfall model | 119.953630 | 117.437897 |

0 | multi alt rain model | 118.095746 | 113.746363 |

In [23]:

```
rmses.reset_index().to_feather('30221_rmses.feather')
```

We see that the multiple regression model reduced the RMSE of the training set, but the RMSE on the test set was not reduced. That means that the added flexibly to the model causes the model to learn the training data better than all models developed so far, however, the higher flexibility did not improve the generalization properties of the model. Such a finding raises caution, as we should avoid to overfit the model.

Finally, Let us visualize the interim results and check which models are leading our “model selection comparison contest”.

In [19]:

```
_, ax = plt.subplots(figsize=(10,6))
rmses.plot(kind="bar", x="name", ax=ax)
ax.set_xlabel("")
ax.set_ylabel("RMSE in mm")
plt.show()
```

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*