Both regularization methods, ridge regression and LASSO, are implemented in the package scikit-learn. These methods fit a generalized linear model via penalized maximum likelihood. The degree of regularization depends on the regularization parameter $\lambda$, in scikit-learn given by the alpha argument. Thus, it is useful to evaluate the regression function for a sequence of $\lambda$s. We evaluate a logarithmic sequence of 100 $\lambda$ values.

Let us load and prepare the data first. We assign the response vector and the model matrix of the training set and the test set to the variables y_train, y_test, X_train, and X_test.

In [1]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import folium

dwd = pd.read_table(
    "https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv", 
    index_col=0,
    sep=',',
)
In [2]:
df = dwd.drop([
    'DWD_ID', 
    'RECORD_LENGTH', 
    'STATION_NAME', 
    'FEDERAL_STATE', 
    'PERIOD', 
    # 'LON', 
    # 'LAT',
], axis=1).dropna()

X = df.drop('MEAN_ANNUAL_RAINFALL', axis=1) 
y = df['MEAN_ANNUAL_RAINFALL']
scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
X_train = scaler.fit_transform(X_train) 
X_test = scaler.transform(X_test)

rmses = pd.read_feather('https://userpage.fu-berlin.de/soga/data/py-data/30222_rmses.feather').set_index('index')
rmses
Out[2]:
name train_RMSE test_RMSE
index
0 baseline model 243.882152 180.877011
0 simple alt model 154.992815 138.854544
0 max rainfall model 119.953630 117.437897
0 multi alt rain model 118.095746 113.746363
4 forward model 87.392377 117.884639
5 backward model 87.236023 118.367974
6 mlxtend SFS model 91.976905 91.232370

Ridge regression in Python¶

We calculate coefficients for $alpha \in [10^{-1}, 10^5]$ with the ridge regression and plot the results.

In [5]:
n_alphas = 100
alphas = np.logspace(-1, 5, n_alphas)

coefs = []
for a in alphas:
    ridge = linear_model.Ridge(alpha=a)
    results = ridge.fit(X_train, y_train)
    coefs.append(ridge.coef_)

ax = plt.gca()

ax.plot(alphas, coefs)
plt.xlabel("$\lambda$")
ax.set_xscale("log")
plt.ylabel("Coefficients")
plt.title("Ridge coefficients as a function of the regularization")
plt.axis("tight")
plt.grid()
plt.show()

Let us have a look at the coefficient difference to simple linear regressions for each parameter. To this end, we calculate simple linear regression coeffs reg_coeff and compute the mean squared error and plot it.

In [6]:
reg_coeff = linear_model.LinearRegression().fit(X_train, y_train)

errors = []
for a in alphas:
    ridge = linear_model.Ridge(alpha=a)
    results = ridge.fit(X_train, y_train)
    errors.append(mean_squared_error(ridge.coef_, reg_coeff.coef_))

ax = plt.gca()
ax.plot(alphas, errors)
ax.set_xscale("log")
plt.xlabel("log($\lambda$)")
plt.ylabel("error")
plt.title("Coefficient deviation as a function of the regularization")
plt.axis("tight")
plt.grid()
plt.show()

The plot shows for a sequence of values of $\alpha$ the values for the Ridge model coefficients. For large values of $\alpha$ the coefficients correspond to the least square solution. By decreasing $\alpha$ the coefficients are penalized and~ converge to $0$.

In the next step we want to find a model configuration, and thus the value of $\alpha$ that minimizes the residuals (difference between observations and prediction). The scikit-learn package ships with a cross-validation){target="_blank"} function. The class RidgeCV() performs a leave-one-out cross-validation on the data set and stores the MSE values in .cv_values_ if the argument store_cv_values is active. We can evaluate the model performance by calculating the mean-squared error (MSE){target="_blank"}. By visualizing, we can demonstrate the relation between MSE and $\alpha$.

In [18]:
reg = linear_model.RidgeCV(alphas=alphas, store_cv_values=True)
reg = reg.fit(X_train, y_train)

ax = plt.gca()
ax.plot(alphas, np.mean(reg.cv_values_, axis=0))
ax.set_xscale("log")
plt.xlabel("alpha")
plt.ylabel("mean-squared error")
plt.title("MSE as a function of the regularization")
plt.axis("tight")
plt.grid()
plt.show()

The plot shows that the MSE increases with $\alpha$. The dotted vertical lines indicate the value of $\alpha$ that gives the minimum mean cross-validated error and the largest value of $\alpha$ such that the error is within 1 standard error of the minimum. We can assess these values of $\alpha$ by the object attributes .alpha_.

In [19]:
print("Best score:", reg.best_score_)
print("Estimated regularization parameter alpha:", reg.alpha_)
print("Training score:", reg.score(X_train, y_train))
print("Test score:", reg.score(X_test, y_test))
Best score: -6440.036134991497
Estimated regularization parameter alpha: 1.6297508346206444
Training score: 0.9081942915442142
Test score: 0.8625788494567455

Finally we may use the best $\alpha$ to predict the response vector y_train and y_test, respectively. Once again we calculate the RMSE, for both the training data and the test data, and store the model together with the corresponding RMSE in our list object called model_outcome.

In [23]:
# prediction for the training set
m_ridge_cv_pred_train = reg.predict(X_train)

# calculate RMSE on training set
rmse_test = mean_squared_error(m_ridge_cv_pred_train, y_train, squared=False)
print("RMSE on training set:", rmse_test)
RMSE on training set: 71.84011342174531
In [24]:
# prediction for the test set
m_ridge_cv_pred_test = reg.predict(X_test)

# calculate RMSE for the test data set
rmse_train = mean_squared_error(m_ridge_cv_pred_test, y_test, squared=False)
print("RMSE on training set:", rmse_train)
RMSE on training set: 77.40882425619203
In [25]:
rmses.loc[len(rmses)] = ['Ridge regresion', rmse_train, rmse_test]
rmses
Out[25]:
name train_RMSE test_RMSE
index
0 baseline model 243.882152 180.877011
0 simple alt model 154.992815 138.854544
0 max rainfall model 119.953630 117.437897
0 multi alt rain model 118.095746 113.746363
4 forward model 87.392377 117.884639
5 backward model 87.236023 118.367974
6 mlxtend SFS model 91.976905 91.232370
7 Ridge regresion 77.408824 71.840113

LASSO regression in Python¶

The mechanics for the LASSO regression in Python are very similar to those for the ridge regression. Let us build the model and plot it.

In [26]:
n_alphas = 100
alphas = np.logspace(-2, 5, n_alphas)

coefs = []
for a in alphas:
    ridge = linear_model.Lasso(alpha=a)
    results = ridge.fit(X_train, y_train)
    coefs.append(ridge.coef_)

ax = plt.gca()

ax.plot(alphas, coefs, )
plt.xlabel("$\lambda$")
ax.set_xscale("log")
plt.ylabel("Coefficients")
plt.title("Ridge coefficients as a function of the regularization")
plt.axis("tight")
plt.grid()
plt.show()

The plot shows for a sequence (default is 100) of $\alpha$ the values for the model coefficients. By increasing $\alpha$ the coefficients are penalized and eventually become $0$. The upper x-axis shows the decreasing number of coefficients of the model with increasing values of $\alpha$. This plot nicely visualizes variable selection via the LASSO method.

In the next step we need to find a model configuration, and thus the value of $\alpha$, which minimizes the residuals (difference between observations and prediction). The glmnet package ships with a build-in cross-validation) function. The function cv.glmnet() performs by default 10-fold cross-validation on the data set and evaluates the model performance by calculating the mean-squared error (MSE). Calling the plot() function on the cv.glmnet object provides a nice graphical representation of the relation between MSE and $\lambda$.

In [27]:
reg = linear_model.LassoCV(alphas=alphas, cv=10)
reg = reg.fit(X_train, y_train)

_, ax = plt.subplots(figsize=(10,4))
ax.scatter(alphas, reg.alphas)
ax.set_xscale("log")
plt.xlabel("$\lambda$")
plt.ylabel("mean-squared error")
plt.title("MSE as a function of the regularization")
plt.axis("tight")
plt.grid()
plt.show()

The plot shows that the MSE increases with $\lambda$. The dotted vertical lines indicate the value of $\lambda$ that gives the minimum mean cross-validated error and the largest value of $\lambda$ such that the error is within 1 standard error of the minimum. Sometimes it is more recommendable to choose the simpler model, and thus the model with fewer model parameters. We can assess these values of $\lambda$ by the object attributes lambda.min and lambda.1se.

In [28]:
print("Best score:", reg.alpha_)
print("Estimated regularization parameter alpha:", reg.alpha_)
print("Training score:", reg.score(X_train, y_train))
print("Test score:", reg.score(X_test, y_test))
Best score: 1.1233240329780276
Estimated regularization parameter alpha: 1.1233240329780276
Training score: 0.9064845447705463
Test score: 0.8748521084827312
In [29]:
reg.score(X_test, y_test)
Out[29]:
0.8748521084827312
In [30]:
# prediction for the training set
lasso_pred_train = reg.predict(X_train)

# calculate RMSE on training set
rmse_test = mean_squared_error(m_lasso_pred_train, y_train, squared=False)
print("RMSE on training set:", rmse_test)
RMSE on training set: 72.50598591177226
In [31]:
# prediction for the test set
lasso_pred_test = reg.predict(X_test)

# calculate RMSE for the test data set
rmse_train = mean_squared_error(lasso_pred_test, y_test, squared=False)
print("RMSE on training set:", rmse_train)
RMSE on training set: 73.87125026527004
In [32]:
rmses.loc[len(rmses)] = ['Lasso regresion', rmse_train, rmse_test]
rmses
Out[32]:
name train_RMSE test_RMSE
index
0 baseline model 243.882152 180.877011
0 simple alt model 154.992815 138.854544
0 max rainfall model 119.953630 117.437897
0 multi alt rain model 118.095746 113.746363
4 forward model 87.392377 117.884639
5 backward model 87.236023 118.367974
6 mlxtend SFS model 91.976905 91.232370
7 Ridge regresion 77.408824 71.840113
8 Lasso regresion 73.871250 72.505986

Final results¶

Let us visualize the final results and check which models are leading our "model selection comparison contest".

In [33]:
_, ax = plt.subplots(figsize=(10,6))
rmses.plot(kind="bar", x="name", ax=ax)
ax.set_xlabel("")
ax.set_ylabel("RMSE in mm")
plt.show()

The plot shows the RMSE of each model on the training data set and on the test data set. The best models with respect to RMSE on the test data set are those models, which used a stepwise parameter selection procedures or regularization terms. For the best performing models the RMSE on the test data set is comparable low.

Visualization of the final results¶

At the end we draw a nice interactive map of the weather stations in our data set and color and scale them according to the RMSE. The benefit of using the RMSE is that the RMSE is on the same scale as the data, hence the RMSE can be interpreted as deviance of the predicted value from the observed value as rainfall in mm.

Feel free to interact with the map. You may zoom in and zoom out. You may change the underlying basemap. Further, you can hoover over the data points, and by clicking you get additional information about the data point. Let us look at predictions of the Ridge regression model.

In [43]:
# m_ridge_cv_pred_train
train_set = dwd.loc[y_train.index]
Out[43]:
DWD_ID STATION_NAME FEDERAL_STATE LAT LON ALTITUDE PERIOD RECORD_LENGTH MEAN_ANNUAL_AIR_TEMP MEAN_MONTHLY_MAX_TEMP ... MEAN_ANNUAL_WIND_SPEED MEAN_CLOUD_COVER MEAN_ANNUAL_SUNSHINE MEAN_ANNUAL_RAINFALL MAX_MONTHLY_WIND_SPEED MAX_AIR_TEMP MAX_WIND_SPEED MAX_RAINFALL MIN_AIR_TEMP MEAN_RANGE_AIR_TEMP
ID
950 5825 Berge Brandenburg 52.6198 12.7867 40.0 1953-2016 63 9.1 13.8 ... 2.0 64.0 1677.0 514.0 3.0 34.5 25.6 31.0 -13.8 8.4
431 2503 Kaltennordheim Thüringen 50.6266 10.1455 487.0 1951-2003 52 6.8 10.6 ... 2.0 72.0 1427.0 749.0 3.0 29.3 27.5 35.0 -17.9 7.7
170 891 Cuxhaven Niedersachsen 53.8713 8.7058 5.0 1922-2016 94 9.1 12.0 ... 3.0 68.0 1667.0 810.0 4.0 30.0 32.7 33.0 -9.7 5.3
112 599 Bonn-Friesdorf Nordrhein-Westfalen 50.7055 7.1467 62.0 1931-1999 68 9.9 14.1 ... 2.0 71.0 1437.0 659.0 3.0 32.1 24.0 36.0 -11.9 8.1
440 2559 Kempten Bayern 47.7233 10.3348 705.0 1951-2016 65 7.2 12.5 ... 2.0 66.0 1751.0 1238.0 2.0 31.0 24.2 52.0 -20.1 9.8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
494 2907 Leck Schleswig-Holstein 54.7903 8.9514 7.0 1974-2016 42 8.4 11.8 ... 3.0 66.0 1503.0 778.0 4.0 29.4 30.3 32.0 -13.6 7.2
74 403 Berlin-Dahlem (FU) Berlin 52.4537 13.3017 51.0 1719-2016 297 9.0 13.3 ... 2.0 65.0 1700.0 592.0 3.0 33.4 27.7 35.0 -13.9 8.1
429 2497 Kall-Sistig Nordrhein-Westfalen 50.5014 6.5264 505.0 1947-2016 69 7.6 11.6 ... 2.0 70.0 1530.0 812.0 3.0 30.2 27.0 38.0 -13.9 7.5
838 5078 Travemünde Schleswig-Holstein 53.9619 10.8892 2.0 1936-2001 65 8.5 11.8 ... 2.0 65.0 1626.0 629.0 3.0 30.3 31.1 32.0 -12.6 6.4
458 2667 Köln-Bonn Nordrhein-Westfalen 50.8646 7.1575 92.0 1957-2016 59 10.1 14.5 ... 2.0 68.0 1515.0 797.0 3.0 33.4 26.3 36.0 -13.3 9.0

163 rows × 21 columns

In [75]:
# extract response vector
y = dwd['MEAN_ANNUAL_RAINFALL']
train_set = dwd.loc[y_train.index]
test_set = dwd.loc[y_test.index]
rmse_train_vector = (m_ridge_cv_pred_train - y.loc[y_train.index])**2 / len(y_train)
rmse_test_vector = (m_ridge_cv_pred_test - y.loc[y_test.index])**2 / len(y_test)
loc_center = [df['LAT'].mean(), df['LON'].mean()]
germany_map = folium.Map(location = loc_center, tiles='Openstreetmap', zoom_start = 5, control_scale=True)

for index, loc in train_set.iterrows():
    folium.CircleMarker(
        [loc['LAT'], loc['LON']], 
        radius = np.log(rmse_train_vector.loc[index]),
        color = 'red',
        opacity = .7,
        fill = True,
        popup = f'training data, rmse: {rmse_train_vector.loc[index]}',
        labels = ['training data'],
    ).add_to(germany_map)

for index, loc in test_set.iterrows():
    folium.CircleMarker(
        [loc['LAT'], loc['LON']], 
        radius = np.log(rmse_test_vector.loc[index]),
        color = 'blue',
        opacity = .7,
        fill = True,
        popup = f'test data, rmse: {rmse_test_vector.loc[index]}',
        labels = 'test data',
    ).add_to(germany_map)
folium.LayerControl().add_to(germany_map)
germany_map
Out[75]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The map shows the RMSE for each data point (weather station). The spatial distribution of the RMSE shows that the prediction of mean annual rainfall is quite good in the North German Lowlands. In the area near Hamburg and Cuxhaven the model predictions are slightly worse. In contrast, the predictions of annual mean rainfall for the Central German Uplands, the Alpine foreland and the Alps are much worse. Maybe the input data for those mountainous regions is too sparse for capturing the spatial variability in the rainfall pattern.


Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.