Both regularization methods, ridge regression and LASSO, are
implemented in the glmnet
R package. The workhorse of the
glmnet
package is the eponymous glmnet()
function. Type ?glmnet
to review the help page on that
function. The glmnet()
function basically fits a
generalized linear model via penalized maximum likelihood. The
alpha
argument is the so-called mixing parameter,
with \(0 \le \alpha \le 1\). If \(\alpha = 1\) the penalty term corresponds
to the LASSO regularized regression and if \(\alpha = 0\) the penalty term corresponds
to the ridge regularized regression. Values of \(\alpha\) between \(0\) and \(1\) are related to the elastic net regularization, which is discussed
elsewhere.
It is important to realize that the degree of regularization depends
on the regularization parameter \(\lambda\). Thus, it is useful to evaluate
the regression function for a sequence of \(\lambda\). By default the
glmnet()
evaluates a sequence of 100 \(\lambda\) values (note that in cases where
no change occurs the function stops earlier). The number of \(\lambda\) may be set by the
nlambda
argument and the sequence of \(\lambda\) may be specified by the
lambda
argument to glmnet()
function call.
The glmnet()
function does not work with the formula
notation, but uses matrices instead. At the begin of the session we
assigned the response vector and the model matrix of the training set
and the test set to the variables y.train
,
y.test
, X.train
, and y.train
,
respectively (see section on data preparation for details).
# load processed data set from previous section
load("dwd_30200.RData")
# load helper functions from previous section
load("helper_functions_30200.RData")
# load list object from previous section
load("model_outcome_II_30200.RData")
We start with the ridge regression (alpha = 0
) and plot
the results.
library(glmnet)
### RIDGE REGRESSION ###
m_ridge <- glmnet(X_train, y_train, alpha = 0)
# plot model coefficients vs. shrinkage parameter lambda
plot(m_ridge, xvar = "lambda", label = TRUE, xlab = expression(paste("Log(", lambda, ")")))
The plot shows for a sequence (default is 100) of \(\log\lambda\) the values for the model coefficients. If \(\lambda\) is close to \(0\) the coefficients correspond to the least square solution, however, by increasing \(\lambda\) the coefficients are penalized and converge to \(0\). The upper x-axis shows the number of coefficients of the model, which is 16 for all values of \(\lambda\).
In the next step we need to find a model configuration, and thus the
value of \(\lambda\), which minimizes
the residuals (difference between observations and prediction). The
glmnet
package ships with a build-in cross-validation function. The function
cv.glmnet()
performs by default 10-fold cross-validation on
the data set and evaluates the model performance by calculating the mean-squared error (MSE). Calling the
plot()
function on the cv.glmnet
object
provides a nice graphical representation of the relation between MSE and
\(\lambda\).
grid <- 10^seq(10, -1, length = 100) # set lambda sequence
m_ridge_cv <- cv.glmnet(X_train,
y_train,
alpha = 0,
lambda = grid)
plot(m_ridge_cv)
The plot shows that the MSE increases with \(\lambda\). The dotted vertical lines
indicate the value of \(\lambda\) that
gives the minimum mean cross-validated error and the largest value of
\(\lambda\) such that the error is
within 1 standard error of the minimum. We can assess these values of
\(\lambda\) by the object attributes
lambda.min
and lambda.1se
.
m_ridge_cv$lambda.min
## [1] 4.641589
m_ridge_cv$lambda.1se
## [1] 27.82559
Finally we may use the best \(\lambda\) to predict the response vector
y_train
and y_test
, respectively. Once again
we calculate the RMSE, for both the training data and the test data, and
store the model together with the corresponding RMSE in our
list
object called model_outcome
.
# prediction for the training set
m_ridge_cv_pred_train <- predict(m_ridge_cv,
X_train, s = "lambda.min")
# calculate RMSE on training set
print(paste("RMSE on training set:", rmse2(m_ridge_cv_pred_train,
y_train)
)
)
## [1] "RMSE on training set: 70.0437564513222"
# prediction for the test set
m_ridge_cv_pred_test <- predict(m_ridge_cv,
X_test, s = "lambda.min")
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(m_ridge_cv_pred_test,
y_test)
)
)
## [1] "RMSE on test set: 86.79229987451"
# store model object and results of RMSE in the `list` object named `model_outcome`
model_outcome[["m_ridge_cv"]] <- list("model" = m_ridge_cv,
"rmse" = data.frame("name" = "ridge regression",
"train_RMSE" = rmse2(m_ridge_cv_pred_train,
y_train),
"test_RMSE" = rmse2(m_ridge_cv_pred_test,
y_test)
)
)
The mechanics for the LASSO regression in R are very similar to those
for the ridge regression. However, this time we set
alpha = 1
. Let us build the model and plot it.
library(glmnet)
### LASSO REGRESSION ###
m_lasso <- glmnet(X_train, y_train, alpha = 1)
# plot model coefficients vs. shrinkage parameter lambda
plot(m_lasso, xvar = "lambda", label = TRUE, xlab = expression(paste("Log(", lambda, ")")))
The plot shows for a sequence (default is 100) of \(\log\lambda\) the values for the model coefficients. By increasing \(\lambda\) the coefficients are penalized and eventually become \(0\). The upper x-axis shows the decreasing number of coefficients of the model with increasing values of \(\lambda\). This plot nicely visualizes variable selection via the LASSO method.
In the next step we need to find a model configuration, and thus the
value of \(\lambda\), which minimizes
the residuals (difference between observations and prediction). The
glmnet
package ships with a build-in cross-validation function. The function
cv.glmnet()
performs by default 10-fold cross-validation on
the data set and evaluates the model performance by calculating the mean-squared error (MSE). Calling the
plot()
function on the cv.glmnet
object
provides a nice graphical representation of the relation between MSE and
\(\lambda\).
m_lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1)
plot(m_lasso_cv)
The plot shows that the MSE increases with \(\lambda\). The dotted vertical lines
indicate the value of \(\lambda\) that
gives the minimum mean cross-validated error and the largest value of
\(\lambda\) such that the error is
within 1 standard error of the minimum. Sometimes it is more
recommendable to choose the simpler model, and thus the model with fewer
model parameters. We can assess these values of \(\lambda\) by the object attributes
lambda.min
and lambda.1se
.
m_lasso_cv$lambda.min
## [1] 1.605312
m_lasso_cv$lambda.1se
## [1] 13.6412
Finally we may use the best \(\lambda\) to predict the response vector
y_train
and y_test
, respectively. Once again
we calculate the RMSE, for both the training data and the test data, and
store the model together with the corresponding RMSE in our
list
object called model_outcome
.
# prediction for the training set
m_lasso_cv_pred_train <- predict(m_lasso_cv, X_train, s = "lambda.min")
# calculate RMSE on training set
print(paste("RMSE on training set:",
rmse2(m_lasso_cv_pred_train,
y_train)
)
)
## [1] "RMSE on training set: 70.3480060786324"
# prediction for the test set
m_lasso_cv_pred_test <- predict(m_lasso_cv, X_test, s = "lambda.min")
# calculate RMSE for the test data set
print(paste("RMSE on test set:",
rmse2(m_lasso_cv_pred_test,
y_test)
)
)
## [1] "RMSE on test set: 85.2714357401066"
# store model object and results of RMSE in the `list` object named `model_outcome`
model_outcome[["m_lasso_cv"]] <- list("model" = m_lasso_cv,
"rmse" = data.frame("name" = "LASSO regression",
"train_RMSE" = rmse2(m_lasso_cv_pred_train, y_train),
"test_RMSE" = rmse2(m_lasso_cv_pred_test, y_test)
)
)
Let us visualize the final results and check which models are leading
our “model selection comparison contest”. The do.call()
function allows us to apply any function, in our case the
rbind()
function, to the elements in the list
object model_outcome
.
# extract matrix object
res_matrix <- do.call(rbind, model_outcome)
# extract data frame object
res_df <- do.call(rbind, res_matrix[, 2])
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3
# melt to tidy data frame
df <- melt(res_df,
id.vars = "name",
measure.vars = c("train_RMSE",
"test_RMSE")
)
# plot
ggplot(data = df, aes(x = name, y = value, fill = variable)) +
geom_bar(stat = "identity",
position = position_dodge()
) +
scale_fill_hue(name = "") +
xlab("") + ylab("RMSE") +
theme(axis.text.x = element_text(angle = 45,
hjust = 1)
)
The plot shows the RMSE of each model on the training data set and on the test data set. The best models with respect to RMSE on the test data set are those models, which used a stepwise parameter selection procedures or regularization terms. For the best performing models the RMSE on the test data set is comparable low.
res_df
## name train_RMSE test_RMSE
## baseline baseline model 218.14450 280.03880
## simple_alt simple ALT 140.29714 187.90538
## simple_rain simple RAIN 116.27653 129.93945
## multi_alt_rain multiple ALT and RAIN 113.02839 130.57758
## forward_full forward model 70.15124 87.21023
## backward_full backward model 69.73868 85.17255
## both_full hyprid model 69.73868 85.17255
## both_force_BIC brute force BIC 19.72879 396.87253
## m_ridge_cv ridge regression 70.04376 86.79230
## m_lasso_cv LASSO regression 70.34801 85.27144
At the end we draw a nice interactive map of the weather stations in our data set and color and scale them according to the RMSE. The benefit of using the RMSE is that the RMSE is on the same scale as the data, hence the RMSE can be interpreted as deviance of the predicted value from the observed value as rainfall in mm.
Feel free to interact with the map. You may zoom in and zoom out. You may change the underlying basemap. Further, you can hoover over the data points, and by clicking you get additional information about the data point.
# make a copy original data source for the ease of further processing
dwd_data_plot <- dwd_data
# use ridge regression model to predict all data in the data set
# extract model matrix
X <- as.matrix(
dwd_data_plot[, colnames(
dwd_data_plot)[!colnames(
dwd_data_plot) %in% c("MEAN.ANNUAL.RAINFALL",
"data.subset")]])
# extract response vector
y <- dwd_data_plot$MEAN.ANNUAL.RAINFALL
# calculate RMSE vector based on L2-regularized regression model (ridge regression)
preds <- predict(m_ridge_cv, X, s = "lambda.min")
rmse_vector <- (preds[, ] - y)^2 / length(y)
# add RMSE vector to data frame
dwd_data_plot["RMSE_Ridge"] <- round(rmse_vector, 2)
# restore previously dropped data columns
cols_to_restore <- c("DWD_ID",
"STATION.NAME",
"FEDERAL.STATE",
"PERIOD") # columns to restore
dwd_restored <- dwd[complete.cases(dwd),
cols_to_restore]
# join data frames
dwd_data_plot <- cbind(dwd_restored, dwd_data_plot)
# add a new column called "data_subset" to the data set
dwd_data_plot$data_subset <- ifelse(split == TRUE,
"training data",
"test data")
library(mapview)
library(sp)
coordinates(dwd_data_plot) <- ~ LON + LAT
proj4string(dwd_data_plot) <- CRS("+init=epsg:4326")
# plot data
mapView(dwd_data_plot,
zcol = c("RMSE_Ridge",
"data_subset"),
cex = "RMSE_Ridge",
legend = TRUE,
layer.name = c("RMSE Ridge",
"Data Subset")
)
The map shows the RMSE for each data point (weather station). The spatial distribution of the RMSE shows that the prediction of mean annual rainfall is quite good in the North German Lowlands. In the area near Hamburg and Cuxhaven the model predictions are slightly worse. In contrast, the predictions of annual mean rainfall for the Central German Uplands, the Alpine foreland and the Alps are much worse. Maybe the input data for those mountainous regions is too sparse for capturing the spatial variability in the rainfall pattern.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.