Both regularization methods, ridge regression and LASSO, are implemented in the glmnet R package. The workhorse of the glmnet package is the eponymous glmnet() function. Type ?glmnet to review the help page on that function. The glmnet() function basically fits a generalized linear model via penalized maximum likelihood. The alpha argument is the so-called mixing parameter, with \(0 \le \alpha \le 1\). If \(\alpha = 1\) the penalty term corresponds to the LASSO regularized regression and if \(\alpha = 0\) the penalty term corresponds to the ridge regularized regression. Values of \(\alpha\) between \(0\) and \(1\) are related to the elastic net regularization, which is discussed elsewhere.

It is important to realize that the degree of regularization depends on the regularization parameter \(\lambda\). Thus, it is useful to evaluate the regression function for a sequence of \(\lambda\). By default the glmnet() evaluates a sequence of 100 \(\lambda\) values (note that in cases where no change occurs the function stops earlier). The number of \(\lambda\) may be set by the nlambda argument and the sequence of \(\lambda\) may be specified by the lambda argument to glmnet() function call.

The glmnet() function does not work with the formula notation, but uses matrices instead. At the begin of the session we assigned the response vector and the model matrix of the training set and the test set to the variables y.train, y.test, X.train, and y.train, respectively (see section on data preparation for details).

# load processed data set from previous section
load("dwd_30200.RData")
# load helper functions from previous section
load("helper_functions_30200.RData")
# load list object from previous section
load("model_outcome_II_30200.RData")

Ridge regression in R

We start with the ridge regression (alpha = 0) and plot the results.

library(glmnet)

### RIDGE REGRESSION ###
m_ridge <- glmnet(X_train, y_train, alpha = 0)

# plot model coefficients vs. shrinkage parameter lambda
plot(m_ridge, xvar = "lambda", label = TRUE, xlab = expression(paste("Log(", lambda, ")")))

The plot shows for a sequence (default is 100) of \(\log\lambda\) the values for the model coefficients. If \(\lambda\) is close to \(0\) the coefficients correspond to the least square solution, however, by increasing \(\lambda\) the coefficients are penalized and converge to \(0\). The upper x-axis shows the number of coefficients of the model, which is 16 for all values of \(\lambda\).

In the next step we need to find a model configuration, and thus the value of \(\lambda\), which minimizes the residuals (difference between observations and prediction). The glmnet package ships with a build-in cross-validation function. The function cv.glmnet() performs by default 10-fold cross-validation on the data set and evaluates the model performance by calculating the mean-squared error (MSE). Calling the plot() function on the cv.glmnet object provides a nice graphical representation of the relation between MSE and \(\lambda\).

grid <- 10^seq(10, -1, length = 100) # set lambda sequence
m_ridge_cv <- cv.glmnet(X_train,
                        y_train,
                        alpha = 0,
                        lambda = grid)

plot(m_ridge_cv)

The plot shows that the MSE increases with \(\lambda\). The dotted vertical lines indicate the value of \(\lambda\) that gives the minimum mean cross-validated error and the largest value of \(\lambda\) such that the error is within 1 standard error of the minimum. We can assess these values of \(\lambda\) by the object attributes lambda.min and lambda.1se.

m_ridge_cv$lambda.min
## [1] 4.641589
m_ridge_cv$lambda.1se
## [1] 27.82559

Finally we may use the best \(\lambda\) to predict the response vector y_train and y_test, respectively. Once again we calculate the RMSE, for both the training data and the test data, and store the model together with the corresponding RMSE in our list object called model_outcome.

# prediction for the training set
m_ridge_cv_pred_train <- predict(m_ridge_cv,
                                 X_train, s = "lambda.min")

# calculate RMSE on training set
print(paste("RMSE on training set:", rmse2(m_ridge_cv_pred_train,
                                           y_train)
            )
      )
## [1] "RMSE on training set: 70.0437564513222"
# prediction for the test set
m_ridge_cv_pred_test <- predict(m_ridge_cv,
                                X_test, s = "lambda.min")

# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(m_ridge_cv_pred_test,
                                       y_test)
            )
      )
## [1] "RMSE on test set: 86.79229987451"
# store model object and results of RMSE in the `list` object named `model_outcome`
model_outcome[["m_ridge_cv"]] <- list("model" = m_ridge_cv,
                                      "rmse" =  data.frame("name" = "ridge regression",
                                                           "train_RMSE" = rmse2(m_ridge_cv_pred_train,
                                                                               y_train),
                                                           "test_RMSE" = rmse2(m_ridge_cv_pred_test,
                                                                               y_test)
                                                           )
                                      )

LASSO regression in R

The mechanics for the LASSO regression in R are very similar to those for the ridge regression. However, this time we set alpha = 1. Let us build the model and plot it.

library(glmnet)

### LASSO REGRESSION ###
m_lasso <- glmnet(X_train, y_train, alpha = 1)

# plot model coefficients vs. shrinkage parameter lambda
plot(m_lasso, xvar = "lambda", label = TRUE, xlab = expression(paste("Log(", lambda, ")")))

The plot shows for a sequence (default is 100) of \(\log\lambda\) the values for the model coefficients. By increasing \(\lambda\) the coefficients are penalized and eventually become \(0\). The upper x-axis shows the decreasing number of coefficients of the model with increasing values of \(\lambda\). This plot nicely visualizes variable selection via the LASSO method.

In the next step we need to find a model configuration, and thus the value of \(\lambda\), which minimizes the residuals (difference between observations and prediction). The glmnet package ships with a build-in cross-validation function. The function cv.glmnet() performs by default 10-fold cross-validation on the data set and evaluates the model performance by calculating the mean-squared error (MSE). Calling the plot() function on the cv.glmnet object provides a nice graphical representation of the relation between MSE and \(\lambda\).

m_lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1)

plot(m_lasso_cv)

The plot shows that the MSE increases with \(\lambda\). The dotted vertical lines indicate the value of \(\lambda\) that gives the minimum mean cross-validated error and the largest value of \(\lambda\) such that the error is within 1 standard error of the minimum. Sometimes it is more recommendable to choose the simpler model, and thus the model with fewer model parameters. We can assess these values of \(\lambda\) by the object attributes lambda.min and lambda.1se.

m_lasso_cv$lambda.min
## [1] 1.605312
m_lasso_cv$lambda.1se
## [1] 13.6412

Finally we may use the best \(\lambda\) to predict the response vector y_train and y_test, respectively. Once again we calculate the RMSE, for both the training data and the test data, and store the model together with the corresponding RMSE in our list object called model_outcome.

# prediction for the training set
m_lasso_cv_pred_train <- predict(m_lasso_cv, X_train, s = "lambda.min")

# calculate RMSE on training set
print(paste("RMSE on training set:",
            rmse2(m_lasso_cv_pred_train,
                  y_train)
            )
      )
## [1] "RMSE on training set: 70.3480060786324"
# prediction for the test set
m_lasso_cv_pred_test <- predict(m_lasso_cv, X_test, s = "lambda.min")

# calculate RMSE for the test data set
print(paste("RMSE on test set:",
            rmse2(m_lasso_cv_pred_test,
                  y_test)
            )
      )
## [1] "RMSE on test set: 85.2714357401066"
# store model object and results of RMSE in the `list` object named `model_outcome`
model_outcome[["m_lasso_cv"]] <- list("model" = m_lasso_cv,
                                     "rmse" =  data.frame("name" = "LASSO regression",
                                                          "train_RMSE" = rmse2(m_lasso_cv_pred_train, y_train),
                                                          "test_RMSE" = rmse2(m_lasso_cv_pred_test, y_test)
                                                          )
                                     )

Final results

Let us visualize the final results and check which models are leading our “model selection comparison contest”. The do.call() function allows us to apply any function, in our case the rbind() function, to the elements in the list object model_outcome.

# extract matrix object
res_matrix <- do.call(rbind, model_outcome)

# extract data frame object
res_df <- do.call(rbind, res_matrix[, 2])

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3
# melt to tidy data frame
df <- melt(res_df,
           id.vars = "name",
           measure.vars = c("train_RMSE",
                            "test_RMSE")
           )

# plot
ggplot(data = df, aes(x = name, y = value, fill = variable)) +
  geom_bar(stat = "identity",
           position = position_dodge()
           ) +
  scale_fill_hue(name = "") +
  xlab("") + ylab("RMSE") +
  theme(axis.text.x = element_text(angle = 45,
                                   hjust = 1)
        )

The plot shows the RMSE of each model on the training data set and on the test data set. The best models with respect to RMSE on the test data set are those models, which used a stepwise parameter selection procedures or regularization terms. For the best performing models the RMSE on the test data set is comparable low.

res_df
##                                 name train_RMSE test_RMSE
## baseline              baseline model  218.14450 280.03880
## simple_alt                simple ALT  140.29714 187.90538
## simple_rain              simple RAIN  116.27653 129.93945
## multi_alt_rain multiple ALT and RAIN  113.02839 130.57758
## forward_full           forward model   70.15124  87.21023
## backward_full         backward model   69.73868  85.17255
## both_full               hyprid model   69.73868  85.17255
## both_force_BIC       brute force BIC   19.72879 396.87253
## m_ridge_cv          ridge regression   70.04376  86.79230
## m_lasso_cv          LASSO regression   70.34801  85.27144

Visualization of the final results

At the end we draw a nice interactive map of the weather stations in our data set and color and scale them according to the RMSE. The benefit of using the RMSE is that the RMSE is on the same scale as the data, hence the RMSE can be interpreted as deviance of the predicted value from the observed value as rainfall in mm.

Feel free to interact with the map. You may zoom in and zoom out. You may change the underlying basemap. Further, you can hoover over the data points, and by clicking you get additional information about the data point.

# make a copy original data source for the ease of further processing
dwd_data_plot <- dwd_data

# use ridge regression model to predict all data in the data set
# extract model matrix
X <- as.matrix(
  dwd_data_plot[, colnames(
    dwd_data_plot)[!colnames(
      dwd_data_plot) %in% c("MEAN.ANNUAL.RAINFALL",
                            "data.subset")]])

# extract response vector
y <- dwd_data_plot$MEAN.ANNUAL.RAINFALL

# calculate RMSE vector based on L2-regularized regression model (ridge regression)
preds <- predict(m_ridge_cv, X, s = "lambda.min")
rmse_vector <- (preds[, ] - y)^2 / length(y)

# add RMSE vector to data frame
dwd_data_plot["RMSE_Ridge"] <- round(rmse_vector, 2)

# restore previously dropped data columns
cols_to_restore <- c("DWD_ID",
                     "STATION.NAME",
                     "FEDERAL.STATE",
                     "PERIOD") # columns to restore
dwd_restored <- dwd[complete.cases(dwd),
                    cols_to_restore]

# join data frames
dwd_data_plot <- cbind(dwd_restored, dwd_data_plot)

# add a new column called "data_subset" to the data set
dwd_data_plot$data_subset <- ifelse(split == TRUE,
                                    "training data",
                                    "test data")

library(mapview)
library(sp)

coordinates(dwd_data_plot) <- ~ LON + LAT
proj4string(dwd_data_plot) <- CRS("+init=epsg:4326")

# plot data
mapView(dwd_data_plot,
        zcol = c("RMSE_Ridge",
                 "data_subset"),
        cex = "RMSE_Ridge",
        legend = TRUE,
        layer.name = c("RMSE Ridge",
                       "Data Subset")
        )

The map shows the RMSE for each data point (weather station). The spatial distribution of the RMSE shows that the prediction of mean annual rainfall is quite good in the North German Lowlands. In the area near Hamburg and Cuxhaven the model predictions are slightly worse. In contrast, the predictions of annual mean rainfall for the Central German Uplands, the Alpine foreland and the Alps are much worse. Maybe the input data for those mountainous regions is too sparse for capturing the spatial variability in the rainfall pattern.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.