In the previous section we realized that feature engineering, such as transformations, basis expansions, and interactions may significantly enhance the predictive power of a model. Consequently, the question arises which features should be included in the model. Hence, we are looking for a criterion that allows us to assess which combination of features gives us the best model performance, and at the same time, in order to counteract overfitting issues, penalizes the number of free parameters in our model.

There are two commonly applied model selection criteria, the Akaike information criterion (AIC) and the Bayesian information criterion (BIC).

Both criteria (AIC and BIC) are measures of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC and BIC estimate the quality of each model, relative to each of the other models. Hence, both criteria provide a means for model selection.

\[AIC = 2k-2\ln(\mathcal L)\text{,}\]

\[BIC = \ln(n)k-2\ln(\mathcal L)\text{,}\]

where \(k\) is the number of estimated model parameters, \(n\) is the number of data points, and \(\mathcal L\) is the maximized value of the likelihood function for the model \(M\); i.e. \(\mathcal L=P(x\vert\hat \theta, M)\), where \(\hat \theta\) are the parameter values that maximize the likelihood function.

Under the framework of stepwise model selection a criterion (e.g. AIC or BIC) is used for weighing the choices of adding (or excluding) model parameters, taking into account the numbers of parameters to be fitted. At each step an add or drop will be performed that minimizes the criterion (e.g. AIC or BIC) score.

In the R software stew-wise model selection is conducted by the step() function. As a default the step() function uses the AIC criterion for model selection, which is indicated by the default value of k = 2 in the function call. To apply the BIC we simply have to set the function argument k = log(n). Type ?step in your console for more information on the step() function.


Forward-Stepwise Selection

Forward-stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit. Forward-stepwise selection is a greedy algorithm, producing a nested sequence of models.

In R we encode a forward-stepwise model selection procedure by applying the step() function. First, we specify a model object. In the forward-stepwise selection the model starts with the intercept and nothing else. Thus the model is specified as response ~ 1. In addition we provide a scope, which tells the algorithm which additional features should be included. The scope argument of the step() function expects the formula-notation as an input, thus me may apply the as.formula() function to prevent us from typing character strings. Moreover, we specify the forward-stepwise selection by explicitly adding the argument direction = "forward" to the function call.

Let’s give it a try. We start with an empty model and add in total four additional features (ALTITUDE, MAX.RAINFALL, MEAN.CLOUD.COVER , and MEAN.ANNUAL.AIR.TEMP). The step() function evaluates by default the AIC criterion and picks that particular feature that minimizes AIC at most. If the addition of another additional feature does not decrease the AIC criterion, the algorithm stops and provides us the best model specifications in terms of nested feature combinations. Be aware that the step() creates a large amount of output to the console, which we discuss just below.

# define (empty) model
test_model <- lm(formula = MEAN.ANNUAL.RAINFALL ~ 1, data = train_set)

# define scope, which features to include
test_scope <- as.formula(lm(MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL + MEAN.CLOUD.COVER + MEAN.ANNUAL.AIR.TEMP,
                            data = train_set))

# call the step function
step(test_model, scope = test_scope, direction = "forward")
## Start:  AIC=1757.56
## MEAN.ANNUAL.RAINFALL ~ 1
## 
##                        Df Sum of Sq     RSS    AIC
## + MAX.RAINFALL          1   5552887 2203798 1554.5
## + ALTITUDE              1   4548309 3208376 1615.7
## + MEAN.ANNUAL.AIR.TEMP  1   3104860 4651824 1676.2
## + MEAN.CLOUD.COVER      1    928168 6828516 1738.8
## <none>                              7756685 1757.6
## 
## Step:  AIC=1554.45
## MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL
## 
##                        Df Sum of Sq     RSS    AIC
## + MEAN.ANNUAL.AIR.TEMP  1    135089 2068709 1546.1
## + ALTITUDE              1    121405 2082393 1547.2
## + MEAN.CLOUD.COVER      1     66110 2137688 1551.5
## <none>                              2203798 1554.5
## 
## Step:  AIC=1546.14
## MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP
## 
##                    Df Sum of Sq     RSS    AIC
## <none>                          2068709 1546.1
## + MEAN.CLOUD.COVER  1     22610 2046099 1546.3
## + ALTITUDE          1     19356 2049353 1546.6
## 
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP, 
##     data = train_set)
## 
## Coefficients:
##          (Intercept)          MAX.RAINFALL  MEAN.ANNUAL.AIR.TEMP  
##                86.79                 23.50                -27.32

Let us take a closer loot at the output of the step() function. The output starts with the AIC value for the baseline model (MEAN.ANNUAL.RAINFALL ~ 1). Further it lists all additional features, provided by the scope argument and sorts them according to the AIC criterion. In other words it evaluates the AIC for sequentially adding these features. The next step starts again with the output of the AIC for the new model given by MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL. Again, all additional features are sorted according to the AIC criterion. In the next step one more feature is added to the model (MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP). Again all additional features are sorted based on their effect on the AIC. Note that adding any of the features left to choose from (MEAN.CLOUD.COVER and ALTITUDE) does not decrease AIC. This is made more clear by the <none> feature in the list, which simply states that adding no more feature would be the best choice. Consequently, the forward-stepwise selection algorithm stops and returns the best model configuration, which in our case is MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP. The last lines of the output show the model call and the model coefficients.

Just for a matter of fun, let us calculate the RMSE for the training and test data sets given the suggested model.

# build the model
m_forward_test <- lm(formula = MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP,
                     data = train_set)

# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_forward_test)))
## [1] "RMSE on training set: 112.65639969757"
# prediction
m_forward_test_pred <- predict(m_forward_test , newdata = test_set)

# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_forward_test_pred)))
## [1] "RMSE on test set: 124.062272949682"

Not bad! Definitely one of the better models we built so far.

However, we want more! Recall that we have 16 features in our data set. So fare we experimented with just a few of them. Let us exploit the full feature space of our data set. Let us apply a forward-stepwise model selection for all available features. Fortunately we do not have to write down all the names of the features in our data set. In the formula-notation there is a representation of “include all features”, this representation is indicated by the full-stop (response ~ .). In order to prevent that the output of the step() function floods our console we add the trace = 0 argument to the function call.

# define (empty) model
m_forward_full_baseline <- lm(formula = MEAN.ANNUAL.RAINFALL ~ 1, data = train_set)

# define scope, which features to include
m_forward_full_scope <- as.formula(lm(MEAN.ANNUAL.RAINFALL ~., data = train_set))

# call the step function
m_forward_full <- step(m_forward_full_baseline,
                       scope = m_forward_full_scope,
                       direction = "forward",
                       trace = 0)
m_forward_full
## 
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + LON + MEAN.ANNUAL.AIR.TEMP + 
##     ALTITUDE + RECORD.LENGTH + MAX.MONTHLY.WIND.SPEED + MIN.AIR.TEMP, 
##     data = train_set)
## 
## Coefficients:
##            (Intercept)            MAX.RAINFALL                     LON  
##              1156.4027                 24.3466                -40.0285  
##   MEAN.ANNUAL.AIR.TEMP                ALTITUDE           RECORD.LENGTH  
##               -82.7801                 -0.1746                 -0.2372  
## MAX.MONTHLY.WIND.SPEED            MIN.AIR.TEMP  
##               -21.3970                  5.7082

OK, that is the suggest model configuration. Let us evaluate the RMSE on the training set and on the test set, and store the model and the corresponding RMSE in our list object called model_outcome.

# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_forward_full)))
## [1] "RMSE on training set: 70.1512411541421"
# prediction
m_forward_full_test_pred <- predict(m_forward_full, newdata = test_set)

# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_forward_full_test_pred)))
## [1] "RMSE on test set: 87.2102349490595"
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["forward_full"]] <- list("model" = m_forward_full,
                                        "rmse" =  data.frame("name" = "forward model",
                                                             "train_RMSE" = rmse(m_forward_full),
                                                             "test_RMSE" = rmse2(y_test, m_forward_full_test_pred)))

Wow! What a reduction of the RMSE for the training and the test data set. This seems to be a model configuration that performs great on the training data and generalizes as well.


Backward-Stepwise Selection

The backward-stepwise model selection is very similar to the forward-stepwise model selection discussed in the previous section. Backward-stepwise selection starts with the full model, and sequentially deletes the predictor that has the least impact on the fit. Backward selection can only be used when the number of observation is larger than the number of features \((n > d)\), while forward stepwise can always be used (Hastie et al. 2008).

In R the backward-stepwise model selection is implemented in the step() function. To conduct the backward-stepwise model selection procedure one has to specify the direction argument as direction = "backward".

For the sake of comparison we apply the backward-stepwise model selection in the same manner as we did in the previous section. However, this time we start with a sophisticated model and exclude one feature after the other based on the AIC criterion.

# define (sophisticated) model
test_model <- lm(formula = MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL + MEAN.CLOUD.COVER + MEAN.ANNUAL.AIR.TEMP,
                 data = train_set)

# call the step function
step(test_model, direction = "backward")
## Start:  AIC=1546.84
## MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL + MEAN.CLOUD.COVER + 
##     MEAN.ANNUAL.AIR.TEMP
## 
##                        Df Sum of Sq     RSS    AIC
## - ALTITUDE              1     18786 2046099 1546.3
## - MEAN.ANNUAL.AIR.TEMP  1     20453 2047766 1546.5
## - MEAN.CLOUD.COVER      1     22040 2049353 1546.6
## <none>                              2027313 1546.8
## - MAX.RAINFALL          1   1144027 3171340 1617.8
## 
## Step:  AIC=1546.34
## MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.CLOUD.COVER + MEAN.ANNUAL.AIR.TEMP
## 
##                        Df Sum of Sq     RSS    AIC
## - MEAN.CLOUD.COVER      1     22610 2068709 1546.1
## <none>                              2046099 1546.3
## - MEAN.ANNUAL.AIR.TEMP  1     91588 2137688 1551.5
## - MAX.RAINFALL          1   2546618 4592718 1676.1
## 
## Step:  AIC=1546.14
## MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP
## 
##                        Df Sum of Sq     RSS    AIC
## <none>                              2068709 1546.1
## - MEAN.ANNUAL.AIR.TEMP  1    135089 2203798 1554.5
## - MAX.RAINFALL          1   2583116 4651824 1676.2
## 
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP, 
##     data = train_set)
## 
## Coefficients:
##          (Intercept)          MAX.RAINFALL  MEAN.ANNUAL.AIR.TEMP  
##                86.79                 23.50                -27.32

Perfect! As expected the backward-stepwise model selection algorithm comes to the same conclusion. The function output is interpreted in the same manner as we discussed in the section on forward-stepwise model selection.

Let us apply the backward-stepwise model selection on the full feature space of our data set. Further, we evaluate the RMSE of the resulting model and store it together with the corresponding RMSE in our list object called model_outcome.

# define (empty) model
m_backward_full_baseline <- lm(formula = MEAN.ANNUAL.RAINFALL ~., data = train_set)

# call the step function
m_backward_full <- step(m_backward_full_baseline,
                       direction = "backward",
                       trace = 0)
m_backward_full
## 
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ LAT + LON + RECORD.LENGTH + 
##     MEAN.ANNUAL.AIR.TEMP + MEAN.MONTHLY.MIN.TEMP + MAX.MONTHLY.WIND.SPEED + 
##     MAX.RAINFALL + MEAN.RANGE.AIR.TEMP, data = train_set)
## 
## Coefficients:
##            (Intercept)                     LAT                     LON  
##                -211.28                   19.04                  -42.70  
##          RECORD.LENGTH    MEAN.ANNUAL.AIR.TEMP   MEAN.MONTHLY.MIN.TEMP  
##                  -0.22                 -115.45                   64.82  
## MAX.MONTHLY.WIND.SPEED            MAX.RAINFALL     MEAN.RANGE.AIR.TEMP  
##                 -19.28                   24.14                   33.82
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_backward_full)))
## [1] "RMSE on training set: 69.7386767083948"
# prediction
m_backward_full_test_pred <- predict(m_backward_full, newdata = test_set)

# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_backward_full_test_pred)))
## [1] "RMSE on test set: 85.1725548463445"
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["backward_full"]] <- list("model" = m_backward_full,
                                         "rmse" =  data.frame("name" = "backward model",
                                                              "train_RMSE" = rmse(m_backward_full),
                                                              "test_RMSE" = rmse2(y_test, m_backward_full_test_pred)))

Nice results. We see that the “best” model configuration provided by the backward-stepwise model selection procedure is not identical with the “best” model configuration provided by the forward-stepwise model selection procedure. The RMSE of the backward-stepwise model selection procedure is slightly lower, but not that much to call out a winner.


Hybrid Stepwise Selection

The R software implements as well a hybrid stepwise-selection strategy that considers both forward and backward moves at each step, and selects the “best” of the two. The step function uses the AIC criterion for weighing the choices, which takes proper account of the number of parameters fit; at each step an add or drop will be performed that minimizes the AIC score.

In R the hybrid stepwise-selection procedure is implemented in the step() function, by setting the direction argument to direction = "both". For matter of model comparison we apply the hybrid stepwise-selection procedure on the full feature space of the data set. We may either start with a full (all features included) or a baseline (intercept only) model. As in the previous sections we evaluate the RMSE of the resulting model and store it together with the corresponding RMSE in our list object called model_outcome.

# define model
m_both_full_baseline <- lm(formula = MEAN.ANNUAL.RAINFALL ~.,
                           data = train_set)

# call the step function
m_both_full <- step(m_both_full_baseline,
                    direction = "both",
                    trace = 0)
m_both_full
## 
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ LAT + LON + RECORD.LENGTH + 
##     MEAN.ANNUAL.AIR.TEMP + MEAN.MONTHLY.MIN.TEMP + MAX.MONTHLY.WIND.SPEED + 
##     MAX.RAINFALL + MEAN.RANGE.AIR.TEMP, data = train_set)
## 
## Coefficients:
##            (Intercept)                     LAT                     LON  
##                -211.28                   19.04                  -42.70  
##          RECORD.LENGTH    MEAN.ANNUAL.AIR.TEMP   MEAN.MONTHLY.MIN.TEMP  
##                  -0.22                 -115.45                   64.82  
## MAX.MONTHLY.WIND.SPEED            MAX.RAINFALL     MEAN.RANGE.AIR.TEMP  
##                 -19.28                   24.14                   33.82
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_both_full)))
## [1] "RMSE on training set: 69.7386767083948"
# prediction
m_both_full_test_pred <- predict(m_both_full, newdata = test_set)

# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_both_full_test_pred)))
## [1] "RMSE on test set: 85.1725548463445"
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["both_full"]] <- list("model" = m_both_full,
                                     "rmse" =  data.frame("name" = "hyprid model",
                                                          "train_RMSE" = rmse(m_both_full),
                                                          "test_RMSE" = rmse2(y_test, m_both_full_test_pred)))

Good results. The RMSE on both data sets, the training and the test data set, are in the same range as for the forward-stepwise and backward-stepwise model selection procedures. Again it is too close to call out a winner.


Brute force feature expansion

Before we take a look at the interim results of our “model selection comparison contest”, we recall that feature engineering techniques (transformations, basis expansions, interactions,…) considerable enhance the predictive power of our model. In this section we combine feature engineering techniques and model selection procedures.

As mentioned earlier feature engineering is sometimes more art than science, and thus there is no exact procedure to follow in terms of “best” feature engineering approaches. Feature engineering is often about domain knowledge, expertise and experience. Here in this section we do not show any of that, thus we have to apply a brute force approach.

This means that we apply transformations (taking the log()), basis expansions (power of 2), interactions (product of each feature pair) on all of our features. Fortunately in R, the formula-notation allows us to encode all these combinations in one line of code. By providing the expression response ~ . + .^2 + .*. to the step() function call we account for basis expansion (.^2) and the interaction (.*.). The log-transformation has to be encoded explicitly.

For the sake of simplicity we apply the hybrid stepwise-selection strategy (direction = "both"). However, this time we will not use the AIC, but the BIC criterion for model selection. Recall that the AIC is used by default by the step() function. In order to use the BIC criterion we add an additional argument to the step() function: k = log(n). As in the previous sections we evaluate the RMSE of the resulting model and store it together with the corresponding RMSE in our list object called model.outcome.

#########################
### Brute force - BIC ###
#########################

# encode log-transformation
features <- colnames(train_set)[!(colnames(train_set)  %in% c("MEAN.ANNUAL.RAINFALL", "MIN.AIR.TEMP"))]
log_transform <- paste(paste0("log(", features, ")"), collapse = " + ")

# encode formula
formula <- as.formula(paste0("MEAN.ANNUAL.RAINFALL ~ . +", log_transform, "+ .^2 + .*."))

# define model
m_both_force_BIC_baseline <- lm(formula = formula, data = train_set)

# call the step function
m_both_force_BIC <- step(m_both_force_BIC_baseline,
                       direction = "both",
                       trace = 0,
                       k = log(nrow(train_set)))

# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_both_force_BIC)))
## [1] "RMSE on training set: 19.7287872656248"
# prediction
m_both_force_BIC_test_pred <- predict(m_both_force_BIC, newdata = test_set)

# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_both_force_BIC_test_pred)))
## [1] "RMSE on test set: 396.872530692526"
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["both_force_BIC"]] <- list("model" = m_both_force_BIC,
                                          "rmse" =  data.frame("name" = "brute force BIC",
                                                               "train_RMSE" = rmse(m_both_force_BIC),
                                                               "test_RMSE" = rmse2(y_test, m_both_force_BIC_test_pred)))

# save list object for later usage
save(model_outcome, file = "model_outcome_II_30200.RData")

Done! We see that the RMSE on the training data set is very low, however the RMSE on the test set is quite high. This is an indicator for model overfitting. No surprise at all, as the suggested model includes 118 parameters!

length(m_both_force_BIC$coefficients)
## [1] 118

Interim results II

Let us visualize the interim results and check which models are leading our “model selection comparison contest”. The do.call() function allows us to apply any function, in our case the rbind() function, to the elements in the list object model_outcome.

# extract matrix object
res_matrix <- do.call(rbind, model_outcome)

# extract data frame object
res_df <- do.call(rbind, res_matrix[, 2])

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3
# melt to tidy data frame
df <- melt(res_df,
           id.vars = "name",
           measure.vars = c("train_RMSE",
                            "test_RMSE")
           )

# plot
ggplot(data = df, aes(x = name, y = value, fill = variable)) +
  geom_bar(stat = "identity",
           position = position_dodge()
           ) +
  scale_fill_hue(name = "") +
  xlab("") +
  ylab("RMSE") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The plot shows the RMSE of each model on the training data set and on the test data set. The best models with respect to RMSE on the test data set are those models, which used a stepwise parameter selection procedure. Note that the “brute force” model performs best on the training data set, but performs poorly on the test data set. Such a model is overfitting the data and is not expected to show high predictive power.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.