In the previous section we realized that feature engineering, such as transformations, basis expansions, and interactions may significantly enhance the predictive power of a model. Consequently, the question arises which features should be included in the model. Hence, we are looking for a criterion that allows us to assess which combination of features gives us the best model performance, and at the same time, in order to counteract overfitting issues, penalizes the number of free parameters in our model.

There are two commonly applied model selection criteria, the **Akaike information criterion (AIC)**
and the **Bayesian information criterion
(BIC)**.

Both criteria (AIC and BIC) are measures of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC and BIC estimate the quality of each model, relative to each of the other models. Hence, both criteria provide a means for model selection.

\[AIC = 2k-2\ln(\mathcal L)\text{,}\]

\[BIC = \ln(n)k-2\ln(\mathcal L)\text{,}\]

where \(k\) is the number of estimated model parameters, \(n\) is the number of data points, and \(\mathcal L\) is the maximized value of the likelihood function for the model \(M\); i.e. \(\mathcal L=P(x\vert\hat \theta, M)\), where \(\hat \theta\) are the parameter values that maximize the likelihood function.

Under the framework of stepwise model selection a criterion (e.g. AIC or BIC) is used for weighing the choices of adding (or excluding) model parameters, taking into account the numbers of parameters to be fitted. At each step an add or drop will be performed that minimizes the criterion (e.g. AIC or BIC) score.

In the R software stew-wise model selection is conducted by the
`step()`

function. As a default the `step()`

function uses the AIC criterion for model selection, which is indicated
by the default value of `k = 2`

in the function call. To
apply the BIC we simply have to set the function argument
`k = log(n)`

. Type `?step`

in your console for
more information on the `step()`

function.

Forward-stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit. Forward-stepwise selection is a greedy algorithm, producing a nested sequence of models.

In R we encode a forward-stepwise model selection procedure by
applying the `step()`

function. First, we specify a model
object. In the forward-stepwise selection the model starts with the
intercept and nothing else. Thus the model is specified as
`response ~ 1`

. In addition we provide a scope, which tells
the algorithm which additional features should be included. The
`scope`

argument of the `step()`

function expects
the formula-notation as an input, thus me may apply the
`as.formula()`

function to prevent us from typing character
strings. Moreover, we specify the forward-stepwise selection by
explicitly adding the argument `direction = "forward"`

to the
function call.

Let’s give it a try. We start with an empty model and add in total
four additional features (`ALTITUDE`

,
`MAX.RAINFALL`

, `MEAN.CLOUD.COVER`

, and
`MEAN.ANNUAL.AIR.TEMP`

). The `step()`

function
evaluates by default the AIC criterion and picks that particular feature
that minimizes AIC at most. If the addition of another additional
feature does not decrease the AIC criterion, the algorithm stops and
provides us the best model specifications in terms of nested feature
combinations. Be aware that the `step()`

creates a large
amount of output to the console, which we discuss just below.

```
# define (empty) model
test_model <- lm(formula = MEAN.ANNUAL.RAINFALL ~ 1, data = train_set)
# define scope, which features to include
test_scope <- as.formula(lm(MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL + MEAN.CLOUD.COVER + MEAN.ANNUAL.AIR.TEMP,
data = train_set))
# call the step function
step(test_model, scope = test_scope, direction = "forward")
```

```
## Start: AIC=1757.56
## MEAN.ANNUAL.RAINFALL ~ 1
##
## Df Sum of Sq RSS AIC
## + MAX.RAINFALL 1 5552887 2203798 1554.5
## + ALTITUDE 1 4548309 3208376 1615.7
## + MEAN.ANNUAL.AIR.TEMP 1 3104860 4651824 1676.2
## + MEAN.CLOUD.COVER 1 928168 6828516 1738.8
## <none> 7756685 1757.6
##
## Step: AIC=1554.45
## MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL
##
## Df Sum of Sq RSS AIC
## + MEAN.ANNUAL.AIR.TEMP 1 135089 2068709 1546.1
## + ALTITUDE 1 121405 2082393 1547.2
## + MEAN.CLOUD.COVER 1 66110 2137688 1551.5
## <none> 2203798 1554.5
##
## Step: AIC=1546.14
## MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP
##
## Df Sum of Sq RSS AIC
## <none> 2068709 1546.1
## + MEAN.CLOUD.COVER 1 22610 2046099 1546.3
## + ALTITUDE 1 19356 2049353 1546.6
```

```
##
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP,
## data = train_set)
##
## Coefficients:
## (Intercept) MAX.RAINFALL MEAN.ANNUAL.AIR.TEMP
## 86.79 23.50 -27.32
```

Let us take a closer loot at the output of the `step()`

function. The output starts with the AIC value for the baseline model
(`MEAN.ANNUAL.RAINFALL ~ 1`

). Further it lists all additional
features, provided by the `scope`

argument and sorts them
according to the AIC criterion. In other words it evaluates the AIC for
sequentially adding these features. The next step starts again with the
output of the AIC for the new model given by
`MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL`

. Again, all additional
features are sorted according to the AIC criterion. In the next step one
more feature is added to the model
(`MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP`

).
Again all additional features are sorted based on their effect on the
AIC. Note that adding any of the features left to choose from
(`MEAN.CLOUD.COVER`

and `ALTITUDE`

) does not
decrease AIC. This is made more clear by the `<none>`

feature in the list, which simply states that adding no more feature
would be the best choice. Consequently, the forward-stepwise selection
algorithm stops and returns the best model configuration, which in our
case is
`MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP`

.
The last lines of the output show the model call and the model
coefficients.

Just for a matter of fun, let us calculate the RMSE for the training and test data sets given the suggested model.

```
# build the model
m_forward_test <- lm(formula = MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP,
data = train_set)
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_forward_test)))
```

`## [1] "RMSE on training set: 112.65639969757"`

```
# prediction
m_forward_test_pred <- predict(m_forward_test , newdata = test_set)
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_forward_test_pred)))
```

`## [1] "RMSE on test set: 124.062272949682"`

Not bad! Definitely one of the better models we built so far.

However, we want more! Recall that we have 16 features in our data
set. So fare we experimented with just a few of them. Let us exploit the
full feature space of our data set. Let us apply a forward-stepwise
model selection for all available features. Fortunately we do not have
to write down all the names of the features in our data set. In the
formula-notation there is a representation of “include all features”,
this representation is indicated by the full-stop
(`response ~ .`

). In order to prevent that the output of the
`step()`

function floods our console we add the
`trace = 0`

argument to the function call.

```
# define (empty) model
m_forward_full_baseline <- lm(formula = MEAN.ANNUAL.RAINFALL ~ 1, data = train_set)
# define scope, which features to include
m_forward_full_scope <- as.formula(lm(MEAN.ANNUAL.RAINFALL ~., data = train_set))
# call the step function
m_forward_full <- step(m_forward_full_baseline,
scope = m_forward_full_scope,
direction = "forward",
trace = 0)
m_forward_full
```

```
##
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + LON + MEAN.ANNUAL.AIR.TEMP +
## ALTITUDE + RECORD.LENGTH + MAX.MONTHLY.WIND.SPEED + MIN.AIR.TEMP,
## data = train_set)
##
## Coefficients:
## (Intercept) MAX.RAINFALL LON
## 1156.4027 24.3466 -40.0285
## MEAN.ANNUAL.AIR.TEMP ALTITUDE RECORD.LENGTH
## -82.7801 -0.1746 -0.2372
## MAX.MONTHLY.WIND.SPEED MIN.AIR.TEMP
## -21.3970 5.7082
```

OK, that is the suggest model configuration. Let us evaluate the RMSE
on the training set and on the test set, and store the model and the
corresponding RMSE in our `list`

object called
`model_outcome`

.

```
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_forward_full)))
```

`## [1] "RMSE on training set: 70.1512411541421"`

```
# prediction
m_forward_full_test_pred <- predict(m_forward_full, newdata = test_set)
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_forward_full_test_pred)))
```

`## [1] "RMSE on test set: 87.2102349490595"`

```
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["forward_full"]] <- list("model" = m_forward_full,
"rmse" = data.frame("name" = "forward model",
"train_RMSE" = rmse(m_forward_full),
"test_RMSE" = rmse2(y_test, m_forward_full_test_pred)))
```

Wow! What a reduction of the RMSE for the training and the test data set. This seems to be a model configuration that performs great on the training data and generalizes as well.

The backward-stepwise model selection is very similar to the forward-stepwise model selection discussed in the previous section. Backward-stepwise selection starts with the full model, and sequentially deletes the predictor that has the least impact on the fit. Backward selection can only be used when the number of observation is larger than the number of features \((n > d)\), while forward stepwise can always be used (Hastie et al. 2008).

In R the backward-stepwise model selection is implemented in the
`step()`

function. To conduct the backward-stepwise model
selection procedure one has to specify the `direction`

argument as `direction = "backward"`

.

For the sake of comparison we apply the backward-stepwise model selection in the same manner as we did in the previous section. However, this time we start with a sophisticated model and exclude one feature after the other based on the AIC criterion.

```
# define (sophisticated) model
test_model <- lm(formula = MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL + MEAN.CLOUD.COVER + MEAN.ANNUAL.AIR.TEMP,
data = train_set)
# call the step function
step(test_model, direction = "backward")
```

```
## Start: AIC=1546.84
## MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL + MEAN.CLOUD.COVER +
## MEAN.ANNUAL.AIR.TEMP
##
## Df Sum of Sq RSS AIC
## - ALTITUDE 1 18786 2046099 1546.3
## - MEAN.ANNUAL.AIR.TEMP 1 20453 2047766 1546.5
## - MEAN.CLOUD.COVER 1 22040 2049353 1546.6
## <none> 2027313 1546.8
## - MAX.RAINFALL 1 1144027 3171340 1617.8
##
## Step: AIC=1546.34
## MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.CLOUD.COVER + MEAN.ANNUAL.AIR.TEMP
##
## Df Sum of Sq RSS AIC
## - MEAN.CLOUD.COVER 1 22610 2068709 1546.1
## <none> 2046099 1546.3
## - MEAN.ANNUAL.AIR.TEMP 1 91588 2137688 1551.5
## - MAX.RAINFALL 1 2546618 4592718 1676.1
##
## Step: AIC=1546.14
## MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP
##
## Df Sum of Sq RSS AIC
## <none> 2068709 1546.1
## - MEAN.ANNUAL.AIR.TEMP 1 135089 2203798 1554.5
## - MAX.RAINFALL 1 2583116 4651824 1676.2
```

```
##
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL + MEAN.ANNUAL.AIR.TEMP,
## data = train_set)
##
## Coefficients:
## (Intercept) MAX.RAINFALL MEAN.ANNUAL.AIR.TEMP
## 86.79 23.50 -27.32
```

Perfect! As expected the backward-stepwise model selection algorithm comes to the same conclusion. The function output is interpreted in the same manner as we discussed in the section on forward-stepwise model selection.

Let us apply the backward-stepwise model selection on the full
feature space of our data set. Further, we evaluate the RMSE of the
resulting model and store it together with the corresponding RMSE in our
`list`

object called `model_outcome`

.

```
# define (empty) model
m_backward_full_baseline <- lm(formula = MEAN.ANNUAL.RAINFALL ~., data = train_set)
# call the step function
m_backward_full <- step(m_backward_full_baseline,
direction = "backward",
trace = 0)
m_backward_full
```

```
##
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ LAT + LON + RECORD.LENGTH +
## MEAN.ANNUAL.AIR.TEMP + MEAN.MONTHLY.MIN.TEMP + MAX.MONTHLY.WIND.SPEED +
## MAX.RAINFALL + MEAN.RANGE.AIR.TEMP, data = train_set)
##
## Coefficients:
## (Intercept) LAT LON
## -211.28 19.04 -42.70
## RECORD.LENGTH MEAN.ANNUAL.AIR.TEMP MEAN.MONTHLY.MIN.TEMP
## -0.22 -115.45 64.82
## MAX.MONTHLY.WIND.SPEED MAX.RAINFALL MEAN.RANGE.AIR.TEMP
## -19.28 24.14 33.82
```

```
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_backward_full)))
```

`## [1] "RMSE on training set: 69.7386767083948"`

```
# prediction
m_backward_full_test_pred <- predict(m_backward_full, newdata = test_set)
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_backward_full_test_pred)))
```

`## [1] "RMSE on test set: 85.1725548463445"`

```
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["backward_full"]] <- list("model" = m_backward_full,
"rmse" = data.frame("name" = "backward model",
"train_RMSE" = rmse(m_backward_full),
"test_RMSE" = rmse2(y_test, m_backward_full_test_pred)))
```

Nice results. We see that the “best” model configuration provided by the backward-stepwise model selection procedure is not identical with the “best” model configuration provided by the forward-stepwise model selection procedure. The RMSE of the backward-stepwise model selection procedure is slightly lower, but not that much to call out a winner.

The R software implements as well a hybrid stepwise-selection strategy that considers both forward and backward moves at each step, and selects the “best” of the two. The step function uses the AIC criterion for weighing the choices, which takes proper account of the number of parameters fit; at each step an add or drop will be performed that minimizes the AIC score.

In R the hybrid stepwise-selection procedure is implemented in the
`step()`

function, by setting the `direction`

argument to `direction = "both"`

. For matter of model
comparison we apply the hybrid stepwise-selection procedure on the full
feature space of the data set. We may either start with a full (all
features included) or a baseline (intercept only) model. As in the
previous sections we evaluate the RMSE of the resulting model and store
it together with the corresponding RMSE in our `list`

object
called `model_outcome`

.

```
# define model
m_both_full_baseline <- lm(formula = MEAN.ANNUAL.RAINFALL ~.,
data = train_set)
# call the step function
m_both_full <- step(m_both_full_baseline,
direction = "both",
trace = 0)
m_both_full
```

```
##
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ LAT + LON + RECORD.LENGTH +
## MEAN.ANNUAL.AIR.TEMP + MEAN.MONTHLY.MIN.TEMP + MAX.MONTHLY.WIND.SPEED +
## MAX.RAINFALL + MEAN.RANGE.AIR.TEMP, data = train_set)
##
## Coefficients:
## (Intercept) LAT LON
## -211.28 19.04 -42.70
## RECORD.LENGTH MEAN.ANNUAL.AIR.TEMP MEAN.MONTHLY.MIN.TEMP
## -0.22 -115.45 64.82
## MAX.MONTHLY.WIND.SPEED MAX.RAINFALL MEAN.RANGE.AIR.TEMP
## -19.28 24.14 33.82
```

```
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_both_full)))
```

`## [1] "RMSE on training set: 69.7386767083948"`

```
# prediction
m_both_full_test_pred <- predict(m_both_full, newdata = test_set)
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_both_full_test_pred)))
```

`## [1] "RMSE on test set: 85.1725548463445"`

```
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["both_full"]] <- list("model" = m_both_full,
"rmse" = data.frame("name" = "hyprid model",
"train_RMSE" = rmse(m_both_full),
"test_RMSE" = rmse2(y_test, m_both_full_test_pred)))
```

Good results. The RMSE on both data sets, the training and the test data set, are in the same range as for the forward-stepwise and backward-stepwise model selection procedures. Again it is too close to call out a winner.

Before we take a look at the interim results of our “model selection comparison contest”, we recall that feature engineering techniques (transformations, basis expansions, interactions,…) considerable enhance the predictive power of our model. In this section we combine feature engineering techniques and model selection procedures.

As mentioned earlier feature engineering is sometimes more art than science, and thus there is no exact procedure to follow in terms of “best” feature engineering approaches. Feature engineering is often about domain knowledge, expertise and experience. Here in this section we do not show any of that, thus we have to apply a brute force approach.

This means that we apply transformations (taking the
`log()`

), basis expansions (power of 2), interactions
(product of each feature pair) on all of our features. Fortunately in R,
the formula-notation allows us to encode all these combinations in one
line of code. By providing the expression
`response ~ . + .^2 + .*.`

to the `step()`

function call we account for basis expansion (`.^2`

) and the
interaction (`.*.`

). The log-transformation has to be encoded
explicitly.

For the sake of simplicity we apply the hybrid stepwise-selection
strategy (`direction = "both"`

). However, this time we will
not use the AIC, but the BIC criterion for model selection. Recall that
the AIC is used by default by the `step()`

function. In order
to use the BIC criterion we add an additional argument to the
`step()`

function: `k = log(n)`

. As in the
previous sections we evaluate the RMSE of the resulting model and store
it together with the corresponding RMSE in our `list`

object
called `model.outcome`

.

```
#########################
### Brute force - BIC ###
#########################
# encode log-transformation
features <- colnames(train_set)[!(colnames(train_set) %in% c("MEAN.ANNUAL.RAINFALL", "MIN.AIR.TEMP"))]
log_transform <- paste(paste0("log(", features, ")"), collapse = " + ")
# encode formula
formula <- as.formula(paste0("MEAN.ANNUAL.RAINFALL ~ . +", log_transform, "+ .^2 + .*."))
# define model
m_both_force_BIC_baseline <- lm(formula = formula, data = train_set)
# call the step function
m_both_force_BIC <- step(m_both_force_BIC_baseline,
direction = "both",
trace = 0,
k = log(nrow(train_set)))
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_both_force_BIC)))
```

`## [1] "RMSE on training set: 19.7287872656248"`

```
# prediction
m_both_force_BIC_test_pred <- predict(m_both_force_BIC, newdata = test_set)
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_both_force_BIC_test_pred)))
```

`## [1] "RMSE on test set: 396.872530692526"`

```
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["both_force_BIC"]] <- list("model" = m_both_force_BIC,
"rmse" = data.frame("name" = "brute force BIC",
"train_RMSE" = rmse(m_both_force_BIC),
"test_RMSE" = rmse2(y_test, m_both_force_BIC_test_pred)))
# save list object for later usage
save(model_outcome, file = "model_outcome_II_30200.RData")
```

Done! We see that the RMSE on the training data set is very low, however the RMSE on the test set is quite high. This is an indicator for model overfitting. No surprise at all, as the suggested model includes 118 parameters!

`length(m_both_force_BIC$coefficients)`

`## [1] 118`

Let us visualize the interim results and check which models are
leading our “model selection comparison contest”. The
`do.call()`

function allows us to apply any function, in our
case the `rbind()`

function, to the elements in the
`list`

object `model_outcome`

.

```
# extract matrix object
res_matrix <- do.call(rbind, model_outcome)
# extract data frame object
res_df <- do.call(rbind, res_matrix[, 2])
library(ggplot2)
```

`## Warning: package 'ggplot2' was built under R version 4.2.3`

`library(reshape2)`

`## Warning: package 'reshape2' was built under R version 4.2.3`

```
# melt to tidy data frame
df <- melt(res_df,
id.vars = "name",
measure.vars = c("train_RMSE",
"test_RMSE")
)
# plot
ggplot(data = df, aes(x = name, y = value, fill = variable)) +
geom_bar(stat = "identity",
position = position_dodge()
) +
scale_fill_hue(name = "") +
xlab("") +
ylab("RMSE") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

The plot shows the RMSE of each model on the training data set and on the test data set. The best models with respect to RMSE on the test data set are those models, which used a stepwise parameter selection procedure. Note that the “brute force” model performs best on the training data set, but performs poorly on the test data set. Such a model is overfitting the data and is not expected to show high predictive power.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*