Let us recall our research question: We want to examine the relationship between the response variable MEAN.ANNUAL.RAINFALL and the other variables in the dwd data set.

A multiple regression model is written as:

\[ \begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \\ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...,m\ \end{align} \] where \(β_0\) corresponds to the intercept, \(β_1,...,β_d\) correspond to the model coefficients, \(x_i\) to the observation/measurement data, and \(ϵ\) to the residuals.

One of the functions to build multiple linear regression models in R is the lm() function. The lm() function expects the formula notation as input; thus, a linear model is specified as response ~ predictor(s). In order to increase the number of predictor variables we concatenate them by using the + sign: response ~ predictor1 + predictor2 + predictor3 + ....

The general procedure we will follow is outlined below:

Consequently, we will showcase different options for feature selection, sometimes referred to as model selection or feature engineering. In order to evaluate a model we define a metric to assess the model performance. In regression modelling, which deals with continuous \(d\)-dimensional feature spaces (\(x_i∈R^{d+1}\)) it is quite common to use the mean squared error (MSE) metric or the root mean squared error (RMSE).

The RMSE is the square root of the sum of the squared difference between the observed and predicted values, normalized be the number of observations, \(n\).

\[RMSE = \sqrt\frac{\sum_{i=1}^m{(pred_i-obs_i)^2}}{n}\] The lower the RMSE the better the model fits the data. Please note that the difference between the observed and predicted is nothing else but the residual term of a linear model. In R, for a given lm object the residuals are easily accessed by the residuals() function. For convenience, we wrap the equation into our own function, which calculates RMSE, given a lm object:

rmse <- function(model) {
  sqrt(sum((model$residual)^2) / nrow(model$model))
}

In addition to the rmse() function from above we implement a rmse2() function, which returns the RMSE as well, but this time the input is not a lm object, but just two vectors, one of them containing the observations, the other containing the model predictions.

rmse2 <- function(obs, preds) {
  sqrt(sum((obs - preds)^2) / length(obs))
}

# save function objects for later usage
save(rmse, rmse2, file = "helper_functions_30200.RData")

We start with learning the model parameters based on the training data set. Then we evaluate the goodness of fit by calculating the RMSE. After that we evaluate the model on data, which was never seen before by the model. Therefore we calculate the RMSE of the model applied on the test data set. This out-of-sample measure provides a clue on how well our model generalizes. If the model performs well on the training data set, but performs poorly on the test data set, this is an indication for model overfitting, which is in general a bad thing. That is because we do not want our model to learn, or better call it to memorize, the training data set, but we want to find a model that accounts for the unknown data generation process generating our observations.


Feature and model selection

The process of feature and model selection is the first step in model development and sometimes the most time consuming and challenging one. This part in the model development pipeline is more arts than pure science. Domain knowledge, experience and a solid background in mathematics and statistics help during this endeavor. In this section we build the baseline model. Thereafter we will build increasingly more sophisticated models. We store each of our models and the associated prediction error (RMSE), both for the training set as well as the test set, in a list object for further usage.

# load processed data set from previous section
load("dwd_30200.RData")
# create empty list object for further usage
model_outcome <- list()

The baseline model

The baseline model is the simplest model we may build. It forms the baseline in terms of RMSE. Whatever we do in the future, we should always try to perform better than the baseline model. In our example the baseline model is just the arithmetic mean of the response variable MEAN.ANNUAL.RAINFALL. This model corresponds to a model without any explanatory variable and is encoded in R by lm(response ~ 1).

m_baseline <- lm(MEAN.ANNUAL.RAINFALL ~ 1, data = train_set)
summary(m_baseline)
## 
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ 1, data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -289.53 -153.53  -50.53   68.47  890.47 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   735.53      17.14   42.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218.8 on 162 degrees of freedom

The intercept term corresponds to a value of 735.53. Let us compare this value the arithmetic mean of the response variable MEAN.ANNUAL.RAINFALL.

mean(train_set$MEAN.ANNUAL.RAINFALL)
## [1] 735.5337

As expected, the arithmetic mean yields the same number as \(β_0\), the intercept of the baseline model.

We calculate the RMSE of the baseline model on the training data set and store the model, as well as the results of our rmse() function it in our list object named model_outcome (Note the [[ ]]-notation to access a list object).

# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_baseline)))
## [1] "RMSE on training set: 218.144497678924"
# store model object and results of rmse in the `list` object named `model_outcome`
# Note that we build a nested list-object
model_outcome[["baseline"]] <- list("model" = m_baseline,
                                    "rmse" =  data.frame("name" = "baseline model",
                                                         "train_RMSE" = rmse(m_baseline),
                                                         "test_RMSE" = NA))

We probably anticipate that the baseline model is not extremely useful to predict the mean annual rainfall for any given weather station in Germany, however to strengthen our intuition - and as long as we are dealing with low-dimensional regression models - we plot the regression line of the baseline model and the data points in our training set.

plot(train_set$MEAN.ANNUAL.RAINFALL)
abline(m_baseline)

Well, looks as expected, but we can do better!
Just note the asymmetric residuals indicating non-normality due to feature scale constraints!
However, we will not solve this problem yet and continue as follow:

Recall our general procedure outlined above:

In order to wrap up the baseline model scenario let us calculate the generalization error of the baseline model. We apply the predict() function and feed it with the test data set. Then we calculate the RMSE using the rmse2() function defined in the section above. Finally, for the sake of completeness we store the RMSE of the baseline model on the test data set in the list object named model_outcome.

# prediction
m_baseline_pred <- predict(m_baseline, newdata = test_set)

# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_baseline_pred)))
## [1] "RMSE on test set: 280.038796947378"
# store rmse for the test data set in the proper slot of the `list` object named `model_outcome`
model_outcome[["baseline"]]$rmse$test_RMSE <- rmse2(y_test, m_baseline_pred)

Before we move on let us just take a quick summarizing look at the model performance metrics of the baseline model:

model_outcome[["baseline"]]$rmse
##             name train_RMSE test_RMSE
## 1 baseline model   218.1445  280.0388

The simple linear regression model

In a simple linear regression analysis the relationship between two variables is modeled as

\[y=f(x)=\beta_0+\beta_1x_1.\] As for the baseline model scenario in the section above we follow the general procedure:

Immediately the question arises, which feature to pick? A heuristic approach is to check for correlations among the response variable and the explanatory variables. We already have done that in a previous section. The correlation analysis revealed that the explanatory variable ALTITUDE is positively correlated with the response variable MEAN.ANNUAL.RAINFALL.

p_cor <- cor(dwd_data$ALTITUDE, dwd_data$MEAN.ANNUAL.RAINFALL)
p_cor
## [1] 0.7585322

The Pearson correlation coefficient is \(r=\) 0.76. Thus, we expect that by extending the baseline model with the explanatory variable ALTITUDE we significantly increase the predictive power of the model.

OK, let us build a simple linear regression model to predict mean annual rainfall recorded at weather stations across Germany by the altitude of the weather stations.

m_simple_alt <- lm(MEAN.ANNUAL.RAINFALL ~ ALTITUDE,
                   data = train_set)
summary(m_simple_alt)
## 
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ ALTITUDE, data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -301.70  -99.49  -36.53  124.13  519.01 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 594.10188   14.48781   41.01   <2e-16 ***
## ALTITUDE      0.58027    0.03841   15.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.2 on 161 degrees of freedom
## Multiple R-squared:  0.5864, Adjusted R-squared:  0.5838 
## F-statistic: 228.2 on 1 and 161 DF,  p-value: < 2.2e-16

Well, by looking at the model summary we see that the explanatory variable ALTITUDE is statistically significant. Further, we see a \(R^2=\) 0.59 , which states the proportion of variation in the observed values explained by the regression model. The numbers look good, lets plot the data:

plot(test_set$ALTITUDE, test_set$MEAN.ANNUAL.RAINFALL)
abline(m_simple_alt)

Well, apparently there is some truth in the regression model we built. Let us continue with our procedure and calculate the RMSE on the training data set and then on the test data. Afterwards we compare the results with the baseline model.

# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_simple_alt)))
## [1] "RMSE on training set: 140.29713805265"
# prediction
m_simple_alt_pred <- predict(m_simple_alt, newdata = test_set)
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_simple_alt_pred)))
## [1] "RMSE on test set: 187.905378975602"
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["simple_alt"]] <- list("model" = m_simple_alt,
                                      "rmse" =  data.frame("name" = "simple ALT",
                                                           "train_RMSE" = rmse(m_simple_alt),
                                                           "test_RMSE" = rmse2(y_test, m_simple_alt_pred)))

Let us compare the results we obtained so far. Hence, we simply apply the rbind() function to the rmse data frame objects stored in the list object model_outcome.

rbind(model_outcome[["baseline"]]$rmse,
      model_outcome[["simple_alt"]]$rmse)
##             name train_RMSE test_RMSE
## 1 baseline model   218.1445  280.0388
## 2     simple ALT   140.2971  187.9054

Great! We dramatically increased the predictive power of our model, both on the training data set but as well on the test data set.

Let us build one more simple regression model. This time we pick the variable MAX.RAINFALL as our explanatory variable. It seems reasonable to consider the amount of the highest rainfall event at a particular weather station as a proxy for the mean annual rainfall at that weather station.

# build the model
m_simple_rain <- lm(MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL,
                    data = train_set)
summary(m_simple_rain)
## 
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL, data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -273.97  -76.21   10.93   97.29  295.44 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -269.258     50.722  -5.309 3.62e-07 ***
## MAX.RAINFALL   26.863      1.334  20.141  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 117 on 161 degrees of freedom
## Multiple R-squared:  0.7159, Adjusted R-squared:  0.7141 
## F-statistic: 405.7 on 1 and 161 DF,  p-value: < 2.2e-16
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_simple_rain)))
## [1] "RMSE on training set: 116.27652956139"
# prediction
m_simple_rain_pred <- predict(m_simple_rain, newdata = test_set)

# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_simple_rain_pred)))
## [1] "RMSE on test set: 129.939454857187"
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["simple_rain"]] <- list("model" = m_simple_rain,
                                       "rmse" =  data.frame("name" = "simple RAIN",
                                                            "train_RMSE" = rmse(m_simple_rain),
                                                            "test_RMSE" = rmse2(y_test, m_simple_rain_pred)))

Let us once again compare the results we obtained so far.

rbind(model_outcome[["baseline"]]$rmse,
      model_outcome[["simple_alt"]]$rmse,
      model_outcome[["simple_rain"]]$rmse)
##             name train_RMSE test_RMSE
## 1 baseline model   218.1445  280.0388
## 2     simple ALT   140.2971  187.9054
## 3    simple RAIN   116.2765  129.9395

Awesome! Once again we increased the predictive power of our model. Just by looking at the numbers in the table above one question should immediately come to our mind: What if we combine both simple linear regression models and try to predict the response variable MEAN.ANNUAL.RAINFALL by using both explanatory variables? … Welcome to the domain of multiple linear regression!


The multiple linear regression model

Recall the multiple regression model:

\[ y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,m. \]

For the purpose of multiple linear regression analysis the built-in lm() function may be extended by concatenating additional explanatory variables by using the + sign. In order to encode the following model

\[\text{MEAN.ANNUAL.RAINFALL}=\beta_0+\beta_1\times\text{ALTITUDE}+\beta_2\times\text{MAX.RAINFALL},\] we encode the formula expression as MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL. Let’s give it a try: First we build the regression model, then we calculate the RMSE on the training data set, then on the test data, and then we compare the models.

# build the model
m_multi_alt_rain <- lm(MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL, data = train_set)
summary(m_multi_alt_rain)
## 
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL, 
##     data = train_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -290.438  -79.815    2.469   95.951  305.532 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -96.4220    75.1569  -1.283  0.20137    
## ALTITUDE       0.1658     0.0543   3.054  0.00264 ** 
## MAX.RAINFALL  21.1612     2.2751   9.301  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.1 on 160 degrees of freedom
## Multiple R-squared:  0.7315, Adjusted R-squared:  0.7282 
## F-statistic:   218 on 2 and 160 DF,  p-value: < 2.2e-16
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_multi_alt_rain)))
## [1] "RMSE on training set: 113.028391095032"
# prediction
m_multi_alt_rain_pred <- predict(m_multi_alt_rain, newdata = test_set)

# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_multi_alt_rain_pred)))
## [1] "RMSE on test set: 130.577583526652"
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["multi_alt_rain"]] <- list("model" = m_multi_alt_rain,
                                          "rmse" = data.frame("name" = "multiple ALT and RAIN",
                                                              "train_RMSE" = rmse(m_multi_alt_rain),
                                                              "test_RMSE" = rmse2(y_test, m_multi_alt_rain_pred)))

# save list object for later usage
save(model_outcome, file = "model_outcome_I_30200.RData")

The model summary indicates that both explanatory variables are statistically significant. The explained variance by the two explanatory variables accounts for approximately \(R^2=\) 73.15% of the total variance . The numbers look fine, let’s plot the data:

library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 4.2.3
scatter_3d <- with(test_set, scatterplot3d(ALTITUDE, MAX.RAINFALL,
                                           MEAN.ANNUAL.RAINFALL,
                             pch = 16, highlight.3d = TRUE,
                             angle = 25))
scatter_3d$plane3d(m_multi_alt_rain)

As already observed above, the skewness of our variables suggests a non-linear relation.
Thus, we should consider all feature scale constraints and transform our variables up front at least by log- or logistic transformation!

However,let us once again compare the results we obtained so far.

rbind(model_outcome[["baseline"]]$rmse,
      model_outcome[["simple_alt"]]$rmse,
      model_outcome[["simple_rain"]]$rmse,
      model_outcome[["multi_alt_rain"]]$rmse)
##                    name train_RMSE test_RMSE
## 1        baseline model   218.1445  280.0388
## 2            simple ALT   140.2971  187.9054
## 3           simple RAIN   116.2765  129.9395
## 4 multiple ALT and RAIN   113.0284  130.5776

We see that the multiple regression model reduced the RMSE of the training set, but the RMSE on the test set was not reduced. That means that the added flexibly to the model causes the model to learn the training data better than all models developed so far, however, the higher flexibility did not improve the generalization properties of the model. Such a finding raises caution, as we should avoid to overfit the model.


Interim results I

Finally, Let us visualize the interim results and check which models are leading our “model selection comparison contest”. The do.call() function allows us to apply any function, in our case the rbind() function, to the elements in the list object model_outcome.

# extract matrix object
res_matrix <- do.call(rbind, model_outcome)
# extract data frame object
res_df <- do.call(rbind, res_matrix[, 2])

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3
# melt to tidy data frame
df <- melt(res_df, id.vars = "name",
           measure.vars = c("train_RMSE", "test_RMSE"))

# plot
ggplot(data = df, aes(x = name, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_fill_hue(name = "") +
  xlab("") + ylab("RMSE") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.