Let us recall our research question: We want to examine the
relationship between the response variable
MEAN.ANNUAL.RAINFALL and the other variables in the
dwd data set.
A multiple regression model is written as:
\[ \begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \\ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...,m\ \end{align} \] where \(β_0\) corresponds to the intercept, \(β_1,...,β_d\) correspond to the model coefficients, \(x_i\) to the observation/measurement data, and \(ϵ\) to the residuals.
One of the functions to build multiple linear regression models in R
is the lm() function. The lm() function
expects the formula notation as input; thus, a linear model is specified
as response ~ predictor(s). In order to increase the number
of predictor variables we concatenate them by using the +
sign:
response ~ predictor1 + predictor2 + predictor3 + ....
The general procedure we will follow is outlined below:
Consequently, we will showcase different options for feature selection, sometimes referred to as model selection or feature engineering. In order to evaluate a model we define a metric to assess the model performance. In regression modelling, which deals with continuous \(d\)-dimensional feature spaces (\(x_i∈R^{d+1}\)) it is quite common to use the mean squared error (MSE) metric or the root mean squared error (RMSE).
The RMSE is the square root of the sum of the squared difference between the observed and predicted values, normalized be the number of observations, \(n\).
\[RMSE =
\sqrt\frac{\sum_{i=1}^m{(pred_i-obs_i)^2}}{n}\] The lower the
RMSE the better the model fits the data. Please note that the difference
between the observed and predicted is nothing else but the residual term
of a linear model. In R, for a given lm object the
residuals are easily accessed by the residuals() function.
For convenience, we wrap the equation into our own function, which
calculates RMSE, given a lm object:
rmse <- function(model) {
sqrt(sum((model$residual)^2) / nrow(model$model))
}
In addition to the rmse() function from above we
implement a rmse2() function, which returns the RMSE as
well, but this time the input is not a lm object, but just
two vectors, one of them containing the observations, the other
containing the model predictions.
rmse2 <- function(obs, preds) {
sqrt(sum((obs - preds)^2) / length(obs))
}
# save function objects for later usage
save(rmse, rmse2, file = "helper_functions_30200.RData")
We start with learning the model parameters based on the training data set. Then we evaluate the goodness of fit by calculating the RMSE. After that we evaluate the model on data, which was never seen before by the model. Therefore we calculate the RMSE of the model applied on the test data set. This out-of-sample measure provides a clue on how well our model generalizes. If the model performs well on the training data set, but performs poorly on the test data set, this is an indication for model overfitting, which is in general a bad thing. That is because we do not want our model to learn, or better call it to memorize, the training data set, but we want to find a model that accounts for the unknown data generation process generating our observations.
The process of feature and model selection is the first step in model
development and sometimes the most time consuming and challenging one.
This part in the model development pipeline is more arts than pure
science. Domain knowledge, experience and a solid background in
mathematics and statistics help during this endeavor. In this section we
build the baseline model. Thereafter we will build
increasingly more sophisticated models. We store each of our models and
the associated prediction error (RMSE), both for the training set as
well as the test set, in a list object for further
usage.
# load processed data set from previous section
load("dwd_30200.RData")
# create empty list object for further usage
model_outcome <- list()
The baseline model is the simplest model we may build. It forms the
baseline in terms of RMSE. Whatever we do in the future, we should
always try to perform better than the baseline model. In our example the
baseline model is just the arithmetic mean of the response variable
MEAN.ANNUAL.RAINFALL. This model corresponds to a model
without any explanatory variable and is encoded in R by
lm(response ~ 1).
m_baseline <- lm(MEAN.ANNUAL.RAINFALL ~ 1, data = train_set)
summary(m_baseline)
##
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ 1, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -289.53 -153.53 -50.53 68.47 890.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 735.53 17.14 42.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 218.8 on 162 degrees of freedom
The intercept term corresponds to a value of 735.53. Let us compare
this value the arithmetic mean of the response variable
MEAN.ANNUAL.RAINFALL.
mean(train_set$MEAN.ANNUAL.RAINFALL)
## [1] 735.5337
As expected, the arithmetic mean yields the same number as \(β_0\), the intercept of the baseline model.
We calculate the RMSE of the baseline model on the training data set
and store the model, as well as the results of our rmse()
function it in our list object named
model_outcome (Note the [[ ]]-notation to
access a list object).
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_baseline)))
## [1] "RMSE on training set: 218.144497678924"
# store model object and results of rmse in the `list` object named `model_outcome`
# Note that we build a nested list-object
model_outcome[["baseline"]] <- list("model" = m_baseline,
"rmse" = data.frame("name" = "baseline model",
"train_RMSE" = rmse(m_baseline),
"test_RMSE" = NA))
We probably anticipate that the baseline model is not extremely useful to predict the mean annual rainfall for any given weather station in Germany, however to strengthen our intuition - and as long as we are dealing with low-dimensional regression models - we plot the regression line of the baseline model and the data points in our training set.
plot(train_set$MEAN.ANNUAL.RAINFALL)
abline(m_baseline)
Well, looks as expected, but we can do better!
Just note the asymmetric residuals indicating non-normality due
to feature scale constraints!
However, we will not solve this problem yet and continue as
follow:
Recall our general procedure outlined above:
In order to wrap up the baseline model scenario let us calculate the
generalization error of the baseline model. We apply the
predict() function and feed it with the test data set. Then
we calculate the RMSE using the rmse2() function defined in
the section above. Finally, for the sake of completeness we store the
RMSE of the baseline model on the test data set in the list object named
model_outcome.
# prediction
m_baseline_pred <- predict(m_baseline, newdata = test_set)
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_baseline_pred)))
## [1] "RMSE on test set: 280.038796947378"
# store rmse for the test data set in the proper slot of the `list` object named `model_outcome`
model_outcome[["baseline"]]$rmse$test_RMSE <- rmse2(y_test, m_baseline_pred)
Before we move on let us just take a quick summarizing look at the model performance metrics of the baseline model:
model_outcome[["baseline"]]$rmse
## name train_RMSE test_RMSE
## 1 baseline model 218.1445 280.0388
In a simple linear regression analysis the relationship between two variables is modeled as
\[y=f(x)=\beta_0+\beta_1x_1.\] As for the baseline model scenario in the section above we follow the general procedure:
Immediately the question arises, which feature to pick? A heuristic
approach is to check for correlations among the response variable and
the explanatory variables. We already have done that in a previous
section. The correlation analysis revealed that the explanatory variable
ALTITUDE is positively correlated with the response
variable MEAN.ANNUAL.RAINFALL.
p_cor <- cor(dwd_data$ALTITUDE, dwd_data$MEAN.ANNUAL.RAINFALL)
p_cor
## [1] 0.7585322
The Pearson correlation coefficient is \(r=\) 0.76. Thus, we expect that by
extending the baseline model with the explanatory variable
ALTITUDE we significantly increase the predictive power of
the model.
OK, let us build a simple linear regression model to predict mean annual rainfall recorded at weather stations across Germany by the altitude of the weather stations.
m_simple_alt <- lm(MEAN.ANNUAL.RAINFALL ~ ALTITUDE,
data = train_set)
summary(m_simple_alt)
##
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ ALTITUDE, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -301.70 -99.49 -36.53 124.13 519.01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 594.10188 14.48781 41.01 <2e-16 ***
## ALTITUDE 0.58027 0.03841 15.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.2 on 161 degrees of freedom
## Multiple R-squared: 0.5864, Adjusted R-squared: 0.5838
## F-statistic: 228.2 on 1 and 161 DF, p-value: < 2.2e-16
Well, by looking at the model summary we see that the explanatory
variable ALTITUDE is statistically significant. Further, we
see a \(R^2=\) 0.59 , which states the
proportion of variation in the observed values explained by the
regression model. The numbers look good, lets plot the data:
plot(test_set$ALTITUDE, test_set$MEAN.ANNUAL.RAINFALL)
abline(m_simple_alt)
Well, apparently there is some truth in the regression model we built. Let us continue with our procedure and calculate the RMSE on the training data set and then on the test data. Afterwards we compare the results with the baseline model.
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_simple_alt)))
## [1] "RMSE on training set: 140.29713805265"
# prediction
m_simple_alt_pred <- predict(m_simple_alt, newdata = test_set)
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_simple_alt_pred)))
## [1] "RMSE on test set: 187.905378975602"
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["simple_alt"]] <- list("model" = m_simple_alt,
"rmse" = data.frame("name" = "simple ALT",
"train_RMSE" = rmse(m_simple_alt),
"test_RMSE" = rmse2(y_test, m_simple_alt_pred)))
Let us compare the results we obtained so far. Hence, we simply apply
the rbind() function to the rmse data frame
objects stored in the list object
model_outcome.
rbind(model_outcome[["baseline"]]$rmse,
model_outcome[["simple_alt"]]$rmse)
## name train_RMSE test_RMSE
## 1 baseline model 218.1445 280.0388
## 2 simple ALT 140.2971 187.9054
Great! We dramatically increased the predictive power of our model, both on the training data set but as well on the test data set.
Let us build one more simple regression model. This time we pick the
variable MAX.RAINFALL as our explanatory variable. It seems
reasonable to consider the amount of the highest rainfall event at a
particular weather station as a proxy for the mean annual rainfall at
that weather station.
# build the model
m_simple_rain <- lm(MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL,
data = train_set)
summary(m_simple_rain)
##
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ MAX.RAINFALL, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -273.97 -76.21 10.93 97.29 295.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -269.258 50.722 -5.309 3.62e-07 ***
## MAX.RAINFALL 26.863 1.334 20.141 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 117 on 161 degrees of freedom
## Multiple R-squared: 0.7159, Adjusted R-squared: 0.7141
## F-statistic: 405.7 on 1 and 161 DF, p-value: < 2.2e-16
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_simple_rain)))
## [1] "RMSE on training set: 116.27652956139"
# prediction
m_simple_rain_pred <- predict(m_simple_rain, newdata = test_set)
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_simple_rain_pred)))
## [1] "RMSE on test set: 129.939454857187"
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["simple_rain"]] <- list("model" = m_simple_rain,
"rmse" = data.frame("name" = "simple RAIN",
"train_RMSE" = rmse(m_simple_rain),
"test_RMSE" = rmse2(y_test, m_simple_rain_pred)))
Let us once again compare the results we obtained so far.
rbind(model_outcome[["baseline"]]$rmse,
model_outcome[["simple_alt"]]$rmse,
model_outcome[["simple_rain"]]$rmse)
## name train_RMSE test_RMSE
## 1 baseline model 218.1445 280.0388
## 2 simple ALT 140.2971 187.9054
## 3 simple RAIN 116.2765 129.9395
Awesome! Once again we increased the predictive power of our model.
Just by looking at the numbers in the table above one question should
immediately come to our mind: What if we combine both simple linear
regression models and try to predict the response variable
MEAN.ANNUAL.RAINFALL by using both explanatory variables? …
Welcome to the domain of multiple linear
regression!
Recall the multiple regression model:
\[ y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,m. \]
For the purpose of multiple linear regression analysis the built-in
lm() function may be extended by concatenating additional
explanatory variables by using the + sign. In order to
encode the following model
\[\text{MEAN.ANNUAL.RAINFALL}=\beta_0+\beta_1\times\text{ALTITUDE}+\beta_2\times\text{MAX.RAINFALL},\]
we encode the formula expression as
MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL. Let’s give
it a try: First we build the regression model, then we calculate the
RMSE on the training data set, then on the test data, and then we
compare the models.
# build the model
m_multi_alt_rain <- lm(MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL, data = train_set)
summary(m_multi_alt_rain)
##
## Call:
## lm(formula = MEAN.ANNUAL.RAINFALL ~ ALTITUDE + MAX.RAINFALL,
## data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -290.438 -79.815 2.469 95.951 305.532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -96.4220 75.1569 -1.283 0.20137
## ALTITUDE 0.1658 0.0543 3.054 0.00264 **
## MAX.RAINFALL 21.1612 2.2751 9.301 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114.1 on 160 degrees of freedom
## Multiple R-squared: 0.7315, Adjusted R-squared: 0.7282
## F-statistic: 218 on 2 and 160 DF, p-value: < 2.2e-16
# calculate rmse on training set
print(paste("RMSE on training set:", rmse(m_multi_alt_rain)))
## [1] "RMSE on training set: 113.028391095032"
# prediction
m_multi_alt_rain_pred <- predict(m_multi_alt_rain, newdata = test_set)
# calculate RMSE for the test data set
print(paste("RMSE on test set:", rmse2(y_test, m_multi_alt_rain_pred)))
## [1] "RMSE on test set: 130.577583526652"
# store model object and results of rmse in the `list` object named `model_outcome`
model_outcome[["multi_alt_rain"]] <- list("model" = m_multi_alt_rain,
"rmse" = data.frame("name" = "multiple ALT and RAIN",
"train_RMSE" = rmse(m_multi_alt_rain),
"test_RMSE" = rmse2(y_test, m_multi_alt_rain_pred)))
# save list object for later usage
save(model_outcome, file = "model_outcome_I_30200.RData")
The model summary indicates that both explanatory variables are statistically significant. The explained variance by the two explanatory variables accounts for approximately \(R^2=\) 73.15% of the total variance . The numbers look fine, let’s plot the data:
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 4.2.3
scatter_3d <- with(test_set, scatterplot3d(ALTITUDE, MAX.RAINFALL,
MEAN.ANNUAL.RAINFALL,
pch = 16, highlight.3d = TRUE,
angle = 25))
scatter_3d$plane3d(m_multi_alt_rain)
As already observed above, the skewness of our variables suggests a
non-linear relation.
Thus, we should consider all feature scale constraints and transform our
variables up front at least by log- or logistic
transformation!
However,let us once again compare the results we obtained so far.
rbind(model_outcome[["baseline"]]$rmse,
model_outcome[["simple_alt"]]$rmse,
model_outcome[["simple_rain"]]$rmse,
model_outcome[["multi_alt_rain"]]$rmse)
## name train_RMSE test_RMSE
## 1 baseline model 218.1445 280.0388
## 2 simple ALT 140.2971 187.9054
## 3 simple RAIN 116.2765 129.9395
## 4 multiple ALT and RAIN 113.0284 130.5776
We see that the multiple regression model reduced the RMSE of the training set, but the RMSE on the test set was not reduced. That means that the added flexibly to the model causes the model to learn the training data better than all models developed so far, however, the higher flexibility did not improve the generalization properties of the model. Such a finding raises caution, as we should avoid to overfit the model.
Finally, Let us visualize the interim results and check which models
are leading our “model selection comparison contest”. The
do.call() function allows us to apply any function, in our
case the rbind() function, to the elements in the
list object model_outcome.
# extract matrix object
res_matrix <- do.call(rbind, model_outcome)
# extract data frame object
res_df <- do.call(rbind, res_matrix[, 2])
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3
# melt to tidy data frame
df <- melt(res_df, id.vars = "name",
measure.vars = c("train_RMSE", "test_RMSE"))
# plot
ggplot(data = df, aes(x = name, y = value, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge()) +
scale_fill_hue(name = "") +
xlab("") + ylab("RMSE") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.