So far we focused on an example data set with just two explanatory variables. In order to get some hands-on experience we advance by using a data set with more than two explanatory variables.

In order to showcase the multiple linear regression analysis we
**examine the relationship between the response variable
MEAN.ANNUAL.RAINFALL and the other variables in the
dwd data set**.

You may download the `DWD.csv`

file here.
Import the data set and assign a proper name to it. Note that we use the
`read.csv2()`

function to load the data set. The
`read.csv()`

and `read.csv2()`

are identical
except for the defaults. The `read.csv2()`

function is
suitable for data originating from countries that use a comma as decimal
point and a semicolon as field separator.

```
dwd <- read.csv2("https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv",
stringsAsFactors = FALSE, sep = ",")
```

The *dwd* data set consists of 599 rows, each of them
representing a particular weather station in Germany, and 22 columns,
each of them corresponding to a variable/feature related to that
particular weather station. These self-explaining variables are: ID,
DWD_ID, STATION_NAME, FEDERAL_STATE, LAT, LON, ALTITUDE, PERIOD,
RECORD_LENGTH, MEAN_ANNUAL_AIR_TEMP, MEAN_MONTHLY_MAX_TEMP,
MEAN_MONTHLY_MIN_TEMP, MEAN_ANNUAL_WIND_SPEED, MEAN_CLOUD_COVER,
MEAN_ANNUAL_SUNSHINE, MEAN_ANNUAL_RAINFALL, MAX_MONTHLY_WIND_SPEED,
MAX_AIR_TEMP, MAX_WIND_SPEED, MAX_RAINFALL, MIN_AIR_TEMP,
MEAN_RANGE_AIR_TEMP.

The data was downloaded from the DWD (German Weather Service) data portal on April 21, 2017. You may find a detailed description of the data set here. Please note that for the purpose of this tutorial the data set was preprocessed and columns have been renamed.

Before we dive into the modelling we first have to prepare our data
set. By calling the `str()`

function on the data set we
realize that there are features in the data set such as,
`id`

, `DWD_ID`

, and `STATION.NAME`

,
among others, which we know, or at least presumably know, do not carry
any particular useful information for the prediction of the mean annual
rainfall. Thus, we exclude them from the data set. In addition we make
sure that we exclude observations that contain missing values
(`NA`

).

`str(dwd)`

```
## 'data.frame': 599 obs. of 22 variables:
## $ ID : int 0 1 2 6 8 9 10 12 14 18 ...
## $ DWD_ID : int 1 3 44 71 73 78 91 98 116 132 ...
## $ STATION_NAME : chr "Aach" "Aachen" "GroÃŸenkneten" "Albstadt-Badkap" ...
## $ FEDERAL_STATE : chr "Baden-WÃ¼rttemberg" "Nordrhein-Westfalen" "Niedersachsen" "Baden-WÃ¼rttemberg" ...
## $ LAT : chr "47.8413" "50.7827" "52.9335" "48.2156" ...
## $ LON : chr "8.8493" "6.0941" "8.237" "8.9784" ...
## $ ALTITUDE : chr "478.0" "202.0" "44.0" "759.0" ...
## $ PERIOD : chr "1931-1986" "1851-2011" "1971-2016" "1986-2016" ...
## $ RECORD_LENGTH : int 55 160 45 30 64 55 38 67 67 33 ...
## $ MEAN_ANNUAL_AIR_TEMP : chr "8.2" "9.8" "9.2" "7.4" ...
## $ MEAN_MONTHLY_MAX_TEMP : chr "13.1" "13.6" "13.2" "12.2" ...
## $ MEAN_MONTHLY_MIN_TEMP : chr "3.5" "6.3" "5.4" "3.3" ...
## $ MEAN_ANNUAL_WIND_SPEED: chr "2.0" "3.0" "2.0" "2.0" ...
## $ MEAN_CLOUD_COVER : chr "67.0" "67.0" "67.0" "66.0" ...
## $ MEAN_ANNUAL_SUNSHINE : chr "" "1531.0" "1459.0" "1725.0" ...
## $ MEAN_ANNUAL_RAINFALL : chr "755.0" "820.0" "759.0" "919.0" ...
## $ MAX_MONTHLY_WIND_SPEED: chr "2.0" "3.0" "3.0" "2.0" ...
## $ MAX_AIR_TEMP : chr "32.5" "32.3" "32.4" "30.2" ...
## $ MAX_WIND_SPEED : chr "" "30.2" "29.9" "" ...
## $ MAX_RAINFALL : chr "39.0" "36.0" "32.0" "43.0" ...
## $ MIN_AIR_TEMP : chr "-16.3" "-10.9" "-12.6" "-15.5" ...
## $ MEAN_RANGE_AIR_TEMP : chr "9.6" "7.3" "7.8" "8.9" ...
```

```
# Exclude variables
cols_to_drop <- c("ID",
"DWD_ID",
"STATION_NAME",
"FEDERAL_STATE",
"PERIOD") # columns to drop
rows_to_drop <- complete.cases(dwd) # rows to drop
dwd_data <- dwd[rows_to_drop, !(colnames(dwd) %in% cols_to_drop)]
```

After the data cleaning steps there are 599 rows (observations) and 17 columns (features) left in our data set.

`dim(dwd_data)`

`## [1] 599 17`

Moreover, we split our data set at random into two distinct parts.
One part is our **training set** while the other part will
serve as our **test set**. The basic idea behind this data
grouping strategy is, that we learn the model parameters on the training
set, and then we use the test data set, a data set that was not involved
at all during parameter estimation, to determine the goodness of fit of
our model, or in other words by using two distinct data sets we may
evaluate the **generalization properties** of our
model.

For the sub-setting of the data set we rely on the
`sample.split()`

function from the `caTools`

package, which splits the data set into two sets (training and test set)
in a predefined ratio. Finally, we assign the **response
vector** and the **data matrix** for each subset of
the data a proper variable name for further processing.

```
# Split data into training and test set, 80% training and 20% test set
library(caTools)
```

`## Warning: Paket 'caTools' wurde unter R Version 4.2.3 erstellt`

```
set.seed(100) # set seed for reproducibility
# define split vector
split <- sample.split(dwd_data$MEAN_ANNUAL_RAINFALL, SplitRatio = 0.8)
train_set <- dwd_data[split == TRUE, ] # training set
test_set <- dwd_data[split == FALSE, ] # test set
# response vectors
y_train <- train_set$MEAN_ANNUAL_RAINFALL # response vector training set
y_test <- test_set$MEAN_ANNUAL_RAINFALL # response vector test set
# model matrices for training and test set
# select all features except the response variable
X_train <- as.matrix(subset(train_set, select = -c(MEAN_ANNUAL_RAINFALL)))
X_test <- as.matrix(subset(test_set, select = -c(MEAN_ANNUAL_RAINFALL)))
# save objects for later usage
save(dwd,
dwd_data,
train_set,
test_set,
y_train,
y_test,
X_train,
X_test,
split,
file = "dwd_30200.RData")
```

Finally there remain 479 observations with 16 features in the data set to train our model, and 120 observations with 16 features to evaluate our model.

Before we build our model we visualize the data we are working with
by plotting a scatter plot matrix. The R software environment provides
several functions that allow us to plot scatter plot matrices. Just to
name a few, we may use the `pairs()`

function, the
`splom()`

function from the `lattice`

library, the
`pairs.panels()`

function from the `psych`

library
or the `ggpairs()`

function from the `GGally`

library. The particular choice is a matter of experience, the size of
the data set, and of personal taste. As a proof of concept that it
becomes difficult to visualize a high dimensional feature space we apply
the built-in function `pairs()`

to the data set.

```
dwd_data <- as.data.frame(sapply(dwd_data,as.numeric))
pairs(dwd_data)
```

Arghhh! Such a plot is of no use at all! Note that we are just visualizing 17 features, which is compared to many other real world applications not much at all.

A good strategy would be to choose only those pairs of variables that
at least somehow are affecting our response variable
`MEAN.ANNUAL.RAINFALL`

. Thus, we first examine the pairwise
correlation between the features, and then we select those features for
visualization which show the highest correlation. The correlation of
features can easily be calculated by using the `cor()`

function, however, the output of `cor()`

would still flood
the screen, so that we do not improve our situation. A fancy method to
overcome this issue is provided by the `corrplot`

library. If
we feed the eponymous `corrplot()`

function with a
correlation matrix we may visualize the correlations among the features
in a quite convenient way.

```
library(corrplot)
dwd_data_cor <- cor(dwd_data, use = "complete.obs")
corrplot(dwd_data_cor,
method = "number",
type = "lower",
tl.cex = 0.6,
number.cex = 0.5
)
```

Take a look at the plot above and focus on the response variable
`MEAN.ANNUAL.RAINFALL`

at position (row and column) 11. In
that particular row/column you find the Pearson correlation coefficients
for each feature pair. The **Pearson correlation coefficient** is
a number between \(-1\) and \(+1\). A positive number indicates a
positive relation between the pairs of features and a negative number
indicates a negative relation between the features, respectively. The
more extreme the number the stronger the correlation.

OK, back to our strategy: How to choose those pairs of variables that
at least somehow are affecting our response variable
`MEAN.ANNUAL.RAINFALL`

?

By now we have all the building blocks to solve this problem. We set
a threshold level, let us say \(\pm
0.5\). We select all feature pair combinations which have a
correlation coefficient above or below that threshold, and then we plot
them in a scatter plot matrix. Here we apply the `ggpairs()`

function from the `GGally`

library. This package is built
upon the `ggplot2`

package and allows a lot of fine tuning.
Check out the help page (type `?ggpairs`

in your console) for
more information on the `ggpairs()`

function.

```
# set threshold
cor_threshold <- 0.5
# extract feature names
best_features <- row.names(dwd_data_cor)[abs(dwd_data_cor["MEAN_ANNUAL_RAINFALL", ])
> cor_threshold]
print(best_features)
```

```
## [1] "ALTITUDE" "MEAN_ANNUAL_AIR_TEMP" "MEAN_MONTHLY_MAX_TEMP"
## [4] "MEAN_MONTHLY_MIN_TEMP" "MEAN_ANNUAL_RAINFALL" "MAX_AIR_TEMP"
## [7] "MAX_RAINFALL"
```

```
library(ggplot2)
library(GGally)
# call ggpairs()
ggpairs(dwd_data,
columns = c(best_features),
upper = list(continuous = wrap("cor",
size = 4,
alignPercent = 1)
)
) +
theme_grey(base_size = 6.5)
```

Finally, we get a nice and clear representation of a subset of
features of the *dwd* data set, all of them showing some kind of
a relation among each another.

Obviously, all variables are more or less skewed by physical constraints. Thus, we should keep in mind, that our results may be biased by missing normality and should be transformed prior to further analyses. For the sake of straightforwardness, we will ignore this important point here.

In the next section we build a bunch of multiple linear regression models and we are going to investigate these relationships in more detail.

However, before we continue we should remind ourselves that the data
set we are working with has a spatial component. Basically, our
observations are point measurements of meteorological parameters spread
across Germany. Let us plot a simple map to get a feel of the spatial
distribution of our observations. Therefore we rely on the
`raster`

package and on the `ggplot2`

package.

```
library(raster)
library(ggplot2)
# retrieve federal states by the getData() function from the raster package
G1 <- getData(country = "Germany", level = 1)
# add a new column called "data_subset" to the data set and populate the column with a character string of "training data" or "test data" based on the split variable from the section DATA PREPARATION
dwd_data$data_subset <- ifelse(split == TRUE, "training data", "test data")
# plot the map
library(mapproj)
ggplot() +
geom_polygon(data = G1,
aes(x = long, y = lat, group = group),
colour = "grey10", fill = "#fff7bc") +
geom_point(data = dwd_data,
aes(x = LON, y = LAT, colour = data_subset),
alpha = .5,
size = 2) +
theme_bw() +
xlab("Longitude") + ylab("Latitude") +
coord_map()
```

Nice! We realize that our data is quite well distributed across Germany. Further, it appears that the observations in the test data set are as well randomly distributed across the country.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*