So far we focused on an example data set with just two explanatory variables. In order to get some hands-on experience we advance by using a data set with more than two explanatory variables.
In order to showcase the multiple linear regression analysis we
examine the relationship between the response variable
MEAN.ANNUAL.RAINFALL
and the other variables in the
dwd data set.
You may download the DWD.csv
file here.
Import the data set and assign a proper name to it. Note that we use the
read.csv2()
function to load the data set. The
read.csv()
and read.csv2()
are identical
except for the defaults. The read.csv2()
function is
suitable for data originating from countries that use a comma as decimal
point and a semicolon as field separator.
dwd <- read.csv2("https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv",
stringsAsFactors = FALSE, sep = ",")
The dwd data set consists of 599 rows, each of them representing a particular weather station in Germany, and 22 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: ID, DWD_ID, STATION_NAME, FEDERAL_STATE, LAT, LON, ALTITUDE, PERIOD, RECORD_LENGTH, MEAN_ANNUAL_AIR_TEMP, MEAN_MONTHLY_MAX_TEMP, MEAN_MONTHLY_MIN_TEMP, MEAN_ANNUAL_WIND_SPEED, MEAN_CLOUD_COVER, MEAN_ANNUAL_SUNSHINE, MEAN_ANNUAL_RAINFALL, MAX_MONTHLY_WIND_SPEED, MAX_AIR_TEMP, MAX_WIND_SPEED, MAX_RAINFALL, MIN_AIR_TEMP, MEAN_RANGE_AIR_TEMP.
The data was downloaded from the DWD (German Weather Service) data portal on April 21, 2017. You may find a detailed description of the data set here. Please note that for the purpose of this tutorial the data set was preprocessed and columns have been renamed.
Before we dive into the modelling we first have to prepare our data
set. By calling the str()
function on the data set we
realize that there are features in the data set such as,
id
, DWD_ID
, and STATION.NAME
,
among others, which we know, or at least presumably know, do not carry
any particular useful information for the prediction of the mean annual
rainfall. Thus, we exclude them from the data set. In addition we make
sure that we exclude observations that contain missing values
(NA
).
str(dwd)
## 'data.frame': 599 obs. of 22 variables:
## $ ID : int 0 1 2 6 8 9 10 12 14 18 ...
## $ DWD_ID : int 1 3 44 71 73 78 91 98 116 132 ...
## $ STATION_NAME : chr "Aach" "Aachen" "Großenkneten" "Albstadt-Badkap" ...
## $ FEDERAL_STATE : chr "Baden-Württemberg" "Nordrhein-Westfalen" "Niedersachsen" "Baden-Württemberg" ...
## $ LAT : chr "47.8413" "50.7827" "52.9335" "48.2156" ...
## $ LON : chr "8.8493" "6.0941" "8.237" "8.9784" ...
## $ ALTITUDE : chr "478.0" "202.0" "44.0" "759.0" ...
## $ PERIOD : chr "1931-1986" "1851-2011" "1971-2016" "1986-2016" ...
## $ RECORD_LENGTH : int 55 160 45 30 64 55 38 67 67 33 ...
## $ MEAN_ANNUAL_AIR_TEMP : chr "8.2" "9.8" "9.2" "7.4" ...
## $ MEAN_MONTHLY_MAX_TEMP : chr "13.1" "13.6" "13.2" "12.2" ...
## $ MEAN_MONTHLY_MIN_TEMP : chr "3.5" "6.3" "5.4" "3.3" ...
## $ MEAN_ANNUAL_WIND_SPEED: chr "2.0" "3.0" "2.0" "2.0" ...
## $ MEAN_CLOUD_COVER : chr "67.0" "67.0" "67.0" "66.0" ...
## $ MEAN_ANNUAL_SUNSHINE : chr "" "1531.0" "1459.0" "1725.0" ...
## $ MEAN_ANNUAL_RAINFALL : chr "755.0" "820.0" "759.0" "919.0" ...
## $ MAX_MONTHLY_WIND_SPEED: chr "2.0" "3.0" "3.0" "2.0" ...
## $ MAX_AIR_TEMP : chr "32.5" "32.3" "32.4" "30.2" ...
## $ MAX_WIND_SPEED : chr "" "30.2" "29.9" "" ...
## $ MAX_RAINFALL : chr "39.0" "36.0" "32.0" "43.0" ...
## $ MIN_AIR_TEMP : chr "-16.3" "-10.9" "-12.6" "-15.5" ...
## $ MEAN_RANGE_AIR_TEMP : chr "9.6" "7.3" "7.8" "8.9" ...
# Exclude variables
cols_to_drop <- c("ID",
"DWD_ID",
"STATION_NAME",
"FEDERAL_STATE",
"PERIOD") # columns to drop
rows_to_drop <- complete.cases(dwd) # rows to drop
dwd_data <- dwd[rows_to_drop, !(colnames(dwd) %in% cols_to_drop)]
After the data cleaning steps there are 599 rows (observations) and 17 columns (features) left in our data set.
dim(dwd_data)
## [1] 599 17
Moreover, we split our data set at random into two distinct parts. One part is our training set while the other part will serve as our test set. The basic idea behind this data grouping strategy is, that we learn the model parameters on the training set, and then we use the test data set, a data set that was not involved at all during parameter estimation, to determine the goodness of fit of our model, or in other words by using two distinct data sets we may evaluate the generalization properties of our model.
For the sub-setting of the data set we rely on the
sample.split()
function from the caTools
package, which splits the data set into two sets (training and test set)
in a predefined ratio. Finally, we assign the response
vector and the data matrix for each subset of
the data a proper variable name for further processing.
# Split data into training and test set, 80% training and 20% test set
library(caTools)
## Warning: Paket 'caTools' wurde unter R Version 4.2.3 erstellt
set.seed(100) # set seed for reproducibility
# define split vector
split <- sample.split(dwd_data$MEAN_ANNUAL_RAINFALL, SplitRatio = 0.8)
train_set <- dwd_data[split == TRUE, ] # training set
test_set <- dwd_data[split == FALSE, ] # test set
# response vectors
y_train <- train_set$MEAN_ANNUAL_RAINFALL # response vector training set
y_test <- test_set$MEAN_ANNUAL_RAINFALL # response vector test set
# model matrices for training and test set
# select all features except the response variable
X_train <- as.matrix(subset(train_set, select = -c(MEAN_ANNUAL_RAINFALL)))
X_test <- as.matrix(subset(test_set, select = -c(MEAN_ANNUAL_RAINFALL)))
# save objects for later usage
save(dwd,
dwd_data,
train_set,
test_set,
y_train,
y_test,
X_train,
X_test,
split,
file = "dwd_30200.RData")
Finally there remain 479 observations with 16 features in the data set to train our model, and 120 observations with 16 features to evaluate our model.
Before we build our model we visualize the data we are working with
by plotting a scatter plot matrix. The R software environment provides
several functions that allow us to plot scatter plot matrices. Just to
name a few, we may use the pairs()
function, the
splom()
function from the lattice
library, the
pairs.panels()
function from the psych
library
or the ggpairs()
function from the GGally
library. The particular choice is a matter of experience, the size of
the data set, and of personal taste. As a proof of concept that it
becomes difficult to visualize a high dimensional feature space we apply
the built-in function pairs()
to the data set.
dwd_data <- as.data.frame(sapply(dwd_data,as.numeric))
pairs(dwd_data)
Arghhh! Such a plot is of no use at all! Note that we are just visualizing 17 features, which is compared to many other real world applications not much at all.
A good strategy would be to choose only those pairs of variables that
at least somehow are affecting our response variable
MEAN.ANNUAL.RAINFALL
. Thus, we first examine the pairwise
correlation between the features, and then we select those features for
visualization which show the highest correlation. The correlation of
features can easily be calculated by using the cor()
function, however, the output of cor()
would still flood
the screen, so that we do not improve our situation. A fancy method to
overcome this issue is provided by the corrplot
library. If
we feed the eponymous corrplot()
function with a
correlation matrix we may visualize the correlations among the features
in a quite convenient way.
library(corrplot)
dwd_data_cor <- cor(dwd_data, use = "complete.obs")
corrplot(dwd_data_cor,
method = "number",
type = "lower",
tl.cex = 0.6,
number.cex = 0.5
)
Take a look at the plot above and focus on the response variable
MEAN.ANNUAL.RAINFALL
at position (row and column) 11. In
that particular row/column you find the Pearson correlation coefficients
for each feature pair. The Pearson correlation coefficient is
a number between \(-1\) and \(+1\). A positive number indicates a
positive relation between the pairs of features and a negative number
indicates a negative relation between the features, respectively. The
more extreme the number the stronger the correlation.
OK, back to our strategy: How to choose those pairs of variables that
at least somehow are affecting our response variable
MEAN.ANNUAL.RAINFALL
?
By now we have all the building blocks to solve this problem. We set
a threshold level, let us say \(\pm
0.5\). We select all feature pair combinations which have a
correlation coefficient above or below that threshold, and then we plot
them in a scatter plot matrix. Here we apply the ggpairs()
function from the GGally
library. This package is built
upon the ggplot2
package and allows a lot of fine tuning.
Check out the help page (type ?ggpairs
in your console) for
more information on the ggpairs()
function.
# set threshold
cor_threshold <- 0.5
# extract feature names
best_features <- row.names(dwd_data_cor)[abs(dwd_data_cor["MEAN_ANNUAL_RAINFALL", ])
> cor_threshold]
print(best_features)
## [1] "ALTITUDE" "MEAN_ANNUAL_AIR_TEMP" "MEAN_MONTHLY_MAX_TEMP"
## [4] "MEAN_MONTHLY_MIN_TEMP" "MEAN_ANNUAL_RAINFALL" "MAX_AIR_TEMP"
## [7] "MAX_RAINFALL"
library(ggplot2)
library(GGally)
# call ggpairs()
ggpairs(dwd_data,
columns = c(best_features),
upper = list(continuous = wrap("cor",
size = 4,
alignPercent = 1)
)
) +
theme_grey(base_size = 6.5)
Finally, we get a nice and clear representation of a subset of features of the dwd data set, all of them showing some kind of a relation among each another.
Obviously, all variables are more or less skewed by physical constraints. Thus, we should keep in mind, that our results may be biased by missing normality and should be transformed prior to further analyses. For the sake of straightforwardness, we will ignore this important point here.
In the next section we build a bunch of multiple linear regression models and we are going to investigate these relationships in more detail.
However, before we continue we should remind ourselves that the data
set we are working with has a spatial component. Basically, our
observations are point measurements of meteorological parameters spread
across Germany. Let us plot a simple map to get a feel of the spatial
distribution of our observations. Therefore we rely on the
raster
package and on the ggplot2
package.
library(raster)
library(ggplot2)
# retrieve federal states by the getData() function from the raster package
G1 <- getData(country = "Germany", level = 1)
# add a new column called "data_subset" to the data set and populate the column with a character string of "training data" or "test data" based on the split variable from the section DATA PREPARATION
dwd_data$data_subset <- ifelse(split == TRUE, "training data", "test data")
# plot the map
library(mapproj)
ggplot() +
geom_polygon(data = G1,
aes(x = long, y = lat, group = group),
colour = "grey10", fill = "#fff7bc") +
geom_point(data = dwd_data,
aes(x = LON, y = LAT, colour = data_subset),
alpha = .5,
size = 2) +
theme_bw() +
xlab("Longitude") + ylab("Latitude") +
coord_map()
Nice! We realize that our data is quite well distributed across Germany. Further, it appears that the observations in the test data set are as well randomly distributed across the country.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.