Before we start, we clean up the data set and exclude non-meaningful
features and missing values. Then, we create our binary response
variable HOT.DAY
, based on the variable
MAX.AIR.TEMP
. In a next step we split our data set at
random into two distinct parts. One part is our trainig
set while the other part will serve as our test
set. The basic idea behind this data grouping strategy is, that
we learn the model parameters on the training set, and then we use the
test data set, a data set that was not involved at all during parameter
estimation, to determine the goodness of fit of our model, or in other
words by using two distinct data sets we may evaluate the
generalization properties of our model.
For the sub-setting of the data set we rely on the
sample.split()
function from the caTools
package, which splits the data set into two sets (training and test set)
in a predefined ratio. Finally, we assign the data
matrices and the response vectors proper
variable names for further processing.
# Load data set
dwd <- read.csv2("DWD.csv",
stringsAsFactors = FALSE)
str(dwd)
## 'data.frame': 599 obs. of 22 variables:
## $ id : int 0 1 2 6 8 9 10 12 14 18 ...
## $ DWD_ID : int 1 3 44 71 73 78 91 98 116 132 ...
## $ STATION.NAME : chr "Aach" "Aachen" "Gro\xdfenkneten" "Albstadt-Badkap" ...
## $ FEDERAL.STATE : chr "Baden-W\xfcrttemberg" "Nordrhein-Westfalen" "Niedersachsen" "Baden-W\xfcrttemberg" ...
## $ LAT : num 47.8 50.8 52.9 48.2 48.6 ...
## $ LON : num 8.85 6.09 8.24 8.98 13.05 ...
## $ ALTITUDE : num 478 202 44 759 340 65 300 780 213 750 ...
## $ PERIOD : chr "1931-1986" "1851-2011" "1971-2016" "1986-2016" ...
## $ RECORD.LENGTH : int 55 160 45 30 64 55 38 67 67 33 ...
## $ MEAN.ANNUAL.AIR.TEMP : num 8.2 9.8 9.2 7.4 8.4 9.3 8.2 5.1 8.4 5.7 ...
## $ MEAN.MONTHLY.MAX.TEMP : num 13.1 13.6 13.2 12.2 13.4 13.4 12.7 8.9 12.9 9.2 ...
## $ MEAN.MONTHLY.MIN.TEMP : num 3.5 6.3 5.4 3.3 3.9 5.2 4.1 2.2 4.2 2.7 ...
## $ MEAN.ANNUAL.WIND.SPEED: num 2 3 2 2 1 2 3 3 2 3 ...
## $ MEAN.CLOUD.COVER : num 67 67 67 66 65 67 72 72 66 64 ...
## $ MEAN.ANNUAL.SUNSHINE : num NA 1531 1459 1725 1595 ...
## $ MEAN.ANNUAL.RAINFALL : num 755 820 759 919 790 794 657 NA NA 915 ...
## $ MAX.MONTHLY.WIND.SPEED: num 2 3 3 2 2 2 3 4 3 3 ...
## $ MAX.AIR.TEMP : num 32.5 32.3 32.4 30.2 33 32.2 31.6 27.6 33.2 29 ...
## $ MAX.WIND.SPEED : num NA 30.2 29.9 NA NA NA NA NA NA NA ...
## $ MAX.RAINFALL : num 39 36 32 43 43 33 37 NA NA 40 ...
## $ MIN.AIR.TEMP : num -16.3 -10.9 -12.6 -15.5 -19.2 -13.3 -15.2 -15.7 -17.5 -17.2 ...
## $ MEAN.RANGE.AIR.TEMP : num 9.6 7.3 7.8 8.9 9.5 8.2 8.6 6.7 8.6 6.5 ...
# Exclude variables
cols_to_drop <- c("id", # columns to drop
"DWD_ID",
"STATION.NAME",
"FEDERAL.STATE",
"PERIOD",
"MAX.WIND.SPEED") # shows many missing values
dwd_data <- dwd[, !(colnames(dwd) %in% cols_to_drop)] # drop columns
rows_to_drop <- complete.cases(dwd_data) # rows to drop
dwd_data <- dwd_data[rows_to_drop, ] # drop rows
# Create binary response variable HEAT.DAY
dwd_data["HOT.DAY"] <- as.factor(ifelse(dwd_data$MAX.AIR.TEMP >= 30, 1, 0))
# Split data into training and test set, 60% training and 40% test set
library(caTools)
set.seed(200) # set seed for reproducibility
# define split vector
split <- sample.split(dwd_data$HOT.DAY, SplitRatio = 0.6)
#########################
### Original data set ###
#########################
train_set <- dwd_data[split == TRUE, ] # training set
test_set <- dwd_data[split == FALSE, ] # test set
# response vectors
y_train <- train_set$HOT.DAY # response vector training set
y_test <- test_set$HOT.DAY # response vector test set
# exclude response variable(s) from the data matrix
idx_HOT_DAY <- which(colnames(train_set) == "HOT.DAY")
idx_MAX_AIR_TEMP <- which(colnames(train_set) == "MAX.AIR.TEMP")
X_train <- data.matrix(train_set[, -c(idx_HOT_DAY, idx_MAX_AIR_TEMP)])
X_test <- data.matrix(test_set[, -c(idx_HOT_DAY, idx_MAX_AIR_TEMP)])
####################
### PCA data set ###
####################
# load PC3 model from previous sections
load("pca_PC3_30300.RData")
pca_PC3 <- as.data.frame(pca_PC3)
X_train_PC3 <- pca_PC3[split == TRUE, ] # training set PCA
X_test_PC3 <- pca_PC3[split == FALSE, ] # test set PCA
### save data matrices and response vectors for later usage ###
save(y_train,
y_test,
X_train,
X_test,
X_train_PC3,
X_test_PC3,
pca_PC3,
dwd_data,
file = "pca_data_30300.RData")
Finally, there remain 238 observations with 16 features in the data set to train our model, and 159 observations with 16 features to evaluate our model.
We notice that the classes of our response variable
HOT.DAY
are not balanced.
tbl_hot_day <- table(dwd_data["HOT.DAY"])
tbl_hot_day
## HOT.DAY
## 0 1
## 54 343
The ratio of weather stations, which recorded at least one hot day and weather stations, which have not yet recorded a hot day is 6 to 1. When working with imbalanced data set we have to be cautious with respect to model metrics, such as the model accuracy. Consider a model that always predict the more common class; in our example the model accuracy would be \(343/(54+343)= 0.86\); which is a quite high number for a non-predictive model. Hence, we will consider other model metrics such as the confusion matrix and Cohen’s kappa too. The number of \(0.86\) will serve as our model benchmark. Whatever model we build, the model’s accuracy must be better than \(0.86\).
However, before we continue we should remind ourselves that the data
set we are working with has a spatial component. Basically, our
observations are point measurements of meteorological parameters spread
across Germany. Let us plot a simple map to get an impression of the
spatial distribution of our observations. For plotting we rely on the
raster
package and on the ggplot2
package.
library(mapproj)
library(raster)
library(ggplot2)
# Retrieve Federal States by the getData() function from the raster package
G1 <- getData(country = "Germany", level = 1)
## Warning in getData(country = "Germany", level = 1): getData will be removed in a future version of raster
## . You can use access these data, and more, with functions from the geodata package instead
#plot the map
ggplot() +
geom_polygon(data = G1,
aes(x = long, y = lat, group = group),
colour = "grey10", fill = "#fff7bc") +
geom_point(data = dwd_data,
aes(x = LON, y = LAT, colour = HOT.DAY),
alpha = 0.5,
size = 2) +
theme_bw() +
xlab("Longitude") + ylab("Latitude") +
coord_map()
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.