To predict the occurrence of hot days at individual weather stations in Germany, we selected a random summer day (2024-07-31) and downloaded the corresponding data from the DWD (German Weather Service) data portal. You may find a detailed description of the data set here
# Load data set
dwd <- read.csv2("https://userpage.fu-berlin.de/soga/data/raw-data/DWD_data_daily.csv",
stringsAsFactors = FALSE)
str(dwd)
## 'data.frame': 562 obs. of 24 variables:
## $ STATIONS_ID : int 11 44 73 78 90 91 96 102 125 131 ...
## $ Stationsname : chr "Donaueschingen (Landeplatz)" "Grossenkneten" "Aldersbach-Kramersepp" "Alfhausen" ...
## $ lat : num 48 52.9 48.6 52.5 50.8 ...
## $ lon : num 8.52 8.24 13.06 7.91 9.26 ...
## $ Bundesland : chr "Baden-Wuerttemberg" "Niedersachsen" "Bayern" "Niedersachsen" ...
## $ Stationshoehe : int 680 44 374 64 305 304 50 0 739 296 ...
## $ MESS_DATUM : chr "2024-07-31" "2024-07-31" "2024-07-31" "2024-07-31" ...
## $ QN_3 : int 10 NA NA NA 10 NA 10 10 10 NA ...
## $ FX_Windspitze : num 13.3 NA NA NA 7.2 NA 5.4 10.2 25.7 NA ...
## $ FM_Windgeschwindigkeit : num 1.9 NA NA NA 1.6 NA 1.1 NA 2.7 NA ...
## $ QN_4 : int NA 10 10 10 NA 10 9 3 3 10 ...
## $ RSK_Niederschlagshoehe : num NA 0 0 0 NA 0 0 NA 10.7 1.8 ...
## $ RSKF_Niederschlagsform : int NA 0 0 0 NA 0 0 NA 4 4 ...
## $ SDK_Sonnenscheindauer : num NA 14.2 NA NA NA NA NA 13.6 NA NA ...
## $ SHK_TAG_Schneehoehe : int NA NA NA NA NA NA 0 NA 0 NA ...
## $ NM_Bedeckungsgrad : num NA NA NA NA NA NA 3.4 NA 3 NA ...
## $ VPM_Dampfdruck : num NA 16.9 14.9 17.4 NA 18.2 15.5 NA 18.6 15.3 ...
## $ PM_Luftdruck : num NA NA NA NA NA ...
## $ TMK_Lufttemperatur : num NA 20.1 23.7 21.2 NA 22 21.6 18.7 22.1 23.8 ...
## $ UPM_Relative_Feuchte : int NA 75 56 70 NA 72 63 NA 73 53 ...
## $ TXK_Lufttemperatur_Max : num NA 28.1 32.8 27.3 NA 30.1 28.6 20.2 NA 31 ...
## $ TNK_Lufttemperatur_Min : num NA 13.6 13.4 15.2 NA 14.6 13.4 17.4 NA 15.4 ...
## $ TGK_Lufttemperatur_5cm_min: num NA 10.8 12.5 13.3 NA 13.5 11.6 NA NA 13 ...
## $ eor : chr "eor" "eor" "eor" "eor" ...
Since the data set is provided in German, non-German speakers can refer to the following table for an English translation of the column names.
translation_table <- data.frame(
Original = c(
"STATIONS_ID", "Stationsname", "lat", "lon", "Bundesland", "Stationshoehe",
"MESS_DATUM", "QN_3", "FX_Windspitze", "FM_Windgeschwindigkeit", "QN_4",
"RSK_Niederschlagshoehe", "RSKF_Niederschlagsform", "SDK_Sonnenscheindauer",
"SHK_TAG_Schneehoehe", "NM_Bedeckungsgrad", "VPM_Dampfdruck", "PM_Luftdruck",
"TMK_Lufttemperatur", "UPM_Relative_Feuchte", "TXK_Lufttemperatur_Max",
"TNK_Lufttemperatur_Min", "TGK_Lufttemperatur_5cm_min"
),
English = c(
"Station ID", "Station Name", "Latitude", "Longitude", "Federal State", "Station Elevation (m)",
"Reference Date (YYYY-MM-DD)", "Quality Level (Wind Data)", "Daily Max Wind Gust (m/s)",
"Daily Mean Wind Speed (m/s)", "Quality Level (Precipitation, Temperature, etc.)",
"Daily Precipitation Amount (mm)", "Precipitation Form (Code)", "Daily Sunshine Duration (h)",
"Daily Snow Depth (cm)", "Daily Mean Cloud Cover (1/8)", "Daily Mean Vapor Pressure (hPa)",
"Daily Mean Air Pressure (hPa)", "Daily Mean Air Temperature (°C)", "Daily Mean Relative Humidity (%)",
"Daily Max Air Temperature (2 m) (°C)", "Daily Min Air Temperature (2 m) (°C)",
"Daily Min Air Temperature (5 cm above ground) (°C)"
))
knitr::kable(translation_table, caption = "Mapping of DWD Columns to English Names")
Original | English |
---|---|
STATIONS_ID | Station ID |
Stationsname | Station Name |
lat | Latitude |
lon | Longitude |
Bundesland | Federal State |
Stationshoehe | Station Elevation (m) |
MESS_DATUM | Reference Date (YYYY-MM-DD) |
QN_3 | Quality Level (Wind Data) |
FX_Windspitze | Daily Max Wind Gust (m/s) |
FM_Windgeschwindigkeit | Daily Mean Wind Speed (m/s) |
QN_4 | Quality Level (Precipitation, Temperature, etc.) |
RSK_Niederschlagshoehe | Daily Precipitation Amount (mm) |
RSKF_Niederschlagsform | Precipitation Form (Code) |
SDK_Sonnenscheindauer | Daily Sunshine Duration (h) |
SHK_TAG_Schneehoehe | Daily Snow Depth (cm) |
NM_Bedeckungsgrad | Daily Mean Cloud Cover (1/8) |
VPM_Dampfdruck | Daily Mean Vapor Pressure (hPa) |
PM_Luftdruck | Daily Mean Air Pressure (hPa) |
TMK_Lufttemperatur | Daily Mean Air Temperature (°C) |
UPM_Relative_Feuchte | Daily Mean Relative Humidity (%) |
TXK_Lufttemperatur_Max | Daily Max Air Temperature (2 m) (°C) |
TNK_Lufttemperatur_Min | Daily Min Air Temperature (2 m) (°C) |
TGK_Lufttemperatur_5cm_min | Daily Min Air Temperature (5 cm above ground) (°C) |
Before we start, we clean up the data set by excluding non-meaningful
or not numerical features, removing missing values and eliminating
highly correlated variables. Then, we create our binary response
variable HOT.DAY
, based on the variable
TXK_Lufttemperatur_Max
(maximum air temperature).
# Exclude variables (columns to drop)
cols_to_drop <- c("STATIONS_ID", # Identifier – not needed for analysis
"Stationsname", # Station name – not numeric
"Bundesland", # Federal state – not numeric
"MESS_DATUM", # Date – not directly usable
"SHK_TAG_Schneehoehe", # Snow height – many missing values/not relevant for heat analysis
"RSKF_Niederschlagsform", # Type of precipitation – categorical
"QN_3", # Quality code – no important information
"QN_4", # Quality code – no substantive information
"eor", # not relevant for modeling
"TGK_Lufttemperatur_5cm_min", # Temperature at 5 cm height – possibly redundant with max temperature
"TNK_Lufttemperatur_Min", # Minimum air temperature – may be highly correlated with max temperature
"TMK_Lufttemperatur") # Mean air temperature – may be highly correlated with max temperature
dwd_data <- dwd[, !(colnames(dwd) %in% cols_to_drop)] # drop columns
rows_to_drop <- complete.cases(dwd_data) # rows to drop
dwd_data <- dwd_data[rows_to_drop, ] # drop rows
# Create binary response variable HEAT.DAY
dwd_data["HOT.DAY"] <- as.factor(ifelse(dwd_data$TXK_Lufttemperatur_Max >= 30, 1, 0))
# Final data set
str(dwd_data)
## 'data.frame': 120 obs. of 13 variables:
## $ lat : num 50.4 53 54.7 48.4 49.9 ...
## $ lon : num 7.42 13.99 13.43 10.94 10.92 ...
## $ Stationshoehe : int 75 54 42 462 240 103 362 45 15 81 ...
## $ FX_Windspitze : num 8.4 10.9 11.3 11 11.3 6.9 12.3 6 6 5.2 ...
## $ FM_Windgeschwindigkeit: num 1.6 1.8 4.2 1.8 1.4 1.9 2.2 1.7 2.5 1.3 ...
## $ RSK_Niederschlagshoehe: num 0 0 0 0 5.6 0 1.4 0 0 0 ...
## $ SDK_Sonnenscheindauer : num 5.8 12.4 14 11.4 12.1 7.8 7.2 13.4 14.1 8.8 ...
## $ NM_Bedeckungsgrad : num 5.6 4.4 1.2 2.7 2.7 6.2 5.5 4.1 2.7 5.4 ...
## $ VPM_Dampfdruck : num 19.5 16.2 16.7 18.4 17.4 17.3 20.7 15.4 17.5 19 ...
## $ PM_Luftdruck : num 1005 1008 1010 960 986 ...
## $ UPM_Relative_Feuchte : int 72 67 76 66 65 68 73 65 77 75 ...
## $ TXK_Lufttemperatur_Max: num 29.9 28.3 22.9 32.8 32.3 26.3 27.3 27.9 22.8 28.6 ...
## $ HOT.DAY : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 1 ...
In a next step, we split our data set at random into two distinct parts. One part is our training set while the other part will serve as our test set. The basic idea behind this data grouping strategy is, that we learn the model parameters on the training set, and then we use the test data set, a data set that was not involved at all during parameter estimation, to determine the goodness of fit of our model, or in other words by using two distinct data sets we may evaluate the generalization properties of our model.
For the sub-setting of the data set we rely on the
sample.split()
function from the caTools
package, which splits the data set into two sets (training and test set)
in a predefined ratio.
We prepare two versions of the features:
Finally, we assign the data matrices and the response vectors proper variable names for further processing.
# Split data into training and test set, 60% training and 40% test set
library(caTools)
set.seed(200) # set seed for reproducibility
# define split vector
split <- sample.split(dwd_data$HOT.DAY, SplitRatio = 0.6)
#########################
### Original data set ###
#########################
train_set <- dwd_data[split == TRUE, ] # training set
test_set <- dwd_data[split == FALSE, ] # test set
# response vectors
y_train <- train_set$HOT.DAY # response vector training set
y_test <- test_set$HOT.DAY # response vector test set
# exclude response variable(s) from the data matrix
idx_HOT_DAY <- which(colnames(train_set) == "HOT.DAY")
idx_MAX_AIR_TEMP <- which(colnames(train_set) == "TXK_Lufttemperatur_Max")
X_train <- data.matrix(train_set[, -c(idx_HOT_DAY, idx_MAX_AIR_TEMP)])
X_test <- data.matrix(test_set[, -c(idx_HOT_DAY, idx_MAX_AIR_TEMP)])
####################
### PCA data set ###
####################
# prepare PCA data set
X_all <- dwd_data[, -c(which(colnames(dwd_data) == "HOT.DAY"),
which(colnames(dwd_data) == "TXK_Lufttemperatur_Max"))]
# Scale data
X_all_scaled <- scale(X_all)
# Compute PCA
pca <- prcomp(X_all_scaled, center = TRUE, scale. = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9552 1.6207 1.2445 1.0740 0.80669 0.73448 0.53249
## Proportion of Variance 0.3475 0.2388 0.1408 0.1049 0.05916 0.04904 0.02578
## Cumulative Proportion 0.3475 0.5863 0.7271 0.8320 0.89112 0.94017 0.96594
## PC8 PC9 PC10 PC11
## Standard deviation 0.42485 0.38789 0.20675 0.03053
## Proportion of Variance 0.01641 0.01368 0.00389 0.00008
## Cumulative Proportion 0.98235 0.99603 0.99992 1.00000
# We selected the first 4 principal components because they explain approximately 83% of the total variance, capturing most of the information while reducing dimensionality
# Extract scores of first 4 principal components
pca_scores_3 <- pca$x[, 1:4]
# Split PCA scores into training and test sets
X_train_PC3 <- pca_scores_3[split == TRUE, ]
X_test_PC3 <- pca_scores_3[split == FALSE, ]
### save data matrices and response vectors for later usage ###
save(y_train,
y_test,
X_train,
X_test,
X_train_PC3,
X_test_PC3,
pca_scores_3,
dwd_data,
file = "pca_data_30300.RData")
Finally, there remain 72 observations with 11 features in the data set to train our model, and 48 observations with 11 features to evaluate our model.
We notice that the classes of our response variable
HOT.DAY
are not balanced.
tbl_hot_day <- table(dwd_data["HOT.DAY"])
tbl_hot_day
## HOT.DAY
## 0 1
## 76 44
The ratio of weather stations, which recorded at least one hot day and weather stations, which have not yet recorded a hot day is 5 to 3. When working with imbalanced data set we have to be cautious with respect to model metrics, such as the model accuracy. Consider a model that always predict the more common class; in our example the model accuracy would be \(76/(44+76)= 0.63\); which is a quite high number for a non-predictive model. Hence, we will consider other model metrics such as the confusion matrix and Cohen’s kappa too. The number of \(0.63\) will serve as our model benchmark. Whatever model we build, the model’s accuracy must be better than \(0.63\).
However, before we continue we should remind ourselves that the data
set we are working with has a spatial component. Basically, our
observations are point measurements of meteorological parameters spread
across Germany. Let us plot a simple map to get an impression of the
spatial distribution of our observations. For plotting we rely on the
geodata
package (which replaces the older
getData()
function from the raster
package for
spatial data retrieval) and on the ggplot2
package.
library(mapproj)
library(geodata)
library(ggplot2)
library(sf)
# Retrieve Federal States using the gadm() function from the geodata package
G1 <- geodata::gadm(country = "DEU", level = 1, tempdir())
G1_sf <- st_as_sf(G1)
#plot the map
ggplot() +
geom_sf(data = G1_sf,
colour = "grey10", fill = "#fff7bc") +
geom_point(data = dwd_data,
aes(x = lon, y = lat, colour = HOT.DAY),
alpha = 0.5,
size = 2) +
scale_colour_manual(values = c("0" = "cyan4", "1" = "coral2"))+
theme_bw() +
xlab("Longitude") + ylab("Latitude") +
coord_sf()
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.