30341_Data-preparation

To predict the occurrence of hot days at individual weather stations in Germany, we selected a random summer day (2024-07-31) and downloaded the corresponding data from the DWD (German Weather Service) data portal. You may find a detailed description of the data set here

# Load data set
dwd <- read.csv2("https://userpage.fu-berlin.de/soga/data/raw-data/DWD_data_daily.csv",
                 stringsAsFactors = FALSE)

str(dwd)

## 'data.frame':    562 obs. of  24 variables:
##  $ STATIONS_ID               : int  11 44 73 78 90 91 96 102 125 131 ...
##  $ Stationsname              : chr  "Donaueschingen (Landeplatz)" "Grossenkneten" "Aldersbach-Kramersepp" "Alfhausen" ...
##  $ lat                       : num  48 52.9 48.6 52.5 50.8 ...
##  $ lon                       : num  8.52 8.24 13.06 7.91 9.26 ...
##  $ Bundesland                : chr  "Baden-Wuerttemberg" "Niedersachsen" "Bayern" "Niedersachsen" ...
##  $ Stationshoehe             : int  680 44 374 64 305 304 50 0 739 296 ...
##  $ MESS_DATUM                : chr  "2024-07-31" "2024-07-31" "2024-07-31" "2024-07-31" ...
##  $ QN_3                      : int  10 NA NA NA 10 NA 10 10 10 NA ...
##  $ FX_Windspitze             : num  13.3 NA NA NA 7.2 NA 5.4 10.2 25.7 NA ...
##  $ FM_Windgeschwindigkeit    : num  1.9 NA NA NA 1.6 NA 1.1 NA 2.7 NA ...
##  $ QN_4                      : int  NA 10 10 10 NA 10 9 3 3 10 ...
##  $ RSK_Niederschlagshoehe    : num  NA 0 0 0 NA 0 0 NA 10.7 1.8 ...
##  $ RSKF_Niederschlagsform    : int  NA 0 0 0 NA 0 0 NA 4 4 ...
##  $ SDK_Sonnenscheindauer     : num  NA 14.2 NA NA NA NA NA 13.6 NA NA ...
##  $ SHK_TAG_Schneehoehe       : int  NA NA NA NA NA NA 0 NA 0 NA ...
##  $ NM_Bedeckungsgrad         : num  NA NA NA NA NA NA 3.4 NA 3 NA ...
##  $ VPM_Dampfdruck            : num  NA 16.9 14.9 17.4 NA 18.2 15.5 NA 18.6 15.3 ...
##  $ PM_Luftdruck              : num  NA NA NA NA NA ...
##  $ TMK_Lufttemperatur        : num  NA 20.1 23.7 21.2 NA 22 21.6 18.7 22.1 23.8 ...
##  $ UPM_Relative_Feuchte      : int  NA 75 56 70 NA 72 63 NA 73 53 ...
##  $ TXK_Lufttemperatur_Max    : num  NA 28.1 32.8 27.3 NA 30.1 28.6 20.2 NA 31 ...
##  $ TNK_Lufttemperatur_Min    : num  NA 13.6 13.4 15.2 NA 14.6 13.4 17.4 NA 15.4 ...
##  $ TGK_Lufttemperatur_5cm_min: num  NA 10.8 12.5 13.3 NA 13.5 11.6 NA NA 13 ...
##  $ eor                       : chr  "eor" "eor" "eor" "eor" ...

Since the data set is provided in German, non-German speakers can refer to the following table for an English translation of the column names.

translation_table <- data.frame(
  Original = c(
    "STATIONS_ID", "Stationsname", "lat", "lon", "Bundesland", "Stationshoehe",
    "MESS_DATUM", "QN_3", "FX_Windspitze", "FM_Windgeschwindigkeit", "QN_4",
    "RSK_Niederschlagshoehe", "RSKF_Niederschlagsform", "SDK_Sonnenscheindauer",
    "SHK_TAG_Schneehoehe", "NM_Bedeckungsgrad", "VPM_Dampfdruck", "PM_Luftdruck",
    "TMK_Lufttemperatur", "UPM_Relative_Feuchte", "TXK_Lufttemperatur_Max",
    "TNK_Lufttemperatur_Min", "TGK_Lufttemperatur_5cm_min"
  ),
  English = c(
    "Station ID", "Station Name", "Latitude", "Longitude", "Federal State", "Station Elevation (m)",
    "Reference Date (YYYY-MM-DD)", "Quality Level (Wind Data)", "Daily Max Wind Gust (m/s)",
    "Daily Mean Wind Speed (m/s)", "Quality Level (Precipitation, Temperature, etc.)",
    "Daily Precipitation Amount (mm)", "Precipitation Form (Code)", "Daily Sunshine Duration (h)",
    "Daily Snow Depth (cm)", "Daily Mean Cloud Cover (1/8)", "Daily Mean Vapor Pressure (hPa)",
    "Daily Mean Air Pressure (hPa)", "Daily Mean Air Temperature (°C)", "Daily Mean Relative Humidity (%)",
    "Daily Max Air Temperature (2 m) (°C)", "Daily Min Air Temperature (2 m) (°C)",
    "Daily Min Air Temperature (5 cm above ground) (°C)"
  ))

knitr::kable(translation_table, caption = "Mapping of DWD Columns to English Names")

Mapping of DWD Columns to English Names
Original	English
STATIONS_ID	Station ID
Stationsname	Station Name
lat	Latitude
lon	Longitude
Bundesland	Federal State
Stationshoehe	Station Elevation (m)
MESS_DATUM	Reference Date (YYYY-MM-DD)
QN_3	Quality Level (Wind Data)
FX_Windspitze	Daily Max Wind Gust (m/s)
FM_Windgeschwindigkeit	Daily Mean Wind Speed (m/s)
QN_4	Quality Level (Precipitation, Temperature, etc.)
RSK_Niederschlagshoehe	Daily Precipitation Amount (mm)
RSKF_Niederschlagsform	Precipitation Form (Code)
SDK_Sonnenscheindauer	Daily Sunshine Duration (h)
SHK_TAG_Schneehoehe	Daily Snow Depth (cm)
NM_Bedeckungsgrad	Daily Mean Cloud Cover (1/8)
VPM_Dampfdruck	Daily Mean Vapor Pressure (hPa)
PM_Luftdruck	Daily Mean Air Pressure (hPa)
TMK_Lufttemperatur	Daily Mean Air Temperature (°C)
UPM_Relative_Feuchte	Daily Mean Relative Humidity (%)
TXK_Lufttemperatur_Max	Daily Max Air Temperature (2 m) (°C)
TNK_Lufttemperatur_Min	Daily Min Air Temperature (2 m) (°C)
TGK_Lufttemperatur_5cm_min	Daily Min Air Temperature (5 cm above ground) (°C)

Before we start, we clean up the data set by excluding non-meaningful or not numerical features, removing missing values and eliminating highly correlated variables. Then, we create our binary response variable HOT.DAY, based on the variable TXK_Lufttemperatur_Max (maximum air temperature).

# Exclude variables (columns to drop)
cols_to_drop <- c("STATIONS_ID", # Identifier – not needed for analysis
                  "Stationsname", # Station name – not numeric
                  "Bundesland", # Federal state – not numeric
                  "MESS_DATUM", # Date – not directly usable
                  "SHK_TAG_Schneehoehe", # Snow height – many missing values/not relevant for heat analysis
                  "RSKF_Niederschlagsform", # Type of precipitation – categorical
                  "QN_3", # Quality code – no important information
                  "QN_4", # Quality code – no substantive information
                  "eor", # not relevant for modeling
                  "TGK_Lufttemperatur_5cm_min", # Temperature at 5 cm height – possibly redundant with max temperature
                  "TNK_Lufttemperatur_Min", # Minimum air temperature – may be highly correlated with max temperature
                  "TMK_Lufttemperatur") # Mean air temperature – may be highly correlated with max temperature

                  

dwd_data <- dwd[, !(colnames(dwd) %in% cols_to_drop)] # drop columns

rows_to_drop <- complete.cases(dwd_data)  # rows to drop
dwd_data <- dwd_data[rows_to_drop, ] # drop rows

# Create binary response variable HEAT.DAY
dwd_data["HOT.DAY"] <- as.factor(ifelse(dwd_data$TXK_Lufttemperatur_Max >= 30, 1, 0))

# Final data set
str(dwd_data)

## 'data.frame':    120 obs. of  13 variables:
##  $ lat                   : num  50.4 53 54.7 48.4 49.9 ...
##  $ lon                   : num  7.42 13.99 13.43 10.94 10.92 ...
##  $ Stationshoehe         : int  75 54 42 462 240 103 362 45 15 81 ...
##  $ FX_Windspitze         : num  8.4 10.9 11.3 11 11.3 6.9 12.3 6 6 5.2 ...
##  $ FM_Windgeschwindigkeit: num  1.6 1.8 4.2 1.8 1.4 1.9 2.2 1.7 2.5 1.3 ...
##  $ RSK_Niederschlagshoehe: num  0 0 0 0 5.6 0 1.4 0 0 0 ...
##  $ SDK_Sonnenscheindauer : num  5.8 12.4 14 11.4 12.1 7.8 7.2 13.4 14.1 8.8 ...
##  $ NM_Bedeckungsgrad     : num  5.6 4.4 1.2 2.7 2.7 6.2 5.5 4.1 2.7 5.4 ...
##  $ VPM_Dampfdruck        : num  19.5 16.2 16.7 18.4 17.4 17.3 20.7 15.4 17.5 19 ...
##  $ PM_Luftdruck          : num  1005 1008 1010 960 986 ...
##  $ UPM_Relative_Feuchte  : int  72 67 76 66 65 68 73 65 77 75 ...
##  $ TXK_Lufttemperatur_Max: num  29.9 28.3 22.9 32.8 32.3 26.3 27.3 27.9 22.8 28.6 ...
##  $ HOT.DAY               : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 1 ...

In a next step, we split our data set at random into two distinct parts. One part is our training set while the other part will serve as our test set. The basic idea behind this data grouping strategy is, that we learn the model parameters on the training set, and then we use the test data set, a data set that was not involved at all during parameter estimation, to determine the goodness of fit of our model, or in other words by using two distinct data sets we may evaluate the generalization properties of our model.

For the sub-setting of the data set we rely on the sample.split() function from the caTools package, which splits the data set into two sets (training and test set) in a predefined ratio.

We prepare two versions of the features:

Original data: The cleaned numerical variables are used as they are.
PCA data: We scale the numerical features and apply Principal Component Analysis (PCA) to reduce the number of variables and remove correlations. We then use the first few principal components as features. The same training-test split is applied.

Finally, we assign the data matrices and the response vectors proper variable names for further processing.

# Split data into training and test set, 60% training and 40% test set
library(caTools)
set.seed(200) # set seed for reproducibility

# define split vector
split <- sample.split(dwd_data$HOT.DAY, SplitRatio = 0.6)

#########################
### Original data set ###
#########################

train_set <- dwd_data[split == TRUE, ] # training set
test_set <- dwd_data[split == FALSE, ] # test set

# response vectors
y_train <- train_set$HOT.DAY # response vector training set
y_test <- test_set$HOT.DAY # response vector test set

# exclude response variable(s) from the data matrix
idx_HOT_DAY <- which(colnames(train_set) == "HOT.DAY")
idx_MAX_AIR_TEMP <- which(colnames(train_set) == "TXK_Lufttemperatur_Max")

X_train <- data.matrix(train_set[, -c(idx_HOT_DAY, idx_MAX_AIR_TEMP)])
X_test <- data.matrix(test_set[, -c(idx_HOT_DAY, idx_MAX_AIR_TEMP)])

####################
### PCA data set ###
####################

# prepare PCA data set
X_all <- dwd_data[, -c(which(colnames(dwd_data) == "HOT.DAY"),
                       which(colnames(dwd_data) == "TXK_Lufttemperatur_Max"))]
# Scale data
X_all_scaled <- scale(X_all)

# Compute PCA
pca <- prcomp(X_all_scaled, center = TRUE, scale. = TRUE)

summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.9552 1.6207 1.2445 1.0740 0.80669 0.73448 0.53249
## Proportion of Variance 0.3475 0.2388 0.1408 0.1049 0.05916 0.04904 0.02578
## Cumulative Proportion  0.3475 0.5863 0.7271 0.8320 0.89112 0.94017 0.96594
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.42485 0.38789 0.20675 0.03053
## Proportion of Variance 0.01641 0.01368 0.00389 0.00008
## Cumulative Proportion  0.98235 0.99603 0.99992 1.00000

# We selected the first 4 principal components because they explain approximately 83% of the total variance, capturing most of the information while reducing dimensionality

# Extract scores of first 4 principal components
pca_scores_3 <- pca$x[, 1:4]

# Split PCA scores into training and test sets
X_train_PC3 <- pca_scores_3[split == TRUE, ]
X_test_PC3  <- pca_scores_3[split == FALSE, ]

### save data matrices and response vectors for later usage ###
save(y_train,
     y_test,
     X_train,
     X_test,
     X_train_PC3,
     X_test_PC3,
     pca_scores_3,
     dwd_data,
     file = "pca_data_30300.RData")

Finally, there remain 72 observations with 11 features in the data set to train our model, and 48 observations with 11 features to evaluate our model.

We notice that the classes of our response variable HOT.DAY are not balanced.

tbl_hot_day <- table(dwd_data["HOT.DAY"])
tbl_hot_day

## HOT.DAY
##  0  1 
## 76 44

The ratio of weather stations, which recorded at least one hot day and weather stations, which have not yet recorded a hot day is 5 to 3. When working with imbalanced data set we have to be cautious with respect to model metrics, such as the model accuracy. Consider a model that always predict the more common class; in our example the model accuracy would be \(76/(44+76)= 0.63\); which is a quite high number for a non-predictive model. Hence, we will consider other model metrics such as the confusion matrix and Cohen’s kappa too. The number of \(0.63\) will serve as our model benchmark. Whatever model we build, the model’s accuracy must be better than \(0.63\).

However, before we continue we should remind ourselves that the data set we are working with has a spatial component. Basically, our observations are point measurements of meteorological parameters spread across Germany. Let us plot a simple map to get an impression of the spatial distribution of our observations. For plotting we rely on the geodata package (which replaces the older getData() function from the raster package for spatial data retrieval) and on the ggplot2 package.

library(mapproj)
library(geodata)
library(ggplot2)
library(sf)

# Retrieve Federal States using the gadm() function from the geodata package
G1 <- geodata::gadm(country = "DEU", level = 1, tempdir())
G1_sf <- st_as_sf(G1)

#plot the map
ggplot() +
  geom_sf(data = G1_sf,
             colour = "grey10", fill = "#fff7bc") +
  geom_point(data = dwd_data,
           aes(x = lon, y = lat, colour = HOT.DAY),
           alpha = 0.5,
           size = 2) +
  scale_colour_manual(values = c("0" = "cyan4", "1" = "coral2"))+
  theme_bw() +
  xlab("Longitude") + ylab("Latitude") +
  coord_sf()

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.