A typical machine learning approach consists of several steps that can be arranged in a kind of standard workflow. Training of the model, which we have done in the previous examples of different algorithms, is only a small part of this workflow. Data preprocessing and model evaluation and selection are more extensive tasks.

In the following, we will walk through the entire workflow using a simple but more realistic example than using toy data.

Our goal is to use weather data and predict the daily mean temperature at a measurement station from several other measurements, such as pressure, vapor content, and so on.

For this example, we will use a daily time series from the DWD weather station Berlin-Dahlem (FU), downloaded from Climate Data Center on 2022-07-22. A detailed description of the variables is available here. For the purpose of this tutorial, the dataset is provided here.

We start by downloading the data, extracting the zip file, and reading the data.

```
# Loading necessary libraries
library(tidyverse)
library(httr)
# Downloading and reading data
url <- "http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/tageswerte_KL_00403_19500101_20211231_hist.zip"
temp_zip <- tempfile()
download.file(url, temp_zip)
unzipped_file <- unzip(temp_zip, files = "produkt_klima_tag_19500101_20211231_00403.txt", exdir = tempdir())
data_raw <- read.table(unzipped_file, sep = ";", header = TRUE, na.strings = "-999")
head(data_raw)
```

```
## STATIONS_ID MESS_DATUM QN_3 FX FM QN_4 RSK RSKF SDK SHK_TAG NM VPM PM
## 1 403 19500101 NA NA NA 5 2.2 7 NA 0 5.0 4.0 1025.6
## 2 403 19500102 NA NA NA 5 12.6 8 NA 0 8.0 6.1 1005.6
## 3 403 19500103 NA NA NA 5 0.5 1 NA 0 5.0 6.5 996.6
## 4 403 19500104 NA NA NA 5 0.5 7 NA 0 7.7 5.2 999.5
## 5 403 19500105 NA NA NA 5 10.3 7 NA 0 8.0 4.0 1001.1
## 6 403 19500106 NA NA NA 5 7.2 8 NA 12 7.3 5.6 997.5
## TMK UPM TXK TNK TGK eor
## 1 -3.2 83 -1.1 -4.9 -6.3 eor
## 2 1.0 95 2.2 -3.7 -5.3 eor
## 3 2.8 86 3.9 1.7 -1.4 eor
## 4 -0.1 85 2.1 -0.9 -2.3 eor
## 5 -2.8 79 -0.9 -3.3 -5.2 eor
## 6 2.6 79 5.0 -4.0 -4.0 eor
```

By looking at the data and consulting the variable description, we can immediately find variables that are not useful for our purpose and drop them.

**STATIONS_ID**-> There is only one station in the data.**MESS_DATUM**-> Since we are not interested in a time series analysis, this is just a numbering of the rows.**QN_3, FX, FM, eor**-> These variables do not contain any data.**QN_4**-> This value is a data quality label and has no relation to weather data.

```
data <- subset(data_raw, select = -c(STATIONS_ID, MESS_DATUM, QN_3, FX, FM, NM, QN_4, eor))
head(data)
```

```
## RSK RSKF SDK SHK_TAG VPM PM TMK UPM TXK TNK TGK
## 1 2.2 7 NA 0 4.0 1025.6 -3.2 83 -1.1 -4.9 -6.3
## 2 12.6 8 NA 0 6.1 1005.6 1.0 95 2.2 -3.7 -5.3
## 3 0.5 1 NA 0 6.5 996.6 2.8 86 3.9 1.7 -1.4
## 4 0.5 7 NA 0 5.2 999.5 -0.1 85 2.1 -0.9 -2.3
## 5 10.3 7 NA 0 4.0 1001.1 -2.8 79 -0.9 -3.3 -5.2
## 6 7.2 8 NA 12 5.6 997.5 2.6 79 5.0 -4.0 -4.0
```

To improve the readability, we rename the remaining columns.

```
data <- rename(data,
prec = RSK,
prec_type = RSKF,
sun_dur = SDK,
snow_depth = SHK_TAG,
vapor_pres = VPM,
pres = PM,
temp = TMK,
rel_humid = UPM,
temp_max = TXK,
temp_min = TNK,
temp_sfc = TGK
)
head(data)
```

```
## prec prec_type sun_dur snow_depth vapor_pres pres temp rel_humid temp_max
## 1 2.2 7 NA 0 4.0 1025.6 -3.2 83 -1.1
## 2 12.6 8 NA 0 6.1 1005.6 1.0 95 2.2
## 3 0.5 1 NA 0 6.5 996.6 2.8 86 3.9
## 4 0.5 7 NA 0 5.2 999.5 -0.1 85 2.1
## 5 10.3 7 NA 0 4.0 1001.1 -2.8 79 -0.9
## 6 7.2 8 NA 12 5.6 997.5 2.6 79 5.0
## temp_min temp_sfc
## 1 -4.9 -6.3
## 2 -3.7 -5.3
## 3 1.7 -1.4
## 4 -0.9 -2.3
## 5 -3.3 -5.2
## 6 -4.0 -4.0
```

As a next step, we check the data for missing values.

```
missing_values <- sapply(data, function(x) sum(is.na(x)))
missing_values
```

```
## prec prec_type sun_dur snow_depth vapor_pres pres temp
## 0 0 145 1 0 0 0
## rel_humid
## 1
```

As we can see, the percentage of missing values in our data is very small (147 out of the 26298 rows). Therefore, the simplest solution is to simply drop all rows with missing data. If the percentage were higher, other solutions might be more useful, such as imputing values for the missing values.

```
data <- drop_na(data)
head(data)
```

```
## prec prec_type sun_dur snow_depth vapor_pres pres temp rel_humid
## 1 1.0 1 0.0 0 7.1 1001.5 7.1 70
## 2 2.8 1 0.0 0 8.7 984.7 7.6 81
## 3 0.0 0 6.4 0 6.7 987.8 6.9 68
## 4 0.5 1 5.8 0 6.9 997.2 6.2 73
## 5 11.7 1 2.9 0 7.1 1000.7 6.4 74
## 6 1.9 1 2.5 0 8.0 1008.2 6.4 82
```

All of the data in our data set contain numerical values. However,
while they are all continuous variables, the `prec_type`

is a
categorical variable. Furthermore, it is only nominal, not ordinal. If
we leave the values as they are, the model will imply an order on the
prec_type, assuming for example that type 8 is greater than type 6. This
could lead to non-optimal results.

We will use a solution called *one-hot encoding*. This means
that we create a new feature for each of the types and assign binary
values to it.

```
data$prec_type <- as.factor(data$prec_type)
data <- model.matrix(~ . + 0, data)
data <- as.data.frame(data)
head(data)
```

```
## prec prec_type0 prec_type1 prec_type4 prec_type6 prec_type7 prec_type8
## 1 1.0 0 1 0 0 0 0
## 2 2.8 0 1 0 0 0 0
## 3 0.0 1 0 0 0 0 0
## 4 0.5 0 1 0 0 0 0
## 5 11.7 0 1 0 0 0 0
## 6 1.9 0 1 0 0 0 0
## sun_dur snow_depth vapor_pres pres temp rel_humid
## 1 0.0 0 7.1 1001.5 7.1 70
## 2 0.0 0 8.7 984.7 7.6 81
## 3 6.4 0 6.7 987.8 6.9 68
## 4 5.8 0 6.9 997.2 6.2 73
## 5 2.9 0 7.1 1000.7 6.4 74
## 6 2.5 0 8.0 1008.2 6.4 82
```

Now that all the data is preprocessed, we want to take a closer look at the different features. We will start by plotting the distributions.

```
library(ggplot2)
data %>%
select(prec, sun_dur, snow_depth, vapor_pres, pres, rel_humid, temp) %>%
gather(key = "feature", value = "value") %>%
ggplot(aes(value)) +
geom_histogram(bins = 35, fill = "blue", alpha = 0.7) +
facet_wrap(~feature, scales = "free") + # This change here makes both x and y axis free.
theme_minimal() +
labs(title = "Feature Distributions", x = NULL)
```

Next, we’ll describe the statistical properties of these features.

```
describe_data <- data %>%
select(prec, sun_dur, snow_depth, vapor_pres, pres, rel_humid, temp) %>%
summary()
describe_data
```

```
## prec sun_dur snow_depth vapor_pres
## Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 1.100
## 1st Qu.: 0.000 1st Qu.: 0.400 1st Qu.: 0.0000 1st Qu.: 6.400
## Median : 0.000 Median : 3.800 Median : 0.0000 Median : 8.900
## Mean : 1.605 Mean : 4.742 Mean : 0.7547 Mean : 9.599
## 3rd Qu.: 1.500 3rd Qu.: 8.000 3rd Qu.: 0.0000 3rd Qu.:12.500
## Max. :106.000 Max. :16.400 Max. :49.0000 Max. :23.900
## pres rel_humid temp
## Min. : 961 Min. : 29.00 Min. :-17.900
## 1st Qu.:1002 1st Qu.: 68.00 1st Qu.: 3.600
## Median :1008 Median : 78.00 Median : 9.500
## Mean :1008 Mean : 76.53 Mean : 9.374
## 3rd Qu.:1014 3rd Qu.: 87.00 3rd Qu.: 15.400
## Max. :1040 Max. :100.00 Max. : 29.500
```

We can see several things here. Let us start with precipitation, sunshine duration, and snow depth. All three show a large spike at 0. This indicates a problem with the data. Here, 0 means that there was no precipitation, but it could also mean that the precipitation was too low to be measured, or that there was a problem with the instrument. The same is true for sunshine duration. We cannot easily compensate for these problems, so we will drop the information. We do the same for snow depth, which is 0 most of the time.

```
data <- subset(data, select = -c(prec, sun_dur, snow_depth))
head(data)
```

```
## prec_type0 prec_type1 prec_type4 prec_type6 prec_type7 prec_type8 vapor_pres
## 1 0 1 0 0 0 0 7.1
## 2 0 1 0 0 0 0 8.7
## 3 1 0 0 0 0 0 6.7
## 4 0 1 0 0 0 0 6.9
## 5 0 1 0 0 0 0 7.1
## 6 0 1 0 0 0 0 8.0
## pres temp rel_humid
## 1 1001.5 7.1 70
## 2 984.7 7.6 81
## 3 987.8 6.9 68
## 4 997.2 6.2 73
## 5 1000.7 6.4 74
## 6 1008.2 6.4 82
```

Next, we look at the distribution for cloud cover. Cloud cover is measured in eighths, which actually makes it a discrete variable. However, here we have daily averages, so we get pseudo-continuous values. But there is still no real metric, so we convert the cloud cover to a percentage.

Finally, we see that the distributions for cloud cover, vapor pressure, pressure, relative humidity, and temperature are all on different scales and are not normally distributed. However, many learning algorithms rely on gradient descent or distance measures. This is problematic when the values are on very different scales. Furthermore, most features are bounded on one or even both sides, so that values outside these bounds are physically meaningless.

In the following, we will apply some transformations to the features. But before we do that, we need to take another important step first.

When we train a model, we need to test its performance. To do this, we need data that is similar to the training data, but independent of it. We achieve this by splitting our dataset into two parts: a training part, which we use to build the model, and a test part, which we use only for the final evaluation. Here we use 80% of our data for training and 20% for testing.

```
library(caret)
# split features and target variable
X <- data[, !(names(data) %in% "temp")] # drop the 'temp' column for features
y <- data$temp # target column
# split the data into training and test sets
set.seed(2) # to make the random split reproducible
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE) # 80% for training
X_train <- X[trainIndex, ]
X_test <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
# print lengths
print(nrow(X_train))
```

`## [1] 20923`

`print(nrow(X_test))`

`## [1] 5229`

`print(length(y_train))`

`## [1] 20923`

`print(length(y_test))`

`## [1] 5229`

We have 20923 records in the training set and 5229 records in the test set. Note: The records are randomly shuffled and split to avoid any effects from the original order of the records.

Now we are ready for the promised feature scaling. The reason we did the train-test split first is that we want a pure training data set that is not influenced by the test data. If we were to determine the scaling parameters from the whole data set, information from the test data would be included in the parameters. Therefore, we perform the scaling on the training data only. Then, of course, we have to transform the test set with the same parameters. The features cloud cover, vapor pressure, pressure, relative humidity and temperature are all double constrained and we will use a logistic transformation with a prior min-max scaling.

The min-max scaling \[x′=\frac{x−x_{min}}{x_{max}−x_{min}}\] transforms the data into the range \([0,1]\). We want to avoid getting values with \(0\), because in the second step we need the logarithm of the values. Therefore, we use slightly larger boundary values \(x_{min}−1\) and \(x_{max}+1\). Thus \[x′=\frac{x−x_{min}+1}{x_{max}−x_{min}+2}\] Note that we transform the test set with the min max values from the train set

```
# columns to transform
transform_cols <- c("vapor_pres", "pres", "rel_humid")
for (i in transform_cols) {
# Scale the features for the test set
X_test[i] <- (X_test[i] - min(X_train[i]) + 1) / (max(X_train[i]) - min(X_train[i]) + 2)
# Scale the features for the training set
X_train[i] <- (X_train[i] - min(X_train[i]) + 1) / (max(X_train[i]) - min(X_train[i]) + 2)
# Apply custom min-max scaling to y_test and y_train
y_test <- (y_test - min(y_train) + 1) / (max(y_train) - min(y_train) + 2)
y_train <- (y_train - min(y_train) + 1) / (max(y_train) - min(y_train) + 2)
}
```

As a second step, we use a logistic transformation \[x''=\log\frac{x'}{1-x'}\] However, before we can start the transformation, we face a problem! Because we have transformed the test set with the values of the training set, the test set may contain values outside the range [0,1]. This will not work with the logarithm in the transformation.

```
# Print rows where any of the columns in transform_cols have values <= 0
print(X_test[rowSums(X_test[transform_cols] <= 0) > 0, ])
```

```
## [1] prec_type0 prec_type1 prec_type4 prec_type6 prec_type7 prec_type8 vapor_pres
## [8] pres rel_humid
## <0 rows> (or 0-length row.names)
```

`print(y_test[y_test <= 0])`

`## numeric(0)`

```
# Print rows in where any of the columns in transform_cols have values >= 1
print(X_test[rowSums(X_test[transform_cols] >= 1) > 0, ])
```

```
## [1] prec_type0 prec_type1 prec_type4 prec_type6 prec_type7 prec_type8 vapor_pres
## [8] pres rel_humid
## <0 rows> (or 0-length row.names)
```

`print(y_test[y_test >= 1])`

`## numeric(0)`

Fortunately, there are no values in the test set outside the range. If there were any, we had basically two options: either drop the data or impute new values. When there are only a few problematic values, we can drop them. Otherwise, we could assign the minimum or maximum value.

Let us perform the logistical transformation.

```
# Logistic transformation
X_test[transform_cols] <- log(X_test[transform_cols] / (1 - X_test[transform_cols]))
X_train[transform_cols] <- log(X_train[transform_cols] / (1 - X_train[transform_cols]))
y_test <- log(y_test / (1 - y_test))
y_train <- log(y_train / (1 - y_train))
```

Now, let’s take a look at the new distributions.

```
par(mar = c(4, 4, 0.5, 4), mfrow = c(2, 2), pin = c(2.5, 1.5))
hist(X_train$vapor_pres, main = "vapor_pres (Train)", col = "skyblue", border = "white", breaks = 35)
hist(X_train$pres, main = "pres (Train)", col = "skyblue", border = "white", breaks = 35)
hist(X_train$rel_humid, main = "rel_humid (Train)", col = "skyblue", border = "white", breaks = 35)
hist(y_train, main = "Temperature (Train)", col = "skyblue", border = "white", breaks = 35)
mtext("Distributions of transformed training data, now in real space", side = 3, outer = TRUE, line = - 1.2, cex = 1.4)
```

```
par(mar = c(4, 4, 0.5, 4), mfrow = c(2, 2), pin = c(2.5, 1.5))
hist(X_test$vapor_pres, main = "vapor_pres (Test)", col = "salmon", border = "white", breaks = 35)
hist(X_test$pres, main = "pres (Test)", col = "salmon", border = "white", breaks = 35)
hist(X_test$rel_humid, main = "rel_humid (Test)", col = "salmon", border = "white", breaks = 35)
hist(y_test, main = "Temperature (Test)", col = "salmon", border = "white", breaks = 35)
mtext("Distributions of transformed test data, now in real space", side = 3, outer = TRUE, line = -1.2, cex = 1.4)
```

All the features are now on the same scale and we can proceed to build our machine learning models.

All the necessary pre-processing is now done, and we can finally get to the actual model.

For this regression task, we’re going to use a neural network,
specifically leveraging the Keras library. We have opted for Keras over the `neuralnet`

package in R
due to performance considerations: `neuralnet`

, while useful
for smaller networks, tends to be quite slow for larger or more complex
models. Keras, on the other hand, is a high-level neural networks API,
originally developed in Python, and it can be utilized in R through
various interfaces. For advanced application of Keras we recommend Collet, Kalinowski & Allaire,2022.

Our initial model will be a fully connected network with two layers, each consisting of 9 nodes, which corresponds to the number of features in our dataset. This structure is a starting point and can be adjusted based on the model’s performance.

```
library(keras)
# Define the model
model <- keras_model_sequential() %>%
layer_dense(units = 9, activation = "relu", input_shape = dim(X_train)[2]) %>%
layer_dense(units = 9, activation = "relu") %>%
layer_dense(units = 1) # Regression output (no activation)
summary(model)
```

```
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_2 (Dense) (None, 9) 90
## dense_1 (Dense) (None, 9) 90
## dense (Dense) (None, 1) 10
## ================================================================================
## Total params: 190
## Trainable params: 190
## Non-trainable params: 0
## ________________________________________________________________________________
```

Since our problem is of regression type, we use a linear activation function (see here). The rectified linear unit is a widely used activation function, but you should also try other functions and compare the results.

Next, we compile the model. When compiling a model in Keras, we need to specify at least two things: the loss function and the optimizer. While the loss function measures how well the model is performing, the optimizer is responsible for adjusting the weights of the network to minimize this loss. Experiment with different values of the learning rate to observe its impact on model training.

```
# Compile the model
model %>% compile(
optimizer = optimizer_adam(learning_rate = 0.0001),
loss = "mean_squared_error"
)
```

Now, we can train the model on our training data set. We choose 100 epochs, which means that the entire dataset is passed forward and backward through the neural network 100 times.

```
# Train the model
history <- model %>% fit(
as.matrix(X_train), as.matrix(y_train),
epochs = 100, batch_size = 200,
validation_split = 0, # No validation data since we already split the data
)
```

You may have thought “that’s it”, but we are only halfway through. Having a trained model is nice, but is it a good model, i.e. does it do what it is supposed to do? And how well is it doing it? And could we make the model even better? These questions are answered when we perform model evaluation and model selection.

First, let’s look at the training of the model. Within the neural network, we update the weights at the nodes to minimize a loss function. The following graph shows the value of this loss function as a function of epochs, i.e. the number of times the training data set is used.

`plot(history)`

As expected, the value of the loss function decreases as the number of epochs increases. This trend indicates that the model rapidly achieves a stable state of convergence.

`history`

```
##
## Final epoch (plot to see history):
## loss: 0.00007731
```

The low loss value of 7.7312128^{-5} indicates that our model is making very accurate predictions on the training data, with a very small average squared error between the predictions and the actual values. However, it is important to validate these results against the separate test data set we created to ensure that the model generalizes well to new data and is not just overfitting to the training set.

The `keras`

package provides an `evaluation`

function to validate the model on a test data set.

`eval <- model %>% evaluate(as.matrix(X_test), as.matrix(y_test))`

`## 164/164 - 1s - loss: 8.4872e-05 - 960ms/epoch - 6ms/step`

The model evaluation on the test dataset reveals a loss of approximately 8.4872474^{-5}. This exceptionally low loss value on the test set, similar to that observed during training, suggests that our model is not only accurately predicting on the training data but also on new, unseen data. This consistency in performance between training and testing phases is a strong indicator of the model’s robustness.

However, we might be lucky in our choice of training and test data.
To reduce the influence of this split, we perform a so-called *k-fold
cross validation*. The basic idea behind cross-validation is that we
split our data set into k folds. Then we train our model on k−1 folds
and use the resulting model to predict the values of the left-out fold.
We repeat this k times such that each of the k folds was once the
left-out fold.

```
### k-Fold Cross-Validation
## 1. define a function for creating the model
create_model <- function() {
model <- keras_model_sequential() %>%
layer_dense(units = 9, activation = "relu", input_shape = dim(X_train)[2]) %>%
layer_dense(units = 9, activation = "relu") %>%
layer_dense(units = 1) # assuming a regression
model %>% compile(
optimizer = optimizer_adam(learning_rate = 0.0001),
loss = "mean_squared_error"
)
return(model)
}
## 2. create k folds
set.seed(123) # For reproducibility
folds <- createFolds(y_train, k = 10, list = TRUE)
results <- vector("list", length(folds))
## 3. loop over all folds
for (i in seq_along(folds)) {
# Split the data
train_indices <- folds[[i]]
X_train_fold <- X_train[train_indices, ]
y_train_fold <- y_train[train_indices]
test_indices <- setdiff(seq_len(nrow(X_train)), train_indices)
X_test_fold <- X_train[test_indices, ]
y_test_fold <- y_train[test_indices]
# Create and compile the model
model <- create_model()
# Fit the model
history <- model %>% fit(
as.matrix(X_train_fold), as.matrix(y_train_fold),
epochs = 100, batch_size = 200
)
# Evaluate the model
results[[i]] <- model %>% evaluate(as.matrix(X_test_fold), as.matrix(y_test_fold))
}
```

```
# Calculate average performance
average_loss <- mean(unlist(results))
print(average_loss)
```

`## [1] 0.008842661`

An important thing to watch out for in machine learning is overfitting. This means that if we train our model well, it will learn all the tiny characteristics of the data and thus describe the training data perfectly. However, the test data may be slightly different, so the model will perform poorly on new data.

One way to detect overfitting is to run a so-called learning curve. This means that we train the model with training sets of different sizes.

```
library(ggplot2)
set.seed(123)
# Define a sequence of training sizes
train_sizes <- seq(0.1, 1, by = 0.1)
# Store errors
train_error <- numeric(length(train_sizes))
validation_error <- numeric(length(train_sizes))
i <- 1
for (size in train_sizes) {
# Sample subset of training data
index <- sample(1:nrow(X_train), size * nrow(X_train))
subset_data <- X_train[index, ]
subset_target <- y_train[index]
# Train model
model <- create_model()
history <- model %>% fit(as.matrix(subset_data), subset_target, epochs = 100, batch_size = 200, verbose = 0)
# Predict and compute error on training subset
train_error[i] <- min(unlist(history$metrics))
validation_error[i] <- model %>% evaluate(as.matrix(X_test), as.matrix(y_test))
i <- i + 1
}
```

```
# Create data frame for plotting
plot_data <- data.frame(
TrainSize = rep(train_sizes, 2),
error = c(train_error, validation_error),
DataType = factor(rep(c("Train", "Validation"), each = length(train_sizes)))
)
# Plot learning curve
ggplot(plot_data, aes(x = TrainSize * nrow(X_train), y = error, color = DataType)) +
geom_line() +
labs(
title = "Learning Curve",
x = "Training Size",
y = "Error",
color = "Data Type"
) +
theme_minimal()
```