40310_workflow.knit

A typical machine learning approach consists of several steps that can be arranged in a kind of standard workflow. Training of the model, which we have done in the previous examples of different algorithms, is only a small part of this workflow. Data preprocessing and model evaluation and selection are more extensive tasks.

In the following, we will walk through the entire workflow using a simple but more realistic example than using toy data.

Our goal is to use weather data and predict the daily mean temperature at a measurement station from several other measurements, such as pressure, vapor content, and so on.

1. Select data

For this example, we will use a daily time series from the DWD weather station Berlin-Dahlem (FU), downloaded from Climate Data Center on 2022-07-22. A detailed description of the variables is available here. For the purpose of this tutorial, the dataset is provided here.

1.1 Read dataset

We start by downloading the data, extracting the zip file, and reading the data.

# Loading necessary libraries
library(tidyverse)
library(httr)

# Downloading and reading data
url <- "http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/tageswerte_KL_00403_19500101_20211231_hist.zip"
temp_zip <- tempfile()
download.file(url, temp_zip)
unzipped_file <- unzip(temp_zip, files = "produkt_klima_tag_19500101_20211231_00403.txt", exdir = tempdir())
data_raw <- read.table(unzipped_file, sep = ";", header = TRUE, na.strings = "-999")
head(data_raw)

##   STATIONS_ID MESS_DATUM QN_3 FX FM QN_4  RSK RSKF SDK SHK_TAG  NM VPM     PM
## 1         403   19500101   NA NA NA    5  2.2    7  NA       0 5.0 4.0 1025.6
## 2         403   19500102   NA NA NA    5 12.6    8  NA       0 8.0 6.1 1005.6
## 3         403   19500103   NA NA NA    5  0.5    1  NA       0 5.0 6.5  996.6
## 4         403   19500104   NA NA NA    5  0.5    7  NA       0 7.7 5.2  999.5
## 5         403   19500105   NA NA NA    5 10.3    7  NA       0 8.0 4.0 1001.1
## 6         403   19500106   NA NA NA    5  7.2    8  NA      12 7.3 5.6  997.5
##    TMK UPM  TXK  TNK  TGK eor
## 1 -3.2  83 -1.1 -4.9 -6.3 eor
## 2  1.0  95  2.2 -3.7 -5.3 eor
## 3  2.8  86  3.9  1.7 -1.4 eor
## 4 -0.1  85  2.1 -0.9 -2.3 eor
## 5 -2.8  79 -0.9 -3.3 -5.2 eor
## 6  2.6  79  5.0 -4.0 -4.0 eor

1.2 Remove unnecessary columns

By looking at the data and consulting the variable description, we can immediately find variables that are not useful for our purpose and drop them.

STATIONS_ID -> There is only one station in the data.
MESS_DATUM -> Since we are not interested in a time series analysis, this is just a numbering of the rows.
QN_3, FX, FM, eor -> These variables do not contain any data.
QN_4 -> This value is a data quality label and has no relation to weather data.

data <- subset(data_raw, select = -c(STATIONS_ID, MESS_DATUM, QN_3, FX, FM, NM, QN_4, eor))
head(data)

##    RSK RSKF SDK SHK_TAG VPM     PM  TMK UPM  TXK  TNK  TGK
## 1  2.2    7  NA       0 4.0 1025.6 -3.2  83 -1.1 -4.9 -6.3
## 2 12.6    8  NA       0 6.1 1005.6  1.0  95  2.2 -3.7 -5.3
## 3  0.5    1  NA       0 6.5  996.6  2.8  86  3.9  1.7 -1.4
## 4  0.5    7  NA       0 5.2  999.5 -0.1  85  2.1 -0.9 -2.3
## 5 10.3    7  NA       0 4.0 1001.1 -2.8  79 -0.9 -3.3 -5.2
## 6  7.2    8  NA      12 5.6  997.5  2.6  79  5.0 -4.0 -4.0

1.3 Rename columns for convenience

To improve the readability, we rename the remaining columns.

data <- rename(data,
  prec = RSK,
  prec_type = RSKF,
  sun_dur = SDK,
  snow_depth = SHK_TAG,
  vapor_pres = VPM,
  pres = PM,
  temp = TMK,
  rel_humid = UPM,
  temp_max = TXK,
  temp_min = TNK,
  temp_sfc = TGK
)
head(data)

##   prec prec_type sun_dur snow_depth vapor_pres   pres temp rel_humid temp_max
## 1  2.2         7      NA          0        4.0 1025.6 -3.2        83     -1.1
## 2 12.6         8      NA          0        6.1 1005.6  1.0        95      2.2
## 3  0.5         1      NA          0        6.5  996.6  2.8        86      3.9
## 4  0.5         7      NA          0        5.2  999.5 -0.1        85      2.1
## 5 10.3         7      NA          0        4.0 1001.1 -2.8        79     -0.9
## 6  7.2         8      NA         12        5.6  997.5  2.6        79      5.0
##   temp_min temp_sfc
## 1     -4.9     -6.3
## 2     -3.7     -5.3
## 3      1.7     -1.4
## 4     -0.9     -2.3
## 5     -3.3     -5.2
## 6     -4.0     -4.0

1.4 Remove correlated columns

In the next step, we see that there are four columns with temperature-related data:

TMK -> Daily Mean Temp (this is the temperature which we want to predict)
TXK -> Daily Max Temp at 2m
TNK -> Daily Min Temp at 2m
TGK -> Daily Min Temp at 5cm

The following pair plot shows that these values are highly correlated, as expected.

library(ggplot2)
library(GGally)

p <- ggpairs(data, columns = c("temp", "temp_max", "temp_min", "temp_sfc"))
p

Our goal is to predict the mean temperature (temp). If we leave the other temperature variables in the data set, our model will basically use these correlations and thus ignore all other variables. So we will drop temp_max, temp_min, and temp_sfc.

Note: If our target variable were something other than temperature, we would effectively have four correlated predictor variables. This could lead to a high influence on the model. It might then be useful to reduce this influence by dimensionally reducing the feature space. One way is to drop certain features, as we do here; another way can be a principal component analysis (PCA) where only the main components are kept in the analysis.

data <- subset(data, select = -c(temp_max, temp_min, temp_sfc))
head(data)

##   prec prec_type sun_dur snow_depth vapor_pres   pres temp rel_humid
## 1  2.2         7      NA          0        4.0 1025.6 -3.2        83
## 2 12.6         8      NA          0        6.1 1005.6  1.0        95
## 3  0.5         1      NA          0        6.5  996.6  2.8        86
## 4  0.5         7      NA          0        5.2  999.5 -0.1        85
## 5 10.3         7      NA          0        4.0 1001.1 -2.8        79
## 6  7.2         8      NA         12        5.6  997.5  2.6        79

1.5 Drop missing data

As a next step, we check the data for missing values.

missing_values <- sapply(data, function(x) sum(is.na(x)))
missing_values

##       prec  prec_type    sun_dur snow_depth vapor_pres       pres       temp 
##          0          0        145          1          0          0          0 
##  rel_humid 
##          1

As we can see, the percentage of missing values in our data is very small (147 out of the 26298 rows). Therefore, the simplest solution is to simply drop all rows with missing data. If the percentage were higher, other solutions might be more useful, such as imputing values for the missing values.

data <- drop_na(data)
head(data)

##   prec prec_type sun_dur snow_depth vapor_pres   pres temp rel_humid
## 1  1.0         1     0.0          0        7.1 1001.5  7.1        70
## 2  2.8         1     0.0          0        8.7  984.7  7.6        81
## 3  0.0         0     6.4          0        6.7  987.8  6.9        68
## 4  0.5         1     5.8          0        6.9  997.2  6.2        73
## 5 11.7         1     2.9          0        7.1 1000.7  6.4        74
## 6  1.9         1     2.5          0        8.0 1008.2  6.4        82

1.6 Handle categorical data

All of the data in our data set contain numerical values. However, while they are all continuous variables, the prec_type is a categorical variable. Furthermore, it is only nominal, not ordinal. If we leave the values as they are, the model will imply an order on the prec_type, assuming for example that type 8 is greater than type 6. This could lead to non-optimal results.

We will use a solution called one-hot encoding. This means that we create a new feature for each of the types and assign binary values to it.

data$prec_type <- as.factor(data$prec_type)
data <- model.matrix(~ . + 0, data)
data <- as.data.frame(data)
head(data)

##   prec prec_type0 prec_type1 prec_type4 prec_type6 prec_type7 prec_type8
## 1  1.0          0          1          0          0          0          0
## 2  2.8          0          1          0          0          0          0
## 3  0.0          1          0          0          0          0          0
## 4  0.5          0          1          0          0          0          0
## 5 11.7          0          1          0          0          0          0
## 6  1.9          0          1          0          0          0          0
##   sun_dur snow_depth vapor_pres   pres temp rel_humid
## 1     0.0          0        7.1 1001.5  7.1        70
## 2     0.0          0        8.7  984.7  7.6        81
## 3     6.4          0        6.7  987.8  6.9        68
## 4     5.8          0        6.9  997.2  6.2        73
## 5     2.9          0        7.1 1000.7  6.4        74
## 6     2.5          0        8.0 1008.2  6.4        82

2. Inspect data

Now that all the data is preprocessed, we want to take a closer look at the different features. We will start by plotting the distributions.

2.1 Check distributions

library(ggplot2)

data %>%
  select(prec, sun_dur, snow_depth, vapor_pres, pres, rel_humid, temp) %>%
  gather(key = "feature", value = "value") %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 35, fill = "blue", alpha = 0.7) +
  facet_wrap(~feature, scales = "free") + # This change here makes both x and y axis free.
  theme_minimal() +
  labs(title = "Feature Distributions", x = NULL)

Next, we’ll describe the statistical properties of these features.

describe_data <- data %>%
  select(prec, sun_dur, snow_depth, vapor_pres, pres, rel_humid, temp) %>%
  summary()
describe_data

##       prec            sun_dur         snow_depth        vapor_pres    
##  Min.   :  0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 1.100  
##  1st Qu.:  0.000   1st Qu.: 0.400   1st Qu.: 0.0000   1st Qu.: 6.400  
##  Median :  0.000   Median : 3.800   Median : 0.0000   Median : 8.900  
##  Mean   :  1.605   Mean   : 4.742   Mean   : 0.7547   Mean   : 9.599  
##  3rd Qu.:  1.500   3rd Qu.: 8.000   3rd Qu.: 0.0000   3rd Qu.:12.500  
##  Max.   :106.000   Max.   :16.400   Max.   :49.0000   Max.   :23.900  
##       pres        rel_humid           temp        
##  Min.   : 961   Min.   : 29.00   Min.   :-17.900  
##  1st Qu.:1002   1st Qu.: 68.00   1st Qu.:  3.600  
##  Median :1008   Median : 78.00   Median :  9.500  
##  Mean   :1008   Mean   : 76.53   Mean   :  9.374  
##  3rd Qu.:1014   3rd Qu.: 87.00   3rd Qu.: 15.400  
##  Max.   :1040   Max.   :100.00   Max.   : 29.500

We can see several things here. Let us start with precipitation, sunshine duration, and snow depth. All three show a large spike at 0. This indicates a problem with the data. Here, 0 means that there was no precipitation, but it could also mean that the precipitation was too low to be measured, or that there was a problem with the instrument. The same is true for sunshine duration. We cannot easily compensate for these problems, so we will drop the information. We do the same for snow depth, which is 0 most of the time.

data <- subset(data, select = -c(prec, sun_dur, snow_depth))
head(data)

##   prec_type0 prec_type1 prec_type4 prec_type6 prec_type7 prec_type8 vapor_pres
## 1          0          1          0          0          0          0        7.1
## 2          0          1          0          0          0          0        8.7
## 3          1          0          0          0          0          0        6.7
## 4          0          1          0          0          0          0        6.9
## 5          0          1          0          0          0          0        7.1
## 6          0          1          0          0          0          0        8.0
##     pres temp rel_humid
## 1 1001.5  7.1        70
## 2  984.7  7.6        81
## 3  987.8  6.9        68
## 4  997.2  6.2        73
## 5 1000.7  6.4        74
## 6 1008.2  6.4        82

Next, we look at the distribution for cloud cover. Cloud cover is measured in eighths, which actually makes it a discrete variable. However, here we have daily averages, so we get pseudo-continuous values. But there is still no real metric, so we convert the cloud cover to a percentage.

Finally, we see that the distributions for cloud cover, vapor pressure, pressure, relative humidity, and temperature are all on different scales and are not normally distributed. However, many learning algorithms rely on gradient descent or distance measures. This is problematic when the values are on very different scales. Furthermore, most features are bounded on one or even both sides, so that values outside these bounds are physically meaningless.

In the following, we will apply some transformations to the features. But before we do that, we need to take another important step first.

3. Train-Test Split

When we train a model, we need to test its performance. To do this, we need data that is similar to the training data, but independent of it. We achieve this by splitting our dataset into two parts: a training part, which we use to build the model, and a test part, which we use only for the final evaluation. Here we use 80% of our data for training and 20% for testing.

library(caret)

# split features and target variable
X <- data[, !(names(data) %in% "temp")] # drop the 'temp' column for features
y <- data$temp # target column

# split the data into training and test sets
set.seed(2) # to make the random split reproducible
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE) # 80% for training
X_train <- X[trainIndex, ]
X_test <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]

# print lengths
print(nrow(X_train))

## [1] 20923

print(nrow(X_test))

## [1] 5229

print(length(y_train))

## [1] 20923

print(length(y_test))

## [1] 5229

We have 20923 records in the training set and 5229 records in the test set. Note: The records are randomly shuffled and split to avoid any effects from the original order of the records.

4. Feature Scaling

Now we are ready for the promised feature scaling. The reason we did the train-test split first is that we want a pure training data set that is not influenced by the test data. If we were to determine the scaling parameters from the whole data set, information from the test data would be included in the parameters. Therefore, we perform the scaling on the training data only. Then, of course, we have to transform the test set with the same parameters. The features cloud cover, vapor pressure, pressure, relative humidity and temperature are all double constrained and we will use a logistic transformation with a prior min-max scaling.

4.1 Min-Max Scaling

The min-max scaling \[x′=\frac{x−x_{min}}{x_{max}−x_{min}}\] transforms the data into the range \([0,1]\). We want to avoid getting values with \(0\), because in the second step we need the logarithm of the values. Therefore, we use slightly larger boundary values \(x_{min}−1\) and \(x_{max}+1\). Thus \[x′=\frac{x−x_{min}+1}{x_{max}−x_{min}+2}\] Note that we transform the test set with the min max values from the train set

# columns to transform
transform_cols <- c("vapor_pres", "pres", "rel_humid")

for (i in transform_cols) {
  # Scale the features for the test set
  X_test[i] <- (X_test[i] - min(X_train[i]) + 1) / (max(X_train[i]) - min(X_train[i]) + 2)

  # Scale the features for the training set
  X_train[i] <- (X_train[i] - min(X_train[i]) + 1) / (max(X_train[i]) - min(X_train[i]) + 2)

  # Apply custom min-max scaling to y_test and y_train
  y_test <- (y_test - min(y_train) + 1) / (max(y_train) - min(y_train) + 2)
  y_train <- (y_train - min(y_train) + 1) / (max(y_train) - min(y_train) + 2)
}

4.2 Logistic Transformation

As a second step, we use a logistic transformation \[x''=\log\frac{x'}{1-x'}\] However, before we can start the transformation, we face a problem! Because we have transformed the test set with the values of the training set, the test set may contain values outside the range [0,1]. This will not work with the logarithm in the transformation.

# Print rows where any of the columns in transform_cols have values <= 0
print(X_test[rowSums(X_test[transform_cols] <= 0) > 0, ])

## [1] prec_type0 prec_type1 prec_type4 prec_type6 prec_type7 prec_type8 vapor_pres
## [8] pres       rel_humid 
## <0 rows> (or 0-length row.names)

print(y_test[y_test <= 0])

## numeric(0)

# Print rows in where any of the columns in transform_cols have values >= 1
print(X_test[rowSums(X_test[transform_cols] >= 1) > 0, ])

## [1] prec_type0 prec_type1 prec_type4 prec_type6 prec_type7 prec_type8 vapor_pres
## [8] pres       rel_humid 
## <0 rows> (or 0-length row.names)

print(y_test[y_test >= 1])

## numeric(0)

Fortunately, there are no values in the test set outside the range. If there were any, we had basically two options: either drop the data or impute new values. When there are only a few problematic values, we can drop them. Otherwise, we could assign the minimum or maximum value.

Let us perform the logistical transformation.

# Logistic transformation
X_test[transform_cols] <- log(X_test[transform_cols] / (1 - X_test[transform_cols]))
X_train[transform_cols] <- log(X_train[transform_cols] / (1 - X_train[transform_cols]))
y_test <- log(y_test / (1 - y_test))
y_train <- log(y_train / (1 - y_train))

Inspect new distributions

Now, let’s take a look at the new distributions.

par(mar = c(4, 4, 0.5, 4), mfrow = c(2, 2), pin = c(2.5, 1.5))
hist(X_train$vapor_pres, main = "vapor_pres (Train)", col = "skyblue", border = "white", breaks = 35)
hist(X_train$pres, main = "pres (Train)", col = "skyblue", border = "white", breaks = 35)
hist(X_train$rel_humid, main = "rel_humid (Train)", col = "skyblue", border = "white", breaks = 35)
hist(y_train, main = "Temperature (Train)", col = "skyblue", border = "white", breaks = 35)

mtext("Distributions of transformed training data, now in real space", side = 3, outer = TRUE, line = - 1.2, cex = 1.4)

par(mar = c(4, 4, 0.5, 4), mfrow = c(2, 2), pin = c(2.5, 1.5))
hist(X_test$vapor_pres, main = "vapor_pres (Test)", col = "salmon", border = "white", breaks = 35)
hist(X_test$pres, main = "pres (Test)", col = "salmon", border = "white", breaks = 35)
hist(X_test$rel_humid, main = "rel_humid (Test)", col = "salmon", border = "white", breaks = 35)
hist(y_test, main = "Temperature (Test)", col = "salmon", border = "white", breaks = 35)

mtext("Distributions of transformed test data, now in real space", side = 3, outer = TRUE, line = -1.2, cex = 1.4)

All the features are now on the same scale and we can proceed to build our machine learning models.

5. Model Definition

All the necessary pre-processing is now done, and we can finally get to the actual model.

For this regression task, we’re going to use a neural network, specifically leveraging the Keras library. We have opted for Keras over the neuralnet package in R due to performance considerations: neuralnet, while useful for smaller networks, tends to be quite slow for larger or more complex models. Keras, on the other hand, is a high-level neural networks API, originally developed in Python, and it can be utilized in R through various interfaces. For advanced application of Keras we recommend Collet, Kalinowski & Allaire,2022.

Our initial model will be a fully connected network with two layers, each consisting of 9 nodes, which corresponds to the number of features in our dataset. This structure is a starting point and can be adjusted based on the model’s performance.

library(keras)

# Define the model
model <- keras_model_sequential() %>%
  layer_dense(units = 9, activation = "relu", input_shape = dim(X_train)[2]) %>%
  layer_dense(units = 9, activation = "relu") %>%
  layer_dense(units = 1) # Regression output (no activation)


summary(model)

## Model: "sequential"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  dense_2 (Dense)                    (None, 9)                       90          
##  dense_1 (Dense)                    (None, 9)                       90          
##  dense (Dense)                      (None, 1)                       10          
## ================================================================================
## Total params: 190
## Trainable params: 190
## Non-trainable params: 0
## ________________________________________________________________________________

Since our problem is of regression type, we use a linear activation function (see here). The rectified linear unit is a widely used activation function, but you should also try other functions and compare the results.

Next, we compile the model. When compiling a model in Keras, we need to specify at least two things: the loss function and the optimizer. While the loss function measures how well the model is performing, the optimizer is responsible for adjusting the weights of the network to minimize this loss. Experiment with different values of the learning rate to observe its impact on model training.

# Compile the model
model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.0001),
  loss = "mean_squared_error"
)

6. Model Training

Now, we can train the model on our training data set. We choose 100 epochs, which means that the entire dataset is passed forward and backward through the neural network 100 times.

# Train the model
history <- model %>% fit(
  as.matrix(X_train), as.matrix(y_train),
  epochs = 100, batch_size = 200,
  validation_split = 0, # No validation data since we already split the data
)

7. Model Evaluation

You may have thought “that’s it”, but we are only halfway through. Having a trained model is nice, but is it a good model, i.e. does it do what it is supposed to do? And how well is it doing it? And could we make the model even better? These questions are answered when we perform model evaluation and model selection.

7.1. Convergence of Training

First, let’s look at the training of the model. Within the neural network, we update the weights at the nodes to minimize a loss function. The following graph shows the value of this loss function as a function of epochs, i.e. the number of times the training data set is used.

plot(history)

As expected, the value of the loss function decreases as the number of epochs increases. This trend indicates that the model rapidly achieves a stable state of convergence.

history

## 
## Final epoch (plot to see history):
## loss: 0.00007731

The low loss value of 7.7312128^{-5} indicates that our model is making very accurate predictions on the training data, with a very small average squared error between the predictions and the actual values. However, it is important to validate these results against the separate test data set we created to ensure that the model generalizes well to new data and is not just overfitting to the training set.

7.2. Model Validation

The keras package provides an evaluation function to validate the model on a test data set.

eval <- model %>% evaluate(as.matrix(X_test), as.matrix(y_test))

## 164/164 - 1s - loss: 8.4872e-05 - 960ms/epoch - 6ms/step

The model evaluation on the test dataset reveals a loss of approximately 8.4872474^{-5}. This exceptionally low loss value on the test set, similar to that observed during training, suggests that our model is not only accurately predicting on the training data but also on new, unseen data. This consistency in performance between training and testing phases is a strong indicator of the model’s robustness.

However, we might be lucky in our choice of training and test data. To reduce the influence of this split, we perform a so-called k-fold cross validation. The basic idea behind cross-validation is that we split our data set into k folds. Then we train our model on k−1 folds and use the resulting model to predict the values of the left-out fold. We repeat this k times such that each of the k folds was once the left-out fold.

### k-Fold Cross-Validation

## 1. define a function for creating the model
create_model <- function() {
  model <- keras_model_sequential() %>%
    layer_dense(units = 9, activation = "relu", input_shape = dim(X_train)[2]) %>%
    layer_dense(units = 9, activation = "relu") %>%
    layer_dense(units = 1) # assuming a regression

  model %>% compile(
    optimizer = optimizer_adam(learning_rate = 0.0001),
    loss = "mean_squared_error"
  )
  return(model)
}

## 2. create k folds
set.seed(123) # For reproducibility
folds <- createFolds(y_train, k = 10, list = TRUE)
results <- vector("list", length(folds))


## 3. loop over all folds
for (i in seq_along(folds)) {
  # Split the data
  train_indices <- folds[[i]]
  X_train_fold <- X_train[train_indices, ]
  y_train_fold <- y_train[train_indices]

  test_indices <- setdiff(seq_len(nrow(X_train)), train_indices)
  X_test_fold <- X_train[test_indices, ]
  y_test_fold <- y_train[test_indices]

  # Create and compile the model
  model <- create_model()

  # Fit the model
  history <- model %>% fit(
    as.matrix(X_train_fold), as.matrix(y_train_fold),
    epochs = 100, batch_size = 200
  )

  # Evaluate the model
  results[[i]] <- model %>% evaluate(as.matrix(X_test_fold), as.matrix(y_test_fold))
}

# Calculate average performance
average_loss <- mean(unlist(results))
print(average_loss)

## [1] 0.008842661

7.3 Learning Curve

An important thing to watch out for in machine learning is overfitting. This means that if we train our model well, it will learn all the tiny characteristics of the data and thus describe the training data perfectly. However, the test data may be slightly different, so the model will perform poorly on new data.

One way to detect overfitting is to run a so-called learning curve. This means that we train the model with training sets of different sizes.

library(ggplot2)

set.seed(123)

# Define a sequence of training sizes
train_sizes <- seq(0.1, 1, by = 0.1)

# Store errors
train_error <- numeric(length(train_sizes))
validation_error <- numeric(length(train_sizes))

i <- 1
for (size in train_sizes) {
  # Sample subset of training data
  index <- sample(1:nrow(X_train), size * nrow(X_train))
  subset_data <- X_train[index, ]
  subset_target <- y_train[index]

  # Train model
  model <- create_model()
  history <- model %>% fit(as.matrix(subset_data), subset_target, epochs = 100, batch_size = 200, verbose = 0)

  # Predict and compute error on training subset
  train_error[i] <- min(unlist(history$metrics))
  validation_error[i] <- model %>% evaluate(as.matrix(X_test), as.matrix(y_test))

  i <- i + 1
}

# Create data frame for plotting
plot_data <- data.frame(
  TrainSize = rep(train_sizes, 2),
  error = c(train_error, validation_error),
  DataType = factor(rep(c("Train", "Validation"), each = length(train_sizes)))
)

# Plot learning curve
ggplot(plot_data, aes(x = TrainSize * nrow(X_train), y = error, color = DataType)) +
  geom_line() +
  labs(
    title = "Learning Curve",
    x = "Training Size",
    y = "Error",
    color = "Data Type"
  ) +
  theme_minimal()

If we assume overfitting, we would expect the training and validation curves to diverge, or at least to have an offset, because the training value would be better, i.e. smaller, than the validation value. Here, however, both lines agree for the large training set sizes. We can therefore conclude that there is no overfitting in our model.

8. A good model might not do what we think it does…

In our model, we used vapor pressure and relative humidity, among other things, to predict temperature. Let’s add some physical background here. Relative humidity is defined as the ratio of vapor pressure to saturation vapor pressure. From the relative humidity and the vapor pressure, we can calculate the saturation vapor pressure. The saturation vapor pressure, in turn, has a relationship to temperature called the Magnus formula. This is not a linear relationship, but as we know, neural networks can model non-linear relationships. Thus, the model probably models only the Magnus formula and simply ignores all other data. We can test this by removing the vapor pressure or the relative humidity from the data and train the model again.

# Drop vapor pressure
X_train_sub <- X_train[, !(names(X_train) %in% "vapor_pres")]
X_test_sub <- X_test[, !(names(X_test) %in% "vapor_pres")]

# Define and compile the model
model <- keras_model_sequential() %>%
  layer_dense(units = 8, activation = "relu", input_shape = dim(X_train_sub)[2]) %>%
  layer_dense(units = 8, activation = "relu") %>%
  layer_dense(units = 1) # Regression output (no activation)

model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.0001),
  loss = "mean_squared_error"
)

# Train the model
history <- model %>% fit(
  as.matrix(X_train_sub), as.matrix(y_train),
  epochs = 100, batch_size = 200,
  validation_split = 0,
)

# Validation
model %>% evaluate(as.matrix(X_test_sub), as.matrix(y_test))

## 164/164 - 0s - loss: 0.0042 - 497ms/epoch - 3ms/step

##        loss 
## 0.004226397

The performance drops by two orders of magnitude. This is still a good performance. However, we can conclude that our previous model did a good job, but it might have been easier to just use our physical knowledge in the first place.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.