30932_Model_selection_via_cross-validation_for

In the previous sections we built IDW models for the mean annual temperature and the mean annual rainfall in East Germany. The results looked quite reasonable, but as you have probable noticed, we used a fixed inverse distance power $\beta$ and a fixed number of nearest observations $n$ that should be used for modelling. It is of course not guaranteed that these somehow arbitrarily picked parameter values give the best model configuration, in terms of predictive power of the model.

One way of tuning model parameters, sometimes referred to as hyperparameter tuning or model selection, is a technique called cross-validation or $k$-fold cross-validation.

The basic idea behind cross-validation is that we split our data set into $k$ folds. Then we train our model on $k-1$ folds and use the resulting model to predict the values of the left-out fold. We repeat this $k$ times so that each of the $k$ folds was once the left-out fold. For each iteration we assess the predictive power of the model by choosing an adequate model assessment metric, such as the root-mean-square error (RMSE) for continuous data and accuracy for categorical data, among others. Thus, we assess the predictive power of the $k$ models $k$-times. Finally, we summarize the values of the $k$ different model assessment metrics either by averaging or by majority vote. this way, we get one particular value, which serves as our final model assessment metric. Then, we repeat this procedure for any model parameter combination of interest. After we have iteratively tried a reasonable number of parameter combinations we compare all model configurations and select that model which performed best, with respect to its predictive power.

We load the required data sets and packages:

load(url("https://userpage.fu-berlin.de/soga/data/r-data/East_Germany.RData"))
library(gstat)
library(sp)
library(sf)
library(raster)
library(tidyverse)
library(mapview)

Cross-validation for rainfall data using IDW

In this section we focus on a special type of cross-validation, the so called leave-one-out cross-validation (LOOCV). For LOOCV $k$ becomes $n$, hence we are computing $n$ models and assess the predictive power of each model on one left-out data point.

The leave-one-out cross-validation procedure can be specified as follows

Split the data set into $k=n$ chunks.
Compute $n$ models and evaluate each model on one left out data point.
Compute the average model assessment metric over $n$ iterations.
Repeat step 1 to 3 for all parameter combinations of interest.
Select the model with the parameter combination that performs best with respect to the chosen model assessment metric.

Step 1–3

The gstat package provides the gstat.cv() function for cross-validation. The argument nfold is used to specify $k$, the number of folds. By default nfold equals the number of rows in the data set, hence the default corresponds to LOOCV.

Let us build an IDW model for the mean annual rainfall in East Germany based on DWD weather station data from the period 1981–2010 (dwd_east_sp). For now we fix the inverse distance power parameter $(\beta=1)$, and the number of nearest observations that should be used for modelling ($n= \max(n)-1=70$).

# set parameter
neighbors <- length(dwd_east_sp) - 1
beta <- 1

# build model
idw_rain <- gstat(
  formula = Rainfall ~ 1, # intercept-only model
  data = dwd_east_sp,
  nmax = neighbors,
  set = list(idp = beta)
)

crossval <- gstat.cv(idw_rain)

## Please note that rgdal will be retired during October 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2023/05/15/evolution4.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-7, (SVN revision 1203)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.5.2, released 2022/09/02
## Path to GDAL shared files: C:/Users/johan/AppData/Local/R/cache/R/renv/cache/v5/R-4.2/x86_64-w64-mingw32/rgdal/1.6-7/10b777236c9e7855bc9dea8e347e30b7/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: C:/Users/johan/AppData/Local/R/cache/R/renv/cache/v5/R-4.2/x86_64-w64-mingw32/rgdal/1.6-7/10b777236c9e7855bc9dea8e347e30b7/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.6-1
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.

The gstat.cv() function returns a SpatialPointsDataFrame object, and its data can be assessed using the @ operator. We take a look at the last 4 rows.

tail(crossval@data, 4)

##    var1.pred var1.var observed    residual zscore fold
## 68  584.3192       NA      584  -0.3191908     NA   68
## 69  582.9338       NA      538 -44.9338168     NA   69
## 70  645.6698       NA      605 -40.6697712     NA   70
## 71  663.0832       NA     1006 342.9168322     NA   71

In the table there are 4 columns of special interest: The fold column indicates the number of the fold $(k)$. As supposed the object crossval contains 71 rows, each of them corresponding to one particular IDW model. Each model, and thus each row, contains further the observed value of the data point, which means the one left-out. In addition each row contains the variable var1.pred, which corresponds to the model prediction for the left-out point based on $n-1$ observations; in our example $n-1=70$. The column residual simply corresponds to the difference between the observation (observed) and the model prediction (var1.pred).

We may easily assess the model residuals using the $ operator.

crossval$residual

##  [1]  -59.8078267  -24.7743795  -44.2523271 -160.3544233  118.6375595
##  [6]  -74.2967074 -125.6696066    3.2700253   16.2607972  -23.5151190
## [11]   10.0117128  -62.2847065 -116.1016077    8.9053342  -21.1273897
## [16] 1071.8917015   55.8193893  -53.3269594 -143.7765103   99.0058357
## [21]  -54.0819889    4.9314285 -139.0689523  476.0746490  -94.2994168
## [26] -160.2745127  -34.2455384   15.5116111   19.6349615 -277.8832433
## [31] -289.5747935 -247.2612095  -48.0003701    4.5349181  535.6536613
## [36]  -80.4224348   -1.1862498 -205.9775425   22.1948181    2.5633624
## [41]  -86.4253955  -25.5148102  206.4040109 -110.7686683   -1.6539548
## [46]  -73.6453775  -40.8586997  -46.9393554  184.0806018  -64.3922790
## [51] -176.1820127  -14.0431967   14.8743140 -193.1252861  -91.5819185
## [56]   15.4326398  653.7194592  -61.2055596   30.9834143  -42.7008986
## [61] -117.6669984 -129.6844130   38.7316624  -78.5207921 -459.9688700
## [66]  415.3014081  -46.0472715   -0.3191908  -44.9338168  -40.6697712
## [71]  342.9168322

In order to assess the predictive power of the model we compute a model assessment metric for the parameter combination $\beta =1$ and $n = 70$. A very common model assessment metric, when dealing with continuous data, is the root-mean-square error (RMSE), given by

\[RMSE = \sqrt{\frac{\sum_{i=1}^n(\hat y_i-y_i)^2}{n}}\text{,}\]

where $\hat y_i$ corresponds to the model prediction and $y_i$ to the observed value. For our convenience we implement a function to compute RMSE based on the residual column of the output of the gstat.cv() function.

Exercise: Define a function that computes the root-mean-square error from the residuals.

## Your code here...
RMSE <- function(residuals) {
  NULL
}

Show code

RMSE <- function(residuals) {
  sqrt(sum((residuals)^2) / length(residuals))
}

We give it a try and compute the RMSE for our IDW model for the annual rainfall in East Germany with $\beta =1$ and $n = 70$.

crossval_rmse <- RMSE(crossval$residual)
crossval_rmse

## [1] 215.0979

The RMSE for our IDW model with $\beta = 1$ and $n = 70$ is approximately 215 mm.

Exercise: Build an IDW model for the mean annual temperature in East Germany based on DWD weather station data from the period 1981–2010 (dwd_east_sp) with fixed $\beta=1$ and $n=70$. Then, assess the predictive power of your model using the root-mean-sqare error.

## Your code here...

# set parameter
neighbors <- NULL
beta <- NULL

#build model
idw_temp <- NULL

crossval_temp <- NULL

# predictive power
crossval_temp_RMSE <- NULL

Show code

# set parameter
neighbors <- length(dwd_east_sp) - 1
beta <- 1

# build model
idw_temp <- gstat(
  formula = Temperature ~ 1, # intercept-only model
  data = dwd_east_sp,
  nmax = neighbors,
  set = list(idp = beta)
)

crossval_temp <- gstat.cv(idw_temp)

#predictive power
crossval_temp_rmse <- RMSE(crossval_temp$residual)

The RMSE for the temperature IDW model with $\beta = 1$ and $n = 70$ is approximately 1 °C.

Recall the model selection procedure referenced above. So far we completed step 1 to 3:

Split the data set into $k = 71$ chunks.
Compute $k = 71$ models and evaluate each of them them on $k-1 = 70$ data points.
Compute a mean model assessment metric, in our case RMSE.

For the forth step we repeat step 1 to 3 for all parameter combinations of interest and in the fifth step we select that model with the parameter combination which performs best with respect to RMSE. These last two steps include repeated calculations, for which we write our own custom function, named cv_IDW().

The function cv_IDW() needs as inputs a SpatialPointsDataFrame object (spatialDF), a formula specifying the model (stat.formula), a sequence of number of neighbors to be included in the computation (seqNeighbors), a sequence of $\beta$ values (seqBeta), a value for the cell size for which the interpolation will be performed (evalGridSize), and optionally a spatial object to define the area for which the model will be evaluated (evalRaster). For the ease of application we provide default values for the majority of the function arguments.

cv_IDW <- function(spatialDF, stat.formula = NULL,
                   seqNeighbors = NULL, seqBeta = NULL,
                   evalGridSize = NULL,
                   evalRaster = NULL,
                   verbose = TRUE) {
  
  ### LOAD LIBRARIES ###
  library(sp)
  library(sf)
  library(gstat)
  library(raster)

  ### PROVIDE DEFAULT VALUES FOR FUNCTION ARGUMENTS ###
  if (is.null(seqNeighbors)) {
    seqNeighbors <- round(seq(3, length(spatialDF), length.out = 10))
  }
  if (is.null(seqBeta)) {
    seqBeta <- c(0.1, seq(0.5, 3, 0.5))
  }
  if (is.null(evalGridSize)) {
    x.interval <- extent(dwd_east_sp)@xmax - extent(dwd_east_sp)@xmin
    y.interval <- extent(dwd_east_sp)@ymax - extent(dwd_east_sp)@ymin
    evalGridSize <- round(min(x.interval, y.interval) * 0.05)
  }
  if (is.null(stat.formula)) {
    print("Please provide a formula!!")
    return()
  }
  if (is.null(evalRaster)) {
    extent.evalGrid <- extent(spatialDF)
  } else {
    extent.evalGrid <- extent(evalRaster)
  }

  ### BUILD A GRID FOR PARAMETER COMBINATIONS ###
  cv_Grid <- expand.grid(
    Beta = seqBeta,
    Neighbors = seqNeighbors
  )
  cv_Grid$RMSE <- NA

  ### LOOP THROUGH ALL PARAMETER COMBINATIONS ###
  for (i in 1:nrow(cv_Grid)) {
    ### BUILD IDW MODEL ###
    idw <- gstat(
      formula = stat.formula,
      data = spatialDF,
      nmax = cv_Grid[i, "Neighbors"],
      set = list(idp = cv_Grid[i, "Beta"])
    )
    
    ### PERFORM LOOCV ###
    crossval <- gstat.cv(idw,
      nmax = cv_Grid[i, "Neighbors"],
      beta = v.Grid[i, "Beta"],
      debug.level = 0
    )
    cv_Grid[i, "RMSE"] <- RMSE(crossval$residual)
    if (verbose) {
      print(paste("Function call", i, "out of", nrow(cv_Grid)))
      print(paste(
        "Evaluating beta =",
        cv_Grid[i, "Beta"],
        "and neighbors =",
        cv_Grid[i, "Neighbors"]
      ))
      print(paste("RMSE =", RMSE(crossval$residual)))
    }
  }

  ### GET BEST PARAMTER VALUES ###
  idx_min <- which.min(cv_Grid$RMSE)
  best_Beta <- cv_Grid$Beta[idx_min]
  best_Neighbors <- cv_Grid$Neighbors[idx_min]
  min_RMSE <- cv_Grid$RMSE[idx_min]

  ### BUILD IDW MODEL BASED ON BEST PARAMTER VALUES ###
  idw_best <- gstat(
    formula = stat.formula,
    data = spatialDF,
    nmax = best_Neighbors,
    set = list(idp = best_Beta)
  )

  ### PREPARE EVALUATION GRID ###
  grid_evalGrid <- expand.grid(
    x = seq(
      from = round(extent.evalGrid@xmin),
      to = round(extent.evalGrid@xmax),
      by = evalGridSize
    ),
    y = seq(
      from = round(extent.evalGrid@ymin),
      to = round(extent.evalGrid@ymax),
      by = evalGridSize
    )
  )

  coordinates(grid_evalGrid) <- ~ x + y
  proj4string(grid_evalGrid) <- proj4string(spatialDF)
  gridded(grid_evalGrid) <- TRUE

  ### INTERPOLATE VALUES FOR EVALUATION GRID USING THE BEST MODEL ###
  idw_best_predict <- predict(
    object = idw_best,
    newdata = grid_evalGrid,
    debug.level = 0
  )

  ### RETURN RESULTS AND OBJECTS ###
  return(list(
    "idwBestModel" = idw_best,
    "idwBestRaster" = idw_best_predict,
    "bestBeta" = best_Beta,
    "bestNeighbors" = best_Neighbors,
    "bestRMSE" = min_RMSE,
    "gridCV" = cv_Grid
  ))
}

The cv_IDW () function performs model selection based on LOOCV and returns a list object, which contains (1) the best model (idwBestModel) (a gstat object), (2) the spatial interpolation based on the best model configuration (idwBestRaster), a SpatialPixelsDataFrame object, (3) the optimal $\beta$ parameter (bestBeta), (4) the optimal number of neighbors (bestNeighbors), and (5) the corresponding RMSE (bestRMSE), all three being numeric objects, and (6) a data.frame object (gridCV), containing each particular parameter combination and its corresponding RMSE.

OK, now that we defined our custom function my_IDW(), let us apply the function on the data set dwd_east_sp in order to find the optimal IDW model for the spatial prediction of mean annual rainfall in East Germany for the period 1981 to 2010. Note, that depending on your computer resources this can take some time.

my_IDW <- cv_IDW(
  spatialDF = dwd_east_sp,
  stat.formula = formula(Rainfall ~ 1),
  evalGridSize = 1000,
  evalRaster = east_germany_states_sp,
  verbose = TRUE
)

## [1] "Function call 1 out of 70"
## [1] "Evaluating beta = 0.1 and neighbors = 3"
## [1] "RMSE = 223.08394029491"
## [1] "Function call 2 out of 70"
## [1] "Evaluating beta = 0.5 and neighbors = 3"
## [1] "RMSE = 219.651853574751"
## [1] "Function call 3 out of 70"
## [1] "Evaluating beta = 1 and neighbors = 3"
## [1] "RMSE = 217.161008465413"
## [1] "Function call 4 out of 70"
## [1] "Evaluating beta = 1.5 and neighbors = 3"
## [1] "RMSE = 217.491397084949"
## [1] "Function call 5 out of 70"
## [1] "Evaluating beta = 2 and neighbors = 3"
## [1] "RMSE = 219.803064740303"
## [1] "Function call 6 out of 70"
## [1] "Evaluating beta = 2.5 and neighbors = 3"
## [1] "RMSE = 222.858295140589"
## [1] "Function call 7 out of 70"
## [1] "Evaluating beta = 3 and neighbors = 3"
## [1] "RMSE = 225.893166172742"
## [1] "Function call 8 out of 70"
## [1] "Evaluating beta = 0.1 and neighbors = 11"
## [1] "RMSE = 229.02767364742"
## [1] "Function call 9 out of 70"
## [1] "Evaluating beta = 0.5 and neighbors = 11"
## [1] "RMSE = 217.508802614293"
## [1] "Function call 10 out of 70"
## [1] "Evaluating beta = 1 and neighbors = 11"
## [1] "RMSE = 204.410984213919"
## [1] "Function call 11 out of 70"
## [1] "Evaluating beta = 1.5 and neighbors = 11"
## [1] "RMSE = 201.627303440551"
## [1] "Function call 12 out of 70"
## [1] "Evaluating beta = 2 and neighbors = 11"
## [1] "RMSE = 206.602122898029"
## [1] "Function call 13 out of 70"
## [1] "Evaluating beta = 2.5 and neighbors = 11"
## [1] "RMSE = 213.101502525785"
## [1] "Function call 14 out of 70"
## [1] "Evaluating beta = 3 and neighbors = 11"
## [1] "RMSE = 218.733159017494"
## [1] "Function call 15 out of 70"
## [1] "Evaluating beta = 0.1 and neighbors = 18"
## [1] "RMSE = 228.840411109831"
## [1] "Function call 16 out of 70"
## [1] "Evaluating beta = 0.5 and neighbors = 18"
## [1] "RMSE = 218.401542021908"
## [1] "Function call 17 out of 70"
## [1] "Evaluating beta = 1 and neighbors = 18"
## [1] "RMSE = 202.423600584451"
## [1] "Function call 18 out of 70"
## [1] "Evaluating beta = 1.5 and neighbors = 18"
## [1] "RMSE = 196.87653360167"
## [1] "Function call 19 out of 70"
## [1] "Evaluating beta = 2 and neighbors = 18"
## [1] "RMSE = 202.324106329969"
## [1] "Function call 20 out of 70"
## [1] "Evaluating beta = 2.5 and neighbors = 18"
## [1] "RMSE = 210.089651347729"
## [1] "Function call 21 out of 70"
## [1] "Evaluating beta = 3 and neighbors = 18"
## [1] "RMSE = 216.715787855657"
## [1] "Function call 22 out of 70"
## [1] "Evaluating beta = 0.1 and neighbors = 26"
## [1] "RMSE = 233.764718000112"
## [1] "Function call 23 out of 70"
## [1] "Evaluating beta = 0.5 and neighbors = 26"
## [1] "RMSE = 224.27342628731"
## [1] "Function call 24 out of 70"
## [1] "Evaluating beta = 1 and neighbors = 26"
## [1] "RMSE = 206.747967744414"
## [1] "Function call 25 out of 70"
## [1] "Evaluating beta = 1.5 and neighbors = 26"
## [1] "RMSE = 197.877582057941"
## [1] "Function call 26 out of 70"
## [1] "Evaluating beta = 2 and neighbors = 26"
## [1] "RMSE = 202.07393321926"
## [1] "Function call 27 out of 70"
## [1] "Evaluating beta = 2.5 and neighbors = 26"
## [1] "RMSE = 209.674859797379"
## [1] "Function call 28 out of 70"
## [1] "Evaluating beta = 3 and neighbors = 26"
## [1] "RMSE = 216.356742247241"
## [1] "Function call 29 out of 70"
## [1] "Evaluating beta = 0.1 and neighbors = 33"
## [1] "RMSE = 235.165963670567"
## [1] "Function call 30 out of 70"
## [1] "Evaluating beta = 0.5 and neighbors = 33"
## [1] "RMSE = 226.196906371644"
## [1] "Function call 31 out of 70"
## [1] "Evaluating beta = 1 and neighbors = 33"
## [1] "RMSE = 207.88272160516"
## [1] "Function call 32 out of 70"
## [1] "Evaluating beta = 1.5 and neighbors = 33"
## [1] "RMSE = 197.343665293022"
## [1] "Function call 33 out of 70"
## [1] "Evaluating beta = 2 and neighbors = 33"
## [1] "RMSE = 201.276521121449"
## [1] "Function call 34 out of 70"
## [1] "Evaluating beta = 2.5 and neighbors = 33"
## [1] "RMSE = 209.122824531364"
## [1] "Function call 35 out of 70"
## [1] "Evaluating beta = 3 and neighbors = 33"
## [1] "RMSE = 216.019491601957"
## [1] "Function call 36 out of 70"
## [1] "Evaluating beta = 0.1 and neighbors = 41"
## [1] "RMSE = 236.546317532959"
## [1] "Function call 37 out of 70"
## [1] "Evaluating beta = 0.5 and neighbors = 41"
## [1] "RMSE = 228.176901671938"
## [1] "Function call 38 out of 70"
## [1] "Evaluating beta = 1 and neighbors = 41"
## [1] "RMSE = 209.422582122709"
## [1] "Function call 39 out of 70"
## [1] "Evaluating beta = 1.5 and neighbors = 41"
## [1] "RMSE = 197.291237460884"
## [1] "Function call 40 out of 70"
## [1] "Evaluating beta = 2 and neighbors = 41"
## [1] "RMSE = 200.845026740389"
## [1] "Function call 41 out of 70"
## [1] "Evaluating beta = 2.5 and neighbors = 41"
## [1] "RMSE = 208.827335653364"
## [1] "Function call 42 out of 70"
## [1] "Evaluating beta = 3 and neighbors = 41"
## [1] "RMSE = 215.852977872709"
## [1] "Function call 43 out of 70"
## [1] "Evaluating beta = 0.1 and neighbors = 48"
## [1] "RMSE = 238.037649535155"
## [1] "Function call 44 out of 70"
## [1] "Evaluating beta = 0.5 and neighbors = 48"
## [1] "RMSE = 229.83820920258"
## [1] "Function call 45 out of 70"
## [1] "Evaluating beta = 1 and neighbors = 48"
## [1] "RMSE = 210.62368758597"
## [1] "Function call 46 out of 70"
## [1] "Evaluating beta = 1.5 and neighbors = 48"
## [1] "RMSE = 197.225822525274"
## [1] "Function call 47 out of 70"
## [1] "Evaluating beta = 2 and neighbors = 48"
## [1] "RMSE = 200.511584654991"
## [1] "Function call 48 out of 70"
## [1] "Evaluating beta = 2.5 and neighbors = 48"
## [1] "RMSE = 208.617550195501"
## [1] "Function call 49 out of 70"
## [1] "Evaluating beta = 3 and neighbors = 48"
## [1] "RMSE = 215.743781657092"
## [1] "Function call 50 out of 70"
## [1] "Evaluating beta = 0.1 and neighbors = 56"
## [1] "RMSE = 238.148968626313"
## [1] "Function call 51 out of 70"
## [1] "Evaluating beta = 0.5 and neighbors = 56"
## [1] "RMSE = 230.481583048291"
## [1] "Function call 52 out of 70"
## [1] "Evaluating beta = 1 and neighbors = 56"
## [1] "RMSE = 211.664547643457"
## [1] "Function call 53 out of 70"
## [1] "Evaluating beta = 1.5 and neighbors = 56"
## [1] "RMSE = 197.474056350474"
## [1] "Function call 54 out of 70"
## [1] "Evaluating beta = 2 and neighbors = 56"
## [1] "RMSE = 200.441554339835"
## [1] "Function call 55 out of 70"
## [1] "Evaluating beta = 2.5 and neighbors = 56"
## [1] "RMSE = 208.558324332183"
## [1] "Function call 56 out of 70"
## [1] "Evaluating beta = 3 and neighbors = 56"
## [1] "RMSE = 215.712259873626"
## [1] "Function call 57 out of 70"
## [1] "Evaluating beta = 0.1 and neighbors = 63"
## [1] "RMSE = 241.721330974318"
## [1] "Function call 58 out of 70"
## [1] "Evaluating beta = 0.5 and neighbors = 63"
## [1] "RMSE = 233.307992025684"
## [1] "Function call 59 out of 70"
## [1] "Evaluating beta = 1 and neighbors = 63"
## [1] "RMSE = 213.389066336511"
## [1] "Function call 60 out of 70"
## [1] "Evaluating beta = 1.5 and neighbors = 63"
## [1] "RMSE = 197.821977365344"
## [1] "Function call 61 out of 70"
## [1] "Evaluating beta = 2 and neighbors = 63"
## [1] "RMSE = 200.37568546295"
## [1] "Function call 62 out of 70"
## [1] "Evaluating beta = 2.5 and neighbors = 63"
## [1] "RMSE = 208.493487352153"
## [1] "Function call 63 out of 70"
## [1] "Evaluating beta = 3 and neighbors = 63"
## [1] "RMSE = 215.678017697433"
## [1] "Function call 64 out of 70"
## [1] "Evaluating beta = 0.1 and neighbors = 71"
## [1] "RMSE = 246.030184223459"
## [1] "Function call 65 out of 70"
## [1] "Evaluating beta = 0.5 and neighbors = 71"
## [1] "RMSE = 236.360249438594"
## [1] "Function call 66 out of 70"
## [1] "Evaluating beta = 1 and neighbors = 71"
## [1] "RMSE = 215.097896833295"
## [1] "Function call 67 out of 70"
## [1] "Evaluating beta = 1.5 and neighbors = 71"
## [1] "RMSE = 198.219083877568"
## [1] "Function call 68 out of 70"
## [1] "Evaluating beta = 2 and neighbors = 71"
## [1] "RMSE = 200.361701068306"
## [1] "Function call 69 out of 70"
## [1] "Evaluating beta = 2.5 and neighbors = 71"
## [1] "RMSE = 208.457407828214"
## [1] "Function call 70 out of 70"
## [1] "Evaluating beta = 3 and neighbors = 71"
## [1] "RMSE = 215.659916511797"

If this function fails you many download its output here.

Ready! Now we can inspect the results and print the optimal $\beta$ parameter (bestBeta), the optimal number of neighbors (bestNeighbors) and the corresponding RMSE of that particular parameter combination (bestRMSE).

my_IDW[["bestBeta"]]

## [1] 1.5

my_IDW[["bestNeighbors"]]

## [1] 18

my_IDW[["bestRMSE"]]

## [1] 196.8765

Based on these parameters we plot the interpolation raster in form of a map. Note that the SpatialPixelsDataFrame object is also returned by the cv_IDW()function (idwBestRaster).

plot(my_IDW[["idwBestRaster"]],
  main = "IDW: Mean annual rainfall"
)
plot(east_germany_states_sp,
  add = TRUE, border = "white"
)
plot(dwd_east_sp,
  add = TRUE, pch = 19,
  cex = 0.5, col = "red"
)

Further, we take a look at the data frame containing the RMSE for all tested model parameter combinations. We consider the 10 best model configurations with respect to RMSE.

gridCV <- my_IDW[["gridCV"]]
head(gridCV[with(gridCV, order(RMSE)), ], 10)

##    Beta Neighbors     RMSE
## 18  1.5        18 196.8765
## 46  1.5        48 197.2258
## 39  1.5        41 197.2912
## 32  1.5        33 197.3437
## 53  1.5        56 197.4741
## 60  1.5        63 197.8220
## 25  1.5        26 197.8776
## 67  1.5        71 198.2191
## 68  2.0        71 200.3617
## 61  2.0        63 200.3757

Exercise: Use the function cv_IDW() to find the best parameter combination for an IDW for the temperature model for East Germany and plot the resulting interpolated data.

## Your code here...
my_IDW_temp <- NULL

Show code

my_IDW_temp <- cv_IDW(
  spatialDF = dwd_east_sp,
  stat.formula = formula(Temperature ~ 1),
  evalGridSize = 1000,
  evalRaster = east_germany_states_sp,
  verbose = TRUE
)

#best parameter combination
my_IDW_temp[["bestBeta"]]
my_IDW_temp[["bestNeighbors"]]
my_IDW_temp[["bestRMSE"]]

# plot
plot(my_IDW_temp[["idwBestRaster"]],
  main = "IDW: Mean annual temperature"
)
plot(east_germany_states_sp,
  add = TRUE, border = "white"
)
plot(dwd_east_sp,
  add = TRUE, pch = 19,
  cex = 0.5, col = "black"
)

Improving the search parameter space

The plotted results above look very reasonable and the LOOCV approach is straight forward. However, as you probable noticed we did not specify the parameter search space explicitly, but relied on the default values of the cv_IDW() function. This approach for sure narrows down the parameter search space, which we may visualize using the ggplot library.

ggplot(gridCV, aes(Neighbors, Beta)) +
  geom_tile(aes(fill = RMSE), colour = "black") +
  scale_fill_gradient(low = "steelblue", high = "orange") +
  theme_bw() +
  ggtitle("Parameter search space")

Each cell in the plot above corresponds to one particular parameter combination. The color of the cells corresponds to the RMSE of that particular parameter combination. Blue colors indicate low RMSE, whereas orange colors indicate high RMSE. We see a zone of blueish cells emerging for parameter combination of the number of neighbors being in between 30 and 50 and $\beta$ being in between 1 and 2.

Based on that intuition we apply the cv_IDW() once again. This time however, we explicitly specify the sequences for $\beta$ (seqBeta) and for the number of neighbors (seqNeighbors).

my_IDW2 <- cv_IDW(
  spatialDF = dwd_east_sp,
  stat.formula = formula(Rainfall ~ 1),
  seqBeta = seq(from = 1, to = 2, 0.1),
  seqNeighbors = seq(from = 30, to = 50, 1),
  evalGridSize = 1000,
  evalRaster = east_germany_states_sp,
  verbose = FALSE
)

Once again we inspect the optimal model parameter combination.

my_IDW2[["bestBeta"]]

## [1] 1.6

my_IDW2[["bestNeighbors"]]

## [1] 50

my_IDW2[["bestRMSE"]]

## [1] 196.6954

Let us visualize the parameter search space once again to see if we can further improve.

gridCV2 <- my_IDW2[["gridCV"]]
ggplot(gridCV2, aes(Neighbors, Beta)) +
  geom_tile(aes(fill = RMSE), colour = "black") +
  scale_fill_gradient(low = "steelblue", high = "orange") +
  theme_bw()

It does not seem that we can improve our model performance significantly. Hence, let us list the 10 best model configurations with respect to RMSE.

head(gridCV2[with(gridCV2, order(RMSE)), ], 10)

##     Beta Neighbors     RMSE
## 227  1.6        50 196.6954
## 216  1.6        49 196.7166
## 194  1.6        47 196.8087
## 172  1.6        45 196.8454
## 183  1.6        46 196.8484
## 205  1.6        48 196.8629
## 161  1.6        44 196.9161
## 150  1.6        43 196.9450
## 228  1.7        50 196.9898
## 217  1.7        49 197.0082

Finally, we plot the results of our best IDW model for mean annual rainfall in East Germany based on weather station data for the period 1981–2010.

plot(my_IDW2[["idwBestRaster"]],
  main = "IDW model mean annual rainfall [mm]"
)
plot(east_germany_states_sp, add = TRUE, border = "white")
plot(dwd_east_sp, add = TRUE, pch = 19, cex = 0.5, col = "red")

We can as well contextualize our best IDW model and plot an interactive map using the mapview library.

idw_best <- raster::mask(
  raster(my_IDW2[["idwBestRaster"]]),
  east_germany_states_sp
)

mapview(idw_best,
  layer.name = "Annual rainfall",
  legend = T
) + mapview(dwd_east_sp,
  label = dwd_east_sp@data$Name,
  cex = 2
)

Exercise: Improve the parameter space of the temperature model.

## Your code here...
my_IDW_temp2 <- NULL

Show code

my_IDW_temp2 <- cv_IDW(
  spatialDF = dwd_east_sp,
  stat.formula = formula(Temperature ~ 1),
  seqBeta = seq(from = 1, to = 2, 0.1),
  seqNeighbors = seq(from = 30, to = 50, 1),
  evalGridSize = 1000,
  evalRaster = east_germany_states_sp,
  verbose = FALSE
)

gridCV_temp2 <- my_IDW_temp2[["gridCV"]]
ggplot(gridCV_temp2, aes(Neighbors, Beta)) +
  geom_tile(aes(fill = RMSE), colour = "black") +
  scale_fill_gradient(low = "steelblue", high = "orange") +
  theme_bw()

head(gridCV_temp2[with(gridCV_temp2, order(RMSE)), ], 10)

# plot
idw_best2 <- raster::mask(
  raster(my_IDW_temp2[["idwBestRaster"]]),
  east_germany_states_sp
)

mapview(idw_best2,
  layer.name = "Annual temperature",
  legend = TRUE
) + mapview(dwd_east_sp,
  label = dwd_east_sp@data$Name,
  cex = 2
)

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.