Data preparation

In this section we use two data sets:

First, we load these data sets into memory:

### LOAD DWD DATA ###
load(url("https://userpage.fu-berlin.de/soga/data/r-data/dwd_sp.RData"))
####################################

### LOAD RASTER DATA ###
## build a temporary folder on disk
temp <- tempfile()
download_url <- "https://userpage.fu-berlin.de/soga/data/raw-data/spatial/"
zipfile <- "srtm_germany_ETRS89.zip"
## download the file
download.file(paste0(download_url, zipfile), temp, mode = "wb")
## unzip the file
unzip(temp)
unlink(temp)
## read in the data
library(raster)
srtm_germany_masked_ETRS89 <- raster("srtm_germany_ETRS89.tif")
####################################
plot(srtm_germany_masked_ETRS89,
  main = "SRTM DEM and DWD weather stations"
)
plot(dwd_sp,
  add = TRUE,
  pch = 16, cex = 0.2
)

Before we proceed we make sure that all of our data shares the same coordinate reference system.

library(rgdal)
newProj <- proj4string(srtm_germany_masked_ETRS89)

# re-project spatial data sets
dwd_sp <- spTransform(dwd_sp,
  CRSobj = newProj
)


# Retrieve Federal States by the the getData() function from the raster package
germany <- getData(country = "Germany", level = 1)
germany <- spTransform(germany,
  CRSobj = newProj
)

Further, we cast the RasterLayer object srtm_germany_masked_ETRS89 to a SpatialPixelsDataFrame object and rename the column in the attribute table to Altitude.

# cast to SpatialPixelsDataFrame
srtm_germany_masked_ETRS89_spdf <- as(
  srtm_germany_masked_ETRS89,
  "SpatialPixelsDataFrame"
)
colnames(srtm_germany_masked_ETRS89_spdf@data) <- "Altitude"

Sample variogram

We can explore the spatial correlation of our predictor variable by plotting a variogram cloud. The variogram cloud is obtained by plotting all possible squared differences of observation pairs \((Z(s_i)−Z(s_j))^2\) against their separation distance \(h_{ij}\). As any point in the variogram cloud refers to a pair of points in the data set, the variogram cloud is used to point us to areas with unusually high or low variability (Bivand et al. 2008).

library(gstat)
ev <- variogram(mean_annual_rainfall ~ 1, dwd_sp, cloud = TRUE)
plot(ev)

The sample variogram plot below is nothing but a plot of averages of semivariogram cloud values over distance intervals \(h\). The variogram() is an extremely versatile function and accepts a ton of additional arguments. For the purpose of this tutorial we keep the defaults. Type ?variogram into your console to review a detailed description of the variogram() function.

ev <- variogram(mean_annual_rainfall ~ 1, dwd_sp)
plot(ev)


Variogram modelling

There are different approaches to select a variogram model:

The eyefit() function from the geoR package offers the possibility to fit a function to the empirical variogram by interactively varying the function type and its parameters. Here, we do not cover this approach, however the interested reader may visit the online documentation for more details (here).

The gstat package ships with the vgm() function, which allows us to specify a wide range of potential variogram models and specific parameters. A list of available models is printed by using the vgm() command.

vgm()
##    short                                      long
## 1    Nug                              Nug (nugget)
## 2    Exp                         Exp (exponential)
## 3    Sph                           Sph (spherical)
## 4    Gau                            Gau (gaussian)
## 5    Exc        Exclass (Exponential class/stable)
## 6    Mat                              Mat (Matern)
## 7    Ste Mat (Matern, M. Stein's parameterization)
## 8    Cir                            Cir (circular)
## 9    Lin                              Lin (linear)
## 10   Bes                              Bes (bessel)
## 11   Pen                      Pen (pentaspherical)
## 12   Per                            Per (periodic)
## 13   Wav                                Wav (wave)
## 14   Hol                                Hol (hole)
## 15   Log                         Log (logarithmic)
## 16   Pow                               Pow (power)
## 17   Spl                              Spl (spline)
## 18   Leg                            Leg (Legendre)
## 19   Err                   Err (Measurement error)
## 20   Int                           Int (Intercept)

The vgm() function is specified by a range of additional arguments such as psill, which corresponds to the (partial) sill of the variogram model, model, which specifies the model type, e.g. “Exp”, “Sph”, “Gau”, “Mat” (see above), range, which specifies the range parameter of the variogram model, kappa, which is a smoothness parameter for the Matern class of variogram models, nugget, which adds the nugget component to the model, anis, a set of parameters to account for anisotropy, among others. Depending on the data set and on the model of choice we have to specify initial values for the different parameters. If we choose initial values too far off from reasonable values, the fit will not succeed.

If we consider the sample variogram from above, an exponential model appears to be a good fit. By eyeballing the sample variogram we set psill = 30000, model = "Exp", range = 100000, and nugget= 25000.

vm <- vgm(psill = 30000, model = "Exp", range = 100000, nugget = 25000)
vm
##   model psill range
## 1   Nug 25000 0e+00
## 2   Exp 30000 1e+05

To verify our parameter choice we plot the sample variogram and overlay it with the variogram model as specified above.

plot(ev, vm)

The variogram model looks reasonable, but for sure not perfect.

We may select the variogram model and its parameters statically, by minimizing some goodness-of-fit criterion. For example the fit.variogram() function from the gstat package allows us to find an appropriate variogram model.

The fit.variogram() function need as arguments at least a sample variogram (object), e.g. the output of the variogram() function, and a model, which specifies the model type, e.g. “Exp”, “Sph”, “Gau”, “Mat”, among others (see output of vgm()). The function fit.variogram() returns an object of class variogramModel extending data.frame. The data.frame object has a logical attribute singular, which indicates whether the non-linear fit converged (FALSE).

Let us apply the fit.variogram() to obtain the parameters of an exponential variogram model.

vm_exp <- fit.variogram(ev, vgm("Exp"))
vm_exp
##   model    psill    range
## 1   Nug 26296.91      0.0
## 2   Exp 31955.10 113131.6

OK, compare the result with our eyeballed parametrization. We have not been too much off.

Let us check whether the fit converges.

attr(vm_exp, "singular")
## [1] FALSE

Perfect, recall that FALSE indicates model convergence.

Note that we may specify the fitting method, used by gstat by setting the additional arguments fit.method. There are four choices implemented:

\[ \begin{array}{c|c} \mathtt{fit.method} & \text{Weight} \\ \hline 1 & N_j \\ 2 & N_j\{\gamma(h_j)\}^2 \\ 6 & 1 \\ 7 & N_j/h_j^2 \end{array} \]

The default value chosen by gstat is fit.method = 7.

Exercise: Apply the fit.variogram() fuction to obtain the parameters for the spherical and the Gaussian variogram model and check each of them for convergence.

## Your code here...
vm_sph <- NULL
vm_gau <- NULL
Show code
vm_sph <- fit.variogram(ev, vgm("Sph"))
attr(vm_exp, "singular")
## [1] FALSE
vm_gau <- fit.variogram(ev, vgm("Gau"))
attr(vm_exp, "singular")
## [1] FALSE

Nice! Both models converge.


Now, let us visualize our three different variogram models we built above.

plot(ev, vm_exp, main = "Exponential variogram model")

Exercise: Visualize the spherical and the Gaussian variogram model. Which of the three models seems to be the best fit by visual inspection?

## Your code here...
Show code
plot(ev, vm_sph, main = "Spherical variogram model")

plot(ev, vm_gau, main = "Gaussian variogram model")

Based on the visual inspection, the exponential variogram model (vm_exp) appears to be the best fit.


In addition to the fit.variogram() function the automap package provides the autofitVariogram(). The function accepts a formula (formula) and a data argument (input_data). The following expression fits a variogram model to our data:

library(automap)
vm_auto <- autofitVariogram(
  formula = mean_annual_rainfall ~ 1,
  input_data = dwd_sp
)
plot(vm_auto)

In the plot above the points show the sample empirical variogram. The numbers indicate the number of pairs assigned to the bin, and the continuous line is the fitted variogram model. In this case, the selected model was Matérn, M. Stein’s parameterization (code name “Ste”), with the four parameter values stated in the bottom-right corner of the image.

The function returns an object of type autofitVariogram. This object contains the fitted variogram model, which can be accessed with the $ operator.

vm_auto$var_model
##   model    psill    range kappa
## 1   Nug 15442.86      0.0   0.0
## 2   Ste 50425.68 283865.8   0.2

Anisotropy

So far we assumed that the spatial correlation structure is the same in all directions (isotropic). Hence, the covariance function depends only on the magnitude of the lag vector \(h\) and not on the direction. In many cases, however, the autocorrelation shows a structure that is different in different directions. This property of a spatial process is referred to as anisotropy.

The most commonly employed model for anisotropy is geometric anisotropy, with the semivariogram reaching the same sill in all directions, but at different ranges. To check for directional dependence, we use directional empirical semivariograms, which include only those values for data pairs falling within certain directional bands as well as falling within the prescribed lag limits.

With the gstat package we can account for anisotropy by specifying directional bands by an azimuthal direction. The variogram() function accepts an alpha parameter, which indicates the direction in plane in positive degrees clockwise. For example, we may specify the main directions to be \(0^\circ\), \(45^\circ\), \(90^\circ\), and \(135^\circ\) with angular tolerance of \(22.5^\circ\).

alpha <- (0:3) * 45
alpha
## [1]   0  45  90 135

If we plug the additional argument alpha into the variogram() function call and plot it, we see a panel plot, where each plot corresponds to a certain directional band. Note that directional semivariograms are often noisier due to the reduced number of data pairs used for the estimation.

ev_anis <- variogram(mean_annual_rainfall ~ 1, dwd_sp, alpha = alpha)
plot(ev_anis)

The plot does not show overwhelming evidence of anisotropy. Let us add our exponential variogram model we fitted in the previous section (vm_exp) to see if it fits the data in all four directional bands.

plot(ev_anis, vm_exp)

Not too bad. However, we can explicitly specify the geometry of the covariance structure, by adding the anis argument to the vgm() function call. Note that we plug in same parameters for nugget, partial sill and range as obtained for the exponential variogram model vm_exp, but we add the anis argument to account for anisotropy.

vm_anis <- vgm(
  psill = 31955,
  model = "Exp",
  range = 113132,
  nugget = 26297,
  anis = c(0, 0.7)
)

The first value of the anis argument is the angle of the major direction (in other words the azimuthal direction measured in degrees clockwise). The value of \(0\) is equal to the North-South direction. The second value is the anisotropy ratio, which in our case is 0.7. The anisotropy ratio indicates that the range in the minor direction (East-West) is only 0.7 of the range in the major direction. Hence, we model a more ellipsis-like spatial correlation structure.

plot(ev_anis, vm_anis)

Well the changes are small, but still recognizable. We will compare the performance of the isotropic and the anisotropic model in the next section by applying a technique referred to as cross-validation.


Model evaluation

Once we defined a variogram model we would like to evaluate its predictive power. Similar as we did in the previous section on inverse distance weighting (IDW) interpolation, we test our model using the leave-one-out cross-validation (LOOCV) approach. Again we assess the predictive power of the model by using the root-mean-square error (RMSE) as model assessment metric.

We can use the same custom function as in the section on IDW:

\[RMSE = \sqrt{\frac{\sum_{i=1}^n(\hat y_i-y_i)^2}{n}}\]

RMSE <- function(residuals) {
  sqrt(sum((residuals)^2) / length(residuals))
}

The gstat package provides the krige.cv() function for kriging cross validation, either \(n\)-fold or leave-one-out (LOOCV). The krige.cv() function takes as input a formula (formula) that defines the dependent variable as a linear model of independent variables. Suppose the dependent variable has name z, for ordinary kriging we use the formula z~1; for universal kriging, suppose z is linearly dependent on x and y, we use the formula z~x+y. In addition the function expects the locations argument, which is an object of class Spatial, and the argument model, which specifies the variogram model of the dependent variable.

Consequently, an ordinary kriging LOOCV using our dwd_sp data set is written as below. Note that we are using the isotropic exponential variogram model vm_exp.

rain_OK_LOOCV_exp <- krige.cv(
  formula = mean_annual_rainfall ~ 1,
  locations = dwd_sp,
  model = vm_exp
)

We assess the predictive power of the model by computing the RMSE.

RMSE(residuals = rain_OK_LOOCV_exp@data$residual)
## [1] 161.1598

Now let us plug in the anisotropic exponential variogram model vm_anis and check its performance.

Exercise: Evaluate the anisotropic exponential model and the the Matérn model (vm_auto). Which model is the best fit?

## Your code here...
rain_OK_LOOCV_anis <- NULL
rain_OK_LOOCV_matern <- NULL
Show code
# anisotropic exponential model
rain_OK_LOOCV_anis <- krige.cv(
  formula = mean_annual_rainfall ~ 1,
  locations = dwd_sp,
  model = vm_anis
)

RMSE(residuals = rain_OK_LOOCV_anis@data$residual)
## [1] 160.5813
# matern model
rain_OK_LOOCV_matern <- krige.cv(
  formula = mean_annual_rainfall ~ 1,
  locations = dwd_sp,
  model = vm_auto$var_model
)

RMSE(residuals = rain_OK_LOOCV_matern@data$residual)
## [1] 159.0077

All models in our case perform similarly well, with the anisotropic exponential variogram and the Matérn model performing only slightly better than the isotropic exponential variogram. The Matérn model is the best fit.


A nice visualization of the model residuals is provided by the bubble() function. Let us plug in the Matérn model.

bubble(rain_OK_LOOCV_matern,
  "residual",
  main = "Ordinary Kriging rainfall: LOOCV residuals"
)

The bubble plot above shows that the model residuals are quite low in the North German Lowlands, indicating a reasonable model performance. In contrast, the residuals are much worse in the Central German Uplands, the Alpine Foreland and the Alps.

Fortunately, our data set (dwd_sp) not only contains rainfall data, but as well the altitude of each particular weather station. Let us make a scatter plot to verify the dependence of rainfall and elevation.

rain_annual <- dwd_sp@data$mean_annual_rainfall
elevation <- dwd_sp@data$Altitude
plot(elevation, rain_annual, main = "Rainfall and Altitude")

Although there are a number of weather stations at low elevations, probably located close to the northern German coast, which receive a considerable amount of annual rainfall, it appears that in general rainfall increases with altitude.

Consequently, we build a model using universal kriging, where the predicted value (mean annual rainfall) is based on the covariate Altitude plus a weighted function of neighboring measurements.

Exercise: Use universal kriging based on the Altitude to validate the Matérn model for the mean annual rainfall, evaluate the model and visualize the residuals with a bubble plot.

## Your code here...
rain_UK_LOOCV_matern <- krige.cv(
  formula = NULL,
  locations = NULL,
  model = NULL
)
Show code
rain_UK_LOOCV_matern <- krige.cv(
  formula = mean_annual_rainfall ~ Altitude,
  locations = dwd_sp,
  model = vm_auto$var_model
)


Show code
RMSE(residuals = rain_UK_LOOCV_matern@data$residual)
## [1] 148.9089


Show code
bubble(rain_UK_LOOCV_matern,
  "residual",
  main = "Universal Kriging rainfall: LOOCV residuals", add = TRUE
)

The bubble plot above shows that the model residuals are still quite high for elevate terrain, however the RMSE is significantly lower than for ordinary kriging.


Note that in universal kriging we may incorporate more than one covariate. The bubble plot above indicates that the model predictions are worse in southern regions of Germany compared to northern regions. Hence, we may try to fit the values using the coordinates and a 2nd order polynomial.

Therefore, we first need to extract the coordinates from our Spatial object and assign them explicitly to the attribute table of dwd_sp.

lat <- dwd_sp@coords[, 1]
lon <- dwd_sp@coords[, 2]
dwd_sp@data$lat <- lat
dwd_sp@data$lon <- lon

Then we simply rewrite our formula expression and re-run the universal kriging LOOCV.

f <- formula(mean_annual_rainfall ~ Altitude + lat + lon + lat * lon + lat * lat + lon * lon)

rain_UK_LAT_LON_LOOCV <- krige.cv(
  formula = f,
  locations = dwd_sp,
  model = vm_auto$var_model
)

Again we visualize the model residuals using a bubble plot and compute the RMSE.

bubble(rain_UK_LAT_LON_LOOCV,
  "residual",
  main = "Universal Kriging rainfall: LOOCV residuals"
)

RMSE(residuals = rain_UK_LAT_LON_LOOCV@data$residual)
## [1] 148.8224

Well, the model performs only slightly better. It is doubtful that the incorporation of the polynomial term is effectively improving the model. Thus, following the law of parsimony, also known as Occam’s razor, we stick to the simpler model and discard the polynomial modelling approach.


Spatial prediction

Finally, we may use our model to predict the annual mean rainfall over the region of Germany from observations at DWD weather stations. Spatial prediction with the gstat package is fairly straight forward. We either may use the predict() function or an even simpler approach is to apply the krige() function. The krige() function is a wrapper method around the functions gstat() and predict().

The krige() function takes as input a formula (formula) that defines the dependent variable as a linear model of independent variables. In addition, the function expects the locations argument, which is an object of class Spatial, and the argument newdata, which is a data.frame or Spatial object with the prediction locations. The object should contain attribute columns with the independent variables with names as defined in locations. Finally, in order to perform kriging the function takes the argument model to specify the variogram model of dependent variable.

Consequently, the spatial prediction model can be written as below. Please note that depending on your computing resources and depending on the spatial resolution of the newdata object, this step can take some time. For your convenience we show the computing time on an Intel i7-7600U CPU, with 3.90 GHz and 8 GB RAM.

# time the function
# start time
start_time <- Sys.time()

# main function call
rain_pred <- krige(
  formula = mean_annual_rainfall ~ Altitude,
  locations = dwd_sp,
  newdata = srtm_germany_masked_ETRS89_spdf,
  model = vm_auto$var_model
)
## [using universal kriging]
# end time
end_time <- Sys.time()
print(end_time - start_time)
## Time difference of 5.903659 mins

Finally, we plot the spatial predictions using the spplot() function.

library(viridisLite)
m <- 50
max_val <- ceiling(max(rain_pred@data$var1.pred) / 100) * 100
min_val <- floor(min(rain_pred@data$var1.pred) / 100) * 100

spplot(rain_pred["var1.pred"],
  main = "Predicted annual rainfall [mm]",
  at = seq(min_val, max_val, m),
  col.regions = rev(viridis(max_val))
)

Exercise: Predict the annual mean temperature over the region of Germany from observations at DWD weather stations, stored in the variable mean_annual_air_temp of the dwd_sp data set. First, construct the variogram model by using the autofitVariogram() function.

## Your code here...
# model
temp_vm_auto <- NULL
plot(temp_vm_auto)

# main function call
temp_pred <- NULL

# plot
library(colorRamps)
m <- 0.5
max_val <- ceiling(max(temp_pred@data$var1.pred) / 10) * 10
min_val <- floor(min(temp_pred@data$var1.pred) / 10) * 10

spplot(temp_pred["var1.pred"],
  main = "Predicted annual temperature [°C]",
  at = seq(min_val, max_val, m),
  col.regions = matlab.like(200)
)
Show code
# model
temp_vm_auto <- autofitVariogram(
  formula = mean_annual_air_temp ~ 1,
  input_data = dwd_sp
)
plot(temp_vm_auto)

# main function call
temp_pred <- krige(
  formula = mean_annual_air_temp ~ Altitude,
  locations = dwd_sp,
  newdata = srtm_germany_masked_ETRS89_spdf,
  model = temp_vm_auto$var_model
)
## [using universal kriging]
# plot
library(colorRamps)
m <- 0.5

spplot(temp_pred["var1.pred"],
  main = "Predicted annual temperature [°C]",
  at = seq(-20, 20, m),
  col.regions = matlab.like(400)
)



Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.