In this section we use two data sets:
dwd_sp
)
andsrtm_germany_masked_ETRS89
)First, we load these data sets into memory:
### LOAD DWD DATA ###
load(url("https://userpage.fu-berlin.de/soga/data/r-data/dwd_sp.RData"))
####################################
### LOAD RASTER DATA ###
## build a temporary folder on disk
temp <- tempfile()
download_url <- "https://userpage.fu-berlin.de/soga/data/raw-data/spatial/"
zipfile <- "srtm_germany_ETRS89.zip"
## download the file
download.file(paste0(download_url, zipfile), temp, mode = "wb")
## unzip the file
unzip(temp)
unlink(temp)
## read in the data
library(raster)
srtm_germany_masked_ETRS89 <- raster("srtm_germany_ETRS89.tif")
####################################
plot(srtm_germany_masked_ETRS89,
main = "SRTM DEM and DWD weather stations"
)
plot(dwd_sp,
add = TRUE,
pch = 16, cex = 0.2
)
Before we proceed we make sure that all of our data shares the same coordinate reference system.
library(rgdal)
newProj <- proj4string(srtm_germany_masked_ETRS89)
# re-project spatial data sets
dwd_sp <- spTransform(dwd_sp,
CRSobj = newProj
)
# Retrieve Federal States by the the getData() function from the raster package
germany <- getData(country = "Germany", level = 1)
germany <- spTransform(germany,
CRSobj = newProj
)
Further, we cast the RasterLayer
object
srtm_germany_masked_ETRS89
to a
SpatialPixelsDataFrame
object and rename the column in the
attribute table to Altitude
.
# cast to SpatialPixelsDataFrame
srtm_germany_masked_ETRS89_spdf <- as(
srtm_germany_masked_ETRS89,
"SpatialPixelsDataFrame"
)
colnames(srtm_germany_masked_ETRS89_spdf@data) <- "Altitude"
We can explore the spatial correlation of our predictor variable by plotting a variogram cloud. The variogram cloud is obtained by plotting all possible squared differences of observation pairs \((Z(s_i)−Z(s_j))^2\) against their separation distance \(h_{ij}\). As any point in the variogram cloud refers to a pair of points in the data set, the variogram cloud is used to point us to areas with unusually high or low variability (Bivand et al. 2008).
library(gstat)
ev <- variogram(mean_annual_rainfall ~ 1, dwd_sp, cloud = TRUE)
plot(ev)
The sample variogram plot below is nothing but a plot of averages of
semivariogram cloud values over distance intervals \(h\). The variogram()
is an
extremely versatile function and accepts a ton of additional arguments.
For the purpose of this tutorial we keep the defaults. Type
?variogram
into your console to review a detailed
description of the variogram()
function.
ev <- variogram(mean_annual_rainfall ~ 1, dwd_sp)
plot(ev)
There are different approaches to select a variogram model:
The eyefit()
function from the geoR
package offers the possibility to
fit a function to the empirical variogram by interactively varying the
function type and its parameters. Here, we do not cover this approach,
however the interested reader may visit the online documentation for
more details (here).
The gstat
package ships with the vgm()
function, which allows us to specify a wide range of potential variogram
models and specific parameters. A list of available models is printed by
using the vgm()
command.
vgm()
## short long
## 1 Nug Nug (nugget)
## 2 Exp Exp (exponential)
## 3 Sph Sph (spherical)
## 4 Gau Gau (gaussian)
## 5 Exc Exclass (Exponential class/stable)
## 6 Mat Mat (Matern)
## 7 Ste Mat (Matern, M. Stein's parameterization)
## 8 Cir Cir (circular)
## 9 Lin Lin (linear)
## 10 Bes Bes (bessel)
## 11 Pen Pen (pentaspherical)
## 12 Per Per (periodic)
## 13 Wav Wav (wave)
## 14 Hol Hol (hole)
## 15 Log Log (logarithmic)
## 16 Pow Pow (power)
## 17 Spl Spl (spline)
## 18 Leg Leg (Legendre)
## 19 Err Err (Measurement error)
## 20 Int Int (Intercept)
The vgm()
function is specified by a range of additional
arguments such as psill
, which corresponds to the (partial)
sill of the variogram model, model
, which specifies the
model type, e.g. “Exp”, “Sph”, “Gau”, “Mat” (see above),
range
, which specifies the range parameter of the variogram
model, kappa
, which is a smoothness parameter for the
Matern
class of variogram models, nugget
,
which adds the nugget component to the model, anis
, a set
of parameters to account for anisotropy, among others. Depending on the
data set and on the model of choice we have to specify initial values
for the different parameters. If we choose initial values too far off
from reasonable values, the fit will not succeed.
If we consider the sample variogram from above, an exponential model
appears to be a good fit. By eyeballing the sample variogram we set
psill = 30000
, model = "Exp"
,
range = 100000
, and nugget= 25000
.
vm <- vgm(psill = 30000, model = "Exp", range = 100000, nugget = 25000)
vm
## model psill range
## 1 Nug 25000 0e+00
## 2 Exp 30000 1e+05
To verify our parameter choice we plot the sample variogram and overlay it with the variogram model as specified above.
plot(ev, vm)
The variogram model looks reasonable, but for sure not perfect.
We may select the variogram model and its parameters statically, by
minimizing some goodness-of-fit criterion. For example the
fit.variogram()
function from the gstat
package allows us to find an appropriate variogram model.
The fit.variogram()
function need as arguments at least
a sample variogram (object
), e.g. the output of the
variogram()
function, and a model
, which
specifies the model type, e.g. “Exp”, “Sph”, “Gau”, “Mat”, among others
(see output of vgm()
). The function
fit.variogram()
returns an object of class
variogramModel
extending data.frame
. The
data.frame
object has a logical attribute
singular
, which indicates whether the non-linear fit
converged (FALSE
).
Let us apply the fit.variogram()
to obtain the
parameters of an exponential variogram model.
vm_exp <- fit.variogram(ev, vgm("Exp"))
vm_exp
## model psill range
## 1 Nug 26296.91 0.0
## 2 Exp 31955.10 113131.6
OK, compare the result with our eyeballed parametrization. We have not been too much off.
Let us check whether the fit converges.
attr(vm_exp, "singular")
## [1] FALSE
Perfect, recall that FALSE
indicates model
convergence.
Note that we may specify the fitting method, used by
gstat
by setting the additional arguments
fit.method
. There are four choices implemented:
\[ \begin{array}{c|c} \mathtt{fit.method} & \text{Weight} \\ \hline 1 & N_j \\ 2 & N_j\{\gamma(h_j)\}^2 \\ 6 & 1 \\ 7 & N_j/h_j^2 \end{array} \]
The default value chosen by gstat
is
fit.method = 7
.
Exercise: Apply the
fit.variogram()
fuction to obtain the parameters for the spherical and the Gaussian variogram model and check each of them for convergence.
## Your code here...
vm_sph <- NULL
vm_gau <- NULL
vm_sph <- fit.variogram(ev, vgm("Sph"))
attr(vm_exp, "singular")
## [1] FALSE
vm_gau <- fit.variogram(ev, vgm("Gau"))
attr(vm_exp, "singular")
## [1] FALSE
Nice! Both models converge.
Now, let us visualize our three different variogram models we built above.
plot(ev, vm_exp, main = "Exponential variogram model")
Exercise: Visualize the spherical and the Gaussian variogram model. Which of the three models seems to be the best fit by visual inspection?
## Your code here...
plot(ev, vm_sph, main = "Spherical variogram model")
plot(ev, vm_gau, main = "Gaussian variogram model")
Based on the visual inspection, the exponential variogram model
(vm_exp
) appears to be the best fit.
In addition to the fit.variogram()
function the
automap
package provides the
autofitVariogram()
. The function accepts a formula
(formula
) and a data argument (input_data
).
The following expression fits a variogram model to our data:
library(automap)
vm_auto <- autofitVariogram(
formula = mean_annual_rainfall ~ 1,
input_data = dwd_sp
)
plot(vm_auto)
In the plot above the points show the sample empirical variogram. The numbers indicate the number of pairs assigned to the bin, and the continuous line is the fitted variogram model. In this case, the selected model was Matérn, M. Stein’s parameterization (code name “Ste”), with the four parameter values stated in the bottom-right corner of the image.
The function returns an object of type autofitVariogram
.
This object contains the fitted variogram model, which can be accessed
with the $
operator.
vm_auto$var_model
## model psill range kappa
## 1 Nug 15442.86 0.0 0.0
## 2 Ste 50425.68 283865.8 0.2
So far we assumed that the spatial correlation structure is the same in all directions (isotropic). Hence, the covariance function depends only on the magnitude of the lag vector \(h\) and not on the direction. In many cases, however, the autocorrelation shows a structure that is different in different directions. This property of a spatial process is referred to as anisotropy.
The most commonly employed model for anisotropy is geometric anisotropy, with the semivariogram reaching the same sill in all directions, but at different ranges. To check for directional dependence, we use directional empirical semivariograms, which include only those values for data pairs falling within certain directional bands as well as falling within the prescribed lag limits.
With the gstat
package we can account for anisotropy by
specifying directional bands by an azimuthal direction. The
variogram()
function accepts an alpha
parameter, which indicates the direction in plane in positive degrees
clockwise. For example, we may specify the main directions to be \(0^\circ\), \(45^\circ\), \(90^\circ\), and \(135^\circ\) with angular tolerance of \(22.5^\circ\).
alpha <- (0:3) * 45
alpha
## [1] 0 45 90 135
If we plug the additional argument alpha
into the
variogram()
function call and plot it, we see a panel plot,
where each plot corresponds to a certain directional band. Note that
directional semivariograms are often noisier due to the reduced number
of data pairs used for the estimation.
ev_anis <- variogram(mean_annual_rainfall ~ 1, dwd_sp, alpha = alpha)
plot(ev_anis)
The plot does not show overwhelming evidence of anisotropy. Let us
add our exponential variogram model we fitted in the previous section
(vm_exp
) to see if it fits the data in all four directional
bands.
plot(ev_anis, vm_exp)
Not too bad. However, we can explicitly specify the geometry of the
covariance structure, by adding the anis
argument to the
vgm()
function call. Note that we plug in same parameters
for nugget, partial sill and range as obtained for the exponential
variogram model vm_exp
, but we add the anis
argument to account for anisotropy.
vm_anis <- vgm(
psill = 31955,
model = "Exp",
range = 113132,
nugget = 26297,
anis = c(0, 0.7)
)
The first value of the anis
argument is the angle of the
major direction (in other words the azimuthal direction measured in
degrees clockwise). The value of \(0\)
is equal to the North-South direction. The second value is the
anisotropy ratio, which in our case is 0.7. The anisotropy ratio
indicates that the range in the minor direction (East-West) is only 0.7
of the range in the major direction. Hence, we model a more
ellipsis-like spatial correlation structure.
plot(ev_anis, vm_anis)
Well the changes are small, but still recognizable. We will compare the performance of the isotropic and the anisotropic model in the next section by applying a technique referred to as cross-validation.
Once we defined a variogram model we would like to evaluate its predictive power. Similar as we did in the previous section on inverse distance weighting (IDW) interpolation, we test our model using the leave-one-out cross-validation (LOOCV) approach. Again we assess the predictive power of the model by using the root-mean-square error (RMSE) as model assessment metric.
We can use the same custom function as in the section on IDW:
\[RMSE = \sqrt{\frac{\sum_{i=1}^n(\hat y_i-y_i)^2}{n}}\]
RMSE <- function(residuals) {
sqrt(sum((residuals)^2) / length(residuals))
}
The gstat
package provides the krige.cv()
function for kriging cross validation, either \(n\)-fold or leave-one-out (LOOCV). The
krige.cv()
function takes as input a formula
(formula
) that defines the dependent variable as a linear
model of independent variables. Suppose the dependent variable has name
z
, for ordinary kriging we use the formula
z~1
; for universal kriging, suppose z
is
linearly dependent on x
and y
, we use the
formula z~x+y
. In addition the function expects the
locations
argument, which is an object of class
Spatial
, and the argument model
, which
specifies the variogram model of the dependent variable.
Consequently, an ordinary kriging LOOCV using our
dwd_sp
data set is written as below. Note that we are using
the isotropic exponential variogram model vm_exp
.
rain_OK_LOOCV_exp <- krige.cv(
formula = mean_annual_rainfall ~ 1,
locations = dwd_sp,
model = vm_exp
)
We assess the predictive power of the model by computing the RMSE.
RMSE(residuals = rain_OK_LOOCV_exp@data$residual)
## [1] 161.1598
Now let us plug in the anisotropic exponential variogram model
vm_anis
and check its performance.
Exercise: Evaluate the anisotropic exponential model and the the Matérn model (
vm_auto
). Which model is the best fit?
## Your code here...
rain_OK_LOOCV_anis <- NULL
rain_OK_LOOCV_matern <- NULL
# anisotropic exponential model
rain_OK_LOOCV_anis <- krige.cv(
formula = mean_annual_rainfall ~ 1,
locations = dwd_sp,
model = vm_anis
)
RMSE(residuals = rain_OK_LOOCV_anis@data$residual)
## [1] 160.5813
# matern model
rain_OK_LOOCV_matern <- krige.cv(
formula = mean_annual_rainfall ~ 1,
locations = dwd_sp,
model = vm_auto$var_model
)
RMSE(residuals = rain_OK_LOOCV_matern@data$residual)
## [1] 159.0077
All models in our case perform similarly well, with the anisotropic exponential variogram and the Matérn model performing only slightly better than the isotropic exponential variogram. The Matérn model is the best fit.
A nice visualization of the model residuals is provided by the
bubble()
function. Let us plug in the Matérn
model.
bubble(rain_OK_LOOCV_matern,
"residual",
main = "Ordinary Kriging rainfall: LOOCV residuals"
)
The bubble plot above shows that the model residuals are quite low in the North German Lowlands, indicating a reasonable model performance. In contrast, the residuals are much worse in the Central German Uplands, the Alpine Foreland and the Alps.
Fortunately, our data set (dwd_sp
) not only contains
rainfall data, but as well the altitude of each particular weather
station. Let us make a scatter plot to verify the dependence of rainfall
and elevation.
rain_annual <- dwd_sp@data$mean_annual_rainfall
elevation <- dwd_sp@data$Altitude
plot(elevation, rain_annual, main = "Rainfall and Altitude")
Although there are a number of weather stations at low elevations, probably located close to the northern German coast, which receive a considerable amount of annual rainfall, it appears that in general rainfall increases with altitude.
Consequently, we build a model using universal
kriging, where the predicted value (mean annual rainfall) is
based on the covariate Altitude
plus a weighted function of
neighboring measurements.
Exercise: Use universal kriging based on the
Altitude
to validate the Matérn model for the mean annual rainfall, evaluate the model and visualize the residuals with a bubble plot.
## Your code here...
rain_UK_LOOCV_matern <- krige.cv(
formula = NULL,
locations = NULL,
model = NULL
)
rain_UK_LOOCV_matern <- krige.cv(
formula = mean_annual_rainfall ~ Altitude,
locations = dwd_sp,
model = vm_auto$var_model
)
RMSE(residuals = rain_UK_LOOCV_matern@data$residual)
## [1] 148.9089
bubble(rain_UK_LOOCV_matern,
"residual",
main = "Universal Kriging rainfall: LOOCV residuals", add = TRUE
)
The bubble plot above shows that the model residuals are still quite high for elevate terrain, however the RMSE is significantly lower than for ordinary kriging.
Note that in universal kriging we may incorporate more than one covariate. The bubble plot above indicates that the model predictions are worse in southern regions of Germany compared to northern regions. Hence, we may try to fit the values using the coordinates and a 2nd order polynomial.
Therefore, we first need to extract the coordinates from our
Spatial
object and assign them explicitly to the attribute
table of dwd_sp
.
lat <- dwd_sp@coords[, 1]
lon <- dwd_sp@coords[, 2]
dwd_sp@data$lat <- lat
dwd_sp@data$lon <- lon
Then we simply rewrite our formula expression and re-run the universal kriging LOOCV.
f <- formula(mean_annual_rainfall ~ Altitude + lat + lon + lat * lon + lat * lat + lon * lon)
rain_UK_LAT_LON_LOOCV <- krige.cv(
formula = f,
locations = dwd_sp,
model = vm_auto$var_model
)
Again we visualize the model residuals using a bubble plot and compute the RMSE.
bubble(rain_UK_LAT_LON_LOOCV,
"residual",
main = "Universal Kriging rainfall: LOOCV residuals"
)
RMSE(residuals = rain_UK_LAT_LON_LOOCV@data$residual)
## [1] 148.8224
Well, the model performs only slightly better. It is doubtful that the incorporation of the polynomial term is effectively improving the model. Thus, following the law of parsimony, also known as Occam’s razor, we stick to the simpler model and discard the polynomial modelling approach.
Finally, we may use our model to predict the annual mean rainfall
over the region of Germany from observations at DWD weather stations.
Spatial prediction with the gstat
package is fairly
straight forward. We either may use the predict()
function
or an even simpler approach is to apply the krige()
function. The krige()
function is a wrapper method around
the functions gstat()
and predict()
.
The krige()
function takes as input a formula
(formula
) that defines the dependent variable as a linear
model of independent variables. In addition, the function expects the
locations
argument, which is an object of class
Spatial
, and the argument newdata
, which is a
data.frame
or Spatial
object with the
prediction locations. The object should contain attribute columns with
the independent variables with names as defined in
locations
. Finally, in order to perform kriging the
function takes the argument model
to specify the variogram
model of dependent variable.
Consequently, the spatial prediction model can be written as below.
Please note that depending on your computing resources and depending on
the spatial resolution of the newdata
object, this step can
take some time. For your convenience we show the computing time on an
Intel i7-7600U CPU, with 3.90 GHz and 8 GB RAM.
# time the function
# start time
start_time <- Sys.time()
# main function call
rain_pred <- krige(
formula = mean_annual_rainfall ~ Altitude,
locations = dwd_sp,
newdata = srtm_germany_masked_ETRS89_spdf,
model = vm_auto$var_model
)
## [using universal kriging]
# end time
end_time <- Sys.time()
print(end_time - start_time)
## Time difference of 5.903659 mins
Finally, we plot the spatial predictions using the
spplot()
function.
library(viridisLite)
m <- 50
max_val <- ceiling(max(rain_pred@data$var1.pred) / 100) * 100
min_val <- floor(min(rain_pred@data$var1.pred) / 100) * 100
spplot(rain_pred["var1.pred"],
main = "Predicted annual rainfall [mm]",
at = seq(min_val, max_val, m),
col.regions = rev(viridis(max_val))
)
Exercise: Predict the annual mean temperature over the region of Germany from observations at DWD weather stations, stored in the variable
mean_annual_air_temp
of thedwd_sp
data set. First, construct the variogram model by using theautofitVariogram()
function.
## Your code here...
# model
temp_vm_auto <- NULL
plot(temp_vm_auto)
# main function call
temp_pred <- NULL
# plot
library(colorRamps)
m <- 0.5
max_val <- ceiling(max(temp_pred@data$var1.pred) / 10) * 10
min_val <- floor(min(temp_pred@data$var1.pred) / 10) * 10
spplot(temp_pred["var1.pred"],
main = "Predicted annual temperature [°C]",
at = seq(min_val, max_val, m),
col.regions = matlab.like(200)
)
# model
temp_vm_auto <- autofitVariogram(
formula = mean_annual_air_temp ~ 1,
input_data = dwd_sp
)
plot(temp_vm_auto)
# main function call
temp_pred <- krige(
formula = mean_annual_air_temp ~ Altitude,
locations = dwd_sp,
newdata = srtm_germany_masked_ETRS89_spdf,
model = temp_vm_auto$var_model
)
## [using universal kriging]
# plot
library(colorRamps)
m <- 0.5
spplot(temp_pred["var1.pred"],
main = "Predicted annual temperature [°C]",
at = seq(-20, 20, m),
col.regions = matlab.like(400)
)
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.