In this section we will learn how to apply variography and kriging in Python. We will build Variograms and then we will perform an Ordinary Kriging Approch to decide which of the models results in the lowest error.

In [2]:
# First, let's import the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import rasterio
import requests
import io
from io import StringIO
import zipfile

Data preparation¶

First we will load the two datasets we will use in this section:

  • German shapefile data (VG5000_LAN.shp)
  • DWD mean annual rainfall data for Germany (dwd_sp) and
  • SRTM DEM of Germany (srtm.germany.masked.ETRS89)

For the shapefiles of Germany we rely on shapefiles provided by the German Federal Agency for Cartography and Geodesy (under this licence). You may directly download the shapefiles here. For the purpose of this tutorial, we also provided the data here.

We load these data sets into memory:

In [3]:
### LOAD SHAPEFILE DATA ###
url = "https://daten.gdz.bkg.bund.de/produkte/vg/vg5000_0101/aktuell/vg5000_01-01.utm32s.shape.ebenen.zip"

r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(path="../data")
boundary_germany = gpd.read_file(
    "../data/vg5000_01-01.utm32s.shape.ebenen/vg5000_ebenen_0101/VG5000_LAN.shp"
)

boundary_germany = boundary_germany.to_crs("epsg:3035")
In [4]:
### LOAD DWD DATA ###
# "http://www.userpage.fu-berlin.de/~soga/300/30100_data_sets/dwd_sp.geojson"
dwd_sp = gpd.read_file(
    "http://userpage.fu-berlin.de/soga/soga-py/300/30900_spatial_interpolation/dwd_sp.geojson"
)
dwd_sp
Out[4]:
MEAN.ANNUAL.RAINFALL ALTITUDE geometry
0 755.0 478.0 POINT (4234819.61426 2748192.18467)
1 820.0 202.0 POINT (4045677.70989 3081917.63415)
2 759.0 44.0 POINT (4202462.72690 3315312.39845)
3 919.0 759.0 POINT (4245047.74819 2789643.85319)
4 790.0 340.0 POINT (4545939.25005 2838265.72134)
... ... ... ...
581 657.0 308.0 POINT (4552333.97346 3109603.91660)
582 875.0 565.0 POINT (4297120.95508 2761375.13048)
583 567.0 62.0 POINT (4523829.16908 3341356.30168)
584 997.0 530.0 POINT (4074689.48322 3010038.15064)
585 734.0 29.0 POINT (4319117.24777 3382112.45555)

586 rows × 3 columns

In [5]:
####################################

### LOAD RASTER DATA ###

import requests, zipfile, io, os

url = "http://www.userpage.fu-berlin.de/~soga/300/30100_data_sets/spatial/srtm_germany_ETRS89.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("../data")

## open file
srtm_germany_masked_ETRS89 = rasterio.open(
    "../data/srtm_germany_ETRS89.tif", driver="GTiff"
)
In [6]:
# plot with rasterio.plot, which provides matplotlib functionality
from rasterio.plot import show
In [7]:
fig, ax = plt.subplots(figsize=(5, 15))

srtm_plot = show(
    srtm_germany_masked_ETRS89,
    ax=ax,
    cmap="RdYlGn_r",
    title="SRTM DEM and DWD weather stations\n",
)
dwd_sp.plot(ax=ax, markersize=1, color="black")

## add colorbar
dummy_colorbar = srtm_plot.get_images()[0]
fig.colorbar(dummy_colorbar, ax=ax, fraction=0.046, pad=0.04, label="elevation [m]")


plt.show()

Before we proceed we make sure that all of our data shares the same coordinate reference system.

In [8]:
# check coordinate systems
print(f"dwd_sp crs = {dwd_sp.crs}")
print(f"srtm_germany_masked_ETRS89 crs = {srtm_germany_masked_ETRS89.crs}")
dwd_sp crs = epsg:4326
srtm_germany_masked_ETRS89 crs = EPSG:3035
In [9]:
# reproject spatial data sets
newProj = srtm_germany_masked_ETRS89.crs
dwd_sp = dwd_sp.set_crs(newProj, allow_override=True)
boundary_germany = boundary_germany.to_crs(newProj)

Sample variogram¶

We can explore the spatial correlation of our predictor variable by plotting variogram cloud. The variogram cloud is obtained by plotting all possible squared differences of observation pairs $(Z(s_i)−Z(s_j))^2$ against their separation distance $h_{ij}$. As any point in the variogram cloud refers to a pair of points in the data set, the variogram cloud is used to point us to areas with unusual high or low variability (Bivand et al. 2008).

In Python we use the scikit-gstat library, which provides a straightforward method to calculate a variogram. The method skg.Variogram(coordinates, values) passes coordinates and values to the Variogram Class.

In [10]:
import skgstat as skg

coords = list(zip(dwd_sp.geometry.x, dwd_sp.geometry.y))
values = dwd_sp["MEAN.ANNUAL.RAINFALL"]
In [11]:
V = skg.Variogram(coords, values)

When working with the Variogram Class from scikit-gstat we have to understand how this class works. Therefore we need to know some basic things about variography. Otherwise, it can be hard to understand what we are doing in this chapter and how to interpret the resulting figures.

There are some difficulties that come with the presented method:

  • For proper usage, we have to know which underlying spatial models are suitable and fit our observations.
  • There are over 10 arguments that can be set optionally. The default values will most likely not be sufficient for the our data and requirements.

Let's start to understand a few basic thing about variograms with the help of the following figures.

In [12]:
V.distance_difference_plot()
plt.show()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\skgstat\plotting\variogram_dd_plot.py:46: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
  fig.show()

What actually is a variogram?

The Variogram tries to describe the relationship between a dependent variable to an independent variable with statistical models. This model describes spatial properties and can be further used to interpolation. The variogram relates the distance between two points to measure their similarity at that given distance. Recall the Tobler's first law of geography:

"Everything is related to everything else, but near things are more related than distant things."

In geostatistics, the independent variable is usually a measure of Euclidean distance. This can be traced in the figure above: The scattering of the pairwise distance (variance) increases wih the sepereating distance between two points. But it is not possible to estimate the behavior of the distribution at different distances. We have to group all point pairs at the same distance lag together into one group, or bin. Binning the separating distances is the most important task in variogram analysis.

If you want to learn more about variograms, the scikit-gstat documentation is a helpful ressource.

In [13]:
V.plot()

plt.show()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\skgstat\plotting\variogram_plot.py:123: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
  fig.show()

The distance calculation of the Variogram Class can be controlled by the dist_func argument. The default value is 'euclidean'. For more information see here.


Variogram modelling¶

The scikit.gstat module comes with the Variogram() function, which allows us to specify a wide range of potential variogram models and specific parameters. A list of available models is printed by using the help(skg.Variogram) command.

  • spherical
  • exponential
  • gaussian
  • cubic
  • stable
  • matern
  • nugget

A more detailed description of each model with some example can be found in the documentation of the variogram models.

In [14]:
# help(skg.Variogram)

The skg.Variogram() function is specified by a range of additional arguments such as:

  • model, which specifies the model type, e.g. "exponential", "spherical", ... (see above),
  • n_lags, which is the number of lag bins,
  • use_nugget, which adds the nugget component to the model,
  • maxlag, which is Maximum lag distance to be considered in this Variogram instance,
  • ...

Depending on the data set and on the model of choice we have to specify initial values for the different parameters. If we choose initial values too far off from reasonable values, the fit will most likely not be sufficient.

If we consider the sample variogram from above, an exponential model appears to be a good fit for the variogram until 300,000. For simplicity we will just look at that range of values. By eyeballing the sample variogram we set model = "exponential" and maxlag= 300000.

Additionally you may set the fit_method (default None), where you manually fit values for the variogram. You can set the effective range, sill and nugget either directly to the fit function, or as fit_ prefixed keyword arguments on the Variogram instance. We will set:

  • fit_range = 90000
  • fit_sill = 40000
  • fit_nugget = 15000
In [15]:
V = skg.Variogram(
    coords,
    values,
    model="exponential",
    maxlag=300000,
    fit_range=90000,
    fit_sill=40000,
    fit_nugget=15000,
)

To verify our parameter choice we plot the sample variogram (blue) and overlay it with the variogram model (green) as specified above.

In [16]:
# plot
V.plot()
plt.show()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\skgstat\plotting\variogram_plot.py:123: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
  fig.show()

The variogram model looks reasonable, but for sure not perfect.

  • Selection of the function and parameters based on a goodness-of-fit criterion

We may select the variogram model and its parameters statically, by minimizing some goodness-of-fit criterion. For example the V.parameters function allows us to find an appropriate variogram model. There are three numbers in the output, which describe the following parameters:

  • sill: upper bound the model approaches
  • effective range: 95% of the sill are approached = distance of statistically independence
  • nugget: add the semi-variance to all values (y-intercept) = semi-variance modeled on the 0-distance lag

Let us apply the V.parameters to obtain the parameters of an exponential variogram model.

In [17]:
print(V.parameters)
[183916.33172896618, 46850.44487117823, 0]

OK, compare the result with our eyeballed parametrization. We have not been too much off.

Note that we may specify the fitting method, used by skg.Variogram() by setting the additional argument fit_method. You may set the method directly to the fit function, or as fit_ prefixed keyword arguments on Variogram instantiation. The following four options are implemented:

$$ \begin{array}{c|c} \text{Abbrev.} & \text{fit_method} \\ \hline 'trf' & \text{Trust-Region Reflective (default)} \\ 'lm' & \text{Levenberg-Marquardt} \\ 'ml' & \text{Maximum Likelihood estimation} \\ 'manual' & \text{Manual fitting by setting the parameters} \end{array} $$

Now let us try out two more model types, the spherical variogram model and the Gaussian variogram model, and visualize them as well as the exponential model.

In [18]:
V_sph = skg.Variogram(
    coords,
    values,
    model="spherical",
    maxlag=300000,
    fit_range=90000,
    fit_sill=40000,
    fit_nugget=15000,
).plot()
plt.title("Spherical variogram model")
plt.show()
In [19]:
V_gau = skg.Variogram(
    coords,
    values,
    model="gaussian",
    maxlag=300000,
    fit_range=90000,
    fit_sill=40000,
    fit_nugget=15000,
).plot()
plt.title("Gaussian variogram model")
plt.show()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\skgstat\plotting\variogram_plot.py:123: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
  fig.show()
In [20]:
V_exp = skg.Variogram(
    coords,
    values,
    model="exponential",
    # use_nugget = True,
    # n_lags = 30,
    maxlag=300000,
    fit_range=90000,
    fit_sill=40000,
    fit_nugget=15000,
)
V_exp.plot()
plt.title("Exponential variogram model")
plt.show()

Based on the visual inspection the exponential variogram model (V_exp) appears to be the best fit.


Anisotropy¶

So far, we assumed that the spatial correlation structure is the same in all directions, or isotropic. Hence, the covariance function depends only on the magnitude of the lag vector, $h$, and not the direction. In many cases, however, the autocorrelation shows a structure, which different for different directions. This property of a spatial process is referred to as anisotropy.

The most commonly employed model for anisotropy is geometric anisotropy, with the semivariogram reaching the same sill in all directions, but at different ranges. To check for directional dependence we directional empirical semivariograms, which include only those values for data pairs falling within certain directional bands as well as falling within the prescribed lag limits.

With the scikit-gstat package we can account for anisotropy by specifying directional bands by an azimuthal direction. The skg.DirectionalVariogram() function accepts the azimuth parameter to indicate the direction in plane, in positive degrees clockwise. For example we may specify the main directions to be $0^\circ$, $45^\circ$, $90^\circ$, and $135^\circ$ with angular tolerance of $22.5^\circ$.

In [21]:
azimuth = np.arange(0, 4, 1) * 45
azimuth
Out[21]:
array([  0,  45,  90, 135])

We plug the additional argument azimuth and tolerance into the skg.DirectionalVariogram() function call and plot all graphs on a panel for better comparison. Each plot corresponds to a certain directional band. Note that directional semivariograms are often noisier due to the reduced number of data pairs used for estimation.

We will store our different Variogram classes in a dictionary, to be able to create them in a loop.

In [22]:
## store as dictionary
d = {}
for x in azimuth:
    d["V_{0}".format(x)] = skg.DirectionalVariogram(
        coords,
        values,
        azimuth=x,
        model="exponential",
        tolerance=22.5,
        maxlag=300000,
        fit_range=90000,
        fit_sill=40000,
        fit_nugget=15000,
    )
In [23]:
fig, axes = plt.subplots(2, 2, figsize=(8, 4), sharey=True)


for V, ax in zip(d.keys(), axes.ravel()):
    ax.plot(d[V].bins, d[V].experimental, ".", label=f"{V}")
    ax.set_title(V)
    ax.set_ylabel("semivariance")
    ax.set_xlabel("distane Lag (-)")


plt.tight_layout()
plt.show()

The plot does not show overwhelming evidence of anisotropy. Let us add our exponential variogram model we fitted in the previous section to see if it fits the data in all four directional bands.

In [24]:
import gstools as gs
In [25]:
fig, axes = plt.subplots(2, 2, figsize=(8, 4), sharey=True)

exp_model = V_exp.to_gstools()  ## extract the Variogram with gstools

for V, ax in zip(d.keys(), axes.ravel()):
    ax.plot(d[V].bins, d[V].experimental, ".", label=f"{V}")
    exp_model.plot(ax=ax)
    ax.set_title(V)
    ax.set_ylabel("semivariance")
    ax.set_xlabel("distane Lag (-)")
    ax.get_legend().remove()
    ax.set_xlim(0, 305000)

plt.tight_layout()
plt.show()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\gstools\covmodel\plot.py:114: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
  fig.show()

Not too bad! However, we can explicitly specify the geometry of the covariance structure, by adding the azimuth and tolerance into the skg.DirectionalVariogram() function call. Note that we plug in same parameters for nugget, sill and range as obtained for the exponential variogram model V_exp.

The values of the azimuth argument give the angle, or in other words the azimuthal direction measured in degrees clockwise, of the major direction, 0 (=North-South). Hence, we model a more ellipsis-like spatial correlation structure.

In [26]:
fig, axes = plt.subplots(2, 2, figsize=(8, 4), sharey=True)


for V, ax in zip(d.keys(), axes.ravel()):
    d[V].plot(axes=ax, hist=False)
    ax.set_title(V)
    ax.set_ylabel("semivariance")
    ax.set_xlabel("distane Lag (-)")


plt.tight_layout()
plt.show()

Model evaluation¶

Once we defined a variogram model we would like to evaluate its predictive power. We test our model using the leave-one-out cross-validation (LOOCV) approach. Again we assess the predictive power of the model by using the root-mean-square error (RMSE) as model assessment metric.

$$RMSE = \sqrt{\frac{\sum_{i=1}^n(\hat y_i-y_i)^2}{n}}$$

First, we will define our own rmse function:

In [27]:
def rmse(residuals):
    return np.sqrt(np.sum((residuals) ** 2) / len(residuals))

Now we are ready to perfom the kriging interpolation combined with LOOCV. Note that we are using the isotropic exponential variogram model V_exp. We assess the predictive power of the model by computing the RMSE. cross_validate() provides an ready to use implementation of the LOOCV of the variogram model by means of kriging. We can set the metric = "rmse" to get the rmse as our model assessment metric.

In [28]:
V_exp.cross_validate(metric="rmse")
Out[28]:
116.13995314666386

Now let us plug in one of the anisotropic exponential variogram model and check its performance.

In [29]:
d["V_0"].cross_validate(metric="rmse")
Out[29]:
116.29184381891159

Well, slightly worse.

Let us evaluate one more model, the Matern model. Therefore we will have to build a Variogram instance first.

In [30]:
V_matern = skg.Variogram(
    coords,
    values,
    model="matern",
    maxlag=300000,
    fit_range=90000,
    fit_sill=30000,
    fit_nugget=15000,
)

V_matern.cross_validate(metric="rmse")
Out[30]:
122.09615008314385

OK, still a bit worse! Hence, we proceed with the Exponential model.

A nice visualization of the model residuals is the bubble plot. We can produce a bubble plot with the help of the seaborn library which provides the scatterplot() function. We need to extract the residuals from the kriging with a custom function. Therefore we will built a custom function loocv_kriging(). Each point in our dataset will be subject to kriging interpolation. Then the difference between the true data value and the estimated data value will be calculated as a residual. The predictive power of the model will be estimated with the help of the RSME. We will use our custom rmse() function defined above.

In [31]:
def loocv_kriging(
    x, y, z, model="exponential", direction=False, azimuth=0, tolerance=22.5
):
    """
    Function for ordinary kriging combined with a Leave-one-out cross validation (LOOCV) approach
    Function iterates through each point of x,y,z Dataframe, each time leaving one point out.
    loocv_kriging will return the interpolated values and the residuals as arrays.
    An ordinary kriging based on specified VARIOGRAM will be performed for each point in iteration.
    Difference will be calculated
    INPUTS:
    x, y
            arrays containing coordinates of point data
    z
            array containing data values that are of interest for interpolation
    model
            Variogram model to be used, e.g. "exponential" (default), "spherical", "gaussian"
            For more information, see documentation of scikit-gstat
            https://scikit-gstat.readthedocs.io/en/latest/userguide/variogram.html#variogram-models
    direction
            wether to use a Directional Variogram
            https://scikit-gstat.readthedocs.io/en/latest/reference/directionalvariogram.html#skgstat.DirectionalVariogram
            default= False,
            if True remember to set azimuth and tolerance!
    azimuth
            default = 0
    tolerance
            default22.5
    """
    from skgstat import OrdinaryKriging

    ## Initialise two empty vectors to be filled with in a loop
    residuals = []
    rainfall_krig = []

    # combine the data array in one structe
    kriging_data = pd.DataFrame({"x_val": x, "y_val": y, "rainfall": z})

    for index, row in kriging_data.iterrows():
        # drop one point from dataset
        loocv = kriging_data.drop([index])
        # from coords and values for variogram
        loocv_coords = list(zip(loocv.x_val, loocv.y_val))
        loocv_values = loocv.rainfall
        # set up variogram model
        if direction == False:
            V = skg.Variogram(
                loocv_coords,
                loocv_values,
                model=model,
                # maxlag= 300000,
                fit_range=90000,
                fit_sill=40000,
                fit_nugget=15000,
            )
        else:
            V = skg.DirectionalVariogram(
                loocv_coords,
                loocv_values,
                model=model,
                azimuth=azimuth,
                tolerance=tolerance,
            )

        # set up kriging instance
        ok = OrdinaryKriging(V, min_points=2, max_points=10, mode="exact")
        # interpolate left out point based on kriging
        loocv_rainfall_point = ok.transform(
            row["x_val"].flatten(), row["y_val"].flatten()
        )
        # append the interpolated rainfall in one array
        rainfall_krig = np.append(rainfall_krig, loocv_rainfall_point)
        # calculate difference between interpolated and true rainfall for the respective point
        diff = loocv_rainfall_point - row["rainfall"]
        # append residuals in one array
        residuals = np.append(residuals, diff)
    return rainfall_krig, residuals
In [32]:
rainfall_krig, residuals = loocv_kriging(
    dwd_sp.geometry.x,
    dwd_sp.geometry.y,
    dwd_sp["MEAN.ANNUAL.RAINFALL"],
    model="exponential",
)
In [33]:
import seaborn as sns

## data prep for nice plotting
data = pd.DataFrame(
    {"x": dwd_sp.geometry.x, "y": dwd_sp.geometry.y, "residuals": residuals}
)
data["LOOCV_residuals"] = np.where(data["residuals"] > 0, "neg", "pos")
data_pos = data[(data["residuals"] > 0)]
data_neg = data[(data["residuals"] < 0)]

### PLOTTING ####

fig, ax = plt.subplots(figsize=(7, 6))
# seaborn bubble plot
sns.scatterplot(
    x=data["x"],
    y=data["y"],
    size=abs(data["residuals"]),
    hue=data["LOOCV_residuals"],
    sizes=(1, 300),
    legend=True,
    ax=ax,
    alpha=0.7,
)

plt.title("Ordinary Kriging rainfall: LOOCV residuals\n")
plt.xlabel("")
plt.ylabel("")
## Legend outside of the plot
plt.legend(title=" ", bbox_to_anchor=(1.01, 1), borderaxespad=0)

plt.tight_layout()
plt.show()

The bubble plot above shows that the model residuals are quite low in the North German Lowlands, indicating a reasonable model performance. In contrast, the residuals are much worse in the Central German Uplands, the Alpine foreland and the Alps.

Spatial prediction¶

Finally, we may use our model to predict the annual mean rainfall over the region of Germany from observations at DWD weather stations.

The scikit-gstat package provides the OrdinaryKriging() method for kriging. The Ordinary kriging estimator uses a given Variogram class to calculate estimations for unobserved locations. The following parameters are implemented and should be set:

  • variogram - Variogram used to build the kriging estimation (contrains the known data).
  • min_points – Minimum amount of points to be considered within the variogram’s range.
  • max_points – Maximum amount of points, that will be considered.

There are a more additional parameters, but for our application the ones named aboved are sufficient.

Consequently the spatial prediction model can be written as below. Please note that depending on your computing resources and depending on the spatial resolution of the meshgrid, this step can take some time.

First, we will have to build a grid to interpolate on. A procedure we already know from previous sections.

In [34]:
## meshgrid to interpolate on

## extent
xmin, xmax = min(dwd_sp.geometry.x), max(dwd_sp.geometry.x)
ymin, ymax = min(dwd_sp.geometry.y), max(dwd_sp.geometry.y)

# size of the grid
nx, ny = 50, 50

# generate two arrays of evenly space data between ends of previous arrays
xi = np.linspace(xmin, xmax, nx)
yi = np.linspace(ymin, ymax, ny)

# generate grid
xx, yy = np.meshgrid(xi, yi)

Set up the kriging instance.

In [35]:
from skgstat import OrdinaryKriging

ok = OrdinaryKriging(V_exp, min_points=2, max_points=10, mode="exact")

Apple the kriging instance to the meshgrid as flattened arrays.

In [36]:
kriging_grid = ok.transform(xx.flatten(), yy.flatten()).reshape(xx.shape)

# We calculate the kriging error on our grid:
sigma2d = ok.sigma.reshape(xx.shape)
In [37]:
fig, ax = plt.subplots(1, 2, figsize=(15, 10), sharey=True)


### plot 1: Interpolation grid
contour_levels = np.arange(400, 1900, 50)

contour_levels = np.arange(400, 1900, 50)
krig = ax[0].contourf(
    xx,
    yy,
    kriging_grid,
    levels=contour_levels,
    # range(150,200,5),
    cmap="viridis_r",
    alpha=0.7,
)

boundary_germany.plot(ax=ax[0], facecolor="none", edgecolor="black")
fig.colorbar(krig, ax=ax[0])
ax[0].set_title("Predicted annual rainfall [mm]")


### plot 2: sigma error
contour_levels = np.arange(1000, 35000, 50)

sigma = ax[1].contourf(xx, yy, sigma2d, cmap="hot_r", levels=contour_levels, alpha=0.7)

boundary_germany.plot(ax=ax[1], facecolor="none", edgecolor="black")
fig.colorbar(sigma, ax=ax[1])
ax[1].set_title("Kriging error $sigma^2$")

plt.show()

Introducing a Covariate¶

Fortunately, our data set (dwd_sp) not only contains rainfall data, but as well the altitude of each particular weather station. We might be able to improve our interpolation including the covariate ALTITUDE. Let us make a scatter plot to verify the dependence of rainfall and elevation as indicated by the bubble plot.

In [38]:
fig, ax = plt.subplots(figsize=(6, 4))
plt.scatter(dwd_sp["ALTITUDE"], dwd_sp["MEAN.ANNUAL.RAINFALL"], color="black", s=3)
plt.title("Rainfall and Altitude")
plt.xlabel("Elevation")
plt.ylabel("Mean Annual Rainfall")
plt.show()

Although there are a number of weather stations at low elevations, probably located close to the northern German coast, which receive a considerable amount of annual rainfall, it appears that in general rainfall increases with altitude as indicated by the bubble plot.

Consequently, we build a model using universal kriging, where the predicted value (mean annual rainfall) is based on the covariate ALTITUDE plus a weighted function of neighboring measurements. To apply universal kriging LOOCV using our dwd_sp data set import a new library PyKrige, which allows for universal kriging.

In [39]:
from pykrige.uk3d import UniversalKriging3D

Apply the kriging method with an exponential model.

In [40]:
fit_sill, fit_range, fit_nugget = 65000, 500000, 0

ok3d = UniversalKriging3D(
    dwd_sp.geometry.x,
    dwd_sp.geometry.y,
    dwd_sp.ALTITUDE,
    dwd_sp["MEAN.ANNUAL.RAINFALL"],
    variogram_model="exponential",
    variogram_parameters=[fit_sill, fit_range, fit_nugget],
    enable_plotting=True,
)

Again we visualize the model residuals using a bubble plot and compute the RMSE.

In [41]:
import seaborn as sns

## data prep for nice plotting
resid = np.append(np.nan, ok3d.delta)
data = pd.DataFrame(
    {"x": dwd_sp.geometry.x, "y": dwd_sp.geometry.y, "residuals": resid}
)
data["LOOCV_residuals"] = np.where(data["residuals"] > 0, "neg", "pos")
data_pos = data[(data["residuals"] > 0)]
data_neg = data[(data["residuals"] < 0)]

### PLOTTING ####

fig, ax = plt.subplots(figsize=(7, 6))
# seaborn bubble plot
sns.scatterplot(
    x=data["x"],
    y=data["y"],
    size=abs(data["residuals"]),
    hue=data["LOOCV_residuals"],
    sizes=(1, 300),
    legend=True,
    ax=ax,
    alpha=0.7,
)

plt.title("Ordinary Kriging rainfall: LOOCV residuals\n")
plt.xlabel("")
plt.ylabel("")
## Legend outside of the plot
plt.legend(title=" ", bbox_to_anchor=(1.01, 1), borderaxespad=0)

plt.tight_layout()
plt.show()
In [42]:
rmse(ok3d.delta)
Out[42]:
148.0164746390339

The bubble plot above shows that the model residuals are still quite high for elevate terrain.

Spatial prediction¶

Finally, we may use our model to predict the annual mean rainfall over the region of Germany from observations at DWD weather stations. Spatial prediction with the PyKrige package is fairly straight forward. We may use the execute() function on our 3d kriging instance, which calculates a kriged grid and the associated variance.

The execute() function takes as input a style (style) that specifies how to treat input kriging points. Here we specify "grid". In addition the function expects the xpoints, ypoints, zpoints arguments, which are input arrays describing the grid.

Consequently the spatial prediction model can be written as below. Please note that depending on your computing resources and depending on the spatial resolution, this step can take some time.

First, we will have to build a grid to interpolate on. A procedure we already know from previous sections.

In [43]:
xmin, xmax = min(dwd_sp.geometry.x), max(dwd_sp.geometry.x)
ymin, ymax = min(dwd_sp.geometry.y), max(dwd_sp.geometry.y)
zmin, zmax = min(dwd_sp.ALTITUDE), max(dwd_sp.ALTITUDE)
## meshgrid to interpolate on
# # size of the grid to interpolate
nx, ny = 50, 50

# generate two arrays of evenly space data between ends of previous arrays
xi = np.linspace(xmin, xmax, nx)
yi = np.linspace(ymin, ymax, ny)
zi = np.linspace(zmin, zmax, ny)


# generate grid
xx, yy, zz = np.meshgrid(xi, yi, zi)
In [44]:
interpol_values_3d, sigma3d = ok3d.execute(
    "grid", xi.flatten(), yi.flatten(), zi.flatten()
)

Finally we plot the spatial predictions using the contourf() function.


In [45]:
fig, ax = plt.subplots(1, 2, figsize=(15, 10), sharey=True)

contour_levels = np.arange(400, 1900, 50)

krig = ax[0].contourf(
    xi,
    yi,
    interpol_values_3d[0, :, :],
    cmap="viridis_r",
    levels=contour_levels,
    alpha=0.7,
)

boundary_germany.plot(ax=ax[0], facecolor="none", edgecolor="black")
fig.colorbar(krig, ax=ax[0])
ax[0].set_title("Predicted annual rainfall [mm]")


contour_levels = np.arange(1000, 35000, 50)

sigma = ax[1].contourf(
    xi, yi, sigma3d[0, :, :], cmap="hot_r", levels=contour_levels, alpha=0.7
)

boundary_germany.plot(ax=ax[1], facecolor="none", edgecolor="black")
fig.colorbar(sigma, ax=ax[1])
ax[1].set_title("Kriging error $sigma^2$")


plt.show()
In [46]:
np.mean(np.sqrt(sigma2d)), np.mean(np.sqrt(sigma3d[0, :, :]))
Out[46]:
(126.18677098523185, 95.20714054394963)

By mere observation, we can see that the introduction of a covariate has improved the kriging result. Thus, the kriging error for universal kriging is lower than for ordinary kriging. In general we can see that the kriging error is much lower in areas of higher sampling density.


Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.