40400_geppert.knit

Geppert, Marielle; Hartmann, Kai; Kirchner, Ingo; Pfahl, Stephan; Struck, Ulrich; Riedel, Frank (2022). Precipitation Over Southern Africa: Moisture Sources and Isotopic Composition. JGR: Atmospheres, https://doi.org/10.1029/2022JD037005

Context of the Study

Southern Africa is characterized by arid and semi-arid landscapes and is particularly susceptible to extreme weather conditions. Intriguingly, over the last 100,000 years, extensive lakes have periodically formed in the central Kalahari desert, raising questions about historical changes in atmospheric circulation and precipitation patterns.

Geppert et al. conducted a study about the annual precipitation distributions throughout Southern Africa. They focused on the analysis of stable water isotope compositions, moisture transport pathways, and moisture sources.

Stable isotopes of hydrogen and oxygen (such as $^{2}H$ and $^{18}O$ ) in water molecules vary slightly based on their source and the environmental conditions they have been through. By analyzing these isotopes, it is possible to trace the origins of water sources and to understand the pathways of moisture transport.

Changes in stable isotope ratios in precipitation can reveal shifts in atmospheric circulation patterns and climate. For instance, when water evaporates from the ocean, water molecules containing heavier isotopes of oxygen ( $^{18}O$ ) and hydrogen ( $^{2}H$ or deuterium) are more likely to remain in the ocean. This results in a higher concentration of these heavier isotopes in the ocean, which is reflected in the $\delta^{2}H$ and $\delta^{18}O$ ratios of the ocean water. As the evaporated water forms precipitation and moves inland, separation continues and and further changes isotopic ratios occur. Analyzing these isotopic variations helps to reconstruct past precipitation regimes and thus provides insights into historical patterns of atmospheric circulation.

Data Collection

The study involved collecting water samples of precipitation and different surface waters in southern Africa between 2016 and 2021. The map shows the sampling locations and the sample type, which is: ocean, spring, lake, precipitation and river. Furthermore, the data of eight Global Network for Isotopes in Precipitation (GNIP) stations has been used.

Sample locations

The data set and further information about the sampling process can be found here.

Let us take a closer look at the data:

url <- "https://doi.pangaea.de/10.1594/PANGAEA.944811?format=textfile"

data <- read.table(url, skip = 268, header = TRUE, sep = "\t", fileEncoding = "UTF-8", check.names = FALSE)

head(data)

##          Event Sample ID  Latitude Longitude  Date/Time Samp type
## 1 WaterSA_SLW1      SLW1 -33.88917  18.96917 2016-08-29     River
## 2 WaterSA_SLW2      SLW2 -33.87800  19.03517 2016-08-29     River
## 3 WaterSA_SLW3      SLW3 -33.93667  19.17000 2016-08-29     River
## 4 WaterSA_SLW4      SLW4 -33.69350  19.32483 2016-08-29     River
## 5 WaterSA_SLW5      SLW5 -33.54333  19.20733 2016-08-29     River
## 6 WaterSA_SLW6      SLW6 -33.33367  19.87767 2016-08-30      Lake
##                                                                                 Sample comment
## 1                                                                               River at Pniel
## 2         River Berg; abundant with insect larvae; dam upstream; reservoir lake not accessible
## 3                                                                   Minor waterfall; iron rich
## 4                                                           River; abundant with insect larvae
## 5                                                                                   River Bree
## 6 Reservoir lake; under almost natural conditions (shore vegetation); animal sample taken here
##   δ18O H2O [‰ SMOW] δD H2O [‰ SMOW] δ18O H2O std dev [±] δD H2O std dev [±]
## 1             -3.54          -14.50                 0.09               0.64
## 2             -3.33          -13.62                 0.09               0.45
## 3             -4.44          -22.33                 0.04               0.59
## 4             -4.28          -22.70                 0.07               0.30
## 5             -4.09          -18.99                 0.04               0.34
## 6             -2.59          -18.59                 0.10               0.29

The data set contains 188 samples and the following 11 variables: Event, Sample ID, Latitude, Longitude, Date/Time, Samp type, Sample comment, δ18O H2O [‰ SMOW], δD H2O [‰ SMOW], δ18O H2O std dev [±], δD H2O std dev [±]. The isotope ratios are expressed in the conventional delta notation (δ $^{18}O$ , δ $^{2}H$ ) in per mil (‰) relative to VSMOW (Vienna Standard Mean Ocean Water).

The Random Forest Algorithm in the Study

The Random Forest (RF) algorithm is applied to assess the relative importance of various meteorological variables on the stable isotope data. Therefore, the study uses the cforest() function from the R package party. One of the main advantages of the party package is its ability to handle both categorical and continuous predictor variables.

In the following, the assessment will be showcased using the (= $\delta^{18}O$ ) isotope data.

load(url("https://userpage.fu-berlin.de/soga/data/r-data/IsoW.data_untransf.fact_lo_TOPO_all2021comb_means.RData"))

We will rename the variables mw18O to O18(= $\delta^{18} O$ ) for clarity and variables mwdD (= $\delta^{2}H$ ) and d.Excess are removed to focus on the O18 isotopes. Furthermore, we remove the country of sample origin.

# change names
names(TrajIsoLC.wmean)[names(TrajIsoLC.wmean) == 'mw18O'] <- 'O18'
names(TrajIsoLC.wmean)[names(TrajIsoLC.wmean) == 'mwdD'] <- 'H2'
names(TrajIsoLC.wmean)[names(TrajIsoLC.wmean) == 'Monat'] <- 'month'

# exclude variables
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="H2")]
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="d.Excess")]
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="Land.Ocean")]
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="Africa")]
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="Oceans")]
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="ISO")]

Then the data is subset, including only data where the explanatory fraction is greater than $0.6$ :

IsoW <- TrajIsoLC.wmean
IsoW.06 <- subset(IsoW, IsoW$expl.frac > 0.6)

Initial Model

We start with a performance test on our data for the RF algorithm by creating an initial RF model using the randomForest() function. It predicts $\delta^{18}O$ based on all other variables in IsoW.06 with 2000 trees.

# for reproduciblity
my.seed <- 196
set.seed(my.seed)

# initial RF model
library(randomForest)

model1 <- randomForest(
  formula = O18 ~ .,
  ntree = 2000,
  data = IsoW.06
)

model1

## 
## Call:
##  randomForest(formula = O18 ~ ., data = IsoW.06, ntree = 2000) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 18
## 
##           Mean of squared residuals: 5.413177
##                     % Var explained: 18.56

In each split of a tree, 18 variables were randomly chosen to determine the best split. About 18.56% of the variability in the $\delta^{18}O$ values can be explained by the model.

By visualizing the model performance, we can observe how error decreases with the number of trees.

plot(model1)

As shown in the figure, the mean squared residuals stabilize around 500 trees. Thus, in the following step we use ´ntree=500´and optimize the tree depth.

Hyper-parameter Tuning

Now, we will tune the parameter that determines the number of variables that are randomly sampled as candidates at each split (mtry). The numbers of trees and variables are crucial for the model performance. The authors applied a range of mtry from 1 to 52 while ntree is fixed at 10000 after Behnamian et al. 2017 (a large number of trees ensures stable variable importance). This grid represents different combinations of hyper-parameters to be tested. In our example we used the above identified threshold of ntrees=500 and mtry=2:54.

set.seed(my.seed)
ntree <- 500

# hyper parameter grid
hyper_grid <- expand.grid(
  mtry = seq(1, 54, by = 2), # mtry-max has to be smaller than max number of variables!
  ntree = seq(ntree, ntree, by = ntree / 10),
  OOB_RMSE = 0
)

# run models
library(party)
library(caret)

set.seed(my.seed)
  for(i in 1:nrow(hyper_grid)) {
    model <- party::cforest(formula = O18 ~ .,
                            data    = IsoW.06,
                            controls = cforest_unbiased(ntree = hyper_grid$ntree[i],
                                                        mtry = hyper_grid$mtry[i]))
    error <- caret:::cforestStats(model)
  # add OOB error to grid
    hyper_grid$OOB_RMSE[i] <- error[1]
  }
library(magrittr)
# top 10 performing models
  hyper_grid %>%
    dplyr::arrange(OOB_RMSE) %>%
    head(10)

##    mtry ntree OOB_RMSE
## 1    23   500 2.306998
## 2    29   500 2.310671
## 3    51   500 2.315662
## 4    41   500 2.315870
## 5    35   500 2.325206
## 6    37   500 2.325341
## 7    53   500 2.328874
## 8    45   500 2.329887
## 9    33   500 2.332528
## 10   13   500 2.333812

For each parameter combination, the Out-of-Bag (OOB) error is calculated. The best model with 500 trees has an mtry value of 23.

Final Model

Now, the final RF model with optimized parameters can be created. The varimp() function is used to calculate the importance of each variable in the model. This tells us which variables are most predictive of O18.

IsoW06.O18.seed <- my.seed
IsoW06.O18.seed

## [1] 196

my.ntree <- 500
set.seed(IsoW06.O18.seed)
IsoW06.O18.mtry <- hyper_grid$mtry[hyper_grid$OOB_RMSE == min(hyper_grid$OOB_RMSE)]

# default RF model
cf <- party::cforest(formula = O18 ~ .,
                     data    = IsoW.06,
                     controls = cforest_unbiased(ntree = my.ntree, 
                     mtry = IsoW06.O18.mtry))

#error <- caret:::cforestStats(cf)
stats <- caret:::cforestStats(cf)

library(tibble)
IsoW06.O18.var.imp.all <- party::varimp(cf, conditional = TRUE) %>% 
  sort(decreasing=TRUE) %>% 
  enframe() %>%
  dplyr::top_n(54)

IsoW06.O18.var.imp.10 <- IsoW06.O18.var.imp.all %>% 
  dplyr::top_n(10)

library(cowplot)

IsoW06.O18 <-
  ggplot(IsoW06.O18.var.imp.10, aes(value, reorder(name, value))) +
  geom_col() +
  scale_x_continuous(expand = expansion(mult = c(0, 0.00001), add = c(0, 0.02))) +
  labs(
    title = "δ¹⁸O",
    subtitle = paste("ntree = ", ntree, ", mtry =", IsoW06.O18.mtry, "; \nRMSE = ", round(error[1], digits = 2), "$R² = ", round(error[2], digits = 2))
  ) +
  xlab("Importance score") +
  theme_bw() +
  theme(
    axis.title = element_text(size = 9),
    axis.title.y = element_blank(),
    plot.title = element_text(size = 11),
    plot.subtitle = element_text(size = 8),
    axis.text = element_text(size = 7),
    text = element_text(size = 8),
    rect = element_rect(fill = "transparent"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = NA, color = "black"),
    panel.ontop = TRUE,
    legend.background = element_rect(fill = "transparent"),
    legend.position = "right",
    legend.title = element_text(angle = 90, size = 8, face = "bold"),
    legend.title.align = 0,
    legend.key.size = unit(1, "mm"),
    legend.spacing.x = unit(0.2, "cm"),
    legend.spacing.y = unit(0.1, "cm"),
    legend.key = element_rect(fill = "transparent"),
    legend.key.height = unit(0.4, "cm"),
    legend.text = element_text(margin = ggplot2::margin(r = 5, unit = "mm"), size = 7)
  )

# Specify fundamental plot dimensions
width.inch <- 9 / 2.54 # cm / 1 inch
height.inch <- 10 / 2.54
pointsize <- 8
x11(
  width = width.inch,
  height = height.inch,
  pointsize = pointsize
)

ggdraw(IsoW06.O18)

The variable abbreviations shown in the plot have the following meanings:

wmSrc_nocfblh: boundary layer height
Type_main: main water type
wmSnk_lon: longitude at sampling site
month: month of sampling
wmSnk_nocftp: total precipitation at sampling site
RFZraster: rainfall zone raster
type: water type
wmSrc_Prc.frac: precipitation fraction of mean precipitation per month
Artificial_clr: artificial land cover
wmSrc_sd.PS: surface pressure difference

For predicting the $\delta^{18}O$ ratio, the boundary layer height, main water type seems to be the most important variable, among main water type, longitude at sampling site and the sampling month. There are just small differences in the importance scores due to a different number of trees: ntree= 500 vs. ntree = 10000 as applied by the authors following the suggestion of Behnamian et al. 2017.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.