Geppert, Marielle; Hartmann, Kai; Kirchner, Ingo; Pfahl, Stephan; Struck, Ulrich; Riedel, Frank (2022). Precipitation Over Southern Africa: Moisture Sources and Isotopic Composition. JGR: Atmospheres, https://doi.org/10.1029/2022JD037005
Southern Africa is characterized by arid and semi-arid landscapes and is particularly susceptible to extreme weather conditions. Intriguingly, over the last 100,000 years, extensive lakes have periodically formed in the central Kalahari desert, raising questions about historical changes in atmospheric circulation and precipitation patterns.
Geppert et al. conducted a study about the annual precipitation distributions throughout Southern Africa. They focused on the analysis of stable water isotope compositions, moisture transport pathways, and moisture sources.
Stable isotopes of hydrogen and oxygen (such as \(^{2}H\) and \(^{18}O\)) in water molecules vary slightly based on their source and the environmental conditions they have been through. By analyzing these isotopes, it is possible to trace the origins of water sources and to understand the pathways of moisture transport.
Changes in stable isotope ratios in precipitation can reveal shifts in atmospheric circulation patterns and climate. For instance, when water evaporates from the ocean, water molecules containing heavier isotopes of oxygen (\(^{18}O\)) and hydrogen (\(^{2}H\) or deuterium) are more likely to remain in the ocean. This results in a higher concentration of these heavier isotopes in the ocean, which is reflected in the \(\delta^{2}H\) and \(\delta^{18}O\) ratios of the ocean water. As the evaporated water forms precipitation and moves inland, separation continues and and further changes isotopic ratios occur. Analyzing these isotopic variations helps to reconstruct past precipitation regimes and thus provides insights into historical patterns of atmospheric circulation.
The study involved collecting water samples of precipitation and different surface waters in southern Africa between 2016 and 2021. The map shows the sampling locations and the sample type, which is: ocean, spring, lake, precipitation and river. Furthermore, the data of eight Global Network for Isotopes in Precipitation (GNIP) stations has been used.
The data set and further information about the sampling process can be found here.
Let us take a closer look at the data:
url <- "https://doi.pangaea.de/10.1594/PANGAEA.944811?format=textfile"
data <- read.table(url, skip = 268, header = TRUE, sep = "\t", fileEncoding = "UTF-8", check.names = FALSE)
head(data)
## Event Sample ID Latitude Longitude Date/Time Samp type
## 1 WaterSA_SLW1 SLW1 -33.88917 18.96917 2016-08-29 River
## 2 WaterSA_SLW2 SLW2 -33.87800 19.03517 2016-08-29 River
## 3 WaterSA_SLW3 SLW3 -33.93667 19.17000 2016-08-29 River
## 4 WaterSA_SLW4 SLW4 -33.69350 19.32483 2016-08-29 River
## 5 WaterSA_SLW5 SLW5 -33.54333 19.20733 2016-08-29 River
## 6 WaterSA_SLW6 SLW6 -33.33367 19.87767 2016-08-30 Lake
## Sample comment
## 1 River at Pniel
## 2 River Berg; abundant with insect larvae; dam upstream; reservoir lake not accessible
## 3 Minor waterfall; iron rich
## 4 River; abundant with insect larvae
## 5 River Bree
## 6 Reservoir lake; under almost natural conditions (shore vegetation); animal sample taken here
## δ18O H2O [‰ SMOW] δD H2O [‰ SMOW] δ18O H2O std dev [±] δD H2O std dev [±]
## 1 -3.54 -14.50 0.09 0.64
## 2 -3.33 -13.62 0.09 0.45
## 3 -4.44 -22.33 0.04 0.59
## 4 -4.28 -22.70 0.07 0.30
## 5 -4.09 -18.99 0.04 0.34
## 6 -2.59 -18.59 0.10 0.29
The data set contains 188 samples and the following 11 variables: Event, Sample ID, Latitude, Longitude, Date/Time, Samp type, Sample comment, δ18O H2O [‰ SMOW], δD H2O [‰ SMOW], δ18O H2O std dev [±], δD H2O std dev [±]. The isotope ratios are expressed in the conventional delta notation (δ\(^{18}O\), δ\(^{2}H\)) in per mil (‰) relative to VSMOW (Vienna Standard Mean Ocean Water).
The Random Forest (RF) algorithm is applied to assess the relative
importance of various meteorological variables on the stable isotope
data. Therefore, the study uses the cforest()
function from
the R package party
. One of the main advantages of the
party
package is its ability to handle both categorical and
continuous predictor variables.
In the following, the assessment will be showcased using the (=\(\delta^{18}O\)) isotope data.
load(url("https://userpage.fu-berlin.de/soga/data/r-data/IsoW.data_untransf.fact_lo_TOPO_all2021comb_means.RData"))
We will rename the variables mw18O to O18(=\(\delta^{18} O\)) for clarity and variables mwdD (=\(\delta^{2}H\)) and d.Excess are removed to focus on the O18 isotopes. Furthermore, we remove the country of sample origin.
# change names
names(TrajIsoLC.wmean)[names(TrajIsoLC.wmean) == 'mw18O'] <- 'O18'
names(TrajIsoLC.wmean)[names(TrajIsoLC.wmean) == 'mwdD'] <- 'H2'
names(TrajIsoLC.wmean)[names(TrajIsoLC.wmean) == 'Monat'] <- 'month'
# exclude variables
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="H2")]
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="d.Excess")]
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="Land.Ocean")]
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="Africa")]
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="Oceans")]
TrajIsoLC.wmean <- TrajIsoLC.wmean[,-which(colnames(TrajIsoLC.wmean)=="ISO")]
Then the data is subset, including only data where the explanatory fraction is greater than \(0.6\):
IsoW <- TrajIsoLC.wmean
IsoW.06 <- subset(IsoW, IsoW$expl.frac > 0.6)
We start with a performance test on our data for the RF algorithm by
creating an initial RF model using the randomForest()
function. It predicts \(\delta^{18}O\)
based on all other variables in IsoW.06 with 2000 trees.
# for reproduciblity
my.seed <- 196
set.seed(my.seed)
# initial RF model
library(randomForest)
model1 <- randomForest(
formula = O18 ~ .,
ntree = 2000,
data = IsoW.06
)
model1
##
## Call:
## randomForest(formula = O18 ~ ., data = IsoW.06, ntree = 2000)
## Type of random forest: regression
## Number of trees: 2000
## No. of variables tried at each split: 18
##
## Mean of squared residuals: 5.413177
## % Var explained: 18.56
In each split of a tree, 18 variables were randomly chosen to determine the best split. About 18.56% of the variability in the \(\delta^{18}O\) values can be explained by the model.
By visualizing the model performance, we can observe how error decreases with the number of trees.
plot(model1)
As shown in the figure, the mean squared residuals stabilize around 500 trees. Thus, in the following step we use ´ntree=500´and optimize the tree depth.
Now, we will tune the parameter that determines the number of
variables that are randomly sampled as candidates at each split
(mtry
). The numbers of trees and variables are crucial for
the model performance. The authors applied a range of mtry
from 1 to 52 while ntree is fixed at 10000 after Behnamian et
al. 2017 (a large number of trees ensures stable variable
importance). This grid represents different combinations of
hyper-parameters to be tested. In our example we used the above
identified threshold of ntrees=500
and
mtry=2:54
.
set.seed(my.seed)
ntree <- 500
# hyper parameter grid
hyper_grid <- expand.grid(
mtry = seq(1, 54, by = 2), # mtry-max has to be smaller than max number of variables!
ntree = seq(ntree, ntree, by = ntree / 10),
OOB_RMSE = 0
)
# run models
library(party)
library(caret)
set.seed(my.seed)
for(i in 1:nrow(hyper_grid)) {
model <- party::cforest(formula = O18 ~ .,
data = IsoW.06,
controls = cforest_unbiased(ntree = hyper_grid$ntree[i],
mtry = hyper_grid$mtry[i]))
error <- caret:::cforestStats(model)
# add OOB error to grid
hyper_grid$OOB_RMSE[i] <- error[1]
}
library(magrittr)
# top 10 performing models
hyper_grid %>%
dplyr::arrange(OOB_RMSE) %>%
head(10)
## mtry ntree OOB_RMSE
## 1 23 500 2.306998
## 2 29 500 2.310671
## 3 51 500 2.315662
## 4 41 500 2.315870
## 5 35 500 2.325206
## 6 37 500 2.325341
## 7 53 500 2.328874
## 8 45 500 2.329887
## 9 33 500 2.332528
## 10 13 500 2.333812
For each parameter combination, the Out-of-Bag (OOB) error is calculated. The best model with 500 trees has an mtry value of 23.
Now, the final RF model with optimized parameters can be created. The
varimp()
function is used to calculate the importance of
each variable in the model. This tells us which variables are most
predictive of O18.
IsoW06.O18.seed <- my.seed
IsoW06.O18.seed
## [1] 196
my.ntree <- 500
set.seed(IsoW06.O18.seed)
IsoW06.O18.mtry <- hyper_grid$mtry[hyper_grid$OOB_RMSE == min(hyper_grid$OOB_RMSE)]
# default RF model
cf <- party::cforest(formula = O18 ~ .,
data = IsoW.06,
controls = cforest_unbiased(ntree = my.ntree,
mtry = IsoW06.O18.mtry))
#error <- caret:::cforestStats(cf)
stats <- caret:::cforestStats(cf)
library(tibble)
IsoW06.O18.var.imp.all <- party::varimp(cf, conditional = TRUE) %>%
sort(decreasing=TRUE) %>%
enframe() %>%
dplyr::top_n(54)
IsoW06.O18.var.imp.10 <- IsoW06.O18.var.imp.all %>%
dplyr::top_n(10)
library(cowplot)
IsoW06.O18 <-
ggplot(IsoW06.O18.var.imp.10, aes(value, reorder(name, value))) +
geom_col() +
scale_x_continuous(expand = expansion(mult = c(0, 0.00001), add = c(0, 0.02))) +
labs(
title = "δ¹⁸O",
subtitle = paste("ntree = ", ntree, ", mtry =", IsoW06.O18.mtry, "; \nRMSE = ", round(error[1], digits = 2), "$R² = ", round(error[2], digits = 2))
) +
xlab("Importance score") +
theme_bw() +
theme(
axis.title = element_text(size = 9),
axis.title.y = element_blank(),
plot.title = element_text(size = 11),
plot.subtitle = element_text(size = 8),
axis.text = element_text(size = 7),
text = element_text(size = 8),
rect = element_rect(fill = "transparent"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = NA, color = "black"),
panel.ontop = TRUE,
legend.background = element_rect(fill = "transparent"),
legend.position = "right",
legend.title = element_text(angle = 90, size = 8, face = "bold"),
legend.title.align = 0,
legend.key.size = unit(1, "mm"),
legend.spacing.x = unit(0.2, "cm"),
legend.spacing.y = unit(0.1, "cm"),
legend.key = element_rect(fill = "transparent"),
legend.key.height = unit(0.4, "cm"),
legend.text = element_text(margin = ggplot2::margin(r = 5, unit = "mm"), size = 7)
)
# Specify fundamental plot dimensions
width.inch <- 9 / 2.54 # cm / 1 inch
height.inch <- 10 / 2.54
pointsize <- 8
x11(
width = width.inch,
height = height.inch,
pointsize = pointsize
)
ggdraw(IsoW06.O18)
The variable abbreviations shown in the plot have the following meanings:
For predicting the \(\delta^{18}O\) ratio, the boundary layer height, main water type seems to be the most important variable, among main water type, longitude at sampling site and the sampling month. There are just small differences in the importance scores due to a different number of trees: ntree= 500 vs. ntree = 10000 as applied by the authors following the suggestion of Behnamian et al. 2017.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.