Following their suggestions, we walk through the EMMA algorithm and
perform concurrently the calculations in R using the machinery of the
EMMAgeo
package. We perform an end-member analysis on the
df.soil.samples
data set provided here. The data set consists of 45 rows,
corresponding to soil samples and 99 columns, corresponding to a
particular grain size class given on the Krumbein phi \((\phi)\) scale.
\[\phi = -\log_2 D/D_0\text{,} \] where \(D\) is the diameter of the particle or grain in millimeters, and \(D_0\) is a reference diameter, equal to \(1\) mm.
library(EMMAgeo)
# load data
load(url("http://userpage.fu-berlin.de/~soga/300/30100_data_sets/data_emma.RData"))
phi <- colnames(df.soil.samples)
X <- as.matrix(df.soil.samples) # convert data.frame object to matrix object
str(X)
## num [1:45, 1:99] 3.04e-04 2.27e-04 7.75e-05 3.80e-04 3.79e-04 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:45] "1" "2" "3" "4" ...
## ..$ : chr [1:99] "0" "0.1" "0.2" "0.31" ...
Let us plot the data:
In order to start with EMMA using the EMMAgeo
package
the data set has to fulfill certain criteria:
NA
-values.We can use R to check our data set with respect to these constraints:
The is.numeric()
function returns TRUE
if
the class is double or integer.
is.numeric(X)
## [1] TRUE
The is.na()
function indicates which elements are
missing. If we apply the function on a vector or a matrix, the function
checks each element of the vector/matrix and returns TRUE
if the element is a missing value or FALSE
if not. As
TRUE
evaluates to a numeric value of \(1\) and FALSE
to \(0\), we may combine the sum()
function and the is.na()
function to check if our matrix is
free of NA
-values.
sum(is.na(X))
## [1] 0
The function call returns 0, thus we conclude that our data is free
of NA
-values.
In order to check if the data contains any zero-only columns we make
use of the apply()
function. We set MARGIN = 2
to indicate that we want to apply the sum()
function on the
columns. Finally we sum the results for each particular column.
sum(apply(X, MARGIN = 2, FUN = sum) == 0)
## [1] 0
The function call returns 0, thus we conclude that our data does not contain any zero-only columns.
Similar to the approaches from above we use the apply()
function to check if the variables sum to constant value. However this
time, we set MARGIN = 1
to indicate that we want to apply
the sum()
function on the rows.
apply(X, MARGIN = 1, FUN = sum)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
We examine the final constrained of our list, the value for the weight transformation limits \((l)\) in the next section.
Recall that for a stable EMMA we have to minimize the effects of
scale. Therefore a column-wise weight transformation is applied by
scaling the columns based on percentiles with lower \((l)\) and upper \((100−l)\) boundaries as weights. The EMMA
algorithm becomes unstable if the weight transformation limits are too
high. The EMMAgeo
package provides the
test.l()
function, which performs the weight transformation
of the data matrix with different weight limits to check if valid
results are yielded. It returns the maximum value for which the
transformation remains stable.
l.seq <- seq(from = 0, to = 0.5, by = 0.01)
l <- test.l(X, l = l.seq)
l
## $step
## [1] 28
##
## $l.max
## [1] 0.27
The function returns a list of objects: step
holds the
position of last valid value and l.max
contains the last
valid value of \(l\). We assign the
value of l.max
to a new variable l.max
for
further usage.
l.max <-l$l.max
l.max
## [1] 0.27
In general the number of end-member is unknown. The
EMMAgeo
package provides the test.parameters()
function, which evaluates all possible combinations of number of
end-members, \(q\), and weight
transformation limits, \(l\). In
addition the function provides a graphical output for several error
metrics. Here we use plot = "mRt"
to indicate that we are
interested in the mean relative total error (mRt). Look up the
documentations for further specifications and additional options (type
?test.parameters
in your console).
seq.q <- seq(from = 2, to = 10, by = 1) # set sequence of end-members
seq.l <- seq(from = 0, to = l.max, by= 0.02) # sequence of weight transformations
params <- test.parameters(X,
q = seq.q,
l = seq.l,
rotation = "Varimax",
plot = "mRt", # "mRt" (mean relative total error),
# "mEm" (mean absolute row-wise error),
# "mEn" (mean absolute column-wise error),
# "mRm" (mean relative row-wise error),
# "mRn" (mean relative column-wise error),
# "ol" (number of overlapping end-members).
legend = 'bottomright',
cex = 0.6)
We can request the optimal number of end-members for each specified weight transformation limit.
num.end.members <- params$q.max
num.end.members
## [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4
We realize that the maximal number of end-members is 4.
Recall that one single model result is just a scenario with a given likelihood, and that minimum and maximum end-member numbers of models with different weight transformations span the uncertainty range in EMMA. In the next section we explore how to determine the optimal combination of weight transformation (\(l\)) and number of end-members.
Before we continue, let us recall our last three steps. So far we (1)
checked the data, and bracketed (2) the weight transformation limits and
(3) the numbers of end-members. The EMMAgeo
ships with a
handy function called check.data()
, which checks the input
data matrix for consistency.
check.data(X = X, num.end.members, seq.l, c = 1)
## [1] "Data matrix passed test... OK"
## [2] "End-member vector passed test... OK"
## [3] "Weight transformation limit vector passed test... OK"
## [4] "Scaling parameter passed test... OK"
## [5] "NA-test passed... OK"
## [6] "Test for zero-only values passed... OK"
## [7] "Maximum weight transformation limit value passed test... OK"
## [8] "All samples sum up to constant sum... OK"
As several model scenarios are equally likely, we are looking for
robust end-members, which persist throughout many scenarios. The
EMMAgeo
package provides a suite of functions to find
robust end-members. One of them is the test.robustness()
function. The function tests all possible combinations of number of
end-members and weight transformation limits by iteratively performing
EMMA.
We provide to the function a series of end-members and a series of
weight transformation limits. By setting plot = TRUE
the
function returns a nice plot with robust end-member loadings and the
corresponding modes. These plots allow us to visually figure out the
number of robust end-members.
xlabs <- c(expression(paste("Grain size (", phi, ")",
sep = "")),
expression(paste("Grain size (", phi, ")",
sep = "")))
xlabs <- c("Grain-size class",
"Grain-size class")
seq.l <- seq(from = 0, to = l.max, by = 0.02)
robust.params <- test.robustness(X = X,
q = unique(num.end.members), # end-members
l = seq.l, # weight transformation limits
plot = TRUE,
colour = c('blue', 'green'),
xlab = xlabs)
In order to better distinguish between the modes we plot the
histogram from above again and call axis()
in order to
format the labels on the x-axis.
hist(robust.params$modes,
breaks = ncol(X),
main = "Mode positions",
xlab = "Class",
xaxt = 'n')
#axis(side = 1, at = 0:max(as.numeric(phi)))
axis(side = 1)
In a next step we apply the robust.loadings()
function,
which extracts those end-member loadings whose modes fall into specified
limits. Based on the plot above we may define these limits by visual
examination.
limits = cbind(c(10, 22, 40, 50), # lower limit
c(20, 30, 50, 60)) # upper limit
We run the robust.loadings()
function in order to
extract robust end-members (given by specified mode limits).
library(RColorBrewer)
rEM <- robust.loadings(em = robust.params, # samples
limits = limits,
Vqn = robust.params$Vqn, # normalised factor loadings
plot = TRUE,
legend = "topright",
cex = 0.85,
colour = brewer.pal(length(limits)/2, 'Set1'),
median = TRUE)
Vqn.mean <- rEM$Vqn$mean # robust end-members mean
Vqn.sd <- rEM$Vqn$sd # robust end-members standart deviation
In many cases the robust end-members alone are not able to explain
the entire data variance. To account for the unexplained variance in the
model we may calculate the residual end-member by
applying the residual.EM()
function.
Vqn.res <- residual.EM(Vqn.mean) # residual end-member
Plot the explained variability by the robust end-members, including the residual end-member.
plot(NA,
main = 'End-member loadings',
xlim = range(as.numeric(phi)),
ylim = range(Vqn.mean),
xlab = expression(paste("Grain size (", phi, ")", sep = "")),
ylab = 'Loadings [Vol.-%]')
# plot end-members
colors <- brewer.pal(nrow(Vqn.mean), 'Set1')
for (i in 1:nrow(Vqn.mean)){
lines(as.numeric(phi),
Vqn.mean[i,],
col = colors[i])}
# plot residual end-member
lines(as.numeric(phi),
Vqn.res, col = 'grey', lwd=2)
legend('right',legend = 'Residual', col = 'grey', lwd=2)
The plot shows, that most of the variability in the data is accounted for by the end-members. However, there are specific ranges within the grain-size class spectrum, which are less well explained by the end-members.
Robust EMMA requires one “optimal” weight transformation limit value,
in order to propagate the uncertainty associated with the loadings also
to the scores. Thus, we optimize the model with respect to different
quality criteria bychanging l
to yield an optimal EM
solution. We can use the function get.l.opt()
. The
criterium to optimize is given by the keyword quality
and
is set to "mRt"
(mean total explained variance) by default.
In case more than one weight transformation limit yield an optimal
model, it is useful to take the smallest one, i.e., the first one.
l_opt <- get.l.opt(X = X,
l = seq.l,
quality = "mRt",
Vqn = Vqn.mean,
plot = TRUE)[1]
When all parameters are known and optimized we finally may run the
EMMA()
function:
library(RColorBrewer)
EM <- EMMA(X = X, #samples
q = nrow(Vqn.mean), # number of end-members
l = l_opt, # optimal weight transformation limit
Vqn = Vqn.mean, # robust end-member loadings
plot = TRUE,
EM.ID = c("EM 1", "EM 2", "EM 3", "EM 4"),
classunits = as.numeric(phi),
xlab = c(expression(paste("Class [", phi, "]")), "Sample ID"),
cex = 0.7,
col = brewer.pal(nrow(Vqn.mean), "Set1"))
We may evaluate the end-member model by plotting the normalized
end-member loading matrix and compare it to the natural end-members.
Please note that in real life applications the natural end-members are
in general not known. In this example, however, the natural end-members
are provided in the list
object EMa
.
cols <- brewer.pal(nrow(Vqn.mean), "Set1")
## Modeled data ##
plot(phi, EM$loadings[1,],
type='l',
ylab = 'Volume percentage',
col = cols[1],
ylim = c(0,10), lwd=2)
lines(phi, EM$loadings[2,], col = cols[2], lwd = 2)
lines(phi, EM$loadings[3,], col = cols[3], lwd = 2)
lines(phi, EM$loadings[4,], col = cols[4], lwd = 2)
## natural end-members ##
lines(phi, EMa[[1]]*100, col = "grey", lwd = 2, lty = 2)
lines(phi, EMa[[2]]*100, col = "grey", lwd = 2, lty = 2)
lines(phi, EMa[[3]]*100, col = "grey", lwd = 2, lty = 2)
lines(phi, EMa[[4]]*100, col = "grey", lwd = 2, lty = 2)
legend('topright',
legend = c('EM 1','EM 2','EM 3','EM 4','natural end-members'),
lwd = 2,
lty = c(rep(1,4), 2),
col = c(cols, 'grey'),
cex = 0.85)
Let us examine the variance explained by the scores of our model.
EM$Mqs.var / 100
## [1] 0.2854860 0.2343972 0.2818878 0.1982290
The EMMAgeo
package provides a function to estimate
model uncertainty. The robust.scores()
function estimates
the influence of different end-member loadings on end-member scores by
running a Monte Carlo simulation.
Vqn.mean <- Vqn.mean # robust end-members mean
Vqn.sd <- Vqn.sd # robust end-members standart deviation
q <- max(num.end.members)
M <- robust.scores(loadings = rEM,
l = l_opt,
mc_n = 200,
plot = TRUE)
For visual inspection we plot the modeled end-member score means.
## Plot line-point graph with score means
par(mfrow = c(1,4), oma = c(0, 0, 2, 0))
cols <- brewer.pal(q, "Set1")
for(i in 1:q) {
xlims <- c(floor(min(M$mean[,i])),
ceiling(max(M$mean[,i])))
plot(M$mean[,i],
1:nrow(X),
col = cols[i],
lwd = 2,
type = 'l',
ylab = 'sampel no.',
xlab = 'scores',
ylim = rev(c(1, nrow(X))),
xlim = c(xlims[1], xlims[2]),
main = paste("EM ", i))
points(M$mean[,i], 1:nrow(X), col = cols[i])
}
mtext('Mean end-member scores', outer = TRUE, cex = 1.3)
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.