Following their suggestions, we walk through the EMMA algorithm and perform concurrently the calculations in R using the machinery of the EMMAgeo package. We perform an end-member analysis on the df.soil.samples data set provided here. The data set consists of 45 rows, corresponding to soil samples and 99 columns, corresponding to a particular grain size class given on the Krumbein phi \((\phi)\) scale.

\[\phi = -\log_2 D/D_0\text{,} \] where \(D\) is the diameter of the particle or grain in millimeters, and \(D_0\) is a reference diameter, equal to \(1\) mm.

library(EMMAgeo)

# load data
load(url("http://userpage.fu-berlin.de/~soga/300/30100_data_sets/data_emma.RData"))

phi <- colnames(df.soil.samples)
X <- as.matrix(df.soil.samples) # convert data.frame object to matrix object
str(X)
##  num [1:45, 1:99] 3.04e-04 2.27e-04 7.75e-05 3.80e-04 3.79e-04 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:45] "1" "2" "3" "4" ...
##   ..$ : chr [1:99] "0" "0.1" "0.2" "0.31" ...

Let us plot the data:


1. Check the data

In order to start with EMMA using the EMMAgeo package the data set has to fulfill certain criteria:

We can use R to check our data set with respect to these constraints:

The is.numeric() function returns TRUE if the class is double or integer.

is.numeric(X)
## [1] TRUE

The is.na() function indicates which elements are missing. If we apply the function on a vector or a matrix, the function checks each element of the vector/matrix and returns TRUE if the element is a missing value or FALSE if not. As TRUE evaluates to a numeric value of \(1\) and FALSE to \(0\), we may combine the sum() function and the is.na() function to check if our matrix is free of NA-values.

sum(is.na(X))
## [1] 0

The function call returns 0, thus we conclude that our data is free of NA-values.

In order to check if the data contains any zero-only columns we make use of the apply() function. We set MARGIN = 2 to indicate that we want to apply the sum() function on the columns. Finally we sum the results for each particular column.

sum(apply(X, MARGIN = 2, FUN = sum) == 0)
## [1] 0

The function call returns 0, thus we conclude that our data does not contain any zero-only columns.

Similar to the approaches from above we use the apply() function to check if the variables sum to constant value. However this time, we set MARGIN = 1 to indicate that we want to apply the sum() function on the rows.

apply(X, MARGIN = 1, FUN = sum)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

We examine the final constrained of our list, the value for the weight transformation limits \((l)\) in the next section.


2. Determine the range of weight transformation limits

Recall that for a stable EMMA we have to minimize the effects of scale. Therefore a column-wise weight transformation is applied by scaling the columns based on percentiles with lower \((l)\) and upper \((100−l)\) boundaries as weights. The EMMA algorithm becomes unstable if the weight transformation limits are too high. The EMMAgeo package provides the test.l() function, which performs the weight transformation of the data matrix with different weight limits to check if valid results are yielded. It returns the maximum value for which the transformation remains stable.

l.seq <- seq(from = 0, to = 0.5, by = 0.01)
l <- test.l(X, l = l.seq)
l
## $step
## [1] 28
## 
## $l.max
## [1] 0.27

The function returns a list of objects: step holds the position of last valid value and l.max contains the last valid value of \(l\). We assign the value of l.max to a new variable l.max for further usage.

l.max <-l$l.max
l.max
## [1] 0.27

3. Define the range of numbers of end-members

In general the number of end-member is unknown. The EMMAgeo package provides the test.parameters() function, which evaluates all possible combinations of number of end-members, \(q\), and weight transformation limits, \(l\). In addition the function provides a graphical output for several error metrics. Here we use plot = "mRt" to indicate that we are interested in the mean relative total error (mRt). Look up the documentations for further specifications and additional options (type ?test.parameters in your console).

seq.q <- seq(from = 2, to = 10, by = 1) # set sequence of end-members
seq.l <- seq(from = 0, to = l.max, by= 0.02) # sequence of weight transformations
params <- test.parameters(X, 
                          q = seq.q, 
                          l = seq.l, 
                          rotation = "Varimax",
                          plot = "mRt", # "mRt" (mean relative total error), 
                                        # "mEm" (mean absolute row-wise error), 
                                        # "mEn" (mean absolute column-wise error), 
                                        # "mRm" (mean relative row-wise error), 
                                        # "mRn" (mean relative column-wise error), 
                                        # "ol" (number of overlapping end-members). 
                          legend = 'bottomright', 
                          cex = 0.6)

We can request the optimal number of end-members for each specified weight transformation limit.

num.end.members <- params$q.max
num.end.members
##  [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4

We realize that the maximal number of end-members is 4.

Recall that one single model result is just a scenario with a given likelihood, and that minimum and maximum end-member numbers of models with different weight transformations span the uncertainty range in EMMA. In the next section we explore how to determine the optimal combination of weight transformation (\(l\)) and number of end-members.


Before we continue, let us recall our last three steps. So far we (1) checked the data, and bracketed (2) the weight transformation limits and (3) the numbers of end-members. The EMMAgeo ships with a handy function called check.data(), which checks the input data matrix for consistency.

check.data(X = X, num.end.members, seq.l, c = 1)
## [1] "Data matrix passed test... OK"                              
## [2] "End-member vector passed test... OK"                        
## [3] "Weight transformation limit vector passed test... OK"       
## [4] "Scaling parameter passed test... OK"                        
## [5] "NA-test passed... OK"                                       
## [6] "Test for zero-only values passed... OK"                     
## [7] "Maximum weight transformation limit value passed test... OK"
## [8] "All samples sum up to constant sum... OK"

4. Calculate and extract robust end-members

As several model scenarios are equally likely, we are looking for robust end-members, which persist throughout many scenarios. The EMMAgeo package provides a suite of functions to find robust end-members. One of them is the test.robustness() function. The function tests all possible combinations of number of end-members and weight transformation limits by iteratively performing EMMA.

We provide to the function a series of end-members and a series of weight transformation limits. By setting plot = TRUE the function returns a nice plot with robust end-member loadings and the corresponding modes. These plots allow us to visually figure out the number of robust end-members.

xlabs <- c(expression(paste("Grain size (", phi, ")",
                            sep = "")),
           expression(paste("Grain size (", phi, ")",
                            sep = "")))
xlabs <- c("Grain-size class",
           "Grain-size class")

seq.l <- seq(from = 0, to = l.max, by = 0.02)
robust.params <- test.robustness(X = X,
                                 q = unique(num.end.members), # end-members
                                 l = seq.l, # weight transformation limits
                                 plot = TRUE,
                                 colour = c('blue', 'green'), 
                                 xlab = xlabs)

In order to better distinguish between the modes we plot the histogram from above again and call axis() in order to format the labels on the x-axis.

hist(robust.params$modes,
     breaks = ncol(X),
     main = "Mode positions",
     xlab = "Class", 
     xaxt = 'n')
#axis(side = 1, at = 0:max(as.numeric(phi)))
axis(side = 1)

In a next step we apply the robust.loadings() function, which extracts those end-member loadings whose modes fall into specified limits. Based on the plot above we may define these limits by visual examination.

limits = cbind(c(10, 22, 40, 50), # lower limit
               c(20, 30, 50, 60)) # upper limit

We run the robust.loadings() function in order to extract robust end-members (given by specified mode limits).

library(RColorBrewer)
rEM <- robust.loadings(em = robust.params,  # samples 
                 limits = limits,
                 Vqn = robust.params$Vqn, # normalised factor loadings
                 plot = TRUE,
                 legend = "topright",
                 cex = 0.85,
                 colour = brewer.pal(length(limits)/2, 'Set1'),
                 median = TRUE)

Vqn.mean <- rEM$Vqn$mean # robust end-members mean
Vqn.sd <- rEM$Vqn$sd # robust end-members standart deviation

In many cases the robust end-members alone are not able to explain the entire data variance. To account for the unexplained variance in the model we may calculate the residual end-member by applying the residual.EM() function.

Vqn.res <- residual.EM(Vqn.mean) # residual end-member

Plot the explained variability by the robust end-members, including the residual end-member.

plot(NA,
     main = 'End-member loadings',
     xlim = range(as.numeric(phi)),
     ylim = range(Vqn.mean),
     xlab = expression(paste("Grain size (", phi, ")", sep = "")),
     ylab = 'Loadings [Vol.-%]')
# plot end-members

colors <- brewer.pal(nrow(Vqn.mean), 'Set1')

for (i in 1:nrow(Vqn.mean)){
  lines(as.numeric(phi),
        Vqn.mean[i,],
        col = colors[i])}

# plot residual end-member
lines(as.numeric(phi), 
  Vqn.res, col = 'grey', lwd=2)
legend('right',legend = 'Residual', col = 'grey', lwd=2)

The plot shows, that most of the variability in the data is accounted for by the end-members. However, there are specific ranges within the grain-size class spectrum, which are less well explained by the end-members.

5 Find the optimal weight transformation limit

Robust EMMA requires one “optimal” weight transformation limit value, in order to propagate the uncertainty associated with the loadings also to the scores. Thus, we optimize the model with respect to different quality criteria bychanging l to yield an optimal EM solution. We can use the function get.l.opt(). The criterium to optimize is given by the keyword quality and is set to "mRt" (mean total explained variance) by default. In case more than one weight transformation limit yield an optimal model, it is useful to take the smallest one, i.e., the first one.

l_opt <- get.l.opt(X = X, 
                   l = seq.l, 
                   quality = "mRt", 
                   Vqn = Vqn.mean, 
                   plot = TRUE)[1]


6 Run EMMA with optimal parameters

When all parameters are known and optimized we finally may run the EMMA() function:

library(RColorBrewer)
EM <- EMMA(X = X, #samples
           q = nrow(Vqn.mean),  # number of end-members
           l = l_opt, # optimal weight transformation limit
           Vqn = Vqn.mean, # robust end-member loadings
           plot = TRUE,
           EM.ID = c("EM 1", "EM 2", "EM 3", "EM 4"),
           classunits = as.numeric(phi),
           xlab = c(expression(paste("Class [", phi, "]")), "Sample ID"),
           cex = 0.7,
           col = brewer.pal(nrow(Vqn.mean), "Set1"))


7. Evaluate model quality

We may evaluate the end-member model by plotting the normalized end-member loading matrix and compare it to the natural end-members. Please note that in real life applications the natural end-members are in general not known. In this example, however, the natural end-members are provided in the list object EMa.

cols <- brewer.pal(nrow(Vqn.mean), "Set1")
## Modeled data ##
plot(phi, EM$loadings[1,], 
     type='l', 
     ylab = 'Volume percentage',
     col = cols[1],
     ylim = c(0,10), lwd=2)
lines(phi, EM$loadings[2,], col = cols[2], lwd = 2)
lines(phi, EM$loadings[3,], col = cols[3], lwd = 2)
lines(phi, EM$loadings[4,], col = cols[4], lwd = 2)

## natural end-members ##
lines(phi, EMa[[1]]*100, col = "grey", lwd = 2, lty = 2)
lines(phi, EMa[[2]]*100, col = "grey", lwd = 2, lty = 2)
lines(phi, EMa[[3]]*100, col = "grey", lwd = 2, lty = 2)
lines(phi, EMa[[4]]*100, col = "grey", lwd = 2, lty = 2)

legend('topright',
       legend = c('EM 1','EM 2','EM 3','EM 4','natural end-members'),
       lwd = 2, 
       lty = c(rep(1,4), 2),
       col = c(cols, 'grey'), 
       cex = 0.85)

Let us examine the variance explained by the scores of our model.

EM$Mqs.var / 100
## [1] 0.2854860 0.2343972 0.2818878 0.1982290

8 Estimate model uncertainty

The EMMAgeo package provides a function to estimate model uncertainty. The robust.scores() function estimates the influence of different end-member loadings on end-member scores by running a Monte Carlo simulation.

Vqn.mean <- Vqn.mean # robust end-members mean
Vqn.sd <- Vqn.sd # robust end-members standart deviation
q <- max(num.end.members)

M <- robust.scores(loadings = rEM, 
                   l = l_opt, 
                   mc_n = 200,
                   plot = TRUE)

For visual inspection we plot the modeled end-member score means.

## Plot line-point graph with score means
par(mfrow = c(1,4), oma = c(0, 0, 2, 0))
cols <- brewer.pal(q, "Set1")

for(i in 1:q) {
  xlims <- c(floor(min(M$mean[,i])),
                ceiling(max(M$mean[,i])))
  plot(M$mean[,i], 
       1:nrow(X), 
       col = cols[i], 
       lwd = 2,
       type = 'l',
       ylab = 'sampel no.',
       xlab = 'scores',
       ylim = rev(c(1, nrow(X))),
       xlim = c(xlims[1], xlims[2]),
       main = paste("EM ", i))
  points(M$mean[,i], 1:nrow(X), col = cols[i])
}
mtext('Mean end-member scores', outer = TRUE, cex = 1.3)


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.