30521_The-EMMA-algorithm-in-R.knit

In this section we walk through the EMMA algorithm and perform concurrently the calculations in R. Therefore we use the df.soil.samples data set provided here. This synthetic toy example set was created by randomly mixing four natural end-members \((q=4)\), corresponding to a particular transport regime (alluvial sediments, dune sediments, loess deposits and overbank sediments).

The data set consists of 45 rows, corresponding to the soil samples, and 99 columns. The columns correspond to a particular grain size class given on the Krumbein phi \((\phi)\) scale.

\[\phi = -\log_2 D/D_0\text{,} \] where \(D\) is the diameter of the particle or grain in millimeters, and \(D_0\) is a reference diameter, equal to \(1\) mm. Let us look at the structure of the data set (reduced to the first 20 samples and grain size classes):

# load the data
load(url("http://www.userpage.fu-berlin.de/~soga/300/30100_data_sets/data_emma.RData"))
str(df.soil.samples[1:20, 1:20])

## 'data.frame':    20 obs. of  20 variables:
##  $ 0   : num  3.04e-04 2.27e-04 7.75e-05 3.80e-04 3.79e-04 ...
##  $ 0.1 : num  3.36e-04 2.48e-04 9.03e-05 4.17e-04 4.16e-04 ...
##  $ 0.2 : num  0.000375 0.000271 0.000111 0.000462 0.000458 ...
##  $ 0.31: num  0.000525 0.000297 0.000151 0.00052 0.000564 ...
##  $ 0.41: num  0.00071 0.000326 0.000234 0.000879 0.000687 ...
##  $ 0.51: num  0.001038 0.000361 0.000406 0.001238 0.000887 ...
##  $ 0.61: num  0.001605 0.000407 0.000745 0.001821 0.001218 ...
##  $ 0.71: num  0.002546 0.000473 0.00138 0.002727 0.001753 ...
##  $ 0.82: num  0.004021 0.000574 0.002499 0.004064 0.002584 ...
##  $ 0.92: num  0.006204 0.000734 0.00434 0.00593 0.003813 ...
##  $ 1.02: num  0.009236 0.000993 0.007166 0.008382 0.005532 ...
##  $ 1.12: num  0.01318 0.00141 0.0112 0.01142 0.0078 ...
##  $ 1.22: num  0.01797 0.00205 0.01655 0.01495 0.01063 ...
##  $ 1.33: num  0.0234 0.00301 0.0231 0.01881 0.01393 ...
##  $ 1.43: num  0.02908 0.00439 0.03048 0.02276 0.01757 ...
##  $ 1.53: num  0.03453 0.00629 0.03807 0.02652 0.02132 ...
##  $ 1.63: num  0.03922 0.00878 0.04511 0.02983 0.02493 ...
##  $ 1.73: num  0.0427 0.0119 0.0508 0.0324 0.0282 ...
##  $ 1.84: num  0.0446 0.0155 0.0547 0.0342 0.0308 ...
##  $ 1.94: num  0.0449 0.0195 0.0565 0.0349 0.0327 ...

Let us plot the first out of 45 samples:

phi <- colnames(df.soil.samples)
plot(phi, df.soil.samples[1, ], type = "l", ylab = "")

We see a multimodal grain size distribution ranging from \(\phi = 0\) to \(\phi = 9.9\), which corresponds to grain sizes ranging from \(0.977\) (coarse sand) to \(1000 \; \mu \text{m}\) (clay). If we plot our 45 samples on top of the 4 natural end-members, we see that each particular soil sample is a mixture of the 4 natural end-members. The goal of EMMA is to unmix the natural end-members and thus, the underlying transport regimes.

Step 1 - Constant sum constraint

Transform raw grain-size distributions to constant sum, e.g. \(1\). The second argument of the apply function can either be 1 (row-wise application of the function sum) or 2 (column-wise application of the function sum).

X <- as.matrix(df.soil.samples) # convert data.frame object to matrix object
c <- 1 # set constant sum, can be 1, or 100 (%), or any other meaningful constant
X <- X / apply(X, 1, sum) * c

Step 2 - Rescaling and standardization

Here we apply a column-wise weight transformation:

\[\mathbf W = \frac{\mathbf X - \mathbf h}{\mathbf g - \mathbf h}\text{,}\]

where vectors \(\mathbf h\) and \(\mathbf g\) are defined by \(\mathbf h_j=P_l(\mathbf x_j)\) and \(\mathbf g_j=P_{100-l}(\mathbf x_j)\) for columns \(j = 1,2,..., p\).

For the sake of simplicity we set \(l_w = 0.05\), however, please note that in real life applications the optimal value \(l_{\text{opt}}\) is found by iteration.

lw <- 0.05 # set percentile
qts <- function(X, lw) quantile(X, c(lw, 1 - lw), type = 5) # define function
ls <- t(apply(X, 2, qts, lw = lw)) # apply function column-wise
W <- t((t(X) - ls[, 1]) / (ls[, 2] - ls[, 1])) # calculate weight transformation

Step 3 - Calculate eigenvector and eigenvalue matrices

The eigenvector matrix \(\mathbf V\) and eigenvalue matrix \(\mathbf \Lambda\) are extracted from the minor product matrix \(\Gamma\) given by \(\Gamma=\mathbf W^T \mathbf W\).

## Eigen space extraction
A <- t(W) %*% W # calculate minor product
EIG <- eigen(A) # calculate eigenvectors and eigenvalues

V <- EIG$vectors[, order(seq(ncol(A), 1, -1))]
Vf <- V[, order(seq(ncol(A), 1, -1))] # eigenvector matrix

L <- EIG$values[order(seq(ncol(A), 1, -1))]
Lv <- cumsum(sort(L / sum(L), decreasing = TRUE)) # normalized eigenvalues

Step 4 - Factor rotation

Factor rotation is applied in order to simplify the structure of the end-members and thus, facilitate factor interpretation. As we know the number of end-members we set \(q=4\), however please note, that in real life applications the number of end-members is unknown, and thus \(q\) needs to be determined by iteration.

q <- 4
Vr <- do.call(varimax, list(Vf[, 1:q])) # apply varimax function

Step 5 - Estimate scores using linear nonnegative least squares

Normalize eigenvector loadings \((\mathbf V)\) to ensure the nonnegativity of the rotated eigenvectors and estimate the eigenvector scores \((\mathbf M)\) using linear nonnegative least squares as objective function from R package limSolve.

library(limSolve)
# calculate the end-member loadings
Vq <- Vr$loadings[, order(seq(q, 1, -1))] # extract and sort factor loadings
Vqr <- t(t(Vq) / apply(Vq, 2, sum)) * c # rescale
Vqr <- t(Vqr)
Vqn <- t((Vqr - apply(Vqr, 1, min)) / # normalize factor loadings column-wise
  (apply(Vqr, 1, max) - apply(Vqr, 1, min)))

# calculate the end-member scores matrix by non-negative least square fitting
Mq <- matrix(nrow = nrow(X), ncol = q)
for (i in 1:nrow(X)) {
  Mq[i, ] <- nnls(Vqn, as.vector(t(W[i, ])))$X
}

This matrix contains the relative contributions of each end-member to each sample. Usually, scores can be interpreted as time series, depth series or spatial distribution patterns of end-member abundance.

Step 6 - Rescale matrices and compute variance explained

Rescale matrices to original units of the initial data set and compute the variance explained by each end-member as the proportion of total scores variance.

s <- (c - sum(ls[, 1])) / apply(Vqn * unname(ls[, 2] - ls[, 1]), 2, sum)
Vqs <- Vqn
for (i in 1:q) {
  Vqs[, i] <- t(s[i] * t(Vqn[, i]) * (ls[, 2] - ls[, 1]) + ls[, 1])
}
Vqsn <- t(t(Vqs) / apply(Vqs, 2, sum)) * c # rescale loading matrix

Let us plot the rescaled loading matrix and compare the modeled end-members to the natural end-members. Recall that the end-member loadings relate to a characteristic grain-size distribution and thus, are a proxy for the associated process regime.

par(mfrow = c(1, 1))
cols <- brewer.pal(q, "Set2")

## Modeled data ##
plot(phi, Vqsn[, 1],
  type = "l",
  ylab = "Volume proportion",
  main = "Modeled loadings",
  col = cols[1], ylim = c(0, 0.10), lwd = 2
)
lines(phi, Vqsn[, 2], col = cols[2], lwd = 2)
lines(phi, Vqsn[, 3], col = cols[3], lwd = 2)
lines(phi, Vqsn[, 4], col = cols[4], lwd = 2)

## natural end members ##
lines(phi, EMa[[1]], col = "grey", lwd = 2)
lines(phi, EMa[[2]], col = "grey", lwd = 2)
lines(phi, EMa[[3]], col = "grey", lwd = 2)
lines(phi, EMa[[4]], col = "grey", lwd = 2)

legend("topright",
  legend = c("EM 1", "EM 2", "EM 3", "EM 4", "natural end members"),
  lwd = 2,
  col = c(cols, "grey"),
  cex = 0.85
)

The model approximates quite well the natural end-members. However, please note that we visualized just one single model parameter configuration, which represents just one scenario with a given likelihood. In real life applications the optimal model parameters and robust end-members are derived iteratively from an ensemble of different end-member models.

Now let us examine the variance explained by the scores of our model. Recall that scores are the relative contributions of the loadings to a sample and thus, relate to the predominance of a process during the formation of the sedimentary deposit.

Mqs <- t(t(Mq) / s) / apply(t(t(Mq) / s), 1, sum) # rescale factor scores
Mqs.var <- diag(var(Mqs)) / sum(diag(var(Mqs))) # variance explained by scores
Mqs.var

## [1] 0.2457144 0.2918080 0.2703787 0.1920989

Looks like the four end-members have roughly similar explained variance levels, between 19 and 29 %.

Step 7 - Evaluate goodness of model fit

Calculate the modeled data set and the respective error matrix \(\mathbf E = \mathbf X^* - \mathbf X\). Evaluate the goodness of fit calculating mean row- and column-wise linear coefficients of determination \((r^2)\) between \(\mathbf X\) and \(\mathbf X^*\).

# calculate the model the data set
Xm <- Mq %*% t(Vqn)

# error matrix row-wise
En <- as.vector(apply(X - Xm, 1, mean))
# row-wise linear coefficients of determinations
Rn <- diag(cor(t(X), t(Xm))^2)

# error matrix column-wise
Ep <- as.vector(apply(X - Xm, 2, mean))
# column-wise linear coefficients of determinations
Rp <- diag(cor(X, Xm)^2)

par(mfrow = c(2, 1), mar = c(4, 6, 4, 4))
plot(Rn,
  ylab = "Explained\nvariance",
  xlab = "Observations",
  main = "Row-wise model performance"
)
plot(phi,
  Rp,
  ylab = "Explained\nvariance",
  main = "Column-wise model performance"
)

That was it. To recapitulate, the text has shown what a mixed data set looks like. Then, we had an under the hood view on what exactly the end-member modeling analysis algorithm does. Finally, we explored some measures of the model results: loadings, scores, explained variance, and goodness of fit criteria.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.