Following their suggestions, we walk through the EMMA algorithm and perform concurrently the calculations in R using the machinery of the EMMAgeo package. We perform an end-member analysis on the df.soil.samples data set provided here. The data set consists of 45 rows, corresponding to soil samples and 99 columns, corresponding to a particular grain size class given on the Krumbein phi \((\phi)\) scale.

\[\phi = -\log_2 D/D_0\text{,} \] where \(D\) is the diameter of the particle or grain in millimeters, and \(D_0\) is a reference diameter, equal to \(1\) mm.

library(EMMAgeo)

# load data
load(url("http://userpage.fu-berlin.de/~soga/300/30100_data_sets/data_emma.RData"))

phi <- colnames(df.soil.samples)
X <- as.matrix(df.soil.samples) # convert data.frame object to matrix object
str(X)
##  num [1:45, 1:99] 3.04e-04 2.27e-04 7.75e-05 3.80e-04 3.79e-04 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:45] "1" "2" "3" "4" ...
##   ..$ : chr [1:99] "0" "0.1" "0.2" "0.31" ...

Let us plot the data:


1. Check the data

In order to start with EMMA using the EMMAgeo package the data set has to fulfill certain criteria:

We can use R to check our data set with respect to these constraints:

The is.numeric() function returns TRUE if the class is double or integer.

is.numeric(X)
## [1] TRUE

The is.na() function indicates which elements are missing. If we apply the function on a vector or a matrix, the function checks each element of the vector/matrix and returns TRUE if the element is a missing value or FALSE if not. As TRUE evaluates to a numeric value of \(1\) and FALSE to \(0\), we may combine the sum() function and the is.na() function to check if our matrix is free of NA-values.

sum(is.na(X))
## [1] 0

The function call returns 0, thus we conclude that our data is free of NA-values.

In order to check if the data contains any zero-only columns we make use of the apply() function. We set MARGIN = 2 to indicate that we want to apply the sum() function on the columns. Finally we sum the results for each particular column.

sum(apply(X, MARGIN = 2, FUN = sum) == 0)
## [1] 0

The function call returns 0, thus we conclude that our data does not contain any zero-only columns.

Similar to the approaches from above we use the apply() function to check if the variables sum to constant value. However this time, we set MARGIN = 1 to indicate that we want to apply the sum() function on the rows.

apply(X, MARGIN = 1, FUN = sum)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

We examine the final constrained of our list, the value for the weight transformation limits \((l)\) in the next section.


2. Determine the range of weight transformation limits

Recall that for a stable EMMA we have to minimize the effects of scale. Therefore a column-wise weight transformation is applied by scaling the columns based on percentiles with lower \((l)\) and upper \((100−l)\) boundaries as weights. The EMMA algorithm becomes unstable if the weight transformation limits are too high. The EMMAgeo package provides the test.l() function, which performs the weight transformation of the data matrix with different weight limits to check if valid results are yielded. It returns the maximum value for which the transformation remains stable.

l.seq <- seq(from = 0, to = 0.5, by = 0.01)
l <- test.l(X, l = l.seq)
l
## $step
## [1] 28
## 
## $l.max
## [1] 0.27

The function returns a list of objects: step holds the position of last valid value and l.max contains the last valid value of \(l\). We assign the value of l.max to a new variable l.max for further usage.

l.max <-l$l.max
l.max
## [1] 0.27

3. Define the range of numbers of end-members

In general the number of end-member is unknown. The EMMAgeo package provides the test.parameters() function, which evaluates all possible combinations of number of end-members, \(q\), and weight transformation limits, \(l\). In addition the function provides a graphical output for several error metrics. Here we use plot = "mRt" to indicate that we are interested in the mean relative total error (mRt). Look up the documentations for further specifications and additional options (type ?test.parameters in your console).

seq.q <- seq(from = 2, to = 10, by = 1) # set sequence of end-members
seq.l <- seq(from = 0, to = l.max, by= 0.02) # sequence of weight transformations
params <- test.parameters(X, 
                          q = seq.q, 
                          l = seq.l, 
                          rotation = "Varimax",
                          plot = "mRt", # "mRt" (mean relative total error), 
                                        # "mEm" (mean absolute row-wise error), 
                                        # "mEn" (mean absolute column-wise error), 
                                        # "mRm" (mean relative row-wise error), 
                                        # "mRn" (mean relative column-wise error), 
                                        # "ol" (number of overlapping end-members). 
                          legend = 'bottomright', 
                          cex = 0.6)