In order to showcase principal component analysis (PCA) for compositional data we use a data set of the geochemistry of a sediment core recovered from the eastern Juyanze palaeolake in north-western China (Hartmann and Wünnemann, 2009). We load the data set and prepare if for the analysis.
We will use the CoDa R-package compositions provided by K.G.v.d.Boogaart and R. Tolosana-Delgado.

library(compositions)

# load data set
g36 <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/G36chemical.txt",
                sep = "\t",
                row.names = "Sample")

# exclude columns of no interest 
library(dplyr)
cols.to.drop <- c('Depth.min',
                  'Depth.max', 
                  'TC', 'TIC', 'TOC', 
                  'Soil_type', 'primST')
                  
g36 <- select(g36, -one_of(cols.to.drop))

# rename columns for better readability
colnames(g36) <- gsub('.mg.g', '', colnames(g36))

The data set consists of 177 samples (rows) and 11 variables (columns).

str(g36)
## 'data.frame':    177 obs. of  11 variables:
##  $ Ba : num  0.178 0.172 0.204 0.167 0.122 0.181 0.22 0.101 0.255 0.114 ...
##  $ Ca : num  61.3 55.2 96.4 53 36 ...
##  $ Fe : num  15.2 15.2 14 16.1 10.9 ...
##  $ K  : num  3.59 2.79 3.9 4.23 2.46 ...
##  $ Mg : num  77.3 31.5 59 55.9 97.5 ...
##  $ Mn : num  0.326 0.324 0.317 0.356 0.218 0.45 0.575 0.221 0.501 0.222 ...
##  $ Na : num  18.78 9.84 18.33 14.19 16.34 ...
##  $ PO4: num  0.865 0.785 0.976 1.002 0.667 ...
##  $ S  : num  24.6 15.4 18.2 14 22.7 ...
##  $ Sr : num  0.485 0.446 0.74 0.352 0.324 0.496 0.507 0.081 0.782 0.068 ...
##  $ Cl : num  27.6 14.9 27.4 24.7 26.1 ...

The elements Ba, Ca, Mg, Sr, Fe, Mn, K, Na, S and PO4 were measured by an inductively coupled atomic emission spectrometer (ICP-OES Aqua regia-digestion), and Chloride (Cl) was measured after Mohr on extracted fluid by titration. All elements are given in mg/g (see Hartmann and Wünnemann, 2009).


In the first step we conduct a PCA, available through the generic function princomp(). This function returns a princomp object, containing the full result of a PCA on the covariance matrix of the clr-transformed data set.

x <- acomp(g36)
pcx <- princomp(x)

In the next step we generate a scree plot, a representation of the squared diagonal elements of the diagonal matrix \(\mathbf D\). The scree plot visualizes the proportion of preserved variation for each principal component.

plot(pcx, type = "screeplot", main = 'Scree plot')

We may calculate the proportion of preserved variation for the first two principal components in R:

sum(pcx$sdev[1:2]^2)/sum(pcx$sdev^2)
## [1] 0.8139164

The computation indicates that the first three principal components account for 81% of the total variance; quite a good value.


In the next step we plot a compositional covariance biplot. Here, we make use of the optional argument xlabs to get a biplot where the data values are just represented by dots, instead of labeled with a number (the default in R).

dots = rep(".", times = nrow(x))
biplot(pcx, 
       xlabs = dots, 
       cex = 0.9)

There are 4 “rays-bundles” showing high correlation for Na and S (post-sedimentary thenardite precipitation), Ca and Sr (carbonate precipitation in a fresh water lake), Mg and Cl (possible salinity indicator) as well as for the remaining elements (all indicators for detrital components)indicating a 2D feature space, because pairwise two are bundles are negative correlated:
1. PC: detrital components vs. post-sedimentary precipitation 2. PC: lower vs. higher salinity of the former desert lake.
Thus, obeying the compositional constraints leads directly to an interpretable biplot derived from a simple PCA on clr-transformed compositions.
But, such compositional approach reveals another big advantage for multivariate data analyses:

The biplot is quite useful to spot low-variance sub-compositions, such as Na-S, Cl-Mg, Ca-Sr, and PO4-Fe-Mn-K-Ba. By comparing the metric variances of these sub-compositions with the total metric variance it becomes evident that the parts in each one of these groups have very low variances; thus they are quite proportional.

var.Na.S <- mvar(acomp(x[,c("Na","S")]))
var.Na.S/mvar(x)*100 # variance in percent
## [1] 0.8549035
var.Cl.Mg <- mvar(acomp(x[,c("Cl","Mg")]))
var.Cl.Mg/mvar(x)*100 # variance in percent
## [1] 6.167523
var.Ca.Sr <- mvar(acomp(x[,c("Ca","Sr")]))
var.Ca.Sr/mvar(x)*100 # variance in percent
## [1] 5.212784
var.PO4.Fe.Mn.K.Ba <- mvar(acomp(x[,c("PO4","Fe", "Mn", "K","Ba")]))
var.PO4.Fe.Mn.K.Ba/mvar(x)*100 # variance in percent
## [1] 4.397015

The parts in each one of these groups are quite proportional and have very low variances, respectively, 0.9, 6.2, 5.2, and 4.4% of the total variance.


The biplot is also useful to identify possible one-dimensional patterns. This means that we can find a projection of the selected variables onto a particular direction that is quasi-constant. In other words: their points in the biplot are colinear! Such sub-compositions are for instance, Sr-Ca-Fe, Na-Fe-PO4, Na-Mg-Sr, or Sr-Na-Ca.

In order to inspect visually if these sub-compositions are quasi-constant we apply the generic plot() function on an acomp object and add the additional arguments center = TRUE and pca = TRUE to the function call. In the ternary diagrams an one-dimensional pattern is shown as a line.

sub.comp <- list(acomp(x[, c('Sr', 'Ca', 'Fe')]),
                 acomp(x[, c('Na', 'Fe', 'PO4')]),
                 acomp(x[, c('Na', 'Mg', 'Sr')]),
                 acomp(x[, c('Sr', 'Na', 'Ca')]))

for (i in 1:length(sub.comp)){
  plot(sub.comp[[i]], 
     pca = TRUE, 
     center = TRUE,
     col = "gray",
     pch = 16,
     col.pca = "red")}

The biplot is obviously very useful for the exploration of the co-dependence structures of a compositional data set. Note that if we hope to find these relations by looking at all possible ternary diagrams, we would have to explore 990 plots (the different combinations of 11 elements taken 3 at a time).


Finally. let us interpret the loadings of the first two principal components, which together account for 81% of the total variance. The clr-loadings (the matrix \(\mathbf V\)) may be accessed by using the function loadings() on a princomp object.

loadings(pcx)[,1:2]
##          Comp.1       Comp.2
## Ba   0.26100501 -0.032325124
## Ca  -0.05843839 -0.370958758
## Fe   0.33132887  0.028543815
## K    0.17962710 -0.038539736
## Mg  -0.02103733  0.209503654
## Mn   0.30626986 -0.006698458
## Na  -0.46517745  0.311755502
## PO4  0.34353929  0.081493528
## S   -0.47876394  0.305361805
## Sr  -0.35603002 -0.743513290
## Cl  -0.04232299  0.255377063

The sum of the clr-coefficients should be zero.

colSums(loadings(pcx))
##        Comp.1        Comp.2        Comp.3        Comp.4        Comp.5 
##  6.661338e-16  1.006140e-15  4.336809e-15 -2.400857e-15  8.562595e-15 
##        Comp.6        Comp.7        Comp.8        Comp.9       Comp.10 
##  1.755540e-15  5.120904e-15  1.498801e-15 -2.340142e-14 -3.739717e-14

These coefficients are not free to vary in the eleven-dimensional feature space, but are linked to each another through the null sum. Hence, we cannot interpret a low loading value as an absence of dependence! For instance, in this example, we cannot conclude that the first PC is not affecting or depending on Mg, because of a very low loading \((-0.021)\). However, we can state that the PO4 vs. Mg log-ratio increases along the first PC with a slope of \(0.344 - (-0.021) = 0.365\). Or we may also state that the first PC is not affecting the ratio Cl/Ca, because their loadings difference \((-0.042 - (-0.058) = 0.016)\) is almost zero, whereas along the second PC, their log-ratio increases with a slope of \(0.255 - (-0.371) = 0.626\).

A graphical representation of these slopes can be obtained with the plot(..., type="loadings") function. The resulting plot shows the loadings scaled by the corresponding singular values. It shows that the bars of higher order PCs tend to be more similar to \(1\), and thus that the last PCs are less able to modify the composition compared to the first PCs.

library(RColorBrewer)
plot(pcx, 
     type = "loadings",
     main = "", 
     xlim = c(0,18), 
     las = 2,
     col = brewer.pal(ncol(x), 'Paired'))
# add helper lines
for (i in 1:10){
  lines(x = c(-1,13), y = c(i,i), lty=2)  
}



Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.