In order to showcase principal component analysis (PCA) for compositional data we use a data set of the geochemistry of a sediment core recovered from the eastern Juyanze palaeolake in north-western China (Hartmann and Wünnemann, 2009). We load the data set and prepare if for the analysis.
We will use the CoDa R-package compositions provided by K.G.v.d.Boogaart and R. Tolosana-Delgado.

library(compositions)

# load data set
g36 <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/G36chemical.txt",
                sep = "\t",
                row.names = "Sample")

# exclude columns of no interest 
library(dplyr)
cols.to.drop <- c('Depth.min',
                  'Depth.max', 
                  'TC', 'TIC', 'TOC', 
                  'Soil_type', 'primST')
                  
g36 <- select(g36, -one_of(cols.to.drop))

# rename columns for better readability
colnames(g36) <- gsub('.mg.g', '', colnames(g36))

The data set consists of 177 samples (rows) and 11 variables (columns).

str(g36)
## 'data.frame':    177 obs. of  11 variables:
##  $ Ba : num  0.178 0.172 0.204 0.167 0.122 0.181 0.22 0.101 0.255 0.114 ...
##  $ Ca : num  61.3 55.2 96.4 53 36 ...
##  $ Fe : num  15.2 15.2 14 16.1 10.9 ...
##  $ K  : num  3.59 2.79 3.9 4.23 2.46 ...
##  $ Mg : num  77.3 31.5 59 55.9 97.5 ...
##  $ Mn : num  0.326 0.324 0.317 0.356 0.218 0.45 0.575 0.221 0.501 0.222 ...
##  $ Na : num  18.78 9.84 18.33 14.19 16.34 ...
##  $ PO4: num  0.865 0.785 0.976 1.002 0.667 ...
##  $ S  : num  24.6 15.4 18.2 14 22.7 ...
##  $ Sr : num  0.485 0.446 0.74 0.352 0.324 0.496 0.507 0.081 0.782 0.068 ...
##  $ Cl : num  27.6 14.9 27.4 24.7 26.1 ...

The elements Ba, Ca, Mg, Sr, Fe, Mn, K, Na, S and PO4 were measured by an inductively coupled atomic emission spectrometer (ICP-OES Aqua regia-digestion), and Chloride (Cl) was measured after Mohr on extracted fluid by titration. All elements are given in mg/g (see Hartmann and Wünnemann, 2009).


In the first step we conduct a PCA, available through the generic function princomp(). This function returns a princomp object, containing the full result of a PCA on the covariance matrix of the clr-transformed data set.

x <- acomp(g36)
pcx <- princomp(x)

In the next step we generate a scree plot, a representation of the squared diagonal elements of the diagonal matrix \(\mathbf D\). The scree plot visualizes the proportion of preserved variation for each principal component.

plot(pcx, type = "screeplot", main = 'Scree plot')

We may calculate the proportion of preserved variation for the first two principal components in R:

sum(pcx$sdev[1:2]^2)/sum(pcx$sdev^2)
## [1] 0.8139164

The computation indicates that the first three principal components account for 81% of the total variance; quite a good value.


In the next step we plot a compositional covariance biplot. Here, we make use of the optional argument xlabs to get a biplot where the data values are just represented by dots, instead of labeled with a number (the default in R).

dots = rep(".", times = nrow(x))
biplot(pcx, 
       xlabs = dots, 
       cex = 0.9)

There are 4 “rays-bundles” showing high correlation for Na and S (post-sedimentary thenardite precipitation), Ca and Sr (carbonate precipitation in a fresh water lake), Mg and Cl (possible salinity indicator) as well as for the remaining elements (all indicators for detrital components)indicating a 2D feature space, because pairwise two are bundles are negative correlated:
1. PC: detrital components vs. post-sedimentary precipitation 2. PC: lower vs. higher salinity of the former desert lake.
Thus, obeying the compositional constraints leads directly to an interpretable biplot derived from a simple PCA on clr-transformed compositions.
But, such compositional approach reveals another big advantage for multivariate data analyses:

The biplot is quite useful to spot low-variance sub-compositions, such as Na-S, Cl-Mg, Ca-Sr, and PO4-Fe-Mn-K-Ba. By comparing the metric variances of these sub-compositions with the total metric variance it becomes evident that the parts in each one of these groups have very low variances; thus they are quite proportional.

var.Na.S <- mvar(acomp(x[,c("Na","S")]))
var.Na.S/mvar(x)*100 # variance in percent
## [1] 0.8549035
var.Cl.Mg <- mvar(acomp(x[,c("Cl","Mg")]))
var.Cl.Mg/mvar(x)*100 # variance in percent
## [1] 6.167523
var.Ca.Sr <- mvar(acomp(x[,c("Ca","Sr")]))
var.Ca.Sr/mvar(x)*100 # variance in percent
## [1] 5.212784
var.PO4.Fe.Mn.K.Ba <- mvar(acomp(x[,c("PO4","Fe", "Mn", "K","Ba")]))
var.PO4.Fe.Mn.K.Ba/mvar(x)*100 # variance in percent
## [1] 4.397015

The parts in each one of these groups are quite proportional and have very low variances, respectively, 0.9, 6.2, 5.2, and 4.4% of the total variance.


The biplot is also useful to identify possible one-dimensional patterns. This means that we can find a projection of the selected variables onto a particular direction that is quasi-constant. In other words: their points in the biplot are colinear! Such sub-compositions are for instance, Sr-Ca-Fe, Na-Fe-PO4, Na-Mg-Sr, or Sr-Na-Ca.

In order to inspect visually if these sub-compositions are quasi-constant we apply the generic plot() function on an acomp object and add the additional arguments center = TRUE and pca = TRUE to the function call. In the ternary diagrams an one-dimensional pattern is shown as a line.

sub.comp <- list(acomp(x[, c('Sr', 'Ca', 'Fe')]),
                 acomp(x[, c('Na', 'Fe', 'PO4')]),
                 acomp(x[, c('Na', 'Mg', 'Sr')]),
                 acomp(x[, c('Sr', 'Na', 'Ca')]))

for (i in 1:length(sub.comp)){
  plot(sub.comp[[i]], 
     pca = TRUE, 
     center = TRUE,
     col = "gray",
     pch = 16,
     col.pca = "red")}