In order to showcase principal component analysis (PCA) for
compositional data we use a data set of the geochemistry of a sediment
core recovered from the eastern Juyanze palaeolake in north-western
China (Hartmann and Wünnemann, 2009). We load the data set
and prepare if for the analysis.
We will use the CoDa R-package compositions
provided by K.G.v.d.Boogaart and R.
Tolosana-Delgado.
library(compositions)
# load data set
g36 <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/G36chemical.txt",
sep = "\t",
row.names = "Sample")
# exclude columns of no interest
library(dplyr)
cols.to.drop <- c('Depth.min',
'Depth.max',
'TC', 'TIC', 'TOC',
'Soil_type', 'primST')
g36 <- select(g36, -one_of(cols.to.drop))
# rename columns for better readability
colnames(g36) <- gsub('.mg.g', '', colnames(g36))
The data set consists of 177 samples (rows) and 11 variables (columns).
str(g36)
## 'data.frame': 177 obs. of 11 variables:
## $ Ba : num 0.178 0.172 0.204 0.167 0.122 0.181 0.22 0.101 0.255 0.114 ...
## $ Ca : num 61.3 55.2 96.4 53 36 ...
## $ Fe : num 15.2 15.2 14 16.1 10.9 ...
## $ K : num 3.59 2.79 3.9 4.23 2.46 ...
## $ Mg : num 77.3 31.5 59 55.9 97.5 ...
## $ Mn : num 0.326 0.324 0.317 0.356 0.218 0.45 0.575 0.221 0.501 0.222 ...
## $ Na : num 18.78 9.84 18.33 14.19 16.34 ...
## $ PO4: num 0.865 0.785 0.976 1.002 0.667 ...
## $ S : num 24.6 15.4 18.2 14 22.7 ...
## $ Sr : num 0.485 0.446 0.74 0.352 0.324 0.496 0.507 0.081 0.782 0.068 ...
## $ Cl : num 27.6 14.9 27.4 24.7 26.1 ...
The elements Ba, Ca, Mg, Sr, Fe, Mn, K, Na, S and PO4 were measured by an inductively coupled atomic emission spectrometer (ICP-OES Aqua regia-digestion), and Chloride (Cl) was measured after Mohr on extracted fluid by titration. All elements are given in mg/g (see Hartmann and Wünnemann, 2009).
In the first step we conduct a PCA, available
through the generic function princomp()
. This function
returns a princomp
object, containing the full result of a
PCA on the covariance matrix of the clr-transformed data
set.
x <- acomp(g36)
pcx <- princomp(x)
In the next step we generate a scree plot, a representation of the squared diagonal elements of the diagonal matrix \(\mathbf D\). The scree plot visualizes the proportion of preserved variation for each principal component.
plot(pcx, type = "screeplot", main = 'Scree plot')
We may calculate the proportion of preserved variation for the first two principal components in R:
sum(pcx$sdev[1:2]^2)/sum(pcx$sdev^2)
## [1] 0.8139164
The computation indicates that the first three principal components account for 81% of the total variance; quite a good value.
In the next step we plot a compositional covariance
biplot. Here, we make use of the optional argument
xlabs
to get a biplot where the data values are just
represented by dots, instead of labeled with a number (the default in
R).
dots = rep(".", times = nrow(x))
biplot(pcx,
xlabs = dots,
cex = 0.9)
There are 4 “rays-bundles” showing high correlation for Na and S
(post-sedimentary thenardite precipitation), Ca and Sr (carbonate
precipitation in a fresh water lake), Mg and Cl (possible salinity
indicator) as well as for the remaining elements (all indicators for
detrital components)indicating a 2D feature space, because pairwise two
are bundles are negative correlated:
1. PC: detrital components vs. post-sedimentary precipitation 2. PC:
lower vs. higher salinity of the former desert lake.
Thus, obeying the compositional constraints leads directly to an
interpretable biplot derived from a simple PCA on clr-transformed
compositions.
But, such compositional approach reveals another big advantage for
multivariate data analyses:
The biplot is quite useful to spot low-variance sub-compositions, such as Na-S, Cl-Mg, Ca-Sr, and PO4-Fe-Mn-K-Ba. By comparing the metric variances of these sub-compositions with the total metric variance it becomes evident that the parts in each one of these groups have very low variances; thus they are quite proportional.
var.Na.S <- mvar(acomp(x[,c("Na","S")]))
var.Na.S/mvar(x)*100 # variance in percent
## [1] 0.8549035
var.Cl.Mg <- mvar(acomp(x[,c("Cl","Mg")]))
var.Cl.Mg/mvar(x)*100 # variance in percent
## [1] 6.167523
var.Ca.Sr <- mvar(acomp(x[,c("Ca","Sr")]))
var.Ca.Sr/mvar(x)*100 # variance in percent
## [1] 5.212784
var.PO4.Fe.Mn.K.Ba <- mvar(acomp(x[,c("PO4","Fe", "Mn", "K","Ba")]))
var.PO4.Fe.Mn.K.Ba/mvar(x)*100 # variance in percent
## [1] 4.397015
The parts in each one of these groups are quite proportional and have very low variances, respectively, 0.9, 6.2, 5.2, and 4.4% of the total variance.
The biplot is also useful to identify possible one-dimensional patterns. This means that we can find a projection of the selected variables onto a particular direction that is quasi-constant. In other words: their points in the biplot are colinear! Such sub-compositions are for instance, Sr-Ca-Fe, Na-Fe-PO4, Na-Mg-Sr, or Sr-Na-Ca.
In order to inspect visually if these sub-compositions are
quasi-constant we apply the generic plot()
function on an
acomp
object and add the additional arguments
center = TRUE
and pca = TRUE
to the function
call. In the ternary diagrams an one-dimensional pattern is shown as a
line.
sub.comp <- list(acomp(x[, c('Sr', 'Ca', 'Fe')]),
acomp(x[, c('Na', 'Fe', 'PO4')]),
acomp(x[, c('Na', 'Mg', 'Sr')]),
acomp(x[, c('Sr', 'Na', 'Ca')]))
for (i in 1:length(sub.comp)){
plot(sub.comp[[i]],
pca = TRUE,
center = TRUE,
col = "gray",
pch = 16,
col.pca = "red")}