In order to showcase principal component analysis (PCA) for
compositional data we use a data set of the geochemistry of a sediment
core recovered from the eastern Juyanze palaeolake in north-western
China (Hartmann and Wünnemann, 2009). We load the data set
and prepare if for the analysis.
We will use the CoDa R-package compositions provided by K.G.v.d.Boogaart and R.
Tolosana-Delgado.
library(compositions)
# load data set
g36 <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/G36chemical.txt",
sep = "\t",
row.names = "Sample")
# exclude columns of no interest
library(dplyr)
cols.to.drop <- c('Depth.min',
'Depth.max',
'TC', 'TIC', 'TOC',
'Soil_type', 'primST')
g36 <- select(g36, -one_of(cols.to.drop))
# rename columns for better readability
colnames(g36) <- gsub('.mg.g', '', colnames(g36))
The data set consists of 177 samples (rows) and 11 variables (columns).
str(g36)
## 'data.frame': 177 obs. of 11 variables:
## $ Ba : num 0.178 0.172 0.204 0.167 0.122 0.181 0.22 0.101 0.255 0.114 ...
## $ Ca : num 61.3 55.2 96.4 53 36 ...
## $ Fe : num 15.2 15.2 14 16.1 10.9 ...
## $ K : num 3.59 2.79 3.9 4.23 2.46 ...
## $ Mg : num 77.3 31.5 59 55.9 97.5 ...
## $ Mn : num 0.326 0.324 0.317 0.356 0.218 0.45 0.575 0.221 0.501 0.222 ...
## $ Na : num 18.78 9.84 18.33 14.19 16.34 ...
## $ PO4: num 0.865 0.785 0.976 1.002 0.667 ...
## $ S : num 24.6 15.4 18.2 14 22.7 ...
## $ Sr : num 0.485 0.446 0.74 0.352 0.324 0.496 0.507 0.081 0.782 0.068 ...
## $ Cl : num 27.6 14.9 27.4 24.7 26.1 ...
The elements Ba, Ca, Mg, Sr, Fe, Mn, K, Na, S and PO4 were measured by an inductively coupled atomic emission spectrometer (ICP-OES Aqua regia-digestion), and Chloride (Cl) was measured after Mohr on extracted fluid by titration. All elements are given in mg/g (see Hartmann and Wünnemann, 2009).
In the first step we conduct a PCA, available
through the generic function princomp(). This function
returns a princomp object, containing the full result of a
PCA on the covariance matrix of the clr-transformed data
set.
x <- acomp(g36)
pcx <- princomp(x)
In the next step we generate a scree plot, a representation of the squared diagonal elements of the diagonal matrix \(\mathbf D\). The scree plot visualizes the proportion of preserved variation for each principal component.
plot(pcx, type = "screeplot", main = 'Scree plot')
We may calculate the proportion of preserved variation for the first two principal components in R:
sum(pcx$sdev[1:2]^2)/sum(pcx$sdev^2)
## [1] 0.8139164
The computation indicates that the first three principal components account for 81% of the total variance; quite a good value.
In the next step we plot a compositional covariance
biplot. Here, we make use of the optional argument
xlabs to get a biplot where the data values are just
represented by dots, instead of labeled with a number (the default in
R).
dots = rep(".", times = nrow(x))
biplot(pcx,
xlabs = dots,
cex = 0.9)
There are 4 “rays-bundles” showing high correlation for Na and S
(post-sedimentary thenardite precipitation), Ca and Sr (carbonate
precipitation in a fresh water lake), Mg and Cl (possible salinity
indicator) as well as for the remaining elements (all indicators for
detrital components)indicating a 2D feature space, because pairwise two
are bundles are negative correlated:
1. PC: detrital components vs. post-sedimentary precipitation 2. PC:
lower vs. higher salinity of the former desert lake.
Thus, obeying the compositional constraints leads directly to an
interpretable biplot derived from a simple PCA on clr-transformed
compositions.
But, such compositional approach reveals another big advantage for
multivariate data analyses:
The biplot is quite useful to spot low-variance sub-compositions, such as Na-S, Cl-Mg, Ca-Sr, and PO4-Fe-Mn-K-Ba. By comparing the metric variances of these sub-compositions with the total metric variance it becomes evident that the parts in each one of these groups have very low variances; thus they are quite proportional.
var.Na.S <- mvar(acomp(x[,c("Na","S")]))
var.Na.S/mvar(x)*100 # variance in percent
## [1] 0.8549035
var.Cl.Mg <- mvar(acomp(x[,c("Cl","Mg")]))
var.Cl.Mg/mvar(x)*100 # variance in percent
## [1] 6.167523
var.Ca.Sr <- mvar(acomp(x[,c("Ca","Sr")]))
var.Ca.Sr/mvar(x)*100 # variance in percent
## [1] 5.212784
var.PO4.Fe.Mn.K.Ba <- mvar(acomp(x[,c("PO4","Fe", "Mn", "K","Ba")]))
var.PO4.Fe.Mn.K.Ba/mvar(x)*100 # variance in percent
## [1] 4.397015
The parts in each one of these groups are quite proportional and have very low variances, respectively, 0.9, 6.2, 5.2, and 4.4% of the total variance.
The biplot is also useful to identify possible one-dimensional patterns. This means that we can find a projection of the selected variables onto a particular direction that is quasi-constant. In other words: their points in the biplot are colinear! Such sub-compositions are for instance, Sr-Ca-Fe, Na-Fe-PO4, Na-Mg-Sr, or Sr-Na-Ca.
In order to inspect visually if these sub-compositions are
quasi-constant we apply the generic plot() function on an
acomp object and add the additional arguments
center = TRUE and pca = TRUE to the function
call. In the ternary diagrams an one-dimensional pattern is shown as a
line.
sub.comp <- list(acomp(x[, c('Sr', 'Ca', 'Fe')]),
acomp(x[, c('Na', 'Fe', 'PO4')]),
acomp(x[, c('Na', 'Mg', 'Sr')]),
acomp(x[, c('Sr', 'Na', 'Ca')]))
for (i in 1:length(sub.comp)){
plot(sub.comp[[i]],
pca = TRUE,
center = TRUE,
col = "gray",
pch = 16,
col.pca = "red")}
The biplot is obviously very useful for the exploration of the co-dependence structures of a compositional data set. Note that if we hope to find these relations by looking at all possible ternary diagrams, we would have to explore 990 plots (the different combinations of 11 elements taken 3 at a time).
Finally. let us interpret the loadings of the first
two principal components, which together account for 81% of the total
variance. The clr-loadings (the matrix \(\mathbf V\)) may be accessed by using the
function loadings() on a princomp object.
loadings(pcx)[,1:2]
## Comp.1 Comp.2
## Ba 0.26100501 -0.032325124
## Ca -0.05843839 -0.370958758
## Fe 0.33132887 0.028543815
## K 0.17962710 -0.038539736
## Mg -0.02103733 0.209503654
## Mn 0.30626986 -0.006698458
## Na -0.46517745 0.311755502
## PO4 0.34353929 0.081493528
## S -0.47876394 0.305361805
## Sr -0.35603002 -0.743513290
## Cl -0.04232299 0.255377063
The sum of the clr-coefficients should be zero.
colSums(loadings(pcx))
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## 6.661338e-16 1.006140e-15 4.336809e-15 -2.400857e-15 8.562595e-15
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## 1.755540e-15 5.120904e-15 1.498801e-15 -2.340142e-14 -3.739717e-14
These coefficients are not free to vary in the eleven-dimensional feature space, but are linked to each another through the null sum. Hence, we cannot interpret a low loading value as an absence of dependence! For instance, in this example, we cannot conclude that the first PC is not affecting or depending on Mg, because of a very low loading \((-0.021)\). However, we can state that the PO4 vs. Mg log-ratio increases along the first PC with a slope of \(0.344 - (-0.021) = 0.365\). Or we may also state that the first PC is not affecting the ratio Cl/Ca, because their loadings difference \((-0.042 - (-0.058) = 0.016)\) is almost zero, whereas along the second PC, their log-ratio increases with a slope of \(0.255 - (-0.371) = 0.626\).
A graphical representation of these slopes can be obtained with the
plot(..., type="loadings") function. The resulting plot
shows the loadings scaled by the corresponding singular values. It shows
that the bars of higher order PCs tend to be more similar to \(1\), and thus that the last PCs are less
able to modify the composition compared to the first PCs.
library(RColorBrewer)
plot(pcx,
type = "loadings",
main = "",
xlim = c(0,18),
las = 2,
col = brewer.pal(ncol(x), 'Paired'))
# add helper lines
for (i in 1:10){
lines(x = c(-1,13), y = c(i,i), lty=2)
}
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.