Descriptive analysis of a compositional data set is based on the principle of working in coordinates and of working with log ratios. The most important functions for descriptive statistics in R are exemplified with a subcomposition of the data set of the geochemistry of a sediment core recovered from the eastern Juyanze palaeolake in north-western China (Hartmann and Wünnemann, 2009). We load the data set and prepare if for the analysis.
# load data set
g36 <- read.csv("http://userpages.fu-berlin.de/soga/data/raw-data/G36chemical.txt",
sep='\t',
row.names = 'Sample')
# exclude columns of no interest
g36 <- g36[, -c(1:5, ncol(g36)-1, ncol(g36))]
# rename columns for better readability
colnames(g36) <- gsub(".mg.g", "", colnames(g36))
# define a subcomposition
library(compositions)
X = acomp(g36, c("Fe","K","Mg"))
The center or compositional mean of a data set \(\mathbf X\) with \(N\) observations and \(D\) parts is the composition
\[\bar x = \frac{1}{N}\odot \bigoplus_{n=1}^N \mathbf x_n = \text{clr}^{-1} \left(\frac{1}{N} \sum_{n=1}^N \text{clr}(\mathbf x_n)\right) = \mathscr C \left[\exp \left(\frac{1}{N}\sum_{n=1}^N\ln(\mathbf x_n)\right)\right]\text{,}\]
where \(x_n\) is the nth observation, in the nth row of \(\mathbf X\).
In R the compositional mean is computed by calling the
mean()
function on a acomp
class object.
mean(X)
## Fe K Mg
## "0.28247056" "0.05232328" "0.66520616"
## attr(,"class")
## [1] "acomp"
The compositional mean is not a robust estimator (van
den Boogaart and Tolosana-Delgado 2013). The
composition
package provides a robust improvement of the
compositional mean by computing the minimum covariance determinant (MCD) estimator.
Therefore we have to add an additional argument
robust = TRUE
to the function call
(mean(X, robust = TRUE)
).
mean(X, robust = TRUE)
## Fe K Mg
## "0.32229870" "0.05679286" "0.62090844"
## attr(,"class")
## [1] "acomp"
A global measure of spread is the metric variance, also known as total variance or generalized variance, which corresponds to the the average squared distance from the center of the data set (van den Boogaart and Tolosana-Delgado 2013).
\[\text{mvar}(\mathbf X) = \frac{1}{N-1}\sum_{n=1}^N d^2_A(\mathbf x_n, \overline{\mathbf x})\]
In R the computation of this variance type is provided by the
compositions
package by applying the mvar()
function on a acomp
class object:
mvar(X)
## [1] 0.2507803
Van den Boogaart and Tolosana-Delgado 2013 define a metric standard deviation, to support quantitative interpretation:
\[\text{msd}(X) = \sqrt{\frac{1}{D-1} \text{mvar}(X)}\]
This quantity may be interpreted as the radial standard deviation on a log scale, if the variation was the same in all directions, or as some sort of average spread.
In R this equation is implemented in the msd()
function
of the compositions
package.
msd(X)
## [1] 0.3541047
The variation of a composition among its components cannot be described by raw correlations or covariances. Instead Aitchison (1986) recommends the variation matrix. This matrix has \(D^2\) components, each defined as
\[\tau_{ij} = \text{var}\left(\ln\frac{x_i}{x_j}\right) \]
and estimated by
\[\tau_{ij}=\frac{1}{N-1}\sum_{n=1}^N\ln^2 \frac{x_{ni}}{x_{nj}} - \ln^2\frac{\bar x_i}{\bar x_j}\text{.}\]
In R this variation matrix is provided by the
compositions
package by applying the
variation()
function on a acomp
class
object:
variation(X)
## Fe K Mg
## Fe 0.00000000 0.06907321 0.3827192
## K 0.06907321 0.00000000 0.3005484
## Mg 0.38271916 0.30054839 0.0000000
The smaller a variation element is, the better the proportionality between the two components. To help in its interpretation, one may consider the transformation
\[\rho_{ij} = \exp(-\tau_{ij}^2/2)\] and interpret it as a correlation coefficient (Van den Boogaart and Tolosana-Delgado 2013).
# interpreted like the correlation coefficient
exp(-variation(X)^2/2)
## Fe K Mg
## Fe 1.0000000 0.9976173 0.9293806
## K 0.9976173 1.0000000 0.9558401
## Mg 0.9293806 0.9558401 1.0000000
In order to analyse a compositional data set in a more classical way, as a matrix of covariances, the data set must be clr-transformed. The resulting clr-variance matrix is the variance-covariance matrix of the clr-transformed data set.
The covariance estimates \(\hat{\Sigma}_{ji}\) are given by
\[\hat{\Sigma}_{ji}=\text{cov}\left(\text{clr}_i(\mathbf
x),\text{clr}_j(\mathbf x)\right) = \frac{1}{N}\sum_{n=1}^N
\ln\frac{x_{ni}}{g(\mathbf x_n)}\cdot\ln\frac{x_{ni}}{g(\mathbf
x_n)}-\ln\frac{\bar x_i}{g(\bar{\mathbf x})}\cdot\ln\frac{\bar
x_j}{g(\bar{\mathbf x})}\] In R the covariance estimates may be
computed by applying the var()
function on a
acomp
class object.
var(X)
## Fe K Mg
## Fe 0.06700404 0.01877231 -0.08577634
## K 0.01877231 0.03961378 -0.05838609
## Mg -0.08577634 -0.05838609 0.14416243
In many statistical analysis it is useful to center the data by removing the mean. In compositional data analysis, we can achieve this task by perturbing the composition by the inverse of its center:
\[\mathbf X^* = \mathbf X \ominus \bar{\mathbf x}\]
Similarly, we may scale (standardize) a compositional data set, by using perturbation and powering. A data set is centered by perturbation with the inverse of the center and scaled by powering with the inverse of the square root of the metric variance:
\[\mathbf Z = \frac{1}{\sqrt{\text{mvar}(\mathbf X)}} \odot (\mathbf X \ominus \bar{\mathbf x})\]
In R we may explicitly write
(X - mean(X)) / sqrt(mvar(X))
or we make use of the handy
scale()
function provided by the compositions
package.
library(ggtern)
grobs = list()
data = list(X,
X-mean(X),
scale(X,center=TRUE,scale=TRUE))
titles <- c('Original data',
'Centered data',
'Standardized data')
for (p in 1:3){
grobs[[p]] <- ggtern(data = as.data.frame(data[[p]]), aes(Fe, K, Mg)) +
geom_point(alpha = 0.3, size = 2, color = "black") +
theme_rgbw() + ggtitle(titles[p]) +
theme(plot.title = element_text(hjust = 0.5))
}
ggtern::grid.arrange(grobs=grobs, nrow=1)
In order to represent the center and dispersion structure of
compositional data graphically, we may draw two ellipses. One
corresponds to a p-confidence region of the mean. This
confidence region is defined by its radius r
according to a
Fisher \(F-\)distribution of \((D - 1)\) and \(N-D+1\) degrees of freedom:
\[r = \sqrt{\frac{D-1}{N-D+1}\cdot F_p(D-1, N-D+1)}\]
This is valid if the data follows a normal distribution for compositions or else if the number of observations \(N\) is large. In this last case, a compositional central limit theorem applies.
The other ellipsis corresponds to a p-probability region, defined by its radius given by the \(\chi^2\) distribution of \((D-1)\) degrees of freedom:
\[r = \sqrt{\chi^2_{\alpha}(D-1)}\]
We show this graphical procedure on a subcomposition of the data set of the geochemistry of a sediment core recovered from the eastern Juyanze palaeolake in north-western China (Hartmann and Wünnemann, 2009). We load the data set and prepare if for the analysis.
### DATA PREPARATION ###
# load data set
g36 <- read.csv("http://userpages.fu-berlin.de/soga/data/raw-data/G36chemical.txt",
sep='\t',
row.names = 'Sample')
# exclude columns of no interest
g36 <- g36[, -c(1:5, ncol(g36)-1, ncol(g36))]
# rename columns for better readability
colnames(g36) <- gsub(".mg.g", "", colnames(g36))
# define a subcomposition
X = acomp(g36, c("Fe","K","Mg"))
### COMPUTATION OF MEAN, VARIANCE AND RADII ###
# center the subcomposition
X = X - mean(X, robust = TRUE)
# get the mean:
mn = mean(X, robust = TRUE)
# compute variance and radii:
vr = var(X)
df1 = ncol(X)-1 # degree of freedom
df2 = nrow(X)-ncol(X)+1 # degree of freedom
rconf = sqrt(qf(p = 0.95, df1, df2)*df1/df2 )
rprob = sqrt(qchisq(p = 0.95, df = df1))
### PLOTTING ###
par(mar=c(0.5,0,0,0))
# plot data points:
plot.acomp(X, pch=16, cex = 0.6) # data points
plot.acomp(mn, pch = 3, cex = 3, col = 'red', add = TRUE) # center
# plot the ellipses:
ellipses(mean = mn, var = vr, r = rconf, col = "red", lwd = 2) # confidence region for the mean
ellipses(mean = mn, var = vr, r = rprob, lty = 2) # probability region for the data
legend('topleft',
legend = c('data points',
'center',
'confidence region (95%)',
'probability region (95%)'),
pch = c(16,3,NA,NA),
lty = c(NA,NA,1,2),
col = c('black','red','red','black'),
cex = 0.9)
summary()
function for a acomp
objectAs for many other object classes of R there exists as well a
summary()
function for a acomp
object. The
summary()
function returns exhaustive amount of
information, including descriptive statistics, for a particular
acomp
class object. Refer to the documentation for detailed
information on the function output
(help(summary.acomp)
)
summary(X)
## $mean
## Fe K Mg
## "0.3054736" "0.3211148" "0.3734116"
## attr(,"class")
## [1] "acomp"
##
## $mean.ratio
## Fe K Mg
## Fe 1.000000 0.9512909 0.8180615
## K 1.051203 1.0000000 0.8599488
## Mg 1.222402 1.1628600 1.0000000
##
## $variation
## Fe K Mg
## Fe 0.00000000 0.06907321 0.3827192
## K 0.06907321 0.00000000 0.3005484
## Mg 0.38271916 0.30054839 0.0000000
##
## $expsd
## Fe K Mg
## Fe 1.000000 1.300590 1.856407
## K 1.300590 1.000000 1.730176
## Mg 1.856407 1.730176 1.000000
##
## $invexpsd
## Fe K Mg
## Fe 1.0000000 0.768882 0.5386749
## K 0.7688820 1.000000 0.5779760
## Mg 0.5386749 0.577976 1.0000000
##
## $min
## Fe K Mg
## Fe 1.0000000 0.3144301 0.03582797
## K 0.5887497 1.0000000 0.11394575
## Mg 0.4380004 0.3883320 1.00000000
##
## $q1
## Fe K Mg
## Fe 1.0000000 0.7783108 0.6486084
## K 0.8677156 1.0000000 0.6785914
## Mg 0.8357302 0.8550891 1.0000000
##
## $med
## Fe K Mg
## Fe 1.00000 0.9683262 0.9604579
## K 1.03271 1.0000000 0.9420606
## Mg 1.04117 1.0615028 1.0000000
##
## $q3
## Fe K Mg
## Fe 1.000000 1.152451 1.196558
## K 1.284834 1.000000 1.169469
## Mg 1.541762 1.473641 1.000000
##
## $max
## Fe K Mg
## Fe 1.000000 1.698515 2.283103
## K 3.180357 1.000000 2.575116
## Mg 27.911152 8.776106 1.000000
##
## $missingness
## missingType
## variable NMV BDL MAR MNAR SZ Err
## Fe 177 0 0 0 0 0
## K 177 0 0 0 0 0
## Mg 177 0 0 0 0 0
##
## attr(,"class")
## [1] "summary.acomp"
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.