Descriptive analysis of a compositional data set is based on the principle of working in coordinates and of working with log ratios. The most important functions for descriptive statistics in R are exemplified with a subcomposition of the data set of the geochemistry of a sediment core recovered from the eastern Juyanze palaeolake in north-western China (Hartmann and WÃ¼nnemann, 2009). We load the data set and prepare if for the analysis.

```
# load data set
g36 <- read.csv("http://userpages.fu-berlin.de/soga/data/raw-data/G36chemical.txt",
sep='\t',
row.names = 'Sample')
# exclude columns of no interest
g36 <- g36[, -c(1:5, ncol(g36)-1, ncol(g36))]
# rename columns for better readability
colnames(g36) <- gsub(".mg.g", "", colnames(g36))
# define a subcomposition
library(compositions)
X = acomp(g36, c("Fe","K","Mg"))
```

The *center* or *compositional mean* of a data set
\(\mathbf X\) with \(N\) observations and \(D\) parts is the composition

\[\bar x = \frac{1}{N}\odot \bigoplus_{n=1}^N \mathbf x_n = \text{clr}^{-1} \left(\frac{1}{N} \sum_{n=1}^N \text{clr}(\mathbf x_n)\right) = \mathscr C \left[\exp \left(\frac{1}{N}\sum_{n=1}^N\ln(\mathbf x_n)\right)\right]\text{,}\]

where \(x_n\) is the
*n*^{th} observation, in the *n*^{th} row
of \(\mathbf X\).

In R the compositional mean is computed by calling the
`mean()`

function on a `acomp`

class object.

`mean(X)`

```
## Fe K Mg
## "0.28247056" "0.05232328" "0.66520616"
## attr(,"class")
## [1] "acomp"
```

The compositional mean is not a robust estimator (van
den Boogaart and Tolosana-Delgado 2013). The
`composition`

package provides a robust improvement of the
compositional mean by computing the minimum covariance determinant (MCD) estimator.
Therefore we have to add an additional argument
`robust = TRUE`

to the function call
(`mean(X, robust = TRUE)`

).

`mean(X, robust = TRUE)`

```
## Fe K Mg
## "0.32229870" "0.05679286" "0.62090844"
## attr(,"class")
## [1] "acomp"
```

A global measure of spread is the *metric variance*, also
known as *total variance* or *generalized variance*, which
corresponds to the the average squared distance from the center of the
data set (van den Boogaart and Tolosana-Delgado 2013).

\[\text{mvar}(\mathbf X) = \frac{1}{N-1}\sum_{n=1}^N d^2_A(\mathbf x_n, \overline{\mathbf x})\]

In R the computation of this variance type is provided by the
`compositions`

package by applying the `mvar()`

function on a `acomp`

class object:

`mvar(X)`

`## [1] 0.2507803`

Van den Boogaart and Tolosana-Delgado 2013 define a
*metric standard deviation*, to support quantitative
interpretation:

\[\text{msd}(X) = \sqrt{\frac{1}{D-1} \text{mvar}(X)}\]

This quantity may be interpreted as the radial standard deviation on a log scale, if the variation was the same in all directions, or as some sort of average spread.

In R this equation is implemented in the `msd()`

function
of the `compositions`

package.

`msd(X)`

`## [1] 0.3541047`

The variation of a composition among its components cannot be
described by raw correlations or covariances. Instead Aitchison (1986) recommends the **variation
matrix**. This matrix has \(D^2\) components, each defined as

\[\tau_{ij} = \text{var}\left(\ln\frac{x_i}{x_j}\right) \]

and estimated by

\[\tau_{ij}=\frac{1}{N-1}\sum_{n=1}^N\ln^2 \frac{x_{ni}}{x_{nj}} - \ln^2\frac{\bar x_i}{\bar x_j}\text{.}\]

In R this variation matrix is provided by the
`compositions`

package by applying the
`variation()`

function on a `acomp`

class
object:

`variation(X)`

```
## Fe K Mg
## Fe 0.00000000 0.06907321 0.3827192
## K 0.06907321 0.00000000 0.3005484
## Mg 0.38271916 0.30054839 0.0000000
```

The smaller a variation element is, the better the proportionality between the two components. To help in its interpretation, one may consider the transformation

\[\rho_{ij} = \exp(-\tau_{ij}^2/2)\] and interpret it as a correlation coefficient (Van den Boogaart and Tolosana-Delgado 2013).

```
# interpreted like the correlation coefficient
exp(-variation(X)^2/2)
```

```
## Fe K Mg
## Fe 1.0000000 0.9976173 0.9293806
## K 0.9976173 1.0000000 0.9558401
## Mg 0.9293806 0.9558401 1.0000000
```

In order to analyse a compositional data set in a more classical way,
as a **matrix of covariances**, the data set must be
*clr*-transformed. The resulting *clr*-variance matrix is
the **variance-covariance matrix** of the
*clr*-transformed data set.

The covariance estimates \(\hat{\Sigma}_{ji}\) are given by

\[\hat{\Sigma}_{ji}=\text{cov}\left(\text{clr}_i(\mathbf
x),\text{clr}_j(\mathbf x)\right) = \frac{1}{N}\sum_{n=1}^N
\ln\frac{x_{ni}}{g(\mathbf x_n)}\cdot\ln\frac{x_{ni}}{g(\mathbf
x_n)}-\ln\frac{\bar x_i}{g(\bar{\mathbf x})}\cdot\ln\frac{\bar
x_j}{g(\bar{\mathbf x})}\] In R the covariance estimates may be
computed by applying the `var()`

function on a
`acomp`

class object.

`var(X)`

```
## Fe K Mg
## Fe 0.06700404 0.01877231 -0.08577634
## K 0.01877231 0.03961378 -0.05838609
## Mg -0.08577634 -0.05838609 0.14416243
```

In many statistical analysis it is useful to center the data by removing the mean. In compositional data analysis, we can achieve this task by perturbing the composition by the inverse of its center:

\[\mathbf X^* = \mathbf X \ominus \bar{\mathbf x}\]

Similarly, we may scale (standardize) a compositional data set, by using perturbation and powering. A data set is centered by perturbation with the inverse of the center and scaled by powering with the inverse of the square root of the metric variance:

\[\mathbf Z = \frac{1}{\sqrt{\text{mvar}(\mathbf X)}} \odot (\mathbf X \ominus \bar{\mathbf x})\]

In R we may explicitly write
`(X - mean(X)) / sqrt(mvar(X))`

or we make use of the handy
`scale()`

function provided by the `compositions`

package.

```
library(ggtern)
grobs = list()
data = list(X,
X-mean(X),
scale(X,center=TRUE,scale=TRUE))
titles <- c('Original data',
'Centered data',
'Standardized data')
for (p in 1:3){
grobs[[p]] <- ggtern(data = as.data.frame(data[[p]]), aes(Fe, K, Mg)) +
geom_point(alpha = 0.3, size = 2, color = "black") +
theme_rgbw() + ggtitle(titles[p]) +
theme(plot.title = element_text(hjust = 0.5))
}
ggtern::grid.arrange(grobs=grobs, nrow=1)
```

In order to represent the center and dispersion structure of
compositional data graphically, we may draw two ellipses. One
corresponds to a *p*-confidence region of the mean. This
confidence region is defined by its radius `r`

according to a
Fisher \(F-\)distribution of \((D - 1)\) and \(N-D+1\) degrees of freedom:

\[r = \sqrt{\frac{D-1}{N-D+1}\cdot F_p(D-1, N-D+1)}\]

This is valid if the data follows a normal distribution for
compositions or else if the number of observations \(N\) is large. In this last case, a
**compositional central limit theorem** applies.

The other ellipsis corresponds to a *p*-probability region,
defined by its radius given by the \(\chi^2\) distribution of \((D-1)\) degrees of freedom:

\[r = \sqrt{\chi^2_{\alpha}(D-1)}\]

We show this graphical procedure on a subcomposition of the data set of the geochemistry of a sediment core recovered from the eastern Juyanze palaeolake in north-western China (Hartmann and WÃ¼nnemann, 2009). We load the data set and prepare if for the analysis.

```
### DATA PREPARATION ###
# load data set
g36 <- read.csv("http://userpages.fu-berlin.de/soga/data/raw-data/G36chemical.txt",
sep='\t',
row.names = 'Sample')
# exclude columns of no interest
g36 <- g36[, -c(1:5, ncol(g36)-1, ncol(g36))]
# rename columns for better readability
colnames(g36) <- gsub(".mg.g", "", colnames(g36))
# define a subcomposition
X = acomp(g36, c("Fe","K","Mg"))
### COMPUTATION OF MEAN, VARIANCE AND RADII ###
# center the subcomposition
X = X - mean(X, robust = TRUE)
# get the mean:
mn = mean(X, robust = TRUE)
# compute variance and radii:
vr = var(X)
df1 = ncol(X)-1 # degree of freedom
df2 = nrow(X)-ncol(X)+1 # degree of freedom
rconf = sqrt(qf(p = 0.95, df1, df2)*df1/df2 )
rprob = sqrt(qchisq(p = 0.95, df = df1))
### PLOTTING ###
par(mar=c(0.5,0,0,0))
# plot data points:
plot.acomp(X, pch=16, cex = 0.6) # data points
plot.acomp(mn, pch = 3, cex = 3, col = 'red', add = TRUE) # center
# plot the ellipses:
ellipses(mean = mn, var = vr, r = rconf, col = "red", lwd = 2) # confidence region for the mean
ellipses(mean = mn, var = vr, r = rprob, lty = 2) # probability region for the data
legend('topleft',
legend = c('data points',
'center',
'confidence region (95%)',
'probability region (95%)'),
pch = c(16,3,NA,NA),
lty = c(NA,NA,1,2),
col = c('black','red','red','black'),
cex = 0.9)
```

`summary()`

function for a `acomp`

objectAs for many other object classes of R there exists as well a
`summary()`

function for a `acomp`

object. The
`summary()`

function returns exhaustive amount of
information, including descriptive statistics, for a particular
`acomp`

class object. Refer to the documentation for detailed
information on the function output
(`help(summary.acomp)`

)

`summary(X)`

```
## $mean
## Fe K Mg
## "0.3054736" "0.3211148" "0.3734116"
## attr(,"class")
## [1] "acomp"
##
## $mean.ratio
## Fe K Mg
## Fe 1.000000 0.9512909 0.8180615
## K 1.051203 1.0000000 0.8599488
## Mg 1.222402 1.1628600 1.0000000
##
## $variation
## Fe K Mg
## Fe 0.00000000 0.06907321 0.3827192
## K 0.06907321 0.00000000 0.3005484
## Mg 0.38271916 0.30054839 0.0000000
##
## $expsd
## Fe K Mg
## Fe 1.000000 1.300590 1.856407
## K 1.300590 1.000000 1.730176
## Mg 1.856407 1.730176 1.000000
##
## $invexpsd
## Fe K Mg
## Fe 1.0000000 0.768882 0.5386749
## K 0.7688820 1.000000 0.5779760
## Mg 0.5386749 0.577976 1.0000000
##
## $min
## Fe K Mg
## Fe 1.0000000 0.3144301 0.03582797
## K 0.5887497 1.0000000 0.11394575
## Mg 0.4380004 0.3883320 1.00000000
##
## $q1
## Fe K Mg
## Fe 1.0000000 0.7783108 0.6486084
## K 0.8677156 1.0000000 0.6785914
## Mg 0.8357302 0.8550891 1.0000000
##
## $med
## Fe K Mg
## Fe 1.00000 0.9683262 0.9604579
## K 1.03271 1.0000000 0.9420606
## Mg 1.04117 1.0615028 1.0000000
##
## $q3
## Fe K Mg
## Fe 1.000000 1.152451 1.196558
## K 1.284834 1.000000 1.169469
## Mg 1.541762 1.473641 1.000000
##
## $max
## Fe K Mg
## Fe 1.000000 1.698515 2.283103
## K 3.180357 1.000000 2.575116
## Mg 27.911152 8.776106 1.000000
##
## $missingness
## missingType
## variable NMV BDL MAR MNAR SZ Err
## Fe 177 0 0 0 0 0
## K 177 0 0 0 0 0
## Mg 177 0 0 0 0 0
##
## attr(,"class")
## [1] "summary.acomp"
```

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*