Compositional data is a special type of of non-negative data, which carries the relevant information in the ratios between the variables rather than in the actual data values. The individual parts of the composition are called components or parts. Each component has an amount, representing its importance within the whole. The sum over the amounts of all components is called the total amount, and portions are the individual amounts divided by this total amount (van den Boogaart and Tolosana-Delgado 2013).
The great challenge of compositional data is that
no single component can be regarded as independent from
other. This means, if you observe a (e.g. spatial or temporal)
pattern within one single component, at least parts of the pattern is
likely rather attributed to another part of the composition than the
(spatial or temporal) behaviour of the varibale of interest it
self.
Imagine two measurements of a 2-part composition which sum up to a
whole:
Within measurement 1, both parts are uniform with exact 50% of the whole
composition. Measurement 2 is showing up with 70% of the dark gray
component and 30% the light grey one. How can we describe the
difference:
Has the dark grey component increased whereas the light grey remained
constant or has the light grey component decreased and the dark grey
remained constant? It is absolutely impossible to decide what has
happened, because both cases would show up with a similar pattern.
Furthermore, we have to consider that all values are strictly positive
and thus, differences are rather related to ratios than for distances.
Thus, we may state the rates of changes by \(x=darkgrey\) and \(y=lightgrey\):
\[dx=\frac{x_2}{x_1}=\frac{0.7}{0.5}=1.4 \\
dy=\frac{y_2}{y_1}=\frac{0.3}{0.5}=0.6\]
Hence, 40% of the lightgrey part has been shifted to the dark grey, but
we still cannot decide whether x increased or y decreased.
Imagine you are on a small hallig in the German
wadden sea for a
wile, let’s say two weeks holidays. There is absolutly nothing to do but
counting boats passing by. You are concentrationg on two types of boats:
fisher boats and ferry boats.
Here are the results of your counting after 10 days:
set.seed(200)
days<-1:10
ferry<-rep(11,10)
fisher<-floor(runif(10,5,15))
plot(days,fisher, type = 'l',col='blue',xlab="day",ylab="no. of boats")
lines(days,ferry,col='green')
title(main="counts of fisher boat (blue) and ferries (green)")
Our plot shows absolute constant occurrence of ferry boats (green line)
and strong variation of fisher boat sightings (blue line).
Let us perform minor statistics on our observations using relative
appearance in percentages:
rel_ferry<-100*ferry/(ferry+fisher)
rel_fisher<-100*fisher/(ferry+fisher)
plot(days,rel_fisher, type = 'l',col='blue',xlab="day",ylab="type of boats [%]",ylim=c(0,100))
lines(days,rel_ferry,col='green')
title(main="relative appearance of fisher boat (blue) and ferries (green)")
Because they seem to occur in an opposite sense, we try a scatter plot
for correlation:
plot(rel_ferry,rel_fisher,xlab = "ferries [%]",ylab="fisher boats [%]")
Ok, absolute perfect negative correlation with r = -1. Consequently, we
may state that the more ferries passing by the less fisher boats can be
counted.
Even if we add a third class of boats: the sailing boats
sailing<-floor(runif(10,min=8,max=24))
rel_sailing<-100*sailing/(ferry+fisher+sailing)
rel_ferry<-100*ferry/(ferry+fisher+sailing)
rel_fisher<-100*fisher/(ferry+fisher+sailing)
psych::pairs.panels(cbind(rel_sailing,rel_ferry,rel_fisher),ellipses = F,smooth = F)
Here we may observe at least four serious problems of
compositional data:
1. depending on the size of the subcomposition (here two or three
classes of boats), measure of covariance/correlation changes:
subcompositional incoherence
2. parts completely independent by random becomes related / linear
dependent: spurious correlations
3. due to the constraint feature scale (here \([0,100]_{\mathbb Q}\)), we can
neither expect normal nor symmetric distributions
(cf. chapter double constraint feature scales).
4. As consequence of spurious correlations (cf. 2.), covariance
matrices of compositions don’t have full rank
(important for multivariate methods such as PCA, SVG, LDA, etc.).
The problem is the so called “closure” of
the variable space meaning that:
A. all parts related to an observation sum up to a constant value \(\kappa\)
B. all single proportion value \(x_i\)
is greater than 0 and cannot exceed \(\kappa\): \(x_i\in ]0,\kappa[_{\mathbb R_+}\)
Before we start in the particular statistic of compositional
data, let us build the formal frame:
First, we may define a row vector, \(x = (x_1,
x_2,..., x_D)\), as a D-part composition when
all its components are strictly positive real numbers and carry only
relative information.
Further, we may rescale any vector of \(D\) real positive components so that the
sum of its components \(\kappa\), is a
constant such as \(1\), \(100\) (%) or \(10^{6}\) (parts per million, ppm).
Such rescaled data is referred to as closed data or
closure of a data set.
\[ \mathbf z = (z_1, z_2, ..., z_D) \in \mathbb R^D_+\text{,} \quad \text{for } z_i >0 \text{ for all } i = 1,2,,...,D\] The closure of \(\mathbf z\) id defined as
\[\mathcal C (\mathbf z) = \left[ \frac{\kappa \cdot z_1}{\sum_{i=1}^Dz_i},\frac{\kappa \cdot z_2}{\sum_{i=1}^Dz_i},..., \frac{\kappa \cdot z_D}{\sum_{i=1}^Dz_i} \right] \]
Let us give it a try in R and rescale a vector \(\mathbf z\) to a closed data set \(c(\mathbf z)\).
z <- c(23, 34, 42, 7, 98)
k = 100 # constant sum
c <- z*k/sum(z)
c
## [1] 11.274510 16.666667 20.588235 3.431373 48.039216
sum(c) == k # check for constant sum
## [1] TRUE
As seen in our examples above, closure has severe
consequences for statistical data analysis. Closed data has its
own vector space structure and own algebraic-geometric structure, which
is called Aitchison geometry, after the Scottish
statistician John Aitchison. The sample space of compositional
data is denoted as the simplex, also referred to as
Aitchison simplex. A brief summary of the sample space
and structure is e.g. provided by Egozcue &
Pawlowsky-Glahn, 2019.
The closure of a data set corresponds to a projection of a point from a
\(D-\)dimensional positive real space
\(\mathbb R_+\)onto the simplex, \(\mathbb S\).
\[\mathbb S^D = \left\{ \mathbf x=(x_1, ..., x_D) \vert x_i \ge 0, \sum_{i=1}^D x_i = 1 \right\}\]
The Aitchison geometry is not linear but curved in the Euclidean sense and thus, the analysis of compositional data calls for new type of statistical methods as most classical statistical methods are based on the usual Euclidean geometry (Filzmoser et al. 2010).
In this section we introduce the compositions
package v.d. Boogaart,
Tolosana-Delgado, Bren. This package provides functions for the
consistent analysis of compositional data (e.g. portions,
concentrations, etc.). After installing the package using the
install.packages("compositions")
command we load the
package by calling the library()
function. The
clo()
function of the compositions
package
applies the closure operation on a given compositional data set.
# install.packages("compositions")
library(compositions)
z
## [1] 23 34 42 7 98
clo(z, total = k)
## [1] 11.274510 16.666667 20.588235 3.431373 48.039216
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.