3060015_Intro_compositional

Compositional data is a special type of of non-negative data, which carries the relevant information in the ratios between the variables rather than in the actual data values. The individual parts of the composition are called components or parts. Each component has an amount, representing its importance within the whole. The sum over the amounts of all components is called the total amount, and portions are the individual amounts divided by this total amount (van den Boogaart and Tolosana-Delgado 2013).

The great challenge of compositional data is that no single component can be regarded as independent from other. This means, if you observe a (e.g. spatial or temporal) pattern within one single component, at least parts of the pattern is likely rather attributed to another part of the composition than the (spatial or temporal) behaviour of the varibale of interest it self.

A simple example:

Imagine two measurements of a 2-part composition which sum up to a whole: Within measurement 1, both parts are uniform with exact 50% of the whole composition. Measurement 2 is showing up with 70% of the dark gray component and 30% the light grey one. How can we describe the difference:
Has the dark grey component increased whereas the light grey remained constant or has the light grey component decreased and the dark grey remained constant? It is absolutely impossible to decide what has happened, because both cases would show up with a similar pattern. Furthermore, we have to consider that all values are strictly positive and thus, differences are rather related to ratios than for distances. Thus, we may state the rates of changes by \(x=darkgrey\) and \(y=lightgrey\):
\[dx=\frac{x_2}{x_1}=\frac{0.7}{0.5}=1.4 \\ dy=\frac{y_2}{y_1}=\frac{0.3}{0.5}=0.6\]
Hence, 40% of the lightgrey part has been shifted to the dark grey, but we still cannot decide whether x increased or y decreased.

Another simple example:

Imagine you are on a small hallig in the German wadden sea for a wile, let’s say two weeks holidays. There is absolutly nothing to do but counting boats passing by. You are concentrationg on two types of boats: fisher boats and ferry boats.
Here are the results of your counting after 10 days:

set.seed(200)
days<-1:10
ferry<-rep(11,10)
fisher<-floor(runif(10,5,15))
plot(days,fisher, type = 'l',col='blue',xlab="day",ylab="no. of boats")
lines(days,ferry,col='green')
title(main="counts of fisher boat (blue) and ferries (green)")

Our plot shows absolute constant occurrence of ferry boats (green line) and strong variation of fisher boat sightings (blue line).
Let us perform minor statistics on our observations using relative appearance in percentages:

rel_ferry<-100*ferry/(ferry+fisher)
rel_fisher<-100*fisher/(ferry+fisher)
plot(days,rel_fisher, type = 'l',col='blue',xlab="day",ylab="type of boats [%]",ylim=c(0,100))
lines(days,rel_ferry,col='green')
title(main="relative appearance of fisher boat (blue) and ferries (green)")

Because they seem to occur in an opposite sense, we try a scatter plot for correlation:

plot(rel_ferry,rel_fisher,xlab = "ferries [%]",ylab="fisher boats [%]")

Ok, absolute perfect negative correlation with r = -1. Consequently, we may state that the more ferries passing by the less fisher boats can be counted.
Even if we add a third class of boats: the sailing boats

sailing<-floor(runif(10,min=8,max=24))
rel_sailing<-100*sailing/(ferry+fisher+sailing)
rel_ferry<-100*ferry/(ferry+fisher+sailing)
rel_fisher<-100*fisher/(ferry+fisher+sailing)
psych::pairs.panels(cbind(rel_sailing,rel_ferry,rel_fisher),ellipses = F,smooth = F)

Here we may observe at least four serious problems of compositional data:
1. depending on the size of the subcomposition (here two or three classes of boats), measure of covariance/correlation changes: subcompositional incoherence
2. parts completely independent by random becomes related / linear dependent: spurious correlations
3. due to the constraint feature scale (here \([0,100]_{\mathbb Q}\)), we can neither expect normal nor symmetric distributions (cf. chapter double constraint feature scales).
4. As consequence of spurious correlations (cf. 2.), covariance matrices of compositions don’t have full rank (important for multivariate methods such as PCA, SVG, LDA, etc.).

The problem is the so called “closure” of the variable space meaning that:
A. all parts related to an observation sum up to a constant value \(\kappa\)
B. all single proportion value \(x_i\) is greater than 0 and cannot exceed \(\kappa\): \(x_i\in ]0,\kappa[_{\mathbb R_+}\)
Before we start in the particular statistic of compositional data, let us build the formal frame:

First, we may define a row vector, \(x = (x_1, x_2,..., x_D)\), as a D-part composition when all its components are strictly positive real numbers and carry only relative information.
Further, we may rescale any vector of \(D\) real positive components so that the sum of its components \(\kappa\), is a constant such as \(1\), \(100\) (%) or \(10^{6}\) (parts per million, ppm). Such rescaled data is referred to as closed data or closure of a data set.

\[ \mathbf z = (z_1, z_2, ..., z_D) \in \mathbb R^D_+\text{,} \quad \text{for } z_i >0 \text{ for all } i = 1,2,,...,D\] The closure of \(\mathbf z\) id defined as

\[\mathcal C (\mathbf z) = \left[ \frac{\kappa \cdot z_1}{\sum_{i=1}^Dz_i},\frac{\kappa \cdot z_2}{\sum_{i=1}^Dz_i},..., \frac{\kappa \cdot z_D}{\sum_{i=1}^Dz_i} \right] \]

Let us give it a try in R and rescale a vector \(\mathbf z\) to a closed data set \(c(\mathbf z)\).

z <- c(23, 34, 42, 7, 98)
k = 100 # constant sum 
c <- z*k/sum(z) 
c

## [1] 11.274510 16.666667 20.588235  3.431373 48.039216

sum(c) == k # check for constant sum

## [1] TRUE

As seen in our examples above, closure has severe consequences for statistical data analysis. Closed data has its own vector space structure and own algebraic-geometric structure, which is called Aitchison geometry, after the Scottish statistician John Aitchison. The sample space of compositional data is denoted as the simplex, also referred to as Aitchison simplex. A brief summary of the sample space and structure is e.g. provided by Egozcue & Pawlowsky-Glahn, 2019.
The closure of a data set corresponds to a projection of a point from a \(D-\)dimensional positive real space \(\mathbb R_+\)onto the simplex, \(\mathbb S\).

\[\mathbb S^D = \left\{ \mathbf x=(x_1, ..., x_D) \vert x_i \ge 0, \sum_{i=1}^D x_i = 1 \right\}\]

The Aitchison geometry is not linear but curved in the Euclidean sense and thus, the analysis of compositional data calls for new type of statistical methods as most classical statistical methods are based on the usual Euclidean geometry (Filzmoser et al. 2010).

In this section we introduce the compositions package v.d. Boogaart, Tolosana-Delgado, Bren. This package provides functions for the consistent analysis of compositional data (e.g. portions, concentrations, etc.). After installing the package using the install.packages("compositions") command we load the package by calling the library() function. The clo() function of the compositions package applies the closure operation on a given compositional data set.

# install.packages("compositions") 
library(compositions)
z

## [1] 23 34 42  7 98

clo(z, total = k)

## [1] 11.274510 16.666667 20.588235  3.431373 48.039216

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.