Compositional data is a special type of of non-negative data, which carries the relevant information in the ratios between the variables rather than in the actual data values. The individual parts of the composition are called components or parts. Each component has an amount, representing its importance within the whole. The sum over the amounts of all components is called the total amount, and portions are the individual amounts divided by this total amount (van den Boogaart and Tolosana-Delgado 2013).
The great challenge of compositional data is that
no single component can be regarded as independent from
other. This means, if you observe a (e.g. spatial or temporal)
pattern within one single component, at least parts of the pattern is
likely rather attributed to another part of the composition than the
(spatial or temporal) behaviour of the varibale of interest it
self.
Imagine two measurements of a 2-part composition which sum up to a
whole:
Within measurement 1, both parts are uniform with exact 50% of the whole
composition. Measurement 2 is showing up with 70% of the dark gray
component and 30% the light grey one. How can we describe the
difference:
Has the dark grey component increased whereas the light grey remained
constant or has the light grey component decreased and the dark grey
remained constant? It is absolutely impossible to decide what has
happened, because both cases would show up with a similar pattern.
Furthermore, we have to consider that all values are strictly positive
and thus, differences are rather related to ratios than for distances.
Thus, we may state the rates of changes by \(x=darkgrey\) and \(y=lightgrey\):
\[dx=\frac{x_2}{x_1}=\frac{0.7}{0.5}=1.4 \\
dy=\frac{y_2}{y_1}=\frac{0.3}{0.5}=0.6\]
Hence, 40% of the lightgrey part has been shifted to the dark grey, but
we still cannot decide whether x increased or y decreased.
Imagine you are on a small hallig in the German
wadden sea for a
wile, let’s say two weeks holidays. There is absolutly nothing to do but
counting boats passing by. You are concentrationg on two types of boats:
fisher boats and ferry boats.
Here are the results of your counting after 10 days:
set.seed(200)
days<-1:10
ferry<-rep(11,10)
fisher<-floor(runif(10,5,15))
plot(days,fisher, type = 'l',col='blue',xlab="day",ylab="no. of boats")
lines(days,ferry,col='green')
title(main="counts of fisher boat (blue) and ferries (green)")
Our plot shows absolute constant occurrence of ferry boats (green line)
and strong variation of fisher boat sightings (blue line).
Let us perform minor statistics on our observations using relative
appearance in percentages:
rel_ferry<-100*ferry/(ferry+fisher)
rel_fisher<-100*fisher/(ferry+fisher)
plot(days,rel_fisher, type = 'l',col='blue',xlab="day",ylab="type of boats [%]",ylim=c(0,100))
lines(days,rel_ferry,col='green')
title(main="relative appearance of fisher boat (blue) and ferries (green)")
Because they seem to occur in an opposite sense, we try a scatter plot
for correlation:
plot(rel_ferry,rel_fisher,xlab = "ferries [%]",ylab="fisher boats [%]")
Ok, absolute perfect negative correlation with r = -1. Consequently, we
may state that the more ferries passing by the less fisher boats can be
counted.
Even if we add a third class of boats: the sailing boats
sailing<-floor(runif(10,min=8,max=24))
rel_sailing<-100*sailing/(ferry+fisher+sailing)
rel_ferry<-100*ferry/(ferry+fisher+sailing)
rel_fisher<-100*fisher/(ferry+fisher+sailing)
psych::pairs.panels(cbind(rel_sailing,rel_ferry,rel_fisher),ellipses = F,smooth = F)