The challenge of dealing with zero or missing values
are often underestimated, ignored or used as excuse to pass on necessary
transformation for constrained data.
One of the most heard excuse for avoiding the need of transformations
for compositional or constraint data analyses is “I cannot apply
the logarithm out on zeros, but my data contains a lot of
zeros!” This sentence is apparently completely true, but
remains an excuse:
Plausibility of zeros as reported values for positive real
variables
Let us imagine a data set of daily reported precipitation. For a
certain day, zero [mm] precipitation has been reported.
How can we interpret this non-existing (amount of) precipitation:
1. Only very few precipitation occurred over a very short time and our
rain gauge did not caught any raindrops or snow flakes by random but the
immediate surrounding. Here, we cannot assume that there was no
precipitation but did not observed by random. Thus, we have to interpret
the zero value as code for missing value by
random (MAR).
2. Imagine, on this particular day only very small sized drizzles
occurred in the morning, have been caught in the rain gauge (e.g. a
graduated cylinder) and evaporated very shortly afterward by changes in
temperature, pressure or relative humidity. In this case we have to
assume a missing value not at random
(MNAR), because there is a certain reason for the missing
value.
3. Every methods has its own accuracy. Thus, approaching zero we will
certainly reach a detection limit. Passing this specific limit, a zero
value has to be interpreted as a (an erroneous) coding for below
detection limit (BDL).
4. Let’s assume we have a rain gauge which is not able to sample snow
precipitation. In this case we are getting the information of no
precipitation. Consequently, we consider this event as essential
or structural zero (SZ) cf. e.g.Lubbe et al.,
2021.
5. If we have to assume true zeros, meaning
absolutely no precipitation in our case, we cannot state 0
mm or 0 \(l/m^2\) precipitation,
because of having no event, no measure can be applied! Only in small
samples of count compositions zero counts may occur (cf. v.d.
Boogaart & Tolosana-Delgado,2013: chapter 7).
Dealing with zeros in positive data depends on the context and purpose
of your analysis.
Here are some possible approaches:
A. Ignore the zeros: This is definitely a really bad
choice, because it may result in a loss of information or bias in your
results.
B. Use zeros as usual measurements: Hereby, you will
get an arbitrary distribution, because the non- or non-measured values
are part of the frequency distribution of measured values. Already
estimating the central tendency will cause serious trouble: median and
arithmetic mean will shift to the left and in the likely case \(x\in\mathbb R_+\) you will get a geometric
mean of \(\bar x_{geo}=0\). All further
analyses wil become nonsense.
C. Analyze separately: If the zeros represent a
distinct category or subgroup of the data, you can analyze them
separately from the positive values. For example, if the data represent
precipitation registered at a rain gauge, any non-zero number represents
a time interval of registered precipitation, whereas zero values means
no-precipitation. You may analyses your data using binary approaches for
estimation of occurrence frequency/pattern/probability. The amount of
precipitation can be analysed in a meaningful way using only the
non-zero values.
D. Replace the zeros: If the zeros represent values
that are expected to be positive but are missing or incomplete, you can
replace them with an imputed value. Pan & Chen, 2023
empirically evaluated zero-imputation approaches for multivariate
missing data (MCAR, MAR, MNAR) in public health.
Azur et al., 2011 present
a brief explanation and intro into Multiple Imputation by Chain
Equations (MICE) including the R-package mice
by van Buuren et al., recent version
3.15.0 2022-11-17.
Lubbe et al.,
2021 focussed on several imputation methods for compositional data.
A comprehensive and useful presentation for all above mentioned types of
zeros/missings can be found in v.d.
Boogaart & Tolosana-Delgado,2013: chapter 7).
Palarea-Albaladejo
and Martin-Fernández, 2011 focused on handling of values below
detection limts (BDL) in compositional data with code example for Matlab
and R.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.