3060011_zeros_missings.knit

The challenge of dealing with zero or missing values are often underestimated, ignored or used as excuse to pass on necessary transformation for constrained data.
One of the most heard excuse for avoiding the need of transformations for compositional or constraint data analyses is “I cannot apply the logarithm out on zeros, but my data contains a lot of zeros!” This sentence is apparently completely true, but remains an excuse:

Plausibility of zeros as reported values for positive real variables

Let us imagine a data set of daily reported precipitation. For a certain day, zero [mm] precipitation has been reported.
How can we interpret this non-existing (amount of) precipitation:

1. Only very few precipitation occurred over a very short time and our rain gauge did not caught any raindrops or snow flakes by random but the immediate surrounding. Here, we cannot assume that there was no precipitation but did not observed by random. Thus, we have to interpret the zero value as code for missing value by random (MAR).

2. Imagine, on this particular day only very small sized drizzles occurred in the morning, have been caught in the rain gauge (e.g. a graduated cylinder) and evaporated very shortly afterward by changes in temperature, pressure or relative humidity. In this case we have to assume a missing value not at random (MNAR), because there is a certain reason for the missing value.

3. Every methods has its own accuracy. Thus, approaching zero we will certainly reach a detection limit. Passing this specific limit, a zero value has to be interpreted as a (an erroneous) coding for below detection limit (BDL).

4. Let’s assume we have a rain gauge which is not able to sample snow precipitation. In this case we are getting the information of no precipitation. Consequently, we consider this event as essential or structural zero (SZ) cf. e.g.Lubbe et al., 2021.

5. If we have to assume true zeros, meaning absolutely no precipitation in our case, we cannot state 0 mm or 0 $l/m^2$ precipitation, because of having no event, no measure can be applied! Only in small samples of count compositions zero counts may occur (cf. v.d. Boogaart & Tolosana-Delgado,2013: chapter 7).

Dealing with zeros in positive data depends on the context and purpose of your analysis.
Here are some possible approaches:

A. Ignore the zeros: This is definitely a really bad choice, because it may result in a loss of information or bias in your results.

B. Use zeros as usual measurements: Hereby, you will get an arbitrary distribution, because the non- or non-measured values are part of the frequency distribution of measured values. Already estimating the central tendency will cause serious trouble: median and arithmetic mean will shift to the left and in the likely case $x\in\mathbb R_+$ you will get a geometric mean of $\bar x_{geo}=0$ . All further analyses wil become nonsense.

C. Analyze separately: If the zeros represent a distinct category or subgroup of the data, you can analyze them separately from the positive values. For example, if the data represent precipitation registered at a rain gauge, any non-zero number represents a time interval of registered precipitation, whereas zero values means no-precipitation. You may analyses your data using binary approaches for estimation of occurrence frequency/pattern/probability. The amount of precipitation can be analysed in a meaningful way using only the non-zero values.

D. Replace the zeros: If the zeros represent values that are expected to be positive but are missing or incomplete, you can replace them with an imputed value. Pan & Chen, 2023 empirically evaluated zero-imputation approaches for multivariate missing data (MCAR, MAR, MNAR) in public health.

Azur et al., 2011 present a brief explanation and intro into Multiple Imputation by Chain Equations (MICE) including the R-package mice by van Buuren et al., recent version 3.15.0 2022-11-17.
Lubbe et al., 2021 focussed on several imputation methods for compositional data. A comprehensive and useful presentation for all above mentioned types of zeros/missings can be found in v.d. Boogaart & Tolosana-Delgado,2013: chapter 7).
Palarea-Albaladejo and Martin-Fernández, 2011 focused on handling of values below detection limts (BDL) in compositional data with code example for Matlab and R.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.