Missing values in data sets are a well-known problem as nearly
everywhere, where data is measured and recorded, issues with missing
values occur. Various reasons lead to missing values: values may not be
measured, values may be measured but get lost or values may be measured
but are considered unusable. Missing values can lead to problems,
because often further data processing and analysis steps rely on
complete data sets. Therefore missing values need to be replaced with
reasonable values. In statistics this process is called **imputation**.

When faced with the problem of missing values it is important to
understand the mechanism that causes missing data. Such an understanding
is useful, as it may be employed as background knowledge for selecting
an appropriate imputation strategy. Mechanism behind the missing data
can be divided into three categories **missing completely at
random (MCAR)** , **missing at random (MAR)** and
**not missing at random (NMAR)** (Moritz et
al. 2015). In MCAR missing data points occur entirely at random. In
MAR the probability for an observation being missing is also independent
of the value of the observation itself. But it is dependent on other
variables. NMAR observations are not missing in a random manner.

Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. Hence, the imputation of univariate time series is a special imputation case, as instead of covariates like in multivariate data sets, time dependencies have to be employed to perform an effective imputation. Techniques capable of doing imputation for univariate time series can be roughly divided into three categories (Moritz et al. 2015):

**1. Univariate algorithms** These algorithms work with
univariate inputs, but typically do not employ the time series character
of the data set. Examples are: *mean*, *mode*,
*median*, *random sample*.

**2. Univariate time series algorithms** These
algorithms are also able to work with univariate inputs, but make use of
the time series characteristics. Examples of simple algorithms of this
category are *last observation carried forward (locf)*, *next
observation carried backward (nocb)*, *arithmetic smoothing*
and *linear interpolation*. The more advanced algorithms are
based on structural time series models and can handle seasonality.

**3. Multivariate algorithms on lagged data** Usually,
multivariate algorithms can not be applied on univariate data. But since
time is an implicit variable for time series, it is possible to add time
information as covariates in order to make it possible to apply
multivariate imputation algorithms.

In general imputation is well covered within R. Moritz et al. 2015 compiled a list of R packages, offering imputation tools and functions based on different strategies and algorithms:

- missForest (imputation based on random
forests)

- mvnmle (maximum likelihood estimation)

- mtsdi (expectation maximization)

- yaImpute (nearest neighbor observation)

- BaBooN (predictive mean matching)

- CoImp (conditional copula specifications)

- mice (multivariate imputation by chained
equations)

- Hmisc (multiple imputation)

- Amelia (multiple imputation)

- imputeR (general imputation framework)

- VIM (visualization and imputation of missing
values)

- mitools (multiple imputation)

- HotDeckImputation (Hot Deck imputation)

- hot.deck (multiple Hot Deck imputation)
- miceadds (multiple imputation)

- mi (missing data imputation in an approximate
Bayesian framework)

- missMDA (missing values with multivariate data
analysis)

- ForImp (forward imputation algorithm)

- DataCombine

- and others…

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*