307050_Dealing_with_missing

Missing values in data sets are a well-known problem as nearly everywhere, where data is measured and recorded, issues with missing values occur. Various reasons lead to missing values: values may not be measured, values may be measured but get lost or values may be measured but are considered unusable. Missing values can lead to problems, because often further data processing and analysis steps rely on complete data sets. Therefore missing values need to be replaced with reasonable values. In statistics this process is called imputation.

When faced with the problem of missing values it is important to understand the mechanism that causes missing data. Such an understanding is useful, as it may be employed as background knowledge for selecting an appropriate imputation strategy. Mechanism behind the missing data can be divided into three categories missing completely at random (MCAR) , missing at random (MAR) and not missing at random (NMAR) (Moritz et al. 2015). In MCAR missing data points occur entirely at random. In MAR the probability for an observation being missing is also independent of the value of the observation itself. But it is dependent on other variables. NMAR observations are not missing in a random manner.

Imputation of univariate time series

Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. Hence, the imputation of univariate time series is a special imputation case, as instead of covariates like in multivariate data sets, time dependencies have to be employed to perform an effective imputation. Techniques capable of doing imputation for univariate time series can be roughly divided into three categories (Moritz et al. 2015):

1. Univariate algorithms These algorithms work with univariate inputs, but typically do not employ the time series character of the data set. Examples are: mean, mode, median, random sample.

2. Univariate time series algorithms These algorithms are also able to work with univariate inputs, but make use of the time series characteristics. Examples of simple algorithms of this category are last observation carried forward (locf), next observation carried backward (nocb), arithmetic smoothing and linear interpolation. The more advanced algorithms are based on structural time series models and can handle seasonality.

3. Multivariate algorithms on lagged data Usually, multivariate algorithms can not be applied on univariate data. But since time is an implicit variable for time series, it is possible to add time information as covariates in order to make it possible to apply multivariate imputation algorithms.

Missing value imputation with R

In general imputation is well covered within R. Moritz et al. 2015 compiled a list of R packages, offering imputation tools and functions based on different strategies and algorithms:

missForest (imputation based on random forests)
mvnmle (maximum likelihood estimation)
mtsdi (expectation maximization)
yaImpute (nearest neighbor observation)
BaBooN (predictive mean matching)
CoImp (conditional copula specifications)
mice (multivariate imputation by chained equations)
Hmisc (multiple imputation)
Amelia (multiple imputation)
imputeR (general imputation framework)
VIM (visualization and imputation of missing values)
mitools (multiple imputation)
HotDeckImputation (Hot Deck imputation)
hot.deck (multiple Hot Deck imputation)
miceadds (multiple imputation)
mi (missing data imputation in an approximate Bayesian framework)
missMDA (missing values with multivariate data analysis)
ForImp (forward imputation algorithm)
DataCombine
and others…

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.