307051_Imputing_missing_values_with_zoo_and

In this section we apply different functions to perform imputation on two different data sets. The evaluation of the performance of imputation algorithms is in general flawed, since the actual values are in fact missing. Thus, a performance comparison can only be done for simulated missing data. Based on an algorithm provided by Moritz et al. 2015, we introduce missing data artificially in our data sets.

In order to evaluate the imputation algorithm we calculate the root mean squared error (RMSE) of the imputed data set with respect to the original data set.

The RMSE is given by

\[RMSE = \sqrt{\frac{\sum_{t=1}^n (\hat y_t - y_t)^2 }{n}}\]

Exercise: Write a small convenience function to calculate the RMSE.

## Your code here
RMSE <- function(sim, obs) {
  NULL
}

Show code

RMSE <- function(sim, obs) {
  sqrt(sum((sim - obs)^2) / length(sim))
}

In this section we apply the following imputation functions:

library(zoo)
library(forecast)

na.aggregate() (zoo): Generic function for replacing each NA with aggregated values. This allows imputing by the overall mean, by monthly means, etc.
na.locf() (zoo): Generic function for replacing each NA with the most recent non-NA prior to it. For each individual, missing values are replaced by the last observed value of that variable.
na.StructTS() (zoo): Generic function for filling NA values using seasonal Kalman filter.
na.interp() (forecast): Uses linear interpolation for non-seasonal series and a periodic STL-decomposition with seasonal series to replace missing values.

Imputing daily temperature data

We apply the imputation algorithms on the mean daily temperature data set from the weather station Berlin-Dahlem (FU).

load(url("https://userpage.fu-berlin.de/soga/data/r-data/NA_datasets.RData"))

The particular data sets of interest are the original data set, temp_sample, and the data set including the missing values: temp_NA. Let us plot the original data by applying the plot.ts() function.

plot.ts(temp_sample, main = "Weather station Berlin-Dahlem", ylab = "Temperature in °C")

Further, we calculate the percentage of missing values for the data set. The is.na() function in combination with the sum() function can be applied to calculate the number of NA values.

na_perc <- round(sum(is.na(temp_NA)) / length(temp_sample), 3) * 100
na_perc

## [1] 36.5

We see that the data set temp_NA consists of about 36.5% missing values. Let us plot the data for a better understanding.

plot.ts(temp_NA,
  ylab = expression("Temperature in °C"),
  main = paste("Missing values: ", na_perc, "%"),
  cex.main = 0.85,
  type = "o",
  cex = 0.3,
  pch = 16
)

OK, let us apply the imputation methods. The procedure is as follows:

First, we apply the imputation function.
Second, we calculate the RMSE, using our custom function RMSE().
Third, we plot the data to get a visual impression about the performance of the imputation algorithm.

`na.aggregate()`

A generic function for replacing each NA with aggregated values. This allows imputing by the overall mean, by monthly means, etc.

## IMPUTING BY OVERALL MEAN ##
temp_NA_imp <- na.aggregate(temp_NA, FUN = mean)

## ERROR ##
rmse_NA <- RMSE(temp_NA_imp, temp_sample)
rmse_NA

## [1] 4.392792

## PLOTTING ##
plot.ts(temp_NA_imp,
  ylab = expression("Temperature in °C"),
  main = "na.aggregate()",
  cex.main = 0.85, col = "red"
)

lines(temp_NA)
text(2013.5, -10, paste("RMSE: ", round(rmse_NA,4)), cex = 0.85)
legend("bottomright", legend = "Imputed values", lty = 1, col = "red", cex = 0.65)

## IMPUTING BY MONTHLY MEAN ##
temp_NA_imp <- na.aggregate(temp_NA, FUN = mean, by = as.yearmon) # mean per month

## ERROR ##
rmse_NA_high <- RMSE(temp_NA_imp, temp_sample)
rmse_NA_high

## [1] 2.098821

## PLOTTING ##
plot.ts(temp_NA_imp,
  ylab = "Temperature in °C",
  main = paste("na.aggregate()"),
  cex.main = 0.85,
  col = "red"
)

lines(temp_NA)

text(2013.5, -10, paste("RMSE: ", round(rmse_NA_high,4)), cex = 0.85)

legend("bottomright", legend = "Imputed values", lty = 1, col = "red", cex = 0.65)

`na.locf()`

A generic function for replacing each NA with the most recent non-NA prior to it. For each individual, missing values are replaced by the last observed value of that variable. With a value of fromLast = TRUE this corresponds to NOCB (next observation carried backward).

## IMPUTING ##
temp_NA_imp <- na.locf(temp_NA, fromLast = FALSE)

## ERROR ##
rmse_NA <- RMSE(temp_NA_imp, temp_sample)
rmse_NA

## [1] 1.639982

## PLOTTING ##
plot.ts(temp_NA_imp,
  ylab = expression("Temperature in °C"),
  main = "na.locf()",
  cex.main = 0.85, col = "red"
)

lines(temp_NA)

text(2013.5, -10,
  paste("RMSE: ", round(rmse_NA,4)),
  cex = 0.85
)

legend("bottomright",
  legend = "Imputed values",
  lty = 1, col = "red", cex = 0.65
)

`na.interp()`

The algorithm uses linear interpolation for non-seasonal series and a periodic STL-decomposition with seasonal series to replace missing values.

## IMPUTING ##
temp_NA_imp <- na.interp(temp_NA)

## ERROR ##
rmse_NA <- RMSE(temp_NA_imp, temp_sample)
rmse_NA

## [1] 1.109501

## PLOTTING ##
plot.ts(temp_NA_imp,
  ylab = expression("Temperature in °C"),
  main = "na.interp()",
  cex.main = 0.85, col = "red"
)

lines(temp_NA)

text(2013.5, -10, paste("RMSE: ", round(rmse_NA,4)), cex = 0.85)

legend("bottomright", legend = "Imputed values", lty = 1, col = "red", cex = 0.65)

In summary the imputation methods performed quite well. With respect to RMSE the last observation carried forward algorithm (na.locf()) performed best.

Imputing the carbon dioxide data set

Exercise: A friend of yours sent you a data set of carbon dioxide measurements taken at the Mauna Loa Observatory in Hawaii (co2_NA). Unfortunately the data was corrupted. Your task is to impute the missing values ind the data set.

First, let us load and plot the data:

load(url("https://userpage.fu-berlin.de/soga/data/r-data/NA_datasets.Rdata"))

plot.ts(co2_NA,
  main = "Atmospheric carbon dioxide, Mauna Loa, Hawaii",
  ylab = expression("CO"[2] * " in ppm")
)

Now, calculate the percentage of missing values, by applying the is.na() function in combination with the sum() function.

## Your code here...

Show code

na_perc_high <- round(sum(is.na(co2_NA)) / length(co2_NA), 3) * 100

The data set co2_NA consists of about 39.1% missing values.

In order to replace the missing data follow the steps outlined below:

Apply the imputation functions for at least one out of na.locf(), na.StructTS() and na.interp() from package forecaste.
Plot the data to get a visual impression about the performance of the imputation algorithm.
(Additional:) Calculate the RMSE, using our custom function RMSE() (The original data is given by co2.sample).

## Your code here...
library(forecast)
co2_NA_imp_interp <- NULL

Show code

## IMPUTING ##
library(forecast)
# na.locf
co2_NA_imp_interp <- na.interp(co2_NA)

rmse_interp <- RMSE(co2_NA_imp_interp, co2_sample)

## PLOTTING ##
plot.ts(co2_NA_imp_interp,
  ylab = "",
  main = paste("na.interp()"),
  cex.main = 0.85, col = "red"
)

lines(co2_NA)

text(2004, 400, paste("RMSE: ", round(rmse_interp, 4)), cex = 0.85)

legend("bottomright", legend = "Imputed values", lty = 1, col = "red", cex = 0.65)

## Your code here...
co2_NA_imp_StructTS <- NULL

Show code

## IMPUTING ##
# na.StructTS
co2_NA_imp_StructTS <- na.StructTS(co2_NA)

rmse_StructTS <- RMSE(co2_NA_imp_StructTS, co2_sample)

## PLOTTING ##
plot.ts(co2_NA_imp_StructTS,
  ylab = "",
  main = paste("na.StructTS()"),
  cex.main = 0.85, col = "red"
)

lines(co2_NA)

text(2004, 400, paste("RMSE: ",round(rmse_StructTS,4)), cex = 0.85)

legend("bottomright", legend = "Imputed values", lty = 1, col = "red", cex = 0.65)

## Your code here...
co2_NA_imp_interp <- NULL

Show code

## IMPUTING ##
# na.interp
co2_NA_imp_interp <- na.interp(co2_NA)

rmse_interp <- RMSE(co2_NA_imp_interp, co2_sample)

## PLOTTING ##
plot.ts(co2_NA_imp_interp,
  ylab = "",
  main = paste("na.interp()"),
  cex.main = 0.85, col = "red"
)

lines(co2_NA)

text(2004, 400, paste("RMSE: ", rmse_interp), cex = 0.85)

legend("bottomright", legend = "Imputed values", lty = 1, col = "red", cex = 0.65)

With respect to RMSE the algorithm implemented in the na.interp() function performed best. This suggests, that for data with a strong seasonal component the STL-decomposition approach seems to be well suited to replace missing values.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.