In this section we apply different functions to perform imputation on two different data sets. The evaluation of the performance of imputation algorithms is in general flawed, since the actual values are in fact missing. Thus, a performance comparison can only be done for simulated missing data. Based on an algorithm provided by Moritz et al. 2015 we introduce missing data artificially in our data sets.

In order to evaluate the imputation algorithm we calculate the root mean squared error (RMSE) of the imputed data set with respect to the original data set.

The RMSE is given by

\[RMSE = \sqrt{\frac{\sum_{t=1}^n (\hat y_t - y_t)^2 }{n}}\]

Let us write a small convenience function to calculate RMSE.

RMSE <- function(sim, obs){
  sqrt(sum((sim-obs)^2)/length(sim))
}

In this section we apply the following imputation functions:

library(zoo)
library(forecast)

Imputing daily temperature data

We apply the imputation algorithms on the mean daily temperature data set from the weather station Berlin-Dahlem (FU). We load the data sets via the load() function into R.

load(url("https://userpage.fu-berlin.de/soga/300/30100_data_sets/NA_datasets.Rdata"))

The particular data sets of interest are the original data set, temp.sample, and the data set with the missing values: temp.NA. Let us plot the original data by applying the plot.ts() function.

plot.ts(temp.sample, 
        main = 'Weather station Berlin-Dahlem', ylab = "°C")

Further we plot the temp.NA data set for a better understanding.

plot.ts(temp.NA, ylab = expression("°C"), 
        cex.main = 0.85,
        type = 'o', 
        cex = 0.3, 
        pch = 16)

Further, we calculate the percentage of missing values for the data set. The is.na() function in combination with the sum() function can be applied to calculate the number of NA values.

na.perc <- round(sum(is.na(temp.NA))/length(temp.sample),3)*100
na.perc
## [1] 36.5

We see see that the data set temp.NA consists of about 36.5% missing values.

OK, let us apply the imputation methods. The procedure is as follows:


na.aggregate()

A generic function for replacing each NA with aggregated values. This allows imputing by the overall mean, by monthly means, etc.

## IMPUTING ##
temp.NA.imp <- na.aggregate(temp.NA, FUN = mean, as.yearmon)
## ERROR ##
rmse.NA <- RMSE(temp.NA.imp, temp.sample)
rmse.NA
## [1] 2.098821
## PLOTTING ##
plot.ts(temp.NA.imp, ylab = expression("°C"), 
        main = "na.aggregate()",
        cex.main = 0.85, col = 'red')
points(temp.NA,
       cex = 0.3, 
       pch = 16)
text(2013.5, -10, paste('RMSE: ', round(rmse.NA,4)), cex = 0.85)
legend('bottomright', legend = 'Imputed values', 
       lty = 1, col = 'red', cex = 0.65)