In this section we apply different functions to perform imputation on two different data sets. The evaluation of the performance of imputation algorithms is in general flawed, since the actual values are in fact missing. Thus, a performance comparison can only be done for simulated missing data. Based on an algorithm provided by Moritz et al.Â 2015, we introduce missing data artificially in our data sets.

In order to evaluate the imputation algorithm we calculate the root mean squared error (RMSE) of the imputed data set with respect to the original data set.

The RMSE is given by

\[RMSE = \sqrt{\frac{\sum_{t=1}^n (\hat y_t - y_t)^2 }{n}}\]

Exercise: Write a small convenience function to calculate the RMSE.

```
## Your code here
RMSE <- function(sim, obs) {
NULL
}
```

```
RMSE <- function(sim, obs) {
sqrt(sum((sim - obs)^2) / length(sim))
}
```

In this section we apply the following imputation functions:

```
library(zoo)
library(forecast)
```

`na.aggregate()`

(`zoo`

): Generic function for replacing each`NA`

with aggregated values. This allows imputing by the overall mean, by monthly means, etc.`na.locf()`

(`zoo`

): Generic function for replacing each`NA`

with the most recent non-`NA`

prior to it. For each individual, missing values are replaced by the last observed value of that variable.`na.StructTS()`

(`zoo`

): Generic function for filling`NA`

values using seasonal Kalman filter.`na.interp()`

(`forecast`

): Uses linear interpolation for non-seasonal series and a periodic STL-decomposition with seasonal series to replace missing values.

We apply the imputation algorithms on the *mean daily
temperature* data set from the weather station Berlin-Dahlem
(FU).

`load(url("https://userpage.fu-berlin.de/soga/data/r-data/NA_datasets.RData"))`

The particular data sets of interest are the original data set,
`temp_sample`

, and the data set including the missing values:
`temp_NA`

. Let us plot the original data by applying the
`plot.ts()`

function.

`plot.ts(temp_sample, main = "Weather station Berlin-Dahlem", ylab = "Temperature in Â°C")`

Further, we calculate the percentage of missing values for the data
set. The `is.na()`

function in combination with the
`sum()`

function can be applied to calculate the number of
`NA`

values.

```
na_perc <- round(sum(is.na(temp_NA)) / length(temp_sample), 3) * 100
na_perc
```

`## [1] 36.5`

We see that the data set `temp_NA`

consists of about 36.5%
missing values. Let us plot the data for a better understanding.

```
plot.ts(temp_NA,
ylab = expression("Temperature in Â°C"),
main = paste("Missing values: ", na_perc, "%"),
cex.main = 0.85,
type = "o",
cex = 0.3,
pch = 16
)
```

OK, let us apply the imputation methods. The procedure is as follows:

- First, we apply the imputation function.
- Second, we calculate the RMSE, using our custom function
`RMSE()`

.

- Third, we plot the data to get a visual impression about the performance of the imputation algorithm.

`na.aggregate()`

A generic function for replacing each `NA`

with aggregated
values. This allows imputing by the overall mean, by monthly means,
etc.

```
## IMPUTING BY OVERALL MEAN ##
temp_NA_imp <- na.aggregate(temp_NA, FUN = mean)
## ERROR ##
rmse_NA <- RMSE(temp_NA_imp, temp_sample)
rmse_NA
```

`## [1] 4.392792`

```
## PLOTTING ##
plot.ts(temp_NA_imp,
ylab = expression("Temperature in Â°C"),
main = "na.aggregate()",
cex.main = 0.85, col = "red"
)
lines(temp_NA)
text(2013.5, -10, paste("RMSE: ", round(rmse_NA,4)), cex = 0.85)
legend("bottomright", legend = "Imputed values", lty = 1, col = "red", cex = 0.65)
```