In this section we apply different functions to perform imputation on two different data sets. The evaluation of the performance of imputation algorithms is in general flawed, since the actual values are in fact missing. Thus, a performance comparison can only be done for simulated missing data. Based on an algorithm provided by Moritz et al. 2015, we introduce missing data artificially in our data sets.
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
In order to evaluate the imputation algorithm we calculate the root mean squared error (RMSE) of the imputed data set with respect to the original data set.
The RMSE is given by
$$RMSE = \sqrt{\frac{\sum_{t=1}^n (\hat y_t - y_t)^2 }{n}}$$Exercise: Write a small convenience function to calculate the RMSE.
## Your code here
def RMSE(sim, obs):
return "Built a funtion!"
## ....
def RMSE(sim, obs):
return np.sqrt(np.sum((sim - obs) ** 2) / len(sim))
In this section we apply the following imputation functions:
We apply the imputation algorithms on the mean daily temperature data set from the weather station Berlin-Dahlem (FU).
temp_NA = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/NA_datasets_temp_NA.json"
).sort_index()
temp_sample = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/NA_datasets_temp_sample.json"
).sort_index()
temp_sample
temp.sample | |
---|---|
0 | 6.2 |
1 | 4.7 |
2 | 7.0 |
3 | 7.9 |
4 | 6.7 |
... | ... |
737 | 3.3 |
738 | 6.2 |
739 | 8.5 |
740 | 3.4 |
741 | 5.5 |
742 rows × 1 columns
The particular data sets of interest are the original data set, temp_sample
, and the data set including the missing values: temp_NA
. Let us plot the original data.
plt.figure(figsize=(18, 6))
plt.plot(temp_sample, color="black")
plt.title("Weather station Berlin-Dahlem", fontsize=16)
plt.ylabel("Temperature in °C", fontsize=16)
plt.xlabel("Days", fontsize=16)
plt.show()
Further, we calculate the percentage of missing values for the data set. The isnull()
function in combination with the sum()
function can be applied to calculate the number of NA
values.
na_perc = round(temp_NA.isnull().sum() / len(temp_sample), 3) * 100
na_perc
temp.NA 36.5 dtype: float64
We see that the data set temp_NA
consists of about 36.5% missing values. Let us plot the data for a better understanding.
plt.figure(figsize=(18, 6))
plt.plot(temp_NA, color="black")
plt.title(f"Missing values: {na_perc.values[0]} %", fontsize=16)
plt.ylabel("Temperature in °C", fontsize=16)
plt.xlabel("Days", fontsize=16)
plt.show()
OK, let us apply the imputation methods. The procedure is as follows:
RMSE()
.There are three methods to fill the holes in reindexed Series.
The .fillna()
function comes with two different methods: forward-fill and back-fill (note this does not fill NaNs that already were present). This way, the NA
's get filled with the last valid observation or with the next valid observation:
.fillna(method="ffill")
¶Note, that you can also specify a fill value, for example
fill_value = -9999
# forward-fill
temp_NA_ffill = temp_NA.fillna(method="ffill")
temp_NA_ffill.head(10)
temp.NA | |
---|---|
0 | 6.2 |
1 | 6.2 |
2 | 7.0 |
3 | 7.0 |
4 | 7.0 |
5 | 6.2 |
6 | 3.7 |
7 | 5.1 |
8 | 6.4 |
9 | 2.6 |
.fillna(method='bfill')
¶# back-fill
temp_NA_bfill = temp_NA.fillna(method="bfill")
temp_NA_bfill.head(10)
temp.NA | |
---|---|
0 | 6.2 |
1 | 7.0 |
2 | 7.0 |
3 | 6.2 |
4 | 6.2 |
5 | 6.2 |
6 | 3.7 |
7 | 5.1 |
8 | 6.4 |
9 | 2.6 |
## ERROR ##
rmse_NA = RMSE(temp_NA_bfill.values, temp_sample.values)
rmse_NA
1.768216699529771
plt.figure(figsize=(18, 6))
plt.plot(temp_NA_bfill, color="red", label="imputed values")
plt.plot(temp_NA, color="black")
plt.title(".fillna(method='bfill')", fontsize=16)
plt.ylabel("Temperature in °C", fontsize=16)
plt.xlabel("Days", fontsize=16)
plt.text(0, 20, f"RMSE ={round(rmse_NA,4)}", fontsize=16)
plt.legend()
plt.show()
# over-all-mean-fill
temp_NA_meanfill = temp_NA.fillna(value=temp_NA.mean())
temp_NA_meanfill.head(10)
temp.NA | |
---|---|
0 | 6.200000 |
1 | 10.315924 |
2 | 7.000000 |
3 | 10.315924 |
4 | 10.315924 |
5 | 6.200000 |
6 | 3.700000 |
7 | 5.100000 |
8 | 6.400000 |
9 | 2.600000 |
## ERROR ##
rmse_NA = RMSE(temp_NA_meanfill.values, temp_sample.values)
rmse_NA
4.392792211787441
plt.figure(figsize=(18, 6))
plt.plot(temp_NA_meanfill, color="red", label="imputed values")
plt.plot(temp_NA, color="black")
plt.title(".fillna() using overall mean", fontsize=16)
plt.ylabel("Temperature in °C", fontsize=16)
plt.xlabel("Days", fontsize=16)
plt.text(0, 20, f"RMSE ={round(rmse_NA,4)}", fontsize=16)
plt.legend()
plt.show()
NA
with aggregated values: .interpolate()
¶A more sophisticated method is the .interpolate()
function. It is generic function for replacing each NA
with aggregated values.
The argument limit
determines maximum number of consecutive NAs to be interpolated (must be >0). It is important to set the argument method
to make sure, that the function is working based on our desired mathematical methods. Here, we will use method = 'linear'
, since have no information about the time. When dealing with DatetimeIndexed data, you might want to consider method = 'time'
.
## LINEAR IMPUTING ##
temp_NA_inter = temp_NA.interpolate(method="linear").copy()
temp_NA_inter.head(10)
temp.NA | |
---|---|
0 | 6.200000 |
1 | 6.600000 |
2 | 7.000000 |
3 | 6.733333 |
4 | 6.466667 |
5 | 6.200000 |
6 | 3.700000 |
7 | 5.100000 |
8 | 6.400000 |
9 | 2.600000 |
We will now use that interpolated series to calculate the RMSE, using our custom function RMSE()
.
## ERROR ##
rmse_NA = RMSE(temp_NA_inter.values, temp_sample.values)
rmse_NA
1.1095012631275054
Finally, we plot the data to get a visual impression about the performance of the imputation algorithm.
plt.figure(figsize=(18, 6))
plt.plot(temp_NA_inter, color="red", label="imputed values")
plt.plot(temp_NA, color="black")
plt.title(".interpolate(method='linear')", fontsize=16)
plt.ylabel("Temperature in °C", fontsize=16)
plt.xlabel("Days", fontsize=16)
plt.text(0, 20, f"RMSE ={round(rmse_NA,4)}", fontsize=16)
plt.legend()
plt.show()
In summary the imputation methods performed quite well. With respect to RMSE the linear Interpolation algorithm (.interpolate()
) performed best.
Exercise: A friend of yours sent you a data set of carbon dioxide measurements taken at the Mauna Loa Observatory in Hawaii (
co2_NA
). Unfortunately the data was corrupted. Your task is to impute the missing values in the data set.
First, let us load and plot the data:
# read json files from directory
co2_NA = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/NA_datasets_co2_NA.json"
).sort_index()
co2_sample = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/NA_datasets_co2_sample.json"
).sort_index()
co2_sample
co2.sample | |
---|---|
0 | 369.29 |
1 | 369.54 |
2 | 370.60 |
3 | 371.82 |
4 | 371.58 |
... | ... |
187 | 398.93 |
188 | 397.63 |
189 | 398.29 |
190 | 400.16 |
191 | 401.85 |
192 rows × 1 columns
plt.figure(figsize=(18, 6))
plt.plot(co2_NA, color="black")
plt.title("Atmospheric carbon dioxide, Mauna Loa, Hawaii", fontsize=16)
plt.ylabel("$CO_2$", fontsize=16)
plt.xlabel("Time", fontsize=16)
plt.show()
Now, calculate the percentage of missing values, by applying the
is.na()
function in combination with thesum()
function.
## Your code here...
na_perc_high = round(co2_NA.isnull().sum() / len(co2_NA), 3) * 100
na_perc_high
co2.NA 39.1 dtype: float64
The data set co2.NA
consists of about 39.1% missing values.
In order to replace the missing data follow the steps outlined below:
.fillna()
and .interpolate()
.RMSE()
(The original data is given by co2_sample
).## Your code here...
.fillna(method='bfill')
¶# backward-fill
co2_NA_bfill = co2_NA.fillna(method="bfill")
## ERROR ##
index = len(co2_NA_bfill)
co2_NA_bfill_clean = np.delete(
co2_NA_bfill.values, index - 1
) ## because of the method, last element is nan
rmse_NA = RMSE(co2_NA_bfill_clean, co2_sample.values)
rmse_NA
188.1500393901449
## PLOT ##
plt.figure(figsize=(18, 6))
plt.plot(co2_NA_bfill, color="red", label="imputed values")
plt.plot(co2_NA, color="black")
plt.title(".fillna(method='bfill')", fontsize=16)
plt.ylabel("$CO_2$", fontsize=16)
plt.xlabel("Time", fontsize=16)
plt.text(0, 380, f"RMSE ={round(rmse_NA,4)}", fontsize=16)
plt.legend()
plt.show()
NA
with aggregated values: .interpolate()
¶## IMPUTING BY OVERALL MEAN ##
co2_NA_inter = co2_NA.interpolate(method="linear").copy()
co2_NA_inter.head(10)
## ERROR ##
rmse_NA = RMSE(co2_NA_inter.values, co2_sample.values)
rmse_NA
## PLOT ##
plt.figure(figsize=(18, 6))
plt.plot(co2_NA_inter, color="red", label="imputed values")
plt.plot(co2_NA, color="black")
plt.title(".interpolate(method='linear')", fontsize=16)
plt.ylabel("$CO_2$", fontsize=16)
plt.xlabel("Time", fontsize=16)
plt.text(0, 380, f"RMSE ={round(rmse_NA,4)}", fontsize=16)
plt.legend()
plt.show()
With respect to RMSE the algorithm implemented in the .interpolate()
function performed best. This suggests, that for data with a strong seasonal component the STL-decomposition approach seems to be well suited to replace missing values.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.