Kernel smoothing is a moving average smoother that uses a weight function, also referred to as kernel, to average the observations. The kernel smoothing function is estimated by
$$\hat f_t = \sum_{i=1}^b w_i(t)x_i\text{,}$$where
$$w_i(t) = \frac{K \left(\frac{t-i}{b}\right)}{\sum_{i=1}^bK\left(\frac{t-i}{b}\right)}$$are the weights and $K(\cdot)$ is a kernel function, typically the normal kernel, $K(z) = \frac{1}{\sqrt{2 \pi}}\exp(-z^2/2)$. The wider the bandwidth, $b$, the smoother the result.
In this chapter, we will perform two approaches of kernel smoothing. One approach is a step by step instruction, while the other method uses the scipy.ndimage
module and thus much less code.
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
The kernel specifies the shape of the function used to create the average of the neighboring points. Here we introduce the Gaussian or normal kernel, which has the form of a normal distribution curve (standard Gaussian kernel has a mean of 0 and a standard deviation of 1).
plt.figure(figsize=(8, 4))
x = np.arange(-5, 5, 0.1)
y = 1 / np.sqrt(2 * np.pi) * np.exp(-(x**2) / 2.0)
plt.plot(x, y, color="black")
plt.title("Gaussian Kernel")
plt.axis("off")
plt.show()
# read the Earth_Surface_Temperature "t_global.json" data, processed in the previous section
# read the data from .json file
t_global = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/t_global.json"
)
t_global["Date"] = pd.to_datetime(t_global["Date"], format="%Y-%m-%d", errors="coerce")
# subset
temp_global = t_global.set_index("Date")["1850-01-01":"2021-12-31"][
"Monthly Anomaly_global"
]
We already know in the usual manner that the shape of the Gaussian normal distribution is defined by σ. Here the kernel smoothing procedure for one positon is depicted.
## define our x,y data
y = temp_global.values
x = np.arange(0, len(temp_global))
sigma = 25
x_position = 100
kernel_at_pos = np.exp(-((x - x_position) ** 2) / (2 * sigma**2))
kernel_at_pos = kernel_at_pos / sum(kernel_at_pos)
plt.plot(x, kernel_at_pos)
plt.show()
In general, smoothing processes the data point by point. For this purpose, a new value is calculated for each data point, which is a function of the initial value at that point and the surrounding data points. (see figure above).
In the first approach, we will use the Gaussian normal distribution as the smoothing function. This gives us what we call weights. Finally, we multiply these Gaussian weights by the original value at the data point. Next, we sum the results to get the new smoothed value for that point. Finally, if we do this for each point, we get the smoothed version of our initial data. Here is the rather inefficient but straightforward method for doing this:
Note: A nice tutorial on the method is given here.
## example for sigma = 2
sigma = 2
smoothed_vals_2 = np.zeros(y.shape)
for x_position in x:
kernel = np.exp(-((x - x_position) ** 2) / (2 * sigma**2))
kernel = kernel / sum(kernel)
smoothed_vals_2[x_position] = sum(y * kernel)
## example for sigma = 25
sigma = 25
smoothed_vals_25 = np.zeros(y.shape)
for x_position in x:
kernel = np.exp(-((x - x_position) ** 2) / (2 * sigma**2))
kernel = kernel / sum(kernel)
smoothed_vals_25[x_position] = sum(y * kernel)
fig, ax = plt.subplots(figsize=(18, 6))
ax.plot(
x,
temp_global.values,
linestyle="-",
linewidth=0.5,
label="Daily Data",
color="grey",
)
ax.plot(
x,
smoothed_vals_2,
color="red",
markersize=2,
linestyle="-",
label="sigma = 2",
)
ax.plot(
x,
smoothed_vals_25,
color="limegreen",
markersize=2,
linestyle="-",
label="sigma = 25",
)
ax.set_title("Kernel Smoothing")
ax.legend()
<matplotlib.legend.Legend at 0x2ba138a12a0>
The second approach we will look at is uses much less code, but is also a bit harder to understand. We will apply the gaussian_filter1d
from the scipy.ndimage
module for linearly shaped data. The function takes an input
as an array and sigma
.
Please note: The function
gaussian_kde
fromscipy.stats
does not smooth a line! It is actually smoothing the density distribution estimate of a dataset. If your data is in a linear x/y format this method does not apply!
from scipy.ndimage import gaussian_filter1d
## set x/y data
y = temp_global.values
x = np.arange(0, len(temp_global))
sigma = 2
x_g1d_2 = gaussian_filter1d(x, sigma)
y_g1d_2 = gaussian_filter1d(y, sigma)
sigma = 25
x_g1d_25 = gaussian_filter1d(x, sigma)
y_g1d_25 = gaussian_filter1d(y, sigma)
fig, ax = plt.subplots(figsize=(18, 6))
ax.plot(
x,
temp_global.values,
linestyle="-",
linewidth=0.5,
label="Daily Data",
color="grey",
)
ax.plot(
x_g1d_2,
y_g1d_2,
color="red",
markersize=2,
linestyle="-",
label="sigma = 2",
)
ax.plot(
x_g1d_25,
y_g1d_25,
color="limegreen",
markersize=2,
linestyle="-",
label="sigma = 25",
)
ax.set_title("Kernel Smoothing")
ax.legend()
<matplotlib.legend.Legend at 0x2ba158785b0>
Perfect! Both methods yield in the same result!
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.