Spline smoothing is an extension of polynomial regression. In spline smoothing the time $t = 1, ..., n$ is divided into $k$ intervals,$[t_0=1,t_1], [t_i+1, t_2], ..., [t_{k-1}+1, t_k=n]$. The values $t_0, t_1, ...,t_k$ are called knots. Then, in each interval a polynomial regression of the form
$$f_t = \beta_0+\beta_1t+...+\beta_pt^p$$is fitted, where typically $p=3$, which is then called cubic spline. The regression is fitted by minimizing
$$\sum_{t=1}^n (x_t-f_t)^2+\lambda\int(f^{"}_t)^2df\text{,}$$where $f_t$ is a cubic spline with a knot at each $t$. This optimization results in a compromise between the fit and the degree of smoothness, which is controlled by $\lambda \ge 0$. As $\lambda \to 0$ (no smoothing), the smoothing spline converges to the interpolating spline, and as $\lambda \to \infty$ (infinite smoothing), the roughness penalty becomes paramount and the estimate converges to a linear least squares estimate (Shumway and Stoffer 2011).
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
# read the Earth_Surface_Temperature data, processed in the previous section
# read the data from .json file
t_global = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/t_global.json"
)
t_global["Date"] = pd.to_datetime(t_global["Date"], format="%Y-%m-%d", errors="coerce")
# subset
temp_global = t_global.set_index("Date")["1850-01-01":"2021-12-31"][
"Monthly Anomaly_global"
]
In Python the spline smoothing is implemented in the scipy.interpolate.UnivariateSpline
function, which fits a cubic smoothing spline to the supplied data.
k
is the degree of the smoothing spline. Must be 1 <= k <= 5. k = 3
is a cubic spline (default is 3). The smoothing parameter is called s
. This positive smoothing factor used to choose the number of knots. The number of knots will be increased until the smoothing condition is satisfied. Default is None.
## convert datetimeindex to float
temp_global.index = temp_global.index.values.astype(float)
from scipy.interpolate import UnivariateSpline
temp_spl_1 = UnivariateSpline(temp_global.index, temp_global.values, k=3)
temp_spl_2 = UnivariateSpline(temp_global.index, temp_global.values, k=3, s=200)
temp_spl_3 = UnivariateSpline(temp_global.index, temp_global.values, k=3, s=250)
# Plot daily and weekly resampled time series together
fig, ax = plt.subplots(figsize=(18, 6))
ax.plot(
temp_global,
marker=".",
linestyle="-",
linewidth=0.5,
label="Daily Data",
color="lightgrey",
)
ax.plot(
temp_global.index,
temp_spl_1(temp_global.index),
marker="o",
markersize=2,
linestyle="-",
label="s = None",
)
ax.plot(
temp_global.index,
temp_spl_2(temp_global.index),
marker="o",
markersize=2,
linestyle="-",
label="s = 200",
)
ax.plot(
temp_global.index,
temp_spl_3(temp_global.index),
marker="o",
markersize=2,
linestyle="-",
label="s = 250",
)
ax.set_title("Smoothing splines")
ax.legend()
plt.show()
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.