Spline smoothing is an extension of polynomial regression. In spline smoothing the time $t = 1, ..., n$ is divided into $k$ intervals,$[t_0=1,t_1], [t_i+1, t_2], ..., [t_{k-1}+1, t_k=n]$. The values $t_0, t_1, ...,t_k$ are called knots. Then, in each interval a polynomial regression of the form

$$f_t = \beta_0+\beta_1t+...+\beta_pt^p$$

is fitted, where typically $p=3$, which is then called cubic spline. The regression is fitted by minimizing

$$\sum_{t=1}^n (x_t-f_t)^2+\lambda\int(f^{"}_t)^2df\text{,}$$

where $f_t$ is a cubic spline with a knot at each $t$. This optimization results in a compromise between the fit and the degree of smoothness, which is controlled by $\lambda \ge 0$. As $\lambda \to 0$ (no smoothing), the smoothing spline converges to the interpolating spline, and as $\lambda \to \infty$ (infinite smoothing), the roughness penalty becomes paramount and the estimate converges to a linear least squares estimate (Shumway and Stoffer 2011).

In [2]:
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
In [3]:
# read the Earth_Surface_Temperature data, processed in the previous section
# read the data from .json file
t_global = pd.read_json(
    "http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/t_global.json"
)
t_global["Date"] = pd.to_datetime(t_global["Date"], format="%Y-%m-%d", errors="coerce")

# subset
temp_global = t_global.set_index("Date")["1850-01-01":"2021-12-31"][
    "Monthly Anomaly_global"
]

In Python the spline smoothing is implemented in the scipy.interpolate.UnivariateSpline function, which fits a cubic smoothing spline to the supplied data.

k is the degree of the smoothing spline. Must be 1 <= k <= 5. k = 3 is a cubic spline (default is 3). The smoothing parameter is called s. This positive smoothing factor used to choose the number of knots. The number of knots will be increased until the smoothing condition is satisfied. Default is None.

In [4]:
## convert datetimeindex to float

temp_global.index = temp_global.index.values.astype(float)
In [5]:
from scipy.interpolate import UnivariateSpline

temp_spl_1 = UnivariateSpline(temp_global.index, temp_global.values, k=3)
temp_spl_2 = UnivariateSpline(temp_global.index, temp_global.values, k=3, s=200)
temp_spl_3 = UnivariateSpline(temp_global.index, temp_global.values, k=3, s=250)
In [6]:
# Plot daily and weekly resampled time series together
fig, ax = plt.subplots(figsize=(18, 6))


ax.plot(
    temp_global,
    marker=".",
    linestyle="-",
    linewidth=0.5,
    label="Daily Data",
    color="lightgrey",
)

ax.plot(
    temp_global.index,
    temp_spl_1(temp_global.index),
    marker="o",
    markersize=2,
    linestyle="-",
    label="s = None",
)


ax.plot(
    temp_global.index,
    temp_spl_2(temp_global.index),
    marker="o",
    markersize=2,
    linestyle="-",
    label="s = 200",
)

ax.plot(
    temp_global.index,
    temp_spl_3(temp_global.index),
    marker="o",
    markersize=2,
    linestyle="-",
    label="s = 250",
)


ax.set_title("Smoothing splines")
ax.legend()

plt.show()

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.