In the previous section we introduced three strategies to eliminate seasonal effects:
where $m_{kt}$ is the [0, 1] indicator for month $k$.
where $m$ corresponds to the period.
Let us reload the working data set, the subset of the Keeling Curve from 1990 to 2021, from the previous section.
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
# open the respective json
df_co2_1990_2021 = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/KeelingCurve_1990-2021.json"
)
df_co2_1990_2021 = df_co2_1990_2021.set_index("Date") # set datetimeindex
In order to build a model fitting not only the trend but also the seasonality, we got several options in Python to yield the same results. E. g. we got the pd.interpolate()
, which we already got to know in a previous section. This function is based on filling NA
values using an interpolation method.
Another option is the interp1d
class from scipy.interpolate
, which is a convenient method to create a function based on fixed data points. This class returns a function whose call method uses interpolation to find the value of new points. See the documentation of the method here.
f_cubic_1 = df_co2_1990_2021.interpolate(method="spline", order=3)
t = np.arange(0, len(df_co2_1990_2021)) ## ignore date, convert to lenght
import statsmodels.formula.api as smf
results = smf.ols(formula="t ~ f_cubic_1", data=df_co2_1990_2021).fit()
results.summary()
Dep. Variable: | t | R-squared: | 0.973 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.973 |
Method: | Least Squares | F-statistic: | 9489. |
Date: | Mon, 03 Apr 2023 | Prob (F-statistic): | 8.48e-208 |
Time: | 14:40:39 | Log-Likelihood: | -1041.2 |
No. Observations: | 264 | AIC: | 2086. |
Df Residuals: | 262 | BIC: | 2094. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1938.8831 | 21.268 | -91.166 | 0.000 | -1980.760 | -1897.006 |
f_cubic_1 | 5.2825 | 0.054 | 97.413 | 0.000 | 5.176 | 5.389 |
Omnibus: | 44.728 | Durbin-Watson: | 0.297 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 11.441 |
Skew: | 0.137 | Prob(JB): | 0.00328 |
Kurtosis: | 2.018 | Cond. No. | 1.08e+04 |
from scipy.interpolate import interp1d
t = np.arange(0, len(df_co2_1990_2021)) ## ignore date, convert to lenght
y = df_co2_1990_2021.values.flatten()
f_cubic_2 = interp1d(t, y, kind="cubic")
results = smf.ols(formula="t ~ f_cubic_2(t)", data=df_co2_1990_2021).fit()
results.summary()
Dep. Variable: | t | R-squared: | 0.973 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.973 |
Method: | Least Squares | F-statistic: | 9489. |
Date: | Mon, 03 Apr 2023 | Prob (F-statistic): | 8.48e-208 |
Time: | 14:40:39 | Log-Likelihood: | -1041.2 |
No. Observations: | 264 | AIC: | 2086. |
Df Residuals: | 262 | BIC: | 2094. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1938.8831 | 21.268 | -91.166 | 0.000 | -1980.760 | -1897.006 |
f_cubic_2(t) | 5.2825 | 0.054 | 97.413 | 0.000 | 5.176 | 5.389 |
Omnibus: | 44.728 | Durbin-Watson: | 0.297 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 11.441 |
Skew: | 0.137 | Prob(JB): | 0.00328 |
Kurtosis: | 2.018 | Cond. No. | 1.08e+04 |
Most of the factor effects are highly significant indicated by the very small p-values. Let us plot the modeled data and the observed data to assess the model's goodness of fit.
fig, ax = plt.subplots(1, figsize=(13, 4))
ax.plot(df_co2_1990_2021, ".", color="blue", label="observed")
f_cubic_1.plot(color="red", label="modeled", ax=ax)
plt.legend()
plt.show()
plt.figure(figsize=(13, 4))
plt.plot(t, y, ".", color="blue", label="observed")
plt.plot(t, f_cubic_2(t), color="red", label="modeled")
plt.legend()
plt.show()
As we can see using these methods will yield in a curve perfectly fitted to the data. If we want to built a more general model to the data, we will have to apply the curve_fit()
function from scipy.optimize
. First we will have to figure out what the time series consists of. Here, we are dealing with a additive time series, which consists of a trend and a seasonality. Both can be described as separate functions, which we can simply add.
The trend in this case can be described through a cubic function
$f(x)=ax^{3}+bx^{2}+cx+d$
while the seasonality can be described through a sinusoidal function
$A * sin(\omega * x + \rho) + B$
where:
$A$ = Intercept
$\omega$ = Period
$\rho$ = Phase
$B$ = Offset
Firstly, we will built a custom fitting function, that will theoretically fit our data. Then will we define an initial guess for all needed parameters, to make sure that the function will be fitted completely. Lastly, curve_fit()
will be applied and the result will be plotted (see here for further reading).
Exercise: Fit a cubic model by using
curve_fit()
.
from scipy.optimize import curve_fit
def cubic_sinus(x, A, offset, omega, phase, a, b, c, d):
"""return paramters for trend (a, b, c, d) + seasonality ( A, offset, omega, phase)
Assuming:
trend = cubic
seasonality = sinusoidal
"""
return (a * x**3 + b * x**2 + c * x + d) + (
A * np.sin(omega * x + phase) + offset
)
T = 12
## define initial guess for the parameters
def get_p0(x, y):
A0 = (max(y[0:T]) - min(y[0:T])) / 2
offset0 = y[0]
phase0 = 0
omega0 = 2.0 * np.pi / T
a = b = c = d = 1
return [A0, offset0, omega0, phase0, a, b, c, d]
from scipy.optimize import curve_fit
parameters, _ = curve_fit(cubic_sinus, t, y, p0=get_p0(t, y))
plt.figure(figsize=(13, 4))
plt.plot(t, y, color="red", linewidth=1, marker=".", label="observed")
plt.plot(
t,
cubic_sinus(t, *parameters),
linewidth=1.5,
color="blue",
label="modeled",
)
plt.legend()
plt.show()
Let us plot the residuals from the cubic and periodic model versus time to see if the strong seasonal effect is now gone and the correlogram for CO2 data with trend and period removed.
fig, ax = plt.subplots(2, 1, figsize=(12, 8))
ax[0].plot(t, df_co2_1990_2021.values.flatten() - cubic_sinus(t, *parameters))
ax[0].set_title("CO2 residuals from a cubic & sinusoidal fit")
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(
df_co2_1990_2021.values.flatten() - cubic_sinus(t, *parameters), ax=ax[1], lags=48
)
ax[1].set_title("ACF-Plot for CO2 Data with Trend & Period removed")
plt.show()
From the correlogram it appears that the autocorrelations for the CO2 still follow a strong seasonality.
In Python we apply the diff()
function for differencing, setting the periods
argument to $12$ results in seasonal differencing
# cubic +sinusoidal fit
fit_resid = df_co2_1990_2021.squeeze() - cubic_sinus(t, *parameters)
seas_diff = fit_resid.diff(periods=12)
fig, ax = plt.subplots(2, 1, figsize=(12, 8))
ax[0].plot(seas_diff)
ax[0].set_title("CO2 residuals from a cubic & sinusoidal fit")
plot_acf(seas_diff.dropna(), ax=ax[1], lags=48)
ax[1].set_title("ACF-Plot for CO2 Data with Trend & Period removed")
plt.show()
The upper plot shows the residuals after linear trend removal with a cubic fit and seasonal differencing $(\Delta^{12}y_t)$. The lower plot shows the correlogram. The lag axes in the correlogram are in terms of seasons (12 months). From the correlogram it appears that the autocorrelations for the CO2 are positive for successive months up to about 8 months apart.
The plot indicates that the general pattern of seasonality was well reproduced, however, the extreme values are not well fitted. The plot still shows strong seasonal components, hence indicating that this kind of periodic regression model is not fully capable of modelling the seasonal component in the CO2 data. This is very likely due to missing information about the phase ($\phi_1$ and $\phi_2$).
$$y_t=\beta_0+\beta_1\sin((2t\pi/m)+\phi_1)+\beta_2\cos((2t\pi/m)+\phi_2) +w_t$$We cannot estimate the additional phase information by a linear model, but we may estimate phase information by e.g. Fourier analysis.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.