In the previous section we introduced three strategies to eliminate seasonal effects:
where $m_{kt}$ is the [0, 1] indicator for month $k$.
where $m$ corresponds to the period.
Let us reload the working data set, the subset of the Keeling Curve from 1990 to 2021, from the previous section.
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
# open the respective json
df_co2_1990_2021 = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/KeelingCurve_1990-2021.json"
)
df_co2_1990_2021 = df_co2_1990_2021.set_index("Date") # set datetimeindex
In order to build a model fitting not only the trend but also the seasonality, we got several options in Python to yield the same results. E. g. we got the pd.interpolate()
, which we already got to know in a previous section. This function is based on filling NA
values using an interpolation method.
Another option is the interp1d
class from scipy.interpolate
, which is a convenient method to create a function based on fixed data points. This class returns a function whose call method uses interpolation to find the value of new points. See the documentation of the method here.
f_cubic_1 = df_co2_1990_2021.interpolate(method="spline", order=3)
t = np.arange(0, len(df_co2_1990_2021)) ## ignore date, convert to lenght
import statsmodels.formula.api as smf
results = smf.ols(formula="t ~ f_cubic_1", data=df_co2_1990_2021).fit()
results.summary()
Dep. Variable: | t | R-squared: | 0.973 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.973 |
Method: | Least Squares | F-statistic: | 9489. |
Date: | Mon, 03 Apr 2023 | Prob (F-statistic): | 8.48e-208 |
Time: | 14:40:39 | Log-Likelihood: | -1041.2 |
No. Observations: | 264 | AIC: | 2086. |
Df Residuals: | 262 | BIC: | 2094. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1938.8831 | 21.268 | -91.166 | 0.000 | -1980.760 | -1897.006 |
f_cubic_1 | 5.2825 | 0.054 | 97.413 | 0.000 | 5.176 | 5.389 |
Omnibus: | 44.728 | Durbin-Watson: | 0.297 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 11.441 |
Skew: | 0.137 | Prob(JB): | 0.00328 |
Kurtosis: | 2.018 | Cond. No. | 1.08e+04 |
from scipy.interpolate import interp1d
t = np.arange(0, len(df_co2_1990_2021)) ## ignore date, convert to lenght
y = df_co2_1990_2021.values.flatten()
f_cubic_2 = interp1d(t, y, kind="cubic")
results = smf.ols(formula="t ~ f_cubic_2(t)", data=df_co2_1990_2021).fit()
results.summary()
Dep. Variable: | t | R-squared: | 0.973 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.973 |
Method: | Least Squares | F-statistic: | 9489. |
Date: | Mon, 03 Apr 2023 | Prob (F-statistic): | 8.48e-208 |
Time: | 14:40:39 | Log-Likelihood: | -1041.2 |
No. Observations: | 264 | AIC: | 2086. |
Df Residuals: | 262 | BIC: | 2094. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1938.8831 | 21.268 | -91.166 | 0.000 | -1980.760 | -1897.006 |
f_cubic_2(t) | 5.2825 | 0.054 | 97.413 | 0.000 | 5.176 | 5.389 |
Omnibus: | 44.728 | Durbin-Watson: | 0.297 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 11.441 |
Skew: | 0.137 | Prob(JB): | 0.00328 |
Kurtosis: | 2.018 | Cond. No. | 1.08e+04 |
Most of the factor effects are highly significant indicated by the very small p-values. Let us plot the modeled data and the observed data to assess the model's goodness of fit.
fig, ax = plt.subplots(1, figsize=(13, 4))
ax.plot(df_co2_1990_2021, ".", color="blue", label="observed")
f_cubic_1.plot(color="red", label="modeled", ax=ax)
plt.legend()
plt.show()
plt.figure(figsize=(13, 4))
plt.plot(t, y, ".", color="blue", label="observed")
plt.plot(t, f_cubic_2(t), color="red", label="modeled")
plt.legend()
plt.show()
As we can see using these methods will yield in a curve perfectly fitted to the data. If we want to built a more general model to the data, we will have to apply the curve_fit()
function from scipy.optimize
. First we will have to figure out what the time series consists of. Here, we are dealing with a additive time series, which consists of a trend and a seasonality. Both can be described as separate functions, which we can simply add.
The trend in this case can be described through a cubic function
$f(x)=ax^{3}+bx^{2}+cx+d$
while the seasonality can be described through a sinusoidal function
$A * sin(\omega * x + \rho) + B$
where:
$A$ = Intercept
$\omega$ = Period
$\rho$ = Phase
$B$ = Offset
Firstly, we will built a custom fitting function, that will theoretically fit our data. Then will we define an initial guess for all needed parameters, to make sure that the function will be fitted completely. Lastly, curve_fit()
will be applied and the result will be plotted (see here for further reading).
Exercise: Fit a cubic model by using
curve_fit()
.
from scipy.optimize import curve_fit
def cubic_sinus(x, A, offset, omega, phase, a, b, c, d):
"""return paramters for trend (a, b, c, d) + seasonality ( A, offset, omega, phase)
Assuming:
trend = cubic
seasonality = sinusoidal
"""
return (a * x**3 + b * x**2 + c * x + d) + (
A * np.sin(omega * x + phase) + offset
)
T = 12
## define initial guess for the parameters
def get_p0(x, y):
A0 = (max(y[0:T]) - min(y[0:T])) / 2
offset0 = y[0]
phase0 = 0
omega0 = 2.0 * np.pi / T
a = b = c = d = 1
return [A0, offset0, omega0, phase0, a, b, c, d]
from scipy.optimize import curve_fit
parameters, _ = curve_fit(cubic_sinus, t, y, p0=get_p0(t, y))
plt.figure(figsize=(13, 4))
plt.plot(t, y, color="red", linewidth=1, marker=".", label="observed")
plt.plot(
t,
cubic_sinus(t, *parameters),
linewidth=1.5,
color="blue",
label="modeled",
)
plt.legend()
plt.show()
Let us plot the residuals from the cubic and periodic model versus time to see if the strong seasonal effect is now gone and the correlogram for CO2 data with trend and period removed.
fig, ax = plt.subplots(2, 1, figsize=(12, 8))
ax[0].plot(t, df_co2_1990_2021.values.flatten() - cubic_sinus(t, *parameters))
ax[0].set_title("CO2 residuals from a cubic & sinusoidal fit")
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(
df_co2_1990_2021.values.flatten() - cubic_sinus(t, *parameters), ax=ax[1], lags=48
)
ax[1].set_title("ACF-Plot for CO2 Data with Trend & Period removed")
plt.show()
From the correlogram it appears that the autocorrelations for the CO2 still follow a strong seasonality.
In Python we apply the diff()
function for differencing, setting the periods
argument to $12$ results in seasonal differencing
# cubic +sinusoidal fit
fit_resid = df_co2_1990_2021.squeeze() - cubic_sinus(t, *parameters)
seas_diff = fit_resid.diff(periods=12)
fig, ax = plt.subplots(2, 1, figsize=(12, 8))
ax[0].plot(seas_diff)
ax[0].set_title("CO2 residuals from a cubic & sinusoidal fit")
plot_acf(seas_diff.dropna(), ax=ax[1], lags=48)
ax[1].set_title("ACF-Plot for CO2 Data with Trend & Period removed")
plt.show()