In the previous section we introduced three strategies to eliminate seasonal effects:

  • Fitting a linear model with indicator variables for the different months
$$y_t = \beta_1m_{1t}+\beta_2m_{2t}+...+\beta_{12}m_{12t} + w_t \text{,}$$

where $m_{kt}$ is the [0, 1] indicator for month $k$.

  • Seasonal differencing
$$\Delta^{12}y_t = y_t -y_{t-12}$$
  • Fitting a periodic function
$$y_t=\beta_0+\beta_1\sin((2\pi/m)t)+\beta_2\cos((2\pi/m)t) +w_t \text{,}$$

where $m$ corresponds to the period.


Fitting a linear model with indicator¶

Let us reload the working data set, the subset of the Keeling Curve from 1990 to 2021, from the previous section.

In [2]:
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
In [3]:
# open the respective json
df_co2_1990_2021 = pd.read_json(
    "http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/KeelingCurve_1990-2021.json"
)
df_co2_1990_2021 = df_co2_1990_2021.set_index("Date")  # set datetimeindex

In order to build a model fitting not only the trend but also the seasonality, we got several options in Python to yield the same results. E. g. we got the pd.interpolate(), which we already got to know in a previous section. This function is based on filling NA values using an interpolation method.

Another option is the interp1d class from scipy.interpolate, which is a convenient method to create a function based on fixed data points. This class returns a function whose call method uses interpolation to find the value of new points. See the documentation of the method here.

In [4]:
f_cubic_1 = df_co2_1990_2021.interpolate(method="spline", order=3)
t = np.arange(0, len(df_co2_1990_2021))  ## ignore date, convert to lenght
In [5]:
import statsmodels.formula.api as smf

results = smf.ols(formula="t ~ f_cubic_1", data=df_co2_1990_2021).fit()
results.summary()
Out[5]:
OLS Regression Results
Dep. Variable: t R-squared: 0.973
Model: OLS Adj. R-squared: 0.973
Method: Least Squares F-statistic: 9489.
Date: Mon, 03 Apr 2023 Prob (F-statistic): 8.48e-208
Time: 14:40:39 Log-Likelihood: -1041.2
No. Observations: 264 AIC: 2086.
Df Residuals: 262 BIC: 2094.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1938.8831 21.268 -91.166 0.000 -1980.760 -1897.006
f_cubic_1 5.2825 0.054 97.413 0.000 5.176 5.389
Omnibus: 44.728 Durbin-Watson: 0.297
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11.441
Skew: 0.137 Prob(JB): 0.00328
Kurtosis: 2.018 Cond. No. 1.08e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [6]:
from scipy.interpolate import interp1d

t = np.arange(0, len(df_co2_1990_2021))  ## ignore date, convert to lenght
y = df_co2_1990_2021.values.flatten()
f_cubic_2 = interp1d(t, y, kind="cubic")
In [7]:
results = smf.ols(formula="t ~ f_cubic_2(t)", data=df_co2_1990_2021).fit()
results.summary()
Out[7]:
OLS Regression Results
Dep. Variable: t R-squared: 0.973
Model: OLS Adj. R-squared: 0.973
Method: Least Squares F-statistic: 9489.
Date: Mon, 03 Apr 2023 Prob (F-statistic): 8.48e-208
Time: 14:40:39 Log-Likelihood: -1041.2
No. Observations: 264 AIC: 2086.
Df Residuals: 262 BIC: 2094.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1938.8831 21.268 -91.166 0.000 -1980.760 -1897.006
f_cubic_2(t) 5.2825 0.054 97.413 0.000 5.176 5.389
Omnibus: 44.728 Durbin-Watson: 0.297
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11.441
Skew: 0.137 Prob(JB): 0.00328
Kurtosis: 2.018 Cond. No. 1.08e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Most of the factor effects are highly significant indicated by the very small p-values. Let us plot the modeled data and the observed data to assess the model's goodness of fit.

In [8]:
fig, ax = plt.subplots(1, figsize=(13, 4))
ax.plot(df_co2_1990_2021, ".", color="blue", label="observed")

f_cubic_1.plot(color="red", label="modeled", ax=ax)

plt.legend()
plt.show()
In [9]:
plt.figure(figsize=(13, 4))
plt.plot(t, y, ".", color="blue", label="observed")
plt.plot(t, f_cubic_2(t), color="red", label="modeled")

plt.legend()
plt.show()

As we can see using these methods will yield in a curve perfectly fitted to the data. If we want to built a more general model to the data, we will have to apply the curve_fit() function from scipy.optimize. First we will have to figure out what the time series consists of. Here, we are dealing with a additive time series, which consists of a trend and a seasonality. Both can be described as separate functions, which we can simply add.

The trend in this case can be described through a cubic function

$f(x)=ax^{3}+bx^{2}+cx+d$

while the seasonality can be described through a sinusoidal function

$A * sin(\omega * x + \rho) + B$

where:

$A$ = Intercept
$\omega$ = Period
$\rho$ = Phase
$B$ = Offset

Firstly, we will built a custom fitting function, that will theoretically fit our data. Then will we define an initial guess for all needed parameters, to make sure that the function will be fitted completely. Lastly, curve_fit() will be applied and the result will be plotted (see here for further reading).

Exercise: Fit a cubic model by using curve_fit().

In [10]:
from scipy.optimize import curve_fit
In [11]:
def cubic_sinus(x, A, offset, omega, phase, a, b, c, d):
    """return paramters for trend (a, b, c, d) + seasonality ( A, offset, omega, phase)
    Assuming:
    trend = cubic
    seasonality = sinusoidal
    """
    return (a * x**3 + b * x**2 + c * x + d) + (
        A * np.sin(omega * x + phase) + offset
    )


T = 12

## define initial guess for the parameters
def get_p0(x, y):
    A0 = (max(y[0:T]) - min(y[0:T])) / 2
    offset0 = y[0]
    phase0 = 0
    omega0 = 2.0 * np.pi / T
    a = b = c = d = 1
    return [A0, offset0, omega0, phase0, a, b, c, d]
In [12]:
from scipy.optimize import curve_fit

parameters, _ = curve_fit(cubic_sinus, t, y, p0=get_p0(t, y))

plt.figure(figsize=(13, 4))

plt.plot(t, y, color="red", linewidth=1, marker=".", label="observed")
plt.plot(
    t,
    cubic_sinus(t, *parameters),
    linewidth=1.5,
    color="blue",
    label="modeled",
)

plt.legend()
plt.show()

Let us plot the residuals from the cubic and periodic model versus time to see if the strong seasonal effect is now gone and the correlogram for CO2 data with trend and period removed.

In [13]:
fig, ax = plt.subplots(2, 1, figsize=(12, 8))

ax[0].plot(t, df_co2_1990_2021.values.flatten() - cubic_sinus(t, *parameters))
ax[0].set_title("CO2 residuals from a cubic & sinusoidal fit")


from statsmodels.graphics.tsaplots import plot_acf


plot_acf(
    df_co2_1990_2021.values.flatten() - cubic_sinus(t, *parameters), ax=ax[1], lags=48
)
ax[1].set_title("ACF-Plot for CO2 Data with Trend & Period removed")
plt.show()

From the correlogram it appears that the autocorrelations for the CO2 still follow a strong seasonality.


Seasonal differencing¶

In Python we apply the diff() function for differencing, setting the periods argument to $12$ results in seasonal differencing

In [14]:
# cubic +sinusoidal fit
fit_resid = df_co2_1990_2021.squeeze() - cubic_sinus(t, *parameters)

seas_diff = fit_resid.diff(periods=12)
In [15]:
fig, ax = plt.subplots(2, 1, figsize=(12, 8))

ax[0].plot(seas_diff)
ax[0].set_title("CO2 residuals from a cubic & sinusoidal fit")


plot_acf(seas_diff.dropna(), ax=ax[1], lags=48)
ax[1].set_title("ACF-Plot for CO2 Data with Trend & Period removed")
plt.show()

The upper plot shows the residuals after linear trend removal with a cubic fit and seasonal differencing $(\Delta^{12}y_t)$. The lower plot shows the correlogram. The lag axes in the correlogram are in terms of seasons (12 months). From the correlogram it appears that the autocorrelations for the CO2 are positive for successive months up to about 8 months apart.

The plot indicates that the general pattern of seasonality was well reproduced, however, the extreme values are not well fitted. The plot still shows strong seasonal components, hence indicating that this kind of periodic regression model is not fully capable of modelling the seasonal component in the CO2 data. This is very likely due to missing information about the phase ($\phi_1$ and $\phi_2$).

$$y_t=\beta_0+\beta_1\sin((2t\pi/m)+\phi_1)+\beta_2\cos((2t\pi/m)+\phi_2) +w_t$$

We cannot estimate the additional phase information by a linear model, but we may estimate phase information by e.g. Fourier analysis.


Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.