Let us reload the working data set, the subset of the Keeling Curve from 1990 to 2021, from the previous section.
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
df_co2_1990_2021 = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/KeelingCurve_1990-2021.json"
)
df_co2_1990_2021 = df_co2_1990_2021.set_index("Date") # set datetimeindex
We apply the least squares estimation and fit 4 different models, a linear $(p=1)$, a quadratic $(p=2)$, a cubic $(p=3)$ and a quartic $(p=4)$ model, by applying the numpy.polyfit()
function.
A simple linear regression can be computed in NumPy
. The numpy.polyfit()
function is an easy solution, if we want to do linear regression in NumPy
without sklearn
or other more sophisticated packages. The numpy.polyfit()
will give the slope (m
) and the intercept (n
) of our regression line. Adding the argument full = True
will give additional information, such as the sum of the squares of the fit errors.
Add the argument deg
to set the degree of the fitting polynomial (e.g. linear = 1, quadratic =2, cubic = 3,...)
Exercise: Fit four polynomial models with $(p=1)$, $(p=2)$, $(p=3)$ and $(p=4)$.
## ignore date, convert to lenght
t = np.arange(0, len(df_co2_1990_2021))
## linear fit
z = np.polyfit(t, df_co2_1990_2021.values.flatten(), 1)
sum_err = np.polyfit(t, df_co2_1990_2021.values.flatten(), 1, full=True)[1]
p = np.poly1d(z)
fit1 = p(t)
## quadratic fit
z = np.polyfit(t, df_co2_1990_2021.values.flatten(), 1)
p = np.poly1d(z)
fit2 = p(t)
## cubic fit
z = np.polyfit(t, df_co2_1990_2021.values.flatten(), 3)
p = np.poly1d(z)
fit3 = p(t)
## quartic fit
z = np.polyfit(t, df_co2_1990_2021.values.flatten(), 4)
p = np.poly1d(z)
fit4 = p(t)
As the residuals from the fit yield a time series without the trend, we evaluate the goodness of the models by plotting the residuals.
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
axs[0, 0].plot(t, fit1 - df_co2_1990_2021.values.flatten())
axs[0, 0].set_title("CO2 residuals from a linear fit")
axs[0, 1].plot(t, fit2 - df_co2_1990_2021.values.flatten())
axs[0, 1].set_title("CO2 residuals from a quadratic fit")
axs[1, 0].plot(t, fit3 - df_co2_1990_2021.values.flatten())
axs[1, 0].set_title("CO2 residuals from a cubic fit")
axs[1, 1].plot(t, fit4 - df_co2_1990_2021.values.flatten())
axs[1, 1].set_title("CO2 residuals from a quartic fit")
plt.show()
The plots of residuals indicate that the linear fit (upper left) did not remove the entire trend in the data. There appears to be some non-linearity to the trend. The residual plot from the quadratic fit (upper right) still showed some structure. The cubic model (lower left) and the quartic mode (lower right) seem to be a good fit to the data.
Following Ockham's razor we pick the cubic model for further analysis. A summary of the cubic model is provided by calling summary()
.
import statsmodels.formula.api as smf
results = smf.ols(formula="t ~ fit3", data=df_co2_1990_2021).fit()
results.summary()
Dep. Variable: | t | R-squared: | 0.998 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.998 |
Method: | Least Squares | F-statistic: | 1.100e+05 |
Date: | Mon, 03 Apr 2023 | Prob (F-statistic): | 0.00 |
Time: | 14:40:35 | Log-Likelihood: | -721.04 |
No. Observations: | 264 | AIC: | 1446. |
Df Residuals: | 262 | BIC: | 1453. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1990.9915 | 6.403 | -310.934 | 0.000 | -2003.600 | -1978.383 |
fit3 | 5.4154 | 0.016 | 331.684 | 0.000 | 5.383 | 5.448 |
Omnibus: | 41.087 | Durbin-Watson: | 0.001 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 25.535 |
Skew: | -0.628 | Prob(JB): | 2.85e-06 |
Kurtosis: | 2.137 | Cond. No. | 1.09e+04 |
The p-values indicate that the estimated coefficients of the cubic fit are statistical significant.
Alternatively, to least squares estimation we may apply differencing. For the purpose of demonstration we apply first order $(\Delta_{y_t})$, second order $(\Delta_{y_t}^2)$, third order $(\Delta_{y_t}^3)$ and fourth order $(\Delta_{y_t}^4)$ differencing.
To difference a series, $\Delta_{y_t}=y_t-y_{t-1}$, in Python we apply the diff()
function. Use y.diff()
for $(\Delta_{y_t})$, y.diff().diff()
for $(\Delta_{y_t}^2)$, y.diff().diff().diff()
for $(\Delta_{y_t}^3)$, and so on for higher order differencing.
Exercise: Apply $(\Delta_{y_t})$, $(\Delta_{y_t}^2)$,$(\Delta_{y_t}^3)$,$(\Delta_{y_t}^4)$ differencing to the
df_co2_1990_2021
time series.
## Your code here ...
diff1 = df_co2_1990_2021.diff()
diff2 = diff1.diff()
diff3 = diff2.diff()
diff4 = diff3.diff()
Let us plot the results:
fig, axs = plt.subplots(2, 2, figsize=(18, 8))
axs[0, 0].plot(diff1)
axs[0, 0].set_title("Differencing: $\u0394^1 y_t$")
axs[0, 1].plot(diff2)
axs[0, 1].set_title("Differencing: $\u0394^2 y_t$")
axs[1, 0].plot(diff3)
axs[1, 0].set_title("Differencing: $\u0394^3 y_t$")
axs[1, 1].plot(diff4)
axs[1, 1].set_title("Differencing: $\u0394^4 y_t$")
plt.show()
The plot indicates that the first order differencing already did a quite good job in removing the linear trend from the data. However, in order to base our decision on a statistical quantity, we make use of the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.
The null hypothesis in a KPSS test is that a time series is stationary around a deterministic trend (trend-stationary). The alternative hypothesis is that a unit root is present, which means that the process is non-stationary. In the presence of a shock, trend-stationary processes are mean-reverting. This means that the time series will converge again towards the mean, while shocks have a permanent impact on the mean of unit-root processes (i.e. no convergence over time).
Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test
$H_0: \quad \text{The time series is trend-stationary}$
$H_A: \quad \text{The time series is not trend-stationary}$
We use the kpss()
function from the statsmodels.tsa.stattools
module and write our own function called kpss_test()
to generate the desired output. Please note that if the computed p-value is p < 0.01, then a warning message is generated.
We apply the KPSS test on both, the original and the differenced data sets.
from statsmodels.tsa.stattools import kpss
def kpss_test(series, **kw):
statistic, p_value, n_lags, critical_values = kpss(series, **kw)
# Format Output
print(f"KPSS Statistic: {statistic}")
print(f"p-value: {p_value}")
print(f"num lags: {n_lags}")
print("Critial Values:")
for key, value in critical_values.items():
print(f" {key} : {value}")
print(f'Result: The series is {"not " if p_value < 0.05 else ""}stationary')
kpss_test(df_co2_1990_2021)
KPSS Statistic: 2.483341384260723 p-value: 0.01 num lags: 10 Critial Values: 10% : 0.347 5% : 0.463 2.5% : 0.574 1% : 0.739 Result: The series is not stationary
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\statsmodels\tsa\stattools.py:2018: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is smaller than the p-value returned. warnings.warn(
The p-value of the KPSS test on the original data set (df_co2_1990_2021
) is $p<0.01$. Hence we would reject $H_0$ in favor of $H_A$; the original time series is not trend-stationary.
kpss_test(diff1.dropna()) ## ignore NA values
KPSS Statistic: 0.007266415372046855 p-value: 0.1 num lags: 3 Critial Values: 10% : 0.347 5% : 0.463 2.5% : 0.574 1% : 0.739 Result: The series is stationary
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\statsmodels\tsa\stattools.py:2022: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is greater than the p-value returned. warnings.warn(
The p-value of the KPSS test on the differenced data set (diff1
) is $p>0.1$. Hence we would not reject $H_0$ in favor of $H_A$; the differenced time series is trend-stationary.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.