Hyndman and Athanasopoulos (2013) outline the general approach for fitting an ARIMA model to a set of time series data.

  1. Plot the data. Identify any unusual observations.

  2. If necessary, transform the data (e.g. log transformations, power transformations, etc.) to stabilize the variance.

  3. If the data is non-stationary: take first differences of the data until the data are stationary.

  4. Examine the ACF/PACF: Which AR($p$), MA($q$), ARMA($p, q$) or ARIMA($p, d, q$) model is appropriate?

  5. Choose a model and then use the $AICc$ as a criterion to search for a better model.

  6. Check the residuals from the chosen model by plotting the ACF of the residuals, and doing a Portmanteau test of the residuals. If they do not look like white noise, try a modified model.

  7. Once the residuals look like white noise, calculate forecasts.


In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random

The data¶

Examplarily, we will fit a model to the global annual temperature anomalies (t_global) provided by the Berkeley Earth Surface Temperature Study. Temperatures are given in Celsius and are reported as anomalies relative to the period January 1850 to December 2000 average. Revisit the section on data sets used to remind yourself how we downloaded and extracted the data set of interest.

In [3]:
t_global = pd.read_json(
    "http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/t_global.json"
)
t_global["Date"] = pd.to_datetime(t_global["Date"], format="%Y-%m-%d", errors="coerce")
t_global = t_global.set_index("Date")["Monthly Anomaly_global"]

t_global
Out[3]:
Date
1750-01-01   -0.993
1750-02-01   -1.679
1750-03-01   -0.192
1750-04-01   -0.531
1750-05-01   -1.881
              ...  
2022-05-01    1.023
2022-06-01    1.315
2022-07-01    1.289
2022-08-01    1.231
2022-09-01    1.090
Name: Monthly Anomaly_global, Length: 3273, dtype: float64

We see that the data set consist of monthly mean temperatures.

Exercise: Aggregate the data (t_global) to annual means using the groupby() function.

In [4]:
## Your code here ...
In [5]:
temp_global_year = t_global.groupby(t_global.index.to_period("Y")).agg("mean")

Then, we split the data set into two parts. One part, the training set, is the period 1850 to 2000. Based on that data set we learn the model parameters. The other part, the test set, is the period 2001 to 2021. This part is used to validate our model, as we compare the model forecast with the observations.

Exercise: Subset the time series (temp_global_year) to the period 1850 to 2000 and to the period 2001 to 2016.

In [6]:
## Your code here...
In [7]:
temp_global_training = temp_global_year["1850-01-01":"2000-01-01"]
temp_global_test = temp_global_year["2000-01-01":]

Let us plot the data and identify any unusual observations.

In [8]:
plt.figure(figsize=(18, 6))
plt.title("Earth Surface Temperature Anomalies", fontsize=14)
temp_global_training.plot(label="training set 1850-2000", fontsize=14)
temp_global_test.plot(label="test set 2001-2016", fontsize=14)

plt.legend()
plt.show()

It seems that there are no outliers or unusual observation in the data set.


Data transformation¶

With respect to the plot above it does not seem that the variation increases or decreases with the level of the time series. Thus, probably no transformation is required. However, just as a sanity check we apply a Box-Cox transformation using the scipy.stats.boxcox function in Python and plot the transformed data against the original data.

The Box-Cox transformation is defined as follows:

$$ w_t = \begin{cases} log(y_t) & \text{if } \lambda=0 \\ (y_t^{\lambda}-1)/\lambda & \text{otherwise} \end{cases} $$

The Box-Cox transformation depends on the parameter $\lambda$. So if $\lambda =0$ the natural logarithm is used, but if $\lambda \ne0$, a power transformation is used followed by some simple scaling. We can make use of the scipy.stats.boxcox function, which chooses the value of $\lambda$, which minimizes the coefficient of variation and uses that value within the transformation.

In [9]:
t_global
Out[9]:
Date
1750-01-01   -0.993
1750-02-01   -1.679
1750-03-01   -0.192
1750-04-01   -0.531
1750-05-01   -1.881
              ...  
2022-05-01    1.023
2022-06-01    1.315
2022-07-01    1.289
2022-08-01    1.231
2022-09-01    1.090
Name: Monthly Anomaly_global, Length: 3273, dtype: float64
In [10]:
from scipy.stats import boxcox

boxcox_transformed_data, boxcox_lamba = boxcox(temp_global_training + 10)
boxcox_transformed_data = pd.Series(
    boxcox_transformed_data, index=temp_global_training.index
)

We use the given $\lambda$ and plot the transformed time series against the original time series.

In [11]:
fig, ax = plt.subplots(2, 1, figsize=(16, 8))
temp_global_training.plot(ax=ax[0], color="black", fontsize=14)
ax[0].set_title("Original time series", fontsize=14)


boxcox_transformed_data.plot(
    ax=ax[1],
    color="grey",
)
ax[1].set_title("Box-Cox transformed time series", fontsize=14)

ax[0].grid()
ax[1].grid()

plt.tight_layout()
plt.show()

The Box-Cox transformed time series (gray line) does not differ significantly from the original time series (black line). Hence, we continue our analysis with the original time series data.


Stationary check¶

Fitting an ARIMA model requires the series to be stationary, which means that the mean, variance, and autocovariance of the series are time invariant. Based on our domain knowledge and the visualization of the time series data we are quite sure that the temperature time series is not stationary.

As a proof of concept we apply the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. Recall, the null hypothesis in a KPSS test is that a time series is stationary. We use the function we introduced in a previous section.

$H_0:\text{The time series is trend-stationary.}$

$H_A:\text{The time series is not trend-stationary.}$

In [12]:
# KPSS test
from statsmodels.tsa.stattools import kpss


def kpss_test(series, **kw):
    statistic, p_value, n_lags, critical_values = kpss(series, **kw)
    # Format Output
    print(f"KPSS Statistic: {statistic}")
    print(f"p-value: {p_value}")
    print(f"num lags: {n_lags}")
    print("Critial Values:")
    for key, value in critical_values.items():
        print(f"   {key} : {value}")
    print(f'Result: The series is {"not " if p_value < 0.05 else ""}stationary')
In [13]:
kpss_test(temp_global_training)
KPSS Statistic: 1.613241516211716
p-value: 0.01
num lags: 8
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is not stationary
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\statsmodels\tsa\stattools.py:2018: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is smaller than the p-value returned.

  warnings.warn(

The p-value of the KPSS test on the original data set (temp_global_training) is $p<0.01$. Hence we reject $H_0$ in favor of $H_A$; the original time series is not trend-stationary.

A non-stationary series can be corrected by a simple transformation such as differencing. Recall that the difference $(\Delta y_t)$ is calculated by subtracting one period's values from the previous period's values. Let us apply the diff() function and compute the KPSS test for the differenced time series.

Exercise: Difference the temp_global_training time series and apply a KPSS test on the differenced time series.

In [14]:
## Your code here...
In [15]:
temp_global_training_diff1 = temp_global_training.diff()
In [16]:
kpss_test(temp_global_training_diff1.dropna())  ## ignore NaN for kpss
KPSS Statistic: 0.0976339172178381
p-value: 0.1
num lags: 24
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
Result: The series is stationary
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\statsmodels\tsa\stattools.py:2022: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is greater than the p-value returned.

  warnings.warn(

The p-value of the KPSS test on the differenced data set (temp_global_training_diff1) is $p>0.1$. Hence we would not reject $H_0$ in favor of $H_A$; the differenced time series is trend-stationary.

Let us plot the differenced data set.

In [17]:
plt.figure(figsize=(18, 6))
plt.title("Differenced data set: temp_global_training_diff1`")
temp_global_training_diff1.plot(color="black")

plt.grid()
plt.show()

Nice, the trend is gone. Let us plot the autocorrelation function to review the correlational structure of the time series. This time we use the plot_acf() function from statsmodels.graphics.tsaplots.

In [18]:
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf

plt.figure(figsize=(18, 6))


plot_acf(temp_global_training_diff1.dropna())
plt.title("ACF for Differenced Series")

plt.show()
<Figure size 1800x600 with 0 Axes>