We assume that the time series values we observe are the realizations of random variables $y_1,...,y_t$, which are in turn part of a larger stochastic process $\{y_t: t \in \mathbb Z\}$.
In time series analysis, the analogs to the mean and the variance are the mean function and the autocovariance function.
The mean of a series is defined as
$$\mu_t = E(y_t)\text{.}$$The autocovariance function is defined as
$$\gamma(s,t) = \text{cov}(y_s, y_t) = E[(y_s-\mu_s)(y_t-\mu_t)]\text{.}$$The autocovariance measures the linear dependence between two points $(y_s,y_t)$ at different times. For smooth series the autocovariance function stays large even when the $s$ and $t$ are far apart, whereas for choppy series the autocovariance function is close to zero for large separations.
If $s=t$ it follows that
$$\gamma(t,t) = E[(y_t-\mu_t)^2] = \text{var}[y_t] \text{,}$$As in classical statistics, it is more convenient to deal with a measure of association between $-1$ and $1$. The autocorrelation function (ACF) is computed from the autocovariance function by dividing by the standard deviations of $y_{s}$ and $y_{t}$.
$$\rho(s,t) = \frac{\gamma(s,t)}{\sqrt{(\gamma(s,s)\gamma(t,t))}}$$The autocorrelation, also called serial correlation, is a measure of the internal correlation of a time series. It is a representation of the degree of similarity between the time series and a lagged version of itself. High autocorrelation values mean that the future is strongly correlated to the past.
The cross-covariance function is a measure of predictability of one series $y_t$ from another series $x_s$.
$$\gamma_{xy}(s,t) = \text{cov}(x_s, y_t) = E[(x_{s}-\mu_{xs})(y_{t}-\mu_{yt})]$$The cross-covariance function can be scaled to $[1,-1]$, referred to as cross-correlation function (CCF).
$$\rho_{xy}(s,t) = \frac{\gamma_{xy}(s,t)}{\sqrt{(\gamma_x(s,s)\gamma_y(t,t))}}$$The real values for the mean and the autocorrelation function are in general not known and must be estimated based on the sample data $y_1, y_2, ...y_n$.
The mean function is estimated by the sample mean
$$\bar y = \frac{1}{n}\sum_{t=1}^ny_t\text{,}$$and the theoretical autocorrelation function is estimated by the sample ACF
$$\hat{\rho}(k) = \frac{\hat{\gamma}(k)}{\hat{\gamma}(0)} = \frac{\sum_{t=1}^{n-k}(y_{t+k}-\bar y)(y_t-\bar y)}{\sum_{t=1}^n(y_t-\bar y)^2}\text{,}$$for $k=0,1,...,n-1$.
One of the most useful descriptive tools in time series analysis is the correlogram plot which is simple a plot of the serial correlations $\hat{\rho}(k)$ versus the lag $k$ for $k = 0, 1,...,M$, where $M$ is usually much less than the sample size $n$.
For the sake of demonstration we consider the monthly temperature times series at the weather station Berlin-Dahlem (ts_FUB_monthly
) for the period 1981 to 1990.
First, we subset the original time series and the we apply the plot_acf()
function from the statsmodels
package to compute and plot the autocorrelation function.
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# open the data set stored as .json file
ts_FUB_monthly = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/ts_FUB_monthly.json"
)
ts_FUB_monthly["Date"] = pd.to_datetime(
ts_FUB_monthly["Date"], format="%Y-%m-%d", errors="coerce"
)
ts_FUB_monthly
Date | rainfall | |
---|---|---|
0 | 1719-01-01 | 2.80 |
1 | 1719-02-01 | 1.10 |
2 | 1719-03-01 | 5.20 |
3 | 1719-04-01 | 9.00 |
4 | 1719-05-01 | 15.10 |
... | ... | ... |
3631 | 2021-08-01 | 17.43 |
3632 | 2021-09-01 | 15.55 |
3633 | 2021-10-01 | 10.49 |
3634 | 2021-11-01 | 6.28 |
3635 | 2021-12-01 | 2.19 |
3636 rows × 2 columns
Let's check what objects we got using the .keys()
function:
ts_FUB_monthly.keys()
Index(['Date', 'rainfall'], dtype='object')
Now, we are able to extract the desired series and filter this series by the years of interest (1981 to 1990).
ts_FUB_monthly = ts_FUB_monthly.set_index("Date")
# filter
ts_FUB_monthly_1980 = ts_FUB_monthly.loc["1981-01-01":"1990-12-01"]
Next, we plot the time series combined with an ACF
plot.
from statsmodels.graphics.tsaplots import plot_acf
fig, ax = plt.subplots(2, 1, figsize=(18, 8))
ax[0].plot(ts_FUB_monthly_1980)
ax[0].set_title("Mean monthly temperatures at Berlin-Dahlem 1981 to 1990",fontsize = 16)
ax[0].set_ylabel("Temperature",fontsize = 16)
plot_acf(ts_FUB_monthly_1980, lags=20, ax=ax[1])
ax[1].set_title(
"Serial correlation of the mean monthly \ntemperatures at Berlin-Dahlen 1981 to 1990",fontsize = 16
)
ax[1].set_xlabel("Lag",fontsize = 16)
ax[1].set_ylabel("ACF",fontsize = 16)
plt.tight_layout()
plt.show()
The correlogram shows an oscillating autocorrelation structure with very strong autocorrelations at a lag of 6 months and multiples of 6. This is to be expected due to the nature of the temperature time series. The blue shaded region indicate the 95% confidence limits, which comes automatically with the ACF plot.
However, please note that typically trends and periodicities are removed from the data before investigating the autocorrelational structure of the data.
A key idea in time series analysis is that of stationarity. Stationarity is considered as an important precondition for the analysis of the correlational structure in the time series. A time series is considered stationary if its behavior does not change over time. This means, for example, that the values always tend to vary about the same level and that their variability is constant over time.
A time series $\{y_t: t \in \mathbb Z\}$ is said to be strictly stationary if for any $k > 0$ and any $t_1,...,t_k \in \mathbb Z$, the distribution of $(y_{t_1}, ...,y_{t_k})$ is the same as that for $(y_{t_1+u}, ...,y_{t_k+u})$ for every value of $u$. Following this definition the stochastic behavior of the process does not change through time.
A weaker definition of stationarity is second order stationarity, also referred to as wide-sense stationarity or covariance stationary. Here, we do not assume anything about the joint distribution of the random responses $y_{t1}, y_{t2}, y_{t3},...$ except that the mean is constant $E[y_t] = \mu$ and that the covariance between two observations $y_t$ and $y_{t+k}$ depends only on the lag $k$ between two observations and not on the point $t$ in the time series.
The theory for time series is based on the assumption of second-order stationarity. However, real-life data are often not stationary. The assumptions of stationarity above only applies after any trends/seasonal effects have been removed.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.