We assume that the time series values we observe are the realizations of random variables $y_1,...,y_t$, which are in turn part of a larger *stochastic process* $\{y_t: t \in \mathbb Z\}$.

In time series analysis, the analogs to the *mean* and the *variance* are the **mean function** and the **autocovariance function**.

The *mean* of a series is defined as

The *autocovariance function* is defined as

The autocovariance measures the linear dependence between two points $(y_s,y_t)$ at different times. For smooth series the autocovariance function stays large even when the $s$ and $t$ are far apart, whereas for choppy series the autocovariance function is close to zero for large separations.

If $s=t$ it follows that

$$\gamma(t,t) = E[(y_t-\mu_t)^2] = \text{var}[y_t] \text{,}$$As in classical statistics, it is more convenient to deal with a measure of association between $-1$ and $1$. The **autocorrelation function (ACF)** is computed from the autocovariance function by dividing by the standard deviations of $y_{s}$ and $y_{t}$.

The autocorrelation, also called *serial correlation*, is a measure of the internal correlation of a time series. It is a representation of the degree of similarity between the time series and a lagged version of itself. High autocorrelation values mean that the future is strongly correlated to the past.

The *cross-covariance function* is a measure of predictability of one series $y_t$ from another series $x_s$.

The cross-covariance function can be scaled to $[1,-1]$, referred to as **cross-correlation function (CCF)**.

The real values for the mean and the autocorrelation function are in general not known and must be estimated based on the sample data $y_1, y_2, ...y_n$.

The mean function is estimated by the sample mean

$$\bar y = \frac{1}{n}\sum_{t=1}^ny_t\text{,}$$and the theoretical autocorrelation function is estimated by the sample ACF

$$\hat{\rho}(k) = \frac{\hat{\gamma}(k)}{\hat{\gamma}(0)} = \frac{\sum_{t=1}^{n-k}(y_{t+k}-\bar y)(y_t-\bar y)}{\sum_{t=1}^n(y_t-\bar y)^2}\text{,}$$for $k=0,1,...,n-1$.

One of the most useful descriptive tools in time series analysis is the **correlogram plot** which is simple a plot of the serial correlations $\hat{\rho}(k)$ versus the lag $k$ for $k = 0, 1,...,M$, where $M$ is usually much less than the sample size $n$.

For the sake of demonstration we consider the monthly temperature times series at the weather station Berlin-Dahlem (`ts_FUB_monthly`

) for the period 1981 to 1990.

First, we subset the original time series and the we apply the `plot_acf()`

function from the `statsmodels`

package to compute and plot the autocorrelation function.

In [8]:

```
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
```

In [9]:

```
# open the data set stored as .json file
ts_FUB_monthly = pd.read_json(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/ts_FUB_monthly.json"
)
ts_FUB_monthly["Date"] = pd.to_datetime(
ts_FUB_monthly["Date"], format="%Y-%m-%d", errors="coerce"
)
ts_FUB_monthly
```

Out[9]:

Date | rainfall | |
---|---|---|

0 | 1719-01-01 | 2.80 |

1 | 1719-02-01 | 1.10 |

2 | 1719-03-01 | 5.20 |

3 | 1719-04-01 | 9.00 |

4 | 1719-05-01 | 15.10 |

... | ... | ... |

3631 | 2021-08-01 | 17.43 |

3632 | 2021-09-01 | 15.55 |

3633 | 2021-10-01 | 10.49 |

3634 | 2021-11-01 | 6.28 |

3635 | 2021-12-01 | 2.19 |

3636 rows × 2 columns

Let's check what objects we got using the `.keys()`

function:

In [10]:

```
ts_FUB_monthly.keys()
```

Out[10]:

Index(['Date', 'rainfall'], dtype='object')

In [11]:

```
ts_FUB_monthly = ts_FUB_monthly.set_index("Date")
# filter
ts_FUB_monthly_1980 = ts_FUB_monthly.loc["1981-01-01":"1990-12-01"]
```

Next, we plot the time series combined with an `ACF`

plot.

In [16]:

```
from statsmodels.graphics.tsaplots import plot_acf
fig, ax = plt.subplots(2, 1, figsize=(18, 8))
ax[0].plot(ts_FUB_monthly_1980)
ax[0].set_title("Mean monthly temperatures at Berlin-Dahlem 1981 to 1990",fontsize = 16)
ax[0].set_ylabel("Temperature",fontsize = 16)
plot_acf(ts_FUB_monthly_1980, lags=20, ax=ax[1])
ax[1].set_title(
"Serial correlation of the mean monthly \ntemperatures at Berlin-Dahlen 1981 to 1990",fontsize = 16
)
ax[1].set_xlabel("Lag",fontsize = 16)
ax[1].set_ylabel("ACF",fontsize = 16)
plt.tight_layout()
plt.show()
```

The correlogram shows an oscillating autocorrelation structure with very strong autocorrelations at a lag of 6 months and multiples of 6. This is to be expected due to the nature of the temperature time series. The blue shaded region indicate the 95% confidence limits, which comes automatically with the ACF plot.

However, please note that typically trends and periodicities are removed from the data before investigating the autocorrelational structure of the data.

A key idea in time series analysis is that of **stationarity**. Stationarity is considered as an important precondition for the analysis of the correlational structure in the time series. A time series is considered stationary if its behavior does not change over time. This means, for example, that the values always tend to vary about the same level and that their variability is constant over time.

A time series $\{y_t: t \in \mathbb Z\}$ is said to be *strictly stationary* if for any $k > 0$ and any $t_1,...,t_k \in \mathbb Z$, the distribution of $(y_{t_1}, ...,y_{t_k})$ is the same as that for $(y_{t_1+u}, ...,y_{t_k+u})$ for every value of $u$. Following this definition the stochastic behavior of the process does not change through time.

A weaker definition of stationarity is *second order stationarity*, also referred to as *wide-sense stationarity* or *covariance stationary*. Here, we do not assume anything about the joint distribution of the random responses $y_{t1}, y_{t2}, y_{t3},...$ except that the mean is constant $E[y_t] = \mu$ and that the covariance between two observations $y_t$ and $y_{t+k}$ depends only on the lag $k$ between two observations and not on the point $t$ in the time series.

The theory for time series is based on the assumption of second-order stationarity. However, real-life data are often not stationary. The assumptions of stationarity above only applies after any trends/seasonal effects have been removed.

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*