In the following sections we introduce basic operations for time series analysis. We disuses the following topics:

  • Subsetting and indexing
  • Summary statistics
  • Aggregation of time series data

As mentioned in the previous section there exist few different ways to work with time series in Python. Hence, it is very important to be aware of the object class and respectively, the data representation. This representation dictates which functions will be available for loading, processing, analyzing, printing, and plotting the time series data.


Loading the sample data¶

For the purpose of demonstration we load the monthly (ts_FUB_monthly), daily (ts_FUB_daily) and hourly (ts_FUB_hourly) time series data for the weather station Berlin-Dahlem (FU) into Python. We can do that by using the pandas.read_json() function. Check out the previous section on data sets used to remind yourself how we processed the data.

In [2]:
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
In [3]:
ts_FUB_monthly = pd.read_json("../data/ts_FUB_monthly.json")
ts_FUB_monthly["Date"] = pd.to_datetime(
    ts_FUB_monthly["Date"], format="%Y-%m-%d", errors="coerce"
)

ts_FUB_daily = pd.read_json("../data/ts_FUB_daily.json")
ts_FUB_daily["MESS_DATUM"] = pd.to_datetime(
    ts_FUB_daily["MESS_DATUM"], format="%Y-%m-%d", errors="coerce"
)

ts_FUB_hourly = pd.read_json("../data/ts_FUB_hourly.json")
ts_FUB_hourly["MESS_DATUM"] = pd.to_datetime(
    ts_FUB_hourly["MESS_DATUM"], format="%Y-%m-%d", errors="coerce"
)

First, we check to object classes for the three data sets:

In [4]:
print(type(ts_FUB_monthly))
print(str(ts_FUB_monthly))
<class 'pandas.core.frame.DataFrame'>
           Date  rainfall
0    1719-01-01      2.80
1    1719-02-01      1.10
2    1719-03-01      5.20
3    1719-04-01      9.00
4    1719-05-01     15.10
...         ...       ...
3631 2021-08-01     17.43
3632 2021-09-01     15.55
3633 2021-10-01     10.49
3634 2021-11-01      6.28
3635 2021-12-01      2.19

[3636 rows x 2 columns]
In [5]:
print(type(ts_FUB_daily))
print(str(ts_FUB_daily))
<class 'pandas.core.frame.DataFrame'>
      MESS_DATUM  Temp  Rain
0     1950-01-01  -3.2   2.2
1     1950-01-02   1.0  12.6
2     1950-01-03   2.8   0.5
3     1950-01-04  -0.1   0.5
4     1950-01-05  -2.8  10.3
...          ...   ...   ...
26293 2021-12-27  -3.7   0.0
26294 2021-12-28  -0.5   1.5
26295 2021-12-29   4.0   0.3
26296 2021-12-30   9.0   3.2
26297 2021-12-31  12.8   5.5

[26298 rows x 3 columns]
In [6]:
print(type(ts_FUB_hourly))
print(str(ts_FUB_hourly))
<class 'pandas.core.frame.DataFrame'>
                MESS_DATUM  rainfall
0      2002-01-28 11:00:00       0.0
1      2002-01-28 13:00:00       0.0
2      2002-01-28 15:00:00       1.7
3      2002-01-28 18:00:00       1.1
4      2002-01-28 21:00:00       0.0
...                    ...       ...
174018 2021-12-31 19:00:00       0.7
174019 2021-12-31 20:00:00       0.7
174020 2021-12-31 21:00:00       0.1
174021 2021-12-31 22:00:00       0.1
174022 2021-12-31 23:00:00       0.0

[174023 rows x 2 columns]

The data sets are of class pandas.Series.


Plotting¶

Now let us plot the monthly data with the plot() function.

In [7]:
plt.figure(figsize=(18, 6))
plt.plot(ts_FUB_monthly.Date, ts_FUB_monthly.rainfall)
plt.show()

Exercise: Plot the daily and hourly data sets using the plot() function

In [8]:
## Your code here...
In [9]:
fig, ax = plt.subplots(2, 1, figsize=(18, 8))

ax[0].plot(ts_FUB_daily["Temp"])
ax[0].set_title("Temp")

ax[1].plot(ts_FUB_daily["Rain"], color="orange")
ax[1].set_title("Rain")
plt.show()
In [10]:
## Your code here...
In [11]:
plt.figure(figsize=(18, 6))
plt.plot(ts_FUB_hourly.MESS_DATUM, ts_FUB_hourly.rainfall)
plt.show()

Summary statistics¶

A very important and recurrent task in time series analysis is the calculation of summary statistics. In order to introduce summary statistics for time series analysis we revisit the monthly, daily and hourly data sets from the weather station Berlin-Dahlem (FU). Check out the previous section on data sets used to remind yourself how we processed the data.

Let us use the monthly (ts_FUB_monthly), daily (ts_FUB_daily) and hourly (ts_FUB_hourly) time series data for the weather station Berlin-Dahlem (FU).

For the sake of simplicity we reduce the monthly and daily data sets and focus on the 10-year period from 2000 to 2009.

In [12]:
### 10-year period from 2000 to 2009 daily data ###
daily_2000_2009 = ts_FUB_daily.set_index(["MESS_DATUM"])
daily_2000_2009 = daily_2000_2009["2000-01-01":"2009-12-31"]

### 10-year period from 2000 to 2009 monthly data ###
monthly_2000_2009 = ts_FUB_monthly.set_index(["Date"])
monthly_2000_2009 = monthly_2000_2009["2000-01-01":"2009-12-31"]
In [13]:
## Plotting ##

import matplotlib.gridspec as gridspec

plt.figure(figsize=(18, 8))

gs = gridspec.GridSpec(
    2, 2, wspace=0.1, hspace=0.2
)  # optionaL: width_ratios=[2, 1.5], height_ratios=[1, 1])

gs.update(wspace=0.5)
ax1 = plt.subplot(gs[:1, 0])

ax1.plot(daily_2000_2009["Temp"])
ax1.set_title("Daily temperature at Berlin-Dahlem")


ax2 = plt.subplot(gs[1:, 0])
ax2.plot(daily_2000_2009["Rain"], color="orange")
ax2.set_title("Daily rainfall at Berlin-Dahlem")

ax3 = plt.subplot(gs[0:2, 1])
ax3.plot(monthly_2000_2009, color="black")
ax3.set_title("Mean monthly temperature \nat Berlin-Dahlem")


plt.show()

To get a quick overview on the statistical characteristics of time series we can use the describe() function. The function returns basic statistics for the whole data set.

In [14]:
monthly_2000_2009.describe()
Out[14]:
rainfall
count 120.000000
mean 9.937917
std 6.808907
min -3.590000
25% 4.092500
50% 9.780000
75% 15.677500
max 23.200000

Exercise: Get the summary statistics for the daily temperature time series from 2000 to 2009.

In [15]:
## Your code here...
In [16]:
daily_2000_2009.describe()
Out[16]:
Temp Rain
count 3653.000000 3653.000000
mean 9.975938 1.684369
std 7.528174 4.020982
min -15.100000 0.000000
25% 4.000000 0.000000
50% 10.300000 0.000000
75% 16.000000 1.500000
max 27.200000 63.200000

Another useful function is the month_plot() function from the statsmodels.graphics.tsa module, which plots seasonal (monthly by default) sub-series of a time series. For each season (month) a time series is plotted and a defined function, such as the mean (default), the median or the standard deviation, among others, is applied to the sub-series. The default method assumes observations come in groups of 12.

In [17]:
import statsmodels.api as sm

plt.figure(figsize=(25, 4))
fig = sm.graphics.tsa.month_plot(monthly_2000_2009)

plt.xlabel("Month")
plt.ylabel("Temperature")

plt.show()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In [17], line 1
----> 1 import statsmodels.api as sm
      3 plt.figure(figsize=(25, 4))
      4 fig = sm.graphics.tsa.month_plot(monthly_2000_2009)

ModuleNotFoundError: No module named 'statsmodels'

The horizontal bars represent the mean monthly temperature and the lines represent the time series for each particular sub-series (month).

Next, let us further explore the seasonality of our data with box plots, using boxplot() function implemented in the seaborn package. This will result in grouped data by different time periods and will display the distributions for each group. We will first group the data by month, to visualize yearly seasonality.

In [ ]:
## group the datetimeindex by month
monthly_2000_2009.index.month
In [ ]:
import seaborn as sns

plt.figure(figsize=(18, 7))

sns.boxplot(x=monthly_2000_2009.index.month, y=monthly_2000_2009.values.flatten())
plt.xlabel("Month", fontsize = 18)
plt.ylabel("Temperature", fontsize = 18)

plt.show()

The resulting plot lets us immediately figure out which months show more variability and asses if the variability is statistically significant.


Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.