Working with time series data is relatively straight forward with Python. Since, Python is an object-oriented programming language, we have to be aware of the data representation, referred to as object class. This representation dictates which functions will be available for loading, processing, analyzing, printing, and plotting our data.

Time series data is often stored as .csv files or other spreadsheet formats. Those typically contain two columns: date and measured value. The pandas libary comes in very handy, when working with time series data sets. Load .csv files by using the read_csv() function from the pandas package.

In the subsequent section we will deal mostly with the following packages associated with time series analysis:

  • pandas
  • datetime
  • statsmodels
  • scipy
  • ...

We begin this chapter by loading the meteorological data set from the Deutscher Wetterdienst DWD (German Weather Service). The data was downloaded from the Climate Data Center (German Weather Service) on 2022-07-22.

In [2]:
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
In [3]:
# Read data
df = pd.read_csv(
    "http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/DWD_Berlin-Dahlem_DailyData.csv",
    sep=";",
)

# Rename columns
df.columns = [
    "Date",
    "QN",
    "FX",
    "FM",
    "QN_4",
    "RSK",
    "RSKF",
    "SDK",
    "SHK_TAG",
    "NM",
    "VPM",
    "PM",
    "TMK",
    "UPM",
    "TXK",
    "TNK",
    "TGK",
]

## Drop columns we want to ignore
df = df.drop(columns=["QN", "FX", "FM", "QN_4", "SHK_TAG"])
In [4]:
# Sort the data set by date: Ascending

df = df.sort_values(by=["Date"], ascending=True)
df.head(10)
Out[4]:
Date RSK RSKF SDK NM VPM PM TMK UPM TXK TNK TGK
0 19500101 2.2 7 -999.0 5.0 4.0 1025.6 -3.2 83.0 -1.1 -4.9 -6.3
1 19500102 12.6 8 -999.0 8.0 6.1 1005.6 1.0 95.0 2.2 -3.7 -5.3
2 19500103 0.5 1 -999.0 5.0 6.5 996.6 2.8 86.0 3.9 1.7 -1.4
3 19500104 0.5 7 -999.0 7.7 5.2 999.5 -0.1 85.0 2.1 -0.9 -2.3
4 19500105 10.3 7 -999.0 8.0 4.0 1001.1 -2.8 79.0 -0.9 -3.3 -5.2
5 19500106 7.2 8 -999.0 7.3 5.6 997.5 2.6 79.0 5.0 -4.0 -4.0
6 19500107 0.4 1 -999.0 8.0 8.0 1005.1 5.7 89.0 6.4 2.5 1.1
7 19500108 0.0 1 -999.0 7.0 9.7 1011.9 7.0 93.0 8.7 5.8 2.2
8 19500109 3.7 8 -999.0 7.7 7.7 1009.1 5.8 85.0 6.9 4.5 2.6
9 19500110 4.5 8 -999.0 8.0 4.8 1021.5 -2.4 88.0 6.0 -4.5 -0.4

Data Info: Descriptions of the Deutsche Wetterdienst (German Weather Service):

RSK: daily precipitation level;mm
RSKF: daily precipitation form;numeric code
SDK: Sunshine duration (daily sum);hour
NM: Daily average of the degree of coverage; eighths
VPM: Daily average of vapor pressure;hpa
PM: Daily mean air pressure;hpa
TMK: Daily mean temperature;°C
UPM: Daily average of relative humidity;%.
TXK: Daily maximum air temperature at 2m altitude;°C
TNK: Daily minimum air temperature at 2m altitude;°C
TGK: Minimum air temperature at ground level at 5cm height;°C

In general, for the calculation of daily average values:

From 01.04.2001 the standard was changed as follows:

Calculation of the daily means from 24 hourly values If more than 3 hourly values are missing -> calculation from the 4 main dates (00, 06, 12, 18 UTC). reference time for a day usually 23:51 UTC of the previous day to 23:50 UTC only the precipitation of the previous day is measured in the morning at 05:50 UTC

Here the observation dates are referred to the global used time in Greenwich (GMT or UTC). The observation time is always 10 minutes before the reference date (therefore the crooked times). This change was necessary after the station network was automated to a large extent.

The datetime module¶

In Python, date and time are not a data type of their own, but the datetime module can be imported to work with the date as well as time.

The datetime module comes built into Python, so there is no need to install it externally.

datetime supplies classes to work with date and time. These classes provide a number of functions to deal with dates, times and time intervals. Date and datetime are objects in Python, so when you edit them, you are actually editing objects and not strings or timestamps.

The standard date format codes are given below.

$$ \begin{array}{|c|l|} \hline \text{code} & \text{value} \\ \hline \mathtt{\%d} & \text{Day of the month (number)} \\ \mathtt{\%m} & \text{Month (number)} \\ \mathtt{\%b} & \text{Month (abbreviated)} \\ \mathtt{\%B} & \text{Month (full name)} \\ \mathtt{\%y} & \text{Year (2 digit)} \\ \mathtt{\%Y} & \text{Year (4 digit)} \\ \hline \end{array} $$

relevantdatetime classes:

$$ \begin{array}{|c|l|} \hline \text{class} & \text{describtion} & \text{attributes} \\ \hline \mathtt{date} & \text{assuming the current Gregorian calendar } & \text{year, month, day} \\ \mathtt{time} & \text{time, independent of any particular day (day= 24*60*60 sec)} & \text{h, min, sec, microsec,tzinfo}\\ \mathtt{datetime} & \text{combination of date and time along with the respective attributes} \\ \mathtt{tzinfo} & \text{provides time zone information objects} \\ \hline \end{array} $$

Example: string to datetime

Say, we got an datetime object, which looks like this:

2019-07-29 14:30:40Z

This string includes:

  • date in YYYY-MM-DD format
  • time in HH:MM:SS format
  • time zone as "Z" (indicatong UTC)

Read more about datetime string formatting.

First, we want to convert this string to a Python datetime object. The datetime.strptime function comes in very handy here. The function takes in a date string and formatting characters and returns a Python datetime object.

In [5]:
from datetime import datetime

date_as_string = "2019-07-29 14:30:40Z"
date_object = datetime.strptime(date_as_string, "%Y-%m-%d %H:%M:%SZ")
print("date_object =", date_object, "class = ", type(date_object))
date_object = 2019-07-29 14:30:40 class =  <class 'datetime.datetime'>

Example: Working with datetime in data frames

Returning to our DWD data, we will consider the date column. Firstly, check for the data type stored in the date column.

In [6]:
df["Date"].dtype
Out[6]:
dtype('int64')

The data type is int64. We would like to transfrom these integers into dates. We achieve that by employing the datetime package again for this application.

In [7]:
df["Date"] = pd.to_datetime(df["Date"], format="%Y%m%d")
df.head()
Out[7]:
Date RSK RSKF SDK NM VPM PM TMK UPM TXK TNK TGK
0 1950-01-01 2.2 7 -999.0 5.0 4.0 1025.6 -3.2 83.0 -1.1 -4.9 -6.3
1 1950-01-02 12.6 8 -999.0 8.0 6.1 1005.6 1.0 95.0 2.2 -3.7 -5.3
2 1950-01-03 0.5 1 -999.0 5.0 6.5 996.6 2.8 86.0 3.9 1.7 -1.4
3 1950-01-04 0.5 7 -999.0 7.7 5.2 999.5 -0.1 85.0 2.1 -0.9 -2.3
4 1950-01-05 10.3 7 -999.0 8.0 4.0 1001.1 -2.8 79.0 -0.9 -3.3 -5.2
In [8]:
## check again for the datatype
df["Date"]
Out[8]:
0       1950-01-01
1       1950-01-02
2       1950-01-03
3       1950-01-04
4       1950-01-05
           ...    
26293   2021-12-27
26294   2021-12-28
26295   2021-12-29
26296   2021-12-30
26297   2021-12-31
Name: Date, Length: 26298, dtype: datetime64[ns]

Perfect, The data type is now datetime! Now we would like the dates to be our indices instead of the numbers. This step will bring many advantages as elaborated below.

In [9]:
### Set the date as Index:
df = df.set_index("Date")
df.head()
Out[9]:
RSK RSKF SDK NM VPM PM TMK UPM TXK TNK TGK
Date
1950-01-01 2.2 7 -999.0 5.0 4.0 1025.6 -3.2 83.0 -1.1 -4.9 -6.3
1950-01-02 12.6 8 -999.0 8.0 6.1 1005.6 1.0 95.0 2.2 -3.7 -5.3
1950-01-03 0.5 1 -999.0 5.0 6.5 996.6 2.8 86.0 3.9 1.7 -1.4
1950-01-04 0.5 7 -999.0 7.7 5.2 999.5 -0.1 85.0 2.1 -0.9 -2.3
1950-01-05 10.3 7 -999.0 8.0 4.0 1001.1 -2.8 79.0 -0.9 -3.3 -5.2

Benefits of the DatetimeIndex¶

1. Subsetting and slicing¶

A very common task in time series analysis is the subsetting of a time series. Maybe you are interested in the oldest or newest observations, or you would like to extract a certain range of data. Since we have set the data as index, we can easily look at the data of a certain time:

In [10]:
df.loc["2018-11-13"]
Out[10]:
RSK        8.9
RSKF       6.0
SDK        1.1
NM         6.0
VPM       11.8
PM      1012.3
TMK        9.7
UPM       97.0
TXK       12.3
TNK        6.8
TGK        4.4
Name: 2018-11-13 00:00:00, dtype: float64

Or we can have a look at the data of a certain time interval:

How was the weather last June?

In [11]:
df_June = df["2019-06-01":"2019-06-30"]
df_June
Out[11]:
RSK RSKF SDK NM VPM PM TMK UPM TXK TNK TGK
Date
2019-06-01 0.0 0 12.700 4.0 15.2 1011.15 20.3 68.00 27.1 14.0 10.3
2019-06-02 0.0 0 13.367 3.1 15.5 1007.62 22.7 60.25 31.3 14.6 10.1
2019-06-03 0.0 6 11.933 2.9 15.3 1003.78 24.8 52.83 33.2 15.6 11.4
2019-06-04 0.0 0 11.550 2.9 17.0 1002.47 23.6 61.00 30.6 17.2 13.3
2019-06-05 0.0 0 14.183 1.3 17.0 999.16 25.7 56.08 33.6 16.0 12.2
2019-06-06 5.4 6 8.367 4.8 19.0 999.29 21.0 78.29 31.7 14.5 14.1
2019-06-07 0.1 6 14.133 2.5 15.0 1004.55 19.4 68.92 25.7 12.8 10.1
2019-06-08 0.0 6 8.617 3.8 11.9 1009.54 17.9 59.25 22.6 10.4 5.9
2019-06-09 3.4 6 12.450 4.7 11.0 1013.62 18.8 53.92 25.9 8.7 4.8
2019-06-10 1.8 6 5.750 6.8 17.7 1003.88 21.1 71.96 27.2 14.5 14.0
2019-06-11 48.0 6 11.267 6.3 20.2 1000.25 23.0 74.79 29.1 16.8 14.6
2019-06-12 13.6 6 9.650 5.9 19.7 998.63 22.0 79.04 33.0 16.6 15.5
2019-06-13 0.0 6 8.900 3.6 15.1 1005.27 18.9 72.29 24.6 14.7 12.5
2019-06-14 0.0 6 12.050 4.1 14.8 1007.85 22.8 56.33 29.5 12.8 10.1
2019-06-15 0.0 0 10.800 5.4 19.7 1001.65 24.9 62.88 31.5 17.4 15.7
2019-06-16 0.0 6 5.800 6.8 14.8 1009.99 17.9 72.96 21.5 14.0 11.3
2019-06-17 0.0 0 10.300 3.2 15.2 1011.72 20.4 66.13 27.1 14.4 12.0
2019-06-18 0.0 0 15.283 2.6 14.9 1007.27 22.8 57.08 29.4 14.6 11.1
2019-06-19 0.9 6 12.767 5.3 17.8 1000.28 24.3 60.88 32.1 16.6 12.6
2019-06-20 1.9 6 6.983 6.0 19.2 1000.97 21.0 78.21 25.7 17.1 14.2
2019-06-21 0.0 0 10.033 4.8 15.5 1007.42 19.5 70.79 24.7 14.9 10.7
2019-06-22 0.0 0 12.417 2.8 14.5 1012.16 20.1 63.38 25.7 14.2 9.9
2019-06-23 0.0 0 14.550 1.4 13.2 1012.80 21.2 53.63 27.8 14.0 10.0
2019-06-24 0.0 0 15.900 3.7 12.4 1014.22 22.8 47.75 29.7 14.8 10.1
2019-06-25 0.0 0 15.367 4.5 16.1 1012.10 25.5 51.17 33.5 17.5 12.5
2019-06-26 0.0 0 15.100 2.0 19.0 1009.91 27.9 53.67 36.1 18.7 15.1
2019-06-27 0.0 0 14.367 1.7 12.5 1013.56 19.3 57.83 25.3 14.6 12.8
2019-06-28 0.0 0 10.550 3.3 12.9 1014.31 17.9 64.96 23.9 11.7 8.4
2019-06-29 0.0 0 15.683 0.6 13.5 1010.94 21.7 57.00 30.3 12.2 8.4
2019-06-30 0.0 6 15.033 1.3 15.6 1003.23 27.6 47.58 37.6 14.6 10.1

Another nice feature using Python and Python's datetime is to generate a new time series. We can define a starting date start="6/1/2019and a end date end="6/01/2020of the series.

Moreover, we can define a frequency, for example daily ('D'), hourly ('H'), calendar daily ('D'), business daily ('B'), weekly ('W'), monthly ('M'), quarterly ('Q'), annual ('A'), and many others. Frequencies can also be specified as multiples of any of the base frequencies, for example '5D' for every five days.

For example:

In [12]:
New_series = pd.date_range(start="6/1/2019", end="6/1/2020", freq="B")
New_series
Out[12]:
DatetimeIndex(['2019-06-03', '2019-06-04', '2019-06-05', '2019-06-06',
               '2019-06-07', '2019-06-10', '2019-06-11', '2019-06-12',
               '2019-06-13', '2019-06-14',
               ...
               '2020-05-19', '2020-05-20', '2020-05-21', '2020-05-22',
               '2020-05-25', '2020-05-26', '2020-05-27', '2020-05-28',
               '2020-05-29', '2020-06-01'],
              dtype='datetime64[ns]', length=261, freq='B')

or:

In [13]:
daily_index = pd.date_range(start="4/1/2018", end="4/30/2018", freq="D")
daily_index
Out[13]:
DatetimeIndex(['2018-04-01', '2018-04-02', '2018-04-03', '2018-04-04',
               '2018-04-05', '2018-04-06', '2018-04-07', '2018-04-08',
               '2018-04-09', '2018-04-10', '2018-04-11', '2018-04-12',
               '2018-04-13', '2018-04-14', '2018-04-15', '2018-04-16',
               '2018-04-17', '2018-04-18', '2018-04-19', '2018-04-20',
               '2018-04-21', '2018-04-22', '2018-04-23', '2018-04-24',
               '2018-04-25', '2018-04-26', '2018-04-27', '2018-04-28',
               '2018-04-29', '2018-04-30'],
              dtype='datetime64[ns]', freq='D')
2. Easy plotting¶

Plotting the whole data set is very easy now, since we have set the Date column as the index. For example, we select one variable to be plotted, in this case the daily mean temperature(TMK):

In [14]:
plt.figure(figsize=(18, 4))
df["TMK"].plot()
plt.show()

Next, we create a new data frame to finally plot the data for a time slice with its confidence interval.

In [15]:
df_sliced = df["2019-01-01":"2019-03-31"]["TMK"].round(2)
df_sliced
Out[15]:
Date
2019-01-01     6.4
2019-01-02     1.8
2019-01-03    -0.4
2019-01-04     1.8
2019-01-05     6.0
              ... 
2019-03-27     7.3
2019-03-28     8.7
2019-03-29     9.4
2019-03-30    10.2
2019-03-31     8.0
Name: TMK, Length: 90, dtype: float64
In [16]:
print("df_sliced.shape: ", df_sliced.shape, "type(df_sliced): ", type(df_sliced))
df_sliced.shape:  (90,) type(df_sliced):  <class 'pandas.core.series.Series'>

We created a pandas Series! In the following step, we will calculate the confidence interval and then transform our data into a pandas DataFrame to save the confidence interval in additional columns.

In [17]:
ci = 1.96 * df_sliced.std() / df_sliced.mean()
ci
Out[17]:
1.6190937680968653
In [18]:
df_sliced = pd.DataFrame(df_sliced)
df_sliced
Out[18]:
TMK
Date
2019-01-01 6.4
2019-01-02 1.8
2019-01-03 -0.4
2019-01-04 1.8
2019-01-05 6.0
... ...
2019-03-27 7.3
2019-03-28 8.7
2019-03-29 9.4
2019-03-30 10.2
2019-03-31 8.0

90 rows × 1 columns

In [19]:
df_sliced["Confidence_lower"] = df_sliced["TMK"] - ci
df_sliced["Confidence_upper"] = df_sliced["TMK"] + ci
df_sliced
Out[19]:
TMK Confidence_lower Confidence_upper
Date
2019-01-01 6.4 4.780906 8.019094
2019-01-02 1.8 0.180906 3.419094
2019-01-03 -0.4 -2.019094 1.219094
2019-01-04 1.8 0.180906 3.419094
2019-01-05 6.0 4.380906 7.619094
... ... ... ...
2019-03-27 7.3 5.680906 8.919094
2019-03-28 8.7 7.080906 10.319094
2019-03-29 9.4 7.780906 11.019094
2019-03-30 10.2 8.580906 11.819094
2019-03-31 8.0 6.380906 9.619094

90 rows × 3 columns

Now, we are able to plot the data with the respective confidence band:

In [20]:
fig, ax = plt.subplots(figsize=(18, 4))

ax = df_sliced["TMK"].plot(marker="o", linestyle="-")
ax.set_ylabel("mean temperature (°C)")

ax.fill_between(
    df_sliced.index,
    df_sliced["Confidence_lower"],
    df_sliced["Confidence_upper"],
    color="b",
    alpha=0.1,
)
Out[20]:
<matplotlib.collections.PolyCollection at 0x1cb1c2ffd90>

Exercise
Was the sun shining in the summer of 2021? Plot the Sunshine duration SDK from the DWD data set for the time slice May, June, July and August 2021.

In [21]:
## your code here...
In [22]:
df_sliced = df["2021-05-01":"2021-08-31"]["SDK"].round(2)

ci = 1.96 * np.std(df_sliced) / np.mean(df_sliced)

df_sliced = df_sliced.to_frame()

df_sliced["Confidence_lower"] = df_sliced["SDK"] - ci
df_sliced["Confidence_upper"] = df_sliced["SDK"] + ci
In [29]:
fig, ax = plt.subplots(figsize=(18, 4))

ax = df_sliced["SDK"].plot(marker="o", linestyle="-")
ax.set_ylabel("sunshine duration [h]")

ax.fill_between(
    df_sliced.index,
    df_sliced["Confidence_lower"],
    df_sliced["Confidence_upper"],
    color="b",
    alpha=0.1,
)
Out[29]:
<matplotlib.collections.PolyCollection at 0x1cb216d4c70>