Working with time series data is relatively straight forward with Python. Since, Python is an object-oriented programming language, we have to be aware of the data representation, referred to as object class. This representation dictates which functions will be available for loading, processing, analyzing, printing, and plotting our data.
Time series data is often stored as .csv files or other spreadsheet formats. Those typically contain two columns: date and measured value. The pandas
libary comes in very handy, when working with time series data sets. Load .csv files by using the read_csv()
function from the pandas
package.
In the subsequent section we will deal mostly with the following packages associated with time series analysis:
We begin this chapter by loading the meteorological data set from the Deutscher Wetterdienst DWD (German Weather Service). The data was downloaded from the Climate Data Center (German Weather Service) on 2022-07-22.
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Read data
df = pd.read_csv(
"http://userpage.fu-berlin.de/soga/soga-py/300/307000_time_series/DWD_Berlin-Dahlem_DailyData.csv",
sep=";",
)
# Rename columns
df.columns = [
"Date",
"QN",
"FX",
"FM",
"QN_4",
"RSK",
"RSKF",
"SDK",
"SHK_TAG",
"NM",
"VPM",
"PM",
"TMK",
"UPM",
"TXK",
"TNK",
"TGK",
]
## Drop columns we want to ignore
df = df.drop(columns=["QN", "FX", "FM", "QN_4", "SHK_TAG"])
# Sort the data set by date: Ascending
df = df.sort_values(by=["Date"], ascending=True)
df.head(10)
Date | RSK | RSKF | SDK | NM | VPM | PM | TMK | UPM | TXK | TNK | TGK | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19500101 | 2.2 | 7 | -999.0 | 5.0 | 4.0 | 1025.6 | -3.2 | 83.0 | -1.1 | -4.9 | -6.3 |
1 | 19500102 | 12.6 | 8 | -999.0 | 8.0 | 6.1 | 1005.6 | 1.0 | 95.0 | 2.2 | -3.7 | -5.3 |
2 | 19500103 | 0.5 | 1 | -999.0 | 5.0 | 6.5 | 996.6 | 2.8 | 86.0 | 3.9 | 1.7 | -1.4 |
3 | 19500104 | 0.5 | 7 | -999.0 | 7.7 | 5.2 | 999.5 | -0.1 | 85.0 | 2.1 | -0.9 | -2.3 |
4 | 19500105 | 10.3 | 7 | -999.0 | 8.0 | 4.0 | 1001.1 | -2.8 | 79.0 | -0.9 | -3.3 | -5.2 |
5 | 19500106 | 7.2 | 8 | -999.0 | 7.3 | 5.6 | 997.5 | 2.6 | 79.0 | 5.0 | -4.0 | -4.0 |
6 | 19500107 | 0.4 | 1 | -999.0 | 8.0 | 8.0 | 1005.1 | 5.7 | 89.0 | 6.4 | 2.5 | 1.1 |
7 | 19500108 | 0.0 | 1 | -999.0 | 7.0 | 9.7 | 1011.9 | 7.0 | 93.0 | 8.7 | 5.8 | 2.2 |
8 | 19500109 | 3.7 | 8 | -999.0 | 7.7 | 7.7 | 1009.1 | 5.8 | 85.0 | 6.9 | 4.5 | 2.6 |
9 | 19500110 | 4.5 | 8 | -999.0 | 8.0 | 4.8 | 1021.5 | -2.4 | 88.0 | 6.0 | -4.5 | -0.4 |
Data Info: Descriptions of the Deutsche Wetterdienst (German Weather Service):
RSK: daily precipitation level;mm
RSKF: daily precipitation form;numeric code
SDK: Sunshine duration (daily sum);hour
NM: Daily average of the degree of coverage; eighths
VPM: Daily average of vapor pressure;hpa
PM: Daily mean air pressure;hpa
TMK: Daily mean temperature;°C
UPM: Daily average of relative humidity;%.
TXK: Daily maximum air temperature at 2m altitude;°C
TNK: Daily minimum air temperature at 2m altitude;°C
TGK: Minimum air temperature at ground level at 5cm height;°C
In general, for the calculation of daily average values:
From 01.04.2001 the standard was changed as follows:
Calculation of the daily means from 24 hourly values If more than 3 hourly values are missing -> calculation from the 4 main dates (00, 06, 12, 18 UTC). reference time for a day usually 23:51 UTC of the previous day to 23:50 UTC only the precipitation of the previous day is measured in the morning at 05:50 UTC
Here the observation dates are referred to the global used time in Greenwich (GMT or UTC). The observation time is always 10 minutes before the reference date (therefore the crooked times). This change was necessary after the station network was automated to a large extent.
In Python, date and time are not a data type of their own, but the datetime
module can be imported to work with the date as well as time.
The
datetime
module comes built into Python, so there is no need to install it externally.
datetime
supplies classes to work with date and time. These classes provide a number of functions to deal with dates, times and time intervals. Date and datetime
are objects in Python, so when you edit them, you are actually editing objects and not strings or timestamps.
The standard date format codes are given below.
$$ \begin{array}{|c|l|} \hline \text{code} & \text{value} \\ \hline \mathtt{\%d} & \text{Day of the month (number)} \\ \mathtt{\%m} & \text{Month (number)} \\ \mathtt{\%b} & \text{Month (abbreviated)} \\ \mathtt{\%B} & \text{Month (full name)} \\ \mathtt{\%y} & \text{Year (2 digit)} \\ \mathtt{\%Y} & \text{Year (4 digit)} \\ \hline \end{array} $$relevantdatetime
classes:
Example: string
to datetime
Say, we got an datetime
object, which looks like this:
This string includes:
Read more about datetime string formatting.
First, we want to convert this string to a Python datetime
object. The datetime.strptime
function comes in very handy here. The function takes in a date string and formatting characters and returns a Python datetime
object.
from datetime import datetime
date_as_string = "2019-07-29 14:30:40Z"
date_object = datetime.strptime(date_as_string, "%Y-%m-%d %H:%M:%SZ")
print("date_object =", date_object, "class = ", type(date_object))
date_object = 2019-07-29 14:30:40 class = <class 'datetime.datetime'>
Example: Working with datetime
in data frames
Returning to our DWD data, we will consider the date
column. Firstly, check for the data type stored in the date
column.
df["Date"].dtype
dtype('int64')
The data type is int64
. We would like to transfrom these integers into dates. We achieve that by employing the datetime
package again for this application.
df["Date"] = pd.to_datetime(df["Date"], format="%Y%m%d")
df.head()
Date | RSK | RSKF | SDK | NM | VPM | PM | TMK | UPM | TXK | TNK | TGK | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1950-01-01 | 2.2 | 7 | -999.0 | 5.0 | 4.0 | 1025.6 | -3.2 | 83.0 | -1.1 | -4.9 | -6.3 |
1 | 1950-01-02 | 12.6 | 8 | -999.0 | 8.0 | 6.1 | 1005.6 | 1.0 | 95.0 | 2.2 | -3.7 | -5.3 |
2 | 1950-01-03 | 0.5 | 1 | -999.0 | 5.0 | 6.5 | 996.6 | 2.8 | 86.0 | 3.9 | 1.7 | -1.4 |
3 | 1950-01-04 | 0.5 | 7 | -999.0 | 7.7 | 5.2 | 999.5 | -0.1 | 85.0 | 2.1 | -0.9 | -2.3 |
4 | 1950-01-05 | 10.3 | 7 | -999.0 | 8.0 | 4.0 | 1001.1 | -2.8 | 79.0 | -0.9 | -3.3 | -5.2 |
## check again for the datatype
df["Date"]
0 1950-01-01 1 1950-01-02 2 1950-01-03 3 1950-01-04 4 1950-01-05 ... 26293 2021-12-27 26294 2021-12-28 26295 2021-12-29 26296 2021-12-30 26297 2021-12-31 Name: Date, Length: 26298, dtype: datetime64[ns]
Perfect, The data type is now datetime
! Now we would like the dates to be our indices instead of the numbers. This step will bring many advantages as elaborated below.
### Set the date as Index:
df = df.set_index("Date")
df.head()
RSK | RSKF | SDK | NM | VPM | PM | TMK | UPM | TXK | TNK | TGK | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||
1950-01-01 | 2.2 | 7 | -999.0 | 5.0 | 4.0 | 1025.6 | -3.2 | 83.0 | -1.1 | -4.9 | -6.3 |
1950-01-02 | 12.6 | 8 | -999.0 | 8.0 | 6.1 | 1005.6 | 1.0 | 95.0 | 2.2 | -3.7 | -5.3 |
1950-01-03 | 0.5 | 1 | -999.0 | 5.0 | 6.5 | 996.6 | 2.8 | 86.0 | 3.9 | 1.7 | -1.4 |
1950-01-04 | 0.5 | 7 | -999.0 | 7.7 | 5.2 | 999.5 | -0.1 | 85.0 | 2.1 | -0.9 | -2.3 |
1950-01-05 | 10.3 | 7 | -999.0 | 8.0 | 4.0 | 1001.1 | -2.8 | 79.0 | -0.9 | -3.3 | -5.2 |
DatetimeIndex
¶A very common task in time series analysis is the subsetting of a time series. Maybe you are interested in the oldest or newest observations, or you would like to extract a certain range of data. Since we have set the data as index, we can easily look at the data of a certain time:
df.loc["2018-11-13"]
RSK 8.9 RSKF 6.0 SDK 1.1 NM 6.0 VPM 11.8 PM 1012.3 TMK 9.7 UPM 97.0 TXK 12.3 TNK 6.8 TGK 4.4 Name: 2018-11-13 00:00:00, dtype: float64
Or we can have a look at the data of a certain time interval:
How was the weather last June?
df_June = df["2019-06-01":"2019-06-30"]
df_June
RSK | RSKF | SDK | NM | VPM | PM | TMK | UPM | TXK | TNK | TGK | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||
2019-06-01 | 0.0 | 0 | 12.700 | 4.0 | 15.2 | 1011.15 | 20.3 | 68.00 | 27.1 | 14.0 | 10.3 |
2019-06-02 | 0.0 | 0 | 13.367 | 3.1 | 15.5 | 1007.62 | 22.7 | 60.25 | 31.3 | 14.6 | 10.1 |
2019-06-03 | 0.0 | 6 | 11.933 | 2.9 | 15.3 | 1003.78 | 24.8 | 52.83 | 33.2 | 15.6 | 11.4 |
2019-06-04 | 0.0 | 0 | 11.550 | 2.9 | 17.0 | 1002.47 | 23.6 | 61.00 | 30.6 | 17.2 | 13.3 |
2019-06-05 | 0.0 | 0 | 14.183 | 1.3 | 17.0 | 999.16 | 25.7 | 56.08 | 33.6 | 16.0 | 12.2 |
2019-06-06 | 5.4 | 6 | 8.367 | 4.8 | 19.0 | 999.29 | 21.0 | 78.29 | 31.7 | 14.5 | 14.1 |
2019-06-07 | 0.1 | 6 | 14.133 | 2.5 | 15.0 | 1004.55 | 19.4 | 68.92 | 25.7 | 12.8 | 10.1 |
2019-06-08 | 0.0 | 6 | 8.617 | 3.8 | 11.9 | 1009.54 | 17.9 | 59.25 | 22.6 | 10.4 | 5.9 |
2019-06-09 | 3.4 | 6 | 12.450 | 4.7 | 11.0 | 1013.62 | 18.8 | 53.92 | 25.9 | 8.7 | 4.8 |
2019-06-10 | 1.8 | 6 | 5.750 | 6.8 | 17.7 | 1003.88 | 21.1 | 71.96 | 27.2 | 14.5 | 14.0 |
2019-06-11 | 48.0 | 6 | 11.267 | 6.3 | 20.2 | 1000.25 | 23.0 | 74.79 | 29.1 | 16.8 | 14.6 |
2019-06-12 | 13.6 | 6 | 9.650 | 5.9 | 19.7 | 998.63 | 22.0 | 79.04 | 33.0 | 16.6 | 15.5 |
2019-06-13 | 0.0 | 6 | 8.900 | 3.6 | 15.1 | 1005.27 | 18.9 | 72.29 | 24.6 | 14.7 | 12.5 |
2019-06-14 | 0.0 | 6 | 12.050 | 4.1 | 14.8 | 1007.85 | 22.8 | 56.33 | 29.5 | 12.8 | 10.1 |
2019-06-15 | 0.0 | 0 | 10.800 | 5.4 | 19.7 | 1001.65 | 24.9 | 62.88 | 31.5 | 17.4 | 15.7 |
2019-06-16 | 0.0 | 6 | 5.800 | 6.8 | 14.8 | 1009.99 | 17.9 | 72.96 | 21.5 | 14.0 | 11.3 |
2019-06-17 | 0.0 | 0 | 10.300 | 3.2 | 15.2 | 1011.72 | 20.4 | 66.13 | 27.1 | 14.4 | 12.0 |
2019-06-18 | 0.0 | 0 | 15.283 | 2.6 | 14.9 | 1007.27 | 22.8 | 57.08 | 29.4 | 14.6 | 11.1 |
2019-06-19 | 0.9 | 6 | 12.767 | 5.3 | 17.8 | 1000.28 | 24.3 | 60.88 | 32.1 | 16.6 | 12.6 |
2019-06-20 | 1.9 | 6 | 6.983 | 6.0 | 19.2 | 1000.97 | 21.0 | 78.21 | 25.7 | 17.1 | 14.2 |
2019-06-21 | 0.0 | 0 | 10.033 | 4.8 | 15.5 | 1007.42 | 19.5 | 70.79 | 24.7 | 14.9 | 10.7 |
2019-06-22 | 0.0 | 0 | 12.417 | 2.8 | 14.5 | 1012.16 | 20.1 | 63.38 | 25.7 | 14.2 | 9.9 |
2019-06-23 | 0.0 | 0 | 14.550 | 1.4 | 13.2 | 1012.80 | 21.2 | 53.63 | 27.8 | 14.0 | 10.0 |
2019-06-24 | 0.0 | 0 | 15.900 | 3.7 | 12.4 | 1014.22 | 22.8 | 47.75 | 29.7 | 14.8 | 10.1 |
2019-06-25 | 0.0 | 0 | 15.367 | 4.5 | 16.1 | 1012.10 | 25.5 | 51.17 | 33.5 | 17.5 | 12.5 |
2019-06-26 | 0.0 | 0 | 15.100 | 2.0 | 19.0 | 1009.91 | 27.9 | 53.67 | 36.1 | 18.7 | 15.1 |
2019-06-27 | 0.0 | 0 | 14.367 | 1.7 | 12.5 | 1013.56 | 19.3 | 57.83 | 25.3 | 14.6 | 12.8 |
2019-06-28 | 0.0 | 0 | 10.550 | 3.3 | 12.9 | 1014.31 | 17.9 | 64.96 | 23.9 | 11.7 | 8.4 |
2019-06-29 | 0.0 | 0 | 15.683 | 0.6 | 13.5 | 1010.94 | 21.7 | 57.00 | 30.3 | 12.2 | 8.4 |
2019-06-30 | 0.0 | 6 | 15.033 | 1.3 | 15.6 | 1003.23 | 27.6 | 47.58 | 37.6 | 14.6 | 10.1 |
Another nice feature using Python and Python's datetime is to generate a new time series.
We can define a starting date start="6/1/2019
and a end date end="6/01/2020
of the series.
Moreover, we can define a frequency, for example daily ('D'), hourly ('H'), calendar daily ('D'), business daily ('B'), weekly ('W'), monthly ('M'), quarterly ('Q'), annual ('A'), and many others. Frequencies can also be specified as multiples of any of the base frequencies, for example '5D' for every five days.
For example:
New_series = pd.date_range(start="6/1/2019", end="6/1/2020", freq="B")
New_series
DatetimeIndex(['2019-06-03', '2019-06-04', '2019-06-05', '2019-06-06', '2019-06-07', '2019-06-10', '2019-06-11', '2019-06-12', '2019-06-13', '2019-06-14', ... '2020-05-19', '2020-05-20', '2020-05-21', '2020-05-22', '2020-05-25', '2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29', '2020-06-01'], dtype='datetime64[ns]', length=261, freq='B')
or:
daily_index = pd.date_range(start="4/1/2018", end="4/30/2018", freq="D")
daily_index
DatetimeIndex(['2018-04-01', '2018-04-02', '2018-04-03', '2018-04-04', '2018-04-05', '2018-04-06', '2018-04-07', '2018-04-08', '2018-04-09', '2018-04-10', '2018-04-11', '2018-04-12', '2018-04-13', '2018-04-14', '2018-04-15', '2018-04-16', '2018-04-17', '2018-04-18', '2018-04-19', '2018-04-20', '2018-04-21', '2018-04-22', '2018-04-23', '2018-04-24', '2018-04-25', '2018-04-26', '2018-04-27', '2018-04-28', '2018-04-29', '2018-04-30'], dtype='datetime64[ns]', freq='D')
Plotting the whole data set is very easy now, since we have set the Date
column as the index. For example, we select one variable to be plotted, in this case the daily mean temperature(TMK
):
plt.figure(figsize=(18, 4))
df["TMK"].plot()
plt.show()
Next, we create a new data frame to finally plot the data for a time slice with its confidence interval.
df_sliced = df["2019-01-01":"2019-03-31"]["TMK"].round(2)
df_sliced
Date 2019-01-01 6.4 2019-01-02 1.8 2019-01-03 -0.4 2019-01-04 1.8 2019-01-05 6.0 ... 2019-03-27 7.3 2019-03-28 8.7 2019-03-29 9.4 2019-03-30 10.2 2019-03-31 8.0 Name: TMK, Length: 90, dtype: float64
print("df_sliced.shape: ", df_sliced.shape, "type(df_sliced): ", type(df_sliced))
df_sliced.shape: (90,) type(df_sliced): <class 'pandas.core.series.Series'>
We created a pandas Series
! In the following step, we will calculate the confidence interval and then transform our data into a pandas DataFrame
to save the confidence interval in additional columns.
ci = 1.96 * df_sliced.std() / df_sliced.mean()
ci
1.6190937680968653
df_sliced = pd.DataFrame(df_sliced)
df_sliced
TMK | |
---|---|
Date | |
2019-01-01 | 6.4 |
2019-01-02 | 1.8 |
2019-01-03 | -0.4 |
2019-01-04 | 1.8 |
2019-01-05 | 6.0 |
... | ... |
2019-03-27 | 7.3 |
2019-03-28 | 8.7 |
2019-03-29 | 9.4 |
2019-03-30 | 10.2 |
2019-03-31 | 8.0 |
90 rows × 1 columns
df_sliced["Confidence_lower"] = df_sliced["TMK"] - ci
df_sliced["Confidence_upper"] = df_sliced["TMK"] + ci
df_sliced
TMK | Confidence_lower | Confidence_upper | |
---|---|---|---|
Date | |||
2019-01-01 | 6.4 | 4.780906 | 8.019094 |
2019-01-02 | 1.8 | 0.180906 | 3.419094 |
2019-01-03 | -0.4 | -2.019094 | 1.219094 |
2019-01-04 | 1.8 | 0.180906 | 3.419094 |
2019-01-05 | 6.0 | 4.380906 | 7.619094 |
... | ... | ... | ... |
2019-03-27 | 7.3 | 5.680906 | 8.919094 |
2019-03-28 | 8.7 | 7.080906 | 10.319094 |
2019-03-29 | 9.4 | 7.780906 | 11.019094 |
2019-03-30 | 10.2 | 8.580906 | 11.819094 |
2019-03-31 | 8.0 | 6.380906 | 9.619094 |
90 rows × 3 columns
Now, we are able to plot the data with the respective confidence band:
fig, ax = plt.subplots(figsize=(18, 4))
ax = df_sliced["TMK"].plot(marker="o", linestyle="-")
ax.set_ylabel("mean temperature (°C)")
ax.fill_between(
df_sliced.index,
df_sliced["Confidence_lower"],
df_sliced["Confidence_upper"],
color="b",
alpha=0.1,
)
<matplotlib.collections.PolyCollection at 0x1cb1c2ffd90>
Exercise
Was the sun shining in the summer of 2021? Plot the Sunshine durationSDK
from the DWD data set for the time slice May, June, July and August 2021.
## your code here...
df_sliced = df["2021-05-01":"2021-08-31"]["SDK"].round(2)
ci = 1.96 * np.std(df_sliced) / np.mean(df_sliced)
df_sliced = df_sliced.to_frame()
df_sliced["Confidence_lower"] = df_sliced["SDK"] - ci
df_sliced["Confidence_upper"] = df_sliced["SDK"] + ci
fig, ax = plt.subplots(figsize=(18, 4))
ax = df_sliced["SDK"].plot(marker="o", linestyle="-")
ax.set_ylabel("sunshine duration [h]")
ax.fill_between(
df_sliced.index,
df_sliced["Confidence_lower"],
df_sliced["Confidence_upper"],
color="b",
alpha=0.1,
)
<matplotlib.collections.PolyCollection at 0x1cb216d4c70>
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.