Sometimes it is necessary to reshape the time series to coarser intervals than presented. For the aggregation of time series data we may easily make use of the characteristics of pd.Series
object, as they come with an index
and values
.
So we use the .groupby()
function to group by the desired period by using the to_period
function.
to_period("D")
for daily periodto_period("W")
for weekly periodto_period("M")
for monthly periodto_period("Q")
for quarterly periodto_period("Y")
for yearly periodApplied to a time series, the function may look like this for a monthly aggregation:
<timeseries>.groupby(\<timeseries>.index.to_period("M"))
Thus, we may easily calculate a time series of monthly or annual rainfall based on daily data.
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
# open the respective json files
ts_FUB_monthly = pd.read_json("../data/ts_FUB_monthly.json")
ts_FUB_monthly["Date"] = pd.to_datetime(
ts_FUB_monthly["Date"], format="%Y-%m-%d", errors="coerce"
)
ts_FUB_daily = pd.read_json("../data/ts_FUB_daily.json")
ts_FUB_daily["MESS_DATUM"] = pd.to_datetime(
ts_FUB_daily["MESS_DATUM"], format="%Y-%m-%d", errors="coerce"
)
ts_FUB_hourly = pd.read_json("../data/ts_FUB_hourly.json")
ts_FUB_hourly["MESS_DATUM"] = pd.to_datetime(
ts_FUB_hourly["MESS_DATUM"], format="%Y-%m-%d", errors="coerce"
)
Exercise: Extract the rainfall data for the station Berlin-Dahlem (FU) for the period 2000-2015 from the daily data set. Aggregate the data to a monthly and an annual time series, and reproduce the plots below.
# Your code here...
### DATA WRANGLING ###
### monthly time series ###
daily_rain_2000_2015 = ts_FUB_daily.set_index(["MESS_DATUM"])[
"2000-01-01":"2015-12-31"
]["Rain"]
monthly_rain_2000_2015 = daily_rain_2000_2015.groupby(
daily_rain_2000_2015.index.to_period("M")
).agg("sum")
### PLOTTING ###
import matplotlib.dates as mdates ## add library for custom x axis
plt.figure(figsize=(18, 6))
ax = monthly_rain_2000_2015.plot.bar(
y="Rain", color="blue", edgecolor="lightgrey", fontsize=16
)
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.set_title("Monthly rainfall at Berlin-Dahlem for the period 2000-2015", fontsize=16)
ax.set_ylabel("Rainfall", fontsize=16)
plt.show()
### DATA WRANGLING ###
### annual time series ###
yearly_rain_2000_2015 = daily_rain_2000_2015.groupby(
daily_rain_2000_2015.index.to_period("Y")
).agg("sum")
### PLOTTING ###
plt.figure(figsize=(18, 4))
ax = yearly_rain_2000_2015.plot.bar(
y="Rain", color="blue", edgecolor="lightgrey", fontsize=16
)
ax.set_title("Annual rainfall at Berlin-Dahlem for the period 2000-2015", fontsize=16)
ax.set_ylabel("Rainfall", fontsize=16)
plt.show()
So far we learned about summarizing and aggregating time series data. Now we combine these approaches in order to get a more sophisticated representation of our data. Therefore we incorporate basic ideas of the so called split-apply-combine strategy for data wrangling.
We elaborate this analytic strategy in two exercises using hourly rainfall data for the station Berlin-Dahlem.
ts_FUB_hourly = ts_FUB_hourly.set_index(["MESS_DATUM"])
ts_FUB_hourly
rainfall | |
---|---|
MESS_DATUM | |
2002-01-28 11:00:00 | 0.0 |
2002-01-28 13:00:00 | 0.0 |
2002-01-28 15:00:00 | 1.7 |
2002-01-28 18:00:00 | 1.1 |
2002-01-28 21:00:00 | 0.0 |
... | ... |
2021-12-31 19:00:00 | 0.7 |
2021-12-31 20:00:00 | 0.7 |
2021-12-31 21:00:00 | 0.1 |
2021-12-31 22:00:00 | 0.1 |
2021-12-31 23:00:00 | 0.0 |
174023 rows × 1 columns
plt.figure(figsize=(18, 6))
plt.plot(ts_FUB_hourly, color="blue")
plt.title("Hourly rainfall data for the station Berlin-Dahlem", fontsize=16)
plt.ylabel("Rainfall", fontsize=16)
plt.show()
print(
f"Start Date: {ts_FUB_hourly.index.min().date()}, End Date: {ts_FUB_hourly.index.max().date()}"
)
Start Date: 2002-01-28, End Date: 2021-12-31
Our goal is to calculate the mean monthly rainfall at the weather station Berlin-Dahlem for the period
2002-01-28
to2021-12-31
.
Our analytic strategy can be summarized as
The Split-Apply-Combine Approach was developed thanks to Hadley Wickham the author of "The split-apply-combine strategy for data analysis", and software developer,
For sure we could achieve that task by writing for()
loops, but we will probably spend a lot of time on bookkeeping tasks. Therefore, Pandas groupby-apply
tool is indispensable. it allows for simply equations on a data frame, working with one or multiple columns at a time. Instead of using one of the stock functions provided by Pandas (such as sum
, mean
,...) to operate on the groups we can define our own custom functions and apply them to our data using the the apply() function. In the apply step, we want to do one of the following: Aggregation, Transformation or Filtration.
A helpful guide for the Split-Apply-Combine Approach in Python is provided here.
## group data by year and month and get the sum
mean_monthly_rainfall = ts_FUB_hourly.groupby(ts_FUB_hourly.index.to_period("M")).agg(
"sum"
)
## group by month to get monthly mean
mean_monthly_rainfall = mean_monthly_rainfall.groupby(
mean_monthly_rainfall.index.month
).mean()
mean_monthly_rainfall
rainfall | |
---|---|
MESS_DATUM | |
1 | 48.165 |
2 | 31.955 |
3 | 35.805 |
4 | 25.455 |
5 | 53.425 |
6 | 58.540 |
7 | 83.295 |
8 | 62.295 |
9 | 42.060 |
10 | 47.830 |
11 | 43.715 |
12 | 41.075 |
We wrote in total two lines of code and calculated the mean monthly rainfall at the weather station Berlin-Dahlem for the period of 19 years based on 174023 data points.
Now we visualize the data using the plot()
function.
np.arange(1, 13, 1)
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
plt.figure(figsize=(18, 4))
ax = mean_monthly_rainfall.plot.bar(y="rainfall", color="blue", edgecolor="lightgrey")
ax.set_title("Mean monthly rainfall at Berlin-Dahlem for the period")
ax.set_ylabel("Rainfall")
ax.set_ylabel("Months")
labels = [
"Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec",
]
ax.set_xticks(np.arange(0, 12, 1), labels)
plt.show()
<Figure size 1800x400 with 0 Axes>
Now it is your turn!
Exercise: Calculate the mean diurnal monthly rainfall at Berlin-Dahlem for the period 2002-01-28 to 2021-12-31. And reproduce the graph below.
## Your code here...
### DATA WRANGLING ###
## Add month and time as new columns
ts_FUB_hourly_df = pd.DataFrame(
{
"Date": ts_FUB_hourly.index,
"rain": ts_FUB_hourly.values.flatten(),
"month": ts_FUB_hourly.index.month,
"month_name": ts_FUB_hourly.index.month_name(),
"time": ts_FUB_hourly.index.hour,
}
).set_index("Date")
## group by month and time and get the sum of rain
ts_FUB_hourly_df = (
ts_FUB_hourly_df.groupby(["month", "month_name", "time"])
.agg({"rain": "sum"})
.reset_index()
)
## normalize to annual mean
ts_FUB_hourly_df.rain = ts_FUB_hourly_df.rain / 15
### PLOTTING ###
import seaborn as sns
g = sns.FacetGrid(
ts_FUB_hourly_df, col="month_name", height=3.5, aspect=0.65, col_wrap=4
)
g.map_dataframe(sns.barplot, x="time", y="rain")
g.set_axis_labels("Hour of the day", "Rain (mm)")
g.set_titles(col_template="{col_name}")
g.set(xlim=(0, 23), ylim=(0, 7), xticks=[0, 6, 12, 18, 23], yticks=[0, 2, 4, 6, 8])
<seaborn.axisgrid.FacetGrid at 0x162662abf70>
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.