307043_Aggregation_of_time_series

Sometimes it is necessary to reshape the time series to coarser intervals than presented. Another very use bundle of functions for the aggregation of time series data is provided by the xts package. These functions are:

apply.daily(ts, func)
apply.weekly(ts, func)
apply.monthly(ts, func)
apply.quarterly(ts, func)
apply.yearly(ts, func)

This family of functions applies a function to each distinct period in a given time series object. Thus, we may easily calculate a time series of monthly or annual rainfall based on daily data.

library(xts)
library(lubridate)

# load the data
load(url("https://userpage.fu-berlin.de/soga/data/r-data/DWD_FUB.RData"))

Exercise: Extract the rainfall data for the station Berlin-Dahlem (FU) for the period 2000-2015 from the daily data set. Aggregate the data to a monthly and an annual time series, and reproduce the plots below.

Monthly time series:

# Your code here...
### monthly time series ###
monthly_rain <- NULL

Show code

### monthly time series ###
monthly_rain <- apply.monthly(ts_FUB_daily["2000/2015", "Rain"], sum)

barplot(monthly_rain,
  main = "Monthly rainfall at Berlin-Dahlem for the period 2000-2015",
  col = "blue",
  cex.main = 0.8, # font size title
  ylab = "Rainfall",
  cex.names = 0.7
) # font size x-labels

Annual time series:

# Your code here...
### annual time series ###
annual_rain <- NULL

Show code

### annual time series ###
annual_rain <- apply.yearly(ts_FUB_daily["2000/2015", "Rain"], sum)

barplot(annual_rain,
  main = "Annual rainfall at Berlin-Dahlem for the period 2000-2015",
  col = "blue",
  cex.main = 0.8,
  ylab = "Rainfall",
  cex.names = 0.7,
  las = 2
) # rotate label x-axis

Split-Apply-Combine

So far we learned about summarizing and aggregating time series data. Now we combine these approaches in order to get a more sophisticated representation of our data. Therefore we incorporate basic ideas of the so called split-apply-combine strategy for data wrangling.

We elaborate this analytic strategy in two exercises using hourly rainfall data for the station Berlin-Dahlem.

library(ggfortify)
autoplot(ts_FUB_hourly) +
  labs(y = "Rainfall") +
  ggtitle("Hourly rainfall data for the station Berlin-Dahlem") +
  theme_bw()

Our goal is to calculate the mean monthly rainfall at the weather station Berlin-Dahlem for the period 2002 to 2021.

Our analytic strategy can be summarized as

split data into pieces (into months),
apply some function to each piece (sum up the amount of rainfall),
combine the results back together again (plot the data or store the data for further usage).

For sure we could achieve that task by writing for() loops, but we will probably spend a lot of time on bookkeeping tasks. In base R there exists the family of apply functions, such as apply(), aggregate(), tapply(), and by(), among others. These are, no doubt, excellent tools, however, thanks to Hadley Wickham the author of “The split-apply-combine strategy for data analysis”, and software developer of major data analysis tools for R, we may use a collection of R packages, sometimes referred to as tidyverse packages. Out of those we pick the dplyr package, a package for data manipulation. You can learn more about the dplyr package by typing vignette("dplyr") into your console.

In addition to the functionality of the dplyr package we make use of the pipe operator (%>%). This allows us to write less code which is even more readable.

First, we load the required package:

library(dplyr)

A short digression on the usefulness of pipes:

The (silly) task is to take a vector x, square it, expand the vector by repeating it three times, take the mean, build sequence of 10 steps from negative to the positive value of the mean and plot it.

# define x
x <- c(-3.54654, 2.54, 5)

1. The long way

a <- x^2 # square it
b <- rep(a, 3) # expand the vector
c <- mean(b) # take the mean
d <- seq(from = -c, to = c, length.out = 10) #  build sequence
plot(d) # plot it

2. The short but hard to read way

plot(seq(from = -mean(rep(x^2, 3)), to = mean(rep(x^2, 3)), length.out = 10))

3. The pipe way

x %>%
  .^2 %>%
  rep(3) %>%
  mean() %>%
  seq(from = -., to = ., length.out = 10) %>%
  plot()

Learn more on the pipe operator by visiting the free online book R for Data Science by Garrett Grolemund and Hadley Wickham (Chapter 18: Pipes).

Let us go back to our example. First, we stack the time series data, which is of object class xts, into a data frame, as the dplyr package is optimized to work on data.frame objects.

# stack time series data into a data frame
df_hour <- data.frame(
  "Datetime" = index(ts_FUB_hourly),
  coredata(ts_FUB_hourly)
)
head(df_hour)

##              Datetime Rain
## 1 2002-01-28 11:00:00  0.0
## 2 2002-01-28 13:00:00  0.0
## 3 2002-01-28 15:00:00  1.7
## 4 2002-01-28 18:00:00  1.1
## 5 2002-01-28 21:00:00  0.0
## 6 2002-01-28 22:00:00  0.0

Now we run the analysis:

# initial data frame
mean_monthly_rainfall <- df_hour %>%
  #### DATA WRANGLING ####

  # generate two more columns and populate them with the year and month
  mutate(
    year = year(Datetime),
    month = month(Datetime)
  ) %>%
  # group the data by these new columns
  group_by(year, month) %>%
  # calculate the monthly rainfall for each year
  summarise(year_month_rainfall = sum(Rain)) %>%
  # group the data by the month
  group_by(month) %>%
  # calculate the mean annual monthly rainfall
  summarise(
    months_count = n(),
    mean_month_rainfall = mean(year_month_rainfall, na.rm = TRUE)
  )

mean_monthly_rainfall

## # A tibble: 12 × 3
##    month months_count mean_month_rainfall
##    <dbl>        <int>               <dbl>
##  1     1           20                48.2
##  2     2           20                32.0
##  3     3           20                35.8
##  4     4           20                25.5
##  5     5           20                53.4
##  6     6           20                58.5
##  7     7           20                83.3
##  8     8           20                62.3
##  9     9           20                42.1
## 10    10           20                47.8
## 11    11           20                43.7
## 12    12           20                41.1

We wrote in total five lines of code and calculated the mean monthly rainfall at the weather station Berlin-Dahlem for the period of ´r max(mean_monthly_rainfall$months.count)´ years based on 130214 data points.

Now we visualize the data using the ggplot() function.

mean_monthly_rainfall %>%
  #### PLOTTING ####

  # add column with month names for labeling
  mutate(label_month = month(month, label = TRUE, abbr = TRUE)) %>%
  # plot using ggplot()
  ggplot(aes(x = label_month, y = mean_month_rainfall)) +

  # design the bar plot
  geom_bar(stat = "identity", width = 0.6, color = "black") +

  # label axes
  labs(y = "Rainfall", x = "") +

  # add title
  ggtitle(paste(
    "Mean monthly rainfall at Berlin-Dahlem for the period",
    year(xts::first(ts_FUB_hourly)), # calculate start year of series
    "to",
    year(xts::last(ts_FUB_hourly))
  )) + # calculate end year of series

  # pick black and white background theme
  theme_bw()

Now it is your turn!

Exercise: Calculate the mean diurnal monthly rainfall at Berlin-Dahlem for the period 2002 to 2021. And reproduce the graph below.

Again we make use of the dplyr package and the pipe operator %>%. The code is very similar to the code from above.

## Your code here...
df_annual <- NULL

Show code

df_annual <- df_hour %>%
  ### DATA WRANGLING ###
  mutate(
    month = month(Datetime),
    hour = hour(Datetime)
  ) %>%
  group_by(month, hour) %>%
  summarize(sum_hour = sum(Rain, na.rm = TRUE)) %>%
  mutate(sum_hour = sum_hour / 15) # normalize to annual mean

df_annual %>%
  ### PLOTTING ###
  mutate(label_month = month(month, label = TRUE, abbr = TRUE)) %>%
  ggplot(aes(x = hour, y = sum_hour)) +
  geom_bar(stat = "identity", width = 0.1, color = "black") +
  facet_wrap(~label_month, ncol = 4) +

  #### THIS LINE IS NEW ####
  labs(y = "Rainfall (mm)", x = "Hour of the day") +
  ggtitle(paste(
    "Mean diurnal monthly rainfall at Berlin-Dahlem for the period",
    year(xts::first(ts_FUB_hourly)),
    "to",
    year(xts::last(ts_FUB_hourly))
  )) +
  theme_bw()

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.