All time series data sets for the weather station Berlin-Dahlem (FU) are provided by DWD (German Weather Service). The data was downloaded from the Climate Data Center on 2022-06-14.

Let us start with loading all necessary libraries into our R session.

library(sp)
library(mapview)
library(dplyr)
library(lubridate)
library(ggplot2)
library(xts)
library(ggfortify)
library(lattice)
library(zoo)

Monthly time series data Berlin-Dahlem (FU)

The monthly time series of the DWD weather station Berlin-Dahlem (FU) was downloaded on 2022-06-14 from Climate Data Center of DWD. A detailed variable description is available here. For the purpose of this tutorial the monthly time series data set is also made available here.

We start by downloading the data. The observatory data comes in a zip file together with several text files containing metadata regarding the DWD weather station Berlin-Dahlem (FU). In R, a zip file will be extracted as follows:

temp <- tempfile()
download_url <- "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/monthly/kl/historical/"
zipfile <- "monatswerte_KL_00403_17190101_20221231_hist.zip"
download.file(paste0(download_url, zipfile), temp, mode = "wb")
metadata <- unzip(temp)
unlink(temp)
metadata
##  [1] "./Metadaten_Stationsname_Betreibername_00403.html"    
##  [2] "./Metadaten_Stationsname_Betreibername_00403.txt"     
##  [3] "./Metadaten_Parameter_klima_monat_00403.html"         
##  [4] "./Metadaten_Parameter_klima_monat_00403.txt"          
##  [5] "./Metadaten_Geraete_Lufttemperatur_00403.html"        
##  [6] "./Metadaten_Geraete_Lufttemperatur_00403.txt"         
##  [7] "./Metadaten_Geraete_Lufttemperatur_Maximum_00403.html"
##  [8] "./Metadaten_Geraete_Lufttemperatur_Maximum_00403.txt" 
##  [9] "./Metadaten_Geraete_Lufttemperatur_Minimum_00403.html"
## [10] "./Metadaten_Geraete_Lufttemperatur_Minimum_00403.txt" 
## [11] "./Metadaten_Geraete_Niederschlagshoehe_00403.html"    
## [12] "./Metadaten_Geraete_Niederschlagshoehe_00403.txt"     
## [13] "./Metadaten_Geraete_Sonnenscheindauer_00403.html"     
## [14] "./Metadaten_Geraete_Sonnenscheindauer_00403.txt"      
## [15] "./Metadaten_Geographie_00403.txt"                     
## [16] "./Metadaten_Fehldaten_00403_17190101_20221231.html"   
## [17] "./Metadaten_Fehldaten_00403_17190101_20221231.txt"    
## [18] "./Metadaten_Fehlwerte_00403_17190101_20221231.txt"    
## [19] "./produkt_klima_monat_17190101_20221231_00403.txt"

Let us explore the metadata about the DWD weather station Berlin-Dahlem (FU).

data_meta <- read.table("./Metadaten_Geographie_00403.txt", sep = ";", header = TRUE)
data_meta
##   Stations_id Stationshoehe Geogr.Breite Geogr.Laenge von_datum bis_datum
## 1         403            51      52.4625      13.3000  19500101  19970711
## 2         403            51      52.4537      13.3017  19970712        NA
##         Stationsname
## 1 Berlin-Dahlem (FU)
## 2 Berlin-Dahlem (FU)

It is quite interesting that the metadata table shows two different locations associated with the same weather station Berlin-Dahlem (FU). One of the weather stations was operative from 1950-01-01 to 1997-07-11 and the other weather station is in operation since 1997-07-12.

Let us further explore this issue. We add the column Status to the data set and put the nominal variables not active and active into it.

data_meta["Status"] <- c("not active", "active")

Further, in order to explore the measurement site in more detail we plot a map. Hence, we make use of the sp package to transform the data.frame object into a SpatialPointsDataFrame object, which then can be plotted on an interactive map with the mapview package.

data_meta_spp <- SpatialPointsDataFrame(
  coords = data_meta[, c(
    "Geogr.Laenge",
    "Geogr.Breite"
  )],
  data = data_meta,
  proj4string = CRS("+init=epsg:4326")
) # WGS 84

mapview(data_meta_spp,
  zcol = "Status",
  color = c("blue", "red")
)

We get a very nice and convenient visual representation of the location of the two data points in the data set. Feel free to hoover over the map, click on the point locations or change the underling map tiles.

It is quite interesting to see that the currently operating weather station is located at Botanischer Garten, whereas before 1997 the weather station Berlin-Dahlem was located at Podbielskialle. The moving of weather stations occasionally causes problems for the analysis of long term time series, as without prior knowledge about such a moving any change in the weather record may be falsely associated with a change of weather or climatic patterns.

Now that we have some intuition about the measurement site we load the actual time series data into R. We call the str() function to get a brief idea on the structure of the data set.

dwd_raw <- read.csv2("./produkt_klima_monat_17190101_20211231_00403.txt")
str(dwd_raw)
## 'data.frame':    3510 obs. of  17 variables:
##  $ STATIONS_ID      : int  403 403 403 403 403 403 403 403 403 403 ...
##  $ MESS_DATUM_BEGINN: int  17190101 17190201 17190301 17190401 17190501 17190601 17190701 17190801 17190901 17191001 ...
##  $ MESS_DATUM_ENDE  : int  17190131 17190228 17190331 17190430 17190531 17190630 17190731 17190831 17190930 17191031 ...
##  $ QN_4             : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ MO_N             : chr  "  -999" "  -999" "  -999" "  -999" ...
##  $ MO_TT            : chr  "   2.80" "   1.10" "   5.20" "   9.00" ...
##  $ MO_TX            : chr  "    -999" "    -999" "    -999" "    -999" ...
##  $ MO_TN            : chr  "    -999" "    -999" "    -999" "    -999" ...
##  $ MO_FK            : chr  "-999" "-999" "-999" "-999" ...
##  $ MX_TX            : chr  "-999" "-999" "-999" "-999" ...
##  $ MX_FX            : chr  "-999" "-999" "-999" "-999" ...
##  $ MX_TN            : chr  "-999" "-999" "-999" "-999" ...
##  $ MO_SD_S          : chr  "-999" "-999" "-999" "-999" ...
##  $ QN_6             : int  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
##  $ MO_RR            : chr  "-999" "-999" "-999" "-999" ...
##  $ MX_RS            : chr  "-999" "-999" "-999" "-999" ...
##  $ eor              : chr  "eor" "eor" "eor" "eor" ...

The data set consists of 3510 rows and 17 columns, with the following variables: STATIONS_ID, MESS_DATUM_BEGINN, MESS_DATUM_ENDE, QN_4, MO_N, MO_TT, MO_TX, MO_TN, MO_FK, MX_TX, MX_FX, MX_TN, MO_SD_S, QN_6, MO_RR, MX_RS, eor. By looking at the structure of the data set we spot several entries of values of \(-999\). This value, as indicated in variable description file (here) corresponds to missing values and should be marked as NA. There exist several strategies on how to deal with missing values. Here, in this case, we decide to reload the data set and to add the additional argument na.strings = "-999" to the function call. It is further recommendable to include the argument strip.white = TRUE as well. This avoids that white spaces interfere with the detection of missing values.

We take a look at four randomly picked row entries by calling the handy sample_n() function from the dplyr package.

dwd_raw <- read.csv2("./produkt_klima_monat_17190101_20221231_00403.txt",
  stringsAsFactors = FALSE,
  strip.white = TRUE,
  na.strings = "-999"
)
sample_n(dwd_raw, 4)
##   STATIONS_ID MESS_DATUM_BEGINN MESS_DATUM_ENDE QN_4 MO_N MO_TT MO_TX MO_TN
## 1         403          17720701        17720731    5 <NA> 17.50  <NA>  <NA>
## 2         403          17190401        17190430    5 <NA>  9.00  <NA>  <NA>
## 3         403          19481201        19481231    5 <NA>  1.60  <NA> -1.20
## 4         403          19580501        19580531    5 5.41 14.40 19.31  9.71
##   MO_FK MX_TX MX_FX MX_TN MO_SD_S QN_6 MO_RR MX_RS eor
## 1  <NA>  <NA>  <NA>  <NA>    <NA>   NA  <NA>  <NA> eor
## 2  <NA>  <NA>  <NA>  <NA>    <NA>   NA  <NA>  <NA> eor
## 3  <NA>  <NA>  <NA>  <NA>    <NA>   NA  <NA>  <NA> eor
## 4  2.52  30.0  <NA>   1.8  190.30    5  88.3  34.8 eor

In a next step we want to make sure that the data set is complete. This means that there exists a row entry in the data table for each month within the time period covered by the time series. For now this fact is important for us as we will work with the ts package for time series handling in R. The data representation of the ts class expects regularly spaced data to work properly. Please note that there are several R packages, such as xts or zoo, that can be used to handle irregularly spaced time series data.

For the purpose of this example we conduct a quick sanity check: First, we extract the year/month entry of the first row and the year/month entry of the last row in the data set using the ymd() function.

first_date <- ymd(dwd_raw$MESS_DATUM_BEGINN[1])
first_date
## [1] "1719-01-01"
last_date <- ymd(dwd_raw$MESS_DATUM_BEGINN[nrow(dwd_raw)])
last_date
## [1] "2022-12-01"

Then we construct a monthly sequence from the first to the last date and compare its length with the number of rows (observations) in the data set. If they are equal we conclude that the data set is complete.

expected_length <- length(seq(from = first_date, to = last_date, by = "month"))
expected_length
## [1] 3648
expected_length == nrow(dwd_raw)
## [1] FALSE

Obviously the data set is not complete! For period from 1719-01-01 to 2022-12-01 the data set should consist of 3648 rows. However, it consists only of 3522.

We may easily solve this problem. First, we add a column of dates to the dwd_raw data frame, which correspond to the dates provided in the MESS_DATUM_BEGINN column. Thereafter we generate a new data frame with a sequence of dates in monthly steps from 1719-01-01 to 2022-12-01. Then merge the sequence of dates with the dwd_raw data frame. The resulting data frame should then be populated with a row for each month.

# add column
dwd_raw["MESS_DATUM_BEGINN.date.format"] <- ymd(dwd_raw$MESS_DATUM_BEGINN)

# create new data frame
dwd_month <- data.frame("Date" = seq(
  from = first_date,
  to = last_date, by = "month"
))
dwd_month <- merge(dwd_month, dwd_raw,
  by.x = "Date",
  by.y = "MESS_DATUM_BEGINN.date.format",
  all.x = TRUE
)
expected_length == nrow(dwd_month)
## [1] TRUE

Nice, sanity check passed!

As for now we are only interested in the variable for mean monthly temperature. After looking into the variable description file (here) we found out that the variable of interest is encoded as MO_TT (monthly mean of the daily temperature two meters above ground).

We take a look at the first ten entries of the variable MO_TT.

dwd_month$MO_TT[1:12]
##  [1] "2.80"  "1.10"  "5.20"  "9.00"  "15.10" "19.00" "21.40" "18.80" "13.90"
## [10] "9.00"  "6.60"  "0.30"
class(dwd_month$MO_TT[1:10])
## [1] "character"

As the data is given as character we transform it to a numerical representation by calling the as.numeric() function.

dwd_month$MO_TT <- as.numeric(dwd_month$MO_TT)
dwd_month$MO_TT[1:10]
##  [1]  2.8  1.1  5.2  9.0 15.1 19.0 21.4 18.8 13.9  9.0
class(dwd_month$MO_TT)
## [1] "numeric"

Finally, we store the temperature vector in form of an xts class object by calling the xts() function. We have to feed the function a vector with the data and a vector with the corresponding dates. Recall that we already created a vector of dates beforehand when building the dwd_month data frame.

# generate xts object
ts_FUB_monthly <- xts(dwd_month$MO_TT, dwd_month$Date)
class(ts_FUB_monthly)
## [1] "xts" "zoo"

The package ggfortify allows ggplot2 to interpret ts objects. Thus, after loading ggfortify, we may use the autoplot function for plotting ts objects.

autoplot(ts_FUB_monthly)


Daily time series data Berlin-Dahlem (FU)

In this section we process the daily time series for the DWD station Berlin-Dahlem (FU). The daily time series of the DWD weather station Berlin-Dahlem (FU) was downloaded on 2022-07-22 from Climate Data Center. A detailed variable description is available here. For the purpose of this tutorial the monthly time series data set is made available here.

The preprocessing procedure is very similar to the one for monthly time series data. Please revisit the previous section if you cannot follow the outlined steps below.

## download data
temp <- tempfile()
download_url <- "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/historical/"
zipfile <- "tageswerte_KL_00403_19500101_20211231_hist.zip"
download.file(paste0(download_url, zipfile), temp, mode = "wb")
metadata <- unzip(temp)
unlink(temp)

# load data of interest
dwd_raw <- read.csv2("./produkt_klima_tag_19500101_20211231_00403.txt",
  stringsAsFactors = FALSE,
  strip.white = TRUE,
  na.strings = "-999"
)
str(dwd_raw)
## 'data.frame':    26298 obs. of  19 variables:
##  $ STATIONS_ID: int  403 403 403 403 403 403 403 403 403 403 ...
##  $ MESS_DATUM : int  19500101 19500102 19500103 19500104 19500105 19500106 19500107 19500108 19500109 19500110 ...
##  $ QN_3       : logi  NA NA NA NA NA NA ...
##  $ FX         : logi  NA NA NA NA NA NA ...
##  $ FM         : logi  NA NA NA NA NA NA ...
##  $ QN_4       : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ RSK        : chr  "2.2" "12.6" "0.5" "0.5" ...
##  $ RSKF       : int  7 8 1 7 7 8 1 1 8 8 ...
##  $ SDK        : chr  NA NA NA NA ...
##  $ SHK_TAG    : int  0 0 0 0 0 12 0 0 0 0 ...
##  $ NM         : chr  "5.0" "8.0" "5.0" "7.7" ...
##  $ VPM        : chr  "4.0" "6.1" "6.5" "5.2" ...
##  $ PM         : chr  "1025.60" "1005.60" "996.60" "999.50" ...
##  $ TMK        : chr  "-3.2" "1.0" "2.8" "-0.1" ...
##  $ UPM        : chr  "83.00" "95.00" "86.00" "85.00" ...
##  $ TXK        : chr  "-1.1" "2.2" "3.9" "2.1" ...
##  $ TNK        : chr  "-4.9" "-3.7" "1.7" "-0.9" ...
##  $ TGK        : chr  "-6.3" "-5.3" "-1.4" "-2.3" ...
##  $ eor        : chr  "eor" "eor" "eor" "eor" ...

For now we are interested in the variables for rainfall and temperature. After reviewing the variable description file (here) we found that the daily rainfall in mm is encoded in the RSK variable and daily mean temperature two meters above ground is encoded in the TMK variable. For the purpose of comprehensibility we construct a new data frame including the two variables of interest.

# construct data frame from variables of interest
df_FUB <- data.frame(
  "Temp" = as.numeric(dwd_raw$TMK),
  "Rain" = as.numeric(dwd_raw$RSK)
)

In the next step we store the temperature and rainfall variables in form of an xts class object by calling the xts() function. Recall that we have to feed the function the data, but as well a vector with dates.

# construct time series index
timestamp <- ymd(dwd_raw$MESS_DATUM)
# generate xts object
ts_FUB_daily <- xts(df_FUB, order.by = timestamp)

Let us plot the data with the autoplot function.

autoplot(ts_FUB_daily)


Hourly time series data Berlin-Dahlem (FU)

In this section we process the hourly time series of rainfall data for the DWD station Berlin-Dahlem (FU). The hourly time series of the DWD weather station Berlin-Dahlem (FU) was downloaded on 2022-07-22 from Climate Data Center. A detailed variable description is available here. For the purpose of this tutorial the hourly time series data set is made available here.

The preprocessing procedure is very similar to the one for monthly and daily data. Please revisit the previous sections if you cannot follow the outlined steps below.

## download data
temp <- tempfile()
download_url <- "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/precipitation/historical/"
zipfile <- "stundenwerte_RR_00403_20020128_20221231_hist.zip"
download.file(paste0(download_url, zipfile), temp, mode = "wb")
metadata <- unzip(temp)
unlink(temp)

# load data of interest
dwd_raw <- read.csv2("./produkt_rr_stunde_20020128_20211231_00403.txt",
  stringsAsFactors = FALSE,
  strip.white = TRUE,
  na.strings = "-999"
)
str(dwd_raw)
## 'data.frame':    174023 obs. of  7 variables:
##  $ STATIONS_ID: int  403 403 403 403 403 403 403 403 403 403 ...
##  $ MESS_DATUM : int  2002012811 2002012813 2002012815 2002012818 2002012821 2002012822 2002012823 2002012900 2002012901 2002012902 ...
##  $ QN_8       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ R1         : chr  "0.0" "0.0" "1.7" "1.1" ...
##  $ RS_IND     : int  0 0 1 1 0 0 1 0 0 0 ...
##  $ WRTR       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ eor        : chr  "eor" "eor" "eor" "eor" ...

The data set contains hourly rainfall measurements for the weather station Berlin-Dahlem (FU). After reviewing the variable description file (here) we found that the hourly rainfall in mm is encoded in the R1 variable.

# construct data frame from variables of interest
df_FUB <- data.frame("Rain" = as.numeric(dwd_raw$R1))
# fill up NA values with 0
df_FUB[is.na(df_FUB)] <- 0

In the next step we store the rainfall variable in form of an xts class object. Recall that we have to feed the xts() function the data, but as well a vector with dates.

# construct time series index
timestamp <- ymd_h(as.character(dwd_raw$MESS_DATUM))
# generate xts object
ts_FUB_hourly <- xts(df_FUB, order.by = timestamp)

In the next step we plot the data with the convenient autoplot() function.

# plot
autoplot(ts_FUB_hourly)

Finally, we store the monthly, daily and hourly time series data set in a .RData file for further processing.

save(file = "DWD_FUB.RData", ts_FUB_monthly, ts_FUB_daily, ts_FUB_hourly)

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.