In the previous section we downloaded a preprocessed data set of weather station data for Germany. In this section however, we download and summarize weather data provided by Deutscher Wetterdienst (DWD) (German Weather Service) on our own.

The goal is to obtain a data set of annual mean rainfall and mean annual temperature data for the 30 year period 1981-2010 for weather stations located in East Germany.

The data is available on the DWD data portal. Here, we use the rdwd package, a package that simplifies downloading data from the DWD.

First, we load the required package from CRAN.

library(rdwd)
library(tidyverse)

The rdwd package has nice features such as the metaIndex, which provides an overview of the available data and the data structure.

data(metaIndex)
sample_n(metaIndex, 10)
##    Stations_id  von_datum  bis_datum Stationshoehe geoBreite geoLaenge
## 1         3086 1996-07-02 2023-06-16            15   53.8025   10.6989
## 2         5111 2005-12-20 2023-06-16           560   48.0311   12.5396
## 3         2370 1947-01-01 2006-12-31           290   50.7638   11.6153
## 4         3388 1994-02-01 1997-01-31           542   48.0919   11.6486
## 5         5167 1978-01-01 2023-05-31           505   48.1342   12.2731
## 6         2618 1991-07-24 2011-12-31           732   50.8458   10.4803
## 7        16085 1960-07-01 1961-12-31             8   53.5874    8.6011
## 8         2704 2004-05-01 2023-06-16            76   51.7511   12.0094
## 9         2701 1945-07-01 2007-03-31           136   51.1329   11.7252
## 10        5371 1949-01-01 2023-06-15           920   50.4973    9.9427
##            Stationsname         Bundesland        res                    var
## 1    Luebeck-Blankensee Schleswig-Holstein 10_minutes                  solar
## 2             Trostberg             Bayern 10_minutes        air_temperature
## 3           Hummelshain         Thueringen      daily            more_precip
## 4      Muenchen-Perlach             Bayern      daily more_weather_phenomena
## 5  Unterreit-Wagenstatt             Bayern    monthly        climate_indices
## 6    Kleiner Inselsberg         Thueringen   subdaily                   soil
## 7            Leherheide             Bremen   subdaily             cloudiness
## 8      Koethen (Anhalt)     Sachsen-Anhalt     hourly             cloud_type
## 9           Koesen, Bad     Sachsen-Anhalt      daily      weather_phenomena
## 10          Wasserkuppe             Hessen     hourly             wind_synop
##           per hasfile
## 1      recent    TRUE
## 2  historical    TRUE
## 3      recent   FALSE
## 4      recent   FALSE
## 5      recent    TRUE
## 6      recent   FALSE
## 7  historical    TRUE
## 8  historical    TRUE
## 9  historical    TRUE
## 10     recent    TRUE

At the moment of writing this data frame lists 146,644 data sets. Here, we focus only on weather stations located in East Germany. Further, we restrict our search to weather stations, which provide data on a monthly basis.

states <- c(
  "Sachsen", "Sachsen-Anhalt", "Mecklenburg-Vorpommern",
  "Brandenburg", "Thueringen", "Berlin"
)

resolution <- "monthly"
variables <- "kl"
period <- "historical"
df_station <- metaIndex[metaIndex$Bundesland %in% states &
  metaIndex$res == resolution &
  metaIndex$per == period &
  metaIndex$var == variables &
  metaIndex$hasfile == TRUE, ]

# number of weather stations
nrow(df_station)
## [1] 279

Based on our constraints we end up with a data set of 279 weather stations. The data provided by DWD is monthly data, however we are only interested in the mean annual rainfall and mean annual temperature for the period 1981–2010. Hence, we have to do some data preprocessing. In order to gain a better intuition we first showcase the processing pipeline for one example weather station. Later we automate this pipeline, to be applied on all 279 selected weather stations.


Download example data set

Here, we focus on the weather station Berlin-Tempelhof. Detailed information on the structure of the DWD data sets can be found here. Further information on the rdwd package can be found here.

The general procedure is outlined in five steps:

  1. Select the weather station of interest based on the station identifier (id), on the temporal resolution of the record (res), on the variable type (var) and on the observation period (per).

  2. Download the data using the functionality of the rdwd package.

  3. Check the data record for completeness.

  4. Extract the statistic of interest, which in our case is the mean annual temperature and mean annual rainfall for the period 1981–2010.

  5. Repeat step 1 to 4 for all stations of interest.


Step 1

We subset the weather station of interest from the catalog of DWD stations.

sample_station <- df_station[df_station$Stationsname == "Berlin-Tempelhof", ]
sample_station
##      Stations_id  von_datum  bis_datum Stationshoehe geoBreite geoLaenge
## 9855         433 1938-01-01 2023-05-31            48   52.4675   13.4021
##          Stationsname Bundesland     res var        per hasfile
## 9855 Berlin-Tempelhof     Berlin monthly  kl historical    TRUE


Exercise: Create a second subset for the weather station in Potsdam.

## Your code here...
Potsdam_station <- NULL
Show code
Potsdam_station <- df_station[df_station$Stationsname == "Potsdam", ]
Potsdam_station
##       Stations_id  von_datum  bis_datum Stationshoehe geoBreite geoLaenge
## 85031        3987 1893-01-01 2023-05-31            81   52.3812   13.0622
##       Stationsname  Bundesland     res var        per hasfile
## 85031      Potsdam Brandenburg monthly  kl historical    TRUE


Step 2

We download the data using the functionality of the rdwd package. First, we generate the link pointing to the file in the online database by applying the selectDWD() function.
The DWD often updates or expands datasets, and when that happens, the filenames in the historical folders may change. By setting current = TRUE, we ensure that we are pointing to the latest version of the dataset, thus avoiding potential issues with outdated filenames. Another option to address the issue of changing filenames is to use rdwd::updateRdwd(). This will check the version of the rdwd package and update it to the latest development version on github. You can find more details about this topic in Berry Boessenkool’s documentation of rdwd.

link <- selectDWD(
  id = sample_station$Stations_id,
  res = sample_station$res,
  var = sample_station$var,
  per = sample_station$per,
  current = TRUE
)
link
## [1] "ftp://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/monthly/kl/historical/monatswerte_KL_00433_19380101_20231231_hist.zip"

Then we download the file by applying the dataDWD() function. Once the file is downloaded, we apply the readDWD() function, which reads the file, processes it and returns it as a data.frame object.

tdir <- tempdir()
file <- dataDWD(link, read = FALSE, dir = tdir, quiet = TRUE, force = FALSE)
clim <- readDWD(file)
## Warning: knitr::knit -> process_file -> xfun:::handle_error -> process_group -> call_block -> block_exec ->  readDWD: R package 'data.table' available for fast reading of files, but system command 'unzip' could not be found. Now reading slowly.
## See   https://bookdown.org/brry/rdwd/fread.html
## Reading 1 file with readDWD.data() and fread=FALSE ...


Exercise: Download and read the data of the weather station in Potsdam.

## Your code here...
link2 <- NULL
tdir2 <- NULL
file2 <- NULL
clim_Potsdam <- NULL
Show code
link2 <- selectDWD(
  id = Potsdam_station$Stations_id,
  res = Potsdam_station$res,
  var = Potsdam_station$var,
  per = Potsdam_station$per,
  current = TRUE
)
## Determining the content of the 1 folder(s)...
## knitr::knit -> process_file -> xfun:::handle_error -> process_group -> call_block -> block_exec ->  selectDWD -> createIndex -> indexFTP -> dirDWD: adding to directory 'C:/Users/sarah/OneDrive/Documents/UNI/Tutorium/GitHub/soga/Soga-R/300/30900_geostatistics/DWDdata'
## knitr::knit -> process_file -> xfun:::handle_error -> process_group -> call_block -> block_exec ->  selectDWD -> createIndex -> indexFTP -> newFilename: creating 1 file (1 already existed for which '_n' is appended):'C:/Users/sarah/OneDrive/Documents/UNI/Tutorium/GitHub/soga/Soga-R/300/30900_geostatistics/DWDdata/INDEX_of_DWD_monthly_kl_historical_31.txt'
## Splitting filenames... (0 secs so far)
## Extracting station IDs + time range from filenames... (0 secs so far)
## Determining if files contain meta data... (0 secs so far)
tdir2 <- tempdir()
file2 <- dataDWD(link2, read = FALSE, dir = tdir2, quiet = TRUE, force = FALSE)
clim_Potsdam <- readDWD(file2)
## Reading 1 file with readDWD.data() and fread=FALSE ...


Step 3

Now we check the data set for completeness by using standard R functions, such as summary(). A description of the variables is available here.

summary(clim)
##   STATIONS_ID    MESS_DATUM                      MESS_DATUM_BEGINN 
##  Min.   :433   Min.   :1938-01-15 00:00:00.000   Min.   :19380101  
##  1st Qu.:433   1st Qu.:1961-04-07 06:00:00.000   1st Qu.:19610376  
##  Median :433   Median :1982-03-01 00:00:00.000   Median :19820251  
##  Mean   :433   Mean   :1981-12-17 19:56:10.518   Mean   :19815313  
##  3rd Qu.:433   3rd Qu.:2003-01-22 18:00:00.000   3rd Qu.:20030126  
##  Max.   :433   Max.   :2023-12-15 00:00:00.000   Max.   :20231201  
##                                                                    
##  MESS_DATUM_ENDE         QN_4             MO_N           MO_TT       
##  Min.   :19380131   Min.   : 3.000   Min.   :2.640   Min.   :-9.400  
##  1st Qu.:19610405   1st Qu.: 5.000   1st Qu.:4.600   1st Qu.: 3.785  
##  Median :19820280   Median : 9.000   Median :5.270   Median : 9.540  
##  Mean   :19815343   Mean   : 7.335   Mean   :5.272   Mean   : 9.812  
##  3rd Qu.:20030155   3rd Qu.:10.000   3rd Qu.:5.997   3rd Qu.:16.285  
##  Max.   :20231231   Max.   :10.000   Max.   :7.440   Max.   :24.380  
##                                      NA's   :262                     
##      MO_TX            MO_TN             MO_FK           MX_TX      
##  Min.   :-6.000   Min.   :-12.700   Min.   :1.630   Min.   : 1.60  
##  1st Qu.: 6.287   1st Qu.:  1.062   1st Qu.:2.480   1st Qu.:13.78  
##  Median :13.770   Median :  5.585   Median :2.660   Median :22.15  
##  Mean   :13.592   Mean   :  5.947   Mean   :2.698   Mean   :21.58  
##  3rd Qu.:20.852   3rd Qu.: 11.675   3rd Qu.:2.870   3rd Qu.:29.40  
##  Max.   :30.300   Max.   : 17.870   Max.   :4.060   Max.   :38.50  
##  NA's   :8                          NA's   :107     NA's   :92     
##      MX_FX           MX_TN             MO_SD_S            QN_6      
##  Min.   :11.30   Min.   :-22.5000   Min.   :  8.00   Min.   : 3.00  
##  1st Qu.:16.65   1st Qu.: -5.4000   1st Qu.: 64.42   1st Qu.: 5.00  
##  Median :19.00   Median : -0.4000   Median :135.30   Median : 5.00  
##  Mean   :19.85   Mean   : -0.2711   Mean   :142.25   Mean   : 7.07  
##  3rd Qu.:22.60   3rd Qu.:  6.6000   3rd Qu.:210.65   3rd Qu.: 9.00  
##  Max.   :40.20   Max.   : 14.0000   Max.   :377.60   Max.   :10.00  
##  NA's   :393     NA's   :92         NA's   :384      NA's   :2      
##      MO_RR            MX_RS         eor      
##  Min.   :  0.60   Min.   :  0.40   eor:1004  
##  1st Qu.: 27.40   1st Qu.:  7.30             
##  Median : 43.20   Median : 11.20             
##  Mean   : 47.82   Mean   : 13.89             
##  3rd Qu.: 62.30   3rd Qu.: 17.30             
##  Max.   :193.40   Max.   :119.50             
##  NA's   :7        NA's   :99

Further, we use the xts and lubridate packages for time series analysis and plot the monthly rainfall data.

library(xts)
library(lubridate)
# create time series object of monthly rainfall
rainfall_sample <- xts(
  x = clim$MO_RR,
  order.by = as.yearmon(ymd(clim$MESS_DATUM_BEGINN))
)
# plot the data
barplot(rainfall_sample,
  ylab = "Monthly rainfall",
  main = paste0(
    "DWD Weather Station ",
    sample_station$Stationsname,
    ", ",
    sample_station$Bundesland
  )
)

The rainfall time series at Berlin-Tempelhof consists of 1004 monthly observations spanning a period from January 1938 to December 2023.

For our use case we are only interest in the period 1981 to 2010, hence we subset the data set and plot it for inspection.

barplot(rainfall_sample["1981/2010"],
  ylab = "Monthly rainfall",
  main = paste0(
    "DWD Weather Station ",
    sample_station$Stationsname,
    ", ",
    sample_station$Bundesland
  )
)


Exercise: Check for the weather station in Potsdam if the data for monthly sunshine duration is complete in the period 1981 to 2010.

## Your code here...
sunshine_Potsdam <- NULL
Show code
summary(clim_Potsdam)
##   STATIONS_ID     MESS_DATUM                      MESS_DATUM_BEGINN 
##  Min.   :3987   Min.   :1893-01-15 00:00:00.000   Min.   :18930101  
##  1st Qu.:3987   1st Qu.:1925-10-07 12:00:00.000   1st Qu.:19250976  
##  Median :3987   Median :1958-06-30 00:00:00.000   Median :19580651  
##  Mean   :3987   Mean   :1958-06-30 15:16:01.832   Mean   :19580651  
##  3rd Qu.:3987   3rd Qu.:1991-03-22 18:00:00.000   3rd Qu.:19910326  
##  Max.   :3987   Max.   :2023-12-15 00:00:00.000   Max.   :20231201  
##                                                                     
##  MESS_DATUM_ENDE         QN_4             MO_N           MO_TT        
##  Min.   :18930131   Min.   : 3.000   Min.   :2.530   Min.   :-10.860  
##  1st Qu.:19251006   1st Qu.: 5.000   1st Qu.:4.728   1st Qu.:  2.763  
##  Median :19580680   Median : 5.000   Median :5.375   Median :  8.770  
##  Mean   :19580680   Mean   : 6.202   Mean   :5.349   Mean   :  8.906  
##  3rd Qu.:19910356   3rd Qu.: 5.000   3rd Qu.:6.020   3rd Qu.: 15.350  
##  Max.   :20231231   Max.   :10.000   Max.   :7.550   Max.   : 23.560  
##                                      NA's   :28                       
##      MO_TX            MO_TN              MO_FK           MX_TX      
##  Min.   :-6.370   Min.   :-15.4300   Min.   :1.880   Min.   : 0.60  
##  1st Qu.: 5.817   1st Qu.:  0.1175   1st Qu.:2.720   1st Qu.:13.40  
##  Median :13.750   Median :  4.6350   Median :2.950   Median :22.35  
##  Mean   :13.388   Mean   :  5.0152   Mean   :3.013   Mean   :21.43  
##  3rd Qu.:20.907   3rd Qu.: 10.5725   3rd Qu.:3.250   3rd Qu.:29.40  
##  Max.   :30.530   Max.   : 16.9500   Max.   :4.790   Max.   :38.90  
##                                      NA's   :11                     
##      MX_FX           MX_TN            MO_SD_S            QN_6       
##  Min.   :12.50   Min.   :-26.800   Min.   : 11.30   Min.   : 3.000  
##  1st Qu.:18.30   1st Qu.: -6.600   1st Qu.: 69.47   1st Qu.: 5.000  
##  Median :21.00   Median : -1.500   Median :140.20   Median : 5.000  
##  Mean   :21.62   Mean   : -1.402   Mean   :144.39   Mean   : 6.324  
##  3rd Qu.:24.00   3rd Qu.:  5.500   3rd Qu.:212.38   3rd Qu.: 9.000  
##  Max.   :37.30   Max.   : 14.100   Max.   :366.60   Max.   :10.000  
##  NA's   :1018                      NA's   :2                        
##      MO_RR            MX_RS         eor      
##  Min.   :  0.20   Min.   :  0.20   eor:1572  
##  1st Qu.: 28.30   1st Qu.:  7.50             
##  Median : 43.20   Median : 11.40             
##  Mean   : 48.80   Mean   : 13.86             
##  3rd Qu.: 64.22   3rd Qu.: 16.80             
##  Max.   :202.30   Max.   :104.80             
## 
# create time series object of monthly rainfall
sunshine_Potsdam <- xts(
  x = clim_Potsdam$MO_SD_S,
  order.by = as.yearmon(ymd(clim_Potsdam$MESS_DATUM_BEGINN))
)
# plot the data
barplot(sunshine_Potsdam["1981/2010"],
  ylab = "Monthly sunshine duration",
  main = paste0(
    "DWD Weather Station ",
    Potsdam_station$Stationsname,
    ", ",
    Potsdam_station$Bundesland
  )
)


Step 4

Based on the visual inspection of the graph above we may conclude that the rainfall data is fairly complete. Next, we compute the mean annual rainfall by applying several functions from the dplyr package, which is part of the tidyverse ecosystem, such as mutate, group_by, filter and summarise.

annual_rainfall <- clim
annual_rainfall <- annual_rainfall %>%
  mutate(year = year(ymd(MESS_DATUM_BEGINN))) %>%
  group_by(year) %>%
  filter(year >= 1981 & year <= 2010) %>%
  summarise(
    # calculate the monthly rainfall for each year
    year_sum = sum(MO_RR, na.rm = TRUE)
  )

## plotting
ggplot(data = annual_rainfall, aes(x = year, y = year_sum)) +
  geom_bar(stat = "identity") +
  ggtitle(paste0(
    "Annual rainfall ",
    sample_station$Stationsname,
    ", ",
    sample_station$Bundesland
  ))

The graph shows the annual rainfall for the period 1981 to 2010. One final step is still pending: We need to take the average of the annual series to get the mean annual rainfall at the weather station Berlin-Tempelhof for the period 1981–2010.

## print result
print(paste0(
  "Mean annual rainfall at station ",
  sample_station$Stationsname,
  ": ",
  round(mean(annual_rainfall$year_sum)),
  " mm"
))
## [1] "Mean annual rainfall at station Berlin-Tempelhof: 577 mm"

Done!

Exercise: Compute the mean annual sunshine duration at the weather station in Potsdam for the period 1981–2010.

## Your code here...
annual_sunshine <- NULL
Show code
annual_sunshine <- clim_Potsdam
annual_sunshine <- annual_sunshine %>%
  mutate(year = year(ymd(MESS_DATUM_BEGINN))) %>%
  group_by(year) %>%
  filter(year >= 1981 & year <= 2010) %>%
  summarise(
    # calculate the monthly rainfall for each year
    year_sum = sum(MO_SD_S, na.rm = TRUE)
  )

## plotting
ggplot(data = annual_sunshine, aes(x = year, y = year_sum)) +
  geom_bar(stat = "identity") +
  ggtitle(paste0(
    "Annual sunshine duration ",
    Potsdam_station$Stationsname,
    ", ",
    Potsdam_station$Bundesland
  ))

## print result
print(paste0(
  "Mean annual sunshine duration at station ",
  Potsdam_station$Stationsname,
  ": ",
  round(mean(annual_sunshine$year_sum)),
  " h"
))
## [1] "Mean annual sunshine duration at station Potsdam: 1742 h"


Now, we are aware of the procedure to download data from the DWD and extract the statistic of interest. In the next section, we will automate this procedure to download all weather stations of interest and extract the statistic of interest for each of them.


Download data set of interest

In this section, we automate the procedure of downloading a data set from DWD and extracting a particular statistic of interest. Recall that the general procedure may be outlined in five steps:

  1. Select the weather station of interest.

  2. Download the data.

  3. Check the data record for completeness.

  4. Extract the statistic of interest.

  5. Repeat step 1–4 for all weather stations of interest.

In the previous, section we already defined a subset of weather stations of interest. The code below is copied from the previous section. Basically, we are interested in all weather stations in East Germany, which provide weather data on a monthly basis.

data(metaIndex)
states <- c(
  "Sachsen", "Sachsen-Anhalt", "Mecklenburg-Vorpommern",
  "Brandenburg", "Thueringen", "Berlin"
)

resolution <- "monthly"
variables <- "kl"
period <- "historical"
df_station <- metaIndex[metaIndex$Bundesland %in% states &
  metaIndex$res == resolution &
  metaIndex$per == period &
  metaIndex$var == variables &
  metaIndex$hasfile == TRUE, ]

The resulting data set consists of 279 weather stations.

In the next step we download each of these weather stations, check the data for completeness and extract the observational data for the period 1981–2010. Finally we compute the mean annual rainfall and the mean annual temperature and store this information in a data.frame object together with the weather station identifier (Station_id), the altitude (Altitude), the geographical coordinates (Lat and Lon), the name of the station (Name) and the name of the federal state Federal_State, the weather station is situated.

For our convenience we wrap the repetitive part of this procedure into a function, denoted as download_dwd_data. Go through the function defined below, read it line for line and try to figure out what is happening.

download_dwd_data <- function(df,
                              res = "monthly",
                              var = "kl",
                              per = "historical",
                              RUN = TRUE, verbose = TRUE,
                              forceDownload = FALSE) {
  ##############################################
  ### FUNCTION TO DOWNLOAD DATA FROM THE DWD ###
  ##############################################
  library(rdwd)
  library(tidyverse)
  library(lubridate)

  if (RUN) {
    ## COLUMNS TO KEEP
    df_column_names <- c(
      "Stations_id", "Stationshoehe",
      "geoBreite", "geoLaenge", "Stationsname",
      "Bundesland"
    )
    df_result <- df[, df_column_names]

    ## RENAME COLUMNS
    df_column_names_english <- c(
      "Station_id", "Altitude", "Lat", "Lon",
      "Name", "Federal_State"
    )
    colnames(df_result) <- df_column_names_english

    ## EMPTY COLUMNS TO STORE RESULTS
    df_result["Rainfall"] <- NA
    df_result["Temperature"] <- NA

    ## LOOP THROUGH STATIONS ##
    for (j in 1:length(df$Stations_id)) {
      if (verbose) {
        ## Write progress information to console
        print(paste(
          "Weather station", df[j, "Stationsname"], "--",
          j, "out of", length(df$Stations_id)
        ))
      }

      ## DOWNLOAD STATION DATA ##
      ## for detailed instructions type vignette("rdwd") into the console
      tdir <- tempdir()
      link <- selectDWD(
        id = df$Stations_id[j],
        res = res,
        var = var,
        per = per
      )
      file <- dataDWD(link,
        read = FALSE,
        dir = tdir,
        quiet = TRUE,
        force = forceDownload
      )
      clim <- readDWD(file)
      clim["Date"] <- ymd(clim$MESS_DATUM_BEGINN)

      ## LOOP THROUGH PARAMETERS ##
      params <- c(quo(MO_TT), quo(MO_RR))
      for (i in 1:length(params)) {
        # initial dataframe
        df_ <- clim %>%
          # generate one column and populate it with the year
          mutate(year = year(Date)) %>%
          # group the data by the new columns
          group_by(year) %>%
          # select only years between 1981 and 2010
          filter(year >= 1981 & year <= 2010) %>%
          summarise(
            # calculate the monthly sum for each year
            year_sum = sum(!!params[[i]], na.rm = TRUE),
            # calculate the monthly mean for each year
            year_mean = mean(!!params[[i]], na.rm = TRUE)
          ) %>%
          # Add column with the number of years in record
          mutate(year_count = n())

        ## CONTROL FLOW ##
        ## if there is no data for the period 81-10 continue
        if (identical(df_$year_count, integer(0))) {
          next
        } else {
          # if there is data, make sure we have data for each year
          if (mean(df_$year_count, na.rm = TRUE) == 30) {
            ## IF SO BUILD OUTPUT DATA FRAME ##
            ## Temperature
            if (any(grepl(pattern = "T", x = params[[i]])) == TRUE) {
              result <- round(mean(df_$year_mean, na.rm = TRUE), 1)
              df_result[j, "Temperature"] <- result
            } else {
              ## Rainfall
              result <- round(mean(df_$year_sum, na.rm = TRUE))
              df_result[j, "Rainfall"] <- result
            }
          }
        }
      }
    }
    df_result <- df_result[complete.cases(df_result), ]
    write.csv(
      x = df_result,
      file = "dwd_data_1981-2010.csv",
      row.names = FALSE
    )
    return(df_result)
  } else {
    ## DOWNLOAD PREPROCESSED DATA ##
    url <- "https://userpage.fu-berlin.de/soga/data/raw-data/"
    df_result <- read.csv(paste0(url, "dwd_data_1981-2010.csv"))

    print("########################################################")
    print("### NOTE THAT YOU ARE DOWNLOADING PREPROCESSED DATA ###")
    print("### THE ARGUMENTS TO THE FUNCTION CALL ARE DISMISSED ###")
    print("########################################################")
    return(df_result)
  }
}

For the ease of application we included the RUN argument in the function. If RUN = TRUE then the main function body will be executed, the data will be downloaded and processed accordingly. Depending on your internet connection this may take some time. However, if RUN = FALSE the function only downloads a prepared data.frame object fitting the purpose of this tutorial. This second option may be of interest, if you just want to follow along this tutorial. For sure this second approach will consume much less time.

df_result <- download_dwd_data(
  df = df_station,
  RUN = FALSE,
  verbose = TRUE,
  forceDownload = FALSE
)
## [1] "########################################################"
## [1] "### NOTE THAT YOU ARE DOWNLOADING PREPROCESSED DATA ###"
## [1] "### THE ARGUMENTS TO THE FUNCTION CALL ARE DISMISSED ###"
## [1] "########################################################"
str(df_result)
## 'data.frame':    71 obs. of  9 variables:
##  $ X            : int  3089 3198 3593 4017 4553 111897 8075 8141 8270 8947 ...
##  $ Station_id   : int  164 167 183 198 222 5825 399 400 403 430 ...
##  $ Altitude     : int  54 11 42 164 393 40 36 60 51 36 ...
##  $ Lat          : num  53 53.8 54.7 51.4 50.6 ...
##  $ Lon          : num  14 13.7 13.4 11.3 12.7 ...
##  $ Name         : chr  "Angermuende" "Anklam" "Arkona" "Artern" ...
##  $ Federal_State: chr  "Brandenburg" "Mecklenburg-Vorpommern" "Mecklenburg-Vorpommern" "Thueringen" ...
##  $ Rainfall     : int  521 562 546 490 805 515 459 579 591 550 ...
##  $ Temperature  : num  8.9 9 8.5 9.3 8.4 9.4 10.5 9.5 9.5 10 ...

The resulting data frame (df_result) contains data of 71 weather stations, with the features X, Station_id, Altitude, Lat, Lon, Name, Federal_State, Rainfall, Temperature.


Data review

Now that we downloaded the weather data of interest from DWD we may inspect the data set.

sample_n(df_result, 10)
##         X Station_id Altitude     Lat     Lon                       Name
## 1   59117       3034      496 50.4505 11.6350            Lobenstein, Bad
## 2   11925        596       15 54.0027 11.1908                Boltenhagen
## 3   69605       3552       38 52.9037 12.8072                  Neuruppin
## 4   36958       1869       56 53.3154 13.9339                    Gruenow
## 5   98261       5097       10 54.0661 12.7675                   Tribsees
## 6   94085       4878      504 51.6646 10.8811 Oberharz am Brocken-Stiege
## 7  108205       5629      105 51.8891 12.6446                 Wittenberg
## 8  104386       5424      328 51.0177 11.3544          Weimar-Schoendorf
## 9   56754       2928      138 51.3151 12.4462         Leipzig-Holzhausen
## 10   3089        164       54 53.0316 13.9908                Angermuende
##             Federal_State Rainfall Temperature
## 1              Thueringen      871         7.3
## 2  Mecklenburg-Vorpommern      592         9.0
## 3             Brandenburg      535         9.2
## 4             Brandenburg      317         9.1
## 5  Mecklenburg-Vorpommern      627         8.5
## 6          Sachsen-Anhalt      897         6.6
## 7          Sachsen-Anhalt      567         9.4
## 8              Thueringen      579         8.9
## 9                 Sachsen      629         9.7
## 10            Brandenburg      521         8.9

By applying the summary() function we print basic statistical measures of the variables Temperature and Rainfall.

summary(df_result[, c("Temperature", "Rainfall")])
##   Temperature        Rainfall     
##  Min.   : 3.500   Min.   : 317.0  
##  1st Qu.: 8.500   1st Qu.: 533.5  
##  Median : 9.100   Median : 578.0  
##  Mean   : 8.631   Mean   : 638.5  
##  3rd Qu.: 9.400   3rd Qu.: 639.0  
##  Max.   :10.500   Max.   :1879.0

By means of visual inspection, we can check for outliers or invalid data points.

library(GGally)
ggpairs(df_result[, c("Altitude", "Rainfall", "Temperature")],
  title = "",
  axisLabels = "show",
  columnLabels = c("Altitude", "Rainfall", "Temperature")
)

The graph looks fine. We notice the expected pattern: Rainfall tends to increase with altitude, whereas temperature tends to decrease with altitude.

Finally, we contextualize our data set by plotting the weather stations on an interactive map.

library(leaflet)
library(leafpop)
library(htmltools)

# plot
df_result %>% leaflet() %>% addTiles() %>% 
  addCircleMarkers(data = df_result, radius = 5,
    stroke = FALSE, fillOpacity = 0.5,
    popup = popupTable(df_result, zcol = 2:9)
  )

For the purpose of later usage we caste the spatial data sets to Spatial objects and save them to disk.

library(raster)
library(sf)

# Retrieve Federal States by the the getData() function from the raster package
germany <- getData(country = "Germany", level = 1)
germany_sf <- st_as_sf(germany)
# transform to UTM zone 33
germany_sf <- st_transform(germany_sf, 32633)
# simplify data
states <- c(
  "Sachsen", "Sachsen-Anhalt",
  "Mecklenburg-Vorpommern", "Brandenburg",
  "Thüringen", "Berlin"
)
east_germany_states <- germany_sf[germany_sf$NAME_1 %in% states, ]
east_germany <- east_germany_states %>%
  group_by(NAME_0) %>%
  summarize()

# create simple feature object from  data frame object
dwd_east_sf <- st_as_sf(df_result, coords = c("Lon", "Lat"), crs = 4326)
# transform to UTM zone 33
dwd_east_sf <- st_transform(dwd_east_sf, 32633)

# create sp objects
east_germany_sp <- as(east_germany, "Spatial")
east_germany_states_sp <- as(east_germany_states, "Spatial")
dwd_east_sp <- as(dwd_east_sf, "Spatial")

save(
  file = "East_Germany.RData",
  east_germany_sp, east_germany_states_sp, dwd_east_sp
)

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.