In the previous section we downloaded a preprocessed data set of weather station data for Germany. In this section however, we download and summarize weather data provided by Deutscher Wetterdienst (DWD) (German Weather Service) on our own.

The goal is to obtain a data set of annual mean rainfall and mean annual temperature data for the 30 year period 1981-2010 for weather stations located in East Germany.

The data is available on the DWD data portal. Here, we use the rdwd package, a package that simplifies downloading data from the DWD.

First, we load the required package from CRAN.

library(rdwd)
library(tidyverse)

The rdwd package has nice features such as the metaIndex, which provides an overview of the available data and the data structure.

data(metaIndex)
sample_n(metaIndex, 10)
##    Stations_id  von_datum  bis_datum Stationshoehe geoBreite geoLaenge
## 1         4745 1958-06-01 2023-04-13            75   52.9604    9.7930
## 2         7330 2005-10-01 2009-09-30           159   51.4633    7.9780
## 3         5344 1960-01-01 2006-12-31             2   53.7865    7.9096
## 4         5745 2004-04-19 2023-01-04            51   52.9663   13.3268
## 5         2532 1880-01-01 2012-12-31           231   51.2963    9.4424
## 6         5621 1991-01-01 2000-05-31           300   51.5662   11.2067
## 7         4456 1951-01-01 2003-12-31            35   53.7267   10.8242
## 8         2680 1951-01-01 2011-12-31           289   50.2840   10.4456
## 9           73 1908-01-01 2023-03-31           340   48.6159   13.0506
## 10        6102 1999-04-01 2023-04-13           397   47.5396    9.6843
##            Stationsname             Bundesland        res
## 1                Soltau          Niedersachsen     hourly
## 2       Arnsberg-Neheim    Nordrhein-Westfalen      daily
## 3            Wangerooge          Niedersachsen      daily
## 4             Zehdenick            Brandenburg   1_minute
## 5                Kassel                 Hessen     annual
## 6         Wippra/Wipper         Sachsen-Anhalt      daily
## 7           Schlagsdorf Mecklenburg-Vorpommern    monthly
## 8     Koenigshofen, Bad                 Bayern    monthly
## 9  Aldersbach-Kriestorf                 Bayern    monthly
## 10         Lindau (SWN)                 Bayern 10_minutes
##                       var        per hasfile
## 1              cloudiness historical    TRUE
## 2       weather_phenomena     recent   FALSE
## 3  more_weather_phenomena historical    TRUE
## 4           precipitation historical    TRUE
## 5         climate_indices     recent   FALSE
## 6  more_weather_phenomena     recent   FALSE
## 7             more_precip     recent   FALSE
## 8       weather_phenomena historical    TRUE
## 9             more_precip historical    TRUE
## 10                   wind     recent    TRUE

At the moment of writing this data frame lists 147,719 data sets. Here, we focus only on weather stations located in East Germany. Further, we restrict our search to weather stations, which provide data on a monthly basis.

states <- c(
  "Sachsen", "Sachsen-Anhalt", "Mecklenburg-Vorpommern",
  "Brandenburg", "Thueringen", "Berlin"
)

resolution <- "monthly"
variables <- "kl"
period <- "historical"
df_station <- metaIndex[metaIndex$Bundesland %in% states &
  metaIndex$res == resolution &
  metaIndex$per == period &
  metaIndex$var == variables &
  metaIndex$hasfile == TRUE, ]

# number of weather stations
nrow(df_station)
## [1] 277

Based on our constraints we end up with a data set of 277 weather stations. The data provided by DWD is monthly data, however we are only interested in the mean annual rainfall and mean annual temperature for the period 1981–2010. Hence, we have to do some data preprocessing. In order to gain a better intuition we first showcase the processing pipeline for one example weather station. Later we automate this pipeline, to be applied on all 277 selected weather stations.


Download example data set

Here, we focus on the weather station Berlin-Tempelhof. Detailed information on the structure of the DWD data sets can be found here. Further information on the rdwd package can be found here.

The general procedure is outlined in five steps:

  1. Select the weather station of interest based on the station identifier (id), on the temporal resolution of the record (res), on the variable type (var) and on the observation period (per).

  2. Download the data using the functionality of the rdwd package.

  3. Check the data record for completeness.

  4. Extract the statistic of interest, which in our case is the mean annual temperature and mean annual rainfall for the period 1981–2010.

  5. Repeat step 1 to 4 for all stations of interest.


Step 1

We subset the weather station of interest from the catalog of DWD stations.

sample_station <- df_station[df_station$Stationsname == "Berlin-Tempelhof", ]
sample_station
##      Stations_id  von_datum  bis_datum Stationshoehe geoBreite geoLaenge
## 9959         433 1938-01-01 2023-03-31            48   52.4675   13.4021
##          Stationsname Bundesland     res var        per hasfile
## 9959 Berlin-Tempelhof     Berlin monthly  kl historical    TRUE


Exercise: Create a second subset for the weather station in Potsdam.

## Your code here...
Potsdam_station <- NULL
Show code
Potsdam_station <- df_station[df_station$Stationsname == "Potsdam", ]
Potsdam_station
##       Stations_id  von_datum  bis_datum Stationshoehe geoBreite geoLaenge
## 86030        3987 1893-01-01 2023-03-31            81   52.3812   13.0622
##       Stationsname  Bundesland     res var        per hasfile
## 86030      Potsdam Brandenburg monthly  kl historical    TRUE


Step 2

We download the data using the functionality of the rdwd package. First, we generate the link pointing to the file in the online database by applying the selectDWD() function.

link <- selectDWD(
  id = sample_station$Stations_id,
  res = sample_station$res,
  var = sample_station$var,
  per = sample_station$per
)
link
## [1] "ftp://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/monthly/kl/historical/monatswerte_KL_00433_19380101_20221231_hist.zip"

Then we download the file by applying the dataDWD() function. Once the file is downloaded, we apply the readDWD() function, which reads the file, processes it and returns it as a data.frame object.

tdir <- tempdir()
file <- dataDWD(link, read = FALSE, dir = tdir, quiet = TRUE, force = FALSE)
clim <- readDWD(file)
## Warning: knitr::knit -> process_file -> process_group -> process_group.block -> call_block -> block_exec ->  readDWD: R package 'data.table' available for fast reading of files, but system command 'unzip' could not be found. Now reading slowly.
## See   https://bookdown.org/brry/rdwd/fread.html
## Reading 1 file with readDWD.data() and fread=FALSE ...


Exercise: Download and read the data of the weather station in Potsdam.

## Your code here...
link2 <- NULL
tdir2 <- NULL
file2 <- NULL
clim_Potsdam <- NULL
Show code
link2 <- selectDWD(
  id = Potsdam_station$Stations_id,
  res = Potsdam_station$res,
  var = Potsdam_station$var,
  per = Potsdam_station$per
)
tdir2 <- tempdir()
file2 <- dataDWD(link2, read = FALSE, dir = tdir2, quiet = TRUE, force = FALSE)
clim_Potsdam <- readDWD(file2)
## Reading 1 file with readDWD.data() and fread=FALSE ...


Step 3

Now we check the data set for completeness by using standard R functions, such as summary(). A description of the variables is available here.

summary(clim)
##   STATIONS_ID    MESS_DATUM                      MESS_DATUM_BEGINN 
##  Min.   :433   Min.   :1938-01-15 00:00:00.000   Min.   :19380101  
##  1st Qu.:433   1st Qu.:1961-01-07 06:00:00.000   1st Qu.:19607876  
##  Median :433   Median :1981-08-30 12:00:00.000   Median :19810851  
##  Mean   :433   Mean   :1981-06-17 07:50:19.355   Mean   :19810289  
##  3rd Qu.:433   3rd Qu.:2002-04-22 12:00:00.000   3rd Qu.:20020426  
##  Max.   :433   Max.   :2022-12-15 00:00:00.000   Max.   :20221201  
##                                                                    
##  MESS_DATUM_ENDE         QN_4             MO_N           MO_TT       
##  Min.   :19380131   Min.   : 3.000   Min.   :2.640   Min.   :-9.400  
##  1st Qu.:19607906   1st Qu.: 5.000   1st Qu.:4.598   1st Qu.: 3.735  
##  Median :19810880   Median : 5.000   Median :5.260   Median : 9.540  
##  Mean   :19810318   Mean   : 7.236   Mean   :5.260   Mean   : 9.791  
##  3rd Qu.:20020455   3rd Qu.:10.000   3rd Qu.:5.973   3rd Qu.:16.265  
##  Max.   :20221231   Max.   :10.000   Max.   :7.440   Max.   :24.380  
##                                      NA's   :260                     
##      MO_TX            MO_TN             MO_FK           MX_TX      
##  Min.   :-6.000   Min.   :-12.700   Min.   :1.630   Min.   : 1.60  
##  1st Qu.: 6.237   1st Qu.:  0.995   1st Qu.:2.480   1st Qu.:13.70  
##  Median :13.770   Median :  5.585   Median :2.660   Median :22.15  
##  Mean   :13.567   Mean   :  5.932   Mean   :2.699   Mean   :21.56  
##  3rd Qu.:20.850   3rd Qu.: 11.655   3rd Qu.:2.870   3rd Qu.:29.40  
##  Max.   :30.300   Max.   : 17.870   Max.   :4.060   Max.   :38.50  
##  NA's   :8                          NA's   :107     NA's   :92     
##      MX_FX           MX_TN             MO_SD_S            QN_6       
##  Min.   :11.30   Min.   :-22.5000   Min.   :  8.00   Min.   : 1.000  
##  1st Qu.:16.60   1st Qu.: -5.3250   1st Qu.: 64.42   1st Qu.: 5.000  
##  Median :19.00   Median : -0.4000   Median :135.30   Median : 5.000  
##  Mean   :19.83   Mean   : -0.2738   Mean   :142.25   Mean   : 7.075  
##  3rd Qu.:22.60   3rd Qu.:  6.5250   3rd Qu.:210.65   3rd Qu.: 9.000  
##  Max.   :40.20   Max.   : 14.0000   Max.   :377.60   Max.   :10.000  
##  NA's   :388     NA's   :92         NA's   :372      NA's   :2       
##      MO_RR            MX_RS          eor     
##  Min.   :  0.60   Min.   :  0.400   eor:992  
##  1st Qu.: 27.23   1st Qu.:  7.225            
##  Median : 43.00   Median : 11.150            
##  Mean   : 47.73   Mean   : 13.874            
##  3rd Qu.: 61.60   3rd Qu.: 17.200            
##  Max.   :193.40   Max.   :119.500            
##  NA's   :2        NA's   :94

Further, we use the xts and lubridate packages for time series analysis and plot the monthly rainfall data.

library(xts)
library(lubridate)
# create time series object of monthly rainfall
rainfall_sample <- xts(
  x = clim$MO_RR,
  order.by = as.yearmon(ymd(clim$MESS_DATUM_BEGINN))
)
# plot the data
barplot(rainfall_sample,
  ylab = "Monthly rainfall",
  main = paste0(
    "DWD Weather Station ",
    sample_station$Stationsname,
    ", ",
    sample_station$Bundesland
  )
)

The rainfall time series at Berlin-Tempelhof consists of 992 monthly observations spanning a period from January 1938 to December 2022.

For our use case we are only interest in the period 1981 to 2010, hence we subset the data set and plot it for inspection.

barplot(rainfall_sample["1981/2010"],
  ylab = "Monthly rainfall",
  main = paste0(
    "DWD Weather Station ",
    sample_station$Stationsname,
    ", ",
    sample_station$Bundesland
  )
)


Exercise: Check for the weather station in Potsdam if the data for monthly sunshine duration is complete in the period 1981 to 2010.

## Your code here...
sunshine_Potsdam <- NULL
Show code
summary(clim_Potsdam)
##   STATIONS_ID     MESS_DATUM                      MESS_DATUM_BEGINN 
##  Min.   :3987   Min.   :1893-01-15 00:00:00.000   Min.   :18930101  
##  1st Qu.:3987   1st Qu.:1925-07-07 12:00:00.000   1st Qu.:19250676  
##  Median :3987   Median :1957-12-30 12:00:00.000   Median :19575651  
##  Mean   :3987   Mean   :1957-12-30 00:20:19.462   Mean   :19575651  
##  3rd Qu.:3987   3rd Qu.:1990-06-22 12:00:00.000   3rd Qu.:19900626  
##  Max.   :3987   Max.   :2022-12-15 00:00:00.000   Max.   :20221201  
##                                                                     
##  MESS_DATUM_ENDE         QN_4             MO_N           MO_TT        
##  Min.   :18930131   Min.   : 3.000   Min.   :2.530   Min.   :-10.860  
##  1st Qu.:19250706   1st Qu.: 5.000   1st Qu.:4.720   1st Qu.:  2.728  
##  Median :19575681   Median : 5.000   Median :5.370   Median :  8.770  
##  Mean   :19575680   Mean   : 6.138   Mean   :5.343   Mean   :  8.889  
##  3rd Qu.:19900655   3rd Qu.: 5.000   3rd Qu.:6.010   3rd Qu.: 15.335  
##  Max.   :20221231   Max.   :10.000   Max.   :7.550   Max.   : 23.560  
##                                      NA's   :27                       
##      MO_TX           MO_TN             MO_FK           MX_TX      
##  Min.   :-6.37   Min.   :-15.430   Min.   :1.880   Min.   : 0.60  
##  1st Qu.: 5.80   1st Qu.:  0.090   1st Qu.:2.723   1st Qu.:13.38  
##  Median :13.75   Median :  4.635   Median :2.950   Median :22.30  
##  Mean   :13.37   Mean   :  5.000   Mean   :3.014   Mean   :21.41  
##  3rd Qu.:20.89   3rd Qu.: 10.555   3rd Qu.:3.250   3rd Qu.:29.40  
##  Max.   :30.53   Max.   : 16.950   Max.   :4.790   Max.   :38.90  
##                                    NA's   :10                     
##      MX_FX           MX_TN            MO_SD_S            QN_6       
##  Min.   :12.50   Min.   :-26.800   Min.   : 11.30   Min.   : 1.000  
##  1st Qu.:18.30   1st Qu.: -6.600   1st Qu.: 69.33   1st Qu.: 5.000  
##  Median :21.00   Median : -1.500   Median :140.20   Median : 5.000  
##  Mean   :21.62   Mean   : -1.418   Mean   :144.33   Mean   : 6.322  
##  3rd Qu.:24.00   3rd Qu.:  5.500   3rd Qu.:212.30   3rd Qu.: 9.000  
##  Max.   :37.30   Max.   : 14.100   Max.   :366.60   Max.   :10.000  
##  NA's   :1017                      NA's   :2                        
##      MO_RR            MX_RS          eor      
##  Min.   :  0.20   Min.   :  0.200   eor:1560  
##  1st Qu.: 28.30   1st Qu.:  7.475             
##  Median : 43.15   Median : 11.300             
##  Mean   : 48.66   Mean   : 13.843             
##  3rd Qu.: 63.92   3rd Qu.: 16.800             
##  Max.   :202.30   Max.   :104.800             
## 
# create time series object of monthly rainfall
sunshine_Potsdam <- xts(
  x = clim_Potsdam$MO_SD_S,
  order.by = as.yearmon(ymd(clim_Potsdam$MESS_DATUM_BEGINN))
)
# plot the data
barplot(sunshine_Potsdam["1981/2010"],
  ylab = "Monthly sunshine duration",
  main = paste0(
    "DWD Weather Station ",
    Potsdam_station$Stationsname,
    ", ",
    Potsdam_station$Bundesland
  )
)


Step 4

Based on the visual inspection of the graph above we may conclude that the rainfall data is fairly complete. Next, we compute the mean annual rainfall by applying several functions from the dplyr package, which is part of the tidyverse ecosystem, such as mutate, group_by, filter and summarise.

annual_rainfall <- clim
annual_rainfall <- annual_rainfall %>%
  mutate(year = year(ymd(MESS_DATUM_BEGINN))) %>%
  group_by(year) %>%
  filter(year >= 1981 & year <= 2010) %>%
  summarise(
    # calculate the monthly rainfall for each year
    year_sum = sum(MO_RR, na.rm = TRUE)
  )

## plotting
ggplot(data = annual_rainfall, aes(x = year, y = year_sum)) +
  geom_bar(stat = "identity") +
  ggtitle(paste0(
    "Annual rainfall ",
    sample_station$Stationsname,
    ", ",
    sample_station$Bundesland
  ))

The graph shows the annual rainfall for the period 1981 to 2010. One final step is still pending: We need to take the average of the annual series to get the mean annual rainfall at the weather station Berlin-Tempelhof for the period 1981–2010.

## print result
print(paste0(
  "Mean annual rainfall at station ",
  sample_station$Stationsname,
  ": ",
  round(mean(annual_rainfall$year_sum)),
  " mm"
))
## [1] "Mean annual rainfall at station Berlin-Tempelhof: 577 mm"

Done!

Exercise: Compute the mean annual sunshine duration at the weather station in Potsdam for the period 1981–2010.

## Your code here...
annual_sunshine <- NULL
Show code
annual_sunshine <- clim_Potsdam
annual_sunshine <- annual_sunshine %>%
  mutate(year = year(ymd(MESS_DATUM_BEGINN))) %>%
  group_by(year) %>%
  filter(year >= 1981 & year <= 2010) %>%
  summarise(
    # calculate the monthly rainfall for each year
    year_sum = sum(MO_SD_S, na.rm = TRUE)
  )

## plotting
ggplot(data = annual_sunshine, aes(x = year, y = year_sum)) +
  geom_bar(stat = "identity") +
  ggtitle(paste0(
    "Annual sunshine duration ",
    Potsdam_station$Stationsname,
    ", ",
    Potsdam_station$Bundesland
  ))

## print result
print(paste0(
  "Mean annual sunshine duration at station ",
  Potsdam_station$Stationsname,
  ": ",
  round(mean(annual_sunshine$year_sum)),
  " h"
))
## [1] "Mean annual sunshine duration at station Potsdam: 1742 h"


Now, we are aware of the procedure to download data from the DWD and extract the statistic of interest. In the next section, we will automate this procedure to download all weather stations of interest and extract the statistic of interest for each of them.


Download data set of interest

In this section, we automate the procedure of downloading a data set from DWD and extracting a particular statistic of interest. Recall that the general procedure may be outlined in five steps:

  1. Select the weather station of interest.

  2. Download the data.

  3. Check the data record for completeness.

  4. Extract the statistic of interest.

  5. Repeat step 1–4 for all weather stations of interest.

In the previous, section we already defined a subset of weather stations of interest. The code below is copied from the previous section. Basically, we are interested in all weather stations in East Germany, which provide weather data on a monthly basis.

data(metaIndex)
states <- c(
  "Sachsen", "Sachsen-Anhalt", "Mecklenburg-Vorpommern",
  "Brandenburg", "Thueringen", "Berlin"
)

resolution <- "monthly"
variables <- "kl"
period <- "historical"
df_station <- metaIndex[metaIndex$Bundesland %in% states &
  metaIndex$res == resolution &
  metaIndex$per == period &
  metaIndex$var == variables &
  metaIndex$hasfile == TRUE, ]

The resulting data set consists of 277 weather stations.

In the next step we download each of these weather stations, check the data for completeness and extract the observational data for the period 1981–2010. Finally we compute the mean annual rainfall and the mean annual temperature and store this information in a data.frame object together with the weather station identifier (Station_id), the altitude (Altitude), the geographical coordinates (Lat and Lon), the name of the station (Name) and the name of the federal state Federal_State, the weather station is situated.

For our convenience we wrap the repetitive part of this procedure into a function, denoted as download_dwd_data. Go through the function defined below, read it line for line and try to figure out what is happening.

download_dwd_data <- function(df,
                              res = "monthly",
                              var = "kl",
                              per = "historical",
                              RUN = TRUE, verbose = TRUE,
                              forceDownload = FALSE) {
  ##############################################
  ### FUNCTION TO DOWNLOAD DATA FROM THE DWD ###
  ##############################################
  library(rdwd)
  library(tidyverse)
  library(lubridate)

  if (RUN) {
    ## COLUMNS TO KEEP
    df_column_names <- c(
      "Stations_id", "Stationshoehe",
      "geoBreite", "geoLaenge", "Stationsname",
      "Bundesland"
    )
    df_result <- df[, df_column_names]

    ## RENAME COLUMNS
    df_column_names_english <- c(
      "Station_id", "Altitude", "Lat", "Lon",
      "Name", "Federal_State"
    )
    colnames(df_result) <- df_column_names_english

    ## EMPTY COLUMNS TO STORE RESULTS
    df_result["Rainfall"] <- NA
    df_result["Temperature"] <- NA

    ## LOOP THROUGH STATIONS ##
    for (j in 1:length(df$Stations_id)) {
      if (verbose) {
        ## Write progress information to console
        print(paste(
          "Weather station", df[j, "Stationsname"], "--",
          j, "out of", length(df$Stations_id)
        ))
      }

      ## DOWNLOAD STATION DATA ##
      ## for detailed instructions type vignette("rdwd") into the console
      tdir <- tempdir()
      link <- selectDWD(
        id = df$Stations_id[j],
        res = res,
        var = var,
        per = per
      )
      file <- dataDWD(link,
        read = FALSE,
        dir = tdir,
        quiet = TRUE,
        force = forceDownload
      )
      clim <- readDWD(file)
      clim["Date"] <- ymd(clim$MESS_DATUM_BEGINN)

      ## LOOP THROUGH PARAMETERS ##
      params <- c(quo(MO_TT), quo(MO_RR))
      for (i in 1:length(params)) {
        # initial dataframe
        df_ <- clim %>%
          # generate one column and populate it with the year
          mutate(year = year(Date)) %>%
          # group the data by the new columns
          group_by(year) %>%
          # select only years between 1981 and 2010
          filter(year >= 1981 & year <= 2010) %>%
          summarise(
            # calculate the monthly sum for each year
            year_sum = sum(!!params[[i]], na.rm = TRUE),
            # calculate the monthly mean for each year
            year_mean = mean(!!params[[i]], na.rm = TRUE)
          ) %>%
          # Add column with the number of years in record
          mutate(year_count = n())

        ## CONTROL FLOW ##
        ## if there is no data for the period 81-10 continue
        if (identical(df_$year_count, integer(0))) {
          next
        } else {
          # if there is data, make sure we have data for each year
          if (mean(df_$year_count, na.rm = TRUE) == 30) {
            ## IF SO BUILD OUTPUT DATA FRAME ##
            ## Temperature
            if (any(grepl(pattern = "T", x = params[[i]])) == TRUE) {
              result <- round(mean(df_$year_mean, na.rm = TRUE), 1)
              df_result[j, "Temperature"] <- result
            } else {
              ## Rainfall
              result <- round(mean(df_$year_sum, na.rm = TRUE))
              df_result[j, "Rainfall"] <- result
            }
          }
        }
      }
    }
    df_result <- df_result[complete.cases(df_result), ]
    write.csv(
      x = df_result,
      file = "dwd_data_1981-2010.csv",
      row.names = FALSE
    )
    return(df_result)
  } else {
    ## DOWNLOAD PREPROCESSED DATA ##
    url <- "https://userpage.fu-berlin.de/soga/data/raw-data/"
    df_result <- read.csv(paste0(url, "dwd_data_1981-2010.csv"))

    print("########################################################")
    print("### NOTE THAT YOU ARE DOWNLOADING PREPROCESSED DATA ###")
    print("### THE ARGUMENTS TO THE FUNCTION CALL ARE DISMISSED ###")
    print("########################################################")
    return(df_result)
  }
}

For the ease of application we included the RUN argument in the function. If RUN = TRUE then the main function body will be executed, the data will be downloaded and processed accordingly. Depending on your internet connection this may take some time. However, if RUN = FALSE the function only downloads a prepared data.frame object fitting the purpose of this tutorial. This second option may be of interest, if you just want to follow along this tutorial. For sure this second approach will consume much less time.

df_result <- download_dwd_data(
  df = df_station,
  RUN = FALSE,
  verbose = TRUE,
  forceDownload = FALSE
)
## [1] "########################################################"
## [1] "### NOTE THAT YOU ARE DOWNLOADING PREPROCESSED DATA ###"
## [1] "### THE ARGUMENTS TO THE FUNCTION CALL ARE DISMISSED ###"
## [1] "########################################################"
str(df_result)
## 'data.frame':    71 obs. of  9 variables:
##  $ X            : int  3089 3198 3593 4017 4553 111897 8075 8141 8270 8947 ...
##  $ Station_id   : int  164 167 183 198 222 5825 399 400 403 430 ...
##  $ Altitude     : int  54 11 42 164 393 40 36 60 51 36 ...
##  $ Lat          : num  53 53.8 54.7 51.4 50.6 ...
##  $ Lon          : num  14 13.7 13.4 11.3 12.7 ...
##  $ Name         : chr  "Angermuende" "Anklam" "Arkona" "Artern" ...
##  $ Federal_State: chr  "Brandenburg" "Mecklenburg-Vorpommern" "Mecklenburg-Vorpommern" "Thueringen" ...
##  $ Rainfall     : int  521 562 546 490 805 515 459 579 591 550 ...
##  $ Temperature  : num  8.9 9 8.5 9.3 8.4 9.4 10.5 9.5 9.5 10 ...

The resulting data frame (df_result) contains data of 71 weather stations, with the features X, Station_id, Altitude, Lat, Lon, Name, Federal_State, Rainfall, Temperature.


Data review

Now that we downloaded the weather data of interest from DWD we may inspect the data set.

sample_n(df_result, 10)
##         X Station_id Altitude     Lat     Lon                        Name
## 1  108205       5629      105 51.8891 12.6446                  Wittenberg
## 2   50806       2618      732 50.8458 10.4803          Kleiner Inselsberg
## 3   39854       2044      404 51.6520 11.1366                  Harzgerode
## 4    8141        400       60 52.6310 13.5021                 Berlin-Buch
## 5   17215        853      416 50.7913 12.8720                    Chemnitz
## 6    9076        433       48 52.4675 13.4021            Berlin-Tempelhof
## 7   56924       2932      131 51.4347 12.2396               Leipzig/Halle
## 8   86515       4445      609 51.7658 10.6533        Wernigerode-Schierke
## 9   11734        591       45 53.3911 10.6878                  Boizenburg
## 10  78711       4036      204 51.3895 11.5412 Querfurt-Muehle Lodersleben
##             Federal_State Rainfall Temperature
## 1          Sachsen-Anhalt      567         9.4
## 2              Thueringen     1220         6.2
## 3          Sachsen-Anhalt      445         7.5
## 4                  Berlin      579         9.5
## 5                 Sachsen      731         8.5
## 6                  Berlin      577         9.9
## 7                 Sachsen      534         9.4
## 8          Sachsen-Anhalt     1362         5.8
## 9  Mecklenburg-Vorpommern      630         8.9
## 10         Sachsen-Anhalt      541         9.3

By applying the summary() function we print basic statistical measures of the variables Temperature and Rainfall.

summary(df_result[, c("Temperature", "Rainfall")])
##   Temperature        Rainfall     
##  Min.   : 3.500   Min.   : 317.0  
##  1st Qu.: 8.500   1st Qu.: 533.5  
##  Median : 9.100   Median : 578.0  
##  Mean   : 8.631   Mean   : 638.5  
##  3rd Qu.: 9.400   3rd Qu.: 639.0  
##  Max.   :10.500   Max.   :1879.0

By means of visual inspection, we can check for outliers or invalid data points.

library(GGally)
ggpairs(df_result[, c("Altitude", "Rainfall", "Temperature")],
  title = "",
  axisLabels = "show",
  columnLabels = c("Altitude", "Rainfall", "Temperature")
)

The graph looks fine. We notice the expected pattern: Rainfall tends to increase with altitude, whereas temperature tends to decrease with altitude.

Finally, we contextualize our data set by plotting the weather stations on an interactive map.

library(leaflet)
library(leafpop)
library(htmltools)

# plot
df_result %>% leaflet() %>% addTiles() %>% 
  addCircleMarkers(data = df_result, radius = 5,
    stroke = FALSE, fillOpacity = 0.5,
    popup = popupTable(df_result, zcol = 2:9)
  )

For the purpose of later usage we caste the spatial data sets to Spatial objects and save them to disk.

library(raster)
library(sf)

# Retrieve Federal States by the the getData() function from the raster package
germany <- getData(country = "Germany", level = 1)
germany_sf <- st_as_sf(germany)
# transform to UTM zone 33
germany_sf <- st_transform(germany_sf, 32633)
# simplify data
states <- c(
  "Sachsen", "Sachsen-Anhalt",
  "Mecklenburg-Vorpommern", "Brandenburg",
  "Thüringen", "Berlin"
)
east_germany_states <- germany_sf[germany_sf$NAME_1 %in% states, ]
east_germany <- east_germany_states %>%
  group_by(NAME_0) %>%
  summarize()

# create simple feature object from  data frame object
dwd_east_sf <- st_as_sf(df_result, coords = c("Lon", "Lat"), crs = 4326)
# transform to UTM zone 33
dwd_east_sf <- st_transform(dwd_east_sf, 32633)

# create sp objects
east_germany_sp <- as(east_germany, "Spatial")
east_germany_states_sp <- as(east_germany_states, "Spatial")
dwd_east_sp <- as(dwd_east_sf, "Spatial")

save(
  file = "East_Germany.RData",
  east_germany_sp, east_germany_states_sp, dwd_east_sp
)

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.