In the previous section we downloaded a preprocessed data set of weather station data for Germany. In this section however, we download and summarize weather data provided by Deutscher Wetterdienst (DWD) (German Weather Service) on our own.
The goal is to obtain a data set of annual mean rainfall and mean annual temperature data for the 30 year period 1981-2010 for weather stations located in East Germany.
The data is available on the DWD data portal. Here, we use the rdwd
package, a package that simplifies
downloading data from the DWD.
First, we load the required package from CRAN.
library(rdwd)
library(tidyverse)
The rdwd
package has nice features such as the
metaIndex
, which provides an overview of the available data
and the data structure.
data(metaIndex)
sample_n(metaIndex, 10)
## Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge
## 1 3086 1996-07-02 2023-06-16 15 53.8025 10.6989
## 2 5111 2005-12-20 2023-06-16 560 48.0311 12.5396
## 3 2370 1947-01-01 2006-12-31 290 50.7638 11.6153
## 4 3388 1994-02-01 1997-01-31 542 48.0919 11.6486
## 5 5167 1978-01-01 2023-05-31 505 48.1342 12.2731
## 6 2618 1991-07-24 2011-12-31 732 50.8458 10.4803
## 7 16085 1960-07-01 1961-12-31 8 53.5874 8.6011
## 8 2704 2004-05-01 2023-06-16 76 51.7511 12.0094
## 9 2701 1945-07-01 2007-03-31 136 51.1329 11.7252
## 10 5371 1949-01-01 2023-06-15 920 50.4973 9.9427
## Stationsname Bundesland res var
## 1 Luebeck-Blankensee Schleswig-Holstein 10_minutes solar
## 2 Trostberg Bayern 10_minutes air_temperature
## 3 Hummelshain Thueringen daily more_precip
## 4 Muenchen-Perlach Bayern daily more_weather_phenomena
## 5 Unterreit-Wagenstatt Bayern monthly climate_indices
## 6 Kleiner Inselsberg Thueringen subdaily soil
## 7 Leherheide Bremen subdaily cloudiness
## 8 Koethen (Anhalt) Sachsen-Anhalt hourly cloud_type
## 9 Koesen, Bad Sachsen-Anhalt daily weather_phenomena
## 10 Wasserkuppe Hessen hourly wind_synop
## per hasfile
## 1 recent TRUE
## 2 historical TRUE
## 3 recent FALSE
## 4 recent FALSE
## 5 recent TRUE
## 6 recent FALSE
## 7 historical TRUE
## 8 historical TRUE
## 9 historical TRUE
## 10 recent TRUE
At the moment of writing this data frame lists 146,644 data sets. Here, we focus only on weather stations located in East Germany. Further, we restrict our search to weather stations, which provide data on a monthly basis.
states <- c(
"Sachsen", "Sachsen-Anhalt", "Mecklenburg-Vorpommern",
"Brandenburg", "Thueringen", "Berlin"
)
resolution <- "monthly"
variables <- "kl"
period <- "historical"
df_station <- metaIndex[metaIndex$Bundesland %in% states &
metaIndex$res == resolution &
metaIndex$per == period &
metaIndex$var == variables &
metaIndex$hasfile == TRUE, ]
# number of weather stations
nrow(df_station)
## [1] 279
Based on our constraints we end up with a data set of 279 weather stations. The data provided by DWD is monthly data, however we are only interested in the mean annual rainfall and mean annual temperature for the period 1981–2010. Hence, we have to do some data preprocessing. In order to gain a better intuition we first showcase the processing pipeline for one example weather station. Later we automate this pipeline, to be applied on all 279 selected weather stations.
Here, we focus on the weather station Berlin-Tempelhof.
Detailed information on the structure of the DWD data sets can be found
here. Further information on the rdwd
package can be found here.
The general procedure is outlined in five steps:
Select the weather station of interest based on the station identifier (id), on the temporal resolution of the record (res), on the variable type (var) and on the observation period (per).
Download the data using the functionality of the
rdwd
package.
Check the data record for completeness.
Extract the statistic of interest, which in our case is the mean annual temperature and mean annual rainfall for the period 1981–2010.
Repeat step 1 to 4 for all stations of interest.
Step 1
We subset the weather station of interest from the catalog of DWD stations.
sample_station <- df_station[df_station$Stationsname == "Berlin-Tempelhof", ]
sample_station
## Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge
## 9855 433 1938-01-01 2023-05-31 48 52.4675 13.4021
## Stationsname Bundesland res var per hasfile
## 9855 Berlin-Tempelhof Berlin monthly kl historical TRUE
Exercise: Create a second subset for the weather station in Potsdam.
## Your code here...
Potsdam_station <- NULL
Potsdam_station <- df_station[df_station$Stationsname == "Potsdam", ]
Potsdam_station
## Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge
## 85031 3987 1893-01-01 2023-05-31 81 52.3812 13.0622
## Stationsname Bundesland res var per hasfile
## 85031 Potsdam Brandenburg monthly kl historical TRUE
Step 2
We download the data using the functionality of the rdwd
package. First, we generate the link pointing to the file in the online
database by applying the selectDWD()
function.
The DWD often updates or expands datasets, and when that happens, the
filenames in the historical folders may change. By setting
current = TRUE
, we ensure that we are pointing to the
latest version of the dataset, thus avoiding potential issues with
outdated filenames. Another option to address the issue of changing
filenames is to use rdwd::updateRdwd()
. This will check the
version of the rdwd
package and update it to the latest
development version on github. You can find more details about this
topic in Berry
Boessenkool’s documentation of rdwd.
link <- selectDWD(
id = sample_station$Stations_id,
res = sample_station$res,
var = sample_station$var,
per = sample_station$per,
current = TRUE
)
link
## [1] "ftp://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/monthly/kl/historical/monatswerte_KL_00433_19380101_20231231_hist.zip"
Then we download the file by applying the dataDWD()
function. Once the file is downloaded, we apply the
readDWD()
function, which reads the file, processes it and
returns it as a data.frame
object.
tdir <- tempdir()
file <- dataDWD(link, read = FALSE, dir = tdir, quiet = TRUE, force = FALSE)
clim <- readDWD(file)
## Warning: knitr::knit -> process_file -> xfun:::handle_error -> process_group -> call_block -> block_exec -> readDWD: R package 'data.table' available for fast reading of files, but system command 'unzip' could not be found. Now reading slowly.
## See https://bookdown.org/brry/rdwd/fread.html
## Reading 1 file with readDWD.data() and fread=FALSE ...
Exercise: Download and read the data of the weather station in Potsdam.
## Your code here...
link2 <- NULL
tdir2 <- NULL
file2 <- NULL
clim_Potsdam <- NULL
link2 <- selectDWD(
id = Potsdam_station$Stations_id,
res = Potsdam_station$res,
var = Potsdam_station$var,
per = Potsdam_station$per,
current = TRUE
)
## Determining the content of the 1 folder(s)...
## knitr::knit -> process_file -> xfun:::handle_error -> process_group -> call_block -> block_exec -> selectDWD -> createIndex -> indexFTP -> dirDWD: adding to directory 'C:/Users/sarah/OneDrive/Documents/UNI/Tutorium/GitHub/soga/Soga-R/300/30900_geostatistics/DWDdata'
## knitr::knit -> process_file -> xfun:::handle_error -> process_group -> call_block -> block_exec -> selectDWD -> createIndex -> indexFTP -> newFilename: creating 1 file (1 already existed for which '_n' is appended):'C:/Users/sarah/OneDrive/Documents/UNI/Tutorium/GitHub/soga/Soga-R/300/30900_geostatistics/DWDdata/INDEX_of_DWD_monthly_kl_historical_31.txt'
## Splitting filenames... (0 secs so far)
## Extracting station IDs + time range from filenames... (0 secs so far)
## Determining if files contain meta data... (0 secs so far)
tdir2 <- tempdir()
file2 <- dataDWD(link2, read = FALSE, dir = tdir2, quiet = TRUE, force = FALSE)
clim_Potsdam <- readDWD(file2)
## Reading 1 file with readDWD.data() and fread=FALSE ...
Step 3
Now we check the data set for completeness by using standard R
functions, such as summary()
. A description of the
variables is available here.
summary(clim)
## STATIONS_ID MESS_DATUM MESS_DATUM_BEGINN
## Min. :433 Min. :1938-01-15 00:00:00.000 Min. :19380101
## 1st Qu.:433 1st Qu.:1961-04-07 06:00:00.000 1st Qu.:19610376
## Median :433 Median :1982-03-01 00:00:00.000 Median :19820251
## Mean :433 Mean :1981-12-17 19:56:10.518 Mean :19815313
## 3rd Qu.:433 3rd Qu.:2003-01-22 18:00:00.000 3rd Qu.:20030126
## Max. :433 Max. :2023-12-15 00:00:00.000 Max. :20231201
##
## MESS_DATUM_ENDE QN_4 MO_N MO_TT
## Min. :19380131 Min. : 3.000 Min. :2.640 Min. :-9.400
## 1st Qu.:19610405 1st Qu.: 5.000 1st Qu.:4.600 1st Qu.: 3.785
## Median :19820280 Median : 9.000 Median :5.270 Median : 9.540
## Mean :19815343 Mean : 7.335 Mean :5.272 Mean : 9.812
## 3rd Qu.:20030155 3rd Qu.:10.000 3rd Qu.:5.997 3rd Qu.:16.285
## Max. :20231231 Max. :10.000 Max. :7.440 Max. :24.380
## NA's :262
## MO_TX MO_TN MO_FK MX_TX
## Min. :-6.000 Min. :-12.700 Min. :1.630 Min. : 1.60
## 1st Qu.: 6.287 1st Qu.: 1.062 1st Qu.:2.480 1st Qu.:13.78
## Median :13.770 Median : 5.585 Median :2.660 Median :22.15
## Mean :13.592 Mean : 5.947 Mean :2.698 Mean :21.58
## 3rd Qu.:20.852 3rd Qu.: 11.675 3rd Qu.:2.870 3rd Qu.:29.40
## Max. :30.300 Max. : 17.870 Max. :4.060 Max. :38.50
## NA's :8 NA's :107 NA's :92
## MX_FX MX_TN MO_SD_S QN_6
## Min. :11.30 Min. :-22.5000 Min. : 8.00 Min. : 3.00
## 1st Qu.:16.65 1st Qu.: -5.4000 1st Qu.: 64.42 1st Qu.: 5.00
## Median :19.00 Median : -0.4000 Median :135.30 Median : 5.00
## Mean :19.85 Mean : -0.2711 Mean :142.25 Mean : 7.07
## 3rd Qu.:22.60 3rd Qu.: 6.6000 3rd Qu.:210.65 3rd Qu.: 9.00
## Max. :40.20 Max. : 14.0000 Max. :377.60 Max. :10.00
## NA's :393 NA's :92 NA's :384 NA's :2
## MO_RR MX_RS eor
## Min. : 0.60 Min. : 0.40 eor:1004
## 1st Qu.: 27.40 1st Qu.: 7.30
## Median : 43.20 Median : 11.20
## Mean : 47.82 Mean : 13.89
## 3rd Qu.: 62.30 3rd Qu.: 17.30
## Max. :193.40 Max. :119.50
## NA's :7 NA's :99
Further, we use the xts
and lubridate
packages for time series analysis and plot the monthly rainfall
data.
library(xts)
library(lubridate)
# create time series object of monthly rainfall
rainfall_sample <- xts(
x = clim$MO_RR,
order.by = as.yearmon(ymd(clim$MESS_DATUM_BEGINN))
)
# plot the data
barplot(rainfall_sample,
ylab = "Monthly rainfall",
main = paste0(
"DWD Weather Station ",
sample_station$Stationsname,
", ",
sample_station$Bundesland
)
)
The rainfall time series at Berlin-Tempelhof consists of 1004 monthly observations spanning a period from January 1938 to December 2023.
For our use case we are only interest in the period 1981 to 2010, hence we subset the data set and plot it for inspection.
barplot(rainfall_sample["1981/2010"],
ylab = "Monthly rainfall",
main = paste0(
"DWD Weather Station ",
sample_station$Stationsname,
", ",
sample_station$Bundesland
)
)
Exercise: Check for the weather station in Potsdam if the data for monthly sunshine duration is complete in the period 1981 to 2010.
## Your code here...
sunshine_Potsdam <- NULL
summary(clim_Potsdam)
## STATIONS_ID MESS_DATUM MESS_DATUM_BEGINN
## Min. :3987 Min. :1893-01-15 00:00:00.000 Min. :18930101
## 1st Qu.:3987 1st Qu.:1925-10-07 12:00:00.000 1st Qu.:19250976
## Median :3987 Median :1958-06-30 00:00:00.000 Median :19580651
## Mean :3987 Mean :1958-06-30 15:16:01.832 Mean :19580651
## 3rd Qu.:3987 3rd Qu.:1991-03-22 18:00:00.000 3rd Qu.:19910326
## Max. :3987 Max. :2023-12-15 00:00:00.000 Max. :20231201
##
## MESS_DATUM_ENDE QN_4 MO_N MO_TT
## Min. :18930131 Min. : 3.000 Min. :2.530 Min. :-10.860
## 1st Qu.:19251006 1st Qu.: 5.000 1st Qu.:4.728 1st Qu.: 2.763
## Median :19580680 Median : 5.000 Median :5.375 Median : 8.770
## Mean :19580680 Mean : 6.202 Mean :5.349 Mean : 8.906
## 3rd Qu.:19910356 3rd Qu.: 5.000 3rd Qu.:6.020 3rd Qu.: 15.350
## Max. :20231231 Max. :10.000 Max. :7.550 Max. : 23.560
## NA's :28
## MO_TX MO_TN MO_FK MX_TX
## Min. :-6.370 Min. :-15.4300 Min. :1.880 Min. : 0.60
## 1st Qu.: 5.817 1st Qu.: 0.1175 1st Qu.:2.720 1st Qu.:13.40
## Median :13.750 Median : 4.6350 Median :2.950 Median :22.35
## Mean :13.388 Mean : 5.0152 Mean :3.013 Mean :21.43
## 3rd Qu.:20.907 3rd Qu.: 10.5725 3rd Qu.:3.250 3rd Qu.:29.40
## Max. :30.530 Max. : 16.9500 Max. :4.790 Max. :38.90
## NA's :11
## MX_FX MX_TN MO_SD_S QN_6
## Min. :12.50 Min. :-26.800 Min. : 11.30 Min. : 3.000
## 1st Qu.:18.30 1st Qu.: -6.600 1st Qu.: 69.47 1st Qu.: 5.000
## Median :21.00 Median : -1.500 Median :140.20 Median : 5.000
## Mean :21.62 Mean : -1.402 Mean :144.39 Mean : 6.324
## 3rd Qu.:24.00 3rd Qu.: 5.500 3rd Qu.:212.38 3rd Qu.: 9.000
## Max. :37.30 Max. : 14.100 Max. :366.60 Max. :10.000
## NA's :1018 NA's :2
## MO_RR MX_RS eor
## Min. : 0.20 Min. : 0.20 eor:1572
## 1st Qu.: 28.30 1st Qu.: 7.50
## Median : 43.20 Median : 11.40
## Mean : 48.80 Mean : 13.86
## 3rd Qu.: 64.22 3rd Qu.: 16.80
## Max. :202.30 Max. :104.80
##
# create time series object of monthly rainfall
sunshine_Potsdam <- xts(
x = clim_Potsdam$MO_SD_S,
order.by = as.yearmon(ymd(clim_Potsdam$MESS_DATUM_BEGINN))
)
# plot the data
barplot(sunshine_Potsdam["1981/2010"],
ylab = "Monthly sunshine duration",
main = paste0(
"DWD Weather Station ",
Potsdam_station$Stationsname,
", ",
Potsdam_station$Bundesland
)
)
Step 4
Based on the visual inspection of the graph above we may conclude
that the rainfall data is fairly complete. Next, we compute the mean
annual rainfall by applying several functions from the
dplyr
package, which is part of the tidyverse
ecosystem, such as mutate
, group_by
,
filter
and summarise
.
annual_rainfall <- clim
annual_rainfall <- annual_rainfall %>%
mutate(year = year(ymd(MESS_DATUM_BEGINN))) %>%
group_by(year) %>%
filter(year >= 1981 & year <= 2010) %>%
summarise(
# calculate the monthly rainfall for each year
year_sum = sum(MO_RR, na.rm = TRUE)
)
## plotting
ggplot(data = annual_rainfall, aes(x = year, y = year_sum)) +
geom_bar(stat = "identity") +
ggtitle(paste0(
"Annual rainfall ",
sample_station$Stationsname,
", ",
sample_station$Bundesland
))
The graph shows the annual rainfall for the period 1981 to 2010. One final step is still pending: We need to take the average of the annual series to get the mean annual rainfall at the weather station Berlin-Tempelhof for the period 1981–2010.
## print result
print(paste0(
"Mean annual rainfall at station ",
sample_station$Stationsname,
": ",
round(mean(annual_rainfall$year_sum)),
" mm"
))
## [1] "Mean annual rainfall at station Berlin-Tempelhof: 577 mm"
Done!
Exercise: Compute the mean annual sunshine duration at the weather station in Potsdam for the period 1981–2010.
## Your code here...
annual_sunshine <- NULL
annual_sunshine <- clim_Potsdam
annual_sunshine <- annual_sunshine %>%
mutate(year = year(ymd(MESS_DATUM_BEGINN))) %>%
group_by(year) %>%
filter(year >= 1981 & year <= 2010) %>%
summarise(
# calculate the monthly rainfall for each year
year_sum = sum(MO_SD_S, na.rm = TRUE)
)
## plotting
ggplot(data = annual_sunshine, aes(x = year, y = year_sum)) +
geom_bar(stat = "identity") +
ggtitle(paste0(
"Annual sunshine duration ",
Potsdam_station$Stationsname,
", ",
Potsdam_station$Bundesland
))
## print result
print(paste0(
"Mean annual sunshine duration at station ",
Potsdam_station$Stationsname,
": ",
round(mean(annual_sunshine$year_sum)),
" h"
))
## [1] "Mean annual sunshine duration at station Potsdam: 1742 h"
Now, we are aware of the procedure to download data from the DWD and extract the statistic of interest. In the next section, we will automate this procedure to download all weather stations of interest and extract the statistic of interest for each of them.
In this section, we automate the procedure of downloading a data set from DWD and extracting a particular statistic of interest. Recall that the general procedure may be outlined in five steps:
Select the weather station of interest.
Download the data.
Check the data record for completeness.
Extract the statistic of interest.
Repeat step 1–4 for all weather stations of interest.
In the previous, section we already defined a subset of weather stations of interest. The code below is copied from the previous section. Basically, we are interested in all weather stations in East Germany, which provide weather data on a monthly basis.
data(metaIndex)
states <- c(
"Sachsen", "Sachsen-Anhalt", "Mecklenburg-Vorpommern",
"Brandenburg", "Thueringen", "Berlin"
)
resolution <- "monthly"
variables <- "kl"
period <- "historical"
df_station <- metaIndex[metaIndex$Bundesland %in% states &
metaIndex$res == resolution &
metaIndex$per == period &
metaIndex$var == variables &
metaIndex$hasfile == TRUE, ]
The resulting data set consists of 279 weather stations.
In the next step we download each of these weather stations, check
the data for completeness and extract the observational data for the
period 1981–2010. Finally we compute the mean annual rainfall and the
mean annual temperature and store this information in a
data.frame
object together with the weather station
identifier (Station_id
), the altitude
(Altitude
), the geographical coordinates (Lat
and Lon
), the name of the station (Name
) and
the name of the federal state Federal_State
, the weather
station is situated.
For our convenience we wrap the repetitive part of this procedure
into a function, denoted as download_dwd_data
. Go through
the function defined below, read it line for line and try to figure out
what is happening.
download_dwd_data <- function(df,
res = "monthly",
var = "kl",
per = "historical",
RUN = TRUE, verbose = TRUE,
forceDownload = FALSE) {
##############################################
### FUNCTION TO DOWNLOAD DATA FROM THE DWD ###
##############################################
library(rdwd)
library(tidyverse)
library(lubridate)
if (RUN) {
## COLUMNS TO KEEP
df_column_names <- c(
"Stations_id", "Stationshoehe",
"geoBreite", "geoLaenge", "Stationsname",
"Bundesland"
)
df_result <- df[, df_column_names]
## RENAME COLUMNS
df_column_names_english <- c(
"Station_id", "Altitude", "Lat", "Lon",
"Name", "Federal_State"
)
colnames(df_result) <- df_column_names_english
## EMPTY COLUMNS TO STORE RESULTS
df_result["Rainfall"] <- NA
df_result["Temperature"] <- NA
## LOOP THROUGH STATIONS ##
for (j in 1:length(df$Stations_id)) {
if (verbose) {
## Write progress information to console
print(paste(
"Weather station", df[j, "Stationsname"], "--",
j, "out of", length(df$Stations_id)
))
}
## DOWNLOAD STATION DATA ##
## for detailed instructions type vignette("rdwd") into the console
tdir <- tempdir()
link <- selectDWD(
id = df$Stations_id[j],
res = res,
var = var,
per = per
)
file <- dataDWD(link,
read = FALSE,
dir = tdir,
quiet = TRUE,
force = forceDownload
)
clim <- readDWD(file)
clim["Date"] <- ymd(clim$MESS_DATUM_BEGINN)
## LOOP THROUGH PARAMETERS ##
params <- c(quo(MO_TT), quo(MO_RR))
for (i in 1:length(params)) {
# initial dataframe
df_ <- clim %>%
# generate one column and populate it with the year
mutate(year = year(Date)) %>%
# group the data by the new columns
group_by(year) %>%
# select only years between 1981 and 2010
filter(year >= 1981 & year <= 2010) %>%
summarise(
# calculate the monthly sum for each year
year_sum = sum(!!params[[i]], na.rm = TRUE),
# calculate the monthly mean for each year
year_mean = mean(!!params[[i]], na.rm = TRUE)
) %>%
# Add column with the number of years in record
mutate(year_count = n())
## CONTROL FLOW ##
## if there is no data for the period 81-10 continue
if (identical(df_$year_count, integer(0))) {
next
} else {
# if there is data, make sure we have data for each year
if (mean(df_$year_count, na.rm = TRUE) == 30) {
## IF SO BUILD OUTPUT DATA FRAME ##
## Temperature
if (any(grepl(pattern = "T", x = params[[i]])) == TRUE) {
result <- round(mean(df_$year_mean, na.rm = TRUE), 1)
df_result[j, "Temperature"] <- result
} else {
## Rainfall
result <- round(mean(df_$year_sum, na.rm = TRUE))
df_result[j, "Rainfall"] <- result
}
}
}
}
}
df_result <- df_result[complete.cases(df_result), ]
write.csv(
x = df_result,
file = "dwd_data_1981-2010.csv",
row.names = FALSE
)
return(df_result)
} else {
## DOWNLOAD PREPROCESSED DATA ##
url <- "https://userpage.fu-berlin.de/soga/data/raw-data/"
df_result <- read.csv(paste0(url, "dwd_data_1981-2010.csv"))
print("########################################################")
print("### NOTE THAT YOU ARE DOWNLOADING PREPROCESSED DATA ###")
print("### THE ARGUMENTS TO THE FUNCTION CALL ARE DISMISSED ###")
print("########################################################")
return(df_result)
}
}
For the ease of application we included the RUN
argument
in the function. If RUN = TRUE
then the main function body
will be executed, the data will be downloaded and processed accordingly.
Depending on your internet connection this may take some time. However,
if RUN = FALSE
the function only downloads a prepared
data.frame
object fitting the purpose of this tutorial.
This second option may be of interest, if you just want to follow along
this tutorial. For sure this second approach will consume much less
time.
df_result <- download_dwd_data(
df = df_station,
RUN = FALSE,
verbose = TRUE,
forceDownload = FALSE
)
## [1] "########################################################"
## [1] "### NOTE THAT YOU ARE DOWNLOADING PREPROCESSED DATA ###"
## [1] "### THE ARGUMENTS TO THE FUNCTION CALL ARE DISMISSED ###"
## [1] "########################################################"
str(df_result)
## 'data.frame': 71 obs. of 9 variables:
## $ X : int 3089 3198 3593 4017 4553 111897 8075 8141 8270 8947 ...
## $ Station_id : int 164 167 183 198 222 5825 399 400 403 430 ...
## $ Altitude : int 54 11 42 164 393 40 36 60 51 36 ...
## $ Lat : num 53 53.8 54.7 51.4 50.6 ...
## $ Lon : num 14 13.7 13.4 11.3 12.7 ...
## $ Name : chr "Angermuende" "Anklam" "Arkona" "Artern" ...
## $ Federal_State: chr "Brandenburg" "Mecklenburg-Vorpommern" "Mecklenburg-Vorpommern" "Thueringen" ...
## $ Rainfall : int 521 562 546 490 805 515 459 579 591 550 ...
## $ Temperature : num 8.9 9 8.5 9.3 8.4 9.4 10.5 9.5 9.5 10 ...
The resulting data frame (df_result
) contains data of 71
weather stations, with the features X, Station_id, Altitude, Lat,
Lon, Name, Federal_State, Rainfall, Temperature.
Now that we downloaded the weather data of interest from DWD we may inspect the data set.
sample_n(df_result, 10)
## X Station_id Altitude Lat Lon Name
## 1 59117 3034 496 50.4505 11.6350 Lobenstein, Bad
## 2 11925 596 15 54.0027 11.1908 Boltenhagen
## 3 69605 3552 38 52.9037 12.8072 Neuruppin
## 4 36958 1869 56 53.3154 13.9339 Gruenow
## 5 98261 5097 10 54.0661 12.7675 Tribsees
## 6 94085 4878 504 51.6646 10.8811 Oberharz am Brocken-Stiege
## 7 108205 5629 105 51.8891 12.6446 Wittenberg
## 8 104386 5424 328 51.0177 11.3544 Weimar-Schoendorf
## 9 56754 2928 138 51.3151 12.4462 Leipzig-Holzhausen
## 10 3089 164 54 53.0316 13.9908 Angermuende
## Federal_State Rainfall Temperature
## 1 Thueringen 871 7.3
## 2 Mecklenburg-Vorpommern 592 9.0
## 3 Brandenburg 535 9.2
## 4 Brandenburg 317 9.1
## 5 Mecklenburg-Vorpommern 627 8.5
## 6 Sachsen-Anhalt 897 6.6
## 7 Sachsen-Anhalt 567 9.4
## 8 Thueringen 579 8.9
## 9 Sachsen 629 9.7
## 10 Brandenburg 521 8.9
By applying the summary()
function we print basic
statistical measures of the variables Temperature
and
Rainfall
.
summary(df_result[, c("Temperature", "Rainfall")])
## Temperature Rainfall
## Min. : 3.500 Min. : 317.0
## 1st Qu.: 8.500 1st Qu.: 533.5
## Median : 9.100 Median : 578.0
## Mean : 8.631 Mean : 638.5
## 3rd Qu.: 9.400 3rd Qu.: 639.0
## Max. :10.500 Max. :1879.0
By means of visual inspection, we can check for outliers or invalid data points.
library(GGally)
ggpairs(df_result[, c("Altitude", "Rainfall", "Temperature")],
title = "",
axisLabels = "show",
columnLabels = c("Altitude", "Rainfall", "Temperature")
)
The graph looks fine. We notice the expected pattern: Rainfall tends to increase with altitude, whereas temperature tends to decrease with altitude.
Finally, we contextualize our data set by plotting the weather stations on an interactive map.
library(leaflet)
library(leafpop)
library(htmltools)
# plot
df_result %>% leaflet() %>% addTiles() %>%
addCircleMarkers(data = df_result, radius = 5,
stroke = FALSE, fillOpacity = 0.5,
popup = popupTable(df_result, zcol = 2:9)
)
For the purpose of later usage we caste the spatial data sets to
Spatial
objects and save them to disk.
library(raster)
library(sf)
# Retrieve Federal States by the the getData() function from the raster package
germany <- getData(country = "Germany", level = 1)
germany_sf <- st_as_sf(germany)
# transform to UTM zone 33
germany_sf <- st_transform(germany_sf, 32633)
# simplify data
states <- c(
"Sachsen", "Sachsen-Anhalt",
"Mecklenburg-Vorpommern", "Brandenburg",
"Thüringen", "Berlin"
)
east_germany_states <- germany_sf[germany_sf$NAME_1 %in% states, ]
east_germany <- east_germany_states %>%
group_by(NAME_0) %>%
summarize()
# create simple feature object from data frame object
dwd_east_sf <- st_as_sf(df_result, coords = c("Lon", "Lat"), crs = 4326)
# transform to UTM zone 33
dwd_east_sf <- st_transform(dwd_east_sf, 32633)
# create sp objects
east_germany_sp <- as(east_germany, "Spatial")
east_germany_states_sp <- as(east_germany_states, "Spatial")
dwd_east_sp <- as(dwd_east_sf, "Spatial")
save(
file = "East_Germany.RData",
east_germany_sp, east_germany_states_sp, dwd_east_sp
)
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.