In the previous section we downloaded a preprocessed data set of weather station data for Germany. In this section however, we download and summarize weather data provided by Deutscher Wetterdienst (DWD) (German Weather Service) on our own.
The goal is to obtain a data set of annual mean rainfall and mean annual temperature data for the 30 year period 1981-2010 for weather stations located in East Germany.
The data is available on the DWD data portal. Here, we use the rdwd package, a package that simplifies
downloading data from the DWD.
First, we load the required package from CRAN.
library(rdwd)
library(tidyverse)
The rdwd package has nice features such as the
metaIndex, which provides an overview of the available data
and the data structure.
data(metaIndex)
sample_n(metaIndex, 10)
## Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge
## 1 6099 2000-04-20 2023-06-15 449 47.7020 8.7318
## 2 5065 1937-01-01 2022-12-31 536 50.3936 11.8735
## 3 4450 1951-01-01 1993-12-31 40 52.6333 13.3833
## 4 1156 1931-01-01 1960-11-30 245 49.3833 7.3833
## 5 757 2003-03-25 2023-06-15 445 47.9625 7.9983
## 6 5369 1947-12-01 1993-12-31 443 48.0572 12.2202
## 7 15000 2011-04-01 2023-06-15 231 50.7983 6.0244
## 8 3164 1949-01-01 2011-12-31 187 50.8492 8.7745
## 9 14323 1891-01-01 2022-12-31 232 50.5218 7.7374
## 10 117 1937-01-01 2007-12-31 520 50.5605 10.7179
## Stationsname Bundesland res
## 1 Gailingen Baden-Wuerttemberg hourly
## 2 Toepen Bayern annual
## 3 Schildow Brandenburg annual
## 4 Eichelscheiderhof Rheinland-Pfalz monthly
## 5 Buchenbach Baden-Wuerttemberg 10_minutes
## 6 Wasserburg Bayern annual
## 7 Aachen-Orsbach Nordrhein-Westfalen hourly
## 8 Coelbe, Kr. Marburg-Biedenkopf Hessen monthly
## 9 Selters/Westerwald Rheinland-Pfalz annual
## 10 Altendambach Thueringen monthly
## var per hasfile
## 1 wind historical TRUE
## 2 more_precip historical TRUE
## 3 climate_indices recent FALSE
## 4 climate_indices historical TRUE
## 5 precipitation historical TRUE
## 6 kl historical TRUE
## 7 cloudiness historical TRUE
## 8 weather_phenomena recent FALSE
## 9 more_precip recent TRUE
## 10 more_precip historical TRUE
At the moment of writing this data frame lists 146,644 data sets. Here, we focus only on weather stations located in East Germany. Further, we restrict our search to weather stations, which provide data on a monthly basis.
states <- c(
"Sachsen", "Sachsen-Anhalt", "Mecklenburg-Vorpommern",
"Brandenburg", "Thueringen", "Berlin"
)
resolution <- "monthly"
variables <- "kl"
period <- "historical"
df_station <- metaIndex[metaIndex$Bundesland %in% states &
metaIndex$res == resolution &
metaIndex$per == period &
metaIndex$var == variables &
metaIndex$hasfile == TRUE, ]
# number of weather stations
nrow(df_station)
## [1] 279
Based on our constraints we end up with a data set of 279 weather stations. The data provided by DWD is monthly data, however we are only interested in the mean annual rainfall and mean annual temperature for the period 1981–2010. Hence, we have to do some data preprocessing. In order to gain a better intuition we first showcase the processing pipeline for one example weather station. Later we automate this pipeline, to be applied on all 279 selected weather stations.
Here, we focus on the weather station Berlin-Tempelhof.
Detailed information on the structure of the DWD data sets can be found
here. Further information on the rdwd
package can be found here.
The general procedure is outlined in five steps:
Select the weather station of interest based on the station identifier (id), on the temporal resolution of the record (res), on the variable type (var) and on the observation period (per).
Download the data using the functionality of the
rdwd package.
Check the data record for completeness.
Extract the statistic of interest, which in our case is the mean annual temperature and mean annual rainfall for the period 1981–2010.
Repeat step 1 to 4 for all stations of interest.
Step 1
We subset the weather station of interest from the catalog of DWD stations.
sample_station <- df_station[df_station$Stationsname == "Berlin-Tempelhof", ]
sample_station
## Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge
## 9855 433 1938-01-01 2023-05-31 48 52.4675 13.4021
## Stationsname Bundesland res var per hasfile
## 9855 Berlin-Tempelhof Berlin monthly kl historical TRUE
Exercise: Create a second subset for the weather station in Potsdam.
## Your code here...
Potsdam_station <- NULL
Potsdam_station <- df_station[df_station$Stationsname == "Potsdam", ]
Potsdam_station
## Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge
## 85031 3987 1893-01-01 2023-05-31 81 52.3812 13.0622
## Stationsname Bundesland res var per hasfile
## 85031 Potsdam Brandenburg monthly kl historical TRUE
Step 2
We download the data using the functionality of the rdwd
package. First, we generate the link pointing to the file in the online
database by applying the selectDWD() function.
The DWD often updates or expands datasets, and when that happens, the
filenames in the historical folders may change. By setting
current = TRUE, we ensure that we are pointing to the
latest version of the dataset, thus avoiding potential issues with
outdated filenames. Another option to address the issue of changing
filenames is to use rdwd::updateRdwd(). This will check the
version of the rdwd package and update it to the latest
development version on github. You can find more details about this
topic in Berry
Boessenkool’s documentation of rdwd.
link <- selectDWD(
id = sample_station$Stations_id,
res = sample_station$res,
var = sample_station$var,
per = sample_station$per,
current = TRUE
)
link
## [1] "ftp://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/monthly/kl/historical/monatswerte_KL_00433_19380101_20231231_hist.zip"
Then we download the file by applying the dataDWD()
function. Once the file is downloaded, we apply the
readDWD() function, which reads the file, processes it and
returns it as a data.frame object.
tdir <- tempdir()
file <- dataDWD(link, read = FALSE, dir = tdir, quiet = TRUE, force = FALSE)
clim <- readDWD(file)
## Reading 1 file with readDWD.data() and fread=TRUE ...
Exercise: Download and read the data of the weather station in Potsdam.
## Your code here...
link2 <- NULL
tdir2 <- NULL
file2 <- NULL
clim_Potsdam <- NULL
link2 <- selectDWD(
id = Potsdam_station$Stations_id,
res = Potsdam_station$res,
var = Potsdam_station$var,
per = Potsdam_station$per,
current = TRUE
)
## Determining the content of the 1 folder(s)...
## knitr::knit -> process_file -> xfun:::handle_error -> process_group -> call_block -> block_exec -> evaluate -> evaluate::evaluate -> withRestarts -> withRestartList -> withOneRestart -> doWithOneRestart -> withRestartList -> withOneRestart -> doWithOneRestart -> with_handlers -> eval -> withVisible -> eval -> selectDWD -> createIndex -> indexFTP -> dirDWD: adding to directory 'C:/Arbeit/soga/Soga-R/300/30900_geostatistics/DWDdata'
## knitr::knit -> process_file -> xfun:::handle_error -> process_group -> call_block -> block_exec -> evaluate -> evaluate::evaluate -> withRestarts -> withRestartList -> withOneRestart -> doWithOneRestart -> withRestartList -> withOneRestart -> doWithOneRestart -> with_handlers -> eval -> withVisible -> eval -> selectDWD -> createIndex -> indexFTP -> newFilename: creating 1 file (1 already existed for which '_n' is appended):'C:/Arbeit/soga/Soga-R/300/30900_geostatistics/DWDdata/INDEX_of_DWD_monthly_kl_historical_1.txt'
## Splitting filenames... (0 secs so far)
## Extracting station IDs + time range from filenames... (0 secs so far)
## Determining if files contain meta data... (0 secs so far)
tdir2 <- tempdir()
file2 <- dataDWD(link2, read = FALSE, dir = tdir2, quiet = TRUE, force = FALSE)
clim_Potsdam <- readDWD(file2)
## Reading 1 file with readDWD.data() and fread=TRUE ...
Step 3
Now we check the data set for completeness by using standard R
functions, such as summary(). A description of the
variables is available here.
summary(clim)
## STATIONS_ID MESS_DATUM MESS_DATUM_BEGINN
## Min. :433 Min. :1938-01-15 00:00:00.000 Min. :19380101
## 1st Qu.:433 1st Qu.:1961-04-07 06:00:00.000 1st Qu.:19610376
## Median :433 Median :1982-03-01 00:00:00.000 Median :19820251
## Mean :433 Mean :1981-12-17 19:56:10.518 Mean :19815313
## 3rd Qu.:433 3rd Qu.:2003-01-22 18:00:00.000 3rd Qu.:20030126
## Max. :433 Max. :2023-12-15 00:00:00.000 Max. :20231201
##
## MESS_DATUM_ENDE QN_4 MO_N MO_TT
## Min. :19380131 Min. : 3.000 Min. :2.640 Min. :-9.400
## 1st Qu.:19610405 1st Qu.: 5.000 1st Qu.:4.600 1st Qu.: 3.785
## Median :19820280 Median : 9.000 Median :5.270 Median : 9.540
## Mean :19815343 Mean : 7.335 Mean :5.272 Mean : 9.812
## 3rd Qu.:20030155 3rd Qu.:10.000 3rd Qu.:5.997 3rd Qu.:16.285
## Max. :20231231 Max. :10.000 Max. :7.440 Max. :24.380
## NA's :262
## MO_TX MO_TN MO_FK MX_TX
## Min. :-6.000 Min. :-12.700 Min. :1.630 Min. : 1.60
## 1st Qu.: 6.287 1st Qu.: 1.062 1st Qu.:2.480 1st Qu.:13.78
## Median :13.770 Median : 5.585 Median :2.660 Median :22.15
## Mean :13.592 Mean : 5.947 Mean :2.698 Mean :21.58
## 3rd Qu.:20.852 3rd Qu.: 11.675 3rd Qu.:2.870 3rd Qu.:29.40
## Max. :30.300 Max. : 17.870 Max. :4.060 Max. :38.50
## NA's :8 NA's :107 NA's :92
## MX_FX MX_TN MO_SD_S QN_6
## Min. :11.30 Min. :-22.5000 Min. : 8.00 Min. : 3.00
## 1st Qu.:16.65 1st Qu.: -5.4000 1st Qu.: 64.42 1st Qu.: 5.00
## Median :19.00 Median : -0.4000 Median :135.30 Median : 5.00
## Mean :19.85 Mean : -0.2711 Mean :142.25 Mean : 7.07
## 3rd Qu.:22.60 3rd Qu.: 6.6000 3rd Qu.:210.65 3rd Qu.: 9.00
## Max. :40.20 Max. : 14.0000 Max. :377.60 Max. :10.00
## NA's :393 NA's :92 NA's :384 NA's :2
## MO_RR MX_RS eor
## Min. : 0.60 Min. : 0.40 eor:1004
## 1st Qu.: 27.40 1st Qu.: 7.30
## Median : 43.20 Median : 11.20
## Mean : 47.82 Mean : 13.89
## 3rd Qu.: 62.30 3rd Qu.: 17.30
## Max. :193.40 Max. :119.50
## NA's :7 NA's :99
Further, we use the xts and lubridate
packages for time series analysis and plot the monthly rainfall
data.
library(xts)
library(lubridate)
# create time series object of monthly rainfall
rainfall_sample <- xts(
x = clim$MO_RR,
order.by = as.yearmon(ymd(clim$MESS_DATUM_BEGINN))
)
# plot the data
barplot(rainfall_sample,
ylab = "Monthly rainfall",
main = paste0(
"DWD Weather Station ",
sample_station$Stationsname,
", ",
sample_station$Bundesland
)
)
The rainfall time series at Berlin-Tempelhof consists of 1004 monthly observations spanning a period from January 1938 to December 2023.
For our use case we are only interest in the period 1981 to 2010, hence we subset the data set and plot it for inspection.
barplot(rainfall_sample["1981/2010"],
ylab = "Monthly rainfall",
main = paste0(
"DWD Weather Station ",
sample_station$Stationsname,
", ",
sample_station$Bundesland
)
)
Exercise: Check for the weather station in Potsdam if the data for monthly sunshine duration is complete in the period 1981 to 2010.
## Your code here...
sunshine_Potsdam <- NULL
summary(clim_Potsdam)
## STATIONS_ID MESS_DATUM MESS_DATUM_BEGINN
## Min. :3987 Min. :1893-01-15 00:00:00.000 Min. :18930101
## 1st Qu.:3987 1st Qu.:1925-10-07 12:00:00.000 1st Qu.:19250976
## Median :3987 Median :1958-06-30 00:00:00.000 Median :19580651
## Mean :3987 Mean :1958-06-30 15:16:01.832 Mean :19580651
## 3rd Qu.:3987 3rd Qu.:1991-03-22 18:00:00.000 3rd Qu.:19910326
## Max. :3987 Max. :2023-12-15 00:00:00.000 Max. :20231201
##
## MESS_DATUM_ENDE QN_4 MO_N MO_TT
## Min. :18930131 Min. : 3.000 Min. :2.530 Min. :-10.860
## 1st Qu.:19251006 1st Qu.: 5.000 1st Qu.:4.728 1st Qu.: 2.763
## Median :19580680 Median : 5.000 Median :5.375 Median : 8.770
## Mean :19580680 Mean : 6.202 Mean :5.349 Mean : 8.906
## 3rd Qu.:19910356 3rd Qu.: 5.000 3rd Qu.:6.020 3rd Qu.: 15.350
## Max. :20231231 Max. :10.000 Max. :7.550 Max. : 23.560
## NA's :28
## MO_TX MO_TN MO_FK MX_TX
## Min. :-6.370 Min. :-15.4300 Min. :1.880 Min. : 0.60
## 1st Qu.: 5.817 1st Qu.: 0.1175 1st Qu.:2.720 1st Qu.:13.40
## Median :13.750 Median : 4.6350 Median :2.950 Median :22.35
## Mean :13.388 Mean : 5.0152 Mean :3.013 Mean :21.43
## 3rd Qu.:20.907 3rd Qu.: 10.5725 3rd Qu.:3.250 3rd Qu.:29.40
## Max. :30.530 Max. : 16.9500 Max. :4.790 Max. :38.90
## NA's :11
## MX_FX MX_TN MO_SD_S QN_6
## Min. :12.50 Min. :-26.800 Min. : 11.30 Min. : 3.000
## 1st Qu.:18.30 1st Qu.: -6.600 1st Qu.: 69.47 1st Qu.: 5.000
## Median :21.00 Median : -1.500 Median :140.20 Median : 5.000
## Mean :21.62 Mean : -1.402 Mean :144.39 Mean : 6.324
## 3rd Qu.:24.00 3rd Qu.: 5.500 3rd Qu.:212.38 3rd Qu.: 9.000
## Max. :37.30 Max. : 14.100 Max. :366.60 Max. :10.000
## NA's :1018 NA's :2
## MO_RR MX_RS eor
## Min. : 0.20 Min. : 0.20 eor:1572
## 1st Qu.: 28.30 1st Qu.: 7.50
## Median : 43.20 Median : 11.40
## Mean : 48.80 Mean : 13.86
## 3rd Qu.: 64.22 3rd Qu.: 16.80
## Max. :202.30 Max. :104.80
##
# create time series object of monthly rainfall
sunshine_Potsdam <- xts(
x = clim_Potsdam$MO_SD_S,
order.by = as.yearmon(ymd(clim_Potsdam$MESS_DATUM_BEGINN))
)
# plot the data
barplot(sunshine_Potsdam["1981/2010"],
ylab = "Monthly sunshine duration",
main = paste0(
"DWD Weather Station ",
Potsdam_station$Stationsname,
", ",
Potsdam_station$Bundesland
)
)
Step 4
Based on the visual inspection of the graph above we may conclude
that the rainfall data is fairly complete. Next, we compute the mean
annual rainfall by applying several functions from the
dplyr package, which is part of the tidyverse
ecosystem, such as mutate, group_by,
filter and summarise.
annual_rainfall <- clim
annual_rainfall <- annual_rainfall %>%
mutate(year = year(ymd(MESS_DATUM_BEGINN))) %>%
group_by(year) %>%
filter(year >= 1981 & year <= 2010) %>%
summarise(
# calculate the monthly rainfall for each year
year_sum = sum(MO_RR, na.rm = TRUE)
)
## plotting
ggplot(data = annual_rainfall, aes(x = year, y = year_sum)) +
geom_bar(stat = "identity") +
ggtitle(paste0(
"Annual rainfall ",
sample_station$Stationsname,
", ",
sample_station$Bundesland
))
The graph shows the annual rainfall for the period 1981 to 2010. One final step is still pending: We need to take the average of the annual series to get the mean annual rainfall at the weather station Berlin-Tempelhof for the period 1981–2010.
## print result
print(paste0(
"Mean annual rainfall at station ",
sample_station$Stationsname,
": ",
round(mean(annual_rainfall$year_sum)),
" mm"
))
## [1] "Mean annual rainfall at station Berlin-Tempelhof: 577 mm"
Done!
Exercise: Compute the mean annual sunshine duration at the weather station in Potsdam for the period 1981–2010.
## Your code here...
annual_sunshine <- NULL
annual_sunshine <- clim_Potsdam
annual_sunshine <- annual_sunshine %>%
mutate(year = year(ymd(MESS_DATUM_BEGINN))) %>%
group_by(year) %>%
filter(year >= 1981 & year <= 2010) %>%
summarise(
# calculate the monthly rainfall for each year
year_sum = sum(MO_SD_S, na.rm = TRUE)
)
## plotting
ggplot(data = annual_sunshine, aes(x = year, y = year_sum)) +
geom_bar(stat = "identity") +
ggtitle(paste0(
"Annual sunshine duration ",
Potsdam_station$Stationsname,
", ",
Potsdam_station$Bundesland
))
## print result
print(paste0(
"Mean annual sunshine duration at station ",
Potsdam_station$Stationsname,
": ",
round(mean(annual_sunshine$year_sum)),
" h"
))
## [1] "Mean annual sunshine duration at station Potsdam: 1742 h"
Now, we are aware of the procedure to download data from the DWD and extract the statistic of interest. In the next section, we will automate this procedure to download all weather stations of interest and extract the statistic of interest for each of them.
In this section, we automate the procedure of downloading a data set from DWD and extracting a particular statistic of interest. Recall that the general procedure may be outlined in five steps:
Select the weather station of interest.
Download the data.
Check the data record for completeness.
Extract the statistic of interest.
Repeat step 1–4 for all weather stations of interest.
In the previous, section we already defined a subset of weather stations of interest. The code below is copied from the previous section. Basically, we are interested in all weather stations in East Germany, which provide weather data on a monthly basis.
data(metaIndex)
states <- c(
"Sachsen", "Sachsen-Anhalt", "Mecklenburg-Vorpommern",
"Brandenburg", "Thueringen", "Berlin"
)
resolution <- "monthly"
variables <- "kl"
period <- "historical"
df_station <- metaIndex[metaIndex$Bundesland %in% states &
metaIndex$res == resolution &
metaIndex$per == period &
metaIndex$var == variables &
metaIndex$hasfile == TRUE, ]
The resulting data set consists of 279 weather stations.
In the next step we download each of these weather stations, check
the data for completeness and extract the observational data for the
period 1981–2010. Finally we compute the mean annual rainfall and the
mean annual temperature and store this information in a
data.frame object together with the weather station
identifier (Station_id), the altitude
(Altitude), the geographical coordinates (Lat
and Lon), the name of the station (Name) and
the name of the federal state Federal_State, the weather
station is situated.
For our convenience we wrap the repetitive part of this procedure
into a function, denoted as download_dwd_data. Go through
the function defined below, read it line for line and try to figure out
what is happening.
download_dwd_data <- function(df,
res = "monthly",
var = "kl",
per = "historical",
RUN = TRUE, verbose = TRUE,
forceDownload = FALSE) {
##############################################
### FUNCTION TO DOWNLOAD DATA FROM THE DWD ###
##############################################
library(rdwd)
library(tidyverse)
library(lubridate)
if (RUN) {
## COLUMNS TO KEEP
df_column_names <- c(
"Stations_id", "Stationshoehe",
"geoBreite", "geoLaenge", "Stationsname",
"Bundesland"
)
df_result <- df[, df_column_names]
## RENAME COLUMNS
df_column_names_english <- c(
"Station_id", "Altitude", "Lat", "Lon",
"Name", "Federal_State"
)
colnames(df_result) <- df_column_names_english
## EMPTY COLUMNS TO STORE RESULTS
df_result["Rainfall"] <- NA
df_result["Temperature"] <- NA
## LOOP THROUGH STATIONS ##
for (j in 1:length(df$Stations_id)) {
if (verbose) {
## Write progress information to console
print(paste(
"Weather station", df[j, "Stationsname"], "--",
j, "out of", length(df$Stations_id)
))
}
## DOWNLOAD STATION DATA ##
## for detailed instructions type vignette("rdwd") into the console
tdir <- tempdir()
link <- selectDWD(
id = df$Stations_id[j],
res = res,
var = var,
per = per
)
file <- dataDWD(link,
read = FALSE,
dir = tdir,
quiet = TRUE,
force = forceDownload
)
clim <- readDWD(file)
clim["Date"] <- ymd(clim$MESS_DATUM_BEGINN)
## LOOP THROUGH PARAMETERS ##
params <- c(quo(MO_TT), quo(MO_RR))
for (i in 1:length(params)) {
# initial dataframe
df_ <- clim %>%
# generate one column and populate it with the year
mutate(year = year(Date)) %>%
# group the data by the new columns
group_by(year) %>%
# select only years between 1981 and 2010
filter(year >= 1981 & year <= 2010) %>%
summarise(
# calculate the monthly sum for each year
year_sum = sum(!!params[[i]], na.rm = TRUE),
# calculate the monthly mean for each year
year_mean = mean(!!params[[i]], na.rm = TRUE)
) %>%
# Add column with the number of years in record
mutate(year_count = n())
## CONTROL FLOW ##
## if there is no data for the period 81-10 continue
if (identical(df_$year_count, integer(0))) {
next
} else {
# if there is data, make sure we have data for each year
if (mean(df_$year_count, na.rm = TRUE) == 30) {
## IF SO BUILD OUTPUT DATA FRAME ##
## Temperature
if (any(grepl(pattern = "T", x = params[[i]])) == TRUE) {
result <- round(mean(df_$year_mean, na.rm = TRUE), 1)
df_result[j, "Temperature"] <- result
} else {
## Rainfall
result <- round(mean(df_$year_sum, na.rm = TRUE))
df_result[j, "Rainfall"] <- result
}
}
}
}
}
df_result <- df_result[complete.cases(df_result), ]
write.csv(
x = df_result,
file = "dwd_data_1981-2010.csv",
row.names = FALSE
)
return(df_result)
} else {
## DOWNLOAD PREPROCESSED DATA ##
url <- "https://userpage.fu-berlin.de/soga/data/raw-data/"
df_result <- read.csv(paste0(url, "dwd_data_1981-2010.csv"))
print("########################################################")
print("### NOTE THAT YOU ARE DOWNLOADING PREPROCESSED DATA ###")
print("### THE ARGUMENTS TO THE FUNCTION CALL ARE DISMISSED ###")
print("########################################################")
return(df_result)
}
}
For the ease of application we included the RUN argument
in the function. If RUN = TRUE then the main function body
will be executed, the data will be downloaded and processed accordingly.
Depending on your internet connection this may take some time. However,
if RUN = FALSE the function only downloads a prepared
data.frame object fitting the purpose of this tutorial.
This second option may be of interest, if you just want to follow along
this tutorial. For sure this second approach will consume much less
time.
df_result <- download_dwd_data(
df = df_station,
RUN = FALSE,
verbose = TRUE,
forceDownload = FALSE
)
## [1] "########################################################"
## [1] "### NOTE THAT YOU ARE DOWNLOADING PREPROCESSED DATA ###"
## [1] "### THE ARGUMENTS TO THE FUNCTION CALL ARE DISMISSED ###"
## [1] "########################################################"
str(df_result)
## 'data.frame': 71 obs. of 9 variables:
## $ X : int 3089 3198 3593 4017 4553 111897 8075 8141 8270 8947 ...
## $ Station_id : int 164 167 183 198 222 5825 399 400 403 430 ...
## $ Altitude : int 54 11 42 164 393 40 36 60 51 36 ...
## $ Lat : num 53 53.8 54.7 51.4 50.6 ...
## $ Lon : num 14 13.7 13.4 11.3 12.7 ...
## $ Name : chr "Angermuende" "Anklam" "Arkona" "Artern" ...
## $ Federal_State: chr "Brandenburg" "Mecklenburg-Vorpommern" "Mecklenburg-Vorpommern" "Thueringen" ...
## $ Rainfall : int 521 562 546 490 805 515 459 579 591 550 ...
## $ Temperature : num 8.9 9 8.5 9.3 8.4 9.4 10.5 9.5 9.5 10 ...
The resulting data frame (df_result) contains data of 71
weather stations, with the features X, Station_id, Altitude, Lat,
Lon, Name, Federal_State, Rainfall, Temperature.
Now that we downloaded the weather data of interest from DWD we may inspect the data set.
sample_n(df_result, 10)
## X Station_id Altitude Lat Lon Name
## 1 36958 1869 56 53.3154 13.9339 Gruenow
## 2 60891 3126 79 52.1029 11.5827 Magdeburg
## 3 59117 3034 496 50.4505 11.6350 Lobenstein, Bad
## 4 4553 222 393 50.5908 12.7139 Aue
## 5 94085 4878 504 51.6646 10.8811 Oberharz am Brocken-Stiege
## 6 96435 5009 33 53.7611 12.5574 Teterow
## 7 27638 1358 1213 50.4283 12.9536 Fichtelberg
## 8 3593 183 42 54.6791 13.4344 Arkona
## 9 32353 1605 35 52.3875 12.1602 Genthin
## 10 11925 596 15 54.0027 11.1908 Boltenhagen
## Federal_State Rainfall Temperature
## 1 Brandenburg 317 9.1
## 2 Sachsen-Anhalt 520 9.5
## 3 Thueringen 871 7.3
## 4 Sachsen 805 8.4
## 5 Sachsen-Anhalt 897 6.6
## 6 Mecklenburg-Vorpommern 471 8.4
## 7 Sachsen 1142 3.5
## 8 Mecklenburg-Vorpommern 546 8.5
## 9 Sachsen-Anhalt 449 9.1
## 10 Mecklenburg-Vorpommern 592 9.0
By applying the summary() function we print basic
statistical measures of the variables Temperature and
Rainfall.
summary(df_result[, c("Temperature", "Rainfall")])
## Temperature Rainfall
## Min. : 3.500 Min. : 317.0
## 1st Qu.: 8.500 1st Qu.: 533.5
## Median : 9.100 Median : 578.0
## Mean : 8.631 Mean : 638.5
## 3rd Qu.: 9.400 3rd Qu.: 639.0
## Max. :10.500 Max. :1879.0
By means of visual inspection, we can check for outliers or invalid data points.
library(GGally)
ggpairs(df_result[, c("Altitude", "Rainfall", "Temperature")],
title = "",
axisLabels = "show",
columnLabels = c("Altitude", "Rainfall", "Temperature")
)
The graph looks fine. We notice the expected pattern: Rainfall tends to increase with altitude, whereas temperature tends to decrease with altitude.
Finally, we contextualize our data set by plotting the weather stations on an interactive map.
library(leaflet)
library(leafpop)
library(htmltools)
# plot
df_result %>% leaflet() %>% addTiles() %>%
addCircleMarkers(data = df_result, radius = 5,
stroke = FALSE, fillOpacity = 0.5,
popup = popupTable(df_result, zcol = 2:9)
)
For the purpose of later usage we caste the spatial data sets to
Spatial objects and save them to disk.
library(sf)
library(geodata)
library(dplyr)
# Bundesländer Deutschlands mit geodata laden (Level 1 = Bundesländer)
germany <- geodata::gadm(country = "DEU", level = 1, path = tempdir())
# In sf-Objekt umwandeln
germany_sf <- st_as_sf(germany)
# In UTM Zone 33N transformieren (EPSG: 32633)
germany_sf <- st_transform(germany_sf, 32633)
# Auswahl der ostdeutschen Bundesländer
states <- c(
"Sachsen", "Sachsen-Anhalt",
"Mecklenburg-Vorpommern", "Brandenburg",
"Thüringen", "Berlin"
)
east_germany_states <- germany_sf[germany_sf$NAME_1 %in% states, ]
# Ostdeutschland als eine Fläche zusammenfassen
east_germany <- east_germany_states %>%
group_by(COUNTRY) %>%
summarize()
# Datensatz mit Wetterstationen als sf-Objekt erzeugen (Koordinaten-Spalten: Lon, Lat)
dwd_east_sf <- st_as_sf(df_result, coords = c("Lon", "Lat"), crs = 4326)
# In UTM Zone 33N transformieren
dwd_east_sf <- st_transform(dwd_east_sf, 32633)
# sf-Objekte in sp-Objekte konvertieren (für alte Kompatibilität)
east_germany_sp <- as(east_germany, "Spatial")
east_germany_states_sp <- as(east_germany_states, "Spatial")
dwd_east_sp <- as(dwd_east_sf, "Spatial")
# Objekte speichern
save(
file = "East_Germany.RData",
east_germany_sp, east_germany_states_sp, dwd_east_sp
)
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.