In the previous section we downloaded a preprocessed data set of weather station data for Germany. In this section however, we download and summarize weather data provided by Deutscher Wetterdienst (DWD) (German Weather Service) on our own.
The goal is to obtain a data set of annual mean rainfall and mean annual temperature data for the 30 year period 1981-2010 for weather stations located in East Germany.
The data is available on the DWD data portal. Here, we use the rdwd
package, a package that simplifies
downloading data from the DWD.
First, we load the required package from CRAN.
library(rdwd)
library(tidyverse)
The rdwd
package has nice features such as the
metaIndex
, which provides an overview of the available data
and the data structure.
data(metaIndex)
sample_n(metaIndex, 10)
## Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge
## 1 4745 1958-06-01 2023-04-13 75 52.9604 9.7930
## 2 7330 2005-10-01 2009-09-30 159 51.4633 7.9780
## 3 5344 1960-01-01 2006-12-31 2 53.7865 7.9096
## 4 5745 2004-04-19 2023-01-04 51 52.9663 13.3268
## 5 2532 1880-01-01 2012-12-31 231 51.2963 9.4424
## 6 5621 1991-01-01 2000-05-31 300 51.5662 11.2067
## 7 4456 1951-01-01 2003-12-31 35 53.7267 10.8242
## 8 2680 1951-01-01 2011-12-31 289 50.2840 10.4456
## 9 73 1908-01-01 2023-03-31 340 48.6159 13.0506
## 10 6102 1999-04-01 2023-04-13 397 47.5396 9.6843
## Stationsname Bundesland res
## 1 Soltau Niedersachsen hourly
## 2 Arnsberg-Neheim Nordrhein-Westfalen daily
## 3 Wangerooge Niedersachsen daily
## 4 Zehdenick Brandenburg 1_minute
## 5 Kassel Hessen annual
## 6 Wippra/Wipper Sachsen-Anhalt daily
## 7 Schlagsdorf Mecklenburg-Vorpommern monthly
## 8 Koenigshofen, Bad Bayern monthly
## 9 Aldersbach-Kriestorf Bayern monthly
## 10 Lindau (SWN) Bayern 10_minutes
## var per hasfile
## 1 cloudiness historical TRUE
## 2 weather_phenomena recent FALSE
## 3 more_weather_phenomena historical TRUE
## 4 precipitation historical TRUE
## 5 climate_indices recent FALSE
## 6 more_weather_phenomena recent FALSE
## 7 more_precip recent FALSE
## 8 weather_phenomena historical TRUE
## 9 more_precip historical TRUE
## 10 wind recent TRUE
At the moment of writing this data frame lists 147,719 data sets. Here, we focus only on weather stations located in East Germany. Further, we restrict our search to weather stations, which provide data on a monthly basis.
states <- c(
"Sachsen", "Sachsen-Anhalt", "Mecklenburg-Vorpommern",
"Brandenburg", "Thueringen", "Berlin"
)
resolution <- "monthly"
variables <- "kl"
period <- "historical"
df_station <- metaIndex[metaIndex$Bundesland %in% states &
metaIndex$res == resolution &
metaIndex$per == period &
metaIndex$var == variables &
metaIndex$hasfile == TRUE, ]
# number of weather stations
nrow(df_station)
## [1] 277
Based on our constraints we end up with a data set of 277 weather stations. The data provided by DWD is monthly data, however we are only interested in the mean annual rainfall and mean annual temperature for the period 1981–2010. Hence, we have to do some data preprocessing. In order to gain a better intuition we first showcase the processing pipeline for one example weather station. Later we automate this pipeline, to be applied on all 277 selected weather stations.
Here, we focus on the weather station Berlin-Tempelhof.
Detailed information on the structure of the DWD data sets can be found
here. Further information on the rdwd
package can be found here.
The general procedure is outlined in five steps:
Select the weather station of interest based on the station identifier (id), on the temporal resolution of the record (res), on the variable type (var) and on the observation period (per).
Download the data using the functionality of the
rdwd
package.
Check the data record for completeness.
Extract the statistic of interest, which in our case is the mean annual temperature and mean annual rainfall for the period 1981–2010.
Repeat step 1 to 4 for all stations of interest.
Step 1
We subset the weather station of interest from the catalog of DWD stations.
sample_station <- df_station[df_station$Stationsname == "Berlin-Tempelhof", ]
sample_station
## Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge
## 9959 433 1938-01-01 2023-03-31 48 52.4675 13.4021
## Stationsname Bundesland res var per hasfile
## 9959 Berlin-Tempelhof Berlin monthly kl historical TRUE
Exercise: Create a second subset for the weather station in Potsdam.
## Your code here...
Potsdam_station <- NULL
Potsdam_station <- df_station[df_station$Stationsname == "Potsdam", ]
Potsdam_station
## Stations_id von_datum bis_datum Stationshoehe geoBreite geoLaenge
## 86030 3987 1893-01-01 2023-03-31 81 52.3812 13.0622
## Stationsname Bundesland res var per hasfile
## 86030 Potsdam Brandenburg monthly kl historical TRUE
Step 2
We download the data using the functionality of the rdwd
package. First, we generate the link pointing to the file in the online
database by applying the selectDWD()
function.
link <- selectDWD(
id = sample_station$Stations_id,
res = sample_station$res,
var = sample_station$var,
per = sample_station$per
)
link
## [1] "ftp://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/monthly/kl/historical/monatswerte_KL_00433_19380101_20221231_hist.zip"
Then we download the file by applying the dataDWD()
function. Once the file is downloaded, we apply the
readDWD()
function, which reads the file, processes it and
returns it as a data.frame
object.
tdir <- tempdir()
file <- dataDWD(link, read = FALSE, dir = tdir, quiet = TRUE, force = FALSE)
clim <- readDWD(file)
## Warning: knitr::knit -> process_file -> process_group -> process_group.block -> call_block -> block_exec -> readDWD: R package 'data.table' available for fast reading of files, but system command 'unzip' could not be found. Now reading slowly.
## See https://bookdown.org/brry/rdwd/fread.html
## Reading 1 file with readDWD.data() and fread=FALSE ...
Exercise: Download and read the data of the weather station in Potsdam.
## Your code here...
link2 <- NULL
tdir2 <- NULL
file2 <- NULL
clim_Potsdam <- NULL
link2 <- selectDWD(
id = Potsdam_station$Stations_id,
res = Potsdam_station$res,
var = Potsdam_station$var,
per = Potsdam_station$per
)
tdir2 <- tempdir()
file2 <- dataDWD(link2, read = FALSE, dir = tdir2, quiet = TRUE, force = FALSE)
clim_Potsdam <- readDWD(file2)
## Reading 1 file with readDWD.data() and fread=FALSE ...
Step 3
Now we check the data set for completeness by using standard R
functions, such as summary()
. A description of the
variables is available here.
summary(clim)
## STATIONS_ID MESS_DATUM MESS_DATUM_BEGINN
## Min. :433 Min. :1938-01-15 00:00:00.000 Min. :19380101
## 1st Qu.:433 1st Qu.:1961-01-07 06:00:00.000 1st Qu.:19607876
## Median :433 Median :1981-08-30 12:00:00.000 Median :19810851
## Mean :433 Mean :1981-06-17 07:50:19.355 Mean :19810289
## 3rd Qu.:433 3rd Qu.:2002-04-22 12:00:00.000 3rd Qu.:20020426
## Max. :433 Max. :2022-12-15 00:00:00.000 Max. :20221201
##
## MESS_DATUM_ENDE QN_4 MO_N MO_TT
## Min. :19380131 Min. : 3.000 Min. :2.640 Min. :-9.400
## 1st Qu.:19607906 1st Qu.: 5.000 1st Qu.:4.598 1st Qu.: 3.735
## Median :19810880 Median : 5.000 Median :5.260 Median : 9.540
## Mean :19810318 Mean : 7.236 Mean :5.260 Mean : 9.791
## 3rd Qu.:20020455 3rd Qu.:10.000 3rd Qu.:5.973 3rd Qu.:16.265
## Max. :20221231 Max. :10.000 Max. :7.440 Max. :24.380
## NA's :260
## MO_TX MO_TN MO_FK MX_TX
## Min. :-6.000 Min. :-12.700 Min. :1.630 Min. : 1.60
## 1st Qu.: 6.237 1st Qu.: 0.995 1st Qu.:2.480 1st Qu.:13.70
## Median :13.770 Median : 5.585 Median :2.660 Median :22.15
## Mean :13.567 Mean : 5.932 Mean :2.699 Mean :21.56
## 3rd Qu.:20.850 3rd Qu.: 11.655 3rd Qu.:2.870 3rd Qu.:29.40
## Max. :30.300 Max. : 17.870 Max. :4.060 Max. :38.50
## NA's :8 NA's :107 NA's :92
## MX_FX MX_TN MO_SD_S QN_6
## Min. :11.30 Min. :-22.5000 Min. : 8.00 Min. : 1.000
## 1st Qu.:16.60 1st Qu.: -5.3250 1st Qu.: 64.42 1st Qu.: 5.000
## Median :19.00 Median : -0.4000 Median :135.30 Median : 5.000
## Mean :19.83 Mean : -0.2738 Mean :142.25 Mean : 7.075
## 3rd Qu.:22.60 3rd Qu.: 6.5250 3rd Qu.:210.65 3rd Qu.: 9.000
## Max. :40.20 Max. : 14.0000 Max. :377.60 Max. :10.000
## NA's :388 NA's :92 NA's :372 NA's :2
## MO_RR MX_RS eor
## Min. : 0.60 Min. : 0.400 eor:992
## 1st Qu.: 27.23 1st Qu.: 7.225
## Median : 43.00 Median : 11.150
## Mean : 47.73 Mean : 13.874
## 3rd Qu.: 61.60 3rd Qu.: 17.200
## Max. :193.40 Max. :119.500
## NA's :2 NA's :94
Further, we use the xts
and lubridate
packages for time series analysis and plot the monthly rainfall
data.
library(xts)
library(lubridate)
# create time series object of monthly rainfall
rainfall_sample <- xts(
x = clim$MO_RR,
order.by = as.yearmon(ymd(clim$MESS_DATUM_BEGINN))
)
# plot the data
barplot(rainfall_sample,
ylab = "Monthly rainfall",
main = paste0(
"DWD Weather Station ",
sample_station$Stationsname,
", ",
sample_station$Bundesland
)
)
The rainfall time series at Berlin-Tempelhof consists of 992 monthly observations spanning a period from January 1938 to December 2022.
For our use case we are only interest in the period 1981 to 2010, hence we subset the data set and plot it for inspection.
barplot(rainfall_sample["1981/2010"],
ylab = "Monthly rainfall",
main = paste0(
"DWD Weather Station ",
sample_station$Stationsname,
", ",
sample_station$Bundesland
)
)
Exercise: Check for the weather station in Potsdam if the data for monthly sunshine duration is complete in the period 1981 to 2010.
## Your code here...
sunshine_Potsdam <- NULL
summary(clim_Potsdam)
## STATIONS_ID MESS_DATUM MESS_DATUM_BEGINN
## Min. :3987 Min. :1893-01-15 00:00:00.000 Min. :18930101
## 1st Qu.:3987 1st Qu.:1925-07-07 12:00:00.000 1st Qu.:19250676
## Median :3987 Median :1957-12-30 12:00:00.000 Median :19575651
## Mean :3987 Mean :1957-12-30 00:20:19.462 Mean :19575651
## 3rd Qu.:3987 3rd Qu.:1990-06-22 12:00:00.000 3rd Qu.:19900626
## Max. :3987 Max. :2022-12-15 00:00:00.000 Max. :20221201
##
## MESS_DATUM_ENDE QN_4 MO_N MO_TT
## Min. :18930131 Min. : 3.000 Min. :2.530 Min. :-10.860
## 1st Qu.:19250706 1st Qu.: 5.000 1st Qu.:4.720 1st Qu.: 2.728
## Median :19575681 Median : 5.000 Median :5.370 Median : 8.770
## Mean :19575680 Mean : 6.138 Mean :5.343 Mean : 8.889
## 3rd Qu.:19900655 3rd Qu.: 5.000 3rd Qu.:6.010 3rd Qu.: 15.335
## Max. :20221231 Max. :10.000 Max. :7.550 Max. : 23.560
## NA's :27
## MO_TX MO_TN MO_FK MX_TX
## Min. :-6.37 Min. :-15.430 Min. :1.880 Min. : 0.60
## 1st Qu.: 5.80 1st Qu.: 0.090 1st Qu.:2.723 1st Qu.:13.38
## Median :13.75 Median : 4.635 Median :2.950 Median :22.30
## Mean :13.37 Mean : 5.000 Mean :3.014 Mean :21.41
## 3rd Qu.:20.89 3rd Qu.: 10.555 3rd Qu.:3.250 3rd Qu.:29.40
## Max. :30.53 Max. : 16.950 Max. :4.790 Max. :38.90
## NA's :10
## MX_FX MX_TN MO_SD_S QN_6
## Min. :12.50 Min. :-26.800 Min. : 11.30 Min. : 1.000
## 1st Qu.:18.30 1st Qu.: -6.600 1st Qu.: 69.33 1st Qu.: 5.000
## Median :21.00 Median : -1.500 Median :140.20 Median : 5.000
## Mean :21.62 Mean : -1.418 Mean :144.33 Mean : 6.322
## 3rd Qu.:24.00 3rd Qu.: 5.500 3rd Qu.:212.30 3rd Qu.: 9.000
## Max. :37.30 Max. : 14.100 Max. :366.60 Max. :10.000
## NA's :1017 NA's :2
## MO_RR MX_RS eor
## Min. : 0.20 Min. : 0.200 eor:1560
## 1st Qu.: 28.30 1st Qu.: 7.475
## Median : 43.15 Median : 11.300
## Mean : 48.66 Mean : 13.843
## 3rd Qu.: 63.92 3rd Qu.: 16.800
## Max. :202.30 Max. :104.800
##
# create time series object of monthly rainfall
sunshine_Potsdam <- xts(
x = clim_Potsdam$MO_SD_S,
order.by = as.yearmon(ymd(clim_Potsdam$MESS_DATUM_BEGINN))
)
# plot the data
barplot(sunshine_Potsdam["1981/2010"],
ylab = "Monthly sunshine duration",
main = paste0(
"DWD Weather Station ",
Potsdam_station$Stationsname,
", ",
Potsdam_station$Bundesland
)
)
Step 4
Based on the visual inspection of the graph above we may conclude
that the rainfall data is fairly complete. Next, we compute the mean
annual rainfall by applying several functions from the
dplyr
package, which is part of the tidyverse
ecosystem, such as mutate
, group_by
,
filter
and summarise
.
annual_rainfall <- clim
annual_rainfall <- annual_rainfall %>%
mutate(year = year(ymd(MESS_DATUM_BEGINN))) %>%
group_by(year) %>%
filter(year >= 1981 & year <= 2010) %>%
summarise(
# calculate the monthly rainfall for each year
year_sum = sum(MO_RR, na.rm = TRUE)
)
## plotting
ggplot(data = annual_rainfall, aes(x = year, y = year_sum)) +
geom_bar(stat = "identity") +
ggtitle(paste0(
"Annual rainfall ",
sample_station$Stationsname,
", ",
sample_station$Bundesland
))
The graph shows the annual rainfall for the period 1981 to 2010. One final step is still pending: We need to take the average of the annual series to get the mean annual rainfall at the weather station Berlin-Tempelhof for the period 1981–2010.
## print result
print(paste0(
"Mean annual rainfall at station ",
sample_station$Stationsname,
": ",
round(mean(annual_rainfall$year_sum)),
" mm"
))
## [1] "Mean annual rainfall at station Berlin-Tempelhof: 577 mm"
Done!
Exercise: Compute the mean annual sunshine duration at the weather station in Potsdam for the period 1981–2010.
## Your code here...
annual_sunshine <- NULL
annual_sunshine <- clim_Potsdam
annual_sunshine <- annual_sunshine %>%
mutate(year = year(ymd(MESS_DATUM_BEGINN))) %>%
group_by(year) %>%
filter(year >= 1981 & year <= 2010) %>%
summarise(
# calculate the monthly rainfall for each year
year_sum = sum(MO_SD_S, na.rm = TRUE)
)
## plotting
ggplot(data = annual_sunshine, aes(x = year, y = year_sum)) +
geom_bar(stat = "identity") +
ggtitle(paste0(
"Annual sunshine duration ",
Potsdam_station$Stationsname,
", ",
Potsdam_station$Bundesland
))
## print result
print(paste0(
"Mean annual sunshine duration at station ",
Potsdam_station$Stationsname,
": ",
round(mean(annual_sunshine$year_sum)),
" h"
))
## [1] "Mean annual sunshine duration at station Potsdam: 1742 h"
Now, we are aware of the procedure to download data from the DWD and extract the statistic of interest. In the next section, we will automate this procedure to download all weather stations of interest and extract the statistic of interest for each of them.
In this section, we automate the procedure of downloading a data set from DWD and extracting a particular statistic of interest. Recall that the general procedure may be outlined in five steps:
Select the weather station of interest.
Download the data.
Check the data record for completeness.
Extract the statistic of interest.
Repeat step 1–4 for all weather stations of interest.
In the previous, section we already defined a subset of weather stations of interest. The code below is copied from the previous section. Basically, we are interested in all weather stations in East Germany, which provide weather data on a monthly basis.
data(metaIndex)
states <- c(
"Sachsen", "Sachsen-Anhalt", "Mecklenburg-Vorpommern",
"Brandenburg", "Thueringen", "Berlin"
)
resolution <- "monthly"
variables <- "kl"
period <- "historical"
df_station <- metaIndex[metaIndex$Bundesland %in% states &
metaIndex$res == resolution &
metaIndex$per == period &
metaIndex$var == variables &
metaIndex$hasfile == TRUE, ]
The resulting data set consists of 277 weather stations.
In the next step we download each of these weather stations, check
the data for completeness and extract the observational data for the
period 1981–2010. Finally we compute the mean annual rainfall and the
mean annual temperature and store this information in a
data.frame
object together with the weather station
identifier (Station_id
), the altitude
(Altitude
), the geographical coordinates (Lat
and Lon
), the name of the station (Name
) and
the name of the federal state Federal_State
, the weather
station is situated.
For our convenience we wrap the repetitive part of this procedure
into a function, denoted as download_dwd_data
. Go through
the function defined below, read it line for line and try to figure out
what is happening.
download_dwd_data <- function(df,
res = "monthly",
var = "kl",
per = "historical",
RUN = TRUE, verbose = TRUE,
forceDownload = FALSE) {
##############################################
### FUNCTION TO DOWNLOAD DATA FROM THE DWD ###
##############################################
library(rdwd)
library(tidyverse)
library(lubridate)
if (RUN) {
## COLUMNS TO KEEP
df_column_names <- c(
"Stations_id", "Stationshoehe",
"geoBreite", "geoLaenge", "Stationsname",
"Bundesland"
)
df_result <- df[, df_column_names]
## RENAME COLUMNS
df_column_names_english <- c(
"Station_id", "Altitude", "Lat", "Lon",
"Name", "Federal_State"
)
colnames(df_result) <- df_column_names_english
## EMPTY COLUMNS TO STORE RESULTS
df_result["Rainfall"] <- NA
df_result["Temperature"] <- NA
## LOOP THROUGH STATIONS ##
for (j in 1:length(df$Stations_id)) {
if (verbose) {
## Write progress information to console
print(paste(
"Weather station", df[j, "Stationsname"], "--",
j, "out of", length(df$Stations_id)
))
}
## DOWNLOAD STATION DATA ##
## for detailed instructions type vignette("rdwd") into the console
tdir <- tempdir()
link <- selectDWD(
id = df$Stations_id[j],
res = res,
var = var,
per = per
)
file <- dataDWD(link,
read = FALSE,
dir = tdir,
quiet = TRUE,
force = forceDownload
)
clim <- readDWD(file)
clim["Date"] <- ymd(clim$MESS_DATUM_BEGINN)
## LOOP THROUGH PARAMETERS ##
params <- c(quo(MO_TT), quo(MO_RR))
for (i in 1:length(params)) {
# initial dataframe
df_ <- clim %>%
# generate one column and populate it with the year
mutate(year = year(Date)) %>%
# group the data by the new columns
group_by(year) %>%
# select only years between 1981 and 2010
filter(year >= 1981 & year <= 2010) %>%
summarise(
# calculate the monthly sum for each year
year_sum = sum(!!params[[i]], na.rm = TRUE),
# calculate the monthly mean for each year
year_mean = mean(!!params[[i]], na.rm = TRUE)
) %>%
# Add column with the number of years in record
mutate(year_count = n())
## CONTROL FLOW ##
## if there is no data for the period 81-10 continue
if (identical(df_$year_count, integer(0))) {
next
} else {
# if there is data, make sure we have data for each year
if (mean(df_$year_count, na.rm = TRUE) == 30) {
## IF SO BUILD OUTPUT DATA FRAME ##
## Temperature
if (any(grepl(pattern = "T", x = params[[i]])) == TRUE) {
result <- round(mean(df_$year_mean, na.rm = TRUE), 1)
df_result[j, "Temperature"] <- result
} else {
## Rainfall
result <- round(mean(df_$year_sum, na.rm = TRUE))
df_result[j, "Rainfall"] <- result
}
}
}
}
}
df_result <- df_result[complete.cases(df_result), ]
write.csv(
x = df_result,
file = "dwd_data_1981-2010.csv",
row.names = FALSE
)
return(df_result)
} else {
## DOWNLOAD PREPROCESSED DATA ##
url <- "https://userpage.fu-berlin.de/soga/data/raw-data/"
df_result <- read.csv(paste0(url, "dwd_data_1981-2010.csv"))
print("########################################################")
print("### NOTE THAT YOU ARE DOWNLOADING PREPROCESSED DATA ###")
print("### THE ARGUMENTS TO THE FUNCTION CALL ARE DISMISSED ###")
print("########################################################")
return(df_result)
}
}
For the ease of application we included the RUN
argument
in the function. If RUN = TRUE
then the main function body
will be executed, the data will be downloaded and processed accordingly.
Depending on your internet connection this may take some time. However,
if RUN = FALSE
the function only downloads a prepared
data.frame
object fitting the purpose of this tutorial.
This second option may be of interest, if you just want to follow along
this tutorial. For sure this second approach will consume much less
time.
df_result <- download_dwd_data(
df = df_station,
RUN = FALSE,
verbose = TRUE,
forceDownload = FALSE
)
## [1] "########################################################"
## [1] "### NOTE THAT YOU ARE DOWNLOADING PREPROCESSED DATA ###"
## [1] "### THE ARGUMENTS TO THE FUNCTION CALL ARE DISMISSED ###"
## [1] "########################################################"
str(df_result)
## 'data.frame': 71 obs. of 9 variables:
## $ X : int 3089 3198 3593 4017 4553 111897 8075 8141 8270 8947 ...
## $ Station_id : int 164 167 183 198 222 5825 399 400 403 430 ...
## $ Altitude : int 54 11 42 164 393 40 36 60 51 36 ...
## $ Lat : num 53 53.8 54.7 51.4 50.6 ...
## $ Lon : num 14 13.7 13.4 11.3 12.7 ...
## $ Name : chr "Angermuende" "Anklam" "Arkona" "Artern" ...
## $ Federal_State: chr "Brandenburg" "Mecklenburg-Vorpommern" "Mecklenburg-Vorpommern" "Thueringen" ...
## $ Rainfall : int 521 562 546 490 805 515 459 579 591 550 ...
## $ Temperature : num 8.9 9 8.5 9.3 8.4 9.4 10.5 9.5 9.5 10 ...
The resulting data frame (df_result
) contains data of 71
weather stations, with the features X, Station_id, Altitude, Lat,
Lon, Name, Federal_State, Rainfall, Temperature.
Now that we downloaded the weather data of interest from DWD we may inspect the data set.
sample_n(df_result, 10)
## X Station_id Altitude Lat Lon Name
## 1 108205 5629 105 51.8891 12.6446 Wittenberg
## 2 50806 2618 732 50.8458 10.4803 Kleiner Inselsberg
## 3 39854 2044 404 51.6520 11.1366 Harzgerode
## 4 8141 400 60 52.6310 13.5021 Berlin-Buch
## 5 17215 853 416 50.7913 12.8720 Chemnitz
## 6 9076 433 48 52.4675 13.4021 Berlin-Tempelhof
## 7 56924 2932 131 51.4347 12.2396 Leipzig/Halle
## 8 86515 4445 609 51.7658 10.6533 Wernigerode-Schierke
## 9 11734 591 45 53.3911 10.6878 Boizenburg
## 10 78711 4036 204 51.3895 11.5412 Querfurt-Muehle Lodersleben
## Federal_State Rainfall Temperature
## 1 Sachsen-Anhalt 567 9.4
## 2 Thueringen 1220 6.2
## 3 Sachsen-Anhalt 445 7.5
## 4 Berlin 579 9.5
## 5 Sachsen 731 8.5
## 6 Berlin 577 9.9
## 7 Sachsen 534 9.4
## 8 Sachsen-Anhalt 1362 5.8
## 9 Mecklenburg-Vorpommern 630 8.9
## 10 Sachsen-Anhalt 541 9.3
By applying the summary()
function we print basic
statistical measures of the variables Temperature
and
Rainfall
.
summary(df_result[, c("Temperature", "Rainfall")])
## Temperature Rainfall
## Min. : 3.500 Min. : 317.0
## 1st Qu.: 8.500 1st Qu.: 533.5
## Median : 9.100 Median : 578.0
## Mean : 8.631 Mean : 638.5
## 3rd Qu.: 9.400 3rd Qu.: 639.0
## Max. :10.500 Max. :1879.0
By means of visual inspection, we can check for outliers or invalid data points.
library(GGally)
ggpairs(df_result[, c("Altitude", "Rainfall", "Temperature")],
title = "",
axisLabels = "show",
columnLabels = c("Altitude", "Rainfall", "Temperature")
)
The graph looks fine. We notice the expected pattern: Rainfall tends to increase with altitude, whereas temperature tends to decrease with altitude.
Finally, we contextualize our data set by plotting the weather stations on an interactive map.
library(leaflet)
library(leafpop)
library(htmltools)
# plot
df_result %>% leaflet() %>% addTiles() %>%
addCircleMarkers(data = df_result, radius = 5,
stroke = FALSE, fillOpacity = 0.5,
popup = popupTable(df_result, zcol = 2:9)
)
For the purpose of later usage we caste the spatial data sets to
Spatial
objects and save them to disk.
library(raster)
library(sf)
# Retrieve Federal States by the the getData() function from the raster package
germany <- getData(country = "Germany", level = 1)
germany_sf <- st_as_sf(germany)
# transform to UTM zone 33
germany_sf <- st_transform(germany_sf, 32633)
# simplify data
states <- c(
"Sachsen", "Sachsen-Anhalt",
"Mecklenburg-Vorpommern", "Brandenburg",
"Thüringen", "Berlin"
)
east_germany_states <- germany_sf[germany_sf$NAME_1 %in% states, ]
east_germany <- east_germany_states %>%
group_by(NAME_0) %>%
summarize()
# create simple feature object from data frame object
dwd_east_sf <- st_as_sf(df_result, coords = c("Lon", "Lat"), crs = 4326)
# transform to UTM zone 33
dwd_east_sf <- st_transform(dwd_east_sf, 32633)
# create sp objects
east_germany_sp <- as(east_germany, "Spatial")
east_germany_states_sp <- as(east_germany_states, "Spatial")
dwd_east_sp <- as(dwd_east_sf, "Spatial")
save(
file = "East_Germany.RData",
east_germany_sp, east_germany_states_sp, dwd_east_sp
)
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.