In the subsequent sections we will work with weather data provided by Deutscher Wetterdienst (DWD) (German Weather Service). Here we provide a preprocessed data set of DWD weather stations located across Germany. The data was downloaded from the DWD (German Weather Service) data portal on April 21, 2017. You may find a detailed description of the data set here. Please note that for the purpose of this tutorial the data set was preprocessed and columns have been renamed.
You may download the DWD.csv
file here. We import the data set and assign a proper name to it. Note that we use the read_csv()
function from the pandas
library to load the data set.
# First, let's import the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import requests
import io
from io import StringIO
url = "http://www.userpage.fu-berlin.de/soga/300/30100_data_sets/DWD.csv"
s = requests.get(url).text
dwd = pd.read_csv(StringIO(s), sep=";", decimal=",")
dwd.columns
Index(['id', 'DWD_ID', 'STATION NAME', 'FEDERAL STATE', 'LAT', 'LON', 'ALTITUDE', 'PERIOD', 'RECORD LENGTH', 'MEAN ANNUAL AIR TEMP', 'MEAN MONTHLY MAX TEMP', 'MEAN MONTHLY MIN TEMP', 'MEAN ANNUAL WIND SPEED', 'MEAN CLOUD COVER', 'MEAN ANNUAL SUNSHINE', 'MEAN ANNUAL RAINFALL', 'MAX MONTHLY WIND SPEED', 'MAX AIR TEMP', 'MAX WIND SPEED', 'MAX RAINFALL', 'MIN AIR TEMP', 'MEAN RANGE AIR TEMP'], dtype='object')
The dwd data set consists of 599 rows, each of them representing a particular weather station in Germany, and 22 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: 'id', 'DWD_ID', 'STATION NAME', 'FEDERAL STATE', 'LAT', 'LON', 'ALTITUDE', 'PERIOD', 'RECORD LENGTH', 'MEAN ANNUAL AIR TEMP','MEAN MONTHLY MAX TEMP', 'MEAN MONTHLY MIN TEMP','MEAN ANNUAL WIND SPEED', 'MEAN CLOUD COVER', 'MEAN ANNUAL SUNSHINE','MEAN ANNUAL RAINFALL', 'MAX MONTHLY WIND SPEED', 'MAX AIR TEMP','MAX WIND SPEED', 'MAX RAINFALL', 'MIN AIR TEMP','MEAN RANGE AIR TEMP'.
For the purpose of the tutorial we are only interested in the variable MEAN ANNUAL RAINFALL
and ALTITUDE
. Hence, we subset or data set based on the variables. Further, we make sure that we exclude all missing values.
dwd = dwd[["LAT", "LON", "MEAN ANNUAL RAINFALL", "ALTITUDE"]].dropna()
dwd.shape[0]
586
After cleaning up there are 586 observations left in our data set. In the next step we create a GeoDataFrame
from the data set. Note that we provide the additional argument crs = 4326
to the function call as the coordinates in the data set are given as geographic coordinates in decimal degrees. Thereafter, we transform the GeoDataFrame
object into the ETRS89/LAEA coordinate reference system (European Terrestrial Reference System 1989/Lambert Azimuthal Equal-Area projection coordinate reference system) providing the EPSG identifier $3035$.
## create geopandas
dwd_geo = gpd.GeoDataFrame(dwd, geometry=gpd.points_from_xy(dwd.LON, dwd.LAT), crs=4326)
dwd_geo = dwd_geo.to_crs("epsg:3035")
dwd_geo.head()
LAT | LON | MEAN ANNUAL RAINFALL | ALTITUDE | geometry | |
---|---|---|---|---|---|
0 | 47.8413 | 8.8493 | 755.0 | 478.0 | POINT (4234819.614 2748192.185) |
1 | 50.7827 | 6.0941 | 820.0 | 202.0 | POINT (4045677.710 3081917.634) |
2 | 52.9335 | 8.2370 | 759.0 | 44.0 | POINT (4202462.727 3315312.398) |
3 | 48.2156 | 8.9784 | 919.0 | 759.0 | POINT (4245047.748 2789643.853) |
4 | 48.6159 | 13.0506 | 790.0 | 340.0 | POINT (4545939.250 2838265.721) |
Before we continue we should remind ourselves that the data set we are working with has a spatial component. Basically, our observations are point measurements of rainfall spread across Germany. Let us plot a simple map to visualize the spatial distribution of our observations. Therefore we rely on the GeoPandas
package. For the shapefiles of Germany we rely on shapefiles provided by the German Federal Agency for Cartography and Geodesy (under this licence). You may directly download the shapefiles here. For the purpose of this tutorial, we also provided the data here.
# Retrieve Federal States
import zipfile
url = "https://daten.gdz.bkg.bund.de/produkte/vg/vg5000_0101/aktuell/vg5000_01-01.utm32s.shape.ebenen.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(path="../data")
G1 = gpd.read_file(
"../data/vg5000_01-01.utm32s.shape.ebenen/vg5000_ebenen_0101/VG5000_LAN.shp"
)
G1 = G1.to_crs("epsg:3035")
# plot the map
fig, ax = plt.subplots(1, 1, figsize=(8, 5))
ax.ticklabel_format(useOffset=False)
G1.plot(ax=ax, color="#fff7bc", edgecolor="black", linewidth=0.7)
dwd_geo.plot(ax=ax, facecolor="none", edgecolor="darkgrey", markersize=6)
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
plt.show()
For later usage we store the spatial object dwd_geo
on disk.
dwd_geo.to_file("../data/dwd_geo.geojson", driver="GeoJSON")
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.