All time series data sets for the weather station Berlin-Dahlem (FU) are provided by DWD (German Weather Service). The data was downloaded from the Climate Data Center on 2022-07-22.
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
The monthly time series of the DWD weather station Berlin-Dahlem (FU) was downloaded on 2022-07-22 from the Climate Data Center of DWD. A detailed variable description is available here. For the purpose of this tutorial the monthly time series data set was downloaded here but is also made available here.
We start with downloading the zipped data set from the server.
import requests, zipfile, io
url = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/monthly/kl/historical/monatswerte_KL_00403_17190101_20211231_hist.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("../data")
We realize that the data set is made up of several text files. Let us explore the metadata of the DWD weather station Berlin-Dahlem (FU).
data_meta = pd.read_csv("../data/Metadaten_Geographie_00403.txt", sep=";")
data_meta
Stations_id | Stationshoehe | Geogr.Breite | Geogr.Laenge | von_datum | bis_datum | Stationsname | |
---|---|---|---|---|---|---|---|
0 | 403 | 51.0 | 52.4625 | 13.3000 | 19500101 | 19970711 | Berlin-Dahlem (FU) |
1 | 403 | 51.0 | 52.4537 | 13.3017 | 19970712 | Berlin-Dahlem (FU) |
It is quite interesting that the metadata table shows two different locations associated with the same weather station Berlin-Dahlem (FU). One of the weather stations was operative from 1950-01-01 to 1997-07-11 and the other weather station is in operation since 1997-07-12.
Let us further explore this issue. We add the column Status
to the data set and put the nominal variables not active
and active
into it.
data_meta["Status"] = ["not active", "active"]
Further, in order to explore the measurement site in more detail we plot a map.
Hence, we make use of the folium
package plot the locations from the metadata file.
import folium
# Create a map, centered on Berlin
m = folium.Map(location=[52.45, 13.3], zoom_start=13)
# Add marker FU in Berlin - Dahlem
folium.Marker(
location=[
data_meta["Geogr.Breite"][0],
data_meta["Geogr.Laenge"][0],
], # coordinates for the marker
popup=data_meta["Status"][0], # pop-up label for the marker
).add_to(m)
folium.Marker(
location=[
data_meta["Geogr.Breite"][1],
data_meta["Geogr.Laenge"][1],
], # coordinates for the marker
popup=data_meta["Status"][1], # pop-up label for the marker
).add_to(m)
# Display m
m
We get a very nice and convenient visual representation of the location of the two data points in the data set. Feel free to hoover over the map, click on the point locations or change the underling map tiles.
It is quite interesting to see that the currently operating weather station is located at Botanischer Garten, whereas before 1997 the weather station Berlin-Dahlem was located at Podbielskialle. The moving of weather stations occasionally causes problems for the analysis of long term time series, as without prior knowledge about such a moving any change in the weather record may be falsely associated with a change of weather or climatic patterns.
Now that we have some intuition about the measurement site we load the actual time series data into Python. We call the dtypes()
function to get a brief idea on the structure of the data set.
filename = "../data/produkt_klima_monat_17190101_20211231_00403.txt"
dwd_raw = pd.read_csv(filename, sep=";")
dwd_raw.dtypes
STATIONS_ID int64 MESS_DATUM_BEGINN int64 MESS_DATUM_ENDE int64 QN_4 int64 MO_N float64 MO_TT float64 MO_TX float64 MO_TN float64 MO_FK float64 MX_TX float64 MX_FX float64 MX_TN float64 MO_SD_S float64 QN_6 int64 MO_RR float64 MX_RS float64 eor object dtype: object
The data set consists of 3510 rows and 17 columns, with the following variables: 'STATIONS_ID', 'MESS_DATUM_BEGINN', 'MESS_DATUM_ENDE', 'QN_4', 'MO_N','MO_TT', 'MO_TX', 'MO_TN', 'MO_FK', 'MX_TX', 'MX_FX', 'MX_TN','MO_SD_S', 'QN_6', 'MO_RR', 'MX_RS', 'eor'].
By looking at the structure of the data set we spot several entries of values of $-999$. This value, as indicated in variable description file (here) corresponds to missing values and should be marked as NA
.
There are several strategies how to deal with missing values. Here in this case, we decide to reload the data set and to add the additional argument na_values=["-999"]
to the function call. It is further recommendable to include the argument skipinitialspace = True
as well. This avoids that white spaces interfere with the detection of missing values.
We take a look at four randomly picked row entries by calling the handy sample()
function.
dwd_raw = pd.read_csv(filename, sep=";", na_values=["-999"], skipinitialspace=True)
dwd_raw.sample(n=4, random_state=1)
STATIONS_ID | MESS_DATUM_BEGINN | MESS_DATUM_ENDE | QN_4 | MO_N | MO_TT | MO_TX | MO_TN | MO_FK | MX_TX | MX_FX | MX_TN | MO_SD_S | QN_6 | MO_RR | MX_RS | eor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
364 | 403 | 17591101 | 17591130 | 5 | NaN | 2.7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
136 | 403 | 17370101 | 17370131 | 5 | NaN | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
99 | 403 | 17331201 | 17331231 | 5 | NaN | 4.6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
943 | 403 | 18080201 | 18080229 | 5 | NaN | -1.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
In a next step we want to make sure that the data set is complete. This means that there exists a row entry in the data table for each month within the time period covered by the time series.
For now this fact is important for us as we will work with the pandas
Package for time series handling in Python. The data representation of the pandas
packages expects regularly spaced data to work properly with time series. Please note that there are several methods, such as interpolate()
, asfreq()
or sample()
, that can be used to convert a irregularly spaced time series data to regular spaced data.
For the purpose of this example we conduct a quick sanity check: First, we extract the year/month entry of the first row and the year/month entry of the last row in the data set using the min()
and the max()
function.
dwd_raw["MESS_DATUM_BEGINN"] = pd.to_datetime(
dwd_raw["MESS_DATUM_BEGINN"], format="%Y%m%d"
)
first_date = min(dwd_raw["MESS_DATUM_BEGINN"])
print(first_date)
dwd_raw["MESS_DATUM_ENDE"] = pd.to_datetime(dwd_raw["MESS_DATUM_ENDE"], format="%Y%m%d")
last_date = max(dwd_raw["MESS_DATUM_BEGINN"])
print(last_date)
1719-01-01 00:00:00 2021-12-01 00:00:00
Then we construct a monthly sequence from the first to the last date and compare its length with the number of rows (observations) in the data set. Set freq = "MS"
to set the start for each month wihtin the date range. If they are equal we conclude that the data set is complete.
expected_length = len(pd.date_range(first_date, last_date, freq="MS"))
expected_length == dwd_raw.shape[0]
False
Obviously the data set is not complete! For period from 1719-01-01 to 2021-12-01 the data set should consist of 3636 rows. However, it consists only of 3510.
We may easily solve this problem. First, we add a column of dates to the dwd_raw
data frame, which correspond to the dates provided in the MESS_DATUM_BEGINN
column. Thereafter we generate a new data frame with a sequence of dates in monthly steps from <IPython.core.display.Javascript object> to <IPython.core.display.Javascript object>. Then merge the sequence of dates with the dwd_raw
data frame. The resulting data frame should then be populated with a row for each month.
# create new DataFrame
dwd_month = pd.DataFrame(
{"Date": pd.date_range(first_date, last_date, freq="MS")}
) ## 'MS' for the beginning (month start frequency)
dwd_month = pd.merge(
dwd_month, dwd_raw, left_on="Date", right_on="MESS_DATUM_BEGINN", how="left"
)
dwd_month.head(15)
Date | STATIONS_ID | MESS_DATUM_BEGINN | MESS_DATUM_ENDE | QN_4 | MO_N | MO_TT | MO_TX | MO_TN | MO_FK | MX_TX | MX_FX | MX_TN | MO_SD_S | QN_6 | MO_RR | MX_RS | eor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1719-01-01 | 403.0 | 1719-01-01 | 1719-01-31 | 5.0 | NaN | 2.8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
1 | 1719-02-01 | 403.0 | 1719-02-01 | 1719-02-28 | 5.0 | NaN | 1.1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
2 | 1719-03-01 | 403.0 | 1719-03-01 | 1719-03-31 | 5.0 | NaN | 5.2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
3 | 1719-04-01 | 403.0 | 1719-04-01 | 1719-04-30 | 5.0 | NaN | 9.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
4 | 1719-05-01 | 403.0 | 1719-05-01 | 1719-05-31 | 5.0 | NaN | 15.1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
5 | 1719-06-01 | 403.0 | 1719-06-01 | 1719-06-30 | 5.0 | NaN | 19.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
6 | 1719-07-01 | 403.0 | 1719-07-01 | 1719-07-31 | 5.0 | NaN | 21.4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
7 | 1719-08-01 | 403.0 | 1719-08-01 | 1719-08-31 | 5.0 | NaN | 18.8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
8 | 1719-09-01 | 403.0 | 1719-09-01 | 1719-09-30 | 5.0 | NaN | 13.9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
9 | 1719-10-01 | 403.0 | 1719-10-01 | 1719-10-31 | 5.0 | NaN | 9.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
10 | 1719-11-01 | 403.0 | 1719-11-01 | 1719-11-30 | 5.0 | NaN | 6.6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
11 | 1719-12-01 | 403.0 | 1719-12-01 | 1719-12-31 | 5.0 | NaN | 0.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
12 | 1720-01-01 | 403.0 | 1720-01-01 | 1720-01-31 | 5.0 | NaN | 2.6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
13 | 1720-02-01 | 403.0 | 1720-02-01 | 1720-02-29 | 5.0 | NaN | 1.4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
14 | 1720-03-01 | 403.0 | 1720-03-01 | 1720-03-31 | 5.0 | NaN | 3.1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | eor |
expected_length == dwd_month.shape[0]
True
Nice, sanity check passed!
As for now we are only interested in the variable for mean monthly temperature. After looking into the variable description file (here) we found out that the variable of interest is encoded as MO_TT
(monthly mean of the daily temperature two meters above ground).
We take a look at the first ten entries of the variable MO_TT
.
dwd_month["MO_TT"][1:10]
1 1.1 2 5.2 3 9.0 4 15.1 5 19.0 6 21.4 7 18.8 8 13.9 9 9.0 Name: MO_TT, dtype: float64
Finally, we store the temperature vector in form of an pandas.Series
class object by calling the pd.Series()
function. We have to feed the function a vector with the data and a vector with the corresponding dates, which will be set as the index. Recall that we already created a vector of dates beforehand when building the dwd.month
data frame.
ts_FUB_monthly = pd.Series(dwd_month["MO_TT"].values, index=dwd_month["Date"])
For a simple visualization, we will apply the plot()
function directly on the ts_FUB_monthly
series.
plt.figure(figsize=(18, 4))
ts_FUB_monthly.plot()
plt.show()
In this section we process the daily time series for the DWD station Berlin-Dahlem (FU). The daily time series of the DWD weather station Berlin-Dahlem (FU) was downloaded on 2022-07-22 from Climate Data Center. A detailed variable description is available here. For the purpose of this tutorial the monthly time series data set is made available here.
The preprocessing procedure is very similar to the one for monthly time series data. Please revisit the previous section if you cannot follow the outlined steps below.
url = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/historical/tageswerte_KL_00403_19500101_20211231_hist.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("../data")
data_raw = pd.read_csv(
"../data/produkt_klima_tag_19500101_20211231_00403.txt",
sep=";",
na_values=["-999"],
skipinitialspace=True,
)
data_raw
STATIONS_ID | MESS_DATUM | QN_3 | FX | FM | QN_4 | RSK | RSKF | SDK | SHK_TAG | NM | VPM | PM | TMK | UPM | TXK | TNK | TGK | eor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 403 | 19500101 | NaN | NaN | NaN | 5 | 2.2 | 7 | NaN | 0.0 | 5.0 | 4.0 | 1025.60 | -3.2 | 83.00 | -1.1 | -4.9 | -6.3 | eor |
1 | 403 | 19500102 | NaN | NaN | NaN | 5 | 12.6 | 8 | NaN | 0.0 | 8.0 | 6.1 | 1005.60 | 1.0 | 95.00 | 2.2 | -3.7 | -5.3 | eor |
2 | 403 | 19500103 | NaN | NaN | NaN | 5 | 0.5 | 1 | NaN | 0.0 | 5.0 | 6.5 | 996.60 | 2.8 | 86.00 | 3.9 | 1.7 | -1.4 | eor |
3 | 403 | 19500104 | NaN | NaN | NaN | 5 | 0.5 | 7 | NaN | 0.0 | 7.7 | 5.2 | 999.50 | -0.1 | 85.00 | 2.1 | -0.9 | -2.3 | eor |
4 | 403 | 19500105 | NaN | NaN | NaN | 5 | 10.3 | 7 | NaN | 0.0 | 8.0 | 4.0 | 1001.10 | -2.8 | 79.00 | -0.9 | -3.3 | -5.2 | eor |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
26293 | 403 | 20211227 | NaN | NaN | NaN | 3 | 0.0 | 8 | 0.183 | 0.0 | 5.9 | 3.8 | 998.13 | -3.7 | 79.67 | -0.7 | -7.9 | -9.9 | eor |
26294 | 403 | 20211228 | NaN | NaN | NaN | 3 | 1.5 | 6 | 0.000 | 0.0 | 6.4 | 5.3 | 990.17 | -0.5 | 88.46 | 2.7 | -3.9 | -5.1 | eor |
26295 | 403 | 20211229 | NaN | NaN | NaN | 3 | 0.3 | 6 | 0.000 | 0.0 | 7.5 | 8.2 | 994.40 | 4.0 | 100.00 | 5.6 | 1.8 | 0.0 | eor |
26296 | 403 | 20211230 | NaN | NaN | NaN | 3 | 3.2 | 6 | 0.000 | 0.0 | 7.9 | 11.5 | 1001.70 | 9.0 | 98.54 | 12.7 | 4.6 | 2.3 | eor |
26297 | 403 | 20211231 | NaN | NaN | NaN | 3 | 5.5 | 6 | 0.000 | 0.0 | 7.7 | 12.5 | 1004.72 | 12.8 | 84.96 | 14.0 | 11.5 | 10.7 | eor |
26298 rows × 19 columns
For now we are interested in the variables for rainfall and temperature. After reviewing the variable description file (here) we found that the daily rainfall in mm is encoded in the RSK
variable and daily mean temperature two meters above ground is encoded in the TMK
variable. For the purpose of comprehensibility we construct a new data frame including the two variables of interest.
data_raw["MESS_DATUM"] = pd.to_datetime(data_raw["MESS_DATUM"], format="%Y%m%d")
ts_FUB_daily = pd.DataFrame(
{"Temp": data_raw["TMK"], "Rain": data_raw["RSK"]}
).set_index(data_raw["MESS_DATUM"])
Let us plot the data with the plot
function, plotting Temp
and Rain
as subplots.
fig, ax = plt.subplots(2, 1, figsize=(18, 8))
ax[0].plot(ts_FUB_daily["Temp"])
ax[0].set_title("Temp")
ax[1].plot(ts_FUB_daily["Rain"], color="orange")
ax[1].set_title("Rain")
plt.show()
In this section we process the hourly time series of rainfall data for the DWD station Berlin-Dahlem (FU). The hourly time series of the DWD weather station Berlin-Dahlem (FU) was downloaded on 2022-07-22 from Climate Data Center. A detailed variable description is available here. For the purpose of this tutorial the hourly time series data set is made available here.
The preprocessing procedure is very similar to the one for monthly and daily data. Please revisit the previous sections if you cannot follow the outlined steps below.
url = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/precipitation/historical/stundenwerte_RR_00403_20020128_20211231_hist.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("../data")
data_raw = pd.read_csv(
"../data/produkt_rr_stunde_20020128_20211231_00403.txt",
sep=";",
na_values=["-999"],
skipinitialspace=True,
)
data_raw
STATIONS_ID | MESS_DATUM | QN_8 | R1 | RS_IND | WRTR | eor | |
---|---|---|---|---|---|---|---|
0 | 403 | 2002012811 | 1 | 0.0 | 0.0 | NaN | eor |
1 | 403 | 2002012813 | 1 | 0.0 | 0.0 | NaN | eor |
2 | 403 | 2002012815 | 1 | 1.7 | 1.0 | NaN | eor |
3 | 403 | 2002012818 | 1 | 1.1 | 1.0 | NaN | eor |
4 | 403 | 2002012821 | 1 | 0.0 | 0.0 | NaN | eor |
... | ... | ... | ... | ... | ... | ... | ... |
174018 | 403 | 2021123119 | 3 | 0.7 | 1.0 | 6.0 | eor |
174019 | 403 | 2021123120 | 3 | 0.7 | 1.0 | 6.0 | eor |
174020 | 403 | 2021123121 | 3 | 0.1 | 1.0 | NaN | eor |
174021 | 403 | 2021123122 | 3 | 0.1 | 1.0 | 6.0 | eor |
174022 | 403 | 2021123123 | 3 | 0.0 | 1.0 | 6.0 | eor |
174023 rows × 7 columns
The data set contains hourly rainfall measurements for the weather station Berlin-Dahlem (FU). After reviewing the variable description file (here) we found that the hourly rainfall in mm is encoded in the R1
variable.
data_raw["MESS_DATUM"] = pd.to_datetime(data_raw["MESS_DATUM"], format="%Y%m%d%H")
ts_FUB_hourly = pd.Series(data_raw["R1"].values, index=data_raw["MESS_DATUM"])
ts_FUB_hourly.fillna(0)
MESS_DATUM 2002-01-28 11:00:00 0.0 2002-01-28 13:00:00 0.0 2002-01-28 15:00:00 1.7 2002-01-28 18:00:00 1.1 2002-01-28 21:00:00 0.0 ... 2021-12-31 19:00:00 0.7 2021-12-31 20:00:00 0.7 2021-12-31 21:00:00 0.1 2021-12-31 22:00:00 0.1 2021-12-31 23:00:00 0.0 Length: 174023, dtype: float64
In the next step we simply plot the data again with the convenient plot()
function.
plt.figure(figsize=(18, 4))
ts_FUB_hourly.plot()
plt.ylabel('Rainfall in mm/h', Fontsize = 13)
plt.xlabel('Date', Fontsize = 13)
plt.show()
Finally, we store the monthly, daily and hourly time series data set in a dictionary, which we will save into a .json
file using the to_json
function for further processing.
# convert pandas series to DataFrames
ts_FUB_monthly_df = ts_FUB_monthly.to_frame(name="rainfall").reset_index()
ts_FUB_hourly_df = ts_FUB_hourly.to_frame(name="rainfall").reset_index()
ts_FUB_daily_df = ts_FUB_daily.reset_index()
# save them as json files, remember to set date_format = "iso"
ts_FUB_monthly_df.to_json("../data/ts_FUB_monthly.json", date_format = "iso")
ts_FUB_hourly_df.to_json("../data/ts_FUB_hourly.json", date_format = "iso")
ts_FUB_daily_df.to_json("../data/ts_FUB_daily.json", date_format = "iso")
#ts_FUB_monthly
To read the .json
files, use the following command and date conversion.
ts_FUB_monthly_df = pd.read_json("../data/ts_FUB_monthly.json")
ts_FUB_monthly_df["Date"] = pd.to_datetime(ts_FUB_monthly_df['Date'], format="%Y-%m-%d", errors = 'coerce')
ts_FUB_monthly_df
Date | rainfall | |
---|---|---|
0 | 1719-01-01 00:00:00+00:00 | 2.80 |
1 | 1719-02-01 00:00:00+00:00 | 1.10 |
2 | 1719-03-01 00:00:00+00:00 | 5.20 |
3 | 1719-04-01 00:00:00+00:00 | 9.00 |
4 | 1719-05-01 00:00:00+00:00 | 15.10 |
... | ... | ... |
3631 | 2021-08-01 00:00:00+00:00 | 17.43 |
3632 | 2021-09-01 00:00:00+00:00 | 15.55 |
3633 | 2021-10-01 00:00:00+00:00 | 10.49 |
3634 | 2021-11-01 00:00:00+00:00 | 6.28 |
3635 | 2021-12-01 00:00:00+00:00 | 2.19 |
3636 rows × 2 columns
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.