All time series data sets for the weather station Berlin-Dahlem (FU) are provided by DWD (German Weather Service). The data was downloaded from the Climate Data Center on 2022-07-22.

In [27]:
# First, let's import the needed libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime

Monthly time series data Berlin-Dahlem (FU)¶

The monthly time series of the DWD weather station Berlin-Dahlem (FU) was downloaded on 2022-07-22 from the Climate Data Center of DWD. A detailed variable description is available here. For the purpose of this tutorial the monthly time series data set was downloaded here but is also made available here.

We start with downloading the zipped data set from the server.

In [28]:
import requests, zipfile, io

url = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/monthly/kl/historical/monatswerte_KL_00403_17190101_20211231_hist.zip"

r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("../data")

We realize that the data set is made up of several text files. Let us explore the metadata of the DWD weather station Berlin-Dahlem (FU).

In [29]:
data_meta = pd.read_csv("../data/Metadaten_Geographie_00403.txt", sep=";")
data_meta
Out[29]:
Stations_id Stationshoehe Geogr.Breite Geogr.Laenge von_datum bis_datum Stationsname
0 403 51.0 52.4625 13.3000 19500101 19970711 Berlin-Dahlem (FU)
1 403 51.0 52.4537 13.3017 19970712 Berlin-Dahlem (FU)

It is quite interesting that the metadata table shows two different locations associated with the same weather station Berlin-Dahlem (FU). One of the weather stations was operative from 1950-01-01 to 1997-07-11 and the other weather station is in operation since 1997-07-12.

Let us further explore this issue. We add the column Status to the data set and put the nominal variables not active and active into it.

In [30]:
data_meta["Status"] = ["not active", "active"]

Further, in order to explore the measurement site in more detail we plot a map.

Hence, we make use of the folium package plot the locations from the metadata file.

In [31]:
import folium

# Create a map, centered on Berlin
m = folium.Map(location=[52.45, 13.3], zoom_start=13)

# Add marker FU in Berlin - Dahlem
folium.Marker(
    location=[
        data_meta["Geogr.Breite"][0],
        data_meta["Geogr.Laenge"][0],
    ],  # coordinates for the marker
    popup=data_meta["Status"][0],  # pop-up label for the marker
).add_to(m)

folium.Marker(
    location=[
        data_meta["Geogr.Breite"][1],
        data_meta["Geogr.Laenge"][1],
    ],  # coordinates for the marker
    popup=data_meta["Status"][1],  # pop-up label for the marker
).add_to(m)


# Display m
m
Out[31]:
Make this Notebook Trusted to load map: File -> Trust Notebook

We get a very nice and convenient visual representation of the location of the two data points in the data set. Feel free to hoover over the map, click on the point locations or change the underling map tiles.

It is quite interesting to see that the currently operating weather station is located at Botanischer Garten, whereas before 1997 the weather station Berlin-Dahlem was located at Podbielskialle. The moving of weather stations occasionally causes problems for the analysis of long term time series, as without prior knowledge about such a moving any change in the weather record may be falsely associated with a change of weather or climatic patterns.

Now that we have some intuition about the measurement site we load the actual time series data into Python. We call the dtypes() function to get a brief idea on the structure of the data set.

In [32]:
filename = "../data/produkt_klima_monat_17190101_20211231_00403.txt"
dwd_raw = pd.read_csv(filename, sep=";")
dwd_raw.dtypes
Out[32]:
STATIONS_ID            int64
MESS_DATUM_BEGINN      int64
MESS_DATUM_ENDE        int64
QN_4                   int64
MO_N                 float64
MO_TT                float64
MO_TX                float64
MO_TN                float64
MO_FK                float64
MX_TX                float64
MX_FX                float64
MX_TN                float64
MO_SD_S              float64
QN_6                   int64
MO_RR                float64
MX_RS                float64
eor                   object
dtype: object

The data set consists of 3510 rows and 17 columns, with the following variables: 'STATIONS_ID', 'MESS_DATUM_BEGINN', 'MESS_DATUM_ENDE', 'QN_4', 'MO_N','MO_TT', 'MO_TX', 'MO_TN', 'MO_FK', 'MX_TX', 'MX_FX', 'MX_TN','MO_SD_S', 'QN_6', 'MO_RR', 'MX_RS', 'eor'].

By looking at the structure of the data set we spot several entries of values of $-999$. This value, as indicated in variable description file (here) corresponds to missing values and should be marked as NA.

There are several strategies how to deal with missing values. Here in this case, we decide to reload the data set and to add the additional argument na_values=["-999"] to the function call. It is further recommendable to include the argument skipinitialspace = True as well. This avoids that white spaces interfere with the detection of missing values.

We take a look at four randomly picked row entries by calling the handy sample() function.

In [33]:
dwd_raw = pd.read_csv(filename, sep=";", na_values=["-999"], skipinitialspace=True)

dwd_raw.sample(n=4, random_state=1)
Out[33]:
STATIONS_ID MESS_DATUM_BEGINN MESS_DATUM_ENDE QN_4 MO_N MO_TT MO_TX MO_TN MO_FK MX_TX MX_FX MX_TN MO_SD_S QN_6 MO_RR MX_RS eor
364 403 17591101 17591130 5 NaN 2.7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
136 403 17370101 17370131 5 NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
99 403 17331201 17331231 5 NaN 4.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
943 403 18080201 18080229 5 NaN -1.3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor

In a next step we want to make sure that the data set is complete. This means that there exists a row entry in the data table for each month within the time period covered by the time series.

For now this fact is important for us as we will work with the pandas Package for time series handling in Python. The data representation of the pandas packages expects regularly spaced data to work properly with time series. Please note that there are several methods, such as interpolate(), asfreq() or sample(), that can be used to convert a irregularly spaced time series data to regular spaced data.

For the purpose of this example we conduct a quick sanity check: First, we extract the year/month entry of the first row and the year/month entry of the last row in the data set using the min() and the max() function.

In [34]:
dwd_raw["MESS_DATUM_BEGINN"] = pd.to_datetime(
    dwd_raw["MESS_DATUM_BEGINN"], format="%Y%m%d"
)
first_date = min(dwd_raw["MESS_DATUM_BEGINN"])
print(first_date)

dwd_raw["MESS_DATUM_ENDE"] = pd.to_datetime(dwd_raw["MESS_DATUM_ENDE"], format="%Y%m%d")
last_date = max(dwd_raw["MESS_DATUM_BEGINN"])
print(last_date)
1719-01-01 00:00:00
2021-12-01 00:00:00

Then we construct a monthly sequence from the first to the last date and compare its length with the number of rows (observations) in the data set. Set freq = "MS" to set the start for each month wihtin the date range. If they are equal we conclude that the data set is complete.

In [35]:
expected_length = len(pd.date_range(first_date, last_date, freq="MS"))
expected_length == dwd_raw.shape[0]
Out[35]:
False

Obviously the data set is not complete! For period from 1719-01-01 to 2021-12-01 the data set should consist of 3636 rows. However, it consists only of 3510.

We may easily solve this problem. First, we add a column of dates to the dwd_raw data frame, which correspond to the dates provided in the MESS_DATUM_BEGINN column. Thereafter we generate a new data frame with a sequence of dates in monthly steps from <IPython.core.display.Javascript object> to <IPython.core.display.Javascript object>. Then merge the sequence of dates with the dwd_raw data frame. The resulting data frame should then be populated with a row for each month.

In [36]:
# create new DataFrame
dwd_month = pd.DataFrame(
    {"Date": pd.date_range(first_date, last_date, freq="MS")}
)  ## 'MS' for the beginning (month start frequency)
In [37]:
dwd_month = pd.merge(
    dwd_month, dwd_raw, left_on="Date", right_on="MESS_DATUM_BEGINN", how="left"
)
In [38]:
dwd_month.head(15)
Out[38]:
Date STATIONS_ID MESS_DATUM_BEGINN MESS_DATUM_ENDE QN_4 MO_N MO_TT MO_TX MO_TN MO_FK MX_TX MX_FX MX_TN MO_SD_S QN_6 MO_RR MX_RS eor
0 1719-01-01 403.0 1719-01-01 1719-01-31 5.0 NaN 2.8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
1 1719-02-01 403.0 1719-02-01 1719-02-28 5.0 NaN 1.1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
2 1719-03-01 403.0 1719-03-01 1719-03-31 5.0 NaN 5.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
3 1719-04-01 403.0 1719-04-01 1719-04-30 5.0 NaN 9.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
4 1719-05-01 403.0 1719-05-01 1719-05-31 5.0 NaN 15.1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
5 1719-06-01 403.0 1719-06-01 1719-06-30 5.0 NaN 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
6 1719-07-01 403.0 1719-07-01 1719-07-31 5.0 NaN 21.4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
7 1719-08-01 403.0 1719-08-01 1719-08-31 5.0 NaN 18.8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
8 1719-09-01 403.0 1719-09-01 1719-09-30 5.0 NaN 13.9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
9 1719-10-01 403.0 1719-10-01 1719-10-31 5.0 NaN 9.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
10 1719-11-01 403.0 1719-11-01 1719-11-30 5.0 NaN 6.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
11 1719-12-01 403.0 1719-12-01 1719-12-31 5.0 NaN 0.3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
12 1720-01-01 403.0 1720-01-01 1720-01-31 5.0 NaN 2.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
13 1720-02-01 403.0 1720-02-01 1720-02-29 5.0 NaN 1.4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
14 1720-03-01 403.0 1720-03-01 1720-03-31 5.0 NaN 3.1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN eor
In [39]:
expected_length == dwd_month.shape[0]
Out[39]:
True

Nice, sanity check passed!

As for now we are only interested in the variable for mean monthly temperature. After looking into the variable description file (here) we found out that the variable of interest is encoded as MO_TT (monthly mean of the daily temperature two meters above ground).

We take a look at the first ten entries of the variable MO_TT.

In [40]:
dwd_month["MO_TT"][1:10]
Out[40]:
1     1.1
2     5.2
3     9.0
4    15.1
5    19.0
6    21.4
7    18.8
8    13.9
9     9.0
Name: MO_TT, dtype: float64

Finally, we store the temperature vector in form of an pandas.Series class object by calling the pd.Series() function. We have to feed the function a vector with the data and a vector with the corresponding dates, which will be set as the index. Recall that we already created a vector of dates beforehand when building the dwd.month data frame.

In [41]:
ts_FUB_monthly = pd.Series(dwd_month["MO_TT"].values, index=dwd_month["Date"])

For a simple visualization, we will apply the plot() function directly on the ts_FUB_monthly series.

In [42]:
plt.figure(figsize=(18, 4))
ts_FUB_monthly.plot()
plt.show()

Daily time series data Berlin-Dahlem (FU)¶

In this section we process the daily time series for the DWD station Berlin-Dahlem (FU). The daily time series of the DWD weather station Berlin-Dahlem (FU) was downloaded on 2022-07-22 from Climate Data Center. A detailed variable description is available here. For the purpose of this tutorial the monthly time series data set is made available here.

The preprocessing procedure is very similar to the one for monthly time series data. Please revisit the previous section if you cannot follow the outlined steps below.

In [43]:
url = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/historical/tageswerte_KL_00403_19500101_20211231_hist.zip"


r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("../data")
In [44]:
data_raw = pd.read_csv(
    "../data/produkt_klima_tag_19500101_20211231_00403.txt",
    sep=";",
    na_values=["-999"],
    skipinitialspace=True,
)
data_raw
Out[44]:
STATIONS_ID MESS_DATUM QN_3 FX FM QN_4 RSK RSKF SDK SHK_TAG NM VPM PM TMK UPM TXK TNK TGK eor
0 403 19500101 NaN NaN NaN 5 2.2 7 NaN 0.0 5.0 4.0 1025.60 -3.2 83.00 -1.1 -4.9 -6.3 eor
1 403 19500102 NaN NaN NaN 5 12.6 8 NaN 0.0 8.0 6.1 1005.60 1.0 95.00 2.2 -3.7 -5.3 eor
2 403 19500103 NaN NaN NaN 5 0.5 1 NaN 0.0 5.0 6.5 996.60 2.8 86.00 3.9 1.7 -1.4 eor
3 403 19500104 NaN NaN NaN 5 0.5 7 NaN 0.0 7.7 5.2 999.50 -0.1 85.00 2.1 -0.9 -2.3 eor
4 403 19500105 NaN NaN NaN 5 10.3 7 NaN 0.0 8.0 4.0 1001.10 -2.8 79.00 -0.9 -3.3 -5.2 eor
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
26293 403 20211227 NaN NaN NaN 3 0.0 8 0.183 0.0 5.9 3.8 998.13 -3.7 79.67 -0.7 -7.9 -9.9 eor
26294 403 20211228 NaN NaN NaN 3 1.5 6 0.000 0.0 6.4 5.3 990.17 -0.5 88.46 2.7 -3.9 -5.1 eor
26295 403 20211229 NaN NaN NaN 3 0.3 6 0.000 0.0 7.5 8.2 994.40 4.0 100.00 5.6 1.8 0.0 eor
26296 403 20211230 NaN NaN NaN 3 3.2 6 0.000 0.0 7.9 11.5 1001.70 9.0 98.54 12.7 4.6 2.3 eor
26297 403 20211231 NaN NaN NaN 3 5.5 6 0.000 0.0 7.7 12.5 1004.72 12.8 84.96 14.0 11.5 10.7 eor

26298 rows × 19 columns

For now we are interested in the variables for rainfall and temperature. After reviewing the variable description file (here) we found that the daily rainfall in mm is encoded in the RSK variable and daily mean temperature two meters above ground is encoded in the TMK variable. For the purpose of comprehensibility we construct a new data frame including the two variables of interest.

In [45]:
data_raw["MESS_DATUM"] = pd.to_datetime(data_raw["MESS_DATUM"], format="%Y%m%d")
In [46]:
ts_FUB_daily = pd.DataFrame(
    {"Temp": data_raw["TMK"], "Rain": data_raw["RSK"]}
).set_index(data_raw["MESS_DATUM"])

Let us plot the data with the plot function, plotting Temp and Rain as subplots.

In [49]:
fig, ax = plt.subplots(2, 1, figsize=(18, 8))

ax[0].plot(ts_FUB_daily["Temp"])
ax[0].set_title("Temp")

ax[1].plot(ts_FUB_daily["Rain"], color="orange")
ax[1].set_title("Rain")
plt.show()

Hourly time series data Berlin-Dahlem (FU)¶

In this section we process the hourly time series of rainfall data for the DWD station Berlin-Dahlem (FU). The hourly time series of the DWD weather station Berlin-Dahlem (FU) was downloaded on 2022-07-22 from Climate Data Center. A detailed variable description is available here. For the purpose of this tutorial the hourly time series data set is made available here.

The preprocessing procedure is very similar to the one for monthly and daily data. Please revisit the previous sections if you cannot follow the outlined steps below.

In [50]:
url = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/precipitation/historical/stundenwerte_RR_00403_20020128_20211231_hist.zip"


r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("../data")
In [51]:
data_raw = pd.read_csv(
    "../data/produkt_rr_stunde_20020128_20211231_00403.txt",
    sep=";",
    na_values=["-999"],
    skipinitialspace=True,
)
data_raw
Out[51]:
STATIONS_ID MESS_DATUM QN_8 R1 RS_IND WRTR eor
0 403 2002012811 1 0.0 0.0 NaN eor
1 403 2002012813 1 0.0 0.0 NaN eor
2 403 2002012815 1 1.7 1.0 NaN eor
3 403 2002012818 1 1.1 1.0 NaN eor
4 403 2002012821 1 0.0 0.0 NaN eor
... ... ... ... ... ... ... ...
174018 403 2021123119 3 0.7 1.0 6.0 eor
174019 403 2021123120 3 0.7 1.0 6.0 eor
174020 403 2021123121 3 0.1 1.0 NaN eor
174021 403 2021123122 3 0.1 1.0 6.0 eor
174022 403 2021123123 3 0.0 1.0 6.0 eor

174023 rows × 7 columns

The data set contains hourly rainfall measurements for the weather station Berlin-Dahlem (FU). After reviewing the variable description file (here) we found that the hourly rainfall in mm is encoded in the R1 variable.

In [52]:
data_raw["MESS_DATUM"] = pd.to_datetime(data_raw["MESS_DATUM"], format="%Y%m%d%H")
In [53]:
ts_FUB_hourly = pd.Series(data_raw["R1"].values, index=data_raw["MESS_DATUM"])
ts_FUB_hourly.fillna(0)
Out[53]:
MESS_DATUM
2002-01-28 11:00:00    0.0
2002-01-28 13:00:00    0.0
2002-01-28 15:00:00    1.7
2002-01-28 18:00:00    1.1
2002-01-28 21:00:00    0.0
                      ... 
2021-12-31 19:00:00    0.7
2021-12-31 20:00:00    0.7
2021-12-31 21:00:00    0.1
2021-12-31 22:00:00    0.1
2021-12-31 23:00:00    0.0
Length: 174023, dtype: float64

In the next step we simply plot the data again with the convenient plot() function.

In [54]:
plt.figure(figsize=(18, 4))
ts_FUB_hourly.plot()
plt.ylabel('Rainfall in mm/h', Fontsize = 13)
plt.xlabel('Date', Fontsize = 13)
plt.show()

Finally, we store the monthly, daily and hourly time series data set in a dictionary, which we will save into a .json file using the to_json function for further processing.

In [55]:
# convert pandas series to DataFrames
ts_FUB_monthly_df = ts_FUB_monthly.to_frame(name="rainfall").reset_index()
ts_FUB_hourly_df = ts_FUB_hourly.to_frame(name="rainfall").reset_index()
ts_FUB_daily_df = ts_FUB_daily.reset_index()
In [56]:
# save them as json files, remember to set date_format = "iso"
ts_FUB_monthly_df.to_json("../data/ts_FUB_monthly.json", date_format = "iso")
ts_FUB_hourly_df.to_json("../data/ts_FUB_hourly.json", date_format = "iso")
ts_FUB_daily_df.to_json("../data/ts_FUB_daily.json", date_format = "iso")


#ts_FUB_monthly

To read the .json files, use the following command and date conversion.

In [57]:
ts_FUB_monthly_df = pd.read_json("../data/ts_FUB_monthly.json")
ts_FUB_monthly_df["Date"] = pd.to_datetime(ts_FUB_monthly_df['Date'], format="%Y-%m-%d", errors = 'coerce')
ts_FUB_monthly_df
Out[57]:
Date rainfall
0 1719-01-01 00:00:00+00:00 2.80
1 1719-02-01 00:00:00+00:00 1.10
2 1719-03-01 00:00:00+00:00 5.20
3 1719-04-01 00:00:00+00:00 9.00
4 1719-05-01 00:00:00+00:00 15.10
... ... ...
3631 2021-08-01 00:00:00+00:00 17.43
3632 2021-09-01 00:00:00+00:00 15.55
3633 2021-10-01 00:00:00+00:00 10.49
3634 2021-11-01 00:00:00+00:00 6.28
3635 2021-12-01 00:00:00+00:00 2.19

3636 rows × 2 columns


Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.