Before we start, we clean up the data set and exclude non-meaningful features and missing values. Then, we create our binary response variable HOT_DAY, based on the variable MAX_AIR_TEMP. In a next step we split our data set at random into two distinct parts. One part is our trainig set while the other part will serve as our test set. The basic idea behind this data grouping strategy is, that we learn the model parameters on the training set, and then we use the test data set, a data set that was not involved at all during parameter estimation, to determine the goodness of fit of our model, or in other words by using two distinct data sets we may evaluate the generalization properties of our model.

For the sub-setting of the data set we rely on the sample split functionality of pandas package, which splits the data set into two sets (training and test set) in a predefined ratio. Finally, we assign the data matrices and the response vectors proper variable names for further processing.

In order to plot the data on a map, we use the Folium library.

In [3]:
# import packages

import folium
import numpy as np
import pandas as pd
In [4]:
# load data set
dwd = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv", sep=",")
dwd.head()

# Exclude variables
cols_to_drop = [
    "DWD_ID",
    "STATION_NAME",
    "FEDERAL_STATE",
    "PERIOD",
    "MAX_WIND_SPEED", # shows many missing values
]
dwd = dwd.drop(cols_to_drop, axis=1) # drop columns

# Exclude missing values
dwd = dwd.dropna() # drop rows with missing values

# Create binary response variable HEAT.DAY
dwd["HOT_DAY"] = np.where(dwd["MAX_AIR_TEMP"] >= 30, 1, 0)


# set seed for reproducibility
np.random.seed(30300)
# Split data into training and test set, 60% training and 40% test set
split = np.random.rand(len(dwd)) < 0.6

# define split vector for HOT_DAY
HOT_DAY_train = dwd["HOT_DAY"][split == True]
HOT_DAY_test = dwd["HOT_DAY"][split == False]

#########################
### Original data set ###
#########################
# training set
train_set = dwd[split == True]
# test set
test_set = dwd[split == False]

# response vectors
y_train = HOT_DAY_train
y_test = HOT_DAY_test

# exclude response variable(s) from the data matrix
X_train = train_set.drop(["HOT_DAY", "MAX_AIR_TEMP"], axis=1)
X_test = test_set.drop(["HOT_DAY", "MAX_AIR_TEMP"], axis=1)

####################
### PCA data set ###
####################

# load PC3 model from previous sections
pca_PC3 = pd.read_feather("https://userpage.fu-berlin.de/soga/data/py-data/pca_PC3.feather")
pca_PC3.head()
Out[4]:
PC1 PC2 PC3
0 -2.250742 1.188693 2.046782
1 -1.987427 0.603907 0.575597
2 2.224010 -1.298316 0.968473
3 1.173059 -2.923561 -1.130196
4 -1.992753 -1.187762 0.404263
In [6]:
# Split the data set:

# training set PCA
train_set_PC3 = pca_PC3[split == True]

# test set PCA
test_set_PC3 = pca_PC3[split == False]

# save defined training and test data as well as response vectors into one feather file
# we have to reset the index as feather does not support non-standar indices
train_set_PC3.reset_index(drop=False).to_feather("train_set_PC3.feather")
test_set_PC3.reset_index(drop=False).to_feather("test_set_PC3.feather")

y_train.reset_index(drop=False).to_feather("y_train.feather")
y_test.reset_index(drop=False).to_feather("y_test.feather")

HOT_DAY_train.reset_index(drop=False).to_feather("HOT_DAY_train.feather")
HOT_DAY_test.reset_index(drop=False).to_feather("HOT_DAY_test.feather")

X_train.reset_index(drop=False).to_feather("X_train.feather")
X_test.reset_index(drop=False).to_feather("X_test.feather")

dwd.reset_index(drop=False).to_feather("dwd.feather")

Finally, there remain 238 observations with 16 features in the data set to train our model, and 159 observations with 16 features to evaluate our model.

We notice that the classes of our response variable HOT_DAY are not balanced.

In [7]:
dwd["HOT_DAY"].describe()
Out[7]:
count    397.000000
mean       0.863980
std        0.343243
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: HOT_DAY, dtype: float64
In [8]:
dwd["HOT_DAY"].info()
<class 'pandas.core.series.Series'>
Int64Index: 397 entries, 1 to 597
Series name: HOT_DAY
Non-Null Count  Dtype
--------------  -----
397 non-null    int64
dtypes: int64(1)
memory usage: 6.2 KB
In [9]:
dwd["HOT_DAY"].value_counts()
Out[9]:
1    343
0     54
Name: HOT_DAY, dtype: int64

The ratio of weather stations, which recorded at least one hot day and weather stations, which have not yet recorded a hot day is 6 to 1. When working with imbalanced data set we have to be cautious with respect to model metrics, such as the model accuracy. Consider a model that always predict the more common class; in our example the model accuracy would be $343/(54+343)=0.86$; which is a quite high number for a non-predictive model. Hence, we will consider other model metrics such as the confusion matrix and Cohen’s kappa too. The number of $0.86$ will serve as our model benchmark. Whatever model we build, the model’s accuracy must be better than $0.86$.


However, before we continue we should remind ourselves that the data set we are working with has a spatial component. Basically, our observations are point measurements of meteorological parameters spread across Germany. Let us plot a simple map to get an impression of the spatial distribution of our observations. For plotting we rely on the folium package.

In [10]:
# plot the map
m = folium.Map(
    location=[51.1657, 10.4515],
    zoom_start=7,
    tiles="cartodbpositron",
    width="100%",
    height="100%",
)

# add the observations as points to the map
for i in range(len(dwd)):
    folium.Circle(
        location=[dwd.iloc[i]["LAT"], dwd.iloc[i]["LON"]],
        color="crimson",
        fill=True,
        radius=100,
    ).add_to(m)

# show the map
m
Out[10]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.