30331_Data_preparation

Before we dive into PCA we first prepare our data set. Therefore we load the dwd data set. You may download the DWD.csv file here. We import the data set and name it df.

In [1]:

# Import libraries
import pandas as pd

In [2]:

# Load the dwd data set
df = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv")

The DWD-data set consists of 1103 rows, each of them representing a particular weather station in Germany, and 21 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: Stationid, Altitude, Lat, Lon, Name, Federal State, Period, Record_length, mean_annual_air_temp, mean_monthly_max_temp, mean_monthly_min_temp, mean_annual_wind_speed, mean_annual_cloud_cover, mean_annual_sunshine, mean_annual_rainfall, mean_monthly_max_wind_speed, max_air_temp, max_wind_speed, max_rainfall, min_air_temp, mean_range_air_temp.

The data was downloaded from the DWD (German Weather Service) data portal and prepossessed for the purpose of this tutorial. You may find a detailed description of the data set here.

By calling the head()-method on the data set we realize that there are features in the data set such as Station_id, and Name, among others, which do not carry any particular useful information. Hence, we exclude them from the data set. In addition we make sure that we exclude observations that contain missing values (NA). With respect to missing values we exclude the variable max_wind_speed from the data set, as this feature shows over-proportionally many missing values.

In [3]:

# Exclude variables
cols_to_drop = [
    "DWD_ID",
    "STATION_NAME",
    "FEDERAL_STATE",
    "PERIOD",
    "MAX_WIND_SPEED", # shows many missing values
    "MAX_AIR_TEMP", # the response variable is not included in PCA
]

df = df.drop(cols_to_drop, axis=1)
rows_to_drop = df.isna().any(axis=1)
df = df.drop(df[rows_to_drop].index, axis=0)
df.reset_index(drop=False, inplace=True)
df.head()

df.to_feather("DWD.feather")

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.