Before we dive into PCA we first prepare our data set. Therefore we load the dwd data set. You may
download the DWD.csv
file
here. We import the data set and name it df.
# Import libraries
import pandas as pd
# Load the dwd data set
df = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/DWD.csv")
The DWD-data set consists of 1103 rows, each of them representing a particular weather station in Germany, and 21 columns, each of them corresponding to a variable/feature related to that particular weather station. These self-explaining variables are: Stationid, Altitude, Lat, Lon, Name, Federal State, Period, Record_length, mean_annual_air_temp, mean_monthly_max_temp, mean_monthly_min_temp, mean_annual_wind_speed, mean_annual_cloud_cover, mean_annual_sunshine, mean_annual_rainfall, mean_monthly_max_wind_speed, max_air_temp, max_wind_speed, max_rainfall, min_air_temp, mean_range_air_temp.
The data was downloaded from the DWD (German Weather Service) data portal and prepossessed for the purpose of this tutorial. You may find a detailed description of the data set here.
By calling the head()
-method on the data set we realize that there are features in the data set
such as Station_id
, and Name
, among others, which do not carry any particular useful
information. Hence, we exclude them from the data set. In addition we make sure that we exclude
observations that contain missing values (NA
). With respect to missing values we exclude the
variable max_wind_speed
from the data set, as this feature shows over-proportionally many missing
values.
# Exclude variables
cols_to_drop = [
"DWD_ID",
"STATION_NAME",
"FEDERAL_STATE",
"PERIOD",
"MAX_WIND_SPEED", # shows many missing values
"MAX_AIR_TEMP", # the response variable is not included in PCA
]
df = df.drop(cols_to_drop, axis=1)
rows_to_drop = df.isna().any(axis=1)
df = df.drop(df[rows_to_drop].index, axis=0)
df.reset_index(drop=False, inplace=True)
df.head()
df.to_feather("DWD.feather")
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.