30311_Data_preparation

To procceed a PCA on a given data set, we will first take a look at the data set. In this section we analyse the food texture data set. This open source data set is available here and describes texture measurements of a pastry-type food.

In [9]:

from pandas import read_csv
from sklearn.preprocessing import StandardScaler

import numpy as np
import matplotlib.pyplot as plt

In [10]:

food = read_csv("https://userpage.fu-berlin.de/soga/300/30100_data_sets/food-texture.csv", index_col=0)
food.head()

Out[10]:

	Oil	Density	Crispy	Fracture	Hardness
B110	16.5	2955	10	23	97
B136	17.7	2660	14	9	139
B171	16.2	2870	12	17	143
B192	16.7	2920	10	31	95
B225	16.3	2975	11	26	143

The data set consists of 50 rows (observations) and 6 columns (features/variables). The features are:

Oil: percentage oil in the pastry
Density: the product’s density (the higher the number, the more dense the product)
Crispy: a crispiness measurement, on a scale from 7 to 15, with 15 being more crispy.
Fracture: the angle, in degrees, through which the pasty can be slowly bent before it fractures.
Hardness: a sharp point is used to measure the amount of force required before breakage occurs.

For the sake of comprehensibility we start working with a reduced, two-dimensional toy data set, by extracting the columns Oil and Density. In a subsequent section we return to the full data set for our analyses.

In [11]:

pca_toy = food[["Oil", "Density"]]
pca_toy.head()

Out[11]:

	Oil	Density
B110	16.5	2955
B136	17.7	2660
B171	16.2	2870
B192	16.7	2920
B225	16.3	2975

We start with an exploratory data analysis and examine a scatter plot.

Scale the data¶

The scatter plot above indicates a relationship between the feature Oil and the feature Density. Note that the variables are not on the same scale. In general, the results of PCA depend on the scale of the variables. Therefore, each variable is typically centered and scaled to have a mean of zero and standard deviation of one. In certain settings, however, the variables are measured in the same units, and one may skip the standardization.

For the sake of comprehensibility we use the StandardScaler of scikit-learn and visualize the effects of each pre-processing step. The goal is to center each column to zero mean and then scale it to have unit variance.

In [12]:

scaler = StandardScaler().fit(pca_toy)
pca_toy_standard = pca_toy.copy()
pca_toy_standard[["Oil", "Density"]] = scaler.transform(pca_toy)
pca_toy_standard.head()

Out[12]:

	Oil	Density
B110	-0.445430	0.790272
B136	0.315989	-1.603262
B171	-0.635784	0.100610
B192	-0.318527	0.506293
B225	-0.572333	0.952546

Note: you may need to install pyarrow via e.g. pip install pyarrow for the following vizialization of the data

In [16]:

# Set figure properties
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))
fig.subplots_adjust(hspace=0.5, wspace=0.3)
axs = axs.ravel()

# Scatterplot 1
axs[0].scatter(pca_toy[["Oil"]], pca_toy[["Density"]])
axs[0].set_title("Raw data")
data_mean = np.mean(pca_toy, axis=0)
axs[0].scatter(data_mean[0], data_mean[1], color="red", marker="o")  # mark mean
# Boxplot 1
axs[1].boxplot(pca_toy)
axs[1].set_title("Raw data")

# Scatterplot 2
axs[2].scatter(pca_toy_standard[["Oil"]], pca_toy_standard[["Density"]])
axs[2].set_title("Scaled and centered data")
data_mean = np.mean(pca_toy_standard, axis=0)
axs[2].scatter(data_mean[0], data_mean[1], color="red", marker="o")  # mark mean
# Boxplot 3
axs[3].boxplot(pca_toy_standard)
axs[3].set_title("Scaled and centered data")

# Assign pre-processed dataset to a new variable
pca_toy_data = pca_toy_standard.reset_index()

# save for later usage
pca_toy_data.to_feather("pca_food_toy_30300.feather")

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.