To procceed a PCA on a given data set, we will first take a look at the data set. In this section we analyse the food texture data set. This open source data set is available here and describes texture measurements of a pastry-type food.

In [9]:
from pandas import read_csv
from sklearn.preprocessing import StandardScaler

import numpy as np
import matplotlib.pyplot as plt
In [10]:
food = read_csv("https://userpage.fu-berlin.de/soga/300/30100_data_sets/food-texture.csv", index_col=0)
food.head()
Out[10]:
Oil Density Crispy Fracture Hardness
B110 16.5 2955 10 23 97
B136 17.7 2660 14 9 139
B171 16.2 2870 12 17 143
B192 16.7 2920 10 31 95
B225 16.3 2975 11 26 143

The data set consists of 50 rows (observations) and 6 columns (features/variables). The features are:

  • Oil: percentage oil in the pastry
  • Density: the product’s density (the higher the number, the more dense the product)
  • Crispy: a crispiness measurement, on a scale from 7 to 15, with 15 being more crispy.
  • Fracture: the angle, in degrees, through which the pasty can be slowly bent before it fractures.
  • Hardness: a sharp point is used to measure the amount of force required before breakage occurs.

For the sake of comprehensibility we start working with a reduced, two-dimensional toy data set, by extracting the columns Oil and Density. In a subsequent section we return to the full data set for our analyses.

In [11]:
pca_toy = food[["Oil", "Density"]]
pca_toy.head()
Out[11]:
Oil Density
B110 16.5 2955
B136 17.7 2660
B171 16.2 2870
B192 16.7 2920
B225 16.3 2975

We start with an exploratory data analysis and examine a scatter plot.

Scale the data¶

The scatter plot above indicates a relationship between the feature Oil and the feature Density. Note that the variables are not on the same scale. In general, the results of PCA depend on the scale of the variables. Therefore, each variable is typically centered and scaled to have a mean of zero and standard deviation of one. In certain settings, however, the variables are measured in the same units, and one may skip the standardization.

For the sake of comprehensibility we use the StandardScaler of scikit-learn and visualize the effects of each pre-processing step. The goal is to center each column to zero mean and then scale it to have unit variance.

In [12]:
scaler = StandardScaler().fit(pca_toy)
pca_toy_standard = pca_toy.copy()
pca_toy_standard[["Oil", "Density"]] = scaler.transform(pca_toy)
pca_toy_standard.head()
Out[12]:
Oil Density
B110 -0.445430 0.790272
B136 0.315989 -1.603262
B171 -0.635784 0.100610
B192 -0.318527 0.506293
B225 -0.572333 0.952546

Note: you may need to install pyarrow via e.g. pip install pyarrow for the following vizialization of the data

In [16]:
# Set figure properties
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))
fig.subplots_adjust(hspace=0.5, wspace=0.3)
axs = axs.ravel()

# Scatterplot 1
axs[0].scatter(pca_toy[["Oil"]], pca_toy[["Density"]])
axs[0].set_title("Raw data")
data_mean = np.mean(pca_toy, axis=0)
axs[0].scatter(data_mean[0], data_mean[1], color="red", marker="o")  # mark mean
# Boxplot 1
axs[1].boxplot(pca_toy)
axs[1].set_title("Raw data")

# Scatterplot 2
axs[2].scatter(pca_toy_standard[["Oil"]], pca_toy_standard[["Density"]])
axs[2].set_title("Scaled and centered data")
data_mean = np.mean(pca_toy_standard, axis=0)
axs[2].scatter(data_mean[0], data_mean[1], color="red", marker="o")  # mark mean
# Boxplot 3
axs[3].boxplot(pca_toy_standard)
axs[3].set_title("Scaled and centered data")

# Assign pre-processed dataset to a new variable
pca_toy_data = pca_toy_standard.reset_index()

# save for later usage
pca_toy_data.to_feather("pca_food_toy_30300.feather")

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.