To procceed a PCA on a given data set, we will first take a look at the data set. In this section we analyse the food texture data set. This open source data set is available here and describes texture measurements of a pastry-type food.
from pandas import read_csv
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
food = read_csv("https://userpage.fu-berlin.de/soga/300/30100_data_sets/food-texture.csv", index_col=0)
food.head()
Oil | Density | Crispy | Fracture | Hardness | |
---|---|---|---|---|---|
B110 | 16.5 | 2955 | 10 | 23 | 97 |
B136 | 17.7 | 2660 | 14 | 9 | 139 |
B171 | 16.2 | 2870 | 12 | 17 | 143 |
B192 | 16.7 | 2920 | 10 | 31 | 95 |
B225 | 16.3 | 2975 | 11 | 26 | 143 |
The data set consists of 50 rows (observations) and 6 columns (features/variables). The features are:
For the sake of comprehensibility we start working with a reduced, two-dimensional toy data set, by extracting the columns Oil
and Density
. In a subsequent section we return to the full data set for our analyses.
pca_toy = food[["Oil", "Density"]]
pca_toy.head()
Oil | Density | |
---|---|---|
B110 | 16.5 | 2955 |
B136 | 17.7 | 2660 |
B171 | 16.2 | 2870 |
B192 | 16.7 | 2920 |
B225 | 16.3 | 2975 |
We start with an exploratory data analysis and examine a scatter plot.
The scatter plot above indicates a relationship between the feature Oil
and the feature Density
. Note that the variables are not on the same scale. In general, the results of PCA depend on the scale of the variables. Therefore, each variable is typically centered and scaled to have a mean of zero and standard deviation of one. In certain settings, however, the variables are measured in the same units, and one may skip the standardization.
For the sake of comprehensibility we use the StandardScaler
of scikit-learn and visualize the effects of each pre-processing step. The goal is to center each column to zero mean and then scale it to have unit variance.
scaler = StandardScaler().fit(pca_toy)
pca_toy_standard = pca_toy.copy()
pca_toy_standard[["Oil", "Density"]] = scaler.transform(pca_toy)
pca_toy_standard.head()
Oil | Density | |
---|---|---|
B110 | -0.445430 | 0.790272 |
B136 | 0.315989 | -1.603262 |
B171 | -0.635784 | 0.100610 |
B192 | -0.318527 | 0.506293 |
B225 | -0.572333 | 0.952546 |
Note: you may need to install pyarrow via e.g. pip install pyarrow for the following vizialization of the data
# Set figure properties
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))
fig.subplots_adjust(hspace=0.5, wspace=0.3)
axs = axs.ravel()
# Scatterplot 1
axs[0].scatter(pca_toy[["Oil"]], pca_toy[["Density"]])
axs[0].set_title("Raw data")
data_mean = np.mean(pca_toy, axis=0)
axs[0].scatter(data_mean[0], data_mean[1], color="red", marker="o") # mark mean
# Boxplot 1
axs[1].boxplot(pca_toy)
axs[1].set_title("Raw data")
# Scatterplot 2
axs[2].scatter(pca_toy_standard[["Oil"]], pca_toy_standard[["Density"]])
axs[2].set_title("Scaled and centered data")
data_mean = np.mean(pca_toy_standard, axis=0)
axs[2].scatter(data_mean[0], data_mean[1], color="red", marker="o") # mark mean
# Boxplot 3
axs[3].boxplot(pca_toy_standard)
axs[3].set_title("Scaled and centered data")
# Assign pre-processed dataset to a new variable
pca_toy_data = pca_toy_standard.reset_index()
# save for later usage
pca_toy_data.to_feather("pca_food_toy_30300.feather")
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.