In the previous section we calculated the eigenvectors of the data covariance matrix (principal components) and plotted them together with the observations in our data set. In this section we focus on the eigenvalues of the data covariance matrix. Recall that the eigenvalues corresponds to the variance of their respective principal components. In fact, the eigenvector with the highest eigenvalue is the first principle component of the data set. Consequently, once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives the principal components in order of significance.
In general, a $n×d$ data matrix $\mathbf{X}$ has $min(n−1,d)$ distinct principal components. If $n≥d$ we may calculate $d$ eigenvectors and $d$ eigenvalues. We then pick, based on the magnitude of the eigenvalues, only the first $k$ eigenvectors. By neglecting some principal components we lose some information, but if the eigenvalues are small, we do not lose too much. The goal is to find the smallest number of principal components required to get a good representation of the original data.
So far we worked with a two-dimensional toy data set. This setting is not very instructive for choosing the most relevant principal components, as there are just two choices. Keep all of them or skip one, thus $k=2$ or $k=1$. To further elaborate this subject we use the full food texture data set in this section.
%load_ext autoreload
%autoreload 2
%matplotlib inline
# disable warnings
import warnings
warnings.filterwarnings("ignore")
# Import libraries and load data
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()
# Load data
df = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/food-texture.csv", index_col=0)
print(df.head())
Oil Density Crispy Fracture Hardness B110 16.5 2955 10 23 97 B136 17.7 2660 14 9 139 B171 16.2 2870 12 17 143 B192 16.7 2920 10 31 95 B225 16.3 2975 11 26 143
# Center and scale each column of the DataFrame
df_scaled = (df - df.mean()) / df.std()
# Save the scaled data to feather format
df_scaled.reset_index(drop=True, inplace=True)
df_scaled.to_feather("food-texture-scaled.feather")
The food texture data set consists of 50 rows (observations) and 5 columns (features/variables). These features are:
Recall, the eigenvalues give the variances of their respective principal components and the ratio of the sum of the first $k$ eigenvalues to the sum of the variances of all $d$ original variables represents the proportion of the total variance in the original data set accounted for by the first $k$ principal components.
Let us calculate the eigenvalue for the food texture data set and the proportion of the total variance for each particular eigenvalue.
# Calculate the eigenvalues and eigenvectors
eig_vals, eig_vecs = np.linalg.eig(df_scaled.cov())
We make sure that the sum of the eigenvalues equals the total variance of the sample data.
# Calculate the total variance
total_var = df_scaled.var().sum()
print(f"Total variance: {total_var:.2f}")
# Calculate the sum of the eigenvalues
eig_vals_sum = eig_vals.sum()
print(f"Sum of eigenvalues: {eig_vals_sum:.2f}")
assert eig_vals_sum.round(2) == total_var.round(2)
Total variance: 5.00 Sum of eigenvalues: 5.00
All right!
To compute the proportion of variance explained by each principal component, we simply divide the variance explained by each principal component by the total variance explained by all principal components:
# Calculate the proportion of variance explained by each principal component
var_exp = eig_vals / total_var
print(f"Proportion of variance explained by each principal component:\n{var_exp}")
Proportion of variance explained by each principal component: [0.60624263 0.25914115 0.06200987 0.04838402 0.02422233]
We see that the first principal component explains 61% of the variance in the data, the second principal component explains 26%, the third component 6%, the fourth component 5% and the fifth component 2% of the variance in the data.
How many principal components are needed?
Unfortunately, there is no well-accepted objective way to decide how many principal components are enough (James et al. 2013). In fact, the question depends on the specific area of application and the specific data set. However, there are three simple approaches, which may be of guidance for deciding the number of relevant principal components.
These are
A widely applied approach is to decide on the number of principal components by examining a scree plot. By eyeballing the scree plot, and looking for a point at which the proportion of variance explained by each subsequent principal component drops off. This is often referred to as an elbow in the scree plot. Let us plot the proportion of explained variance by each particular principal component with their absolute values (left plot) and as cumulative sums (right plot).
# Plot the proportion of explained variance by each particular principal component
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].plot(var_exp, marker="o")
ax[0].set_xlabel("Principal component")
ax[0].set_ylabel("Proportion of explained variance")
ax[0].set_title("Scree plot")
ax[1].plot(np.cumsum(var_exp), marker="o")
ax[1].set_xlabel("Principal component")
ax[1].set_ylabel("Cumulative sum of explained variance")
ax[1].set_title("Cumulative scree plot")
plt.show()
By looking at the plots we see a more or less pronounced drop off (elbow in the scree plot) after the third principal component. Thus, based on the scree plot we would decide to pick the first three principal component to represent our data set, thereby explaining 93% of the variance in the data set.
Another simple approach to decide on the number of principal components is to set a threshold, say 80%, and stop when the first $k$ components account for a percentage of total variation greater than this threshold (Jolliffe, 2002{target="_blank"}). In our example the first two components account for 87% of the variation. Thus, based on the variance explained criteria we pick the first two principal components to represent our data set.
# Calculate the cumulative sum of the proportion of variance explained by each principal component
cum_var_exp = np.cumsum(var_exp)
print(cum_var_exp)
[0.60624263 0.86538379 0.92739365 0.97577767 1. ]
Note that the threshold is set somehow arbitrary; 70 to 90% are the usual sort of values, but it depends on the context of the data set and can be higher or lower (Lovric 2011{target="_blank"}).
The Kaiser’s rule (Kaiser-Guttman criterion) is a widely used method to evaluate the maximum number of linear combinations to extract from the data set. According to that rule only those principal components are retained, whose variances exceed 1. The idea behind the Kaiser-Guttman criterion is that any principal with variance less than 1 contains less information than one of the original variables and so is not worth retaining (Jolliffe, 2002{target="_blank"}).
Applying Kaiser’s rule to the food-texture data set results in keeping the first two principal components.
# Calculate the number of principal components with variances greater than 1
kaiser_rule = np.where(eig_vals > 1)[0]
print(kaiser_rule)
[0 1]
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.