30313_Choose-principal-components.knit

In the previous section we calculated the eigenvectors of the data covariance matrix (principal components) and plotted them together with the observations in our data set. In this section we focus on the eigenvalues of the data covariance matrix. Recall that the eigenvalues corresponds to the variance of their respective principal components. In fact, the eigenvector with the highest eigenvalue is the first principle component of the data set. Consequently, once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives the principal components in order of significance.

In general, a \(n×d\) data matrix \(\mathbf{X}\) has \(min(n−1,d)\) distinct principal components. If \(n≥d\) we may calculate \(d\) eigenvectors and \(d\) eigenvalues. We then pick, based on the magnitude of the eigenvalues, only the first \(k\) eigenvectors. By neglecting some principal components we lose some information, but if the eigenvalues are small, we do not lose too much. The goal is to find the smallest number of principal components required to get a good representation of the original data.

So far we worked with a two-dimensional toy data set. This setting is not very instructive for choosing the most relevant principal components, as there are just two choices. Keep all of them or skip one, thus \(k=2\) or \(k=1\). To further elaborate this subject we use the full food texture data set in this section.

food <- read.csv("food-texture.csv")

# load helper functions from previous sections
load("helper_functions_30300.RData")

library(dplyr)
# center and scale data set
# useage of the pipe operator %>% provided by the dplyr package
food_pca <- food[, 2:ncol(food)] %>%
  apply(MARGIN = 2, FUN = center) %>%
  apply(MARGIN = 2, FUN = scale)
dim(food_pca)

## [1] 50  5

# save food_pca for later usage
save(food_pca, file = "pca_food_30300.RData")

The food texture data set consists of 50 rows (observations) and 5 columns (features/variables). These features are:

Oil: percentage oil in the pastry
Density: the product’s density (the higher the number, the more dense the product)
Crispy: a crispiness measurement, on a scale from 7 to 15, with 15 being more crispy.
Fracture: the angle, in degrees, through which the pasty can be slowly bent before it fractures.
Hardness: a sharp point is used to measure the amount of force required before breakage occurs.

Recall, the eigenvalues give the variances of their respective principal components and the ratio of the sum of the first \(k\) eigenvalues to the sum of the variances of all \(d\) original variables represents the proportion of the total variance in the original data set accounted for by the first \(k\) principal components.

Let us calculate the eigenvalue for the food texture data set and the proportion of the total variance for each particular eigenvalue.

food_pca_eigen <- eigen(cov(food_pca))
food_pca_eigen$values

## [1] 3.0312132 1.2957058 0.3100493 0.2419201 0.1211116

We make sure that the sum of the eigenvalues equals the total variance of the sample data.

all.equal(sum(food_pca_eigen$values), sum(apply(food_pca, MARGIN = 2, FUN = var)))

## [1] TRUE

All right!

To compute the proportion of variance explained by each principal component, we simply divide the variance explained by each principal component by the total variance explained by all principal components:

food_pca_ve <- food_pca_eigen$values / sum(food_pca_eigen$values)
food_pca_ve

## [1] 0.60624263 0.25914115 0.06200987 0.04838402 0.02422233

We see that the first principal component explains 61% of the variance in the data, the second principal component explains 26%, the third component 6%, the fourth component 5% and the fifth component 2% of the variance in the data.

How many principal components are needed?

Unfortunately, there is no well-accepted objective way to decide how many principal components are enough (James et al. 2013). In fact, the question depends on the specific area of application and the specific data set. However, there are three simple approaches, which may be of guidance for deciding the number of relevant principal components.

These are

the visual examination of a scree plot,
the variance explained criteria or,
the Kaiser rule.

The visual examination of a scree plot

A widely applied approach is to decide on the number of principal components by examining a scree plot. By eyeballing the scree plot, and looking for a point at which the proportion of variance explained by each subsequent principal component drops off. This is often referred to as an elbow in the scree plot. Let us plot the proportion of explained variance by each particular principal component with their absolute values (left plot) and as cumulative sums (right plot).

par(mfrow = c(1, 2), mar = c(4, 5, 3, 1))
plot(food_pca_ve,
     xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1),
     type = "b",
     main = "Scree plot")

plot(cumsum(food_pca_ve),
     xlab = "Principal Component",
     ylab = "Cumulative Proportion of\nVariance Explained",
     ylim = c(0, 1),
     type = "b",
     main = "Scree plot")

By looking at the plots we see a more or less pronounced drop off (elbow in the scree plot) after the third principal component. Thus, based on the scree plot we would decide to pick the first three principal component to represent our data set, thereby explaining 93% of the variance in the data set.

The variance explained criteria

Another simple approach to decide on the number of principal components is to set a threshold, say 80%, and stop when the first \(k\) components account for a percentage of total variation greater than this threshold (Jolliffe, 2002). In our example the first two components account for 87% of the variation. Thus, based on the variance explained criteria we pick the first two principal components to represent our data set.

cumsum(food_pca_eigen$values / sum(food_pca_eigen$values))

## [1] 0.6062426 0.8653838 0.9273937 0.9757777 1.0000000

Note that the threshold is set somehow arbitrary; 70 to 90% are the usual sort of values, but it depends on the context of the data set and can be higher or lower (Lovric 2011).

Kaiser’s rule (Kaiser-Guttman criterion)

The Kaiser’s rule (Kaiser-Guttman criterion) is a widely used method to evaluate the maximum number of linear combinations to extract from the data set. According to that rule only those principal components are retained, whose variances exceed 1. The idea behind the Kaiser-Guttman criterion is that any principal with variance less than 1 contains less information than one of the original variables and so is not worth retaining (Jolliffe, 2002).

Applying Kaiser’s rule to the food-texture data set results in keeping the first two principal components.

food_pca_eigen$values[food_pca_eigen$values >= 1]

## [1] 3.031213 1.295706

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.