Suppose we have \(n\) measurements on a vector \(\mathbf x\) of \(d\) features, and we wish to reduce the dimension from \(d\) to \(k\), where \(k\) is typically much smaller than \(d\). PCA does this by finding linear combinations, called principal components, that successively have maximum variance for the data.

The core principle in PCA is that the principal components are obtained from the eigenvectors of the covariance matrix (Hastie et al. 2008). The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues give the variances of their respective principal components. The ratio of the sum of the first \(k\) eigenvalues to the sum of the variances of all \(d\) original variables represents the proportion of the total variance in the original data set accounted for by the first \(k\) principal components.

There are two very popular approaches to compute the principal components, one approach is referred to as the spectral decomposition or eigendecomposition, whereas the other approach is referred to as singular value decomposition (SVD). In this tutorial we focus on the eigendecomposition approach as it is more intuitive, however please note that most PCA software implementations perform SVD to improve the computational efficiency.


Calculation of principal components

Let us compute the principal components of our simplified data set from the previous section (pca.toy.data). First, let us start with some mathematical notations.

The data covariance matrix is defined by

\[S= \frac{1}{N}\sum_{n=1}^{N}(x_n-\overline{x})(x_n-\overline{x})^T.\] The covariance matrix \(S\) is often denoted as matrix $$

Eigenvectors are a special type of vectors that fulfill the equation

\[\mathbf{A}\vec{\mathbf v}=\lambda\vec{\mathbf v},\] where \(\mathbf{A}\) is a matrix, \(\vec{v}\) is a non-negative vector, denoted as eigenvectors (\(\vec{v}≠\vec{0}\)) and \(\lambda\) is a scalar (\(λ∈\mathbb{R}\)), denoted as eigenvalue.

Matrix A is a similarity matrix like a correlation matrix or a covariance matrix. Furthermore, a covariance matrix of scaled variables (\(\mu = 0\) and $= 1 $) is eqaul their correlation matrix. Thus we consider \[ \mathbf{A} = \mathbf{S} (= \mathbf{\Sigma})\]

Just to make sure we understand the mechanics, we compute the principal components manually in R. The data covariance matrix is computed by applying the function cov(), and for the computation of the eigenvalues we apply the eigen() function. Please note that the eigen() function returns both, the eigenvalues and the eigenvectors. In general, a \(n×d\) data matrix has \(min(n−1,d)\) distinct principal components. Thus, for our specific data set we expect \(min(n−1,d)=min(50,2)=2\) eigenvectors and accordingly \(2\) eigenvalues.

# load data from previous section
load("pca_food_toy_30300.RData")

# Compute eigenvalues and eigenvectors of covariance matrix (= correlation matrix) of our scaled variables
pca_toy_cov <- cov(pca_toy_data)
pca_toy_eigen <- eigen(pca_toy_cov)

# Access the eigenvectors
rownames(pca_toy_eigen$vectors) <- c("Oil", "Density") # label output matrix
colnames(pca_toy_eigen$vectors) <- c("PC1", "PC2") # label output matrix
pca_toy_eigen$vectors
##                PC1        PC2
## Oil     -0.7071068 -0.7071068
## Density  0.7071068 -0.7071068
# Access the eigenvalues
pca_toy_eigen$values
## [1] 1.750024 0.249976

Hence, we found the two eigenvectors (principal components) and two eigenvalues of the data covariance matrix. We denote them as \(\mathbf{p}_j\) and and \(λ_j\), respectively, where \(j=1,2,...,min(n−1,d)\). (Note that the principal components are only unique up to a sign change.)

\[\mathbf{p}_1=\begin{bmatrix}-0.7071068\\0.7071068\end{bmatrix}, \ \mathbf{p}_2=\begin{bmatrix}-0.7071068\\-0.7071068\end{bmatrix},\ \lambda_1=[1.750024],\ \lambda_2=[0.249976].\]


Geometric explanation of PCA

So what do principal components mean?

Principal components provide us with information about the patterns in the data. By taking the eigenvectors of the covariance matrix, we extract lines that go in the direction of maximum variance along each dimension. Another, but equivalent, way of expressing this is to find the best-fit line, which best explains all the observations with minimum residual error.

Let us plot our data points and the first principal component. The data points are given as black dots. The red line, defined by the principal component vector of the first principal component, goes in the direction of maximum variance of the projections onto the line.

# first principal component vector slope
s1 <- pca_toy_eigen$vectors[1, 1] / pca_toy_eigen$vectors[2, 1] # PC1

plot(pca_toy_data,
     asp = TRUE,
     pch = 16,
     xlab = "Oil",
     ylab = "Density",
     main = "First principal component")
abline(a = 0, b = s1, col = "red")
legend("topleft", legend = "PC1", col = "red", lty = 1)

By adding the second principal component to the plot we see, that they are perpendicular to each other. The second eigenvector gives the other, less important, pattern in the data. The second eigenvector accounts for the fact, that all the points follow the main line, but are off to the side of the main line by some amount.

# principal component vector slopes
s1 <- pca_toy_eigen$vectors[1, 1] / pca_toy_eigen$vectors[2, 1] # PC1
s2 <- pca_toy_eigen$vectors[1, 2] / pca_toy_eigen$vectors[2, 2] # PC2

plot(pca_toy_data,
     asp = TRUE,
     pch = 16,
     xlab = "Oil",
     ylab = "Density",
     main = "First principal component")
abline(a = 0, b = s1, col = "red")
abline(a = 0, b = s2, col = "blue")
legend("topleft", legend = c("PC1", "PC2"), col = c("red", "blue"), lty = 1)

Thus, the two principal components are rotated by exact 45°.

Furthermore, the PC1 has a 75% higher variance as one of the original variable with \(s^2 = 1\), but the sum of variances is still \[\lambda_1+\lambda_2= 1.750024 + 0.249976 = 2 \]


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.