30312_extract_principal

Suppose we have $n$ measurements on a vector $\mathbf{x}$ of $d$ features, and we wish to reduce the dimension from $d$ to $k$, where $k$ is typically much smaller than $d$. PCA does this by finding linear combinations, called principal components, that successively have maximum variance for the data.

The core principle in PCA is that the principal components are obtained from the eigenvectors of the covariance matrix (Hastie et al. 2008). The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues give the variances of their respective principal components. The ratio of the sum of the first $k$ eigenvalues to the sum of the variances of all $d$ original variables represents the proportion of the total variance in the original data set accounted for by the first $k$ principal components.

There are two very popular approaches to compute the principal components, one approach is referred to as the spectral decomposition or eigendecomposition, whereas the other approach is referred to as singular value decomposition (SVD). In this tutorial we focus on the eigendecomposition approach as it is more intuitive, however please note that most PCA software implementations perform SVD to improve the computational efficiency.

Calculation of principal components¶

Let us compute the principal components of our simplified data set from the previous section (pca.toy.data). First, let us start with some mathematical notations.

The data covariance matrix is defined by

$$S= \frac{1}{N}\sum_{n=1}^{N}(x_n-\overline{x})(x_n-\overline{x})^T.$$

The covariance matrix $S$ is often denoted as matrix $\Sigma $

Eigenvectors are a special type of vectors that fulfill the equation

$$\mathbf{A}\mathbf{v}=\lambda\mathbf{v},$$

where $\mathbf{A}$ is a matrix, $\mathbf{v}$ is a non-negative vector called eigenvectors ($\mathbf{v}≠\mathbf{0}$) and $\lambda$ is a scalar ($λ∈\mathbb{R}$) called eigenvalue.

Matrix A is a similarity matrix like a correlation matrix or a covariance matrix. Furthermore, a covariance matrix of scaled variables ($\mu = 0$ and $\sigma = 1 $) is eqaul their correlation matrix. Thus we consider $$ \mathbf{A} = \mathbf{S} (= \mathbf{\Sigma})$$

Just to make sure we understand the mechanics, we compute the principal components manually in R. The data covariance matrix is computed by applying the function cov(), and for the computation of the eigenvalues we apply the eigen() function. Please note that the eigen() function returns both, the eigenvalues and the eigenvectors. In general, a $n×d$ data matrix has $min(n−1,d)$ distinct principal components. Thus, for our specific data set we expect $min(n−1,d)=min(50,2)=2$ eigenvectors and accordingly $2$ eigenvalues.

In [1]:

# Load packages

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

In [2]:

# Load data
pca_toy_data = pd.read_feather("https://userpage.fu-berlin.de/soga/data/py-data/pca_food_toy_30300.feather")
pca_toy_data.set_index("index", inplace=True)

In [3]:

# Compute covariance matrix

cov_mat = np.cov(pca_toy_data, rowvar=False)
print(cov_mat)

[[ 1.02040816 -0.7653306 ]
 [-0.7653306   1.02040816]]

In [4]:

# Compute eigenvalues and eigenvectors
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print("Eigenvalues: \n", eig_vals)
print("Eigenvectors: \n", eig_vecs)

Eigenvalues: 
 [0.25507756 1.78573877]
Eigenvectors: 
 [[-0.70710678  0.70710678]
 [-0.70710678 -0.70710678]]

Hence, we found the two eigenvectors (principal components) and two eigenvalues of the data covariance matrix. We denote them as $\mathbf{p}_j$ and and $λ_j$, respectively, where, considering the general notation $j=1,2,...,min(n−1,d)$. (Note that the principal components are only unique up to a sign change.)

$$\mathbf{p}_1=\begin{bmatrix}-0.70710678\\-0.70710678\end{bmatrix}, \ \mathbf{p}_2=\begin{bmatrix}0.70710678\\-0.70710678\end{bmatrix},\ \lambda_1=[0.25507756],\ \lambda_2=[1.78573877].$$

Geometric explanation of PCA¶

So what do principal components mean?

Principal components provide us with information about the patterns in the data. By taking the eigenvectors of the covariance matrix, we extract lines that go in the direction of maximum variance along each dimension. Another, but equivalent, way of expressing this is to find the best-fit line, which best explains all the observations with minimum residual error.

Let us plot our data points and the first principal component. The data points are given as black dots. The red line, defined by the principal component vector of the first principal component, goes in the direction of maximum variance of the projections onto the line.

In [5]:

# Plot data points and first principal component

sns.set_theme(style="whitegrid")

# Plot data points
plt.scatter(pca_toy_data.iloc[:, 0], pca_toy_data.iloc[:, 1], color="black")
plt.xlabel("Oil")
plt.ylabel("Density")

# Plot first principal component
plt.arrow(0, 0, 4 * eig_vecs[0, 1], 4 * eig_vecs[1, 1], color="blue", linewidth=2, head_width=0.1, label="First principal component")
plt.axis("equal")
plt.legend()
plt.show()

By adding the second principal component to the plot we see, that they are perpendicular to each other. The second eigenvector gives the other, less important, pattern in the data. The second eigenvector accounts for the fact, that all the points follow the main line, but are off to the side of the main line by some amount.

In [6]:

# plot principal component vectors

# Plot data points
plt.scatter(pca_toy_data.iloc[:, 0], pca_toy_data.iloc[:, 1], color="black")
plt.xlabel("Oil")
plt.ylabel("Density")

# Plot first principal component
plt.arrow(0, 0, 4 * eig_vecs[0, 1], 4 * eig_vecs[1, 1], color="blue", linewidth=2, head_width=0.1, label="First principal component")

# Plot second principal component
plt.arrow(0, 0, -4 * eig_vecs[0, 0], -4 * eig_vecs[1, 0], color="red", linewidth=2, head_width=0.1, label="Second principal component")
plt.axis("equal")
plt.legend()
plt.show()

Thus, the two principal components are rotated by exact 45°.

Furthermore, the PC1 has a 75% higher variance as one of the original variable with $s^2 = 1$, but the sum of variances is still $$\lambda_1+\lambda_2= 0.25507756 + 1.78573877= 2 $$

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.