30314_Derive-new-data-set.knit

In the previous sections we discussed the computation of principal components and how to pick a reasonable number of principal components to represent the data. In this section we will learn how to project the original data onto a new feature space with less dimensions. Recall that the principal components are the eigenvectors of the covariance matrix and that the eigenvalues correspond to the explained variance of the respective principal component.

In PCA the elements of the principal components are denoted as loadings (\(ϕ\)). Written in vector form the loadings are referred to as loading vector, \(\mathbf{p}_j\), where \(j=1,2,...,k\).

Let us take a look at the loading of the first two principal components, \(\mathbf{p}_1\) and \(\mathbf{p}_2\), of the food-texture data set.

# load data from previous sections
load("pca_food_30300.RData")

food_pca_eigen <- eigen(cov(food_pca))
pca_loading <- food_pca_eigen$vectors[, 1:2] # select the first two principal components
colnames(pca_loading) <- c("PC1", "PC2") # name columns
rownames(pca_loading) <- colnames(food_pca) # name rows
pca_loading

##                 PC1        PC2
## Oil       0.4575334  0.3704389
## Density  -0.4787455 -0.3567500
## Crispy    0.5323877 -0.1976610
## Fracture -0.5044769  0.2212399
## Hardness  0.1534026 -0.8046661

The output from R can be rewritten in vector form.

\[\mathbf{p}_1=\begin{bmatrix}\phi_{11}\\\phi_{12}\\\phi_{13}\\\phi_{14}\\\phi_{15}\\\end{bmatrix}= \begin{bmatrix}0.46\\-0.48\\0.53\\-0.5\\0.15\end{bmatrix}, \ \mathbf{p}_2=\begin{bmatrix}\phi_{21}\\\phi_{22}\\\phi_{23}\\\phi_{24}\\\phi_{25}\\\end{bmatrix}= \begin{bmatrix}0.37\\-0.36\\-0.2\\0.22\\-0.8\end{bmatrix}\] Recall that the principal components are the eigenvectors of the data covariance matrix. By definition the length of an eigenvector is 1. Hence, the squared sum of the loadings should sum up to \(1\) (\(∑kj=1ϕ2j=1\)). Let us check this claim.

# calculate squared sum of loadings
apply(pca_loading^2, 2, sum)

## PC1 PC2 
##   1   1

Correct, the squared sum of loadings sums up to \(1\)!

Recall that the principal components are linear combinations of \(ϕ\mathbf{x}\), where \(ϕ\) corresponds to the loading vector and \(\mathbf{x}\) to one particular \(d\)-dimensional observation vector (\(x_1,x_2,...,x_d\)) . By multiplying the i^th observation vector, \(\mathbf{x}_i\), and the j^th loading vector \(\mathbf{p}_j\) we calculate the score, which represents the projection of the i^th observation onto the j^th principal components.

We can explicitly write the score (\(z_{ij}\)) of the i^th observations onto the first principal component (\(j=1\)) as a linear combination of \(ϕ\) and \(\mathbf{x}_i\)

\[z_{i1}=ϕ_{11}\mathbf{x}_{i1}+ϕ_{21}\mathbf{x}_{i2},...,ϕ_{d1}\mathbf{x}_{id} \ i=1,2,...,n,\] and so forth.

Applied to the food-texture data set the equation yields for the first principal component

\[z_{zi1}=0.46\mathbf{x}_{oil}+-0.48\mathbf{x}_{density}+0.53\mathbf{x}_{crispy}+-0.5\mathbf{x}_{fracture}+0.15\mathbf{x}_{hardness} \ i=1,2,...,n,\] and for the second principal component

\[z_{zi1}=0.37\mathbf{x}_{oil}+-0.36\mathbf{x}_{density}+-0.2\mathbf{x}_{crispy}+0.22\mathbf{x}_{fracture}+-0.8\mathbf{x}_{hardness} \ i=1,2,...,n.\] Knowing these equations it becomes very easy to calculate the scores of the first observation in our data set (\(i=1\)) for the first and the second principle components.

# first observation projected on PC1
obs1 <- food_pca[1, ]
food_pca[1, ] %*% pca_loading[, 1]

##           [,1]
## [1,] -1.383211

# first observation projected on PC2
food_pca[1, ] %*% pca_loading[, 2]

##           [,1]
## [1,] 0.6194061

These values, denoted as scores, correspond to the projection of the first data point in our data set onto the first and the second principal component, respectively. To calculate \(n\) scores of \(n\) observations for \(k\) principal components we apply linear algebra and multiply the original variables with the principal component to compute the scores. These new values, our synthetic variables, are the projection of the original data onto a new feature space.

Let \(\mathbf{X}_{n×d}\) be the data matrix with \(n\) observations and \(d\) features, and let \(\mathbf{P}_{d×k}\), often referred to as rotation matrix, be a matrix with \(k\) principal components in the columns, then the matrix \(\mathbf{Z}_{n×k}\) holds the synthetic variables (scores) in \(n\) rows and \(k\) columns.

\[\mathbf{Z}_{n×k}=\mathbf{X}_{n×d}\mathbf{P}_{d×k}\] In order to calculate the scores for our food-texture data set we write

Z <- food_pca %*% pca_loading
dim(Z)

## [1] 50  2

##               PC1         PC2
##  [1,] -1.38321116  0.61940609
##  [2,]  2.79447694 -0.35372463
##  [3,]  0.23755641 -0.86145822
##  [4,] -1.93934057  1.14173664
##  [5,] -1.26793706 -0.66346458
##  [6,]  1.99745643 -1.29835426
##  [7,]  1.48818776  0.63265249
##  [8,] -0.82886358  2.38254850
##  [9,] -1.18515331 -0.35018100
## [10,] -0.99346261 -0.35082583
## [11,]  0.98484427  0.15975902
## [12,]  0.76764163 -1.19113483
## [13,]  2.19477278 -1.48435425
## [14,] -0.84520978 -2.26002157
## [15,] -2.22267282  1.58812847
## [16,]  1.86851048 -1.93684067
## [17,]  2.03938254  1.04564488
## [18,]  0.47185486 -0.63696147
## [19,]  1.56549666  0.31033679
## [20,]  2.74415668 -0.28555759
## [21,] -2.33586961 -2.06720889
## [22,] -1.54236231 -0.92311295
## [23,]  1.74090108 -0.57516377
## [24,] -1.11069766  1.46958839
## [25,]  1.26539381 -0.73525624
## [26,]  0.98346062 -0.69558506
## [27,] -1.48581846  0.67472784
## [28,] -1.13263937  0.26814430
## [29,]  0.72482386 -0.17906169
## [30,]  1.56051390  2.50713724
## [31,] -1.99196219 -0.11104397
## [32,] -1.00478309  0.45498384
## [33,] -4.17139945  0.76756233
## [34,]  0.84325076 -0.11178233
## [35,] -0.09171039 -2.12200632
## [36,]  3.60970963  1.75939081
## [37,]  1.72100778  1.75656917
## [38,]  2.73275558  0.23976542
## [39,] -0.48651579  0.76429167
## [40,] -1.64009084  0.52325484
## [41,] -1.25379402 -0.55692683
## [42,]  0.03455199 -0.08340916
## [43,] -0.72177088 -0.48681051
## [44,] -2.78394234 -0.59292164
## [45,]  0.14727359  0.22249197
## [46,] -0.77978676  0.27563061
## [47,] -0.35408994 -1.62039117
## [48,] -2.12046481  0.07959690
## [49,]  2.55856798  1.31200347
## [50,] -1.40299922  1.57820776

Done! We just projected our original 5-dimensional data set onto a new 2-dimensional feature space. By plotting the 2-dimensional features we realize that the data is rotated and aligns along the dimensions of our first two principal components.

plot(Z, xlab = "PC1", ylab = "PC2")
abline(h = 0, col = "blue")
abline(v = 0, col = "green")

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.