In the previous sections we discussed the computation of principal components and how to pick a reasonable number of principal components to represent the data. In this section we will learn how to project the original data onto a new feature space with less dimensions. Recall that the principal components are the eigenvectors of the covariance matrix and that the eigenvalues correspond to the explained variance of the respective principal component. In PCA the elements of the principal components are denoted as loadings ($ϕ$). Written in vector form the loadings are referred to as loading vector, $\mathbf{p}_j$, where $j=1,2,...,k$. Let us take a look at the loading of the first two principal components, $\mathbf{p}_1$ and $\mathbf{p}_2$, of the food-texture data set.
# import libraries
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
# load data from previous sections
food_pca = pd.read_feather('https://userpage.fu-berlin.de/soga/data/py-data/food-texture-scaled.feather')
food_pca.head()
| Oil | Density | Crispy | Fracture | Hardness | |
|---|---|---|---|---|---|
| 0 | -0.440953 | 0.782329 | -0.856063 | 0.391506 | -1.001684 |
| 1 | 0.312813 | -1.587149 | 1.396734 | -2.169748 | 0.347602 |
| 2 | -0.629394 | 0.099598 | 0.270336 | -0.706174 | 0.476105 |
| 3 | -0.315325 | 0.501205 | -0.856063 | 1.855079 | -1.065936 |
| 4 | -0.566580 | 0.942972 | -0.292864 | 0.940346 | 0.476105 |
# compute eigenvalues and eigenvectors
eigvals, eigvecs = np.linalg.eig(food_pca.cov())
# get the loadings for the first two principal components
pca_loading = eigvecs[:, :2]
print(pca_loading)
[[-0.45753343 0.37043885] [ 0.4787455 -0.35674997] [-0.53238767 -0.19766103] [ 0.50447688 0.22123992] [-0.15340262 -0.8046661 ]]
The columns of the output from Python show the vectors $$\mathbf{p}_1=\begin{bmatrix}\phi_{11}\\\phi_{12}\\\phi_{13}\\\phi_{14}\\\phi_{15}\\\end{bmatrix}$$ and $$\mathbf{p}_2=\begin{bmatrix}\phi_{21}\\\phi_{22}\\\phi_{23}\\\phi_{24}\\\phi_{25}\\\end{bmatrix}\, .$$ Recall that the principal components are the eigenvectors of the data covariance matrix. By definition the length of an eigenvector is 1. Hence, the squared sum of the loadings should sum up to $1$ ($∑kj=1ϕ2j=1$). Let us check this claim.
# calculate the squared sum of the loadings
ssl = [sum(pca_loading[:, 0]**2), sum(pca_loading[:, 1]**2)]
print(ssl)
assert round(ssl[0], 2) == 1
assert round(ssl[1], 2) == 1
[1.0, 1.0000000000000002]
Correct, the squared sum of loadings sums up to $1$!
Recall that the principal components are linear combinations of $ϕ\mathbf{x}$, where $ϕ$ corresponds to the loading vector and $\mathbf{x}$ to one particular $d$-dimensional observation vector ($x_1,x_2,...,x_d$) . By multiplying the $i$-th observation vector, $\mathbf{x}_i$, and the $j$-th loading vector $\mathbf{p}_j$ we calculate the score, which represents the projection of the $i$-th observation onto the $j$-th principal components.
We can explicitly write the score ($z_{ij}$) of the $i$-th observations onto the first principal component ($j=1$) as a linear combination of $ϕ$ and $\mathbf{x}_i$
$$z_{i1}=ϕ_{11}\mathbf{x}_{i1}+ϕ_{21}\mathbf{x}_{i2},...,ϕ_{d1}\mathbf{x}_{id} \ i=1,2,...,n,$$and so forth.
Applied to the food-texture data set the equation yields for the first principal component
$$z_{zi1}=0.46\mathbf{x}_{oil}+−0.48\mathbf{x}_{density}+0.53\mathbf{x}_{crispy}+−0.5\mathbf{x}_{fracture}+0.15\mathbf{x}_{hardness} \ i=1,2,...,n,$$and for the second principal component
$$z_{zi1}=0.37\mathbf{x}_{oil}+−0.36\mathbf{x}_{density}+−0.2\mathbf{x}_{crispy}+0.22\mathbf{x}_{fracture}+−0.8\mathbf{x}_{hardness} \ i=1,2,...,n.$$Knowing these equations it becomes very easy to calculate the scores of the first observation in our data set ($i=1$) for the first and the second principle components.
# calculate the scores for the first observation projected on PC1
score_pc1 = pca_loading[0, 0] * food_pca.iloc[0, 0] + \
pca_loading[1, 0] * food_pca.iloc[0, 1] + \
pca_loading[2, 0] * food_pca.iloc[0, 2] + \
pca_loading[3, 0] * food_pca.iloc[0, 3] + \
pca_loading[4, 0] * food_pca.iloc[0, 4]
# calculate the scores for the first observation projected on PC2
score_pc2 = pca_loading[0, 1] * food_pca.iloc[0, 0] + \
pca_loading[1, 1] * food_pca.iloc[0, 1] + \
pca_loading[2, 1] * food_pca.iloc[0, 2] + \
pca_loading[3, 1] * food_pca.iloc[0, 3] + \
pca_loading[4, 1] * food_pca.iloc[0, 4]
# print the scores
print('The score for the first observation projected on PC1 is: %.2f' % score_pc1)
print('The score for the first observation projected on PC2 is: %.2f' % score_pc2)
The score for the first observation projected on PC1 is: 1.38 The score for the first observation projected on PC2 is: 0.62
These values, denoted as scores, correspond to the projection of the first data point in our data set onto the first and the second principal component, respectively. To calculate $n$ scores of $n$ observations for $k$ principal components we apply linear algebra and multiply the original variables with the principal component to compute the scores. These new values, our synthetic variables, are the projection of the original data onto a new feature space.
Let $\mathbf{X}_{n×d}$ be the data matrix with $n$ observations and $d$ features, and let $\mathbf{P}_{d×k}$, often referred to as rotation matrix, be a matrix with $k$ principal components in the columns, then the matrix $\mathbf{Z}_{n×k}$ holds the synthetic variables (scores) in $n$ rows and $k$ columns.
$$\mathbf{Z}_{n×k}=\mathbf{X}_{n×d}\mathbf{P}_{d×k}$$In order to calculate the scores for our food-texture data set we write
# calculate the scores for the first two principal components
scores = food_pca.dot(pca_loading)
# print the scores
print(scores)
# get the dimensions of the scores
print('The dimensions of the scores are: %s' % str(scores.shape))
0 1 0 1.383211 0.619406 1 -2.794477 -0.353725 2 -0.237556 -0.861458 3 1.939341 1.141737 4 1.267937 -0.663465 5 -1.997456 -1.298354 6 -1.488188 0.632652 7 0.828864 2.382549 8 1.185153 -0.350181 9 0.993463 -0.350826 10 -0.984844 0.159759 11 -0.767642 -1.191135 12 -2.194773 -1.484354 13 0.845210 -2.260022 14 2.222673 1.588128 15 -1.868510 -1.936841 16 -2.039383 1.045645 17 -0.471855 -0.636961 18 -1.565497 0.310337 19 -2.744157 -0.285558 20 2.335870 -2.067209 21 1.542362 -0.923113 22 -1.740901 -0.575164 23 1.110698 1.469588 24 -1.265394 -0.735256 25 -0.983461 -0.695585 26 1.485818 0.674728 27 1.132639 0.268144 28 -0.724824 -0.179062 29 -1.560514 2.507137 30 1.991962 -0.111044 31 1.004783 0.454984 32 4.171399 0.767562 33 -0.843251 -0.111782 34 0.091710 -2.122006 35 -3.609710 1.759391 36 -1.721008 1.756569 37 -2.732756 0.239765 38 0.486516 0.764292 39 1.640091 0.523255 40 1.253794 -0.556927 41 -0.034552 -0.083409 42 0.721771 -0.486811 43 2.783942 -0.592922 44 -0.147274 0.222492 45 0.779787 0.275631 46 0.354090 -1.620391 47 2.120465 0.079597 48 -2.558568 1.312003 49 1.402999 1.578208 The dimensions of the scores are: (50, 2)
Done! We just projected our original 5-dimensional data set onto a new 2-dimensional feature space. By plotting the 2-dimensional features we realize that the data is rotated and aligns along the dimensions of our first two principal components.
# plot the scores and plot the 2-dimensional feature space
plt.scatter(scores.iloc[:, 0], scores.iloc[:, 1])
# plot a horizontal line through the origin
plt.axhline(y=0, color='k', linestyle='--')
# plot a vertical line through the origin
plt.axvline(x=0, color='k', linestyle='--')
# add labels to the plot
plt.title('Scores of the first two principal components')
# add labels to the axes
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.