Interpreting score plots¶

Let us recall what a score value is: There is one score value for each observation (row) in the data set, so there are $n$ score values for the first component, another $n$ score values for the second component, and so on. The score value for an observation is the point where that observation projects onto the direction vector for say the first component. Or in vector terminology it is the distance from the origin, along the direction (loading vector) of the first component, up to the point where that observation projects onto the direction vector.

An important point with PCA is that because the matrix $\mathbf{P}$ is orthonormal, any relationships that were present in $\mathbf{X}$ are still present in $\mathbf{Z}$. Thus score plots allow us to rapidly locate similar observations, clusters, outliers and time-based patterns.

The first two score vectors, $\mathbf{z}_1$ and $\mathbf{z}_2$, explain the greatest variation in the data, hence we usually start by looking at the $\{\mathbf{z}_1,\mathbf{z}_2\}$ scatter plot of the scores.

In [1]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
In [2]:
# load data from previous sections
food = pd.read_feather("https://userpage.fu-berlin.de/soga/data/py-data/food-texture-scaled.feather")
food.head()
Out[2]:
Oil Density Crispy Fracture Hardness
0 -0.440953 0.782329 -0.856063 0.391506 -1.001684
1 0.312813 -1.587149 1.396734 -2.169748 0.347602
2 -0.629394 0.099598 0.270336 -0.706174 0.476105
3 -0.315325 0.501205 -0.856063 1.855079 -1.065936
4 -0.566580 0.942972 -0.292864 0.940346 0.476105
In [3]:
# calculate the eigenvectors and eigenvalues
food_pca = PCA().fit(food)
food_pca_eigen = pd.DataFrame(
    food_pca.components_.T,
    columns=["PC1", "PC2", "PC3", "PC4", "PC5"],
    index=food.columns,
)

# Compute the loadings
food_pca_eigen["eigenvalue"] = food_pca.explained_variance_

# Compute the scores
food_pca_scores = pd.DataFrame(
    food_pca.transform(food),
    columns=["PC1", "PC2", "PC3", "PC4", "PC5"],
    index=food.index,
)

# Plot the scores
sns.scatterplot(data=food_pca_scores, x="PC1", y="PC2")
plt.axhline(0, color="blue")
plt.axvline(0, color="green")
plt.show()
  • Points close the average appear at the origin of the score plot. An observation that is at the mean value for all $k$-variables will have a score vector $\mathbf{z}_i=[0,0,...,0]$.
  • Scores further out are either outliers or naturally extreme observations.
  • Original observations in $\mathbf{X}$ that are similar to each other will be similar in the score plot, while observations much further apart are dissimilar. It is much easier to detect this similarity in an $k$-dimensional space than the original $d$-dimensional space, when $d≫k$.

Interpreting loading plots¶

The loadings plot is a plot of the direction vectors that define the model. They show how the original variables contribute to creating the principal component.

In [4]:
# Plot the loadings
sns.scatterplot(data=food_pca_eigen, x="PC1", y="PC2")
plt.axhline(0, color="blue")
plt.axvline(0, color="green")
plt.xlim(-1, 1)
plt.ylim(-1, 1)

# Plot annotations
for i in range(food_pca_eigen.shape[0]):
    plt.text(
        food_pca_eigen["PC1"][i],
        food_pca_eigen["PC2"][i],
        food_pca_eigen.index[i],
        color="red",
    )

plt.show()
  • Variables which have little contribution to a direction have almost zero weight in that loading.
  • Variables which have roughly equal influence on defining a direction are correlated with each other and will have roughly equal numeric weights.
  • Strongly correlated variables, will have approximately the same weight value when they are positively correlated. In a loadings plot of e.g. $p_1$ vs $p_2$ they will appear near each other, while negatively correlated variables will appear diagonally opposite each other.
  • Signs of the loading variables are useful to compare within a direction vector; but these vectors can be rotated by 180° and still have the same interpretation.

Interpreting Biplots¶

The biplot{target="_blank"} is a very popular way for visualization of results from PCA, as it combines both the principal component scores and the loading vectors in a single biplot display.

In [5]:
# Correlation Biplot
sns.scatterplot(
    data=food_pca_scores,
    x="PC1",
    y="PC2",
    hue=food_pca_scores.index,
    legend=False,
)
plt.axhline(0, color="blue")
plt.axvline(0, color="green")

# plot the variables as vectors
plt.quiver(
    np.zeros(food_pca_eigen.shape[0]),
    np.zeros(food_pca_eigen.shape[0]),
    food_pca_eigen["PC1"],
    food_pca_eigen["PC2"],
    food_pca_eigen["eigenvalue"],
    angles="xy",
    scale_units="xy",
    scale=1,
)

# Plot annotations
for i in range(food_pca_eigen.shape[0]):
    plt.text(
        food_pca_eigen["PC1"][i],
        food_pca_eigen["PC2"][i],
        food_pca_eigen.index[i],
        color="red",
    )

The plot shows the observations as points in the plane formed by two principal components (synthetic variables). Like for any scatterplot we may look for patterns, clusters, and outliers.

In addition to the observations the plot shows the original variables as vectors (arrows). They begin at the origin $[0,0]$ and extend to coordinates given by the loading vector (see loading plot above). These vectors can be interpreted in three ways (Rossiter, 2022):

  • The orientation (direction) of the vector, with respect to the principal component space, in particular, its angle with the principal component axes: the more parallel to a principal component axis is a vector, the more it contributes only to that PC.
  • The length in the space; the longer the vector, the more variability of this variable is represented by the two displayed principal components; short vectors are thus better represented in other dimension.
  • The angles between vectors of different variables show their correlation in this space: small angles represent high positive correlation, right angles represent lack of correlation, opposite angles represent high negative correlation.

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.