Interpreting score plots

Let us recall what a score value is: There is one score value for each observation (row) in the data set, so there are \(n\) score values for the first component, another \(n\) score values for the second component, and so on. The score value for an observation is the point where that observation projects onto the direction vector for say the first component. Or in vector terminology it is the distance from the origin, along the direction (loading vector) of the first component, up to the point where that observation projects onto the direction vector.

An important point with PCA is that because the matrix \(\mathbf{P}\) is orthonormal, any relationships that were present in \(\mathbf{X}\) are still present in \(\mathbf{Z}\). Thus score plots allow us to rapidly locate similar observations, clusters, outliers and time-based patterns.

The first two score vectors, \(\mathbf{z}_1\) and \(\mathbf{z}_2\), explain the greatest variation in the data, hence we usually start by looking at the \(\{\mathbf{z}_1,\mathbf{z}_2\}\) scatter plot of the scores.

# load data from previous sections
load("pca_food_30300.RData")

food_pca_eigen <- eigen(cov(food_pca))
pca_loading <- food_pca_eigen$vectors[, 1:2] # select the first two principal components

pca_scores <- food_pca %*% pca_loading
rownames(pca_scores) <- seq(1, nrow(pca_scores))

# Plot the scores
plot(pca_scores,
     xlab = expression("PC"[1]),
     ylab = expression("PC"[2]),
     main = "Score plot")
abline(h = 0, col = "blue")
abline(v = 0, col = "green")

# Plot the scores as points
text(pca_scores[, 1] + 0.2,
     pca_scores[, 2],
     rownames(pca_scores),
     col = "blue", cex = 0.6)


Interpreting loading plots

The loadings plot is a plot of the direction vectors that define the model. They show how the original variables contribute to creating the principal component.

loading_vector <- food_pca_eigen$vectors
rownames(loading_vector) <- colnames(food_pca)

# Plot the loading vector
plot(loading_vector,
     xlab = expression("PC"[1]),
     ylab = expression("PC"[2]),
     main = "Loading plot",
     ylim = c(-1, 1),
     xlim = c(-1, 1))
abline(h = 0, col = "blue")
abline(v = 0, col = "green")

# Plot the loadings as points
text(loading_vector[, 1] + 0.1,
     loading_vector[, 2] + 0.1,
     rownames(loading_vector),
     col = "blue", cex = 1.2)


Interpreting Biplots

The biplot is a very popular way for visualization of results from PCA, as it combines both the principal component scores and the loading vectors in a single biplot display.

# Correlation BiPlot
pca_sd <- sqrt(food_pca_eigen$values) # standardize to sd = 1
loading_vector <- food_pca_eigen$vectors
rownames(loading_vector) <- colnames(food_pca)

# Plot
plot(pca_scores,
     xlab = expression("PC"[1]),
     ylab = expression("PC"[2]))
abline(h = 0, col = "blue")
abline(v = 0, col = "green")

# This is to make the size of the lines more apparent
factor <- 0.5

# Plot the variables as vectors
arrows(0, 0, loading_vector[, 1] * pca_sd[1] / factor,
       loading_vector[, 2] * pca_sd[2] / factor,
       length = 0.1,
       lwd =  2,
       angle = 20,
       col = "red")

# Plot annotations
text(loading_vector[, 1] * pca_sd[1] / factor * 1.2,
      loading_vector[, 2] * pca_sd[2] / factor * 1.2,
      rownames(loading_vector),
      col = "red",
      cex = 1.2)

The plot shows the observations as points in the plane formed by two principal components (synthetic variables). Like for any scatterplot we may look for patterns, clusters, and outliers.

In addition to the observations the plot shows the original variables as vectors (arrows). They begin at the origin \([0,0]\) and extend to coordinates given by the loading vector (see loading plot above). These vectors can be interpreted in three ways (Rossiter, 2022):


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.