PCA starts with data. In this section we analyse the food texture data set. This open source data set is available here and describes texture measurements of a pastry-type food.
food <- read.csv("food-texture.csv")
str(food)
## 'data.frame': 50 obs. of 6 variables:
## $ X : chr "B110" "B136" "B171" "B192" ...
## $ Oil : num 16.5 17.7 16.2 16.7 16.3 19.1 18.4 17.5 15.7 16.4 ...
## $ Density : int 2955 2660 2870 2920 2975 2790 2750 2770 2955 2945 ...
## $ Crispy : int 10 14 12 10 11 13 13 10 11 11 ...
## $ Fracture: int 23 9 17 31 26 16 17 26 23 24 ...
## $ Hardness: int 97 139 143 95 143 189 114 63 123 132 ...
The data set consists of 50 rows (observations) and 6 columns (features/variables). The features are:
For the sake of comprehensibility we start working with a reduced,
two-dimensional toy data set, by extracting the columns Oil
and Density
. In a subsequent section we return to the full
data set for our analyses.
pca_toy <- food[, c("Oil", "Density")]
We start with an exploratory data analysis and examine a scatter plot.
plot(pca_toy)
The scatter plot above indicates a relationship between the feature
Oil
and the feature Density
. Note that the
variables are not on the same scale. In general, the results of PCA
depend on the scale of the variables. Therefore, each variable is
typically centered and scaled to have a mean of zero and standard
deviation of one. In certain settings, however, the variables are
measured in the same units, and one may skip the standardization.
For the sake of comprehensibility we write two simple functions in R
and visualize the effects of each pre-processing step. The goal is to
center each column to zero mean and then scale it to have unit variance.
For further usage we assign a proper variable name
(e.g. pca.toy.data
) to the pre-processed data set.
# Function for centering a vector
center <- function(v) {
v - mean(v)
}
# Function for scaling a vector
scale <- function(v) {
v / sd(v)
}
## save helper functions for later usage
save(center, scale, file = "helper_functions_30300.RData")
# Apply use defined function on each column of the data set
pca_toy_centered <- apply(pca_toy, MARGIN = 2, FUN = center)
pca_toy_scaled <- apply(pca_toy_centered, MARGIN = 2, FUN = scale)
### Plotting ###
par(mfrow = c(3, 2), mar = c(4, 4, 3, 1))
###############
# scatterplot 1
plot(pca_toy, main = "Raw data")
# calulate mean for visualization
data_mean <- apply(pca_toy, 2, mean)
points(data_mean[1], data_mean[2], col = "red", pch = 16) # mark mean
# boxplot 1
boxplot(pca_toy, main = "Raw data")
###############
# scatterplot 2
plot(pca_toy_centered, main = "Centered data")
# calulate mean for visualization
data_mean <- apply(pca_toy_centered, 2, mean)
points(data_mean[1], data_mean[2], col = "red", pch = 16) # mark mean
# boxplot 2
boxplot(pca_toy_centered, main = "Centered data")
###############
# scatterplot 3
plot(pca_toy_scaled, main = "Scaled and centered data")
# calulate mean for visualization
data_mean <- apply(pca_toy_scaled, 2, mean)
points(data_mean[1], data_mean[2], col = "red", pch = 16) # mark mean
# boxplot 3
boxplot(pca_toy_scaled, main = "Scaled and centered data")
# assign propper variable name to pre-processed data set
pca_toy_data <- pca_toy_scaled
# save for later usage
save(pca_toy_data, file = "pca_food_toy_30300.RData")
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.