30311-Data-preparation.knit

PCA starts with data. In this section we analyse the food texture data set. This open source data set is available here and describes texture measurements of a pastry-type food.

The data set consists of 50 rows (observations) and 6 columns (features/variables). The features are:

For the sake of comprehensibility we start working with a reduced, two-dimensional toy data set, by extracting the columns Oil and Density. In a subsequent section we return to the full data set for our analyses.

Scale the data

The scatter plot above indicates a relationship between the feature Oil and the feature Density. Note that the variables are not on the same scale. In general, the results of PCA depend on the scale of the variables. Therefore, each variable is typically centered and scaled to have a mean of zero and standard deviation of one. In certain settings, however, the variables are measured in the same units, and one may skip the standardization.

For the sake of comprehensibility we write two simple functions in R and visualize the effects of each pre-processing step. The goal is to center each column to zero mean and then scale it to have unit variance. For further usage we assign a proper variable name (e.g. pca.toy.data) to the pre-processed data set.

# Function for centering a vector
center <- function(v) {
  v - mean(v)
}
# Function for scaling a vector
scale <- function(v) {
  v / sd(v)
}

## save helper functions for later usage
save(center, scale, file = "helper_functions_30300.RData")

# Apply use defined function on each column of the data set
pca_toy_centered <- apply(pca_toy, MARGIN = 2, FUN = center)
pca_toy_scaled <- apply(pca_toy_centered, MARGIN = 2, FUN = scale)

### Plotting ###
par(mfrow = c(3, 2), mar = c(4, 4, 3, 1))
###############
# scatterplot 1
plot(pca_toy, main = "Raw data")
# calulate mean for visualization
data_mean <- apply(pca_toy, 2, mean)
points(data_mean[1], data_mean[2], col = "red", pch = 16) # mark mean
# boxplot 1
boxplot(pca_toy, main = "Raw data")
###############

# scatterplot 2
plot(pca_toy_centered, main = "Centered data")
# calulate mean for visualization
data_mean <- apply(pca_toy_centered, 2, mean)
points(data_mean[1], data_mean[2], col = "red", pch = 16) # mark mean
# boxplot 2
boxplot(pca_toy_centered, main = "Centered data")
###############

# scatterplot 3
plot(pca_toy_scaled, main = "Scaled and centered data")
# calulate mean for visualization
data_mean <- apply(pca_toy_scaled, 2, mean)
points(data_mean[1], data_mean[2], col = "red", pch = 16) # mark mean
# boxplot 3
boxplot(pca_toy_scaled, main = "Scaled and centered data")

# assign propper variable name to pre-processed data set
pca_toy_data <- pca_toy_scaled

# save for later usage
save(pca_toy_data, file = "pca_food_toy_30300.RData")

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.