In predictive modelling PCA is particular useful as a data pre-processing technique. PCA serves as a tool for exploratory data analysis and outlier detection, but as well for dimensionality reduction when the number of variables outnumbers the sample size $$(d > n )$$. Beyond that PCA is often applied on data sets with highly redundant variables, or in other words of highly correlated variables (problem of multicollinearity). Multicollinearity is a problem because it causes instability in regression models. The redundant information inflates the variance of the parameter estimates which can cause them to be statistically insignificant when they would have been significant otherwise (Kerns 2010).

In a subsequent sections we will apply PCA as a data pre-processing technique and combine it with a $$L_2$$-regularized logistic regression model for solving a classification problem.