In predictive modelling PCA is particular useful as a data pre-processing technique. PCA serves as a tool for exploratory data analysis and outlier detection, but as well for dimensionality reduction when the number of variables outnumbers the sample size (\(d>n\)). Beyond that PCA is often applied on data sets with highly redundant variables, or in other words of highly correlated variables (problem of multicollinearity). Multicollinearity is a problem because it causes instability in regression models. The redundant information inflates the variance of the parameter estimates which can cause them to be statistically insignificant when they would have been significant otherwise (Kerns 2010).

In a subsequent sections we will apply PCA as a data pre-processing technique and combine it with a \(L_2\)-regularized logistic regression model for solving a classification problem.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.