Let us consider a problem with $m$ observations/measurements on a set of $n$ variables, $x_1,x_2,...,x_n$. If $n$ is a high number, exploratory data analysis becomes challenging, as for example our ability to visualize data is limited to 2 or 3 dimensions. We may explore the data set by examining two-dimensional scatter plots, each of which contains $m$ observations and two of the $n$ variables. However, there are $\binom{n}{2}=n(n−1)/2$ such scatter plots. If a data set has $n=15$ variables there are 105 plots to draw! Moreover, it is very likely that none of them will be informative since they each contain just a small fraction of the total information present in the data set. Hence, we are looking for a low-dimensional $(d≪n)$ representation of the data that captures as much of the information as possible.
Besides exploratory data analysis dimensionality reduction becomes important, if the features of a given data set are redundant. or, in other words, if they are highly correlated (multicollinearity). Multicollinearity is a problem because it causes instability in regression models. The redundant information inflates the variance of the parameter estimates which can cause them to be statistically insignificant when they would have been significant otherwise (Kerns 2010). Hence, we are looking for a low-dimensional representation of the data, where the features are uncorrelated with one another.
One technique that provides such a dimensionality reduction is Principal Component Analysis (PCA), which projects a high-dimensional variable space onto a new feature space. The original explanatory variables are replaced with new variables (features), derived from the original ones, that are by design uncorrelated with one another, thus eliminating the redundancy.
The main idea behind PCA is that not all of the $n$ dimensions of the original data set are equally informative, where the concept of being informative is measured by the variability along each particular variable space dimension, also denoted as variance. More precisely, PCA finds the directions of maximum variance in high-dimensional data and projects it onto a smaller dimensional subspace while retaining the main information. Each of the dimensions found by PCA is effectively a linear combination of the $n$ variables.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.