Let us consider a problem with $$n$$ observations/measurements on a set of $$d$$ features, $$x_1, x_2,...,x_d$$. If $$d$$ is a high number, exploratory data analysis becomes challenging, as for example our ability to visualize data is limited to 2 or 3 dimensions. We may explore the data set by examining two-dimensional scatter plots, each of which contains $$n$$ observations and two of the $$d$$ features. However, there are $${d \choose 2} = d(d-1)/2$$ such scatter plots. If a data set has $$d=15$$ features there are 105 plots to draw! Moreover, it is very likely that none of them will be informative since they each contain just a small fraction of the total information present in the data set. Hence, we are looking for a low-dimensional $$(k \ll d)$$ representation of the data that captures as much of the information as possible.

Beyond exploratory data analysis dimensionality reduction becomes important if the features of a given data set are redundant, or in other words are highly correlated (multicollinearity). Multicollinearity is a problem because it causes instability in regression models. The redundant information inflates the variance of the parameter estimates which can cause them to be statistically insignificant when they would have been significant otherwise (Kerns 2010). Hence, we are looking for a low-dimensional representation of the data, where the features are uncorrelated with one another.

One technique that provides such a dimensionality reduction is Principal Component Analysis (PCA), which projects a high-dimensional feature space onto a new feature space. The original explanatory variables are replaced with new variables, derived from the original ones, that are by design uncorrelated with one another, thus eliminating the redundancy.

The main idea behind PCA is that not all of the $$d$$ dimensions of the original data set are equally informative, where the concept of being informative is measured by the variability along each particular feature space dimension, also denoted as variance. More precisely, PCA finds the directions of maximum variance in high-dimensional data and projects it onto a smaller dimensional subspace while retaining most of the information. Each of the dimensions found by PCA is effectively a linear combination of the $$d$$ features.