In this example we compute a factor analysis, employing the scikit-learn library.
We assume that our data was generated by a linear transformation of a lower dimensional data set, with an overlay of white noise. The factor analysis allows us to retrieve these underlying factors and thus to lower the dimensionality of our data.
Let's import the needed tools and get going. We will give further explanation along the way.
from pandas import read_csv, Series, DataFrame
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
Let us get our hands dirty and apply a factor analysis on a the food-texture data set. We already discussed the data set in the section on principal component analysis, so you are probably familiar with the data set.
This open source data set is available here and describes texture measurements of a pastry-type food.
food = read_csv("https://userpage.fu-berlin.de/soga/300/30100_data_sets/food-texture.csv", index_col=0)
food.head()
Oil | Density | Crispy | Fracture | Hardness | |
---|---|---|---|---|---|
B110 | 16.5 | 2955 | 10 | 23 | 97 |
B136 | 17.7 | 2660 | 14 | 9 | 139 |
B171 | 16.2 | 2870 | 12 | 17 | 143 |
B192 | 16.7 | 2920 | 10 | 31 | 95 |
B225 | 16.3 | 2975 | 11 | 26 | 143 |
The data set consists of 50 rows (observations) and 5 columns (features/variables). The features are:
7
to 15
, with 15
being more crispy.The class FactorAnalysis of the scikit-learn package, enables many methods around Factor analysis. When instantiating the class, we can pass it the desired number of factors.
FactorAnalysis(n_components = <factors>)
Replace <factors>
with the amount of desired factors.
With this class we can perform a maximum-likelihood factor analysis on a covariance matrix or data matrix, specifying the desired number of factors.
By an additional argument rotation
the transformation of the factors may be specified to be either varimax
or quartimax
, two types of orthogonal rotation or None
(default) for no rotation.
Getting a reasonable value for the amount of factors desired is a tricky aspect of factor analysis. If we already have some understanding of the system that created our data, we may make an make an educated guess about the number of latent variables.
If we do not know much about out data other than that the number of variables is not too large, we may simply try several values to initialize the model. In most cases though and to do our due dilligence, we should use a more sophisticated approach and perform a principal component analysis (PCA) and get a good initial estimate of the number of factors.
Since we have already explained the PCA, we will not repeat it here and just make a guess and set the number of factor to be factors = 2
. Furthermore, we try the analysis with the rotation
set to varimax
and with the default value rotation = None
.
Let's compute this and plot it using matplotlib.
X = StandardScaler().fit_transform(food) # Standardize the data
factors = 2
# a list of 2 tuples containing titles for and instances of or class
fas = [
("FA no rotation", FactorAnalysis(n_components = factors)),
("FA varimax", FactorAnalysis(n_components = factors, rotation="varimax")),
]
# Let's prepare some plots on one canvas (subplots)
fig, axes = plt.subplots(ncols=len(fas), figsize=(10, 8))
'''
And loop over the variants of our analysis `fas`, zipped with the
plot axes `axes`
'''
for ax, (title, fa) in zip(axes, fas):
# Fit the model to the standardized food data
fa = fa.fit(X)
# and transpose the component (loading) matrix
factor_matrix = fa.components_.T
# Plot the data as a heat map
im = ax.imshow(factor_matrix, cmap="RdBu_r", vmax=1, vmin=-1)
# and add the corresponding value to the center of each cell
for (i,j), z in np.ndenumerate(factor_matrix):
ax.text(j, i, str(z.round(2)), ha="center", va="center")
# Tell matplotlib about the metadata of the plot
ax.set_yticks(np.arange(len(food.columns)))
if ax.get_subplotspec().is_first_col():
ax.set_yticklabels(food.columns)
else:
ax.set_yticklabels([])
ax.set_title(title)
ax.set_xticks([0, 1])
ax.set_xticklabels(["Factor 1", "Factor 2"])
# and squeeze the axes tight, to save space
plt.tight_layout()
# and add a colorbar
cb = fig.colorbar(im, ax=axes, location='right', label="loadings")
# show us the plot
plt.show()
Before we interpret the results of the factor analysis, let us recall the basic idea behind it. Factor analysis creates linear combinations of factors to abstract the variable’s underlying communality. This reduces the amount of factors in the data set, while preserving most of the variance.
This allows us to aggregate a large number of observable variables in a model to represent an underlying concept, making it easier to comprehend the data.
The variability in a data set $\mathbf X$, is given by $\Sigma$, and its estimate $\hat \Sigma$ is composed of the variability explained by a linear combination of the factors, which we call communality, and of the remaining variability that can not be explained by a linear combination of the factors, namley the uniqueness.
$$\hat \Sigma = \underbrace{\hat\Lambda\hat\Lambda^T}_{\text{communality}} + \underbrace{\hat\Psi}_{\text{uniqueness}}$$Now let's have a look at our results:
In the plot above we can see the loadings, which range from −1
to 1
. This part is represented by the $\hat \Lambda$ in the equation above. The loadings are the contribution of each original variable to the factor. Variables with loading values further away from 0
have a larger part of their variability factor.
Now we will check the uniqueness for of each variable. Note that we only use the varimax
rotation method, going forward.
The next plot is generated straight from a PandasSeries
, using the .plot()
method. This provides easy access to quick plots.
fa = FactorAnalysis(n_components = 2, rotation="varimax")
fa.fit(X)
uniqueness = Series(fa.noise_variance_, index=food.columns)
uniqueness.plot(
kind="bar",
ylabel="Uniqueness"
)
<AxesSubplot:ylabel='Uniqueness'>
The uniqueness, which ranges from 0
to 1
. It is, sometimes also referred to as (white) noise, corresponds to the proportion of variability that can not be explained by a linear combination of the factors. This part is represented by the $\hat\Psi$ in the equation above. A high uniqueness for a variable indicates that the factors do not account well for its variance.
Opposing the uniquess, stands the communality:
# Communality
communality = Series(np.square(fa.components_.T).sum(axis=1), index=food.columns)
communality.plot(
kind="bar",
ylabel="communality"
)
<AxesSubplot:ylabel='communality'>
By squaring the loading we can compute the fraction of the variable’s total variance explained by the factors (Teetor 2011{target="_blank"}). This proportion of the variability is denoted as communality.
A way way to calculate the the uniqueness, when you already computed the communlity is to subtract it from 1
. An appropriate factor model results in low values for uniqueness and high values for communality. So if we see bad results for our model, we could try a different number of underlying factors (latent variables).
#and back to uniqueness
(1 - communality).plot(kind="bar", ylabel="uniqueness")
<AxesSubplot:ylabel='uniqueness'>
Recall the factor analysis model: $\hat \Sigma = \hat\Lambda\hat\Lambda^T +\hat\Psi$
Using our factor model fa
we may calculate $\hat \Sigma$ and compare it to the observed correlation matrix, S
, by simple matrix algebra.
We use numpy to perform fast and efficient math operations. Note: We can also use pandas, since it "wraps" around numpy, basically just forwarding our commands to the library.
# the word 'lambda' is reserved for a python-operator, so we use the underscore at the end
lambda_ = fa.components_
psi = np.diag(uniqueness)
s = np.corrcoef(np.transpose(X))
sigma = np.matmul(lambda_.T, lambda_) + psi
residuals = (s - sigma)
We subtracted the fitted correlation matrix $\hat \Sigma$ (sigma
) from the observed correlation matrix S
. The resulting matrix is called the residual matrix. Numbers close to 0
indicate that our factor model is a good representation of the underlying system.
Now lets plot the results.
ax = plt.axes()
im = ax.imshow(residuals, cmap="RdBu_r", vmin=-1, vmax=1)
ax.tick_params(axis="x", bottom=False, labelbottom=False, top=True, labeltop=True)
ax.set_xticks(range(5))
ax.set_xticklabels(food.columns)
ax.set_yticks(range(5))
ax.set_yticklabels(food.columns)
for (i,j), z in np.ndenumerate(residuals):
ax.text(j, i, str(z.round(3)), ha="center", va="center")
fig.colorbar(im, ax=ax, location='right')
ax.set_title("FA residual matrix")
plt.tight_layout()
The purpose of a rotation is to produce more extreme loadings. The idea behind this is to give meaning to the factors. This can help with their interpretation. From a mathematical viewpoint, there is no difference between a rotated and unrotated matrix. The fitted model is the same, the uniquenesses are the same, and the proportion of variance explained is the same.
Here we fit three factor models, one with no rotation
, one with varimax
rotation, and one with quartimax
rotation. We then make a scatter plot
of the first and second loadings.
methods = [
("FA No rotation", FactorAnalysis(2,)),
("FA Varimax", FactorAnalysis(2, rotation="varimax")),
("FA Quartimax", FactorAnalysis(2, rotation="quartimax")),
]
fig, axes = plt.subplots(ncols=3, figsize=(10, 8), sharex=True, sharey=True)
for ax, (method, fa) in zip(axes, methods):
fa = fa.fit(X)
components = fa.components_
vmax = np.abs(components).max()
ax.scatter(components[0,:], components[1, :])
ax.axhline(0, -1, 1, color='k')
ax.axvline(0, -1, 1, color='k')
for i,j, z in zip(components[0, :], components[1, :], food.columns):
ax.text(i+.02, j+.02, str(z), ha="center")
ax.set_title(str(method))
if ax.get_subplotspec().is_first_col():
ax.set_ylabel("Factor 1")
ax.set_xlabel("Factor 2")
plt.tight_layout()
plt.show()
How can I interpret these factors?
If two variables have loadings further away from 0
for the same factor, we know they are related. We want to understand the data in order to give meaningful names to the latent variables.
Taking a look at the plot in the middle FA Varimax
above, it appears that Factor 1
describes a variable that makes pastry soft, less crisp. This description fits flaky pastry
rather well.
Whereas the loadings in Factor 2
show high density
and little oil
. This would fit the classification of hot water crust pastry
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.