A simple example of factor analysis in Python¶

In this example we compute a factor analysis, employing the scikit-learn library.

We assume that our data was generated by a linear transformation of a lower dimensional data set, with an overlay of white noise. The factor analysis allows us to retrieve these underlying factors and thus to lower the dimensionality of our data.

Let's import the needed tools and get going. We will give further explanation along the way.

In [1]:
from pandas import read_csv, Series, DataFrame
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

The data set¶

Let us get our hands dirty and apply a factor analysis on a the food-texture data set. We already discussed the data set in the section on principal component analysis, so you are probably familiar with the data set.

This open source data set is available here and describes texture measurements of a pastry-type food.

In [2]:
food = read_csv("https://userpage.fu-berlin.de/soga/300/30100_data_sets/food-texture.csv", index_col=0)
food.head()
Out[2]:
Oil Density Crispy Fracture Hardness
B110 16.5 2955 10 23 97
B136 17.7 2660 14 9 139
B171 16.2 2870 12 17 143
B192 16.7 2920 10 31 95
B225 16.3 2975 11 26 143

The data set consists of 50 rows (observations) and 5 columns (features/variables). The features are:

  • Oil: percentage oil in the pastry
  • Density: the product’s density (the higher the number, the more dense the product)
  • Crispy: a crispiness measurement, on a scale from 7 to 15, with 15 being more crispy.
  • Fracture: the angle, in degrees, through which the pasty can be slowly bent before it fractures.
  • Hardness: a sharp point is used to measure the amount of force required before breakage occurs.

The FactorAnalysis class of scikit-learn¶

The class FactorAnalysis of the scikit-learn package, enables many methods around Factor analysis. When instantiating the class, we can pass it the desired number of factors.

FactorAnalysis(n_components = <factors>)

Replace <factors> with the amount of desired factors.

With this class we can perform a maximum-likelihood factor analysis on a covariance matrix or data matrix, specifying the desired number of factors. By an additional argument rotation the transformation of the factors may be specified to be either varimax or quartimax, two types of orthogonal rotation or None (default) for no rotation.

Getting a reasonable value for the amount of factors desired is a tricky aspect of factor analysis. If we already have some understanding of the system that created our data, we may make an make an educated guess about the number of latent variables.

If we do not know much about out data other than that the number of variables is not too large, we may simply try several values to initialize the model. In most cases though and to do our due dilligence, we should use a more sophisticated approach and perform a principal component analysis (PCA) and get a good initial estimate of the number of factors.

Since we have already explained the PCA, we will not repeat it here and just make a guess and set the number of factor to be factors = 2. Furthermore, we try the analysis with the rotation set to varimax and with the default value rotation = None.

Let's compute this and plot it using matplotlib.

In [3]:
X = StandardScaler().fit_transform(food)  # Standardize the data
factors = 2
#  a list of 2 tuples containing titles for and instances of or class
fas = [
    ("FA no rotation", FactorAnalysis(n_components = factors)),
    ("FA varimax", FactorAnalysis(n_components = factors, rotation="varimax")),
]  

#  Let's prepare some plots on one canvas (subplots)
fig, axes = plt.subplots(ncols=len(fas), figsize=(10, 8))

'''
And loop over the variants of our analysis `fas`, zipped with the 
plot axes `axes`
'''
for ax, (title, fa) in zip(axes, fas):
    #  Fit the model to the standardized food data
    fa = fa.fit(X)
    #  and transpose the component (loading) matrix
    factor_matrix = fa.components_.T
    #  Plot the data as a heat map
    im = ax.imshow(factor_matrix, cmap="RdBu_r", vmax=1, vmin=-1)
    #  and add the corresponding value to the center of each cell
    for (i,j), z in np.ndenumerate(factor_matrix):
        ax.text(j, i, str(z.round(2)), ha="center", va="center")
    #  Tell matplotlib about the metadata of the plot
    ax.set_yticks(np.arange(len(food.columns)))
    if ax.get_subplotspec().is_first_col():
        ax.set_yticklabels(food.columns)
    else:
        ax.set_yticklabels([])
    ax.set_title(title)
    ax.set_xticks([0, 1])
    ax.set_xticklabels(["Factor 1", "Factor 2"])
    #  and squeeze the axes tight, to save space
    plt.tight_layout()
    
#  and add a colorbar
cb = fig.colorbar(im, ax=axes, location='right', label="loadings")
#  show us the plot
plt.show()

Interpretation of the results¶

Before we interpret the results of the factor analysis, let us recall the basic idea behind it. Factor analysis creates linear combinations of factors to abstract the variable’s underlying communality. This reduces the amount of factors in the data set, while preserving most of the variance.

This allows us to aggregate a large number of observable variables in a model to represent an underlying concept, making it easier to comprehend the data.

The variability in a data set $\mathbf X$, is given by $\Sigma$, and its estimate $\hat \Sigma$ is composed of the variability explained by a linear combination of the factors, which we call communality, and of the remaining variability that can not be explained by a linear combination of the factors, namley the uniqueness.

$$\hat \Sigma = \underbrace{\hat\Lambda\hat\Lambda^T}_{\text{communality}} + \underbrace{\hat\Psi}_{\text{uniqueness}}$$

Now let's have a look at our results:

In the plot above we can see the loadings, which range from −1 to 1. This part is represented by the $\hat \Lambda$ in the equation above. The loadings are the contribution of each original variable to the factor. Variables with loading values further away from 0 have a larger part of their variability factor.

Now we will check the uniqueness for of each variable. Note that we only use the varimax rotation method, going forward.

The next plot is generated straight from a PandasSeries, using the .plot() method. This provides easy access to quick plots.

In [4]:
fa = FactorAnalysis(n_components = 2, rotation="varimax")
fa.fit(X)
uniqueness = Series(fa.noise_variance_, index=food.columns)
uniqueness.plot(
    kind="bar",
    ylabel="Uniqueness"
)
Out[4]:
<AxesSubplot:ylabel='Uniqueness'>

The uniqueness, which ranges from 0 to 1. It is, sometimes also referred to as (white) noise, corresponds to the proportion of variability that can not be explained by a linear combination of the factors. This part is represented by the $\hat\Psi$ in the equation above. A high uniqueness for a variable indicates that the factors do not account well for its variance.

Opposing the uniquess, stands the communality:

In [5]:
# Communality
communality = Series(np.square(fa.components_.T).sum(axis=1), index=food.columns)
communality.plot(
    kind="bar",
    ylabel="communality"
)
Out[5]:
<AxesSubplot:ylabel='communality'>

By squaring the loading we can compute the fraction of the variable’s total variance explained by the factors (Teetor 2011{target="_blank"}). This proportion of the variability is denoted as communality.

A way way to calculate the the uniqueness, when you already computed the communlity is to subtract it from 1. An appropriate factor model results in low values for uniqueness and high values for communality. So if we see bad results for our model, we could try a different number of underlying factors (latent variables).

In [6]:
#and back to uniqueness
(1 - communality).plot(kind="bar", ylabel="uniqueness")
Out[6]:
<AxesSubplot:ylabel='uniqueness'>

Recall the factor analysis model: $\hat \Sigma = \hat\Lambda\hat\Lambda^T +\hat\Psi$

Using our factor model fa we may calculate $\hat \Sigma$ and compare it to the observed correlation matrix, S, by simple matrix algebra.

We use numpy to perform fast and efficient math operations. Note: We can also use pandas, since it "wraps" around numpy, basically just forwarding our commands to the library.

In [8]:
# the word 'lambda' is reserved for a python-operator, so we use the underscore at the end
lambda_ = fa.components_
psi = np.diag(uniqueness)
s = np.corrcoef(np.transpose(X))
sigma = np.matmul(lambda_.T, lambda_) + psi
residuals = (s - sigma)

We subtracted the fitted correlation matrix $\hat \Sigma$ (sigma) from the observed correlation matrix S. The resulting matrix is called the residual matrix. Numbers close to 0 indicate that our factor model is a good representation of the underlying system. Now lets plot the results.

In [9]:
ax = plt.axes()
im = ax.imshow(residuals, cmap="RdBu_r", vmin=-1, vmax=1)
ax.tick_params(axis="x", bottom=False, labelbottom=False, top=True, labeltop=True)
ax.set_xticks(range(5))
ax.set_xticklabels(food.columns)
ax.set_yticks(range(5))
ax.set_yticklabels(food.columns)
for (i,j), z in np.ndenumerate(residuals):
    ax.text(j, i, str(z.round(3)), ha="center", va="center")

fig.colorbar(im, ax=ax, location='right')
ax.set_title("FA residual matrix")
plt.tight_layout()

Interpretation of the factors¶

The purpose of a rotation is to produce more extreme loadings. The idea behind this is to give meaning to the factors. This can help with their interpretation. From a mathematical viewpoint, there is no difference between a rotated and unrotated matrix. The fitted model is the same, the uniquenesses are the same, and the proportion of variance explained is the same.

Here we fit three factor models, one with no rotation, one with varimax rotation, and one with quartimax rotation. We then make a scatter plot of the first and second loadings.

In [10]:
methods = [
    ("FA No rotation", FactorAnalysis(2,)),
    ("FA Varimax", FactorAnalysis(2, rotation="varimax")),
    ("FA Quartimax", FactorAnalysis(2, rotation="quartimax")),
]
fig, axes = plt.subplots(ncols=3, figsize=(10, 8), sharex=True, sharey=True)

for ax, (method, fa) in zip(axes, methods):
    fa = fa.fit(X)

    components = fa.components_

    vmax = np.abs(components).max()
    ax.scatter(components[0,:], components[1, :])
    ax.axhline(0, -1, 1, color='k')
    ax.axvline(0, -1, 1, color='k')
    for i,j, z in zip(components[0, :], components[1, :], food.columns):
        ax.text(i+.02, j+.02, str(z), ha="center")
    ax.set_title(str(method))
    if ax.get_subplotspec().is_first_col():
        ax.set_ylabel("Factor 1")
    ax.set_xlabel("Factor 2")

plt.tight_layout()
plt.show()

What does this mean?¶

How can I interpret these factors? If two variables have loadings further away from 0 for the same factor, we know they are related. We want to understand the data in order to give meaningful names to the latent variables.

Taking a look at the plot in the middle FA Varimax above, it appears that Factor 1 describes a variable that makes pastry soft, less crisp. This description fits flaky pastry rather well.

Whereas the loadings in Factor 2 show high density and little oil. This would fit the classification of hot water crust pastry


Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.