In this example we compute a factor analysis, employing the scikit-learn library.

We assume that our data was generated by a linear transformation of a lower dimensional data set, with an overlay of white noise. The factor analysis allows us to retrieve these underlying factors and thus to lower the dimensionality of our data.

Let's import the needed tools and get going. We will give further explanation along the way.

In [1]:

```
from pandas import read_csv, Series, DataFrame
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
```

Let us get our hands dirty and apply a factor analysis on a the food-texture data set. We already discussed the data set in the section on principal component analysis, so you are probably familiar with the data set.

This open source data set is available here and describes texture measurements of a pastry-type food.

In [2]:

```
food = read_csv("https://userpage.fu-berlin.de/soga/300/30100_data_sets/food-texture.csv", index_col=0)
food.head()
```

Out[2]:

Oil | Density | Crispy | Fracture | Hardness | |
---|---|---|---|---|---|

B110 | 16.5 | 2955 | 10 | 23 | 97 |

B136 | 17.7 | 2660 | 14 | 9 | 139 |

B171 | 16.2 | 2870 | 12 | 17 | 143 |

B192 | 16.7 | 2920 | 10 | 31 | 95 |

B225 | 16.3 | 2975 | 11 | 26 | 143 |

The data set consists of 50 rows (observations) and 5 columns (features/variables). The features are:

- Oil: percentage oil in the pastry
- Density: the product’s density (the higher the number, the more dense the product)
- Crispy: a crispiness measurement, on a scale from
`7`

to`15`

, with`15`

being more crispy. - Fracture: the angle, in degrees, through which the pasty can be slowly bent before it fractures.
- Hardness: a sharp point is used to measure the amount of force required before breakage occurs.

The class FactorAnalysis of the scikit-learn package, enables many methods around Factor analysis. When instantiating the class, we can pass it the desired number of factors.

```
FactorAnalysis(n_components = <factors>)
```

Replace `<factors>`

with the amount of desired factors.

With this class we can perform a maximum-likelihood factor analysis on a covariance matrix or data matrix, specifying the desired number of factors.
By an additional argument `rotation`

the transformation of the factors may be specified to be either `varimax`

or `quartimax`

, two types of orthogonal rotation or `None`

(default) for no rotation.

Getting a reasonable value for the amount of factors desired is a tricky aspect of factor analysis. If we already have some understanding of the system that created our data, we may make an make an educated guess about the number of latent variables.

If we do not know much about out data other than that the number of variables is not too large, we may simply try several values to initialize the model. In most cases though and to do our due dilligence, we should use a more sophisticated approach and perform a principal component analysis (PCA) and get a good initial estimate of the number of factors.

Since we have already explained the PCA, we will not repeat it here and just make a guess and set the number of factor to be `factors = 2`

. Furthermore, we try the analysis with the `rotation`

set to `varimax`

and with the default value `rotation = None`

.

Let's compute this and plot it using matplotlib.

In [3]:

```
X = StandardScaler().fit_transform(food) # Standardize the data
factors = 2
# a list of 2 tuples containing titles for and instances of or class
fas = [
("FA no rotation", FactorAnalysis(n_components = factors)),
("FA varimax", FactorAnalysis(n_components = factors, rotation="varimax")),
]
# Let's prepare some plots on one canvas (subplots)
fig, axes = plt.subplots(ncols=len(fas), figsize=(10, 8))
'''
And loop over the variants of our analysis `fas`, zipped with the
plot axes `axes`
'''
for ax, (title, fa) in zip(axes, fas):
# Fit the model to the standardized food data
fa = fa.fit(X)
# and transpose the component (loading) matrix
factor_matrix = fa.components_.T
# Plot the data as a heat map
im = ax.imshow(factor_matrix, cmap="RdBu_r", vmax=1, vmin=-1)
# and add the corresponding value to the center of each cell
for (i,j), z in np.ndenumerate(factor_matrix):
ax.text(j, i, str(z.round(2)), ha="center", va="center")
# Tell matplotlib about the metadata of the plot
ax.set_yticks(np.arange(len(food.columns)))
if ax.get_subplotspec().is_first_col():
ax.set_yticklabels(food.columns)
else:
ax.set_yticklabels([])
ax.set_title(title)
ax.set_xticks([0, 1])
ax.set_xticklabels(["Factor 1", "Factor 2"])
# and squeeze the axes tight, to save space
plt.tight_layout()
# and add a colorbar
cb = fig.colorbar(im, ax=axes, location='right', label="loadings")
# show us the plot
plt.show()
```

Before we interpret the results of the factor analysis, let us recall the basic idea behind it. Factor analysis creates **linear combinations of factors** to abstract the variable’s underlying **communality**. This reduces the amount of factors in the data set, while preserving most of the variance.

This allows us to aggregate a large number of observable variables in a model to **represent an underlying concept**, making it easier to comprehend the data.

The variability in a data set $\mathbf X$, is given by $\Sigma$, and its estimate $\hat \Sigma$ is composed of the variability explained by a **linear combination of the factors**, which we call **communality**, and of the remaining variability that can not be explained by a linear combination of the factors, namley the **uniqueness**.

Now let's have a look at our results:

In the plot above we can see the **loadings**, which range from `−1`

to `1`

. This part is represented by the $\hat \Lambda$ in the equation above. The loadings are the contribution of each original variable to the factor. Variables with loading values further away from `0`

have a larger part of their variability factor.

Now we will check the **uniqueness** for of each variable. Note that we only use the `varimax`

rotation method, going forward.

The next plot is generated straight from a `PandasSeries`

, using the `.plot()`

method. This provides easy access to quick plots.

In [4]:

```
fa = FactorAnalysis(n_components = 2, rotation="varimax")
fa.fit(X)
uniqueness = Series(fa.noise_variance_, index=food.columns)
uniqueness.plot(
kind="bar",
ylabel="Uniqueness"
)
```

Out[4]:

<AxesSubplot:ylabel='Uniqueness'>

The **uniqueness**, which ranges from `0`

to `1`

. It is, sometimes also referred to as (white) noise, corresponds to the proportion of variability that can not be explained by a linear combination of the factors. This part is represented by the $\hat\Psi$ in the equation above. A **high uniqueness** for a variable indicates that the **factors do not account well for its variance**.

Opposing the uniquess, stands the **communality**:

In [5]:

```
# Communality
communality = Series(np.square(fa.components_.T).sum(axis=1), index=food.columns)
communality.plot(
kind="bar",
ylabel="communality"
)
```

Out[5]:

<AxesSubplot:ylabel='communality'>

By squaring the loading we can compute the **fraction of the variable’s total variance explained by the factors** (Teetor 2011{target="_blank"}). This proportion of the variability is denoted as **communality**.

A way way to calculate the the uniqueness, when you already computed the communlity is to subtract it from `1`

. **An appropriate factor model results in low values for uniqueness and high values for communality**. So if we see bad results for our model, we could try a different number of underlying factors (latent variables).

In [6]:

```
#and back to uniqueness
(1 - communality).plot(kind="bar", ylabel="uniqueness")
```

Out[6]:

<AxesSubplot:ylabel='uniqueness'>

Recall the factor analysis model: $\hat \Sigma = \hat\Lambda\hat\Lambda^T +\hat\Psi$

Using our factor model `fa`

we may calculate $\hat \Sigma$ and compare it to the observed correlation matrix, `S`

, by simple matrix algebra.

We use numpy to perform fast and efficient math operations. Note: We can also use pandas, since it "wraps" around numpy, basically just forwarding our commands to the library.

In [8]:

```
# the word 'lambda' is reserved for a python-operator, so we use the underscore at the end
lambda_ = fa.components_
psi = np.diag(uniqueness)
s = np.corrcoef(np.transpose(X))
sigma = np.matmul(lambda_.T, lambda_) + psi
residuals = (s - sigma)
```

`sigma`

) from the observed correlation matrix `S`

. The resulting matrix is called the **residual matrix**. Numbers close to `0`

indicate that our factor model is a good representation of the underlying system.
Now lets plot the results.

In [9]:

```
ax = plt.axes()
im = ax.imshow(residuals, cmap="RdBu_r", vmin=-1, vmax=1)
ax.tick_params(axis="x", bottom=False, labelbottom=False, top=True, labeltop=True)
ax.set_xticks(range(5))
ax.set_xticklabels(food.columns)
ax.set_yticks(range(5))
ax.set_yticklabels(food.columns)
for (i,j), z in np.ndenumerate(residuals):
ax.text(j, i, str(z.round(3)), ha="center", va="center")
fig.colorbar(im, ax=ax, location='right')
ax.set_title("FA residual matrix")
plt.tight_layout()
```

The purpose of a **rotation** is to produce **more extreme loadings**. The idea behind this is to give meaning to the factors. This can help with their interpretation. From a mathematical viewpoint, there is no difference between a rotated and unrotated matrix. **The fitted model is the same**, the uniquenesses are the same, and the proportion of variance explained is the same.

Here we fit three factor models, one with `no rotation`

, one with `varimax`

rotation, and one with `quartimax`

rotation. We then make a `scatter plot`

of the first and second loadings.

In [10]:

```
methods = [
("FA No rotation", FactorAnalysis(2,)),
("FA Varimax", FactorAnalysis(2, rotation="varimax")),
("FA Quartimax", FactorAnalysis(2, rotation="quartimax")),
]
fig, axes = plt.subplots(ncols=3, figsize=(10, 8), sharex=True, sharey=True)
for ax, (method, fa) in zip(axes, methods):
fa = fa.fit(X)
components = fa.components_
vmax = np.abs(components).max()
ax.scatter(components[0,:], components[1, :])
ax.axhline(0, -1, 1, color='k')
ax.axvline(0, -1, 1, color='k')
for i,j, z in zip(components[0, :], components[1, :], food.columns):
ax.text(i+.02, j+.02, str(z), ha="center")
ax.set_title(str(method))
if ax.get_subplotspec().is_first_col():
ax.set_ylabel("Factor 1")
ax.set_xlabel("Factor 2")
plt.tight_layout()
plt.show()
```

How can I interpret these factors?
If two variables have loadings further away from `0`

for the same factor, we know they are related. We want to understand the data in order to give meaningful names to the latent variables.

Taking a look at the plot in the middle `FA Varimax`

above, it appears that `Factor 1`

describes a variable that makes pastry soft, less crisp. This description fits `flaky pastry`

rather well.

Whereas the loadings in `Factor 2`

show high `density`

and little `oil`

. This would fit the classification of `hot water crust pastry`

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*