In practice we do not know $\mathbf \Sigma$, but can estimate it by the sample covariance matrix $\mathbf S$ (or by the sample correlation matrix $\mathbf R$ if we want to deal with standardized data). Therefore we seek the estimates $\hat {\mathbf\Lambda}$ and $\hat {\mathbf\Psi}$ such that $\mathbf\Lambda\mathbf\Lambda^T+\mathbf\Psi$ is as close to $\mathbf{S}$ as possible. In other words we try to make the discrepancy between the sample covariance matrix, $\mathbf{S}$ and the model matrix $\mathbf\Sigma$ as small as possible.

There a various methods for finding these estimates. In this course we will focus on the *maximum likelihood method*, a commonly used estimation method that is also used implemented in the class FactorAnalysis of the scikit-learn package.

Maximum likelihood estimates are obtained by iteration, a process in which $\hat \Lambda$ and $\hat \Psi$ are systematically altered to make the maximum likelihood function get smaller and smaller. This approach assumes that the vector $\mathbf x$ is distributed according to a multivariate normal distribution, that is:

$$f(x) = \vert 2 \pi\mathbf\Sigma\vert^{-1/2}e^{-\frac{1}{2}(\mathbf{x}-\mathbf\mu)^T\Sigma^{-1}(\mathbf{x}-\mathbf\mu)}$$The likelihood for a sample of $n$ units, is then

$$\mathcal L(\mathbf{x,\mu,\Sigma}) = \vert 2 \pi\mathbf\Sigma\vert^{-n/2}e^{-\frac{1}{2}\sum_{j=1}^n(\mathbf{x_j}-\mathbf\mu)^T\mathbf\Sigma^{-1}(\mathbf{x_j}-\mathbf\mu)}$$and the log-likelihood

$$\ln L(\mathbf{x,\mu,\Sigma}) = -\frac{n}{2}\ln \vert 2 \pi\mathbf\Sigma\vert -\frac{1}{2}\sum_{j=1}^n(\mathbf{x_j}-\mathbf\mu)^T\Sigma^{-1}(\mathbf{x_j}-\mathbf\mu)$$If the factor model holds then $\mathbf \Sigma = \mathbf \Lambda \mathbf \Lambda^T+ \mathbf \Psi$ and the log-likelihood becomes:

$$\ln L(\mathbf{x,\Lambda,\Psi}) = -\frac{n}{2}\ln \vert 2 \pi(\mathbf{\Lambda\Lambda}^T+\mathbf{\Psi}) \vert -\frac{n}{2} tr (\mathbf{\Lambda\Lambda}^T+\mathbf\Psi)^{-1}\mathbf{S}$$Note that in linear algebra $tr$ stands for the trace) of an $n \times n$ square matrix and is defined to be the sum of the elements on the main diagonal (the diagonal from the upper left to the lower right).

Finally, the log-likelihood function must be maximized with respect to $\mathbf\Lambda$ and $\mathbf\Psi$. In cases one or more elements of the estimate of $\mathbf\Psi$ become negative, known as *Heywood case*, the elements of the estimate of $\mathbf\Psi$ are constrained to be non negative. A major benefit of fitting the factor model by maximum likelihood is that we may evaluate the number of factors to be included in the model by applying a likelihood ratio test, which follows a $\chi^2$ distribution with $\frac{1}{2}[(p-m^2-(p+m))]$ degrees of freedom.

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*