In statistics we deal in general with observable variables, however there are applications where we are interested in variables which cannot be observed. Imagine we are working on a research project to reconstruct the palaeo-environmental boundary conditions in a remote region in the world. However, for sure there are no direct measurements available, thus we have to base our investigation on proxy data. Depending on the actual research area there will be different types of proxy data available, such as tree rings, pollen records, lake sediments, geomorphological indications, among others. These data archives may be investigated and sampled in a variety of ways. Based on these direct measurements we now may infer the palaeo-environmental boundary conditions, such as temperature, the water balance, and/or the vegetation pattern, among others.

In statistics variables that are not directly observed but are rather inferred from observed variables are referred to as latent variables. Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models. Such a latent variable model provides the link between the latent variables, which cannot be observed, and the manifest variables which can be observed.

Factor analysis is an application of latent variable models. The purpose of the analysis is to determine how many latent variables are needed to explain the correlations between the manifest variable, to interpret them and, sometimes, to predict the values of the latent variables which have given rise to the manifest variables (Lovric 2011).

The basic idea behind factor analysis is that we investigate a multivariate feature space covered by a set of observable variables and that we describe the variability and correlation among the observed variables in terms of a potentially lower number of unobserved variables denoted as factors.

There are two main branches in factor analysis:

In EFA we allow all $$m$$ factors to be related to all $$p$$ observed variables. Thus, one can say that we are exploring which factors relate to which observed variables. Whereas in CFA we know, or assume, based on an a priori hypothesized model that $$k$$ observed variables are related to a particular factor (latent variable). Thus, one can say we try to confirm if a set of $$k$$ observed variables is in fact related to a particular factor.

Please note that in this tutorial we focus on the exploratory factor analysis (EFA).

Reyment & Joreskog (1993) provide a comprehensive introduction into factor analysis in natural sciences including application from the broad field of geosciences.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.