In statistics we deal in general with observable variables, however
there are applications where we are interested in variables which cannot
be observed. Imagine we are working on a research project to reconstruct
the palaeo-environmental boundary conditions in a remote region in the
world. However, for sure there are no direct measurements available,
thus we have to base our investigation on **proxy data**. Depending on the actual
research area there will be different types of proxy data available,
such as tree rings, pollen records, lake sediments, geomorphological
indications, among others. These data archives may be investigated and
sampled in a variety of ways. Based on these direct measurements we now
may infer the palaeo-environmental boundary conditions, such as
temperature, the water balance, and/or the vegetation pattern, among
others.

In statistics variables that are not directly observed but are rather
inferred from observed variables are referred to as **latent variables**. Mathematical
models that aim to explain observed variables in terms of latent
variables are called **latent variable models**. Such a
latent variable model provides the link between the latent variables,
which cannot be observed, and the manifest variables which can be
observed.

**Factor analysis** is an application
of latent variable models. The purpose of the analysis is to determine
how many latent variables are needed to explain the correlations between
the manifest variable, to interpret them and, sometimes, to predict the
values of the latent variables which have given rise to the manifest
variables (Lovric 2011).

The basic idea behind factor analysis is that we investigate a
multivariate feature space covered by a set of observable variables and
that we describe the variability and correlation among the observed
variables in terms of a potentially lower number of unobserved variables
denoted as **factors**.

There are two main branches in factor analysis:

In EFA we allow all \(m\) factors to
be related to all \(p\) observed
variables. Thus, one can say that we are *exploring* which
factors relate to which observed variables. Whereas in CFA we know, or
assume, based on an *a priori* hypothesized model that \(k\) observed variables are related to a
particular factor (latent variable). Thus, one can say we try to confirm
if a set of \(k\) observed variables is
in fact related to a particular factor.

**Please note that in this tutorial we focus on the exploratory
factor analysis (EFA).**

Reyment & Joreskog (1993) provide a comprehensive introduction into factor analysis in natural sciences including application from the broad field of geosciences.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*