Recall the equation for the multiple regression model:

\[ y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,m\text{; } x\in \mathbb R^d. \]

Further recall that in regression we try to find a set of unknown
parameters \(\beta\) that \(y \approx f (x;\beta)\) for the data pair
\((x; y)\). In order to find a
appropriate set of parameters \(\beta\)
we need to define an **objective function** \((\mathcal E)\), which allows us to find
“good” values for the set of unknown parameters \(\beta\). One of the most popular estimation
method is the so-called ordinary least squares (OSL) method. Thereby the
coefficients \(\beta_0,
\beta_1,\beta_2,...,\beta_{d}\) are picked by minimizing the sum
of squared residuals. The objective function for the sum of squared
residuals is written as

\[\mathcal E = \sum_{i=1}^m\epsilon_i^2=\sum_{i=1}^m(y_i - f(x_i; \beta))^2\text{.}\]

By plugging in the regression model equation from above we get

\[\mathcal E = \sum_{i=1}^m(y_i - \beta_0 - \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}\] where \(m\) corresponds to the number of observations and \(d\) corresponds to the number of features of our data set.

In the previous section we introduced the matrix notation for the vectors \(\mathbf y\) and the stacked matrix \(\mathbf X\) (written in bold letters to indicate that \(\mathbf{y}\) is a \(m\)-dimensional vector, and \(\mathbf{X}\) a \(m \times d\) dimensional matrix, respectively).

\[ \begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \\ \end{bmatrix} ,\qquad \mathbf X_{m \times d}= \begin{bmatrix} \mathbf x_{1}^T \\ \mathbf x_{2}^T \\ \vdots \\ \mathbf x_{m1}^T \\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{md} \\ \end{bmatrix} \end{align} \]

For linear regression (and classification) the intercept term \(\beta_0\) does not interact with any element in the vector \(\mathbf{x} \in \mathbb R^d\). Thus it is convenient to populate the first column of the matrix \(\mathbf X\) with ones:

\[ \begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \\ \end{bmatrix}\text{;} \quad \mathbf X_{m \times d+1}= \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 &x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots &\vdots & \vdots & \ddots & \vdots \\ 1 & x_{m1} & x_{m2} & \cdots & x_{md} \\ \end{bmatrix}\text{;} \quad \mathbf{\beta}_{d+1 \times 1}^T= \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \\ \end{bmatrix} \end{align}. \]

Note that the matrix \(\mathbf X\)
is now of the size \(n \times d+1\) and
that \(\mathbf{\beta}\in \mathbb
R^{d+1}\), and thus \([\beta_0,
\beta_1,...,\beta_d]^T\) is of size \(d+1 \times 1\). The vector \(\mathbf y\) is denoted as **response
vector** and the matrix \(\mathbf
X\) is called the **model matrix**.

Recalling the basics of **Linear Algebra**, we know that the *dot
product*, also referred to as the **scalar
product** of a vector \(\mathbf
u_{1\times n}\) with a vector \(\mathbf
v_{n\times 1}\) results in a scalar \(w\) (\(w\)
may be written as a matrix of size \(1\times1\)). The value of the scalar \(w\) corresponds to the sum of the products
of the elements in the vectors \(\mathbf
u\) and \(\mathbf v\).

\[\mathbf u \cdot \mathbf v = \sum_{i=1}^n u_iv_i= u_1v_1+u_2v_2+...+ u_nv_n= w\]

Returning to our problem we may rewrite the original least squares objective function

\[\mathcal E = \sum_{i=1}^m(y_i - \beta_0 + \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}\] by using the matrix notation as

\[\mathcal E = \sum_{i=1}^m(y_i - \mathbf {x_i^T \beta}){.}\]

The representation in matrix form allows us to further simplify the equation by plugging in the response vector \(\mathbf y\) and the model matrix \(\mathbf X\):

\[\mathcal E = \sum_{i=1}^m(y_i - x_i^T \beta) = \Vert\mathbf y - \mathbf X\beta\Vert^2 = (\mathbf y - \mathbf X\beta)^T(\mathbf y - \mathbf X\beta).\] Now, in order to find the least squares solution we take the gradient with respect to \(\beta\) and find a unique solution \(\hat \beta\):

\[\frac{\partial \mathcal E}{\partial( \beta_i,i=1,...,n)}=2\mathbf X^T\mathbf X \beta-2\mathbf X^T\mathbf y \quad \Rightarrow \quad\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y.\]

The matrix \(\mathbf H = \mathbf X(\mathbf
X^T\mathbf X)^{-1}\mathbf X^T\) is sometimes called the **“hat” matrix** because it puts the
hat on \(\mathbf y\) (Hastie et al 2008).

Given new data \(\mathbf X_{new}\), the least squares prediction for \(\hat y\) is

\[\mathbf{\hat y} = \mathbf X_{new}\hat
\beta = \mathbf X_{new}(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf
y.\] **Important note: ** Be aware that the
calculation of \(\hat \beta = (\mathbf
X^T\mathbf X)^{-1}\mathbf X^T\mathbf y\) assumes \((\mathbf X^T\mathbf X)^{-1}\) exists.
Therefore \(\mathbf X^T\mathbf X\) has
to be a **full rank matrix**.

A matrix \(\mathbf M_{m\times m}\)
is called **full rank** if all row (column) vectors are
linear independent. This means that any point in \(\mathbb R^m\) can be reached by a weighted
combination of \(m\) rows of \(\mathbf M\).

In our case, \(\mathbf X_{m\times d+1}^T\mathbf X_{m\times d+1}\) is a \((d+1 \times d+1)\) Matrix.

Thus, if either \(m < d + 1\) or at least one of the d+1 column vectors of \(\mathbf X\) can be written as linear combination of the others, least squares fails, because \((\mathbf X^T\mathbf X)^{-1}\) does not exist.Therefore, we are facing an infinite number of possible solutions.

The latter case occurs if \(\mathbf
X\) is part of a **compositional data set**!

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*