Recall the equation for the multiple regression model:

$y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,m\text{; } x\in \mathbb R^d.$

Further recall that in regression we try to find a set of unknown parameters $$\beta$$ that $$y \approx f (x;\beta)$$ for the data pair $$(x; y)$$. In order to find a appropriate set of parameters $$\beta$$ we need to define an objective function $$(\mathcal E)$$, which allows us to find “good” values for the set of unknown parameters $$\beta$$. One of the most popular estimation method is the so-called ordinary least squares (OSL) method. Thereby the coefficients $$\beta_0, \beta_1,\beta_2,...,\beta_{d}$$ are picked by minimizing the sum of squared residuals. The objective function for the sum of squared residuals is written as

$\mathcal E = \sum_{i=1}^m\epsilon_i^2=\sum_{i=1}^m(y_i - f(x_i; \beta))^2\text{.}$

By plugging in the regression model equation from above we get

$\mathcal E = \sum_{i=1}^m(y_i - \beta_0 - \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}$ where $$m$$ corresponds to the number of observations and $$d$$ corresponds to the number of features of our data set.

In the previous section we introduced the matrix notation for the vectors $$\mathbf y$$ and the stacked matrix $$\mathbf X$$ (written in bold letters to indicate that $$\mathbf{y}$$ is a $$m$$-dimensional vector, and $$\mathbf{X}$$ a $$m \times d$$ dimensional matrix, respectively).

\begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \\ \end{bmatrix} ,\qquad \mathbf X_{m \times d}= \begin{bmatrix} \mathbf x_{1}^T \\ \mathbf x_{2}^T \\ \vdots \\ \mathbf x_{m1}^T \\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{md} \\ \end{bmatrix} \end{align}

For linear regression (and classification) the intercept term $$\beta_0$$ does not interact with any element in the vector $$\mathbf{x} \in \mathbb R^d$$. Thus it is convenient to populate the first column of the matrix $$\mathbf X$$ with ones:

\begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \\ \end{bmatrix}\text{;} \quad \mathbf X_{m \times d+1}= \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 &x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots &\vdots & \vdots & \ddots & \vdots \\ 1 & x_{m1} & x_{m2} & \cdots & x_{md} \\ \end{bmatrix}\text{;} \quad \mathbf{\beta}_{d+1 \times 1}^T= \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \\ \end{bmatrix} \end{align}.

Note that the matrix $$\mathbf X$$ is now of the size $$n \times d+1$$ and that $$\mathbf{\beta}\in \mathbb R^{d+1}$$, and thus $$[\beta_0, \beta_1,...,\beta_d]^T$$ is of size $$d+1 \times 1$$. The vector $$\mathbf y$$ is denoted as response vector and the matrix $$\mathbf X$$ is called the model matrix.

Recalling the basics of Linear Algebra, we know that the dot product, also referred to as the scalar product of a vector $$\mathbf u_{1\times n}$$ with a vector $$\mathbf v_{n\times 1}$$ results in a scalar $$w$$ ($$w$$ may be written as a matrix of size $$1\times1$$). The value of the scalar $$w$$ corresponds to the sum of the products of the elements in the vectors $$\mathbf u$$ and $$\mathbf v$$.

$\mathbf u \cdot \mathbf v = \sum_{i=1}^n u_iv_i= u_1v_1+u_2v_2+...+ u_nv_n= w$

Returning to our problem we may rewrite the original least squares objective function

$\mathcal E = \sum_{i=1}^m(y_i - \beta_0 + \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}$ by using the matrix notation as

$\mathcal E = \sum_{i=1}^m(y_i - \mathbf {x_i^T \beta}){.}$

The representation in matrix form allows us to further simplify the equation by plugging in the response vector $$\mathbf y$$ and the model matrix $$\mathbf X$$:

$\mathcal E = \sum_{i=1}^m(y_i - x_i^T \beta) = \Vert\mathbf y - \mathbf X\beta\Vert^2 = (\mathbf y - \mathbf X\beta)^T(\mathbf y - \mathbf X\beta).$ Now, in order to find the least squares solution we take the gradient with respect to $$\beta$$ and find a unique solution $$\hat \beta$$:

$\frac{\partial \mathcal E}{\partial( \beta_i,i=1,...,n)}=2\mathbf X^T\mathbf X \beta-2\mathbf X^T\mathbf y \quad \Rightarrow \quad\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y.$

The matrix $$\mathbf H = \mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T$$ is sometimes called the “hat” matrix because it puts the hat on $$\mathbf y$$ (Hastie et al 2008).

Given new data $$\mathbf X_{new}$$, the least squares prediction for $$\hat y$$ is

$\mathbf{\hat y} = \mathbf X_{new}\hat \beta = \mathbf X_{new}(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y.$ Important note: Be aware that the calculation of $$\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y$$ assumes $$(\mathbf X^T\mathbf X)^{-1}$$ exists. Therefore $$\mathbf X^T\mathbf X$$ has to be a full rank matrix.

A matrix $$\mathbf M_{m\times m}$$ is called full rank if all row (column) vectors are linear independent. This means that any point in $$\mathbb R^m$$ can be reached by a weighted combination of $$m$$ rows of $$\mathbf M$$.

In our case, $$\mathbf X_{m\times d+1}^T\mathbf X_{m\times d+1}$$ is a $$(d+1 \times d+1)$$ Matrix.

Thus, if either $$m < d + 1$$ or at least one of the d+1 column vectors of $$\mathbf X$$ can be written as linear combination of the others, least squares fails, because $$(\mathbf X^T\mathbf X)^{-1}$$ does not exist.Therefore, we are facing an infinite number of possible solutions.

The latter case occurs if $$\mathbf X$$ is part of a compositional data set!

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.