Recall the equation for the multiple regression model:
\[ y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,m\text{; } x\in \mathbb R^d. \]
Further recall that in regression we try to find a set of unknown parameters \(\beta\) that \(y \approx f (x;\beta)\) for the data pair \((x; y)\). In order to find a appropriate set of parameters \(\beta\) we need to define an objective function \((\mathcal E)\), which allows us to find “good” values for the set of unknown parameters \(\beta\). One of the most popular estimation method is the so-called ordinary least squares (OSL) method. Thereby the coefficients \(\beta_0, \beta_1,\beta_2,...,\beta_{d}\) are picked by minimizing the sum of squared residuals. The objective function for the sum of squared residuals is written as
\[\mathcal E = \sum_{i=1}^m\epsilon_i^2=\sum_{i=1}^m(y_i - f(x_i; \beta))^2\text{.}\]
By plugging in the regression model equation from above we get
\[\mathcal E = \sum_{i=1}^m(y_i - \beta_0 - \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}\] where \(m\) corresponds to the number of observations and \(d\) corresponds to the number of features of our data set.
In the previous section we introduced the matrix notation for the vectors \(\mathbf y\) and the stacked matrix \(\mathbf X\) (written in bold letters to indicate that \(\mathbf{y}\) is a \(m\)-dimensional vector, and \(\mathbf{X}\) a \(m \times d\) dimensional matrix, respectively).
\[ \begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \\ \end{bmatrix} ,\qquad \mathbf X_{m \times d}= \begin{bmatrix} \mathbf x_{1}^T \\ \mathbf x_{2}^T \\ \vdots \\ \mathbf x_{m1}^T \\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{md} \\ \end{bmatrix} \end{align} \]
For linear regression (and classification) the intercept term \(\beta_0\) does not interact with any element in the vector \(\mathbf{x} \in \mathbb R^d\). Thus it is convenient to populate the first column of the matrix \(\mathbf X\) with ones:
\[ \begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \\ \end{bmatrix}\text{;} \quad \mathbf X_{m \times d+1}= \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 &x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots &\vdots & \vdots & \ddots & \vdots \\ 1 & x_{m1} & x_{m2} & \cdots & x_{md} \\ \end{bmatrix}\text{;} \quad \mathbf{\beta}_{d+1 \times 1}^T= \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \\ \end{bmatrix} \end{align}. \]
Note that the matrix \(\mathbf X\) is now of the size \(n \times d+1\) and that \(\mathbf{\beta}\in \mathbb R^{d+1}\), and thus \([\beta_0, \beta_1,...,\beta_d]^T\) is of size \(d+1 \times 1\). The vector \(\mathbf y\) is denoted as response vector and the matrix \(\mathbf X\) is called the model matrix.
Recalling the basics of Linear Algebra, we know that the dot product, also referred to as the scalar product of a vector \(\mathbf u_{1\times n}\) with a vector \(\mathbf v_{n\times 1}\) results in a scalar \(w\) (\(w\) may be written as a matrix of size \(1\times1\)). The value of the scalar \(w\) corresponds to the sum of the products of the elements in the vectors \(\mathbf u\) and \(\mathbf v\).
\[\mathbf u \cdot \mathbf v = \sum_{i=1}^n u_iv_i= u_1v_1+u_2v_2+...+ u_nv_n= w\]
Returning to our problem we may rewrite the original least squares objective function
\[\mathcal E = \sum_{i=1}^m(y_i - \beta_0 + \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}\] by using the matrix notation as
\[\mathcal E = \sum_{i=1}^m(y_i - \mathbf {x_i^T \beta}){.}\]
The representation in matrix form allows us to further simplify the equation by plugging in the response vector \(\mathbf y\) and the model matrix \(\mathbf X\):
\[\mathcal E = \sum_{i=1}^m(y_i - x_i^T \beta) = \Vert\mathbf y - \mathbf X\beta\Vert^2 = (\mathbf y - \mathbf X\beta)^T(\mathbf y - \mathbf X\beta).\] Now, in order to find the least squares solution we take the gradient with respect to \(\beta\) and find a unique solution \(\hat \beta\):
\[\frac{\partial \mathcal E}{\partial( \beta_i,i=1,...,n)}=2\mathbf X^T\mathbf X \beta-2\mathbf X^T\mathbf y \quad \Rightarrow \quad\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y.\]
The matrix \(\mathbf H = \mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\) is sometimes called the “hat” matrix because it puts the hat on \(\mathbf y\) (Hastie et al 2008).
Given new data \(\mathbf X_{new}\), the least squares prediction for \(\hat y\) is
\[\mathbf{\hat y} = \mathbf X_{new}\hat \beta = \mathbf X_{new}(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y.\] Important note: Be aware that the calculation of \(\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y\) assumes \((\mathbf X^T\mathbf X)^{-1}\) exists. Therefore \(\mathbf X^T\mathbf X\) has to be a full rank matrix.
A matrix \(\mathbf M_{m\times m}\) is called full rank if all row (column) vectors are linear independent. This means that any point in \(\mathbb R^m\) can be reached by a weighted combination of \(m\) rows of \(\mathbf M\).
In our case, \(\mathbf X_{m\times d+1}^T\mathbf X_{m\times d+1}\) is a \((d+1 \times d+1)\) Matrix.
Thus, if either \(m < d + 1\) or at least one of the d+1 column vectors of \(\mathbf X\) can be written as linear combination of the others, least squares fails, because \((\mathbf X^T\mathbf X)^{-1}\) does not exist.Therefore, we are facing an infinite number of possible solutions.
The latter case occurs if \(\mathbf X\) is part of a compositional data set!
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.