Recall the equation for the multiple regression model: $$ y_i = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \text{,} \quad i = 1,2,...,m\text{; } x\in \mathbb R^d. $$ Further recall that in regression we try to find a set of unknown parameters $\beta$ that $y \approx f (x;\beta)$ for the data pair $(x; y)$. In order to find a appropriate set of parameters $\beta$ we need to define an objective function $(\mathcal E)$, which allows us to find "good" values for the set of unknown parameters $\beta$. One of the most popular estimation method is the so-called ordinary least squares (OSL) method. Thereby the coefficients $\beta_0, \beta_1,\beta_2,...,\beta{d}$ are picked by minimizing the sum of squared residuals. The objective function for the sum of squared residuals is written as $$\mathcal E = \sum_{i=1}^m\epsilon_i^2=\sum_{i=1}^m(y_i - f(x_i; \beta))^2\text{.}$$ By plugging in the regression model equation from above we get $$\mathcal E = \sum_{i=1}^m(y_i - \beta_0 - \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}$$ where $m$ corresponds to the number of observations and $d$ corresponds to the number of features of our data set. In the previous section we introduced the matrix notation for the vectors $\mathbf y$ and the stacked matrix $\mathbf X$ (written in bold letters to indicate that $\mathbf{y}$ is a $m$-dimensional vector, and $\mathbf{X}$ a $m \times d$ dimensional matrix, respectively). $$ \begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \ y_{2} \ \vdots \ y_{m} \ \end{bmatrix} ,\qquad \mathbf X_{m \times d}= \begin{bmatrix} \mathbf x_{1}^T \ \mathbf x_{2}^T \ \vdots \ \mathbf x_{m1}^T \ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \ x_{21} & x_{22} & \cdots & x_{2d} \ \vdots & \vdots & \ddots & \vdots \ x_{m1} & x_{m2} & \cdots & x_{md} \ \end{bmatrix} \end{align} $$ For linear regression (and classification) the intercept term $\beta_0$ does not interact with any element in the vector $\mathbf{x} \in \mathbb R^d$. Thus it is convenient to populate the first column of the matrix $\mathbf X$ with ones: $$ \begin{align} & \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \ y_{2} \ \vdots \ y_{m} \ \end{bmatrix}\text{;} \quad \\ & \mathbf X_{m \times d+1}= \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \ 1 &x_{21} & x_{22} & \cdots & x_{2d} \ \vdots &\vdots & \vdots & \ddots & \vdots \ 1 & x_{m1} & x_{m2} & \cdots & x_{md} \ \end{bmatrix}\text{;} \quad \\ & \mathbf{\beta}_{d+1 \times 1}^T= \begin{bmatrix} \beta_0 \ \beta_1 \ \vdots \ \beta_d \ \end{bmatrix} \end{align}. $$
Note that the matrix $\mathbf X$ is now of the size $n \times d+1$ and that $\mathbf{\beta}\in \mathbb R^{d+1}$, and thus $[\beta_0, \beta_1,...,\beta_d]^T$ is of size $d+1 \times 1$. The vector $\mathbf y$ is denoted as response vector and the matrix $\mathbf X$ is called the model matrix. Recalling the basics of Linear Algebra, we know that the dot product, also referred to as the scalar product of a vector $\mathbf u{1\times n}$ with a vector $\mathbf v{n\times 1}$ results in a scalar $w$ ($w$ may be written as a matrix of size $1\times1$). The value of the scalar $w$ corresponds to the sum of the products of the elements in the vectors $\mathbf u$ and $\mathbf v$. $$\mathbf u \cdot \mathbf v = \sum_{i=1}^n u_iv_i= u_1v_1+u_2v_2+...+ u_nv_n= w$$ Returning to our problem we may rewrite the original least squares objective function $$\mathcal E = \sum_{i=1}^m(y_i - \beta_0 + \sum_{j=1}^dx_{ij}\beta_j)^2\text{,}$$ by using the matrix notation as $$\mathcal E = \sum_{i=1}^m(y_i - \mathbf {x_i^T \beta}){.}$$ The representation in matrix form allows us to further simplify the equation by plugging in the response vector $\mathbf y$ and the model matrix $\mathbf X$: $$\mathcal E = \sum_{i=1}^m(y_i - x_i^T \beta) = \Vert\mathbf y - \mathbf X\beta\Vert^2 = (\mathbf y - \mathbf X\beta)^T(\mathbf y - \mathbf X\beta).$$ Now, in order to find the least squares solution we take the gradient with respect to $\beta$ and find a unique solution $\hat \beta$: $$\frac{\partial \mathcal E}{\partial( \beta_i,i=1,...,n)}=2\mathbf X^T\mathbf X \beta-2\mathbf X^T\mathbf y \quad \Rightarrow \quad\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y.$$ The matrix $\mathbf H = \mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T$ is sometimes called the "hat" matrix because it puts the hat on $\mathbf y$ (Hastie et al. 2008). Given new data $\mathbf X_{new}$, the least squares prediction for $\hat y$ is $$\mathbf{\hat y} = \mathbf X_{new}\hat \beta = \mathbf X_{new}(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y.$$ Important note Be aware that the calculation of $\hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf y$ assumes $(\mathbf X^T\mathbf X)^{-1}$ exists. Therefore $\mathbf X^T\mathbf X$ has to be a full rank matrix. A matrix $\mathbf M_{m\times m}$ is called full rank if all row (column) vectors are linear independent. This means that any point in $\mathbb R^m$ can be reached by a weighted combination of $m$ rows of $\mathbf M$. In our case, $\mathbf X_{m\times d+1}^T\mathbf X_{m\times d+1}$ is a $(d+1 \times d+1)$ Matrix. Thus, if either $m < d + 1$ or at least one of the d+1 column vectors of $\mathbf X$ can be written as linear combination of the others, least squares fails, because $(\mathbf X^T\mathbf X)^{-1}$ does not exist.Therefore, we are facing an infinite number of possible solutions. The latter case occurs if $\mathbf X$ is part of a compositional data set!
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.