30200_multiple_linear_regression.knit

A regression model relates the dependent variable (a.k.a. response variable), $y$ , to a function of independent variables (a.k.a. explanatory or predictor variables), $x$ , and unknown parameters (a.k.a. model coefficients) $\beta$ . Such a regression model can be written as

$y = f(x; \beta)\text{.}$

The goal of regression is to find a function such that $y \approx f (x;\beta)$ for the data pair $(x; y)$ . The function $f(x;\beta)$ is called a regression function, and its free parameters ( $\beta$ ) are the function coefficients. A regression method is linear if the prediction function $f$ is a linear function of the unknown parameters $\beta$ .

By extending the equation to a set of $n$ observations and $d$ explanatory variables, $x_1,..., x_d$ the regression model can be written as

$\begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \\ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...m\text{, } x\in \mathbb R^d\text{, } \end{align}$

where $\beta_0$ corresponds to the intercept, sometimes referred to as bias, shift or offset and $\epsilon$ corresponds to the error term, referred to as residuals.

A regression model based upon $m$ observations (measurements) consists of $n$ response variables, $y_1, y_2,...y_m$ . For the ease of notation we write the response variables as a one-dimensional column vector of the size $\mathbf{y}_{m \times 1}$ .

$\begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \\ \end{bmatrix} \end{align}$

Moreover, for each particular observation $x_i$ $(x_1, x_2,..., x_m)$ we represent the $d$ associated explanatory variables as a column vector as well.

$\begin{align} \mathbf x_{i}= \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \\ \end{bmatrix} \end{align} \text{(e.g.)} \Rightarrow \begin{bmatrix} \text{height} \\ \text{weight} \\ \vdots \\ \text{age} \\ \end{bmatrix}$

Further, by transposing $\mathbf{x}_i$ we stack a set of $m$ observation vectors into a matrix $\mathbf{X}$ of the form $\mathbf{X}_{m \times d}$ :

$\begin{align} \mathbf X_{m \times d}= \begin{bmatrix} \mathbf x_{1}^T \\ \mathbf x_{2}^T \\ \vdots \\ \mathbf x_{m1}^T \\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{md} \\ \end{bmatrix} \end{align}.$

This matrix notation is very similar to a spreadsheet representation, where each row corresponds to an observation and each column to a feature. Please note that we assume that all features are continuous-valued $(\mathbf x \in \mathbb R^d)$ and that there are more observations than dimensions $(m > d)$ .

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.