A regression model relates the dependent variable (a.k.a. response variable), \(y\), to a function of independent variables (a.k.a. explanatory or predictor variables), \(x\), and unknown parameters (a.k.a. model coefficients) \(\beta\). Such a regression model can be written as

\[ y = f(x; \beta)\text{.}\]

The goal of regression is to find a function such that \(y \approx f (x;\beta)\) for the data pair \((x; y)\). The function \(f(x;\beta)\) is called a regression function, and its free parameters (\(\beta\)) are the function coefficients. A regression method is linear if the prediction function \(f\) is a linear function of the unknown parameters \(\beta\).

By extending the equation to a set of \(n\) observations and \(d\) explanatory variables, \(x_1,..., x_d\) the regression model can be written as

\[ \begin{align} y_i & = \beta_0 + \sum_{j=1}^dx_{ij}\beta_j + \epsilon_i \\ & = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +...+ \beta_dx_{di} + \epsilon_i \text{,} \quad i = 1,2,...m\text{, } x\in \mathbb R^d\text{, } \end{align} \]

where \(\beta_0\) corresponds to the intercept, sometimes referred to as bias, shift or offset and \(\epsilon\) corresponds to the error term, referred to as residuals.

A regression model based upon \(m\) observations (measurements) consists of \(n\) response variables, \(y_1, y_2,...y_m\). For the ease of notation we write the response variables as a one-dimensional column vector of the size \(\mathbf{y}_{m \times 1}\).

\[ \begin{align} \mathbf y_{m \times 1}= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \\ \end{bmatrix} \end{align} \]

Moreover, for each particular observation \(x_i\) \((x_1, x_2,..., x_m)\) we represent the \(d\) associated explanatory variables as a column vector as well.

\[ \begin{align} \mathbf x_{i}= \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{id} \\ \end{bmatrix} \end{align} \text{(e.g.)} \Rightarrow \begin{bmatrix} \text{height} \\ \text{weight} \\ \vdots \\ \text{age} \\ \end{bmatrix} \]

Further, by transposing \(\mathbf{x}_i\) we stack a set of \(m\) observation vectors into a matrix \(\mathbf{X}\) of the form \(\mathbf{X}_{m \times d}\):

\[ \begin{align} \mathbf X_{m \times d}= \begin{bmatrix} \mathbf x_{1}^T \\ \mathbf x_{2}^T \\ \vdots \\ \mathbf x_{m1}^T \\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{md} \\ \end{bmatrix} \end{align}. \]

This matrix notation is very similar to a spreadsheet representation, where each row corresponds to an observation and each column to a feature. Please note that we assume that all features are continuous-valued \((\mathbf x \in \mathbb R^d)\) and that there are more observations than dimensions \((m > d)\).


The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.