In this section we discuss a special type of regression, which is called simple linear regression. In this special case of regression analysis the relationship between the response variable \(y\) and and the predictor variable \(x\) is given in form of a linear equation

\[y= \alpha + \beta x\text{,}\]

where \(\alpha\) and \(\beta\) are constants. The number \(\alpha\) is called intercept and defines the point of intersection of the regression line and the \(y\)-axis (\(x=0\)). The number \(\beta\) is called regression coefficient. It is a measure of the slope of the regression line. Thus, \(\beta\) indicates how much the \(y\)-value changes when the \(x\)-value increases by 1 unit. The adjective simple refers to the fact, that the outcome variable is related to a single predictor. The model is considered as a deterministic model, as it gives an exact relationship between \(x\) and \(y\).

Let us consider a simple example. Given a population of \(n = 3\) points with Cartesian coordinates \((x_i,y_i)\) of \((1,6)\), \((2,8)\) and \((3,10)\). These points plot onto a straight line. Thus, they can be described by a linear equation model of the form \(y= \alpha + \beta x\), where \(\alpha=4\) and \(\beta=2\).


In many cases however, the relationship between two variables \(x\) and \(y\) is not exact. This is due to the fact, that the response variable \(y\) is affected by other unknown and/or random processes, that are not fully captured by the predictor variable \(x\). In such a case the data points do not line up on a straight line. However, the data still may follow an underlying linear relationship. In order to take these unknowns into consideration a random error term, denoted as \(\epsilon\), is added to the linear model equation. The result is, in contrast to the deterministic model above, a probabilistic model:

\[y = \alpha + \beta x + \epsilon \text{,}\]

where the error term \(\epsilon_i\) is assumed to consist of independent normally distributed values, \(e_i \sim N(0, \sigma^2)\).

In linear regression modelling the following assumptions on the model are made (Mann, 2012):


Let us consider another example. This time we take a random sample of \(n = 8\) from a population. In order to emphasize, that the values of the intercept and slope are calculated from sample data, \(\alpha\) and \(\beta\) are denoted as \(a\) and \(b\), respectively. In addition, the error term \(\epsilon\) is denoted as \(e\). Thus, \(a\), \(b\) and \(e\) are estimates based on sample data for the population parameters \(\alpha\), \(\beta\) and \(\epsilon\).

\[\hat y = a + b x + e \text{,}\]

where \(\hat y\) is the estimated or predicted value of \(y\) for any given value of \(x\).

The error \(e_i\) for each particular pair of values (\(x_i,y_i\)), also called residual, is computed as the difference between the observed value \(y_i\) and the predicted value \(\hat y_i\):

\[e_i = y_i - \hat y_i \text{.}\]

Depending on the data, \(e_i\) is a negative number, if \(y_i\) plots below the regression line, or it is a positive number, if \(y_i\) plots above the regression line.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.