20810_simple_linear_regression.knit

In this section we discuss a special type of regression, which is called simple linear regression. In this special case of regression analysis the relationship between the response variable $y$ and and the predictor variable $x$ is given in form of a linear equation

$y= \alpha + \beta x\text{,}$

where $\alpha$ and $\beta$ are constants. The number $\alpha$ is called intercept and defines the point of intersection of the regression line and the $y$ -axis ( $x=0$ ). The number $\beta$ is called regression coefficient. It is a measure of the slope of the regression line. Thus, $\beta$ indicates how much the $y$ -value changes when the $x$ -value increases by 1 unit. The adjective simple refers to the fact, that the outcome variable is related to a single predictor. The model is considered as a deterministic model, as it gives an exact relationship between $x$ and $y$ .

Let us consider a simple example. Given a population of $n = 3$ points with Cartesian coordinates $(x_i,y_i)$ of $(1,6)$ , $(2,8)$ and $(3,10)$ . These points plot onto a straight line. Thus, they can be described by a linear equation model of the form $y= \alpha + \beta x$ , where $\alpha=4$ and $\beta=2$ .

In many cases however, the relationship between two variables $x$ and $y$ is not exact. This is due to the fact, that the response variable $y$ is affected by other unknown and/or random processes, that are not fully captured by the predictor variable $x$ . In such a case the data points do not line up on a straight line. However, the data still may follow an underlying linear relationship. In order to take these unknowns into consideration a random error term, denoted as $\epsilon$ , is added to the linear model equation. The result is, in contrast to the deterministic model above, a probabilistic model:

$y = \alpha + \beta x + \epsilon \text{,}$

where the error term $\epsilon_i$ is assumed to consist of independent normally distributed values, $e_i \sim N(0, \sigma^2)$ .

In linear regression modelling the following assumptions on the model are made (Mann, 2012):

The random error term $\epsilon$ has a mean equal to zero for each $x$ .
The errors associated with different observations are independent.
For any given $x$ , the distribution of errors is normal.
The distribution of errors for each $x$ has the same (constant) standard deviation, denoted by $\sigma_{\epsilon}$ (homoscedasticity).

Let us consider another example. This time we take a random sample of $n = 8$ from a population. In order to emphasize, that the values of the intercept and slope are calculated from sample data, $\alpha$ and $\beta$ are denoted as $a$ and $b$ , respectively. In addition, the error term $\epsilon$ is denoted as $e$ . Thus, $a$ , $b$ and $e$ are estimates based on sample data for the population parameters $\alpha$ , $\beta$ and $\epsilon$ .

$\hat y = a + b x + e \text{,}$

where $\hat y$ is the estimated or predicted value of $y$ for any given value of $x$ .

The error $e_i$ for each particular pair of values ( $x_i,y_i$ ), also called residual, is computed as the difference between the observed value $y_i$ and the predicted value $\hat y_i$ :

$e_i = y_i - \hat y_i \text{.}$

Depending on the data, $e_i$ is a negative number, if $y_i$ plots below the regression line, or it is a positive number, if $y_i$ plots above the regression line.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.