20770_inferential_methods_in_regression_and

The linear model is given by the equation

\[y = \alpha + \beta x+ \epsilon \text{,}\] where \(\alpha\) is the intercept, \(\beta\) is the regression coefficient and \(\epsilon\) the error term. The best regression line is found by applying the ordinary least squares method, which minimizes the sum of squares error (SSE). This means minimizing the squared difference of the measured response variable \(y\) and the model prediction \(\hat y\), which is given by

\[SSE = \sum_{i=1}^n \epsilon_i^2=\sum_{i=1}^n (y - \hat y)^2\text{.}\]

Refer to the section on linear regression for more details on the linear model.

However, no matter what, we have to acknowledge that we build our models, in this case our linear regression model, on observation data. Hence, the data originates from a population which corresponding statistical properties are generally unknown to us. Thus, by taking measurements, each observation represents a manifestation of the population denoted by the term random variable.

Let us consider the example shown in the figure below. In this example the population parameters are known. Thus, we may build a linear regression model of the form \(y = \alpha + \beta x = 1 + 0.25x\).

However, if we now take a random sample of the population and build a linear model based on the sample data, the sample regression line will not be the same as the population regression line. For the figure below we take four random samples of sample size 25 (blue dots). We immediately see that the sample regression line (blue dashed line) is not the same as the population regression line (grey line). In order to account for that variability, which is due to the random sampling process, a statistic is calculated by applying the equation

\[s_e = \sqrt{\frac{SSE}{n-2}}\text{,}\]

where \(SSE\) corresponds to the sum of squares error and \(n\) corresponds to the sample size. The statistic, \(s_e\), is denoted as standard error of the estimate \((s_e)\) or the residual standard error.

As seen above, the sample regression line varies from one sample to another and is therefore a random variable. Its distribution is called the sampling distribution of the slope of the regression line.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.