20812_simple_linear_regression_-_an

In order to get some hands-on experience we apply a simple linear regression in an exercise. For this we import the students data set, which you may also download here.

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In order to showcase the simple linear regression we examine the relationship between two variables, the height of students as the predictor variable and the weight of students as the response variable.

Data preparation

For data preparation we randomly sample 12 students from the data set and build a data frame with the two variables of interest (height and weight). Further, we plot the data in form of a scatter plot to visualize the underlying linear relationship between the two variables.

n <- 12

sample_idx <- sample(1:nrow(students), n)
data <- students[sample_idx, c("height", "weight")]

plot(data$height, data$weight)

The visual inspection confirms our assumption, that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height the individual student tends the have a higher weight.

Parameter estimation

Solving for $a$ and $b$ analytically in R

As shown in the previous section the parameters $a$ and $b$ of a simple linear model may be calculated analytically. Recall the equation for a linear model from sample data

\[\hat y = a + b x + e \text{,}\]

for $b$

\[b = \frac{\sum_{i=1}^n ((x_i- \bar x) (y_i-\bar y))}{\sum_{i=1}^n (x_i-\bar x)^2} = \frac{cov(x,y)}{var(x)}\text{,}\]

and $a$

\[a = \bar y - b \bar x \text{.}\]

For a better understanding we use R to calculate the individual terms.

# calculate b

x <- data$height
y <- data$weight
x_bar <- mean(x)
y_bar <- mean(y)

lm_b <- sum((x - x_bar) * (y - y_bar)) / sum((x - x_bar)^2)
lm_b

## [1] 0.8140094

The slope of the regression model is approximately 0.81. For a sanity check we calculate the ratio of the covariance of $x$ and $y$, applying the cov() function, and the variance of $x$, using the var() function, and compare it to the result above:

cov(x, y) / var(x)

## [1] 0.8140094

A perfect match!

Further, we calculate $a$:

# calculate a
lm_a <- y_bar - lm_b * x_bar
lm_a

## [1] -67.82645

The intercept $a$ of the regression model is approximately -67.83.

Now, we can write down the regression model:

\[\text{weight} = -67.83 + 0.81 \times \text{height.} \]

Based on that equation we may estimate the weight of a student given its height. Just for fun let us predict the weight of students with a height of 156, 178, 192 cm:

\[\text{weight}_{156} = -67.83 + 0.81 \times 156 \approx 59 \text{ kg} \]

\[\text{weight}_{178} = -67.83 + 0.81 \times 178 \approx 77\text{ kg} \]

\[\text{weight}_{192} = -67.83 + 0.81 \times 192 \approx 88\text{ kg} \]

Using the `lm()` function in R to calculate $a$ and $b$

The R software package provides a function for linear models called lm(). The lm() function expects a formula notation as input; thus, a linear model is specified as response ~ predictor.

lin_model <- lm(weight ~ height, data = data)
lin_model

## 
## Call:
## lm(formula = weight ~ height, data = data)
## 
## Coefficients:
## (Intercept)       height  
##     -67.826        0.814

The default output to the R-console returns the intercept and the regression coefficient. The $ notation may be useful, if you want to access a particular property of the model object. For example, we may access the intercept and the regression coefficient(s) for further usage:

lin_model$coefficients[1]

## (Intercept) 
##   -67.82645

lin_model$coefficients[2]

##    height 
## 0.8140094

There are many more properties of a lm object that can be accessed by the $ notation. The str() function provides an overview of the structure of the model object:

str(lin_model)

## List of 12
##  $ coefficients : Named num [1:2] -67.826 0.814
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "height"
##  $ residuals    : Named num [1:12] -0.737 -1.355 -3.569 0.437 0.885 ...
##   ..- attr(*, "names")= chr [1:12] "3371" "6010" "8040" "32" ...
##  $ effects      : Named num [1:12] -248.405 33.709 -3.325 0.131 1.417 ...
##   ..- attr(*, "names")= chr [1:12] "(Intercept)" "height" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:12] 81.1 70.6 71.4 88.5 62.4 ...
##   ..- attr(*, "names")= chr [1:12] "3371" "6010" "8040" "32" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:12, 1:2] -3.464 0.289 0.289 0.289 0.289 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:12] "3371" "6010" "8040" "32" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "height"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.29 1.1
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 10
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = weight ~ height, data = data)
##  $ terms        :Classes 'terms', 'formula'  language weight ~ height
##   .. ..- attr(*, "variables")= language list(weight, height)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "weight" "height"
##   .. .. .. ..$ : chr "height"
##   .. ..- attr(*, "term.labels")= chr "height"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(weight, height)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
##  $ model        :'data.frame':   12 obs. of  2 variables:
##   ..$ weight: num [1:12] 80.4 69.2 67.8 88.9 63.3 83 59.5 65 59.6 80.2 ...
##   ..$ height: int [1:12] 183 170 171 192 160 186 156 161 158 182 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language weight ~ height
##   .. .. ..- attr(*, "variables")= language list(weight, height)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "weight" "height"
##   .. .. .. .. ..$ : chr "height"
##   .. .. ..- attr(*, "term.labels")= chr "height"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(weight, height)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
##  - attr(*, "class")= chr "lm"

Extractor functions

Another way to access the properties of a model object in R is to apply so-called extractor functions. Let us reconstruct the linear model from above, assign the model output to the variable model and apply some useful extractor functions.

model <- lm(weight ~ height, data = data)

The coef() function returns the model coefficients:

coef(model)

## (Intercept)      height 
## -67.8264493   0.8140094

The confint.default() function returns the confidence interval for the model coefficients for a specified confidence level:

confint.default(model, level = 0.90)

##                     5 %        95 %
## (Intercept) -80.8847910 -54.7681076
## height        0.7380151   0.8900038

Another useful extractor function is the residuals() function:

residuals(model)

##       3371       6010       8040         32       7210       7165       3329 
## -0.7372759 -1.3551533 -3.5691627  0.4366393  0.8849410 -0.5793041  0.3409787 
##       8109       2449       6699       3210       2074 
##  1.7709315 -1.1870402 -0.1232664  0.1849410  3.9327713

We may instantaneously check, if the sum of residuals $(\sum e)$ is close to zero:

sum(residuals(model))

## [1] 3.885781e-16

Cool, close to zero!

In order to plot the obtained regression line we apply the abline() function, which draws lines based on the intercept ($a$) and slope ($b$).

plot(data$height, data$weight)
abline(model, col = "red")
legend("topleft",
  legend = c("data points", "regression line"),
  cex = 0.7,
  col = c("black", "red"),
  lwd = c(NA, 1),
  pch = c(1, NA)
)

The fitted() function returns $\hat y_i$, the fitted values for each particular $x_i$:

fitted(model)

##     3371     6010     8040       32     7210     7165     3329     8109 
## 81.13728 70.55515 71.36916 88.46336 62.41506 83.57930 59.15902 63.22907 
##     2449     6699     3210     2074 
## 60.78704 80.32327 62.41506 77.06723

These fitted values fall directly on the regression line.

plot(data$height, data$weight)
abline(model, col = "red")
points(data$height, fitted(model), col = "green", pch = 15, cex = 0.7)
legend("topleft",
  legend = c("data points", "regression line", "fitted values"),
  cex = 0.7,
  col = c("black", "red", "green"),
  lwd = c(NA, 1, NA),
  pch = c(1, NA, 15)
)

Another particularly interesting extractor function is the predict() function. If unspecified, the predict() function returns $\hat y_i$ for each particular $x_i$; similar as the fitted() function. However, in addition we may provide new data and the predict() function will predict $\hat y_i$ for any given $x_i$. Be aware, that the new data must be a data frame object or a list object.

new_data <- list(height = c(165, 172, 183))
predict(model, new_data)

##        1        2        3 
## 66.48511 72.18317 81.13728

plot(data$height, data$weight, xlim = c(160, 190))
abline(model, col = "red")
points(c(165, 172, 183), predict(model, new_data), col = "blue", pch = 18, cex = 1.5)
legend("topleft",
  legend = c("data points", "regression line", "predicted values"),
  cex = 0.7,
  col = c("black", "red", "blue"),
  lwd = c(NA, 1, NA),
  pch = c(1, NA, 18)
)

In addition, R provides a simple approach to construct uncertainty bands around the fitted regression line. There are two kinds of bands, referred to as the narrow and wide limits. The narrow bands, the so-called confidence bands, reflect the uncertainty about the line itself. The bands will be narrow, if there are many observations, reflecting a well-determined line. The wide bands, the so-called prediction bands, include the uncertainty about future observations. These bands capture the majority of the observed points and do not collapse to a line as the number of observations increases.

In order to calculate these uncertainty bands we apply the predict() function and add the argument interval="confidence" or interval="prediction". This way we get the vector of predicted values augmented with limits.

predict(model, interval = "confidence", level = 0.99)

##           fit      lwr      upr
## 3371 81.13728 78.69992 83.57463
## 6010 70.55515 68.79248 72.31783
## 8040 71.36916 69.61767 73.12066
## 32   88.46336 84.97802 91.94870
## 7210 62.41506 59.99462 64.83550
## 7165 83.57930 80.81819 86.34042
## 3329 59.15902 56.30249 62.01555
## 8109 63.22907 60.90734 65.55079
## 2449 60.78704 58.15581 63.41827
## 6699 80.32327 77.98544 82.66110
## 3210 62.41506 59.99462 64.83550
## 2074 77.06723 75.06892 79.06553

predict(model, interval = "prediction", level = 0.99)

##           fit      lwr      upr
## 3371 81.13728 74.60208 87.67247
## 6010 70.55515 64.24048 76.86983
## 8040 71.36916 65.05760 77.68072
## 32   88.46336 81.46939 95.45733
## 7210 62.41506 55.88616 68.94396
## 7165 83.57930 76.91659 90.24202
## 3329 59.15902 52.45620 65.86184
## 8109 63.22907 56.73611 69.72202
## 2449 60.78704 54.17709 67.39699
## 6699 80.32327 73.82454 86.82200
## 3210 62.41506 55.88616 68.94396
## 2074 77.06723 70.68277 83.45168

new_values <- seq(min(data$height) * 0.5, max(data$height) * 1.5, by = 0.1)
conf <- predict(model, newdata = list("height" = new_values), interval = "confidence", level = 0.99)
pred <- predict(model, newdata = list("height" = new_values), interval = "prediction", level = 0.99)

plot(data$height, data$weight)
abline(model, col = "red")
lines(new_values, conf[, "lwr"])
lines(new_values, conf[, "upr"])
lines(new_values, pred[, "lwr"], lty = 2)
lines(new_values, pred[, "upr"], lty = 2)

legend("topleft",
  legend = c("data points", "regression line", "confidence bands", "prediction bands"),
  cex = 0.7,
  col = c("black", "red", "black", "black"),
  lwd = c(NA, 1, 1, 1),
  lty = c(NA, 1, 1, 2),
  pch = c(1, NA, NA, NA)
)

Finally, the most convenient function to summarize the properties of any model object in R is the summary() function:

summary(model)

## 
## Call:
## lm(formula = weight ~ height, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5692 -0.8497  0.0308  0.5487  3.9328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -67.8264     7.9389  -8.544 6.59e-06 ***
## height        0.8140     0.0462  17.619 7.38e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.913 on 10 degrees of freedom
## Multiple R-squared:  0.9688, Adjusted R-squared:  0.9657 
## F-statistic: 310.4 on 1 and 10 DF,  p-value: 7.383e-09

We will discuss the output of the summary function for a linear model object in more detail in the next section on model diagnostics. In addition, you may visit the blog by Felipe Rego, who provides a nicely written overview of the output of a linear model in R.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.

Data preparation

Parameter estimation

Solving for \(a\) and \(b\) analytically in R

Using the `lm()` function in R to calculate \(a\) and \(b\)

Extractor functions

Data preparation

Parameter estimation

Solving for \(a\) and \(b\) analytically in R

Using the lm() function in R to calculate \(a\) and \(b\)

Extractor functions

Using the `lm()` function in R to calculate \(a\) and \(b\)