In order to get some hands-on experience we apply the simple linear regression in an exercise. Therefore we load the students data set. You may download the students.csv file here. Import the data set and assign a proper name to it.

students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.

In order to showcase the simple linear regression we examine the relationship between two variables, the height of students, as the predictor variable and the weight of students as response variable.


Data preparation

For data preparation we randomly sample 12 students from the data set and build a data frame with the two variables of interest (height and weight). Further we plot the data in form of a scatter plot to visualize the underlying linear relationship between the two variables.

n <- 12

sample.idx <- sample(1:nrow(students), n)
data <- students[sample.idx , c('height','weight')]

plot(data$height, data$weight)

The visual inspection confirms our assumption that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height the individual student tends the have a higher weight.


Parameter estimation

Solving for \(\beta_0\) and \(\beta_1\) analytically in R

As shown in the previous section the parameter \(\beta_0\) and \(\beta_1\) of a simple linear model may be calculated analytically. Recall the equation for a linear model from sample data

\[\hat y = \beta_0 + \beta_1 x + e \text{,}\]

for \(\beta_1\)

\[\hat{\beta_1} = \frac{\sum_{i=1}^n ((x_i- \bar x) (y_i-\bar y))}{\sum_{i=1}^n (x_i-\bar x)^2} = \frac{cov(x,y)}{var(x)}\text{,}\]

and \(\beta_0\).

\[\hat{\beta_0} = \bar y -\hat{\beta_1} \bar x\]

For a better understanding we use R to calculate the individual terms

# Calculate b1

x <- data$height
y <- data$weight
x.bar <- mean(x)
y.bar <- mean(y)

b1 <- sum((x-x.bar)*(y-y.bar)) / sum((x-x.bar)^2)
b1
## [1] 0.7118566

The slope of the regression model is approximately 0.71. For a sanity check we calculate the ratio of the covariance of \(x\) and \(y\), applying the cov() function, and the variance of \(x\), using the var() function and compare it to the result from above.

cov(x,y) / var(x)
## [1] 0.7118566

A perfect match!

Further, we calculate \(\beta_0\).

# Calculate b0
b0 <- y.bar - b1*x.bar
b0
## [1] -48.76592

The intercept \(\beta_0\) of the regression model is approximately -48.77.

Thus we can write down the regression model

\[\text{weight} = -48.77 + 0.71 \times \text{height} \]

Now, based on that equation we may determine the weight of a student given its height. Just for fun let us predict the weight of students with a height of 156, 178, 192 cm.

\[\text{weight}_{156} = -48.77 + 0.71 \times 156 \approx 62 \text{ kg} \]

\[\text{weight}_{178} = -48.77 + 0.71 \times 178 \approx 78\text{ kg} \]

\[\text{weight}_{192} = -48.77 + 0.71 \times 192 \approx 88\text{ kg} \]


Using the lm() function in R to calculate \(\beta_0\) and \(\beta_1\)

The R software package provides a function for linear models called lm(). The lm() expects the formula notation as input; thus, a linear model is specified as response ~ predictor.

lin.model <- lm(weight ~ height, data = data)
lin.model
## 
## Call:
## lm(formula = weight ~ height, data = data)
## 
## Coefficients:
## (Intercept)       height  
##    -48.7659       0.7119

The default output to the R-console returns the intercept and the regression coefficient. The $ notation may be useful if you want to access a particular property of the model object. For example we may access the intercept and the regressions coefficient(s) for further usage.

lin.model$coefficients[1]
## (Intercept) 
##   -48.76592
lin.model$coefficients[2]
##    height 
## 0.7118566

There are many more properties of a lm object that can be accessed by the $ notation. The str() function provides an overview of the structure of the model object.

str(lin.model)
## List of 12
##  $ coefficients : Named num [1:2] -48.766 0.712
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "height"
##  $ residuals    : Named num [1:12] 1.14 2.479 -0.143 -1.062 -1.602 ...
##   ..- attr(*, "names")= chr [1:12] "5445" "1555" "6615" "3747" ...
##  $ effects      : Named num [1:12] -245.143 -20.07 -0.184 -1.667 -1.925 ...
##   ..- attr(*, "names")= chr [1:12] "(Intercept)" "height" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:12] 60.9 76.5 65.8 73 69.4 ...
##   ..- attr(*, "names")= chr [1:12] "5445" "1555" "6615" "3747" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:12, 1:2] -3.464 0.289 0.289 0.289 0.289 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:12] "5445" "1555" "6615" "3747" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "height"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.29 1.4
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 10
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = weight ~ height, data = data)
##  $ terms        :Classes 'terms', 'formula'  language weight ~ height
##   .. ..- attr(*, "variables")= language list(weight, height)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "weight" "height"
##   .. .. .. ..$ : chr "height"
##   .. ..- attr(*, "term.labels")= chr "height"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(weight, height)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
##  $ model        :'data.frame':   12 obs. of  2 variables:
##   ..$ weight: num [1:12] 62 79 65.7 71.9 67.8 77.7 64.9 66.7 81.4 68.6 ...
##   ..$ height: int [1:12] 154 176 161 171 166 177 163 162 183 171 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language weight ~ height
##   .. .. ..- attr(*, "variables")= language list(weight, height)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "weight" "height"
##   .. .. .. .. ..$ : chr "height"
##   .. .. ..- attr(*, "term.labels")= chr "height"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(weight, height)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
##  - attr(*, "class")= chr "lm"

Extractor functions

Another way to access the properties of a model object in R is to apply so called extractor functions. Let us reconstruct the linear model from above, assign the model output to the variable model and apply some useful extractor functions.

model <- lm(weight ~ height, data = data)

The coef() function returns the model coefficients.

coef(model)
## (Intercept)      height 
## -48.7659189   0.7118566

The confint.default() function returns the confidence interval for the model coefficients for a specified confidence level.

confint.default(model, level = 0.90)
##                     5 %        95 %
## (Intercept) -71.0082990 -26.5235388
## height        0.5795511   0.8441621

Another useful extractor function is the residuals() function.

residuals(model)
##       5445       1555       6615       3747       1007       2129 
##  1.1400042  2.4791592 -0.1429919 -1.0615578 -1.6022749  0.4673027 
##       4088       6867       2920       2496       7568       2072 
## -2.3667051  0.1451515 -0.1038369 -4.3615578  3.3265856  2.0807212

We may instantaneously check if the sum of residuals \((\sum e)\) is close to zero.

sum(residuals(model))
## [1] -1.665335e-15

Cool, close to zero!

In order to plot the obtained regression line we apply the abline() function, which draws lines based on the intercept and slope, \(\beta_0\) and \(\beta_1\), respectively.

plot(data$height, data$weight)
abline(model, col = 'red')
legend("topleft",
       legend = c('data points', 'regression line'), 
       cex = 0.7, 
       col = c('black', 'red'), 
       lwd = c(NA, 1), 
       pch = c(1, NA))

The fitted() function returns \(\hat y_i\), the fitted values for each particular \(x_i\).

fitted(model)
##     5445     1555     6615     3747     1007     2129     4088     6867 
## 60.86000 76.52084 65.84299 72.96156 69.40227 77.23270 67.26671 66.55485 
##     2920     2496     7568     2072 
## 81.50384 72.96156 73.67341 64.41928

These fitted values fall directly on the regression line.

plot(data$height, data$weight)
abline(model, col ='red')
points(data$height, fitted(model), col = 'green', pch = 15, cex = 0.7)
legend("topleft",
       legend = c('data points', 'regression line', 'fitted values'), 
       cex = 0.7, 
       col = c('black', 'red', 'green'), 
       lwd = c(NA, 1, NA), 
       pch = c(1, NA, 15))

Another particular interesting extractor function is the predict() function. If unspecified, the predict() function returns \(\hat y_i\) for each particular \(x_i\); similar as the fitted() function. However, in addition we may provide new data and the predict() function will predict \(\hat y_i\) for any given \(x_i\). Be aware that the new data must be a data frame object or a list object.

new.data <- list(height=c(165, 172, 183))
predict(model, new.data)
##        1        2        3 
## 68.69042 73.67341 81.50384
plot(data$height, data$weight, xlim = c(160,190))
abline(model, col = 'red')
points(c(165, 172, 183), predict(model, new.data), col = 'blue', pch = 18, cex = 1.5)
legend("topleft",
       legend = c('data points', 'regression line', 'predicted values'), 
       cex = 0.7, 
       col = c('black', 'red', 'blue'), 
       lwd = c(NA, 1, NA), 
       pch = c(1, NA, 18))