In order to get some hands-on experience we apply a simple linear regression in an exercise. For this we import the students data set, which you may also download here.
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.
In order to showcase the simple linear regression we examine the relationship between two variables, the height of students as the predictor variable and the weight of students as the response variable.
For data preparation we randomly sample 12 students from the data set
and build a data frame with the two variables of interest
(height
and weight
). Further, we plot the data
in form of a scatter plot to visualize the underlying linear
relationship between the two variables.
n <- 12
sample_idx <- sample(1:nrow(students), n)
data <- students[sample_idx, c("height", "weight")]
plot(data$height, data$weight)
The visual inspection confirms our assumption, that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height the individual student tends the have a higher weight.
As shown in the previous section the parameters \(a\) and \(b\) of a simple linear model may be calculated analytically. Recall the equation for a linear model from sample data
\[\hat y = a + b x + e \text{,}\]
for \(b\)
\[b = \frac{\sum_{i=1}^n ((x_i- \bar x) (y_i-\bar y))}{\sum_{i=1}^n (x_i-\bar x)^2} = \frac{cov(x,y)}{var(x)}\text{,}\]
and \(a\)
\[a = \bar y - b \bar x \text{.}\]
For a better understanding we use R to calculate the individual terms.
# calculate b
x <- data$height
y <- data$weight
x_bar <- mean(x)
y_bar <- mean(y)
lm_b <- sum((x - x_bar) * (y - y_bar)) / sum((x - x_bar)^2)
lm_b
## [1] 0.8140094
The slope of the regression model is approximately 0.81. For a sanity
check we calculate the ratio of the covariance of \(x\) and \(y\), applying the cov()
function, and the variance of \(x\),
using the var()
function, and compare it to the result
above:
cov(x, y) / var(x)
## [1] 0.8140094
A perfect match!
Further, we calculate \(a\):
# calculate a
lm_a <- y_bar - lm_b * x_bar
lm_a
## [1] -67.82645
The intercept \(a\) of the regression model is approximately -67.83.
Now, we can write down the regression model:
\[\text{weight} = -67.83 + 0.81 \times \text{height.} \]
Based on that equation we may estimate the weight of a student given its height. Just for fun let us predict the weight of students with a height of 156, 178, 192 cm:
\[\text{weight}_{156} = -67.83 + 0.81 \times 156 \approx 59 \text{ kg} \]
\[\text{weight}_{178} = -67.83 + 0.81 \times 178 \approx 77\text{ kg} \]
\[\text{weight}_{192} = -67.83 + 0.81 \times 192 \approx 88\text{ kg} \]
lm()
function in R to calculate \(a\) and \(b\)The R software package provides a function for linear models called
lm()
. The lm()
function expects a formula
notation as input; thus, a linear model is specified as
response ~ predictor
.
lin_model <- lm(weight ~ height, data = data)
lin_model
##
## Call:
## lm(formula = weight ~ height, data = data)
##
## Coefficients:
## (Intercept) height
## -67.826 0.814
The default output to the R-console returns the intercept and the
regression coefficient. The $
notation may be useful, if
you want to access a particular property of the model object. For
example, we may access the intercept and the regression coefficient(s)
for further usage:
lin_model$coefficients[1]
## (Intercept)
## -67.82645
lin_model$coefficients[2]
## height
## 0.8140094
There are many more properties of a lm
object that can
be accessed by the $
notation. The str()
function provides an overview of the structure of the model object:
str(lin_model)
## List of 12
## $ coefficients : Named num [1:2] -67.826 0.814
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "height"
## $ residuals : Named num [1:12] -0.737 -1.355 -3.569 0.437 0.885 ...
## ..- attr(*, "names")= chr [1:12] "3371" "6010" "8040" "32" ...
## $ effects : Named num [1:12] -248.405 33.709 -3.325 0.131 1.417 ...
## ..- attr(*, "names")= chr [1:12] "(Intercept)" "height" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:12] 81.1 70.6 71.4 88.5 62.4 ...
## ..- attr(*, "names")= chr [1:12] "3371" "6010" "8040" "32" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:12, 1:2] -3.464 0.289 0.289 0.289 0.289 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:12] "3371" "6010" "8040" "32" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "height"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.29 1.1
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 10
## $ xlevels : Named list()
## $ call : language lm(formula = weight ~ height, data = data)
## $ terms :Classes 'terms', 'formula' language weight ~ height
## .. ..- attr(*, "variables")= language list(weight, height)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "weight" "height"
## .. .. .. ..$ : chr "height"
## .. ..- attr(*, "term.labels")= chr "height"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(weight, height)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
## $ model :'data.frame': 12 obs. of 2 variables:
## ..$ weight: num [1:12] 80.4 69.2 67.8 88.9 63.3 83 59.5 65 59.6 80.2 ...
## ..$ height: int [1:12] 183 170 171 192 160 186 156 161 158 182 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language weight ~ height
## .. .. ..- attr(*, "variables")= language list(weight, height)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "weight" "height"
## .. .. .. .. ..$ : chr "height"
## .. .. ..- attr(*, "term.labels")= chr "height"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(weight, height)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
## - attr(*, "class")= chr "lm"
Another way to access the properties of a model object in R is to
apply so-called extractor functions. Let us reconstruct
the linear model from above, assign the model output to the variable
model
and apply some useful extractor functions.
model <- lm(weight ~ height, data = data)
The coef()
function returns the model coefficients:
coef(model)
## (Intercept) height
## -67.8264493 0.8140094
The confint.default()
function returns the confidence
interval for the model coefficients for a specified confidence
level:
confint.default(model, level = 0.90)
## 5 % 95 %
## (Intercept) -80.8847910 -54.7681076
## height 0.7380151 0.8900038
Another useful extractor function is the residuals()
function:
residuals(model)
## 3371 6010 8040 32 7210 7165 3329
## -0.7372759 -1.3551533 -3.5691627 0.4366393 0.8849410 -0.5793041 0.3409787
## 8109 2449 6699 3210 2074
## 1.7709315 -1.1870402 -0.1232664 0.1849410 3.9327713
We may instantaneously check, if the sum of residuals \((\sum e)\) is close to zero:
sum(residuals(model))
## [1] 3.885781e-16
Cool, close to zero!
In order to plot the obtained regression line we apply the
abline()
function, which draws lines based on the intercept
(\(a\)) and slope (\(b\)).
plot(data$height, data$weight)
abline(model, col = "red")
legend("topleft",
legend = c("data points", "regression line"),
cex = 0.7,
col = c("black", "red"),
lwd = c(NA, 1),
pch = c(1, NA)
)
The fitted()
function returns \(\hat y_i\), the fitted values for each
particular \(x_i\):
fitted(model)
## 3371 6010 8040 32 7210 7165 3329 8109
## 81.13728 70.55515 71.36916 88.46336 62.41506 83.57930 59.15902 63.22907
## 2449 6699 3210 2074
## 60.78704 80.32327 62.41506 77.06723
These fitted values fall directly on the regression line.
plot(data$height, data$weight)
abline(model, col = "red")
points(data$height, fitted(model), col = "green", pch = 15, cex = 0.7)
legend("topleft",
legend = c("data points", "regression line", "fitted values"),
cex = 0.7,
col = c("black", "red", "green"),
lwd = c(NA, 1, NA),
pch = c(1, NA, 15)
)
Another particularly interesting extractor function is the
predict()
function. If unspecified, the
predict()
function returns \(\hat
y_i\) for each particular \(x_i\); similar as the fitted()
function. However, in addition we may provide new data and the
predict()
function will predict \(\hat y_i\) for any given \(x_i\). Be aware, that the new data must be
a data frame object or a list object.
new_data <- list(height = c(165, 172, 183))
predict(model, new_data)
## 1 2 3
## 66.48511 72.18317 81.13728
plot(data$height, data$weight, xlim = c(160, 190))
abline(model, col = "red")
points(c(165, 172, 183), predict(model, new_data), col = "blue", pch = 18, cex = 1.5)
legend("topleft",
legend = c("data points", "regression line", "predicted values"),
cex = 0.7,
col = c("black", "red", "blue"),
lwd = c(NA, 1, NA),
pch = c(1, NA, 18)
)
In addition, R provides a simple approach to construct uncertainty bands around the fitted regression line. There are two kinds of bands, referred to as the narrow and wide limits. The narrow bands, the so-called confidence bands, reflect the uncertainty about the line itself. The bands will be narrow, if there are many observations, reflecting a well-determined line. The wide bands, the so-called prediction bands, include the uncertainty about future observations. These bands capture the majority of the observed points and do not collapse to a line as the number of observations increases.
In order to calculate these uncertainty bands we apply the
predict()
function and add the argument
interval="confidence"
or
interval="prediction"
. This way we get the vector of
predicted values augmented with limits.
predict(model, interval = "confidence", level = 0.99)
## fit lwr upr
## 3371 81.13728 78.69992 83.57463
## 6010 70.55515 68.79248 72.31783
## 8040 71.36916 69.61767 73.12066
## 32 88.46336 84.97802 91.94870
## 7210 62.41506 59.99462 64.83550
## 7165 83.57930 80.81819 86.34042
## 3329 59.15902 56.30249 62.01555
## 8109 63.22907 60.90734 65.55079
## 2449 60.78704 58.15581 63.41827
## 6699 80.32327 77.98544 82.66110
## 3210 62.41506 59.99462 64.83550
## 2074 77.06723 75.06892 79.06553
predict(model, interval = "prediction", level = 0.99)
## fit lwr upr
## 3371 81.13728 74.60208 87.67247
## 6010 70.55515 64.24048 76.86983
## 8040 71.36916 65.05760 77.68072
## 32 88.46336 81.46939 95.45733
## 7210 62.41506 55.88616 68.94396
## 7165 83.57930 76.91659 90.24202
## 3329 59.15902 52.45620 65.86184
## 8109 63.22907 56.73611 69.72202
## 2449 60.78704 54.17709 67.39699
## 6699 80.32327 73.82454 86.82200
## 3210 62.41506 55.88616 68.94396
## 2074 77.06723 70.68277 83.45168
new_values <- seq(min(data$height) * 0.5, max(data$height) * 1.5, by = 0.1)
conf <- predict(model, newdata = list("height" = new_values), interval = "confidence", level = 0.99)
pred <- predict(model, newdata = list("height" = new_values), interval = "prediction", level = 0.99)
plot(data$height, data$weight)
abline(model, col = "red")
lines(new_values, conf[, "lwr"])
lines(new_values, conf[, "upr"])
lines(new_values, pred[, "lwr"], lty = 2)
lines(new_values, pred[, "upr"], lty = 2)
legend("topleft",
legend = c("data points", "regression line", "confidence bands", "prediction bands"),
cex = 0.7,
col = c("black", "red", "black", "black"),
lwd = c(NA, 1, 1, 1),
lty = c(NA, 1, 1, 2),
pch = c(1, NA, NA, NA)
)
Finally, the most convenient function to summarize the properties of
any model object in R is the summary()
function:
summary(model)
##
## Call:
## lm(formula = weight ~ height, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5692 -0.8497 0.0308 0.5487 3.9328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.8264 7.9389 -8.544 6.59e-06 ***
## height 0.8140 0.0462 17.619 7.38e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.913 on 10 degrees of freedom
## Multiple R-squared: 0.9688, Adjusted R-squared: 0.9657
## F-statistic: 310.4 on 1 and 10 DF, p-value: 7.383e-09
We will discuss the output of the summary function for a linear model object in more detail in the next section on model diagnostics. In addition, you may visit the blog by Felipe Rego, who provides a nicely written overview of the output of a linear model in R.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.