Logistic regression analysis belongs to the class of generalized linear models. In R
generalized linear models are handled by the glm()
function. The function is written as
glm(response ~ predictor, family = binomial(link = "logit"), data)
.
Since logit
is the default for binomial, we do not have to
type it explicitly. The glm()
function returns a model
object, therefore we may apply extractor functions,
such as summary()
, fitted()
or
predict
, on it. However, please note that the output
numbers are on the logit scale. To actually predict probabilities we
need to provide the predict()
function an additional
argument type = "response"
.
This example is inspired by the work of James B. Elsner and his colleagues (Elsner et al. 1996 and Kimberlain and Elsner 1998), who worked on a genetic classification of North Atlantic hurricanes based on formation and development mechanisms. The classification yields three different groups: tropical hurricanes, hurricanes under baroclinic influences and hurricanes of baroclinic initiation. The term “baroclinic” relates to the fact, that these hurricanes are influenced by outer tropics disturbances or even originate in the outer tropics. The stronger tropical hurricanes develop farther south and primarily occur in August and September. The weaker outer-tropical hurricanes occur throughout a longer season. The original data set on Genetic Hurricane Classification can be retrieved here. The analysis of by James B. Elsner can be reviewed here.
The goal of the following exercise is to build a model that predicts the group membership of a hurricane, either tropical or non-tropical, based on the latitude of formation.
We start the analysis by loading the data
set. By installing the openxlsx
package (type
install.packages("openxlsx")
) we can access the Excel file
directly by an URL:
library(openxlsx)
hurricanes <- read.xlsx("https://userpage.fu-berlin.de/soga/data/raw-data/hurricanes.xlsx")
First, we inspect the structure of the data set by applying the
str
function.
str(hurricanes)
## 'data.frame': 337 obs. of 12 variables:
## $ RowNames: chr "1" "2" "3" "4" ...
## $ Number : num 430 432 433 436 437 438 440 441 445 449 ...
## $ Name : chr "NOTNAMED" "NOTNAMED" "NOTNAMED" "NOTNAMED" ...
## $ Year : num 1944 1944 1944 1944 1944 ...
## $ Type : num 1 0 0 0 0 1 0 1 0 0 ...
## $ FirstLat: num 30.2 25.6 14.2 20.8 20 29.2 16.1 27.6 21.6 19 ...
## $ FirstLon: num -76.1 -74.9 -65.2 -58 -84.2 -55.8 -80.8 -85.6 -95.2 -56.6 ...
## $ MaxLat : num 32.1 31 16.6 26.3 20.6 38 21.9 27.6 28.6 24.9 ...
## $ MaxLon : num -74.8 -78.1 -72.2 -72.3 -84.9 -53.2 -82.9 -85.6 -96.1 -79.6 ...
## $ LastLat : num 35.1 32.6 20.6 42.1 19.1 50 28.4 31.7 29.5 28.9 ...
## $ LastLon : num -69.2 -78.2 -88.5 -71.5 -93.9 -46.5 -82.1 -79.1 -96 -81.8 ...
## $ MaxInt : num 80 80 105 120 70 85 105 100 120 120 ...
There are 337 observations and 12 variables in the data set. We are
primarily interested in the variable Type
, which is our
response variable, and the variable FirstLat
, which
corresponds to the latitude of formation, and thus is our predictor
variable. However, in order to get a sense for the data set we plot the
number of hurricanes for each year as a bar plot using the
ggplot2
package:
library(ggplot2)
ggplot(hurricanes, aes(x = Year)) +
geom_bar(stat = "count")
That is a nice plot. By adding just one argument,
fill = factor(Type)
, we may additionally color the bars
according to the Type
variable. To enhance readability we
add the scale_fill_discrete
argument to the function call
and rename the legend accordingly.
ggplot(hurricanes, aes(x = Year, fill = factor(Type))) +
geom_bar(stat = "count") +
scale_fill_discrete(
name = "Type of Hurricane",
labels = c("tropical-only", "baroclinic influences", "baroclinic initiation")
)
For a numerical representation of the hurricane classes we use the
table()
function:
hurricanes_table <- table(hurricanes$Type)
hurricanes_table
##
## 0 1 3
## 187 77 73
In class \(0\), tropical hurricanes, there are 187 observations, in class \(1\), baroclinic influences, there are 77 observations and in class \(3\), baroclinic initiation, there are 73 observations. Since we can only deal with dichotomous data in logistic regression, we re-code the classes and assign classes \(1\) and \(3\), both being influenced by the outer tropics, the label \(1\):
hurricanes$Type_new <- ifelse(test = hurricanes$Type == 0, yes = 0, no = 1)
Now we have a binary response variable of two classes: Class \(0\) corresponds to tropical
hurricanes and class \(1\) to
non-tropical hurricanes. Let us check the class distribution of
our new response variable Type_new
:
table(hurricanes$Type_new)
##
## 0 1
## 187 150
To wrap up this part on exploratory data analysis we plot the data on
an interactive map, using the leaflet
package:
options(viewer = NULL)
library(leaflet)
m <- leaflet()
m <- addTiles(m)
m <- addProviderTiles(m, "Esri.OceanBasemap")
cols <- c("red", "navy")
m <- addCircleMarkers(m,
lng = hurricanes$FirstLon,
lat = hurricanes$FirstLat,
radius = 2.5,
color = cols[hurricanes$Type_new + 1],
popup = paste("Year:", as.character(hurricanes$Year))
)
m <- addLegend(m,
"topright",
colors = cols,
labels = c("tropical", "non-tropical"),
title = "Type of Hurricane",
opacity = 1
)
m
Recall the goal of the exercise: We want to build a model that
predicts the group membership of a hurricane, either tropical \((0)\) or non-tropical \((1)\), based on the latitude of formation
of the hurricane. The response variable is the binary variable
Type_new
and the predictor variable is
FirstLat
. We build a logit model by applying the
glm()
function. For the logistic regression model we
specify family = 'binomial'
.
log_model <- glm(Type_new ~ FirstLat, data = hurricanes, family = "binomial")
We use the extractor function summary()
to review the
model properties:
summary(log_model)
##
## Call:
## glm(formula = Type_new ~ FirstLat, family = "binomial", data = hurricanes)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.08263 0.96148 -9.446 <2e-16 ***
## FirstLat 0.37283 0.03947 9.447 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 463.11 on 336 degrees of freedom
## Residual deviance: 232.03 on 335 degrees of freedom
## AIC: 236.03
##
## Number of Fisher Scoring iterations: 6
Let us take a closer look at the coefficients:
summary(log_model)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.0826335 0.96148212 -9.446492 3.503753e-21
## FirstLat 0.3728295 0.03946632 9.446777 3.494240e-21
Be aware, that the output of the logistic model is on link-scale (logit). Thus, the numerical output of the model corresponds to the log-odds. For convenience let us write down the logistic regression model with the calculated intercept and coefficient:
\[\phi(\eta) = \frac{1}{1+e^{-\eta}}= \frac{1}{1+e^{-\beta_0+\beta_1x_1}}= \frac{1}{1+e^{-(-9.0826335+ 0.3728295x)}} \text{.}\]
Let us first focus on the coefficient of the intercept. The coefficient of the intercept corresponds to the log-odds of observing a hurricane at a latitude of zero, which is known as the equator. To calculate the odds of observing a non-tropical hurricane when the latitude is zero, we exponentiate the log-odds \(e^{-9.0826335} = 1.1362198\times 10^{-4}\). That is a very low number! And it makes perfect sense, since it is very unlikely to observe a non-tropical hurricane at the equator.
The coefficient of the formation latitude variable
(FirstLat
) has a numerical value of 0.3728295. The positive
sign of this value indicates, that the chance of observing a
non-tropical hurricane increases with latitude. The magnitude of the
coefficient implies, that for every degree increase in formation
latitude the log-odd increases by a constant of 0.3728295 units, on
average. By taking the exponent of the coefficient value, we get the
odds ratio:
exp(coefficients(log_model)[2])
## FirstLat
## 1.451837
Thus, for every degree increase in formation latitude the odds ratio increases on average by a constant factor of 1.4518368 (or 45 %). The interpretation is valid only over the range of latitudes in the data and physically meaningless for latitudes outside the range where hurricanes occur.
The coefficient table from above includes the standard error and the p-value. The smaller the p-value, the less support there is for the null hypothesis given the data and the model. Recall, the null hypothesis states that the coefficient is \(0\), in other words, that there is no relationship between the occurrence of a non-tropical hurricane and the formation latitude. In our example the very small p-value does not provide support for the null hypothesis and allows us to accept the model.
We have learned that any point estimate should be accompanied by a level of confidence for that point estimation. Therefore, we construct a 95 % confidence interval for the estimated model coefficient. We exponentiate the result to return the odds-ratio, which is a quantity that is easier to interpret than the log-odds.
confint.default(log_model)[2, ]
## 2.5 % 97.5 %
## 0.2954770 0.4501821
exp(confint.default(log_model)[2, ])
## 2.5 % 97.5 %
## 1.343767 1.568598
Thus, we can state that we are 95 % confident, that for every degree increase in formation latitude the observation of a non-tropical hurricane becomes on average more likely by 34 to 57 %.
In the preceding section we built a logistic regression model for the relationship between the formation latitude and the type of hurricane (tropical/non-tropical). In this section we will now apply that model to make predictions about the chance of observing a non-tropical hurricane given the latitude of formation.
In order to predict the probability that a randomly picked hurricane,
which forms at a given latitude, is of type non-tropical we
apply the predict()
function. Please note, that if we add
the argument type = "response"
to the function call, the
predict()
function returns the probability
and not the log-odds. Let us evaluate the probability of observing a
non-tropical hurricane, that formed at latitudes of \(10^\circ, 23.5^\circ \text{ and
}30^\circ\):
predict(log_model, newdata = list(FirstLat = c(10, 23.5, 30)), type = "response")
## 1 2 3
## 0.004705352 0.420398055 0.891121908
The results make perfect sense. With increasing latitude the chance of observing the formation of a non-tropical hurricane increases significantly. At a latitude of \(10^\circ\) the probability of observing a non-tropical hurricane is \(0.005\), whereas at the latitude of \(30^\circ\) that probability is \(0.891\).
For a better understanding let us calculate the probability of observing a non-tropical hurricane, that formed at latitudes of \(23.5^\circ\), by hand. Recall the equation for the regression model from above:
\[ \begin{align} \ \phi(\eta) & = \frac{1}{1+e^{-\eta}} = \frac{1}{1+e^{-\beta_0+\beta_1x_1}}\\ & = \frac{1}{1+e^{-(-9.0826335+ 0.3728295 \times 23.5)}} \\ & = \frac{1}{1+e^{-(-0.3211396)}} \\ & = \frac{1}{2.378698} = 0.4203981 \end{align} \]
To visualize our results we create a plot of predictions across a
range of latitudes. Therefore, we first prepare a vector of latitudes,
lats
. We specify an increment of 0.1 degrees, so that the
resulting prediction curve is smooth. We apply the
predict()
function and set se.fit = TRUE
, in
order to return the standard error of the prediction.
Knowing the standard error allows us to calculate the margin of
error and thus the confidence interval.
Finally, we plot the data points (observations) at \(0\) and \(1\) and add the predicted values using the
lines()
function.
lats <- seq(min(hurricanes$FirstLat), max(hurricanes$FirstLat), 0.1)
probs <- predict(log_model,
newdata = data.frame(FirstLat = lats),
type = "response",
se.fit = TRUE
)
pm <- probs$fit
pu <- probs$fit + probs$se.fit * 1.96 # 95% confidence interval
pl <- probs$fit - probs$se.fit * 1.96 # 95% confidence interval
plot(hurricanes$FirstLat,
hurricanes$Type_new,
pch = 16,
cex = 1,
ylab = "Probability",
xlab = "Formation Latitude (N)"
)
grid()
polygon(c(rev(lats), lats), c(rev(pl), pu),
col = "grey90", border = NA
)
lines(lats, pm, lwd = 2)
lines(lats, pu, lwd = 2, col = "red")
lines(lats, pl, lwd = 2, col = "red")
abline(h = 0.1, lty = 2)
abline(h = 0.5, lty = 2)
abline(h = 0.9, lty = 2)
The observations of tropical and non-tropical hurricanes are shown along the \(y=0\) and \(y=1\) lines, respectively. The gray band is the 95 % point-wise confidence interval.
The model predictions show that the probability of a non-tropical hurricane is less than 10 % at latitudes south of approx. \(16^\circ N\). However, by latitude ca.\(24^\circ N\), the probability exceeds 50 % and by latitude \(30^\circ N\) the probability roughly exceeds 90 %. The odds ratio is constant, but the probability is a nonlinear function of latitude.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.