2093_logistic_regression_in_r_-_an

Logistic regression analysis belongs to the class of generalized linear models. In R generalized linear models are handled by the glm() function. The function is written as glm(response ~ predictor, family = binomial(link = "logit"), data). Since logit is the default for binomial, we do not have to type it explicitly. The glm() function returns a model object, therefore we may apply extractor functions, such as summary(), fitted() or predict, on it. However, please note that the output numbers are on the logit scale. To actually predict probabilities we need to provide the predict() function an additional argument type = "response".

Introduction and exploratory data analysis

This example is inspired by the work of James B. Elsner and his colleagues (Elsner et al. 1996 and Kimberlain and Elsner 1998), who worked on a genetic classification of North Atlantic hurricanes based on formation and development mechanisms. The classification yields three different groups: tropical hurricanes, hurricanes under baroclinic influences and hurricanes of baroclinic initiation. The term “baroclinic” relates to the fact, that these hurricanes are influenced by outer tropics disturbances or even originate in the outer tropics. The stronger tropical hurricanes develop farther south and primarily occur in August and September. The weaker outer-tropical hurricanes occur throughout a longer season. The original data set on Genetic Hurricane Classification can be retrieved here. The analysis of by James B. Elsner can be reviewed here.

The goal of the following exercise is to build a model that predicts the group membership of a hurricane, either tropical or non-tropical, based on the latitude of formation.

We start the analysis by loading the data set. By installing the openxlsx package (type install.packages("openxlsx")) we can access the Excel file directly by an URL:

library(openxlsx)
hurricanes <- read.xlsx("https://userpage.fu-berlin.de/soga/data/raw-data/hurricanes.xlsx")

First, we inspect the structure of the data set by applying the str function.

str(hurricanes)

## 'data.frame':    337 obs. of  12 variables:
##  $ RowNames: chr  "1" "2" "3" "4" ...
##  $ Number  : num  430 432 433 436 437 438 440 441 445 449 ...
##  $ Name    : chr  "NOTNAMED" "NOTNAMED" "NOTNAMED" "NOTNAMED" ...
##  $ Year    : num  1944 1944 1944 1944 1944 ...
##  $ Type    : num  1 0 0 0 0 1 0 1 0 0 ...
##  $ FirstLat: num  30.2 25.6 14.2 20.8 20 29.2 16.1 27.6 21.6 19 ...
##  $ FirstLon: num  -76.1 -74.9 -65.2 -58 -84.2 -55.8 -80.8 -85.6 -95.2 -56.6 ...
##  $ MaxLat  : num  32.1 31 16.6 26.3 20.6 38 21.9 27.6 28.6 24.9 ...
##  $ MaxLon  : num  -74.8 -78.1 -72.2 -72.3 -84.9 -53.2 -82.9 -85.6 -96.1 -79.6 ...
##  $ LastLat : num  35.1 32.6 20.6 42.1 19.1 50 28.4 31.7 29.5 28.9 ...
##  $ LastLon : num  -69.2 -78.2 -88.5 -71.5 -93.9 -46.5 -82.1 -79.1 -96 -81.8 ...
##  $ MaxInt  : num  80 80 105 120 70 85 105 100 120 120 ...

There are 337 observations and 12 variables in the data set. We are primarily interested in the variable Type, which is our response variable, and the variable FirstLat, which corresponds to the latitude of formation, and thus is our predictor variable. However, in order to get a sense for the data set we plot the number of hurricanes for each year as a bar plot using the ggplot2 package:

library(ggplot2)
ggplot(hurricanes, aes(x = Year)) +
  geom_bar(stat = "count")

That is a nice plot. By adding just one argument, fill = factor(Type), we may additionally color the bars according to the Type variable. To enhance readability we add the scale_fill_discrete argument to the function call and rename the legend accordingly.

ggplot(hurricanes, aes(x = Year, fill = factor(Type))) +
  geom_bar(stat = "count") +
  scale_fill_discrete(
    name = "Type of Hurricane",
    labels = c("tropical-only", "baroclinic influences", "baroclinic initiation")
  )

For a numerical representation of the hurricane classes we use the table() function:

hurricanes_table <- table(hurricanes$Type)
hurricanes_table

## 
##   0   1   3 
## 187  77  73

In class $0$ , tropical hurricanes, there are 187 observations, in class $1$ , baroclinic influences, there are 77 observations and in class $3$ , baroclinic initiation, there are 73 observations. Since we can only deal with dichotomous data in logistic regression, we re-code the classes and assign classes $1$ and $3$ , both being influenced by the outer tropics, the label $1$ :

hurricanes$Type_new <- ifelse(test = hurricanes$Type == 0, yes = 0, no = 1)

Now we have a binary response variable of two classes: Class $0$ corresponds to tropical hurricanes and class $1$ to non-tropical hurricanes. Let us check the class distribution of our new response variable Type_new:

table(hurricanes$Type_new)

## 
##   0   1 
## 187 150

To wrap up this part on exploratory data analysis we plot the data on an interactive map, using the leaflet package:

options(viewer = NULL)
library(leaflet)
m <- leaflet()
m <- addTiles(m)
m <- addProviderTiles(m, "Esri.OceanBasemap")

cols <- c("red", "navy")
m <- addCircleMarkers(m,
  lng = hurricanes$FirstLon,
  lat = hurricanes$FirstLat,
  radius = 2.5,
  color = cols[hurricanes$Type_new + 1],
  popup = paste("Year:", as.character(hurricanes$Year))
)
m <- addLegend(m,
  "topright",
  colors = cols,
  labels = c("tropical", "non-tropical"),
  title = "Type of Hurricane",
  opacity = 1
)
m

Logistic regression: Model fit

Recall the goal of the exercise: We want to build a model that predicts the group membership of a hurricane, either tropical $(0)$ or non-tropical $(1)$ , based on the latitude of formation of the hurricane. The response variable is the binary variable Type_new and the predictor variable is FirstLat. We build a logit model by applying the glm() function. For the logistic regression model we specify family = 'binomial'.

log_model <- glm(Type_new ~ FirstLat, data = hurricanes, family = "binomial")

We use the extractor function summary() to review the model properties:

summary(log_model)

## 
## Call:
## glm(formula = Type_new ~ FirstLat, family = "binomial", data = hurricanes)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.08263    0.96148  -9.446   <2e-16 ***
## FirstLat     0.37283    0.03947   9.447   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 463.11  on 336  degrees of freedom
## Residual deviance: 232.03  on 335  degrees of freedom
## AIC: 236.03
## 
## Number of Fisher Scoring iterations: 6

Let us take a closer look at the coefficients:

summary(log_model)$coefficients

##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -9.0826335 0.96148212 -9.446492 3.503753e-21
## FirstLat     0.3728295 0.03946632  9.446777 3.494240e-21

Be aware, that the output of the logistic model is on link-scale (logit). Thus, the numerical output of the model corresponds to the log-odds. For convenience let us write down the logistic regression model with the calculated intercept and coefficient:

$\phi(\eta) = \frac{1}{1+e^{-\eta}}= \frac{1}{1+e^{-\beta_0+\beta_1x_1}}= \frac{1}{1+e^{-(-9.0826335+ 0.3728295x)}} \text{.}$

Let us first focus on the coefficient of the intercept. The coefficient of the intercept corresponds to the log-odds of observing a hurricane at a latitude of zero, which is known as the equator. To calculate the odds of observing a non-tropical hurricane when the latitude is zero, we exponentiate the log-odds $e^{-9.0826335} = 1.1362198\times 10^{-4}$ . That is a very low number! And it makes perfect sense, since it is very unlikely to observe a non-tropical hurricane at the equator.

The coefficient of the formation latitude variable (FirstLat) has a numerical value of 0.3728295. The positive sign of this value indicates, that the chance of observing a non-tropical hurricane increases with latitude. The magnitude of the coefficient implies, that for every degree increase in formation latitude the log-odd increases by a constant of 0.3728295 units, on average. By taking the exponent of the coefficient value, we get the odds ratio:

exp(coefficients(log_model)[2])

## FirstLat 
## 1.451837

Thus, for every degree increase in formation latitude the odds ratio increases on average by a constant factor of 1.4518368 (or 45 %). The interpretation is valid only over the range of latitudes in the data and physically meaningless for latitudes outside the range where hurricanes occur.

The coefficient table from above includes the standard error and the p-value. The smaller the p-value, the less support there is for the null hypothesis given the data and the model. Recall, the null hypothesis states that the coefficient is $0$ , in other words, that there is no relationship between the occurrence of a non-tropical hurricane and the formation latitude. In our example the very small p-value does not provide support for the null hypothesis and allows us to accept the model.

We have learned that any point estimate should be accompanied by a level of confidence for that point estimation. Therefore, we construct a 95 % confidence interval for the estimated model coefficient. We exponentiate the result to return the odds-ratio, which is a quantity that is easier to interpret than the log-odds.

confint.default(log_model)[2, ]

##     2.5 %    97.5 % 
## 0.2954770 0.4501821

exp(confint.default(log_model)[2, ])

##    2.5 %   97.5 % 
## 1.343767 1.568598

Thus, we can state that we are 95 % confident, that for every degree increase in formation latitude the observation of a non-tropical hurricane becomes on average more likely by 34 to 57 %.

Logistic regression: Model prediction

In the preceding section we built a logistic regression model for the relationship between the formation latitude and the type of hurricane (tropical/non-tropical). In this section we will now apply that model to make predictions about the chance of observing a non-tropical hurricane given the latitude of formation.

In order to predict the probability that a randomly picked hurricane, which forms at a given latitude, is of type non-tropical we apply the predict() function. Please note, that if we add the argument type = "response" to the function call, the predict() function returns the probability and not the log-odds. Let us evaluate the probability of observing a non-tropical hurricane, that formed at latitudes of $10^\circ, 23.5^\circ \text{ and }30^\circ$ :

predict(log_model, newdata = list(FirstLat = c(10, 23.5, 30)), type = "response")

##           1           2           3 
## 0.004705352 0.420398055 0.891121908

The results make perfect sense. With increasing latitude the chance of observing the formation of a non-tropical hurricane increases significantly. At a latitude of $10^\circ$ the probability of observing a non-tropical hurricane is $0.005$ , whereas at the latitude of $30^\circ$ that probability is $0.891$ .

For a better understanding let us calculate the probability of observing a non-tropical hurricane, that formed at latitudes of $23.5^\circ$ , by hand. Recall the equation for the regression model from above:

$\begin{align} \ \phi(\eta) & = \frac{1}{1+e^{-\eta}} = \frac{1}{1+e^{-\beta_0+\beta_1x_1}}\\ & = \frac{1}{1+e^{-(-9.0826335+ 0.3728295 \times 23.5)}} \\ & = \frac{1}{1+e^{-(-0.3211396)}} \\ & = \frac{1}{2.378698} = 0.4203981 \end{align}$

To visualize our results we create a plot of predictions across a range of latitudes. Therefore, we first prepare a vector of latitudes, lats. We specify an increment of 0.1 degrees, so that the resulting prediction curve is smooth. We apply the predict() function and set se.fit = TRUE, in order to return the standard error of the prediction. Knowing the standard error allows us to calculate the margin of error and thus the confidence interval. Finally, we plot the data points (observations) at $0$ and $1$ and add the predicted values using the lines() function.

lats <- seq(min(hurricanes$FirstLat), max(hurricanes$FirstLat), 0.1)

probs <- predict(log_model,
  newdata = data.frame(FirstLat = lats),
  type = "response",
  se.fit = TRUE
)

pm <- probs$fit
pu <- probs$fit + probs$se.fit * 1.96 # 95% confidence interval
pl <- probs$fit - probs$se.fit * 1.96 # 95% confidence interval

plot(hurricanes$FirstLat,
  hurricanes$Type_new,
  pch = 16,
  cex = 1,
  ylab = "Probability",
  xlab = "Formation Latitude (N)"
)

grid()

polygon(c(rev(lats), lats), c(rev(pl), pu),
  col = "grey90", border = NA
)

lines(lats, pm, lwd = 2)
lines(lats, pu, lwd = 2, col = "red")
lines(lats, pl, lwd = 2, col = "red")

abline(h = 0.1, lty = 2)
abline(h = 0.5, lty = 2)
abline(h = 0.9, lty = 2)

The observations of tropical and non-tropical hurricanes are shown along the $y=0$ and $y=1$ lines, respectively. The gray band is the 95 % point-wise confidence interval.

The model predictions show that the probability of a non-tropical hurricane is less than 10 % at latitudes south of approx. $16^\circ N$ . However, by latitude ca. $24^\circ N$ , the probability exceeds 50 % and by latitude $30^\circ N$ the probability roughly exceeds 90 %. The odds ratio is constant, but the probability is a nonlinear function of latitude.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.