Logistic regression analysis belongs to the class of generalized linear models. In R generalized linear models are handled by the glm() function. The function is written as glm(response ~ predictor, family = binomial(link = "logit"), data). Since logit is the default for binomial, we do not have to type it explicitly. The glm() function returns a model object, therefore we may apply extractor functions, such as summary(), fitted() or predict, on it. However, please note that the output numbers are on the logit scale. To actually predict probabilities we need to provide the predict() function an additional argument type = "response".


Introduction and exploratory data analysis

This example is inspired by the work of James B. Elsner and his colleagues (Elsner et al. 1996 and Kimberlain and Elsner 1998), who worked on a genetic classification of North Atlantic hurricanes based on formation and development mechanisms. The classification yields three different groups: tropical hurricanes, hurricanes under baroclinic influences and hurricanes of baroclinic initiation. The term “baroclinic” relates to the fact, that these hurricanes are influenced by outer tropics disturbances or even originate in the outer tropics. The stronger tropical hurricanes develop farther south and primarily occur in August and September. The weaker outer-tropical hurricanes occur throughout a longer season. The original data set on Genetic Hurricane Classification can be retrieved here. The analysis of by James B. Elsner can be reviewed here.

The goal of the following exercise is to build a model that predicts the group membership of a hurricane, either tropical or non-tropical, based on the latitude of formation.

We start the analysis by loading the data set. By installing the openxlsx package (type install.packages("openxlsx")) we can access the Excel file directly by an URL:

library(openxlsx)
hurricanes <- read.xlsx("https://userpage.fu-berlin.de/soga/data/raw-data/hurricanes.xlsx")

First, we inspect the structure of the data set by applying the str function.

str(hurricanes)
## 'data.frame':    337 obs. of  12 variables:
##  $ RowNames: chr  "1" "2" "3" "4" ...
##  $ Number  : num  430 432 433 436 437 438 440 441 445 449 ...
##  $ Name    : chr  "NOTNAMED" "NOTNAMED" "NOTNAMED" "NOTNAMED" ...
##  $ Year    : num  1944 1944 1944 1944 1944 ...
##  $ Type    : num  1 0 0 0 0 1 0 1 0 0 ...
##  $ FirstLat: num  30.2 25.6 14.2 20.8 20 29.2 16.1 27.6 21.6 19 ...
##  $ FirstLon: num  -76.1 -74.9 -65.2 -58 -84.2 -55.8 -80.8 -85.6 -95.2 -56.6 ...
##  $ MaxLat  : num  32.1 31 16.6 26.3 20.6 38 21.9 27.6 28.6 24.9 ...
##  $ MaxLon  : num  -74.8 -78.1 -72.2 -72.3 -84.9 -53.2 -82.9 -85.6 -96.1 -79.6 ...
##  $ LastLat : num  35.1 32.6 20.6 42.1 19.1 50 28.4 31.7 29.5 28.9 ...
##  $ LastLon : num  -69.2 -78.2 -88.5 -71.5 -93.9 -46.5 -82.1 -79.1 -96 -81.8 ...
##  $ MaxInt  : num  80 80 105 120 70 85 105 100 120 120 ...

There are 337 observations and 12 variables in the data set. We are primarily interested in the variable Type, which is our response variable, and the variable FirstLat, which corresponds to the latitude of formation, and thus is our predictor variable. However, in order to get a sense for the data set we plot the number of hurricanes for each year as a bar plot using the ggplot2 package:

library(ggplot2)
ggplot(hurricanes, aes(x = Year)) +
  geom_bar(stat = "count")