Logistic regression analysis belongs to the class of generalized linear models. In R generalized linear models are handled by the glm() function. The function is written as glm(response ~ predictor, family = binomial(link = "logit"), data). Please note that logit is the default for binomial; thus, we do not have to type it explicitly. The glm() function returns a model object, therefore we may apply extractor functions, such as summary(), fitted() or predict, among others, on it. However, please note that the output numbers are on the logit scale. To actually predict probabilities we need to provide the predict() function an additional argument type = "response".


Introduction and exploratory data analysis

This example is inspired by the work of James B. Elsner and his colleagues (Elsner et al. 1996 and Kimberlain and Elsner 1998), who worked on a genetic classification of North Atlantic hurricanes based on formation and development mechanisms. The classification yields three different groups: tropical hurricanes, hurricanes under baroclinic influences and hurricanes of baroclinic initiation. The term “baroclinic” relates to the fact that these hurricanes are influenced by outer tropics disturbances or even originate in the outer tropics. The stronger tropical hurricanes develop farther south and primarily occur in August and September. The weaker outer-tropical hurricanes occur throughout a longer season. The original data set on Genetic Hurricane Classification can be retrieved here. The analysis of by James B. Elsner can be reviewed here.

The goal of the exercise is to build a model that predicts the group membership of a hurricane, either tropical or non-tropical, based on the latitude of formation.

We start the analysis by loading the data set. Download the data to your computer and load the file. Please note that the file is an Excel-spreadsheet. Thus, in order to deal with Excel-spreadsheets we install and import the package readxl beforehand. Please note that the readxl package is fairly new and under active development. At the time of writing of these lines it is not yet possible to access an Excel file by an URL; therefore we download the file first, and then read the file it into memory; finally we delete the file again. Check out the GitHub repository of the readxl package for future improvements.

# set up filename
my.filename <- paste0(getwd(),'/','my-temporary-downloadfile.xlsx')

# download file
my.file <- download.file(url = 'https://userpage.fu-berlin.de/soga/200/2010_data_sets/hurricanes.xlsx', 
                         destfile = my.filename, 
                         mode="wb")

# read file into memory
library(readxl)
hurricanes <- read_excel('my-temporary-downloadfile.xlsx')

# delete file
file.remove(my.filename)
## [1] TRUE

First, we inspect the structure of the data set by applying the str function.

str(hurricanes)
## Classes 'tbl_df', 'tbl' and 'data.frame':    337 obs. of  12 variables:
##  $ RowNames: chr  "1" "2" "3" "4" ...
##  $ Number  : num  430 432 433 436 437 438 440 441 445 449 ...
##  $ Name    : chr  "NOTNAMED" "NOTNAMED" "NOTNAMED" "NOTNAMED" ...
##  $ Year    : num  1944 1944 1944 1944 1944 ...
##  $ Type    : num  1 0 0 0 0 1 0 1 0 0 ...
##  $ FirstLat: num  30.2 25.6 14.2 20.8 20 29.2 16.1 27.6 21.6 19 ...
##  $ FirstLon: num  -76.1 -74.9 -65.2 -58 -84.2 -55.8 -80.8 -85.6 -95.2 -56.6 ...
##  $ MaxLat  : num  32.1 31 16.6 26.3 20.6 38 21.9 27.6 28.6 24.9 ...
##  $ MaxLon  : num  -74.8 -78.1 -72.2 -72.3 -84.9 -53.2 -82.9 -85.6 -96.1 -79.6 ...
##  $ LastLat : num  35.1 32.6 20.6 42.1 19.1 50 28.4 31.7 29.5 28.9 ...
##  $ LastLon : num  -69.2 -78.2 -88.5 -71.5 -93.9 -46.5 -82.1 -79.1 -96 -81.8 ...
##  $ MaxInt  : num  80 80 105 120 70 85 105 100 120 120 ...

There are 337 observations and 12 variables in the data set. We are primarily interested in the variable Type, which is our response variable and the variable FirstLat, which corresponds to the latitude of formation, and thus is our predictor variable. However, in order to get a sense for the data set we plot the number of hurricanes for each year as a bar plot using the ggplot2 package.

library(ggplot2)
ggplot(hurricanes, aes(x = Year)) + geom_bar(stat = "count")