Logistic regression analysis belongs to the class of generalized linear models. In R
generalized linear models are handled by the glm()
function. The function is written as
glm(response ~ predictor, family = binomial(link = "logit"), data)
.
Since logit
is the default for binomial, we do not have to
type it explicitly. The glm()
function returns a model
object, therefore we may apply extractor functions,
such as summary()
, fitted()
or
predict
, on it. However, please note that the output
numbers are on the logit scale. To actually predict probabilities we
need to provide the predict()
function an additional
argument type = "response"
.
This example is inspired by the work of James B. Elsner and his colleagues (Elsner et al. 1996 and Kimberlain and Elsner 1998), who worked on a genetic classification of North Atlantic hurricanes based on formation and development mechanisms. The classification yields three different groups: tropical hurricanes, hurricanes under baroclinic influences and hurricanes of baroclinic initiation. The term “baroclinic” relates to the fact, that these hurricanes are influenced by outer tropics disturbances or even originate in the outer tropics. The stronger tropical hurricanes develop farther south and primarily occur in August and September. The weaker outer-tropical hurricanes occur throughout a longer season. The original data set on Genetic Hurricane Classification can be retrieved here. The analysis of by James B. Elsner can be reviewed here.
The goal of the following exercise is to build a model that predicts the group membership of a hurricane, either tropical or non-tropical, based on the latitude of formation.
We start the analysis by loading the data
set. By installing the openxlsx
package (type
install.packages("openxlsx")
) we can access the Excel file
directly by an URL:
library(openxlsx)
hurricanes <- read.xlsx("https://userpage.fu-berlin.de/soga/data/raw-data/hurricanes.xlsx")
First, we inspect the structure of the data set by applying the
str
function.
str(hurricanes)
## 'data.frame': 337 obs. of 12 variables:
## $ RowNames: chr "1" "2" "3" "4" ...
## $ Number : num 430 432 433 436 437 438 440 441 445 449 ...
## $ Name : chr "NOTNAMED" "NOTNAMED" "NOTNAMED" "NOTNAMED" ...
## $ Year : num 1944 1944 1944 1944 1944 ...
## $ Type : num 1 0 0 0 0 1 0 1 0 0 ...
## $ FirstLat: num 30.2 25.6 14.2 20.8 20 29.2 16.1 27.6 21.6 19 ...
## $ FirstLon: num -76.1 -74.9 -65.2 -58 -84.2 -55.8 -80.8 -85.6 -95.2 -56.6 ...
## $ MaxLat : num 32.1 31 16.6 26.3 20.6 38 21.9 27.6 28.6 24.9 ...
## $ MaxLon : num -74.8 -78.1 -72.2 -72.3 -84.9 -53.2 -82.9 -85.6 -96.1 -79.6 ...
## $ LastLat : num 35.1 32.6 20.6 42.1 19.1 50 28.4 31.7 29.5 28.9 ...
## $ LastLon : num -69.2 -78.2 -88.5 -71.5 -93.9 -46.5 -82.1 -79.1 -96 -81.8 ...
## $ MaxInt : num 80 80 105 120 70 85 105 100 120 120 ...
There are 337 observations and 12 variables in the data set. We are
primarily interested in the variable Type
, which is our
response variable, and the variable FirstLat
, which
corresponds to the latitude of formation, and thus is our predictor
variable. However, in order to get a sense for the data set we plot the
number of hurricanes for each year as a bar plot using the
ggplot2
package:
library(ggplot2)
ggplot(hurricanes, aes(x = Year)) +
geom_bar(stat = "count")