Logistic regression analysis belongs to the class of **generalized linear models**. In R
generalized linear models are handled by the `glm()`

function. The function is written as
`glm(response ~ predictor, family = binomial(link = "logit"), data)`

.
Since `logit`

is the default for binomial, we do not have to
type it explicitly. The `glm()`

function returns a model
object, therefore we may apply **extractor functions**,
such as `summary()`

, `fitted()`

or
`predict`

, on it. However, please note that the output
numbers are on the logit scale. To actually predict probabilities we
need to provide the `predict()`

function an additional
argument `type = "response"`

.

This example is inspired by the work of James B. Elsner and his colleagues (Elsner et al. 1996 and Kimberlain and Elsner 1998), who worked on a
**genetic classification of North Atlantic hurricanes**
based on formation and development mechanisms. The classification yields
three different groups: tropical hurricanes, hurricanes under baroclinic
influences and hurricanes of baroclinic initiation. The term
“baroclinic” relates to the fact, that these hurricanes are influenced
by outer tropics disturbances or even originate in the outer tropics.
The stronger tropical hurricanes develop farther south and primarily
occur in August and September. The weaker outer-tropical hurricanes
occur throughout a longer season. The original data set on Genetic
Hurricane Classification can be retrieved here. The analysis of by James B. Elsner can be
reviewed here.

The goal of the following exercise is to **build a model that
predicts the group membership of a hurricane, either tropical or
non-tropical, based on the latitude of formation.**

We start the analysis by loading the data
set. By installing the `openxlsx`

package (type
`install.packages("openxlsx")`

) we can access the Excel file
directly by an URL:

```
library(openxlsx)
hurricanes <- read.xlsx("https://userpage.fu-berlin.de/soga/data/raw-data/hurricanes.xlsx")
```

First, we inspect the structure of the data set by applying the
`str`

function.

`str(hurricanes)`

```
## 'data.frame': 337 obs. of 12 variables:
## $ RowNames: chr "1" "2" "3" "4" ...
## $ Number : num 430 432 433 436 437 438 440 441 445 449 ...
## $ Name : chr "NOTNAMED" "NOTNAMED" "NOTNAMED" "NOTNAMED" ...
## $ Year : num 1944 1944 1944 1944 1944 ...
## $ Type : num 1 0 0 0 0 1 0 1 0 0 ...
## $ FirstLat: num 30.2 25.6 14.2 20.8 20 29.2 16.1 27.6 21.6 19 ...
## $ FirstLon: num -76.1 -74.9 -65.2 -58 -84.2 -55.8 -80.8 -85.6 -95.2 -56.6 ...
## $ MaxLat : num 32.1 31 16.6 26.3 20.6 38 21.9 27.6 28.6 24.9 ...
## $ MaxLon : num -74.8 -78.1 -72.2 -72.3 -84.9 -53.2 -82.9 -85.6 -96.1 -79.6 ...
## $ LastLat : num 35.1 32.6 20.6 42.1 19.1 50 28.4 31.7 29.5 28.9 ...
## $ LastLon : num -69.2 -78.2 -88.5 -71.5 -93.9 -46.5 -82.1 -79.1 -96 -81.8 ...
## $ MaxInt : num 80 80 105 120 70 85 105 100 120 120 ...
```

There are 337 observations and 12 variables in the data set. We are
primarily interested in the variable `Type`

, which is our
response variable, and the variable `FirstLat`

, which
corresponds to the latitude of formation, and thus is our predictor
variable. However, in order to get a sense for the data set we plot the
number of hurricanes for each year as a bar plot using the
`ggplot2`

package:

```
library(ggplot2)
ggplot(hurricanes, aes(x = Year)) +
geom_bar(stat = "count")
```