2093_Logistic regression in python

Introduction and exploratory data analysis¶

This example is inspired by the work of James B. Elsner and his colleagues (Elsner et al. 1996 and Kimberlain and Elsner 1998), who worked on a genetic classification of North Atlantic hurricanes based on formation and development mechanisms. The classification yields three different groups: tropical hurricanes, hurricanes under baroclinic influences, and hurricanes of baroclinic initiation. The stronger tropical hurricanes develop farther south and primarily occur in August and September. The weaker outer-tropical hurricanes occur throughout a long season. The original data set on Genetic Hurricane Classification can be retrieved here. The analysis by James B. Elsner can be reviewed here.

The goal of the exercise is to build a model that predicts the group membership of a hurricane, either tropical or non-tropical, based on the latitude of formation.

We start the analysis by loading the data set. Please note that the file is an Excel-spreadsheet. Thus, to deal with Excel spreadsheets we use the Pandas library to read it into a DataFrame. We then look at the first rows of the data.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_excel('https://userpage.fu-berlin.de/soga/200/2010_data_sets/hurricanes.xlsx')
df.head()

Out[1]:

	RowNames	Number	Name	Year	Type	FirstLat	FirstLon	MaxLat	MaxLon	LastLat	LastLon	MaxInt
0	1	430	NOTNAMED	1944	1	30.2	-76.1	32.1	-74.8	35.1	-69.2	80
1	2	432	NOTNAMED	1944	0	25.6	-74.9	31.0	-78.1	32.6	-78.2	80
2	3	433	NOTNAMED	1944	0	14.2	-65.2	16.6	-72.2	20.6	-88.5	105
3	4	436	NOTNAMED	1944	0	20.8	-58.0	26.3	-72.3	42.1	-71.5	120
4	5	437	NOTNAMED	1944	0	20.0	-84.2	20.6	-84.9	19.1	-93.9	70

In [2]:

df.shape

Out[2]:

(337, 12)

From the above output, we can see that are 337 observations and 12 variables in the data set. We are primarily interested in the variable Type, which is our response variable, and the variable FirstLat, which corresponds to the latitude of formation, and thus is our predictor variable. However, to get a sense of the data set we plot the number of hurricanes for each year as a bar plot using Pandas .plot()-method.

In [3]:

df.groupby('Year').size().plot(kind='bar', figsize=(18,8), ylabel='hurricanes', legend=False)
plt.show()

That is a nice plot. However, by adding just one argument, we may additionally color the bars according to the Type variable. To this end, we group by the two variables and rename the legend accordingly.

In [4]:

df.groupby(['Year','Type']).size().unstack().plot(
    kind='bar', 
    figsize=(16,6), 
    ylabel='hurricanes', 
    legend=True,
    stacked=True
)
plt.legend(["tropical-only", "baroclinic influences", "baroclinic initiation"])
plt.show()

For a numerical representation of the hurricane classes we use the size()-method.

In [5]:

df.groupby("Type").size()

Out[5]:

Type
0    187
1     77
3     73
dtype: int64

In class $0$, tropical hurricanes, there are 187 observations, in class $1$, baroclinic influences, there are 77 observations and in class $3$, baroclinic initiation, there are 73 observations. In logistic regression we deal with dichotomous data; thus, for simplicity we re-code the classes and assign the class $1$ and class $2$, both being influenced by the outer tropics, the label $1$. We use the .where()-method

In [6]:

df["Type_new"] = df["Type"].where(df["Type"] == 0, other=1)

Now we have a binary response variable of two classes: Class 0 corresponds to tropical hurricanes and class 1 to non-tropical hurricanes. Let us check the class distribution of our new response variable Type_new.

In [7]:

df.groupby("Type_new").size()

Out[7]:

Type_new
0    187
1    150
dtype: int64

To wrap up this part on exploratory data analysis we plot the data on an interactive map, using the Folium package.

In [8]:

import folium
loc_center = [df['FirstLat'].mean(), df['FirstLon'].mean()]

In [9]:

hurricane_map = folium.Map(location = loc_center, tiles='Openstreetmap', zoom_start = 5, control_scale=True)
for index, loc in df.iterrows():
    if loc['Type_new']==0:
        color = 'red'
    else:
        color = 'blue'
    folium.CircleMarker(
        [loc['FirstLat'], loc['FirstLon']], 
        weight=5,
        radius = 2.5,
        color = color,
        popup = 'Year: ' + str(loc["Year"]),
        labels = ['tropical', 'non-tropical'],
    ).add_to(hurricane_map)
folium.LayerControl().add_to(hurricane_map)
hurricane_map

Out[9]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Logistic regression: Model fit¶

Recall the goal of the exercise: We want to build a model that predicts the group membership of a hurricane, either tropical $(0)$ or non-tropical $(1)$, based on the latitude of formation of the hurricane. The response variable is the binary variable Type_new and the predictor variable is FirstLat. We build a logit model by applying the Logit() function.

In [10]:

import statsmodels.formula.api as smf

model = smf.logit('Type_new ~ FirstLat', data=df)
fit = model.fit()
fit.summary()

Optimization terminated successfully.
         Current function value: 0.344258
         Iterations 7

Out[10]:

Logit Regression Results
Dep. Variable:	Type_new	No. Observations:	337
Model:	Logit	Df Residuals:	335
Method:	MLE	Df Model:	1
Date:	Wed, 05 Oct 2022	Pseudo R-squ.:	0.4990
Time:	12:21:25	Log-Likelihood:	-116.02
converged:	True	LL-Null:	-231.56
Covariance Type:	nonrobust	LLR p-value:	3.465e-52

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-9.0826	0.961	-9.446	0.000	-10.967	-7.198
FirstLat	0.3728	0.039	9.447	0.000	0.295	0.450

Let us take a closer look on the coefficients:

In [11]:

fit._results.params

Out[11]:

array([-9.08263355,  0.37282953])

Be aware that the output of the logistic model is on link-scale (logit); thus, the numerical output of the model corresponds to the log-odds! For convenience let us write down the logistic regression model with the calculated intercept and coefficient.

$$\phi(\eta) = \frac{1}{1+e^{-\eta}}= \frac{1}{1+e^{-\beta_0+\beta_1x_1}}= \frac{1}{1+e^{-(-9.08263355) + 0.37282953x)}}$$

Let us first focus on the coefficient of the intercept. The coefficient of the intercept corresponds to the log-odds of observing a hurricane at a latitude of zero, which is the known as equator. To calculate the odds of observing a non-tropical hurricane when the latitude is zero we exponentiate the log-odds $e^{-9.08263355`} = 1.1362198 x 10^{-4}$. This is a very low number! And it makes perfect sense since it is very unlikely to observe a non-tropical hurricane at the equator.

The coefficient of the formation latitude variable (FirstLat) has a numerical value of 0.37282953. The positive sign of this value indicates that the chance to observe a non-tropical one increases with latitude. The magnitude of the coefficient implies that for every degree increase in formation latitude the log-odds increase by a constant 0.37282953 units, on average. By taking the exponent of the coefficient value, we get the odds ratio.

In [12]:

print('FirstLat', str(np.exp(fit._results.params[1])))

FirstLat 1.4518368266008836

Thus, for every degree increase in formation latitude the odds ratio increases on average by a constant factor of 1.4518368266008836 (or 45 %). The interpretation is valid only over the range of latitudes in the data, and physically meaningless for latitudes outside the range where hurricanes occur.

The coefficients table from above includes a standard error and the p-value. The smaller the p-value, the less support there is for the null hypothesis given the data and the model. Recall, the null hypothesis states that the coefficient is $0$, in words, the null hypothesis states that there is no relationship between the occurrence of a non-tropical hurricane and the formation latitude. The very small the p-value does not provide support for the null hypothesis and allows us to accept the model.

We have learned that any point estimate should be accompanied by a level of confidence about that point estimation. Therefore, we construct a 95% confidence interval for the estimated model coefficient. We exponentiate the result to return the odds ratio, which is a quantity that is easier to interpret than the log odds.

In [13]:

print('lower conf_int (2.5  %):', str(fit._results.conf_int()[1][0]))
print('upper conf_int (97.5 %):', str(fit._results.conf_int()[1][1]))

lower conf_int (2.5  %): 0.29547694826960713
upper conf_int (97.5 %): 0.45018211515766227

In [14]:

print('lower odds_ratio (2.5  %):', str(np.exp(fit._results.conf_int()[1][0])))
print('upper odds_ratio (97.5 %):', str(np.exp(fit._results.conf_int()[1][1])))

lower odds_ratio (2.5  %): 1.3437671132701643
upper odds_ratio (97.5 %): 1.5685978249199384

Thus, we can state that we are 95% confident that for every degree increase in formation latitude the observation of a non-tropical hurricane becomes on average more likely between 34 and 57%.

Logistic regression: Model prediction¶

In the proceeding section, we built a logistic regression model for the relationship between the formation latitude and the type of hurricane (tropical/non-tropical). In this section, we apply that model to make predictions about the chance of observing a non-tropical hurricane given the latitude of formation.

To predict the probability that a hurricane picked at random, which forms at a given latitude is of type non-tropical we apply the predict() function. Please note that if we add the argument type = "response" to the function call, the predict() function returns the probability and not the log-odds. Let us evaluate the probability to observe a non-tropical hurricane that formed at latitudes of $10^\circ, 23.5^\circ, \text{and }30^\circ$.

In [29]:

# performing predictions on the test datdaset
import statsmodels.api as sm
df_new = pd.DataFrame({'FirstLat': [10, 23.5, 30]})
df_new = sm.add_constant(df_new)

yhat = fit.predict(df_new)

# comparing original and predicted values of y
# print('Actual values', list(ytest.values))
print('Predictions:')
print(yhat)

Predictions:
0    0.004705
1    0.420398
2    0.891122
dtype: float64

The results make perfect sense. With increasing latitude, the chance of observing the formation of a non-tropical hurricane increases significantly. At a latitude of $10^\circ$ the probability of observing a non-tropical hurricane is $0.005$, whereas at the latitude of $30^\circ$ that probability is $0.891$.

For a better understanding let us calculate the probability to observe a non-tropical hurricane that formed at latitudes of $23.5^\circ$ by hand. Recall the equation for the regression model from above:

$$ \begin{align} \ \phi(\eta) & = \frac{1}{1+e^{-\eta}} = \frac{1}{1+e^{-\beta_0+\beta_1x_1}}\\ & = \frac{1}{1+e^{-(−9.0826335+0.3728295 \times 23.5)}} \\ & = \frac{1}{1+e^{-(−0.3211396)}} \\ & = \frac{1}{2.378698} = 0.4203981 \end{align} $$

Therefore, we first prepare a vector of latitudes, lats. We specify an increment of 0.1 degrees so that the resulting prediction curve is smooth. We apply the predict() method and add a column of constants for the model to work.

To visualize our results we create a plot of predictions across a range of latitudes employing the seaborn library. This includes the confidence interval. We plot the data points (observations) at $0$ and $1$ and the predicted values using the lmplot() function with the logistic argument. y_jitter spreads the data points for better visual interpretation and height determines the size of the plot.

In [51]:

import seaborn as sns
sns.lmplot(
    x='FirstLat', 
    y='Type_new', 
    data=df, 
    logistic=True, 
    y_jitter=.01, 
    height=8,
)
plt.xlabel('Formation Latitude (N)')
plt.ylabel('Probability')
plt.show()

The observations of tropical and non-tropical hurricanes are shown along the $y=0$ and $y=1$ lines, respectively. The gray band is the 95% point-wise confidence interval.

The model predictions show that the probability of a non-tropical hurricane is less than 10% at latitudes south of $20^\circ$N. However, by latitude $26^\circ$N, the probability exceeds 50% and by latitude $33^\circ$N the probability exceeds 90%. The odds ratio is constant but the probability is a nonlinear function of latitude.