**Logistic regression** analysis belongs to the
class of **generalized linear models**. We apply the
`glmnet()`

function from the eponymous `glmnet`

package to build a **\(L_2\)-regularized logistic regression
model**. You may want to revisit the introductory sections on
logistic regression and regularization methods.

Suppose the response variable takes value in \(G=\text{{0,1}}\). Denote \(y_i=I(g_i=1)\). The model

\[P(G=1|X=x)+\frac{e^{\beta_0+\beta^{T}x}}{1+e^{\beta_0+\beta^{T}x}}\] can be written in the following form:

\[\log\frac{P(G=1|X=x)}{P(G=0|X=x)}=\beta_0+\beta^{T}x,\]
the so-called logistic or *log-odds
transformation*.

\[\min_{\substack{(\beta_0\beta)\in\mathbb{R}^{p+1}}}- \begin{bmatrix} \frac{1}{N}\displaystyle\sum_{i=1}^{N}y_i(\beta_0+x_i^T\beta)-\log(1+e^{\beta_0+x_i^T\beta}) \end{bmatrix}+\lambda[(1-\alpha)||\beta||_2^2/2+\alpha||\beta||_1].\]

For further details on the computational mechanics visit the
`Glmnet`

Vignette by typing
`vignette("glmnet_beta")`

in your console.

The `glmnet()`

function expects an input matrix
`X`

and the response vector `y`

. In order to apply
a logistic regression we need to add the argument
`family = "binomial"`

to the function call. In the previous
sections we prepared the response vectors and the model matrices for the
training and test sets (`y_train`

, `y_test`

,
`X_train`

, `X_test`

, `X_train_PC3`

,
`X_test_PC3`

, and `pca_PC3`

). Let us load all the
data we need into our workspace.

```
# load model input data from previous section
load("pca_data_30300.RData")
```

The general procedure is as follows:

- Build a model for the training set.
- Evaluate model performance on the test set.

We evaluate the model performance by calculating the model accuracy,
the *confusion matrix* and *Cohenâ€™s kappa*. The confusion matrix is a
specific table that visualizes the performance of a predictive model.
Each column of the matrix represents the instances in a actual class
while each row represents the instances in an predicted class (or
sometimes *vice versa*).

Cohenâ€™s kappa coefficient, \(Îº\), is a statistic which measures inter-rater agreement for categorical items. It takes a value of \(Îº=0\) if the items agreement is due to chance. A value of \(Îº=1\) indicates a perfect model fit.

The definition of \(Îº\) is:

\[\kappa=\frac{p_0-p_e}{1-p_e},\] where \(p_0\) is the relative observed agreement (identical to accuracy), and \(p_e\) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category.

In the *baseline model* we predict the response variable
`HOT.DAY`

by including all 15 features in the data set. We
use the \(L_2\)-regularized logistic
regression model (ridge regression: `alpha = 0`

) provided
by the `glmnet`

package. In particular we apply the
`cv.glmnet()`

function, which performs by default 10-fold cross-validation on the data set and evaluates the
model performance by calculating the misclassification error given the
additional argument `type.measure = "class"`

to the function
call.

`library(glmnet)`

`## Warning: Paket 'glmnet' wurde unter R Version 4.2.3 erstellt`

```
grid <- 10^seq(1, -3, length = 100) # set lambda sequence
m_log_baseline <- cv.glmnet(X_train,
y_train,
family = "binomial",
type.measure = "class", # misclassification error
alpha = 0, # ridge regression
lambda = grid)
```

Calling the `plot()`

function on the
`cv.glmnet`

object provides a nice graphical representation
of the relation between misclassification error and \(Î»\).

`plot(m_log_baseline)`

The plot shows that the misclassification error increases with \(Î»\). The dotted vertical lines indicate the
value of \(Î»\) that gives the minimum
mean cross-validated error and the largest value of \(Î»\) such that the error is within 1
standard error of the minimum. We can assess these values of \(Î»\) by the object attributes
`lambda.min`

and `lambda.1se`

.

`m_log_baseline$lambda.min`

`## [1] 0.01963041`

`m_log_baseline$lambda.1se`

`## [1] 0.04132012`

Finally, we use the test data set, `X_test`

, and the best
\(Î»\) to predict the response vector
`y_test`

. We have to add an additional argument to the
`predict()`

function in order to indicate if we want the
function to return class labels (`type = "class"`

) or
probabilities (`type = "response"`

).

```
# prediction for the test set
m_log_baseline_pred <- predict(m_log_baseline, # learned model
X_test, # new data
type = "class", # class labels
s = "lambda.min") # picks the smallest lambda
tbl <- table(m_log_baseline_pred)
tbl
```

```
## m_log_baseline_pred
## 0 1
## 18 141
```

To assess the model performance we use the functionality of the
`caret`

package. The function `confusionMatrix()`

creates a tabular form of the confusion matrix. In addition the
`confusionMatrix`

object stores many other model metrics.
Here we just ask for the accuracy and Cohenâ€™s kappa.

`library(caret)`

`## Warning: Paket 'caret' wurde unter R Version 4.2.3 erstellt`

```
cm <- confusionMatrix(data = as.factor(m_log_baseline_pred), reference = y_test)
cm$table
```

```
## Reference
## Prediction 0 1
## 0 16 2
## 1 6 135
```

`cm$overall["Accuracy"]`

```
## Accuracy
## 0.9496855
```

`cm$overall["Kappa"]`

```
## Kappa
## 0.7715517
```

The model performs very good. We perform better than our accuracy benchmark of \(0.86\). The two model metrics, accuracy and Cohenâ€™s kappa are 0.95 and 0.77, respectively. In terms of classification accuracy we count 16 true negative and 135 true positive classifications. In contrast we only misclassified 8 observations (6 false negatives and 2 false positives).

Now we build a \(L_2\)-regularized
logistic regression model to predict the response variable
`HOT.DAY`

by including just the first three principal
components scores. As in the previous section, we compute the confusion
matrix, the model accuracy and Cohenâ€™s kappa for the test set to
evaluate the goodness of fit of the model.

```
# build the model
grid <- 10^seq(1, -3, length = 100) ## set lambda sequence
m_log_PC3 <- cv.glmnet(X_train_PC3,
y_train,
family = "binomial",
type.measure = "class", # misclassification error
alpha = 0, # ridge regression
lambda = grid)
plot(m_log_PC3)
```

```
# prediction for the test set
m_log_PC3_pred <- predict(m_log_PC3, # learned model
X_test_PC3, # new data
type = "class", # class labels
s = "lambda.min") # picks the smallest lambda
cm <- confusionMatrix(data = as.factor(m_log_PC3_pred), reference = y_test)
cm$table
```

```
## Reference
## Prediction 0 1
## 0 16 4
## 1 6 133
```

`cm$overall["Accuracy"]`

```
## Accuracy
## 0.9371069
```

`cm$overall["Kappa"]`

```
## Kappa
## 0.7257675
```

The model performs very good compared the baseline model. The *PC3
model* accuracy is higher than our accuracy benchmark of \(0.86\). The two model metrics, accuracy and
Cohenâ€™s kappa are 0.94 and 0.73, respectively. We count 16 true negative
and 133 true positive classifications. In contrast we only misclassified
10 observations (6 false negatives and 4 false positives).

Please note that we projected a 15-dimensional data set onto a 3-dimensional feature space, and still the model performance is remarkable.

Let us visualize the classification results of the *PC3 model*
on a map.

```
library(raster)
library(ggplot2)
# Retrieve Federal States by the getData() function from the raster package
G1 <- getData(country = "Germany", level = 1)
```

```
## Warning in getData(country = "Germany", level = 1): getData will be removed in a future version of raster
## . You can use access these data, and more, with functions from the geodata package instead
```

```
# Predcit the occurrence of hot days at a particular weather station based on the PC3 model (m_log_PC3) for the full data set pca_PC3
m_log_PC3_full <- predict(m_log_PC3, # learned model
pca_PC3, # new data
type = "class", # class labels
s = "lambda.min") # picks the smallest lambda
# add a new column called "Classification" to the data set
# "FN" for false negativ,
# "TN" for true negativ,
# "FP" for false positive and
# "TP" for true positive
dwd_data$PC3_prediction <- m_log_PC3_full # assign prediction vector to data set
dwd_data$Classification <- NULL # initialize empty column
# populate column
dwd_data$Classification[dwd_data$HOT.DAY == 1 & dwd_data$PC3_prediction == 1] <- "TP"
dwd_data$Classification[dwd_data$HOT.DAY == 0 & dwd_data$PC3_prediction == 0] <- "TN"
dwd_data$Classification[dwd_data$HOT.DAY == 1 & dwd_data$PC3_prediction == 0] <- "FN"
dwd_data$Classification[dwd_data$HOT.DAY == 0 & dwd_data$PC3_prediction == 1] <- "FP"
#plot the map
ggplot() +
geom_polygon(data = G1,
aes(x = long, y = lat, group = group),
colour = "grey10", fill = "#FFFFFF") +
geom_point(data = dwd_data,
aes(x = LON, y = LAT, colour = Classification),
alpha = 0.7,
size = 2) +
scale_colour_manual(values = c("#D55E00",
"#CC79A7",
"#56B4E9",
"#009E73")) +
theme_bw() +
xlab("Longitude") + ylab("Latitude") +
coord_map()
```

The map shows an interesting pattern. First of all we realize that the data set is quite imbalanced, as there are a lot more positives than negatives. We see that the negatives (weather stations that did not record a hot day so far) are clustered. One cluster is located in the Alpine region and the Alpine foothills. Another cluster is found in the Central German Uplands, and another cluster is observed at the coastal regions of Germany. Though, it is very interesting, that the classification algorithm correctly predicted all observations in the southern part of Germany. In the central parts of Germany the algorithm sometimes failed to correctly predict positives, as well as negatives. However, one pattern is very striking. In the coastal regions of Northern Germany the logistic regression model performed, in general, quite well, expect for those weather stations located in the federal state of Schleswig-Holstein. In this particular coastal region nearly all class predictions turn out to be wrong. This is an interesting result, which needs more thoroughly investigation.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*