Logistic regression analysis belongs to the class of generalized linear models. We apply the glmnet() function from the eponymous glmnet package to build a \(L_2\)-regularized logistic regression model, which means large coefficients are penalized by their squared size to prevent overfitting. You may want to revisit the introductory sections on logistic regression and regularization methods.

Suppose the response variable takes value in \(G=\text{{0,1}}\). Denote \(y_i=I(g_i=1)\). The model

\[P(G=1|X=x)+\frac{e^{\beta_0+\beta^{T}x}}{1+e^{\beta_0+\beta^{T}x}}\] can be written in the following form:

\[\log\frac{P(G=1|X=x)}{P(G=0|X=x)}=\beta_0+\beta^{T}x,\] the so-called logistic or log-odds transformation.

\[\min_{\substack{(\beta_0\beta)\in\mathbb{R}^{p+1}}}- \begin{bmatrix} \frac{1}{N}\displaystyle\sum_{i=1}^{N}y_i(\beta_0+x_i^T\beta)-\log(1+e^{\beta_0+x_i^T\beta}) \end{bmatrix}+\lambda[(1-\alpha)||\beta||_2^2/2+\alpha||\beta||_1].\]

For further details on the computational mechanics visit the Glmnet Vignette by typing vignette("glmnet_beta") in your console.

The glmnet() function expects an input matrix X and the response vector y. In order to apply a logistic regression we need to add the argument family = "binomial" to the function call. In the previous sections we prepared the response vectors and the model matrices for the training and test sets (y_train, y_test, X_train, X_test, X_train_PC3, X_test_PC3, and pca_scores_3). Let us load all the data we need into our workspace.

# load model input data from previous section
load("pca_data_30300.RData")

# Or, if you don't have the file locally, you can download it directly from the given URL:
download.file("https://userpage.fu-berlin.de/soga/data/r-data/pca_data_30300.RData",destfile = "pca_data_30300.RData", mode = "wb")

The general procedure is as follows:

We evaluate the model performance by calculating the model accuracy, the confusion matrix and Cohen’s kappa. The confusion matrix is a specific table that visualizes the performance of a predictive model. Each column of the matrix represents the instances in a actual class while each row represents the instances in an predicted class (or sometimes vice versa).

Cohen’s kappa coefficient, \(κ\), is a statistic which measures inter-rater agreement for categorical items. It takes a value of \(κ=0\) if the items agreement is due to chance. A value of \(κ=1\) indicates a perfect model fit.

The definition of \(κ\) is:

\[\kappa=\frac{p_0-p_e}{1-p_e},\] where \(p_0\) is the relative observed agreement (identical to accuracy), and \(p_e\) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category.


The baseline model

In the baseline model we predict the response variable HOT.DAY by including all 13 features in the data set. We use the \(L_2\)-regularized logistic regression model (ridge regression: alpha = 0) provided by the glmnet package. In particular we apply the cv.glmnet() function, which performs by default 10-fold cross-validation on the data set and evaluates the model performance by calculating the misclassification error given the additional argument type.measure = "class" to the function call.

library(glmnet)
grid <- 10^seq(1, -3, length = 100) # set lambda sequence

set.seed(200) 
m_log_baseline <- cv.glmnet(X_train,
                            y_train,
                            family = "binomial",
                            type.measure = "class", # misclassification error
                            alpha = 0, # ridge regression
                            lambda = grid)

Calling the plot() function on the cv.glmnet object provides a nice graphical representation of the relation between misclassification error and \(λ\) (on the log scale). The log transformation is applied for visualization, because \(λ\) covers a wide numeric range. This is purely a plotting choice to make interpretation easier; all computations use the original \(λ\) values.

plot(m_log_baseline)

The plot shows that the misclassification error increases with \(λ\). The dotted vertical lines indicate the value of \(λ\) that gives the minimum mean cross-validated error and the largest value of \(λ\) such that the error is within 1 standard error of the minimum. We can assess these values of \(λ\) by the object attributes lambda.min and lambda.1se.

m_log_baseline$lambda.min
## [1] 0.001204504
m_log_baseline$lambda.1se
## [1] 0.0869749

Finally, we use the test data set, X_test, and the best \(λ\) to predict the response vector y_test. We have to add an additional argument to the predict() function in order to indicate if we want the function to return class labels (type = "class") or probabilities (type = "response").

# prediction for the test set
m_log_baseline_pred <- predict(m_log_baseline, # learned model
                               X_test,  # new data
                               type = "class", # class labels
                               s = "lambda.min") # picks the smallest lambda
tbl <- table(m_log_baseline_pred)
tbl
## m_log_baseline_pred
##  0  1 
## 28 20

To assess the model performance we use the functionality of the caret package. The function confusionMatrix() creates a tabular form of the confusion matrix. In addition the confusionMatrix object stores many other model metrics. Here we just ask for the accuracy and Cohen’s kappa. Accuracy measures the proportion of correct predictions, while Cohen’s kappa indicates how well the predicted classes agree with the actual classes after accounting for chance.

library(caret)
cm <- confusionMatrix(data = as.factor(m_log_baseline_pred), reference = y_test)
cm$table
##           Reference
## Prediction  0  1
##          0 27  1
##          1  3 17
cm$overall["Accuracy"]
##  Accuracy 
## 0.9166667
cm$overall["Kappa"]
##    Kappa 
## 0.826087

The model performs very good. We perform better than our accuracy benchmark of \(0.63\). The two model metrics, accuracy and Cohen’s kappa are 0.92 and 0.83, respectively. In terms of classification accuracy we count 27 true negative and 17 true positive classifications. In contrast we only misclassified 4 observations (3 false negatives and 1 false positives).


The PC3 model

Now, we build a \(L_2\)-regularized logistic regression model to predict the response variable HOT.DAY by including just the first four principal components scores. As in the previous section, we compute the confusion matrix, the model accuracy and Cohen’s kappa for the test set to evaluate the goodness of fit of the model.

library(mice)
# build the model
grid <- 10^seq(1, -3, length = 100)## set lambda sequence

set.seed(200) 
m_log_PC3 <- cv.glmnet(X_train_PC3,y_train,
                       family = "binomial",
                       type.measure = "class", # misclassification error
                       alpha = 0, # ridge regression
                       lambda = grid)
plot(m_log_PC3)

# prediction for the test set
m_log_PC3_pred <- predict(m_log_PC3, # learned model
                          X_test_PC3, # new data
                          type = "class", # class labels
                          s = "lambda.min") # picks the smallest lambda

cm <- confusionMatrix(data = as.factor(m_log_PC3_pred), reference = y_test)
cm$table
##           Reference
## Prediction  0  1
##          0 27  4
##          1  3 14
cm$overall["Accuracy"]
##  Accuracy 
## 0.8541667
cm$overall["Kappa"]
##     Kappa 
## 0.6853933

The model performs reasonably well compared the baseline model. The PC3 model accuracy of 0.85 is higher than our accuracy benchmark of \(0.63\). Cohen’s kappa is 0.69, suggesting that the model’s predictions agree with the true classes slightly better than chance; however, this metric can vary with different random data splits and may sometimes fall below the benchmark.

We count 27 true negative and 14 true positive classifications. In contrast we only misclassified 7 observations (3 false negatives and 4 false positives).

Please note that we projected a 11-dimensional data set onto a 4-dimensional feature space. While the model still captures the main patterns, its performance should be interpreted with some caution due to the variability of Kappa across different data splits.


Visualisation of the results

Let us visualize the classification results of the PC3 model on a map.

library(geodata)
library(sf)
library(ggplot2)

# Retrieve Federal States using the gadm() function from the geodata package
G1 <- geodata::gadm(country = "DEU", level = 1, tempdir())
G1_sf <- st_as_sf(G1)

# Predcit the occurrence of hot days at a particular weather station based on the PC3 model (m_log_PC3) for the full data set pca_PC3
m_log_PC3_full <- predict(m_log_PC3, # learned model
                          pca_scores_3, # new data
                          type = "class", # class labels
                          s = "lambda.min") # picks the smallest lambda

# add a new column called "Classification" to the data set
# "FN" for false negativ,
# "TN" for true negativ,
# "FP" for false positive and
# "TP" for true positive

dwd_data$PC3_prediction <- m_log_PC3_full # assign prediction vector to data set

dwd_data$Classification <- NULL # initialize empty column
# populate column
dwd_data$Classification[dwd_data$HOT.DAY == 1 & dwd_data$PC3_prediction == 1] <- "TP"
dwd_data$Classification[dwd_data$HOT.DAY == 0 & dwd_data$PC3_prediction == 0] <- "TN"
dwd_data$Classification[dwd_data$HOT.DAY == 1 & dwd_data$PC3_prediction == 0] <- "FN"
dwd_data$Classification[dwd_data$HOT.DAY == 0 & dwd_data$PC3_prediction == 1] <- "FP"


#plot the map
ggplot() +
geom_sf(data = G1_sf,
             colour = "grey10", fill = "#FFFFFF") +
geom_point(data = dwd_data,
           aes(x = lon, y = lat, colour = Classification),
           alpha = 0.7,
           size = 2) +
scale_colour_manual(values = c("#D55E00",
                               "#CC79A7",
                               "#56B4E9",
                               "#009E73")) +
theme_bw() +
xlab("Longitude") + ylab("Latitude") +
coord_sf()

The map shows an interesting pattern. First, we notice that the data set is slightly imbalanced, with non-hot days (the majority class) occurring more frequently than hot days (the minority class). This imbalance leads the model to favor the majority class, which results in many true negatives but also many false negatives. In other words, the model achieves high overall accuracy simply by predicting the more frequent class, which is problematic because it fails to identify a significant number of actual hot days, exactly the cases we are most interested in.

Looking at the spatial distribution, we see that true negatives (blue) dominate in the northern and western regions, where non-hot days are common and correctly predicted. True positives (green) are concentrated in the southern and southeastern regions, where hot days occur more often and the model performs better. However, false negatives (orange) are clustered in the transition zones between cooler and hotter regions, such as central Germany and southern edges, indicating that the model often misses hot days in borderline cases. False positives (pink), although very rare, also seem to appear in these transitional areas.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.