Logistic regression analysis
belongs to the class of generalized linear models. We apply
the glmnet() function from the eponymous
glmnet package to build a \(L_2\)-regularized logistic regression
model, which means large coefficients are penalized by their squared
size to prevent overfitting. You may want to revisit the introductory
sections on logistic regression and regularization methods.
Suppose the response variable takes value in \(G=\text{{0,1}}\). Denote \(y_i=I(g_i=1)\). The model
\[P(G=1|X=x)+\frac{e^{\beta_0+\beta^{T}x}}{1+e^{\beta_0+\beta^{T}x}}\] can be written in the following form:
\[\log\frac{P(G=1|X=x)}{P(G=0|X=x)}=\beta_0+\beta^{T}x,\] the so-called logistic or log-odds transformation.
\[\min_{\substack{(\beta_0\beta)\in\mathbb{R}^{p+1}}}- \begin{bmatrix} \frac{1}{N}\displaystyle\sum_{i=1}^{N}y_i(\beta_0+x_i^T\beta)-\log(1+e^{\beta_0+x_i^T\beta}) \end{bmatrix}+\lambda[(1-\alpha)||\beta||_2^2/2+\alpha||\beta||_1].\]
For further details on the computational mechanics visit the
Glmnet Vignette by typing
vignette("glmnet_beta") in your console.
The glmnet() function expects an input matrix
X and the response vector y. In order to apply
a logistic regression we need to add the argument
family = "binomial" to the function call. In the previous
sections we prepared the response vectors and the model matrices for the
training and test sets (y_train, y_test,
X_train, X_test, X_train_PC3,
X_test_PC3, and pca_scores_3). Let us load all
the data we need into our workspace.
# load model input data from previous section
load("pca_data_30300.RData")
# Or, if you don't have the file locally, you can download it directly from the given URL:
download.file("https://userpage.fu-berlin.de/soga/data/r-data/pca_data_30300.RData",destfile = "pca_data_30300.RData", mode = "wb")
The general procedure is as follows:
We evaluate the model performance by calculating the model accuracy, the confusion matrix and Cohen’s kappa. The confusion matrix is a specific table that visualizes the performance of a predictive model. Each column of the matrix represents the instances in a actual class while each row represents the instances in an predicted class (or sometimes vice versa).
Cohen’s kappa coefficient, \(κ\), is a statistic which measures inter-rater agreement for categorical items. It takes a value of \(κ=0\) if the items agreement is due to chance. A value of \(κ=1\) indicates a perfect model fit.
The definition of \(κ\) is:
\[\kappa=\frac{p_0-p_e}{1-p_e},\] where \(p_0\) is the relative observed agreement (identical to accuracy), and \(p_e\) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category.
In the baseline model we predict the response variable
HOT.DAY by including all 13 features in the data set. We
use the \(L_2\)-regularized logistic
regression model (ridge regression: alpha = 0) provided
by the glmnet package. In particular we apply the
cv.glmnet() function, which performs by default 10-fold cross-validation on the data set and evaluates the
model performance by calculating the misclassification error given the
additional argument type.measure = "class" to the function
call.
library(glmnet)
grid <- 10^seq(1, -3, length = 100) # set lambda sequence
set.seed(200)
m_log_baseline <- cv.glmnet(X_train,
y_train,
family = "binomial",
type.measure = "class", # misclassification error
alpha = 0, # ridge regression
lambda = grid)
Calling the plot() function on the
cv.glmnet object provides a nice graphical representation
of the relation between misclassification error and \(λ\) (on the log scale). The log
transformation is applied for visualization, because \(λ\) covers a wide numeric range. This is
purely a plotting choice to make interpretation easier; all computations
use the original \(λ\) values.
plot(m_log_baseline)
The plot shows that the misclassification error increases with \(λ\). The dotted vertical lines indicate the
value of \(λ\) that gives the minimum
mean cross-validated error and the largest value of \(λ\) such that the error is within 1
standard error of the minimum. We can assess these values of \(λ\) by the object attributes
lambda.min and lambda.1se.
m_log_baseline$lambda.min
## [1] 0.001204504
m_log_baseline$lambda.1se
## [1] 0.0869749
Finally, we use the test data set, X_test, and the best
\(λ\) to predict the response vector
y_test. We have to add an additional argument to the
predict() function in order to indicate if we want the
function to return class labels (type = "class") or
probabilities (type = "response").
# prediction for the test set
m_log_baseline_pred <- predict(m_log_baseline, # learned model
X_test, # new data
type = "class", # class labels
s = "lambda.min") # picks the smallest lambda
tbl <- table(m_log_baseline_pred)
tbl
## m_log_baseline_pred
## 0 1
## 28 20
To assess the model performance we use the functionality of the
caret package. The function confusionMatrix()
creates a tabular form of the confusion matrix. In addition the
confusionMatrix object stores many other model metrics.
Here we just ask for the accuracy and Cohen’s
kappa. Accuracy measures the proportion of correct predictions,
while Cohen’s kappa indicates how well the predicted classes agree with
the actual classes after accounting for chance.
library(caret)
cm <- confusionMatrix(data = as.factor(m_log_baseline_pred), reference = y_test)
cm$table
## Reference
## Prediction 0 1
## 0 27 1
## 1 3 17
cm$overall["Accuracy"]
## Accuracy
## 0.9166667
cm$overall["Kappa"]
## Kappa
## 0.826087
The model performs very good. We perform better than our accuracy benchmark of \(0.63\). The two model metrics, accuracy and Cohen’s kappa are 0.92 and 0.83, respectively. In terms of classification accuracy we count 27 true negative and 17 true positive classifications. In contrast we only misclassified 4 observations (3 false negatives and 1 false positives).
Now, we build a \(L_2\)-regularized
logistic regression model to predict the response variable
HOT.DAY by including just the first four principal
components scores. As in the previous section, we compute the confusion
matrix, the model accuracy and Cohen’s kappa for the test set to
evaluate the goodness of fit of the model.
library(mice)
# build the model
grid <- 10^seq(1, -3, length = 100)## set lambda sequence
set.seed(200)
m_log_PC3 <- cv.glmnet(X_train_PC3,y_train,
family = "binomial",
type.measure = "class", # misclassification error
alpha = 0, # ridge regression
lambda = grid)
plot(m_log_PC3)
# prediction for the test set
m_log_PC3_pred <- predict(m_log_PC3, # learned model
X_test_PC3, # new data
type = "class", # class labels
s = "lambda.min") # picks the smallest lambda
cm <- confusionMatrix(data = as.factor(m_log_PC3_pred), reference = y_test)
cm$table
## Reference
## Prediction 0 1
## 0 27 4
## 1 3 14
cm$overall["Accuracy"]
## Accuracy
## 0.8541667
cm$overall["Kappa"]
## Kappa
## 0.6853933
The model performs reasonably well compared the baseline model. The PC3 model accuracy of 0.85 is higher than our accuracy benchmark of \(0.63\). Cohen’s kappa is 0.69, suggesting that the model’s predictions agree with the true classes slightly better than chance; however, this metric can vary with different random data splits and may sometimes fall below the benchmark.
We count 27 true negative and 14 true positive classifications. In contrast we only misclassified 7 observations (3 false negatives and 4 false positives).
Please note that we projected a 11-dimensional data set onto a 4-dimensional feature space. While the model still captures the main patterns, its performance should be interpreted with some caution due to the variability of Kappa across different data splits.
Let us visualize the classification results of the PC3 model on a map.
library(geodata)
library(sf)
library(ggplot2)
# Retrieve Federal States using the gadm() function from the geodata package
G1 <- geodata::gadm(country = "DEU", level = 1, tempdir())
G1_sf <- st_as_sf(G1)
# Predcit the occurrence of hot days at a particular weather station based on the PC3 model (m_log_PC3) for the full data set pca_PC3
m_log_PC3_full <- predict(m_log_PC3, # learned model
pca_scores_3, # new data
type = "class", # class labels
s = "lambda.min") # picks the smallest lambda
# add a new column called "Classification" to the data set
# "FN" for false negativ,
# "TN" for true negativ,
# "FP" for false positive and
# "TP" for true positive
dwd_data$PC3_prediction <- m_log_PC3_full # assign prediction vector to data set
dwd_data$Classification <- NULL # initialize empty column
# populate column
dwd_data$Classification[dwd_data$HOT.DAY == 1 & dwd_data$PC3_prediction == 1] <- "TP"
dwd_data$Classification[dwd_data$HOT.DAY == 0 & dwd_data$PC3_prediction == 0] <- "TN"
dwd_data$Classification[dwd_data$HOT.DAY == 1 & dwd_data$PC3_prediction == 0] <- "FN"
dwd_data$Classification[dwd_data$HOT.DAY == 0 & dwd_data$PC3_prediction == 1] <- "FP"
#plot the map
ggplot() +
geom_sf(data = G1_sf,
colour = "grey10", fill = "#FFFFFF") +
geom_point(data = dwd_data,
aes(x = lon, y = lat, colour = Classification),
alpha = 0.7,
size = 2) +
scale_colour_manual(values = c("#D55E00",
"#CC79A7",
"#56B4E9",
"#009E73")) +
theme_bw() +
xlab("Longitude") + ylab("Latitude") +
coord_sf()
The map shows an interesting pattern. First, we notice that the data set is slightly imbalanced, with non-hot days (the majority class) occurring more frequently than hot days (the minority class). This imbalance leads the model to favor the majority class, which results in many true negatives but also many false negatives. In other words, the model achieves high overall accuracy simply by predicting the more frequent class, which is problematic because it fails to identify a significant number of actual hot days, exactly the cases we are most interested in.
Looking at the spatial distribution, we see that true negatives (blue) dominate in the northern and western regions, where non-hot days are common and correctly predicted. True positives (green) are concentrated in the southern and southeastern regions, where hot days occur more often and the model performs better. However, false negatives (orange) are clustered in the transition zones between cooler and hotter regions, such as central Germany and southern edges, indicating that the model often misses hot days in borderline cases. False positives (pink), although very rare, also seem to appear in these transitional areas.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.