In 1960, Bernard Widrow and Tedd Hoff published an improved learning rule for artificial neurons. The main difference is the way the weights and the bias unit are updated. A schematic view of the process is shown in the following figure.


The Widrow-Hoff learning rule


Unlike the perceptron, where we defined the error based on the predicted output, here we use the result of the activation function to define a loss or cost function. The goal of this loss function is to find its minimum, which corresponds to the optimal solution to our task.

One way to find this minimum is to use a technique called gradient descent. Depending on the number of training records used for each learning step, we differentiate between full batch, stochastic, and minibatch gradient descent.


Full batch gradient descent

The main reason for using the activation function instead of the predicted output is that it is a differentiable function and we can use calculus techniques to find the minimum of the loss function. Once we have a function, we can calculate its gradient to find a vector that points in the direction of the steepest slope. Thus, the negative of the gradient points in the direction of the steepest descent. If we follow the gradient we will find the minimum of the function (at least if the activation function is linear, otherwise we might just end up in a local minimum). This technique is called gradient descent, and means that we update the weights and bias unit as follows

\[\begin{align}\mathbf{w_{new}}&:=\mathbf{w_{old}}-\eta\nabla_wL(\mathbf{w},b)\\b_{new}&:=b_{old}-\eta\frac{\partial L(\mathbf{w},b)}{\partial b}\end{align}\]

As with the perceptron, \(\eta\) is a learning rate. If \(\eta\) is small, the descent takes a lot of steps, if it is too high, we might overshoot the minimum. Therefore, finding an appropriate value is an important step in fine-tuning the training.

By using the gradient descent, we have implicitly changed the learning step from online learning, as with the perceptron, to full batch learning. This means that the loss function is calculated from the results of the activation function from the entire training data set.


Stochastic gradient descent (SGD)

An alternative to the full batch learning is the so-called stochastic gradient descent, which updates the weights and the bias unit after every single training record. For the MSE loss function, the update process is then:

\[\begin{align}\mathbf{w_{new}}&:=\mathbf{w_{old}}-\eta\nabla_w\left(y^{(j)}-\sigma\left(z^{(j)}\right)\right)^2\\b_{new}&:=b_{old}-\eta\frac{\partial \left(y^{(j)}-\sigma\left(z^{(j)}\right)\right)^2}{\partial b}\end{align}\]

SGD usually reaches the minimum faster than full batch gradient descent because the weights are incremented more often. However, the error surface is noisier because the loss function is different for each training record. This is actually an advantageous behavior, because it also means that for nonlinear activation functions, it is easier to escape local minima and find the global minimum. To avoid patterns arising from the order of the training records, the training set is shuffled at the beginning of each epoch, which leads to the word stochastic in SGD.


Mini-batch gradient descent

Defining the loss function on a subset of the training set is a compromise between full-batch and stochastic gradient descent. This has the advantage of reaching the minimum faster than in full-batch mode, and it allows the use of vectorized operations, which improves the computational efficiency.


Different regression types

Since the Widrow Hoff rule does not specify the activation and loss functions, we can build different types of learning algorithms.

Linear regression

A common loss function is the mean squared error (MSE):

\[L(\mathbf{w},b)=\frac{1}{n}\sum_{i=1}^n\left(y^{(j)}-\sigma\left(z^{(j)}\right)\right)^2\]

Obviously, if the MSE is small, we have a good classification. If we additionally choose

\[\sigma(z):=id(z)=\mathbf{w}^T\mathbf{x}+b\]

the whole learning rule is nothing else than a simple linear regression.

Logistic regression

If we use the logistic sigmoid function

\[\sigma(z):=\frac{1}{1+e^{-z}}\]

and define the loss function as the log-likelihood

\[L:=\log p(y|\mathbf{x},\mathcal{w},b)\]

we have a logistic regression.

NB: We can use both versions for either classification or regression. In the case of classification, we use the threshold function after the learning to do the classification, in the case of regression, we simply use the outcome of the activation function as the regression result.


Example

The following example shows how to use the logistic regression as a classifier for a toy data set.

library(ggplot2)

plot_decision_boundary <- function(X, y, model) {
  # Predicted probabilities from the model
  prob_grid <- expand.grid(
    X1 = seq(min(X[, 1]) - 1, max(X[, 1]) + 1, by = 0.02),
    X2 = seq(min(X[, 2]) - 1, max(X[, 2]) + 1, by = 0.02)
  )
  prob_grid$pred <- predict(model, newdata = prob_grid, type = "response")

  # Plot
  ggplot() +
    geom_tile(data = prob_grid, aes(x = X1, y = X2, fill = pred), alpha = 0.3) +
    geom_point(data = data.frame(X, y), aes(x = X1, y = X2, color = factor(y))) +
    scale_fill_gradient(low = "dodgerblue", high = "firebrick1") +
    scale_color_manual(values = c("dodgerblue", "firebrick1")) +
    labs(title = "Logistic Regression", x = "Feature 1", y = "Feature 2", fill = "Prediction", color = "Class") +
    theme_minimal()
}
# Set seed for reproducibility
set.seed(0)

# Create some toy data
X <- matrix(rnorm(60), ncol = 2)
colnames(X) <- c("X1", "X2")

# Divide the data into two classes along the line x2 = -x1
y <- ifelse(X[, 1] + X[, 2] >= 0, 1, 0)
# Fit a logistic regression model
logreg_model <- glm(y ~ X1 + X2, family = binomial(), data = data.frame(X, y))

summary(logreg_model)
## 
## Call:
## glm(formula = y ~ X1 + X2, family = binomial(), data = data.frame(X, 
##     y))
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)     14.33    8136.52   0.002    0.999
## X1             306.76  103109.80   0.003    0.998
## X2             293.76  101348.04   0.003    0.998
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.1455e+01  on 29  degrees of freedom
## Residual deviance: 1.7156e-08  on 27  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25
plot_decision_boundary(X, y, logreg_model)

In a real world example, we would of course need to check the accuracy of our trained model. We will do this when we talk about the “standard” machine learning workflow.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.