As a first example for machine learning algorithms, we will take a close look at artificial neuronal networks (ANN). We start with the simple building blocks, called artificial neurons, and then connect several of these neurons to form a neuronal network.


McCulloch-Pitts Neuron

The basic idea behind an artificial neuron is to mimic the signal processing in the human brain, which is based on highly interconnected cells, called neurons. Simply put, the neuron receives input signals from neighboring neurons through its dendrites. Each dendrite has a different response to the incoming signal (“the signal is weighted”). Within the soma, the signals from all the dendrites are summed. Once the combined signal is above a certain threshold, the axon sends a signal pulse to its terminal, which is connected to other neurons.

Signal flow in a neuron, Source: Egm4313.s12 (Prof. Loc Vu-Quoc), CC BY-SA 4.0


The Mathematical Model

In 1943, Waren McCulloch and Walter Pitts, described this neuron as logic gates with a binary output that depends on the weighted sum of several inputs.

In the following we will describe the model of an artificial neuron in today’s notation. Suppose we have data with \(n\) features. This data is arranged in an \(n\) dimensional vector

\[ \mathbf{x}=\begin{pmatrix}x_1\\ x_2\\ x_3\\\vdots\\ x_n\end{pmatrix}, \]

which forms the input to the neuron. The dendrites are modeled by weights, one for each feature, which are also arranged in an n dimensional vector

\[ \mathbf{w}=\begin{pmatrix}w_1\\ w_2\\ w_3\\\vdots\\ w_n\end{pmatrix}. \]

To calculate the response of the neuron, we first build the weighted sum of the input, usually called the net input \(z\)

\[ z:=\sum_{i=1}^nw_ix_i=\mathbf{w}^T\mathbf{x} \, . \]

Note, that in artificial neural networks, w and x become matrices. Therefore, we already use the matrix product here and write \(\mathbf{w}^T\mathbf{x}\), i.e. the dot product of a \((1×n)\) and an \((n×1)-\)matrix.

The net input is then fed into an activation function \(σ(z)\). There are a variety of possible functions, each with its own advantages and disadvantages. We will see some examples when we talk about different types of neurons. After the activation function, which usually provides a continuous output, a threshold function (step function) decides the output of the neuron \[ \hat{y}:=\begin{cases}1&\text{if }\sigma(z)\geq\theta\\0&\text{otherwise}\end{cases}. \] Usually the notation is further simplified by introducing a bias unit \(b:=−θ\) into the net input, which gives us

  • net input

\[ z:=\sum_{i=1}^nw_ix_i+b=\mathbf{w}^T\mathbf{x}+b \]

  • activation function

\[ \sigma(z) \]

  • threshold function

\[ \hat{y}:=\begin{cases}1&\text{if }\sigma(z)\geq 0\\0&\text{otherwise}\end{cases} \, . \]

The following figure shows the basic structure of an artificial neuron.



Perceptron Learning Rule

With the above definition of an artificial neuron, we are able to use it for a binary classification on any given input. But we have no idea if our chosen weights are optimal for the given task, let alone how to even choose them in the first place. In 1957, Frank Rosenblatt introduced the idea of the perceptron-learning-rule. This rule allows the neuron to determine the optimal set of weights and bias unit by learning from the data. For this simple version of an artificial neuron, the activation function is simply defined as the identity function \(σ(z):=id(z)=z\), which basically means that it has no effect at all.

In our training data set, we have the combination of m records for each of the n features and their corresponding m targets, which are supposed to be either \(0\) or \(1\). Note, that this means that we are doing supervised learning for a binary classification here. Let \(\mathbf{x}^{(j)}\) be the input, i.e. the values of the features for the \(j\)-th training example, and \(y^{(j)}\) be the corresponding output.

  1. Initialize the weights and bias unit to \(0\) or some small random numbers.
  2. For each epoch:
    • For each training record \((\mathbf{x}^{(j)},y^{(j)})\):
      1. Compute the output \(\hat{y}^{(j)}\)
      2. Compute the error \(err=y^{(j)}−\hat{y}^{(j)}\)
      3. Update the weight and bias unit \[\begin{align*}\mathbf{w_{new}}&:=\mathbf{w_{old}}+\eta\cdot\text{err}\cdot\mathbf{x}\\b_{new}&:=b_{ol d}+\eta\cdot\text{err}\end{align*}\]

This rule is an example of online learning, which means that the weights and bias unit are updated after each training record. An epoch is a loop over the entire training data set. The main part of the perceptron rule is updating the weights and the bias unit, which is only done if the neuron misclassifies the input. If \(y^{(j)}=0\) and \(\hat{y}^{(j)}=1\), the value of \(z\) is too high, in which case the weights and the bias unit are decreased. If \(y^{(j)}=0\) and \(\hat{y}^{(j)}=0\), the value of \(z\) is too low, in which case the weights and the bias unit are increased. The value \(η\) is the learning rate and typically a value between \(0\) and \(1\). Note that the learning rate only has an effect, if the weights are initialized with random numbers. Otherwise, \(η\) only scales the weight vector, but does not change its direction.

If we imagine the \(n\) dimensional feature space, it is divided into two parts: in one region the output of the neuron is \(0\), in the other it is \(1\). The decision boundary between the two classes is given by \(z=\mathbf{w}^T\mathbf{x}+b=0\). This describes a flat affine hyperplane, which is the reason why the perceptron is a linear classifier. The weight vector is the normal vector of the plane and the bias unit is related to the distance of the plane from the origin (to be precise: \(d=\frac{-b}{∥\mathbf{w}∥}\), where the sign corresponds to the side of the plane).

Due to the linear nature of the decision boundary, the perceptron will only converge, i.e. find a solution that satisfies all training records, if the data is linearly separable. But, if this is the case, the algorithm is guaranteed to find an optimal solution (for a proof see e.g. a lecture by Raschka). Otherwise, the algorithm will run indefinitely, so a maximum number of epochs must be specified to stop it. In this case, however, it is not guaranteed that a good solution will be obtained and other algorithms should be used.

 


A Toy Example

To demonstrate the working principle of the perceptron algorithm, we will implement one ourselves. But first, we need to define a helper function that allows us to draw the decision boundary at an arbitrary step.

library(ggplot2)
library(reshape2)

plot_decision_boundary <- function(X, y, classifier, resolution = 0.02) {
  # Define the grid
  x1_min <- min(X[, 1]) - 1
  x1_max <- max(X[, 1]) + 1
  x2_min <- min(X[, 2]) - 1
  x2_max <- max(X[, 2]) + 1
  xx1 <- seq(x1_min, x1_max, by = resolution)
  xx2 <- seq(x2_min, x2_max, by = resolution)
  grid <- expand.grid(xx1 = xx1, xx2 = xx2)

  p <- ggplot()

  if (!is.null(classifier)) {
    # For each grid point, predict the class
    lab <- classifier$predict(as.matrix(grid), classifier$weights, classifier$bias)
    grid$lab <- lab

    # Plot the decision regions
    p <- p + geom_tile(data = grid, aes(x = xx1, y = xx2, fill = factor(lab)), alpha = 0.3) +
      scale_fill_manual(values = c("blue", "orange"))
  }

  # Plot the data points
  data <- data.frame(X, Class = as.factor(y))
  p <- p + geom_point(data = data, aes(x = X[, 1], y = X[, 2], color = Class, shape = Class), alpha = 0.8, size = 3) +
    scale_color_manual(values = c("blue", "orange")) +
    scale_shape_manual(values = c(19, 15)) +
    labs(x = "Feature 1", y = "Feature 2") +
    theme_minimal()

  return(p)
}

Now, we implement a simple version of a perceptron.

SimplePerceptron <- function(epochs, eta = 1) {
  # Define the list to hold the perceptron parameters
  perceptron <- list(epochs = epochs, eta = eta)
  # Initialize weights and bias
  perceptron$weights <- rnorm(ncol(X))
  perceptron$bias <- runif(1) * 2 - 0.8


  # Define the fit function
  perceptron$fit <- function(X, y) {
    steps <- 0
    # Iterate over epochs
    for (e in 1:perceptron$epochs) {
      for (i in 1:nrow(X)) {
        xi <- X[i, ]
        yi <- y[i]
        prediction <- perceptron$predict(xi, perceptron$weights, perceptron$bias)
        error <- yi - prediction
        perceptron$weights <- perceptron$weights + perceptron$eta * error * xi
        perceptron$bias <- perceptron$bias + perceptron$eta * error

        if (error != 0) {
          steps <- steps + 1
          plot <- plot_decision_boundary(X, y, perceptron) +
            ggtitle(paste("Update No.", steps)) +
            labs(x = "Feature 1", y = "Feature 2") +
            theme(legend.position = "bottom") +
            guides(fill = FALSE)
          print(plot)
        }
      }
    }
  }

  # Define the net input function
  perceptron$net_input <- function(X, weights, bias) {
    return(as.numeric(X %*% weights + bias))
  }

  # Define the predict function
  perceptron$predict <- function(X, weights, bias) {
    return(ifelse(perceptron$net_input(X, weights, bias) >= 0, 1, 0))
  }

  return(perceptron)
}

To test our implementation, we create some toy Gaussian distributed random data with two features and divide it into two linear separable classes.

# Create toy data
set.seed(0)
X <- matrix(rnorm(60), ncol = 2)
y <- ifelse(X[, 1] + X[, 2] >= 0, 1, 0)

# Plot the data
plot_decision_boundary(X, y, NULL) +
  ggtitle("Toy data") +
  theme(legend.position = "bottom") +
  guides(fill = FALSE)

After all the preparation, we can now test our perceptron.

ppn <- SimplePerceptron(epochs = 1, eta = 0.1)
ppn$fit(X, y)