As a first example of machine learning algorithms, we will take a close look at artificial neural networks (ANN). We start with the simple building blocks, called artificial neurons, before connecting several of these neurons to form a neural network.
The basic idea behind an artificial neuron is to mimic the signal processing in the human brain, which is based on highly interconnected cells, called neurons. Simply put, the neuron receives input signals from neighboring neurons through its dendrites. Each dendrite has a different response to the incoming signal ("the signal is weighted"). Within the soma, the signals from all the dendrites are summed. Once the combined signal is above a certain threshold, the axon sends a signal pulse to its terminal, which is connected to other neurons.
Signal flow in a neuron, Source: Egm4313.s12 (Prof. Loc Vu-Quoc), CC BY-SA 4.0.
In 1943, Waren McCulloch and Walter Pitts, described this neuron as logic gates with a binary output that depends on the weighted sum of several inputs.
In the following we will describe the model of an artificial neuron in today's notation.
Suppose we have data with $n$ features. This data is arranged in an $n$ dimensional vector $$\mathbf{x}=\begin{pmatrix}x_1\\ x_2\\ x_3\\\vdots\\ x_n\end{pmatrix},$$ which forms the input to the neuron. The dendrites are modeled by weights, one for each feature, which are also arranged in an $n$ dimensional vector $$\mathbf{w}=\begin{pmatrix}w_1\\ w_2\\ w_3\\\vdots\\ w_n\end{pmatrix}.$$ To calculate the response of the neuron, we first build the weighted sum of the input, usually called the net input, $$z:=\sum_{i=1}^nw_ix_i=\mathbf{w}^T\mathbf{x} \, .$$ Note, that in artificial neural networks, $\mathbf{w}$ and $\mathbf{x}$ become matrices. Therefore, we already use the matrix product here and write $\mathbf{w}^T\mathbf{x}$, i.e. the dot product of a $1\times n$ and an $n\times 1$ matrix.
The net input is then fed into an activation function $\sigma(z)$. There are a variety of possible functions, each with its own advantages and disadvantages. We will see some examples when we talk about different types of neurons. After the activation function, which usually provides a continuous output, a threshold function (step function) decides the output of the neuron $$\hat{y}:=\begin{cases}1&\text{if }\sigma(z)\geq\theta\\0&\text{otherwise}\end{cases}$$
Usually the notation is further simplified by introducing a bias unit $b:=-\theta$ into the net input, which gives us
The following figure shows the basic structure of an artificial neuron.
With the above definition of an artificial neuron, we are able to use it for a binary classification on any given input. But we have no idea if our chosen weights are optimal for the given task, let alone how to even choose them in the first place.
In 1957, Frank Rosenblatt introduced the idea of the perceptron-learning-rule. This rule allows the neuron to determine the optimal set of weights and bias unit by learning from the data.
For this simple version of an artificial neuron, the activation function is simply defined as the identity function $\sigma(z):=\text{id}(z)=z$, which basically means that it has no effect at all.
In our training data set, we have the combination of $m$ records for each of the $n$ features and their corresponding $m$ targets, which are supposed to be either 0 or 1. Note, that this means that we are doing supervised learning for a binary classification here. Let $\mathbf{x}^{(j)}$ be the input, i.e., the values of the features for the $j$-th training example, and $y^{(j)}$ be the corresponding output.
This rule is an example of online learning, which means that the weights and bias unit are updated after each training record. An epoch is a loop over the entire training data set. The main part of the perceptron rule is updating the weights and the bias unit, which is only done if the neuron misclassifies the input. If $y^{(j)}=0$ and $\hat{y}^{(j)}=1$, the value of $z$ is too high, in which case the weights and the bias unit are decreased. If $y^{(j)}=1$ and $\hat{y}^{(j)}=0$, the value of $z$ is too low, in which case the weights and the bias unit are increased. The value $\eta$ is the learning rate and typically a value between 0 and 1. Note that the learning rate only has an effect, if the weights are initialized with random numbers. Otherwise, $\eta$ only scales the weight vector, but does not change its direction.
If we imagine the $n$ dimensional feature space, it is divided into two parts: in one region the output of the neuron is 0, in the other it is 1. The decision boundary between the two classes is given by $z=\mathbf{w}^T\mathbf{x}+b=0$. This describes a flat affine hyperplane, which is the reason why the perceptron is a linear classifier. The weight vector is the normal vector of the plane and the bias unit is related to the distance of the plane from the origin (to be precise, $d=\frac{-b}{\lVert\mathbf{w}\rVert}$, where the sign corresponds to the side of the plane).
Due to the linear nature of the decision boundary, the perceptron will only converge, i.e. find a solution that satisfies all training records, if the data is linearly separable. But, if this is the case, the algorithm is guaranteed to find an optimal solution (for a proof see e.g. a lecture by Raschka). Otherwise, the algorithm will run indefinitely, so a maximum number of epochs must be specified to stop it. In this case, however, it is not guaranteed that a good solution will be obtained, and other algorithms should be used.
To demonstrate the working principle of the perceptron algorithm, we will implement one ourself. But first, we define a helper function that allows us to draw the decision boundary at an arbitrary step.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
def plot_decision_boundary(X, y, classifier, resolution=0.02):
"""
Modified version of an implementation in
Sebastian Raschka and Vahid Mirijalili,
Python Machine Learning,
2nd ed., 2017, Packt Publishing
"""
markers = ('o', 's')
colors = ('tab:blue', 'tab:orange')
cmap = ListedColormap(colors)
# define the grid
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
if classifier is not None:
# for each grid point, predict the class
lab = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
lab = lab.reshape(xx1.shape)
# plot the decision regions
plt.contourf(xx1, xx2, lab, alpha=0.3, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# plot the data points
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0],
y=X[y == cl, 1],
alpha=0.8,
c=colors[idx],
marker=markers[idx],
label=f'Class {cl}')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.legend()
Now we implement a simple version of a perceptron.
class SimplePerceptron:
"""Perceptron classifier
Modified version of an implementation in
Sebastian Raschka and Vahid Mirijalili,
Python Machine Learning,
2nd ed., 2017, Packt Publishing
"""
def __init__(self, epochs, eta=1):
self.epochs = epochs
self.eta = eta
def fit(self, X, y):
# initialize weights with random numbers
rgn = np.random.default_rng(seed=0)
self.weights = rgn.standard_normal(X.shape[1])
self.bias = 0.
steps = 0
for e in range(self.epochs):
for xi, yi in zip(X, y):
error = yi - self.predict(xi)
self.weights += self.eta * error * xi
self.bias += self.eta * error
# plot the decision boundary if it is changed
if error != 0:
steps += 1
plot_decision_boundary(X, y, self)
plt.title(f'Update no. {steps}')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.legend()
plt.show()
def net_input(self, X):
return np.dot(X, self.weights) + self.bias
def predict(self, X):
return np.where(self.net_input(X) >= 0, 1, 0)
To test our implementation, we create some toy gaussian distributed random data with two features and divide it into two linear separable classes.
# create some toy data
rng = np.random.default_rng(seed=0)
X = rng.standard_normal((30, 2))
# divide the data into two classes along the line x2=-x1
y = np.where(X[:,0]+X[:,1] >= 0, 1, 0)
# plot the data
plot_decision_boundary(X, y, None)
plt.title('Toy data')
plt.show()
After all the preparation, we can now test our perceptron.
ppn = SimplePerceptron(epochs=1, eta=0.1)
ppn.fit(X, y)
The plots above show how the perceptron successfully updates the decision boundary. For this combination of data and parameters, it took only 7 updates to reach an optimal solution. This was achieved within the first epoch.
Of course, there is also an implementation within the Python libraries. A standard package for all kinds of machine learning algorithms is the scikit-learn package. It can be used in an analogous way to our own implementation.
As you can see, the Scikit-Learn Perceptron also finds a solution, but, it is slightly different from the one above. This is to be expected, since we can draw many decision boundaries between the data points, all of which solve the classification. Which one is the best is the task of another algorithm, the support vector machine.
In a real example, we would of course need to check the accuracy of our trained model. We will do this when we talk about the "standard" machine learning workflow.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.