In 1960 Bernard Widrow and Tedd Hoff published an improved learning rule for artificial neurons. The main difference is the way the weights and the bias unit are updated. A schematic view of the process is shown in the following figure.
Unlike the perceptron, where we defined the error based on the predicted output, here we use the result of the activation function to define a loss or cost function. The goal of this loss function is to find its minimum, which corresponds to the optimal solution to our task.
One way to find this minimum is to use a technique called gradient descent. Depending on the number of training records used for each learning step, we differentiate between full batch, stochastic and minibatch gradient descent.
The main reason for using the activation function instead of the predicted output is that it is a differentiable function and we can use calculus techniques to find the minimum of the loss function. Once we have a function, we can calculate its gradient to find a vector that points in the direction of the steepest slope. Thus, the negative of the gradient points in the direction of the steepest descent. If we follow the gradient we will find the minimum of the function (at least if the activation function is linear, otherwise we might just end up in a local minimum). This technique is called gradient descent, and means that we update the weights and bias unit as follows $$\begin{align}\mathbf{w_{new}}&:=\mathbf{w_{old}}-\eta\nabla_wL(\mathbf{w},b)\\b_{new}&:=b_{old}-\eta\frac{\partial L(\mathbf{w},b)}{\partial b}\end{align}$$ As with the perceptron, $\eta$ is a learning rate. If $\eta$ is small, the descent takes a lot of steps, if it is too high, we might overshoot the minimum. Therefore, finding an appropriate value is an important step in fine-tuning the training.
By using the gradient descent, we have implicitly changed the learning step from online learning, as with the perceptron, to full batch learning. This means, that the loss function is calculated from the results of the activation function from the entire training data set.
An alternative to the full batch learning is the so-called stochastic gradient descent, which updates the weights and the bias unit after every single training record. For the MSE loss function, the update process is then: $$\begin{align}\mathbf{w_{new}}&:=\mathbf{w_{old}}-\eta\nabla_w\left(y^{(j)}-\sigma\left(z^{(j)}\right)\right)^2\\b_{new}&:=b_{old}-\eta\frac{\partial \left(y^{(j)}-\sigma\left(z^{(j)}\right)\right)^2}{\partial b}\end{align}$$ SGD usually reaches the minimum faster than full batch gradient descent because the weights are incremented more often. However, the error surface is noisier because the loss function is different for each training record. This is actually an advantageous behavior, because it also means that for nonlinear activation functions, it is easier to escape local minima and find the global minimum. To avoid patterns arising from the order of the training records, the training set is shuffled at the beginning of each epoch, which leads to the word stochastic in SGD.
Defining the loss function on a subset of the training set is a compromise between full-batch and stochastic gradient descent. This has the advantage of reaching the minimum faster than in full-batch mode, and it allows the use of vectorized operations, which improves the computational efficiency.
Since the Widrow Hoff rule does not specify the activation and loss functions, we can build different types of learning algorithms.
A common loss function is the mean squared error (MSE): $$L(\mathbf{w},b)=\frac{1}{n}\sum_{i=1}^n\left(y^{(j)}-\sigma\left(z^{(j)}\right)\right)^2$$ Obviously, if the MSE is small, we have a good classification. If we additionally choose $$\sigma(z):=id(z)=\mathbf{w}^T\mathbf{x}+b$$ the whole learning rule is nothing else than a simple linear regression.
If we use the logistic sigmoid function $$\sigma(z):=\frac{1}{1+e^{-z}}$$ and define the loss function as the log-likelihood $$L:=\log p(y|\mathbf{x},\mathcal{w},b)$$ we have a logistic regression.
NB: We can use both versions for either classification or regression. In the case of classification, we use the threshold function after the learning to do the classification, in the case of regression, we simply use the outcome of the activation function as the regression result.
The following shows how to use the logistic regression as classification for our previously defined toy data.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
def plot_decision_boundary(X, y, classifier, resolution=0.02):
"""
Modified version of an implementation in
Sebastian Raschka and Vahid Mirijalili,
Python Machine Learning,
2nd ed., 2017, Packt Publishing
"""
markers = ('o', 's')
colors = ('tab:blue', 'tab:orange')
cmap = ListedColormap(colors)
# define the grid
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
if classifier is not None:
# for each grid point, predict the class
lab = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
lab = lab.reshape(xx1.shape)
# plot the decision regions
plt.contourf(xx1, xx2, lab, alpha=0.3, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# plot the data points
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0],
y=X[y == cl, 1],
alpha=0.8,
c=colors[idx],
marker=markers[idx],
label=f'Class {cl}')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.legend()
# create some toy data
rng = np.random.default_rng(seed=0)
X = rng.standard_normal((30, 2))
# divide the data into two classes along the line x2=-x1
y = np.where(X[:,0]+X[:,1] >= 0, 1, 0)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
plot_decision_boundary(X, y, logreg)
plt.title('Scikit-Learn Logistic Regression')
plt.show()
In a real example, we would of course need to check the accuracy of our trained model. We will do this when we talk about the "standard" machine learning workflow.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.