So far we have only looked at single neurons. As we have seen, they are able to perform linear classification tasks. However, it can be shown that single neurons cannot solve XOR (exclusive-or) problems (we will see an example later). Moreover, they are binary classifiers. Multiclass classification is only possible by using techniques like One-vs-Rest (OvR), where on neuron is trained for each class.

In the introduction to artificial neurons, we said that they should mimic brain cells. Since the brain contains billions of interconnected cells, we can try to do something similar and build a network of neurons. This can be done in many different ways. We will discuss here only one type, the so-called multilayer perceptron, which has the advantage that we already have all the ingredients and only need to extend them to more neurons.

An MLP is a fully connected feedforward net with at least three layers (input, hidden and output), each consisting of several neurons, also called nodes. Feedforward means that there are no circles within the connections of the neurons, while fully connected means that each neuron is connected to all neurons in the next layer. The next figure shows an example.

Because the input layer does not contain real neurons, the counting of the layer count is sometimes done differently (see 1st and 2nd layer in the figure). This leads to the statement that a single neuron is a single layer ANN.

If we want to describe the calculations behind an MLP, we need to extend our previous notation. Some of this is aready included in the figure above.

- The superscript indicates the layer of the corresponding node.
- $x_2^{(in)}$ is the value of the second feature of the training record.
- $a_4^{(h)}$ is the result of the activation function in the fourth hidden layer node.
- $w_{4,2}^{(h)}$ is the weight of the second input, as used in the net input of the fourth node.
- $b^{(h)}$ is the bias unit for the hidden layer, i.e., it is the same for all nodes in the layer.
- $\hat{y}_3$ is the result of the threshold function for the third output neuron.

Note that the threshold function is only calculated for the output layer, all hidden layers only calculate the activation function and pass the result to the next layer.

In the following, we will assume that our training data has $n$ features and the neural network has $k$ hidden layer nodes and $\ell$ output nodes. For the figure above, this means $n=2$, $k=4$, $\ell=3$.

To perform the actual computation, we extend the computation of the net input $z=\mathbf{w}^T\mathbf{x}+b$ to a matrix multiplication of the net input for the hidden layer $$ \mathbf{z^{(h)}}=\mathbf{W^{(h)}}^T\mathbf{x^{(in)}}+\mathbf{b^{(h)}} $$ Where $\mathbf{z^{(h)}}$ is a $k$-component vector containing the net input for each node, $\mathbf{b}$ is a vector with the entry $b^{(h)}$ in each of the $k$ components. $\mathbf{W^{(h)}}^T$ is a bit more complicated because, we now have $n$ weights at each of the $k$ nodes. This means that we have a matrix $$ \mathbf{W^{(h)}}^T= \begin{pmatrix} w_{1,1} & w_{1,2} & \dots & w_{1,n}\\ w_{2,1} & w_{2,2} & \dots & w_{2,n}\\ \vdots & \vdots & & \vdots\\ w_{k,1} & w_{k,2} & \dots & w_{k,n} \end{pmatrix} $$

Note:

- Above we calculated the net input for one training record. However, it is also possible to take $m$ records at the same time (as it is done in minibatch gradient descent). Then $\mathbf{x^{(in)}}$ will be expanded to a $(n\times m)$ matrix and consequently $\mathbf{B^{(h)}}$ and $\mathbf{Z^{(h)}}$ will be $(k\times m)$ matrices.
- The same extension to matrices applies to the output layer (or any other hidden layer that might be included in the network).

Now we are ready to calculate the results of the ANN step by step

- The net input of the hidden layer is $\mathbf{z^{(h)}}=\mathbf{W^{(h)}}^T\mathbf{x^{(in)}}+\mathbf{b^{(h)}}$
- The activation of the hidden layer is $\mathbf{a^{(h)}}=\sigma(\mathbf{z^{(h)}})$
- The net input of the output layer is $\mathbf{z^{(out)}}=\mathbf{W^{(out)}}^T\mathbf{a^{(h)}}+\mathbf{b^{(out)}}$
- The activation of the hidden layer is $\mathbf{a^{(out)}}=\sigma(\mathbf{z^{(out)}})$
- The output of the whole neural net is $\mathbf{\hat{y}^{(out)}}=\text{step}(\mathbf{a^{(out)}})$

Of course, $\sigma$ and $\text{step}$ act on each vector component individually.

As for the single neuron, the actual learning is done by updating the weights and bias units. Therefore, we use stochastic gradient descent and define a loss function as $$ L=L(\mathbf{W^{(h)}}, \mathbf{W^{(out)}}, \mathbf{b^{(h)}}, \mathbf{b^{(out)}}) $$ and update the weights and bias units as $$ w_{i,j, new}=w_{i,j, old}-\eta\frac{\partial L}{\partial w_{i,j}}\qquad b_{new}=b_{old}-\eta\frac{\partial L}{\partial b} $$ Let us compute the update for the two weights marked red in the above figure. For $w_{3,4}^{(out)}$ the update is given by $$ \frac{\partial L}{\partial w_{3,4}^{(out)}}=\frac{\partial L}{\partial a_3^{(out)}}\frac{\partial a_3^{(out)}}{\partial w_{3,4}^{(out)}} $$ where we used the chain rule.

For $w_{4, 2}^{(h)}$ it gets a little more complicated, because the weight appears in the activation of all output nodes. The update is then $$ \frac{\partial L}{\partial w_{4,2}^{(out)}}=\left[\frac{\partial L}{\partial a_1^{(out)}}\frac{\partial a_1^{(out)}}{\partial a_4^{(h)}}\frac{\partial a_4^{(h)}}{\partial w_{4,2}^{(h)}}+\frac{\partial L}{\partial a_2^{(out)}}\frac{\partial a_2^{(out)}}{\partial a_4^{(h)}}\frac{\partial a_4^{(h)}}{\partial w_{4,2}^{(h)}}+\frac{\partial L}{\partial a_3^{(out)}}\frac{\partial a_3^{(out)}}{\partial a_4^{(h)}}\frac{\partial a_4^{(h)}}{\partial w_{4,2}^{(h)}}\right] $$ This seems to be quite computationally expensive. But if we start with the weights for the output layer and store some of the partial derivatives, we can reuse them in the calculation for the hidden layer, e.g. $\frac{\partial L}{\partial a_3^{(out)}}$. This is the reason why this method is usually called backpropagation. The input is propagated forward through the network, while the weight correction is propagated backwards through the network. In the end, it is just a clever use of the chain rule.

Finally, le us apply the above MLP to our previous toy data.

In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
def plot_decision_boundary(X, y, classifier, resolution=0.02):
"""
Modified version of an implementation in
Sebastian Raschka and Vahid Mirijalili,
Python Machine Learning,
2nd ed., 2017, Packt Publishing
"""
markers = ('o', 's')
colors = ('tab:blue', 'tab:orange')
cmap = ListedColormap(colors)
# define the grid
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
if classifier is not None:
# for each grid point, predict the class
lab = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
lab = lab.reshape(xx1.shape)
# plot the decision regions
plt.contourf(xx1, xx2, lab, alpha=0.3, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# plot the data points
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0],
y=X[y == cl, 1],
alpha=0.8,
c=colors[idx],
marker=markers[idx],
label=f'Class {cl}')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.legend()
```

In [2]:

```
# create some toy data
rng = np.random.default_rng(seed=0)
X = rng.standard_normal((30, 2))
# divide the data into two classes along the line x2=-x1
y = np.where(X[:,0]+X[:,1] >= 0, 1, 0)
```

In [3]:

```
from sklearn.neural_network import MLPClassifier
# use 1 hidden layer with 4 neurons and stochastic gradient descent
mlp = MLPClassifier(hidden_layer_sizes=(4,), solver='sgd', batch_size=1, max_iter=500, random_state=0)
mlp.fit(X, y)
plot_decision_boundary(X, y, mlp)
plt.title('Scikit-Learn Multilayer Perceptron')
plt.show()
```

As an example that the MLP is able to classify XOR problems, we divide the data according to the 4 quadrants, where quadrant 1 and 3 correspond to class 1 and quadrant 2 and 4 correspond to class 0.

In [4]:

```
# create some toy data
rng = np.random.default_rng(seed=0)
X = rng.standard_normal((30, 2))
# divide the data
y = np.where(X[:,0]*X[:,1] >= 0, 1, 0)
```

In [5]:

```
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
plot_decision_boundary(X, y, logreg)
plt.title('Scikit-Learn Logistic Regression')
plt.show()
```

Now, we use the MLP, which is able to find a nonlinear decision boundary.

In [6]:

```
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(4,), solver='sgd', batch_size=1, max_iter=700, random_state=0)
mlp.fit(X, y)
plot_decision_boundary(X, y, mlp)
plt.title('Scikit-Learn Multilayer Perceptron')
plt.show()
```

The neural network described above contains only one hidden layer. It is now possible to extend the network layout by adding more hidden layers with numbers of nodes. This is the subject of *deep learning*.

As we saw above, adding more layers and more nodes simply requires different sized matrices. GPUs are specialized to perform linear algebra calculations on large data sets very efficiently. This is the reason why today's machine learning algorithms for big data are preferred to be run on high performance GPU clusters rather than on traditional CPUs.

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*