Since Cortes and Vapnik,1995 introduced the concept of
*Support Vector Networks* Support Vector Machines (SVM) (**SVM**)
experienced a fast increasing attention for supervised machine learning applications primarily
used for classification tasks. However, it can also be used for
regression, which is then referred to as Support Vector Regression
(SVR). A nice reference for SVM is provided by Steinwart & Christmann, 2008.

Given a set of labeled data, an SVM algorithm fits a model to the data such that it is optimally divided into the given categories by finding a hyperplane that maximizes the distance between them. Newly added (unlabeled) data points can then be effectively assigned to one of the categories.

The primary objective of SVM is to find a hyperplane that optimally separates the data into different classes. This involves solving an optimization problem to find the parameters \(w\) and \(b\) of the decision function:

\[f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b\] Where:

- \(\mathbf{w} \in \mathbb{R}^d\) is a weight vector and \(d\) is the number of features.
- \(\mathbf{x} \in \mathbb{R}^d\) is the input vector.
- \(b \in \mathbb{R}\) is the bias term.

The purpose of the bias term is to shift the decision boundary (hyperplane) away from the origin. Without the bias term, the decision boundary would always go through the origin \((0,0,...,0)\), which would limit the capacity of the model. The optimization aims to maximize the margin while minimizing misclassification. This is formulated as follows:

\[\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2} ||\mathbf{w}||^2 + C \sum_{i=1}^{N} \xi_i\] Subject to: \[y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \] Where:

- \(\boldsymbol{\xi} \in \mathbb{R}^N\) is the slack variable vector, with each \(\xi_i\) allowing for the soft margin and possible misclassification of difficult points.
- \(C \in \mathbb{R}\) is a scalar that controls the trade-off between maximizing the margin and minimizing the classification error.
- \(y_i \in \{-1, 1\}\) is the label of the i-th data point, and \(\mathbf{x}_i \in \mathbb{R}^d\) is the feature vector of the i-th data point.

The so-called *support vectors* are the data points that lie
closest to the decision boundary and are the most difficult to classify.
They are critical to defining the optimal hyperplane because if they are
moved, the position of the hyperplane changes. Support vectors have to
satisfy the following condition:

\[y_i (\mathbf{w}^T \mathbf{x}_i + b) = 1\]

Linear SVM Classifier are primarily used for binary classification
tasks. To get an intuition, we will use an SVM to classify a set of toy
data points.

For demonstration purposes, we will generate a data set of 25 observations and 2 features (for a real data set this could be for example temperature and humidity). Then, we label each observation by assigning it to a class (here either class \(-1\)` or \(1\)) and the shift those classes a little away from each other. This way, we produce a data set with two clusters.

```
create_example_data <- function(n, f, seed, shift = 3, pattern = "linear") {
### n - number of observations, f - number of features/features ###
# data set
set.seed(seed)
x <- matrix(rnorm(f * n), n, f)
# vector that assigns each observation to a class (out of two classes)
y <- rep(c(-1, 1), c(n / f, n / f))
if (pattern == "linear") {
# shifting all observations assigned to class "1" for linearly separable data
x[y == 1, ] <- x[y == 1, ] + shift
}
if (pattern == "poly") {
# create a non-linear decision boundary
x[y == 1, 2] <- x[y == 1, 1]^3 + rnorm(sum(y == 1), 0, 1) + 2 * shift
x[y == -1, 2] <- x[y == -1, 1]^3 + rnorm(sum(y == -1), 1)
}
if (pattern == "radial") {
# create a non-linear decision boundary
x[y == 1, 2] <- x[y == 1, 1]^2 + rnorm(sum(y == 1), 0, 0.5)
x[y == -1, ] <- x[y == -1, ] * 0.4
x[y == -1, 2] <- x[y == -1, 2] + shift
}
if (pattern == "sigmoid") {
# create a non-linear decision boundary
x[y == 1, 2] <- 1 / (1 + exp(-5 * x[y == 1, 1])) - 1 + rnorm(sum(y == 1), 0, 0.1)
x[y == -1, 2] <- 2 / (1 + exp(-5 * x[y == -1, 1])) + rnorm(sum(y == 1), 0, 0.1)
}
# create a data frame
data.frame(Feature1 = x[, 1], Feature2 = x[, 2], class = as.factor(y))
}
toy_data <- create_example_data(30, 2, 111)
```

Let us visualize our data set.

```
library(ggplot2)
plot_clusters <- function(data, class) {
ggplot(data, aes(x = data[, 1], y = data[, 2], color = class)) +
geom_point(size = 4) +
labs(
title = "Toy Data",
x = "Feature 1",
y = "Feature 2"
) +
theme_minimal(base_size = 13) +
theme(plot.margin = margin(1, 1, 1, 1, "cm"), plot.title = element_text(size = 22, hjust = 0.0)) +
scale_color_manual(values = c("#f6525a", "cornflowerblue"))
}
plot_clusters(toy_data, toy_data$class)
```

As we can observe, the points are distinctly separated into two clusters.

We will apply the SVM algorithm using the `svm`

function
from the `e1071`

package. The goal is to create a hyperplane
that can accurately classify the data points.

```
library(e1071)
model <- svm(class ~ ., data = toy_data, type = "C-classification", kernel = "linear", scale = FALSE)
```

Let us plot our model and the training data. This can simply be done
by using the `plot`

function.

`plot(model, toy_data, color.palette = terrain.colors)`

The plot shows how the SVM divides the data set into two regions. The
hyperplane, for a two-dimensional data set simply a line, is called
*decision boundary*. Every data point below the the decision
boundary is assigned to class \(-1\)
and every data point above to class \(1\). It is obvious that there are many
possible hyperplanes for separating both clusters. In SVM’s, the best
separating hyperplane maximizes the distance, or *margin*,
between the hyperplane and the data points closest by the hyperplane(.
Those data points are called (*support vectors*), indicated with
an `x`

in the plot above. The support vectors are the
training data to find the decision boundary. Note that the SVM plot
above breaks a convention by plotting the first feature on the vertical
axis.

By using the `print`

function, we get a summary of our
model.

`print(model)`

```
##
## Call:
## svm(formula = class ~ ., data = toy_data, type = "C-classification",
## kernel = "linear", scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 3
```

Our SVM model uses 3 support vectors, is of type “C-classification”,
has a linear kernel and a cost parameter of \(1\) (we will come to this later). Calling
the `names`

lists all components of our model.

`names(model)`

```
## [1] "call" "type" "kernel" "cost"
## [5] "degree" "gamma" "coef0" "nu"
## [9] "epsilon" "sparse" "scaled" "x.scale"
## [13] "y.scale" "nclasses" "levels" "tot.nSV"
## [17] "nSV" "labels" "SV" "index"
## [21] "rho" "compprob" "probA" "probB"
## [25] "sigma" "coefs" "na.action" "fitted"
## [29] "decision.values" "terms"
```

Let us now create a data grid and predict for each data point its
class according to our SVM model. We us the apply function to get the
range of each of the features and define an evenly spaced grid within
them. By using the `predict`

function, we assign a each of
the data points to either class \(1\)
or \(-1\). We also want to visualize
the decision boundary and the margins. A linear decision boundary can
simply be described by: \[\beta_0 + \beta_1
x_1 + \beta_2x_2 = 0\] Since we want \(x_1\) as the independent variable, we get
\(x_2 = - \frac{\beta_1}{\beta_2}x_1 -
\frac{\beta_1}{\beta_2}\).

`SV`

contains the support vectors, `rho`

the
negative intercept,

```
beta <- t(model$coefs) %*% model$SV
beta0 <- -model$rho
```

```
plot_svm <- function(svm_model, data) {
color_1 <- "firebrick1"
color_2 <- "dodgerblue"
# create grid of feature space
var_range <- apply(data[, 1:2], 2, range)
var1 <- seq(from = var_range[1, 1], to = var_range[2, 1], length = 100)
var2 <- seq(from = var_range[1, 2], to = var_range[2, 2], length = 100)
data_grid <- expand.grid(Feature1 = var1, Feature2 = var2)
# classify grid points
label_grid <- predict(svm_model, data_grid)
# classes of training data
Class <- data$class
# plot original data
plot <- ggplot(data, aes(x = data[, 1], y = data[, 2], color = Class)) +
scale_color_manual(values = c(color_1, color_2)) +
geom_point(size = 4) +
# plot labeled data grid
geom_point(
data = data_grid, aes(x = data_grid[, 1], y = data_grid[, 2]),
col = c(color_1, color_2)[as.numeric(label_grid)], size = 0.2
) +
# plot support vectors
geom_point(
data = data[svm_model$index, ], aes(x = data[svm_model$index, 1], y = data[svm_model$index, 2]),
col = "black", size = 5, shape = 8
)
# plot decision boundary and margins for linear kernels
if (svm_model$kernel == 0) {
title <- "SVM Classifier with linear Kernel"
beta <- t(svm_model$coefs) %*% svm_model$SV
beta0 <- -svm_model$rho
plot <- plot +
geom_abline(slope = -beta[1] / beta[2], intercept = -beta0 / beta[2]) +
geom_abline(slope = -beta[1] / beta[2], intercept = -(beta0 - 1) / beta[2], linetype = "dashed") +
geom_abline(slope = -beta[1] / beta[2], intercept = -(beta0 + 1) / beta[2], linetype = "dashed")
} else {
title <- ""
}
# layout
plot <- plot +
labs(
title = title,
x = "Feature 1",
y = "Feature 2"
) +
theme_minimal(base_size = 14) +
theme(plot.margin = margin(1, 1, 1, 1, "cm"), plot.title = element_text(size = 22, hjust = 0.5))
return(plot)
}
plot_svm(model, toy_data)
```

The ideal scenario for applying a linear SVM is when the data is linearly separable. However, we know from our experience that data sets can have outliers that are closer to a different class than to their own. Let us look at the following example:

```
test_data <- create_example_data(100, 2, 7)
plot_clusters(test_data, test_data$class) +
labs(title = "Intermixing Outliers from Different Clusters")
```

Data points of both classes mix at the border of both classes. The data set is clearly not separable by a single line. There are multiple ways to deal with this.

In a “hard margin” SVM, the algorithm strictly enforces that all observations are classified correctly, which is problematic in the presence of outliers. Conversely, a “soft margin” SVM introduces a degree of flexibility, allowing certain points to be misclassified in favor of achieving a broader, more generalizable margin.

**Slack Variables**: To accommodate the misclassification of difficult or noisy points, soft margin SVM introduces slack variables. These variables measure the degree of misclassification for each data point.**Cost Parameter (C)**: The introduction of slack variables brings a new hyperparameter to the SVM: the cost parameter, denoted as \(C\). This parameter controls the trade-off between maximizing the margin and minimizing the classification error on the training data.- A small value of \(C\) makes the cost of misclassification low, focusing more on maximizing the margin. This could lead to a broader margin at the expense of a few misclassifications, which can be beneficial in the presence of outliers.
- A large value of \(C\) prioritizes minimizing the classification error on the training data, potentially resulting in a narrower margin.

```
library(ggpubr)
test_model_100 <- svm(class ~ ., data = test_data, type = "C-classification", kernel = "linear", scale = FALSE, cost = 100)
test_model_1 <- svm(class ~ ., data = test_data, type = "C-classification", kernel = "linear", scale = FALSE, cost = 1)
plot1 <- plot_svm(test_model_1, test_data) +
ggtitle("Cost = 1")
plot100 <- plot_svm(test_model_100, test_data) +
ggtitle("Cost = 100")
figure <- ggarrange(plot1, plot100,
ncol = 2, nrow = 1,
common.legend = TRUE,
legend = "right",
align = "h"
)
figure <- figure +
ggtitle("Comparative SVM Decision Boundaries: Impact of Cost Parameter Variation")
# Displaying the figure
figure
```

While introducing slack variables and the cost parameter offers flexibility, it is important to balance them:

**Overfitting**: A very high value of \(C\) can result in a decision boundary that is heavily influenced by outliers, leading to overfitting.**Underfitting**: Conversely, a very low value of \(C\) might result in underfitting, where the model becomes too simplistic to capture the underlying patterns in the data.

A common strategy to selecting the optimal value for \(C\) involves:

- Starting with a range of \(C\) values.
- Using cross-validation to evaluate the performance of the SVM for each \(C\).
- Selecting the value of \(C\) that gives the best cross-validation accuracy.

**Robustness**: Soft margin SVM provides robustness against outliers and noisy data, making it suitable for real-world datasets.**Sensitivity**: The performance of soft margin SVM is sensitive to the choice of \(C\). Proper parameter tuning is essential to avoid underfitting or overfitting.**Increased Flexibility**: By introducing the cost parameter, SVM can be tuned to match the specific characteristics and challenges of the dataset in question.

Another option to deal with outliers and non-linearly separable data is the so-called kernel trick.

The power of SVM lies in its capability to transform data into a higher-dimensional space to make it linearly separable, even when it is not in its original space. This transformation is achieved using various kernel functions. The kernel function implicitly computes the dot product between two observations in this higher-dimensional space without us having to explicitly calculate the coordinates in that space. This not only saves computational cost but also allows SVM to capture complex, non-linear relationships in the data.

**Linear Kernel**: This is essentially the dot product of two given observations. The resulting boundary is linear. The linear kernel is represented as: \[K(\textbf{x}, \textbf{y}) = \textbf{x} \cdot \textbf{y}\]**Polynomial Kernel**: It computes the dot product of vectors, raised to a specified power (degree). This kernel can capture polynomial decision boundaries. \[K(\textbf{x}, \textbf{y}) = (c + \textbf{x} \cdot \textbf{y})^d\] where \(d\) is the degree of the polynomial and \(c\) is the constant.**Radial Basis Function (RBF) or Gaussian Kernel**: The RBF kernel is a popular choice and can map an input space into an infinite-dimensional space, making it very powerful for handling non-linear relationships. \[K(\textbf{x}, \textbf{y}) = e^{-\gamma \| \textbf{x} - \textbf{y} \|^2}\] where \(\gamma\) is a parameter that needs to be set.**Sigmoid Kernel**: The sigmoid kernel has properties similar to the RBF kernel and can be used as a proxy for neural networks. \[K(\textbf{x}, \textbf{y}) = \tanh(k\textbf{x} \cdot \textbf{y} + c)\] where \(k\) is the slope of the function and \(c\) is the constant.

```
# create test data
test_data_poly <- create_example_data(100, 2, 2, pattern = "poly")
test_data_radial <- create_example_data(100, 2, 2, pattern = "radial")
test_data_sigmoid <- create_example_data(100, 2, 2, pattern = "sigmoid")
# apply models
test_model_poly <- svm(class ~ ., data = test_data_poly, type = "C-classification", kernel = "polynomial", scale = FALSE, degree = 5)
test_model_radial <- svm(class ~ ., data = test_data_radial, type = "C-classification", kernel = "radial", scale = TRUE)
test_model_sigmoid <- svm(class ~ ., data = test_data_sigmoid, type = "C-classification", kernel = "sigmoid", scale = TRUE)
# plots
plot_data_poly <- plot_clusters(test_data_poly) + ggtitle("SVM Classifier with Polynomial Kernel")
plot_data_radial <- plot_clusters(test_data_radial) + ggtitle("SVM Classifier with Radial Basis Kernel")
plot_data_sigmoid <- plot_clusters(test_data_sigmoid) + ggtitle("SVM Classifier with Sigmoid Kernel")
plot_model_poly <- plot_svm(test_model_poly, test_data_poly)
plot_model_radial <- plot_svm(test_model_radial, test_data_radial)
plot_model_sigmoid <- plot_svm(test_model_sigmoid, test_data_sigmoid)
figure <- ggarrange(
plot_data_poly, plot_model_poly,
plot_data_radial, plot_model_radial,
plot_data_sigmoid, plot_model_sigmoid,
ncol = 2, nrow = 3,
common.legend = TRUE,
legend = "right",
align = "h"
)
figure
```

The choice of kernel and its parameters influences the performance of the SVM. While the RBF kernel works well for many problems, it might not always be the best choice. Cross-validation can be used to tune kernel parameters and select the best kernel for a given problem.

**Flexibility**: The kernel trick offers a flexible way to design decision boundaries, making SVM versatile for a wide range of tasks.**Overfitting**: Using complex kernels (like high-degree polynomial or RBF with a small value of \(\gamma\)) can lead to overfitting, especially with limited training data.**Computational Complexity**: Some kernels can lead to increased training times, especially on large datasets.

Basic SVM is designed for binary classification, but it can be extended to handle multi-class classification problems using mainly two distinctive strategies: One-vs-One (OvO) and One-vs-All/One-vs-Rest (OvA/OvR).

The OvO strategy decomposes the multiclass problem into multiple binary classification problems. Specifically, it constructs one SVM for each pair of classes.

For a problem with \(K\) classes, \(\frac{K \times (K-1)}{2}\) classifiers are built.

Each classifier is trained on the subset of the data belonging to the two classes it distinguishes.

When classifying new data:

The sample is run through all \(\frac{K \times (K-1)}{2}\) classifiers.

Each classifier casts a vote for one class.

The class getting the most votes determines the final predicted class.

**Pros**: Each classifier only needs to be trained on the part of the dataset involving the two classes it focuses on, which can be beneficial for the computational cost and memory usage when the dataset is large.**Cons**: The strategy could become computationally intensive as the number of classes (\(K\)) grows, due to the need to train and store \(\frac{K \times (K-1)}{2}\) models.

OvA constructs one SVM for each class.

- The SVM for a class is trained with all of the data from that class being the positive class and all data from other classes forming the negative class.

When classifying new data:

The classifier corresponding to each class computes a value.

The class whose classifier outputs the highest value is chosen as the prediction.

**Pros**: Only \(K\) classifiers are constructed, which is computation-wise better compared to OvO in case of a large number of classes.**Cons**: It involves imbalances, because typically the set of negatives is much larger than the set of positives. This potentially impacts performance and the placement of the decision boundary.

```
# Example Data
set.seed(123)
multi_class_data <- data.frame(
feature1 = c(rnorm(50, mean = 20), rnorm(50, mean = 10, sd = 3), rnorm(50, mean = 15, sd = 2), rnorm(50, mean = 20)),
feature2 = c(rnorm(50, mean = 20, sd = 2), rnorm(50, mean = 15), rnorm(50, mean = 10), rnorm(50, mean = 5, sd = 2)),
class = as.factor(rep(1:4, each = 50))
)
# Training the SVM Model
multi_svm_model <- svm(class ~ ., data = multi_class_data, type = "C-classification", kernel = "radial")
# Visualizing the Model (only feasible for 2D data)
plot(multi_svm_model, multi_class_data)
```