Since Cortes and Vapnik,1995 introduced the concept of Support Vector Networks Support Vector Machines (SVM) (SVM) experienced a fast increasing attention for supervised machine learning applications primarily used for classification tasks. However, it can also be used for regression, which is then referred to as Support Vector Regression (SVR). A nice reference for SVM is provided by Steinwart & Christmann, 2008.

Given a set of labeled data, an SVM algorithm fits a model to the data such that it is optimally divided into the given categories by finding a hyperplane that maximizes the distance between them. Newly added (unlabeled) data points can then be effectively assigned to one of the categories.

Mathematical Foundation of SVM

The primary objective of SVM is to find a hyperplane that optimally separates the data into different classes. This involves solving an optimization problem to find the parameters \(w\) and \(b\) of the decision function:

\[f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b\] Where:

The purpose of the bias term is to shift the decision boundary (hyperplane) away from the origin. Without the bias term, the decision boundary would always go through the origin \((0,0,...,0)\), which would limit the capacity of the model. The optimization aims to maximize the margin while minimizing misclassification. This is formulated as follows:

\[\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2} ||\mathbf{w}||^2 + C \sum_{i=1}^{N} \xi_i\] Subject to: \[y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \] Where:

The so-called support vectors are the data points that lie closest to the decision boundary and are the most difficult to classify. They are critical to defining the optimal hyperplane because if they are moved, the position of the hyperplane changes. Support vectors have to satisfy the following condition:

\[y_i (\mathbf{w}^T \mathbf{x}_i + b) = 1\]

Linear SVM Classifier

Linear SVM Classifier are primarily used for binary classification tasks. To get an intuition, we will use an SVM to classify a set of toy data points.

A Simple Example

For demonstration purposes, we will generate a data set of 25 observations and 2 features (for a real data set this could be for example temperature and humidity). Then, we label each observation by assigning it to a class (here either class \(-1\)` or \(1\)) and the shift those classes a little away from each other. This way, we produce a data set with two clusters.

create_example_data <- function(n, f, seed, shift = 3, pattern = "linear") {
  ### n - number of observations, f - number of features/features ###

  # data set
  x <- matrix(rnorm(f * n), n, f)

  # vector that assigns each observation to a class (out of two classes)
  y <- rep(c(-1, 1), c(n / f, n / f))

  if (pattern == "linear") {
    # shifting all observations assigned to class "1" for linearly separable data
    x[y == 1, ] <- x[y == 1, ] + shift
  if (pattern == "poly") {
    # create a non-linear decision boundary
    x[y == 1, 2] <- x[y == 1, 1]^3 + rnorm(sum(y == 1), 0, 1) + 2 * shift
    x[y == -1, 2] <- x[y == -1, 1]^3 + rnorm(sum(y == -1), 1)
  if (pattern == "radial") {
    # create a non-linear decision boundary
    x[y == 1, 2] <- x[y == 1, 1]^2 + rnorm(sum(y == 1), 0, 0.5)
    x[y == -1, ] <- x[y == -1, ] * 0.4
    x[y == -1, 2] <- x[y == -1, 2] + shift
  if (pattern == "sigmoid") {
    # create a non-linear decision boundary
    x[y == 1, 2] <- 1 / (1 + exp(-5 * x[y == 1, 1])) - 1 + rnorm(sum(y == 1), 0, 0.1)
    x[y == -1, 2] <- 2 / (1 + exp(-5 * x[y == -1, 1])) + rnorm(sum(y == 1), 0, 0.1)

  # create a data frame
  data.frame(Feature1 = x[, 1], Feature2 = x[, 2], class = as.factor(y))

toy_data <- create_example_data(30, 2, 111)

Let us visualize our data set.


plot_clusters <- function(data, class) {
  ggplot(data, aes(x = data[, 1], y = data[, 2], color = class)) +
    geom_point(size = 4) +
      title = "Toy Data",
      x = "Feature 1",
      y = "Feature 2"
    ) +
    theme_minimal(base_size = 13) +
    theme(plot.margin = margin(1, 1, 1, 1, "cm"), plot.title = element_text(size = 22, hjust = 0.0)) +
    scale_color_manual(values = c("#f6525a", "cornflowerblue"))

plot_clusters(toy_data, toy_data$class)

As we can observe, the points are distinctly separated into two clusters.

Applying SVM

We will apply the SVM algorithm using the svm function from the e1071 package. The goal is to create a hyperplane that can accurately classify the data points.


model <- svm(class ~ ., data = toy_data, type = "C-classification", kernel = "linear", scale = FALSE)

Let us plot our model and the training data. This can simply be done by using the plot function.

plot(model, toy_data, color.palette = terrain.colors)

The plot shows how the SVM divides the data set into two regions. The hyperplane, for a two-dimensional data set simply a line, is called decision boundary. Every data point below the the decision boundary is assigned to class \(-1\) and every data point above to class \(1\). It is obvious that there are many possible hyperplanes for separating both clusters. In SVM’s, the best separating hyperplane maximizes the distance, or margin, between the hyperplane and the data points closest by the hyperplane(. Those data points are called (support vectors), indicated with an x in the plot above. The support vectors are the training data to find the decision boundary. Note that the SVM plot above breaks a convention by plotting the first feature on the vertical axis.

By using the print function, we get a summary of our model.

## Call:
## svm(formula = class ~ ., data = toy_data, type = "C-classification", 
##     kernel = "linear", scale = FALSE)
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## Number of Support Vectors:  3

Our SVM model uses 3 support vectors, is of type “C-classification”, has a linear kernel and a cost parameter of \(1\) (we will come to this later). Calling the names lists all components of our model.

##  [1] "call"            "type"            "kernel"          "cost"           
##  [5] "degree"          "gamma"           "coef0"           "nu"             
##  [9] "epsilon"         "sparse"          "scaled"          "x.scale"        
## [13] "y.scale"         "nclasses"        "levels"          "tot.nSV"        
## [17] "nSV"             "labels"          "SV"              "index"          
## [21] "rho"             "compprob"        "probA"           "probB"          
## [25] "sigma"           "coefs"           "na.action"       "fitted"         
## [29] "decision.values" "terms"

Visualizing decision boundary and margins

Let us now create a data grid and predict for each data point its class according to our SVM model. We us the apply function to get the range of each of the features and define an evenly spaced grid within them. By using the predict function, we assign a each of the data points to either class \(1\) or \(-1\). We also want to visualize the decision boundary and the margins. A linear decision boundary can simply be described by: \[\beta_0 + \beta_1 x_1 + \beta_2x_2 = 0\] Since we want \(x_1\) as the independent variable, we get \(x_2 = - \frac{\beta_1}{\beta_2}x_1 - \frac{\beta_1}{\beta_2}\).

SV contains the support vectors, rho the negative intercept,

beta <- t(model$coefs) %*% model$SV
beta0 <- -model$rho
plot_svm <- function(svm_model, data) {
  color_1 <- "firebrick1"
  color_2 <- "dodgerblue"

  # create grid of feature space
  var_range <- apply(data[, 1:2], 2, range)
  var1 <- seq(from = var_range[1, 1], to = var_range[2, 1], length = 100)
  var2 <- seq(from = var_range[1, 2], to = var_range[2, 2], length = 100)
  data_grid <- expand.grid(Feature1 = var1, Feature2 = var2)

  # classify grid points
  label_grid <- predict(svm_model, data_grid)

  # classes of training data
  Class <- data$class

  # plot original data
  plot <- ggplot(data, aes(x = data[, 1], y = data[, 2], color = Class)) +
    scale_color_manual(values = c(color_1, color_2)) +
    geom_point(size = 4) +
    # plot labeled data grid
      data = data_grid, aes(x = data_grid[, 1], y = data_grid[, 2]),
      col = c(color_1, color_2)[as.numeric(label_grid)], size = 0.2
    ) +
    # plot support vectors
      data = data[svm_model$index, ], aes(x = data[svm_model$index, 1], y = data[svm_model$index, 2]),
      col = "black", size = 5, shape = 8

  # plot decision boundary and margins for linear kernels
  if (svm_model$kernel == 0) {
    title <- "SVM Classifier with linear Kernel"
    beta <- t(svm_model$coefs) %*% svm_model$SV
    beta0 <- -svm_model$rho
    plot <- plot +
      geom_abline(slope = -beta[1] / beta[2], intercept = -beta0 / beta[2]) +
      geom_abline(slope = -beta[1] / beta[2], intercept = -(beta0 - 1) / beta[2], linetype = "dashed") +
      geom_abline(slope = -beta[1] / beta[2], intercept = -(beta0 + 1) / beta[2], linetype = "dashed")
  } else {
    title <- ""
  # layout
  plot <- plot +
      title = title,
      x = "Feature 1",
      y = "Feature 2"
    ) +
    theme_minimal(base_size = 14) +
    theme(plot.margin = margin(1, 1, 1, 1, "cm"), plot.title = element_text(size = 22, hjust = 0.5))


plot_svm(model, toy_data)

What about outliers?

The ideal scenario for applying a linear SVM is when the data is linearly separable. However, we know from our experience that data sets can have outliers that are closer to a different class than to their own. Let us look at the following example:

test_data <- create_example_data(100, 2, 7)
plot_clusters(test_data, test_data$class) +
  labs(title = "Intermixing Outliers from Different Clusters")

Data points of both classes mix at the border of both classes. The data set is clearly not separable by a single line. There are multiple ways to deal with this.

1. Soft Margins

In a “hard margin” SVM, the algorithm strictly enforces that all observations are classified correctly, which is problematic in the presence of outliers. Conversely, a “soft margin” SVM introduces a degree of flexibility, allowing certain points to be misclassified in favor of achieving a broader, more generalizable margin.

  1. Slack Variables: To accommodate the misclassification of difficult or noisy points, soft margin SVM introduces slack variables. These variables measure the degree of misclassification for each data point.

  2. Cost Parameter (C): The introduction of slack variables brings a new hyperparameter to the SVM: the cost parameter, denoted as \(C\). This parameter controls the trade-off between maximizing the margin and minimizing the classification error on the training data.

    • A small value of \(C\) makes the cost of misclassification low, focusing more on maximizing the margin. This could lead to a broader margin at the expense of a few misclassifications, which can be beneficial in the presence of outliers.
    • A large value of \(C\) prioritizes minimizing the classification error on the training data, potentially resulting in a narrower margin.

test_model_100 <- svm(class ~ ., data = test_data, type = "C-classification", kernel = "linear", scale = FALSE, cost = 100)
test_model_1 <- svm(class ~ ., data = test_data, type = "C-classification", kernel = "linear", scale = FALSE, cost = 1)

plot1 <- plot_svm(test_model_1, test_data) +
  ggtitle("Cost = 1")

plot100 <- plot_svm(test_model_100, test_data) +
  ggtitle("Cost = 100")

figure <- ggarrange(plot1, plot100,
  ncol = 2, nrow = 1,
  common.legend = TRUE,
  legend = "right",
  align = "h"

figure <- figure +
  ggtitle("Comparative SVM Decision Boundaries: Impact of Cost Parameter Variation")

# Displaying the figure

Balancing Margins and Misclassifications:

While introducing slack variables and the cost parameter offers flexibility, it is important to balance them:

  • Overfitting: A very high value of \(C\) can result in a decision boundary that is heavily influenced by outliers, leading to overfitting.

  • Underfitting: Conversely, a very low value of \(C\) might result in underfitting, where the model becomes too simplistic to capture the underlying patterns in the data.


Choosing the right \(C\):

A common strategy to selecting the optimal value for \(C\) involves:

  1. Starting with a range of \(C\) values.
  2. Using cross-validation to evaluate the performance of the SVM for each \(C\).
  3. Selecting the value of \(C\) that gives the best cross-validation accuracy.

Benefits and Limitations:

  • Robustness: Soft margin SVM provides robustness against outliers and noisy data, making it suitable for real-world datasets.

  • Sensitivity: The performance of soft margin SVM is sensitive to the choice of \(C\). Proper parameter tuning is essential to avoid underfitting or overfitting.

  • Increased Flexibility: By introducing the cost parameter, SVM can be tuned to match the specific characteristics and challenges of the dataset in question.

Another option to deal with outliers and non-linearly separable data is the so-called kernel trick.

2. The Kernel Trick

The power of SVM lies in its capability to transform data into a higher-dimensional space to make it linearly separable, even when it is not in its original space. This transformation is achieved using various kernel functions. The kernel function implicitly computes the dot product between two observations in this higher-dimensional space without us having to explicitly calculate the coordinates in that space. This not only saves computational cost but also allows SVM to capture complex, non-linear relationships in the data.

Commonly Used Kernels:

  1. Linear Kernel: This is essentially the dot product of two given observations. The resulting boundary is linear. The linear kernel is represented as: \[K(\textbf{x}, \textbf{y}) = \textbf{x} \cdot \textbf{y}\]

  2. Polynomial Kernel: It computes the dot product of vectors, raised to a specified power (degree). This kernel can capture polynomial decision boundaries. \[K(\textbf{x}, \textbf{y}) = (c + \textbf{x} \cdot \textbf{y})^d\] where \(d\) is the degree of the polynomial and \(c\) is the constant.

  3. Radial Basis Function (RBF) or Gaussian Kernel: The RBF kernel is a popular choice and can map an input space into an infinite-dimensional space, making it very powerful for handling non-linear relationships. \[K(\textbf{x}, \textbf{y}) = e^{-\gamma \| \textbf{x} - \textbf{y} \|^2}\] where \(\gamma\) is a parameter that needs to be set.

  4. Sigmoid Kernel: The sigmoid kernel has properties similar to the RBF kernel and can be used as a proxy for neural networks. \[K(\textbf{x}, \textbf{y}) = \tanh(k\textbf{x} \cdot \textbf{y} + c)\] where \(k\) is the slope of the function and \(c\) is the constant.

# create test data
test_data_poly <- create_example_data(100, 2, 2, pattern = "poly")
test_data_radial <- create_example_data(100, 2, 2, pattern = "radial")
test_data_sigmoid <- create_example_data(100, 2, 2, pattern = "sigmoid")

# apply models
test_model_poly <- svm(class ~ ., data = test_data_poly, type = "C-classification", kernel = "polynomial", scale = FALSE, degree = 5)
test_model_radial <- svm(class ~ ., data = test_data_radial, type = "C-classification", kernel = "radial", scale = TRUE)
test_model_sigmoid <- svm(class ~ ., data = test_data_sigmoid, type = "C-classification", kernel = "sigmoid", scale = TRUE)

# plots
plot_data_poly <- plot_clusters(test_data_poly) + ggtitle("SVM Classifier with Polynomial Kernel")
plot_data_radial <- plot_clusters(test_data_radial) + ggtitle("SVM Classifier with Radial Basis Kernel")
plot_data_sigmoid <- plot_clusters(test_data_sigmoid) + ggtitle("SVM Classifier with Sigmoid Kernel")

plot_model_poly <- plot_svm(test_model_poly, test_data_poly)
plot_model_radial <- plot_svm(test_model_radial, test_data_radial)
plot_model_sigmoid <- plot_svm(test_model_sigmoid, test_data_sigmoid)

figure <- ggarrange(
  plot_data_poly, plot_model_poly,
  plot_data_radial, plot_model_radial,
  plot_data_sigmoid, plot_model_sigmoid,
  ncol = 2, nrow = 3,
  common.legend = TRUE,
  legend = "right",
  align = "h"


The choice of kernel and its parameters influences the performance of the SVM. While the RBF kernel works well for many problems, it might not always be the best choice. Cross-validation can be used to tune kernel parameters and select the best kernel for a given problem.

Benefits and Limitations:

  • Flexibility: The kernel trick offers a flexible way to design decision boundaries, making SVM versatile for a wide range of tasks.

  • Overfitting: Using complex kernels (like high-degree polynomial or RBF with a small value of \(\gamma\)) can lead to overfitting, especially with limited training data.

  • Computational Complexity: Some kernels can lead to increased training times, especially on large datasets.

Multiclass Classification with SVM

Basic SVM is designed for binary classification, but it can be extended to handle multi-class classification problems using mainly two distinctive strategies: One-vs-One (OvO) and One-vs-All/One-vs-Rest (OvA/OvR).

One-vs-One Strategy

The OvO strategy decomposes the multiclass problem into multiple binary classification problems. Specifically, it constructs one SVM for each pair of classes.

  • For a problem with \(K\) classes, \(\frac{K \times (K-1)}{2}\) classifiers are built.

  • Each classifier is trained on the subset of the data belonging to the two classes it distinguishes.

Class Prediction:

When classifying new data:

  • The sample is run through all \(\frac{K \times (K-1)}{2}\) classifiers.

  • Each classifier casts a vote for one class.

  • The class getting the most votes determines the final predicted class.

Benefits and Limitations:

  • Pros: Each classifier only needs to be trained on the part of the dataset involving the two classes it focuses on, which can be beneficial for the computational cost and memory usage when the dataset is large.

  • Cons: The strategy could become computationally intensive as the number of classes (\(K\)) grows, due to the need to train and store \(\frac{K \times (K-1)}{2}\) models.

One-vs-All Strategy

OvA constructs one SVM for each class.

  • The SVM for a class is trained with all of the data from that class being the positive class and all data from other classes forming the negative class.

Class Prediction:

When classifying new data:

  • The classifier corresponding to each class computes a value.

  • The class whose classifier outputs the highest value is chosen as the prediction.

Benefits and Limitations:

  • Pros: Only \(K\) classifiers are constructed, which is computation-wise better compared to OvO in case of a large number of classes.

  • Cons: It involves imbalances, because typically the set of negatives is much larger than the set of positives. This potentially impacts performance and the placement of the decision boundary.


# Example Data
multi_class_data <- data.frame(
  feature1 = c(rnorm(50, mean = 20), rnorm(50, mean = 10, sd = 3), rnorm(50, mean = 15, sd = 2), rnorm(50, mean = 20)),
  feature2 = c(rnorm(50, mean = 20, sd = 2), rnorm(50, mean = 15), rnorm(50, mean = 10), rnorm(50, mean = 5, sd = 2)),
  class = as.factor(rep(1:4, each = 50))

# Training the SVM Model
multi_svm_model <- svm(class ~ ., data = multi_class_data, type = "C-classification", kernel = "radial")

# Visualizing the Model (only feasible for 2D data)
plot(multi_svm_model, multi_class_data)