40240_kNN.knit

The k-Nearest Neighbors (k-NN) algorithm is one of the simplest and widely used supervised machine learning method that categorizes data based on the classes of their nearest neighbors.

Generally, k-NN is used as a multi-class classifiers. A common task is to classify new data that has been added to an already classified dataset. Let’s start by exploring how to use k-NN in R before we dive into the details of the algorithm.

Let us have a look at the following case:

library(ggplot2)

# Create synthetic data for 3 classes
set.seed(42)
n <- 50 # number of samples per class
data <- data.frame(
  x = c(rnorm(n, 2), rnorm(n, 6), rnorm(n, 4)),
  y = c(rnorm(n, 3), rnorm(n, 7), rnorm(n, 4)),
  class = factor(rep(1:3, each = n))
)

# Define new data points
new_points <- data.frame(
  x = c(1, 5, 4, 2, 3, 1),
  y = c(2, 6, 5, 5, 2, 7)
)

# Plotting data
library(ggplot2)

plot_classification <- function(data, classes) {
  p <- ggplot(data) +
    geom_point(aes(x = x, y = y, color = factor(classes), shape = factor(classes)), size = 3) +
    scale_color_manual(values = c("blue", "darkgreen", "purple"), name = "Class") +
    scale_shape_manual(values = c(15, 17, 18), name = "Class") +
    labs(
      title = "k-NN Classification",
      x = "Feature 1",
      y = "Feature 2"
    ) +
    theme_minimal(base_size = 11) +
    theme(
      plot.title = element_text(size = 22, hjust = 0.5),
      plot.margin = margin(1, 1, 1, 1, "cm"),
      axis.title.x = element_text(angle = 0, vjust = 1, size = 14),
      axis.title.y = element_text(angle = 90, vjust = 2, size = 14)
    )

  return(p)
}

plot_classification(data, data$class) +
  geom_point(data = new_points, aes(x = x, y = y), color = "red", size = 3, shape = 16) +
  labs(title = "Toy Data")

We have a dataset that is divided into three different classes (1, 2 and 3). Then, we add new, unclassified data (red dots).

Let us now perform a k-NN classification by using the knn function from the class package. This function takes the training data and their labels, as well as the test data (our new_points) and k (the number of neighbors to consider), as input.

library(class)

# Specifying number of neighbors
k <- 7

# Performing k-NN classification
predicted_labels <- knn(
  train = data[, c("x", "y")],
  test = new_points[, c("x", "y")],
  cl = data$class,
  k = k
)

The predicted_labels object now contains the predicted classes for our data points. We can visualize the result with our plot function.

new_classified_points <- new_points
new_classified_points$class <- predicted_labels
extended_data <- rbind(data, new_classified_points)
plot_classification(extended_data, extended_data$class) +
  geom_point(data = new_points, aes(x = x, y = y), color = "red", size = 4, shape = 5)

The new data points have each been assign to a class according to their seven closest neighbors. Now, let’s delve deeper into the mechanics of the k-NN algorithm.

The k-NN Algorithm

When classifying data, k-NN operates by identifying k data points from the training data that are nearest to the unclassified point. The class of the new point is then determined by majority vote, meaning it is assigned the most common class among its k nearest neighbors.

The choice of distance metric to determine the neighbors depends on the problem. Usually, the Euclidean distance between two points \(p\) and \(q\in \mathbb{R}^n\), is used: \[d(p,q)=‖q−p‖_2=\sqrt{\sum _{i=1}^{n}(q_{i}-p_{i})^{2}}\]

\(n\) corresponds to the number of features.

Steps for classifying a point using k-NN are:

Choose k, the number of neighbors to consider.
Calculate the distance between the point of interest and all other points.
Identify k closest points.
Assign the class that is most common among the k neighbors to the new point.

To showcase k-NN, we develop the algorithm step by step. Let us create a new training data set some new, unclassified data points:

# create training data
set.seed(42)
n <- 25 # number of samples per class
train_data <- data.frame(
  x = c(rnorm(n, 2.5, 0.7), rnorm(n, 5, 0.8), rnorm(n, 4)),
  y = c(rnorm(n, 4, 0.7), rnorm(n, 4.5, 0.8), rnorm(n, 4)),
  class = factor(rep(1:3, each = n))
)

# create test data
set.seed(43)
new_points <- data.frame(x = rnorm(4, mean = 4, sd = 1.5), y = rnorm(4, mean = 4, sd = 1.5))
new_points$class <- c(0, 0, 0, 0)

# Visualize Original Data
plot_classification(train_data, train_data$class) +
  geom_point(data = new_points, aes(x = x, y = y), color = "red", size = 3, shape = 16) +
  labs(title = paste("Training Data and new Points"))

We define our function k_nearest_neighbors() by specifying the following arguments: the training dataset, the test data, and the parameter \(k\).

k_nearest_neighbors <- function(training, test, k) {
}

Step 1: Choose the ‘k’ value

k <- 5

Step 2: Calculate Distance

k_nearest_neighbors <- function(train_data, test, k) {
  # Euclidean distance
  calculate_distance <- function(a, b) {
    sqrt(sum((a - b)^2))
  }
}

Step 3: Find Nearest Neighbors

k_nearest_neighbors <- function(train_data, test, k) {
  # Euclidean distance
  calculate_distance <- function(a, b) {
    sqrt(sum((a - b)^2))
  }

  # compute distance between new point and all training data points
  find_neighbors <- function(train_data, test_instance, k) {
    distances <- train_data
    distances$distance <- vector("numeric", nrow(train_data))
    for (i in 1:nrow(distances)) {
      distances$distance[i] <- calculate_distance(train_data[i, 1:2], test_instance[, 1:2])
    }

    # select k closest points
    distances <- distances[order(distances$distance), ]
    neighbors <- distances[1:k, ]
    return(neighbors)
  }

  neighbors <- find_neighbors(train_data, test, k)
  return(neighbors)
}


# Example with a random test point
neighbors <- k_nearest_neighbors(train_data, new_points[1, ], k)

# Plotting training data, test instance, and neighbors
plot_classification(train_data, train_data$class) +
  geom_point(data = new_points[1, ], aes(x = x, y = y), size = 3, shape = 16, color = "red") +
  geom_line(data = rbind(new_points[1, 1:2], neighbors[1, 1:2]), aes(x = x, y = y, group = 1), linetype = "dashed", color = "black") +
  geom_line(data = rbind(new_points[1, 1:2], neighbors[2, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
  geom_line(data = rbind(new_points[1, 1:2], neighbors[3, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
  geom_line(data = rbind(new_points[1, 1:2], neighbors[4, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
  geom_line(data = rbind(new_points[1, 1:2], neighbors[5, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
  labs(title = paste("k-NN: Test Point and its", k, "Nearest Neighbors"))

Step 4: Assign Label

Assign the label via majority vote of k neighbors.

k_nearest_neighbors <- function(train_data, test, k) {
  # Euclidean distance
  calculate_distance <- function(a, b) {
    sqrt(sum((a - b)^2))
  }

  # compute distance between new point and all training data points
  find_neighbors <- function(train_data, instance, k) {
    distances <- train_data
    distances$distance <- vector("numeric", nrow(train_data))
    for (i in 1:nrow(distances)) {
      distances$distance[i] <- calculate_distance(train_data[i, 1:2], instance[, 1:2])
    }

    # select k closest points
    distances <- distances[order(distances$distance), ]
    neighbors <- distances[1:k, ]
    return(neighbors)
  }

  # assign label to text point
  assign_label <- function(neighbors_labels) {
    as.integer(names(which.max(table(neighbors_labels))))
  }

  neighbors <- find_neighbors(train_data, test, k)
  assigned_label <- assign_label(neighbors$class)

  return(assigned_label)
}

# Testing label assignment
assigned_label <- k_nearest_neighbors(train_data, new_points[1, ], k)
print(paste("Assigned label for the test instance:", assigned_label))

## [1] "Assigned label for the test instance: 3"

Step 5: Repeat Steps 3 and 4

Finally, we apply steps 3 and 4 to all new instances.

# k-NN algorithm

k_nearest_neighbors <- function(train_data, test, k) {
  ### input: training data (data frame: points and labels), test data (data frame: points), k (integer) ###
  ### output: test points and predicted labels, k nearest neighbors (list) ###

  # Euclidean distance
  calculate_distance <- function(a, b) {
    sqrt(sum((a - b)^2))
  }

  # compute distance between new point and all training data points
  find_neighbors <- function(train_data, instance, k) {
    distances <- train_data
    distances$distance <- vector("numeric", nrow(train_data))
    for (i in 1:nrow(distances)) {
      distances$distance[i] <- calculate_distance(train_data[i, 1:2], instance[, 1:2])
    }

    # select k closest points
    distances <- distances[order(distances$distance), ]
    neighbors <- distances[1:k, ]
    return(neighbors)
  }

  # assign label to text point
  assign_label <- function(neighbors_labels) {
    as.integer(names(which.max(table(neighbors_labels))))
  }

  # repeat for all test points
  assigned_labels <- vector("numeric", nrow(test))
  neighbors_list <- list()
  for (i in 1:nrow(test)) {
    neighbors <- find_neighbors(train_data, test[i, ], k)
    assigned_labels[i] <- assign_label(neighbors$class)
    neighbors_list <- append(neighbors_list, list(neighbors))
  }

  # create output list of classified points, their assigned labels and the respective neighbors
  classified_points <- test[, 1:2]
  classified_points$class <- assigned_labels
  result <- list(
    classified_points = list(classified_points),
    neighbors = neighbors_list
  )

  return(result)
}

knn_result <- k_nearest_neighbors(train_data, new_points, k)
extended_data <- rbind(train_data, as.data.frame(knn_result$classified_points))

# plot output
p <- plot_classification(extended_data, extended_data$class)

# test points and neighbors
for (i in 1:4) {
  p <- p +
    geom_point(data = knn_result$classified_points[[1]][i, 1:2], aes(x = x, y = y), size = 4, shape = 5, color = "red") +
    geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][1, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
    geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][2, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
    geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][3, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
    geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][4, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
    geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][5, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black")
}

print(p)

Properties

Advantages:

Simple and intuitive: It easy applicable for a wide range of classification problems.
Non-Parametric: The algorithm makes no assumptions about the underlying data distribution.
Lazy Learner: k-NN doesn’t learn a discriminative function from the training data but memorizes the training dataset instead.

Disadvantages:

Sensitive to Irrelevant Features: Since k-NN uses all features for classification, the presence of irrelevant features can reduce performance.
Dimensionality: k-NN is adversely affected by a high dimensionality, meaning performance decreases as the number of features increases, particularly with a limited amount of data.
Choosing ‘k’: The choice of k can significantly influence the classification. A small k might capture noise in the data, while a large k may smooth over the data, ignoring smaller, potentially relevant, patterns. Various strategies, such as cross-validation, can be utilized to select an optimal k.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.