The k-Nearest Neighbors (k-NN) algorithm is one of the simplest and widely used supervised machine learning method that categorizes data based on the classes of their nearest neighbors.
Generally, k-NN is used as a multi-class classifiers. A common task is to classify new data that has been added to an already classified dataset. Let’s start by exploring how to use k-NN in R before we dive into the details of the algorithm.
Let us have a look at the following case:
library(ggplot2)
# Create synthetic data for 3 classes
set.seed(42)
n <- 50 # number of samples per class
data <- data.frame(
x = c(rnorm(n, 2), rnorm(n, 6), rnorm(n, 4)),
y = c(rnorm(n, 3), rnorm(n, 7), rnorm(n, 4)),
class = factor(rep(1:3, each = n))
)
# Define new data points
new_points <- data.frame(
x = c(1, 5, 4, 2, 3, 1),
y = c(2, 6, 5, 5, 2, 7)
)
# Plotting data
library(ggplot2)
plot_classification <- function(data, classes) {
p <- ggplot(data) +
geom_point(aes(x = x, y = y, color = factor(classes), shape = factor(classes)), size = 3) +
scale_color_manual(values = c("blue", "darkgreen", "purple"), name = "Class") +
scale_shape_manual(values = c(15, 17, 18), name = "Class") +
labs(
title = "k-NN Classification",
x = "Feature 1",
y = "Feature 2"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(size = 22, hjust = 0.5),
plot.margin = margin(1, 1, 1, 1, "cm"),
axis.title.x = element_text(angle = 0, vjust = 1, size = 14),
axis.title.y = element_text(angle = 90, vjust = 2, size = 14)
)
return(p)
}
plot_classification(data, data$class) +
geom_point(data = new_points, aes(x = x, y = y), color = "red", size = 3, shape = 16) +
labs(title = "Toy Data")
We have a dataset that is divided into three different classes (1, 2 and 3). Then, we add new, unclassified data (red dots).
Let us now perform a k-NN classification by using the
knn
function from the class
package. This
function takes the training data and their labels, as well as the test
data (our new_points
) and k
(the number of
neighbors to consider), as input.
library(class)
# Specifying number of neighbors
k <- 7
# Performing k-NN classification
predicted_labels <- knn(
train = data[, c("x", "y")],
test = new_points[, c("x", "y")],
cl = data$class,
k = k
)
The predicted_labels
object now contains the predicted
classes for our data points. We can visualize the result with our plot
function.
new_classified_points <- new_points
new_classified_points$class <- predicted_labels
extended_data <- rbind(data, new_classified_points)
plot_classification(extended_data, extended_data$class) +
geom_point(data = new_points, aes(x = x, y = y), color = "red", size = 4, shape = 5)
The new data points have each been assign to a class according to their seven closest neighbors. Now, let’s delve deeper into the mechanics of the k-NN algorithm.
When classifying data, k-NN operates by identifying k
data points from the training data that are nearest to the unclassified
point. The class of the new point is then determined by majority vote,
meaning it is assigned the most common class among its k
nearest neighbors.
The choice of distance metric to determine the neighbors depends on the problem. Usually, the Euclidean distance between two points \(p\) and \(q\in \mathbb{R}^n\), is used: \[d(p,q)=‖q−p‖_2=\sqrt{\sum _{i=1}^{n}(q_{i}-p_{i})^{2}}\]
\(n\) corresponds to the number of features.
Steps for classifying a point using k-NN are:
k
, the number of neighbors to consider.k
closest points.k
neighbors to the new point.To showcase k-NN, we develop the algorithm step by step. Let us create a new training data set some new, unclassified data points:
# create training data
set.seed(42)
n <- 25 # number of samples per class
train_data <- data.frame(
x = c(rnorm(n, 2.5, 0.7), rnorm(n, 5, 0.8), rnorm(n, 4)),
y = c(rnorm(n, 4, 0.7), rnorm(n, 4.5, 0.8), rnorm(n, 4)),
class = factor(rep(1:3, each = n))
)
# create test data
set.seed(43)
new_points <- data.frame(x = rnorm(4, mean = 4, sd = 1.5), y = rnorm(4, mean = 4, sd = 1.5))
new_points$class <- c(0, 0, 0, 0)
# Visualize Original Data
plot_classification(train_data, train_data$class) +
geom_point(data = new_points, aes(x = x, y = y), color = "red", size = 3, shape = 16) +
labs(title = paste("Training Data and new Points"))
We define our function k_nearest_neighbors()
by
specifying the following arguments: the training dataset, the test data,
and the parameter \(k\).
k_nearest_neighbors <- function(training, test, k) {
}
k <- 5
k_nearest_neighbors <- function(train_data, test, k) {
# Euclidean distance
calculate_distance <- function(a, b) {
sqrt(sum((a - b)^2))
}
}
k_nearest_neighbors <- function(train_data, test, k) {
# Euclidean distance
calculate_distance <- function(a, b) {
sqrt(sum((a - b)^2))
}
# compute distance between new point and all training data points
find_neighbors <- function(train_data, test_instance, k) {
distances <- train_data
distances$distance <- vector("numeric", nrow(train_data))
for (i in 1:nrow(distances)) {
distances$distance[i] <- calculate_distance(train_data[i, 1:2], test_instance[, 1:2])
}
# select k closest points
distances <- distances[order(distances$distance), ]
neighbors <- distances[1:k, ]
return(neighbors)
}
neighbors <- find_neighbors(train_data, test, k)
return(neighbors)
}
# Example with a random test point
neighbors <- k_nearest_neighbors(train_data, new_points[1, ], k)
# Plotting training data, test instance, and neighbors
plot_classification(train_data, train_data$class) +
geom_point(data = new_points[1, ], aes(x = x, y = y), size = 3, shape = 16, color = "red") +
geom_line(data = rbind(new_points[1, 1:2], neighbors[1, 1:2]), aes(x = x, y = y, group = 1), linetype = "dashed", color = "black") +
geom_line(data = rbind(new_points[1, 1:2], neighbors[2, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(new_points[1, 1:2], neighbors[3, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(new_points[1, 1:2], neighbors[4, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(new_points[1, 1:2], neighbors[5, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
labs(title = paste("k-NN: Test Point and its", k, "Nearest Neighbors"))
Assign the label via majority vote of k neighbors.
k_nearest_neighbors <- function(train_data, test, k) {
# Euclidean distance
calculate_distance <- function(a, b) {
sqrt(sum((a - b)^2))
}
# compute distance between new point and all training data points
find_neighbors <- function(train_data, instance, k) {
distances <- train_data
distances$distance <- vector("numeric", nrow(train_data))
for (i in 1:nrow(distances)) {
distances$distance[i] <- calculate_distance(train_data[i, 1:2], instance[, 1:2])
}
# select k closest points
distances <- distances[order(distances$distance), ]
neighbors <- distances[1:k, ]
return(neighbors)
}
# assign label to text point
assign_label <- function(neighbors_labels) {
as.integer(names(which.max(table(neighbors_labels))))
}
neighbors <- find_neighbors(train_data, test, k)
assigned_label <- assign_label(neighbors$class)
return(assigned_label)
}
# Testing label assignment
assigned_label <- k_nearest_neighbors(train_data, new_points[1, ], k)
print(paste("Assigned label for the test instance:", assigned_label))
## [1] "Assigned label for the test instance: 3"
Finally, we apply steps 3 and 4 to all new instances.
# k-NN algorithm
k_nearest_neighbors <- function(train_data, test, k) {
### input: training data (data frame: points and labels), test data (data frame: points), k (integer) ###
### output: test points and predicted labels, k nearest neighbors (list) ###
# Euclidean distance
calculate_distance <- function(a, b) {
sqrt(sum((a - b)^2))
}
# compute distance between new point and all training data points
find_neighbors <- function(train_data, instance, k) {
distances <- train_data
distances$distance <- vector("numeric", nrow(train_data))
for (i in 1:nrow(distances)) {
distances$distance[i] <- calculate_distance(train_data[i, 1:2], instance[, 1:2])
}
# select k closest points
distances <- distances[order(distances$distance), ]
neighbors <- distances[1:k, ]
return(neighbors)
}
# assign label to text point
assign_label <- function(neighbors_labels) {
as.integer(names(which.max(table(neighbors_labels))))
}
# repeat for all test points
assigned_labels <- vector("numeric", nrow(test))
neighbors_list <- list()
for (i in 1:nrow(test)) {
neighbors <- find_neighbors(train_data, test[i, ], k)
assigned_labels[i] <- assign_label(neighbors$class)
neighbors_list <- append(neighbors_list, list(neighbors))
}
# create output list of classified points, their assigned labels and the respective neighbors
classified_points <- test[, 1:2]
classified_points$class <- assigned_labels
result <- list(
classified_points = list(classified_points),
neighbors = neighbors_list
)
return(result)
}
knn_result <- k_nearest_neighbors(train_data, new_points, k)
extended_data <- rbind(train_data, as.data.frame(knn_result$classified_points))
# plot output
p <- plot_classification(extended_data, extended_data$class)
# test points and neighbors
for (i in 1:4) {
p <- p +
geom_point(data = knn_result$classified_points[[1]][i, 1:2], aes(x = x, y = y), size = 4, shape = 5, color = "red") +
geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][1, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][2, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][3, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][4, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][5, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black")
}
print(p)
k
can
significantly influence the classification. A small k
might
capture noise in the data, while a large k
may smooth over
the data, ignoring smaller, potentially relevant, patterns. Various
strategies, such as cross-validation, can be utilized to select an
optimal k
.Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.