The k-Nearest Neighbors (k-NN) algorithm is one of the simplest and widely used supervised machine learning method that categorizes data based on the classes of their nearest neighbors.

Generally, k-NN is used as a multi-class classifiers. A common task is to classify new data that has been added to an already classified dataset. Let’s start by exploring how to use k-NN in R before we dive into the details of the algorithm.

Let us have a look at the following case:

```
library(ggplot2)
# Create synthetic data for 3 classes
set.seed(42)
n <- 50 # number of samples per class
data <- data.frame(
x = c(rnorm(n, 2), rnorm(n, 6), rnorm(n, 4)),
y = c(rnorm(n, 3), rnorm(n, 7), rnorm(n, 4)),
class = factor(rep(1:3, each = n))
)
# Define new data points
new_points <- data.frame(
x = c(1, 5, 4, 2, 3, 1),
y = c(2, 6, 5, 5, 2, 7)
)
# Plotting data
library(ggplot2)
plot_classification <- function(data, classes) {
p <- ggplot(data) +
geom_point(aes(x = x, y = y, color = factor(classes), shape = factor(classes)), size = 3) +
scale_color_manual(values = c("blue", "darkgreen", "purple"), name = "Class") +
scale_shape_manual(values = c(15, 17, 18), name = "Class") +
labs(
title = "k-NN Classification",
x = "Feature 1",
y = "Feature 2"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(size = 22, hjust = 0.5),
plot.margin = margin(1, 1, 1, 1, "cm"),
axis.title.x = element_text(angle = 0, vjust = 1, size = 14),
axis.title.y = element_text(angle = 90, vjust = 2, size = 14)
)
return(p)
}
plot_classification(data, data$class) +
geom_point(data = new_points, aes(x = x, y = y), color = "red", size = 3, shape = 16) +
labs(title = "Toy Data")
```

We have a dataset that is divided into three different classes (1, 2 and 3). Then, we add new, unclassified data (red dots).

Let us now perform a k-NN classification by using the
`knn`

function from the `class`

package. This
function takes the training data and their labels, as well as the test
data (our `new_points`

) and `k`

(the number of
neighbors to consider), as input.

```
library(class)
# Specifying number of neighbors
k <- 7
# Performing k-NN classification
predicted_labels <- knn(
train = data[, c("x", "y")],
test = new_points[, c("x", "y")],
cl = data$class,
k = k
)
```

The `predicted_labels`

object now contains the predicted
classes for our data points. We can visualize the result with our plot
function.

```
new_classified_points <- new_points
new_classified_points$class <- predicted_labels
extended_data <- rbind(data, new_classified_points)
plot_classification(extended_data, extended_data$class) +
geom_point(data = new_points, aes(x = x, y = y), color = "red", size = 4, shape = 5)
```

The new data points have each been assign to a class according to their seven closest neighbors. Now, let’s delve deeper into the mechanics of the k-NN algorithm.

When classifying data, k-NN operates by identifying `k`

data points from the training data that are nearest to the unclassified
point. The class of the new point is then determined by majority vote,
meaning it is assigned the most common class among its `k`

nearest neighbors.

The choice of distance metric to determine the neighbors depends on the problem. Usually, the Euclidean distance between two points \(p\) and \(q\in \mathbb{R}^n\), is used: \[d(p,q)=‖q−p‖_2=\sqrt{\sum _{i=1}^{n}(q_{i}-p_{i})^{2}}\]

\(n\) corresponds to the number of features.

Steps for classifying a point using k-NN are:

- Choose
`k`

, the number of neighbors to consider. - Calculate the distance between the point of interest and all other points.
- Identify
`k`

closest points. - Assign the class that is most common among the
`k`

neighbors to the new point.

To showcase k-NN, we develop the algorithm step by step. Let us create a new training data set some new, unclassified data points:

```
# create training data
set.seed(42)
n <- 25 # number of samples per class
train_data <- data.frame(
x = c(rnorm(n, 2.5, 0.7), rnorm(n, 5, 0.8), rnorm(n, 4)),
y = c(rnorm(n, 4, 0.7), rnorm(n, 4.5, 0.8), rnorm(n, 4)),
class = factor(rep(1:3, each = n))
)
# create test data
set.seed(43)
new_points <- data.frame(x = rnorm(4, mean = 4, sd = 1.5), y = rnorm(4, mean = 4, sd = 1.5))
new_points$class <- c(0, 0, 0, 0)
# Visualize Original Data
plot_classification(train_data, train_data$class) +
geom_point(data = new_points, aes(x = x, y = y), color = "red", size = 3, shape = 16) +
labs(title = paste("Training Data and new Points"))
```

We define our function `k_nearest_neighbors()`

by
specifying the following arguments: the training dataset, the test data,
and the parameter \(k\).

```
k_nearest_neighbors <- function(training, test, k) {
}
```

`k <- 5`

```
k_nearest_neighbors <- function(train_data, test, k) {
# Euclidean distance
calculate_distance <- function(a, b) {
sqrt(sum((a - b)^2))
}
}
```

```
k_nearest_neighbors <- function(train_data, test, k) {
# Euclidean distance
calculate_distance <- function(a, b) {
sqrt(sum((a - b)^2))
}
# compute distance between new point and all training data points
find_neighbors <- function(train_data, test_instance, k) {
distances <- train_data
distances$distance <- vector("numeric", nrow(train_data))
for (i in 1:nrow(distances)) {
distances$distance[i] <- calculate_distance(train_data[i, 1:2], test_instance[, 1:2])
}
# select k closest points
distances <- distances[order(distances$distance), ]
neighbors <- distances[1:k, ]
return(neighbors)
}
neighbors <- find_neighbors(train_data, test, k)
return(neighbors)
}
# Example with a random test point
neighbors <- k_nearest_neighbors(train_data, new_points[1, ], k)
# Plotting training data, test instance, and neighbors
plot_classification(train_data, train_data$class) +
geom_point(data = new_points[1, ], aes(x = x, y = y), size = 3, shape = 16, color = "red") +
geom_line(data = rbind(new_points[1, 1:2], neighbors[1, 1:2]), aes(x = x, y = y, group = 1), linetype = "dashed", color = "black") +
geom_line(data = rbind(new_points[1, 1:2], neighbors[2, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(new_points[1, 1:2], neighbors[3, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(new_points[1, 1:2], neighbors[4, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(new_points[1, 1:2], neighbors[5, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
labs(title = paste("k-NN: Test Point and its", k, "Nearest Neighbors"))
```

Assign the label via majority vote of k neighbors.

```
k_nearest_neighbors <- function(train_data, test, k) {
# Euclidean distance
calculate_distance <- function(a, b) {
sqrt(sum((a - b)^2))
}
# compute distance between new point and all training data points
find_neighbors <- function(train_data, instance, k) {
distances <- train_data
distances$distance <- vector("numeric", nrow(train_data))
for (i in 1:nrow(distances)) {
distances$distance[i] <- calculate_distance(train_data[i, 1:2], instance[, 1:2])
}
# select k closest points
distances <- distances[order(distances$distance), ]
neighbors <- distances[1:k, ]
return(neighbors)
}
# assign label to text point
assign_label <- function(neighbors_labels) {
as.integer(names(which.max(table(neighbors_labels))))
}
neighbors <- find_neighbors(train_data, test, k)
assigned_label <- assign_label(neighbors$class)
return(assigned_label)
}
# Testing label assignment
assigned_label <- k_nearest_neighbors(train_data, new_points[1, ], k)
print(paste("Assigned label for the test instance:", assigned_label))
```

`## [1] "Assigned label for the test instance: 3"`

Finally, we apply steps 3 and 4 to all new instances.

```
# k-NN algorithm
k_nearest_neighbors <- function(train_data, test, k) {
### input: training data (data frame: points and labels), test data (data frame: points), k (integer) ###
### output: test points and predicted labels, k nearest neighbors (list) ###
# Euclidean distance
calculate_distance <- function(a, b) {
sqrt(sum((a - b)^2))
}
# compute distance between new point and all training data points
find_neighbors <- function(train_data, instance, k) {
distances <- train_data
distances$distance <- vector("numeric", nrow(train_data))
for (i in 1:nrow(distances)) {
distances$distance[i] <- calculate_distance(train_data[i, 1:2], instance[, 1:2])
}
# select k closest points
distances <- distances[order(distances$distance), ]
neighbors <- distances[1:k, ]
return(neighbors)
}
# assign label to text point
assign_label <- function(neighbors_labels) {
as.integer(names(which.max(table(neighbors_labels))))
}
# repeat for all test points
assigned_labels <- vector("numeric", nrow(test))
neighbors_list <- list()
for (i in 1:nrow(test)) {
neighbors <- find_neighbors(train_data, test[i, ], k)
assigned_labels[i] <- assign_label(neighbors$class)
neighbors_list <- append(neighbors_list, list(neighbors))
}
# create output list of classified points, their assigned labels and the respective neighbors
classified_points <- test[, 1:2]
classified_points$class <- assigned_labels
result <- list(
classified_points = list(classified_points),
neighbors = neighbors_list
)
return(result)
}
```

```
knn_result <- k_nearest_neighbors(train_data, new_points, k)
extended_data <- rbind(train_data, as.data.frame(knn_result$classified_points))
# plot output
p <- plot_classification(extended_data, extended_data$class)
# test points and neighbors
for (i in 1:4) {
p <- p +
geom_point(data = knn_result$classified_points[[1]][i, 1:2], aes(x = x, y = y), size = 4, shape = 5, color = "red") +
geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][1, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][2, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][3, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][4, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black") +
geom_line(data = rbind(knn_result$classified_points[[1]][i, 1:2], knn_result$neighbors[[i]][5, 1:2]), aes(x = x, y = y), linetype = "dashed", color = "black")
}
print(p)
```

**Simple and intuitive**: It easy applicable for a wide range of classification problems.**Non-Parametric**: The algorithm makes no assumptions about the underlying data distribution.**Lazy Learner**: k-NN doesn’t learn a discriminative function from the training data but memorizes the training dataset instead.

**Sensitive to Irrelevant Features**: Since k-NN uses all features for classification, the presence of irrelevant features can reduce performance.**Dimensionality**: k-NN is adversely affected by a high dimensionality, meaning performance decreases as the number of features increases, particularly with a limited amount of data.**Choosing ‘k’**: The choice of`k`

can significantly influence the classification. A small`k`

might capture noise in the data, while a large`k`

may smooth over the data, ignoring smaller, potentially relevant, patterns. Various strategies, such as cross-validation, can be utilized to select an optimal`k`

.

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*