The k-means
clustering algorithm is a popular unsupervised
machine learning method that groups data points into distinct
clusters. As the name suggests, the algorithm divides a data set into a
given number (`k`

) of clusters, based on their individual
arithmetic mean. In this introduction, we will explore how to use
k-means in R and provide an explanation of the algorithm.

To begin, letâ€™s define a function that will help us visualize the results. The function will create a scatter plot with different colors representing the clusters.

```
library(ggplot2)
plot_clusters <- function(data, clusters) {
ggplot(data, aes(x = x, y = y, color = factor(clusters))) +
geom_point(size = 3) +
labs(title = "k-means clustering",
x = "Feature 1",
y = "Feature 2") +
theme_minimal(base_size = 11) +
theme(plot.margin = margin(1, 1, 1, 1, "cm"))
}
```

Now, letâ€™s generate some random toy data to demonstrate the k-means
algorithm. Weâ€™ll use a two-dimensional data set and plot it using the
`plot_clusters`

function.

```
set.seed(0)
data <- data.frame(x = rnorm(35), y = rnorm(35))
plot_clusters(data, rep(1, nrow(data)))
```

The data points are currently plotted with a single color, indicating that they havenâ€™t been clustered yet.

To apply the k-means algorithm, weâ€™ll use the `kmeans`

function from the `stats`

package in R. This function takes
the data as input and requires us to specify the number of clusters we
want to create.

```
library(stats)
k <- 4 # Number of clusters
kmeans_result <- kmeans(data, centers = k)
```

After running the `kmeans`

function, we obtain the
clustering results in `kmeans_result`

. We can access the
assigned cluster for each data point using
`kmeans_result$cluster`

.

Letâ€™s update our plot to visualize the clusters obtained from k-means.

`plot_clusters(data, kmeans_result$cluster)`

The data points are colored based on the assigned clusters. The algorithm has separated the data into four distinct clusters.

Now, that weâ€™ve seen how to apply the algorithm and visualize the resulting clusters, we want to explore the algorithm in depth.

When clustering data, we aim for a high similarity of data points within each cluster and maximal dissimilarity between different clusters. The k-means algorithm achieves this by minimizing the sum of squared distances between points and their assigned centroid within each cluster. The corresponding optimization problem is \[\underset{\mathcal{C}}{\arg \min} \left( \sum_{i=1}^{k} \sum_{\boldsymbol{x}\in C_i}\| \boldsymbol{x} - \boldsymbol{\mu}_i \|^2\right),\]

with \(\boldsymbol{x}\in \mathbb{R}^d\) where \(d\in \mathbb{N}\) is the dimensionality of the data, \(k\in \mathbb{N}\) the number of clusters, \(\mathcal{C}=\{C_1,C_2, ... , C_k\}\) the set of clusters and \(\boldsymbol{\mu}_i \in \mathbb{R}^d\) the mean (or centroid) of cluster \(C_i\). \(\|\cdot\|\) is usually the Euclidean norm, but based on the nature of the data or problem, other distance metrics can be used.

The algorithm can be broken down into five steps:

- Choose the number of clusters
`k`

, e.g.Â k = 2. - Select k random points from the data set as initial centroids.
- Assign each data point to its closest centroid.
- Recompute the centroid (i.e.Â compute the mean) of each newly formed cluster.
- Repeat step 3 and 4 until the iteration stops.

To demonstrate, how k-means works we will start from scratch and develop the algorithm step by step in the following. We start again with a set of randomly distributed, unclustered data points.

`plot_clusters(data, rep(1, nrow(data)))`

For the algorithm itself, the first step is irrelevant, but it is
certainly necessary to pass `k`

, along with the data, to our
function. Letâ€™s call our function `k_means_clustering`

.

```
k <- 4
k_means_clustering <- function(data, k) {
}
```

The initial centroids will be `k`

randomly chosen data
points from our data set.

```
k_means_clustering <- function(data, k) {
# select k centroids from the initial points
set.seed(1)
centroids <- data[sample(nrow(data), k), ]
return(centroids)
}
### Plot ###
centroids <- k_means_clustering(data, k)
plot_clusters(data, rep(1, nrow(data))) +
geom_point(data = centroids, aes(x = centroids[, 1], y = centroids[, 2]), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)
```

So far, our function is capable of randomly selecting a given number of points. Now, we will form clusters by assigning each point to its closest centroid. Therefore, we compute the distances of all data points to each centroid and select the minimum, respectively. Usually, the Euclidean norm, commonly known as the Euclidean distance, serves as distance metric. However, based on the nature of the data or problem other distance metrics can also be used.

```
k_means_clustering <- function(data, k) {
# select k centroids from the initial points
set.seed(1)
centroids <- data[sample(nrow(data), k), ] # shuffle and select k points
# compute distances (Euclidean norm)
distances <- matrix(0, nrow = nrow(data), ncol = nrow(centroids))
for (i in 1:nrow(data)) {
for (j in 1:nrow(centroids)) {
distances[i, j] <- sqrt(sum((data[i, ] - centroids[j, ])^2)) # Euclidean norm
}
}
# assign data to centroids
cluster_assignments <- apply(distances, 1, which.min)
return(list(clusters = cluster_assignments, centroids = centroids))
}
### Plot ###
result <- k_means_clustering(data, 4)
plot_clusters(data, result$clusters) +
geom_point(data = result$centroids, aes(x = x, y = y), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)
```

Now, that we have an initial clustering, we compute the arithmetic mean of each cluster to be the new centroid.

```
k_means_clustering <- function(data, k) {
# select k centroids from the initial points
set.seed(1)
centroids <- data[sample(nrow(data), k), ] # shuffle and select k points
# compute distances (Euclidean norm)
distances <- matrix(0, nrow = nrow(data), ncol = nrow(centroids))
for (i in 1:nrow(data)) {
for (j in 1:nrow(centroids)) {
distances[i, j] <- sqrt(sum((data[i, ] - centroids[j, ])^2))
}
}
# assign data to centroids
cluster_assignments <- apply(distances, 1, which.min)
# update centroids based on mean of the assigned data points
new_centroids <- data.frame(matrix(0, nrow = k, ncol = ncol(data)))
colnames(new_centroids) <- colnames(data)
for (i in 1:k) {
cluster_points <- data[cluster_assignments == i, ]
if (nrow(cluster_points) > 0) {
new_centroids[i, ] <- colMeans(cluster_points)
}
}
return(list(clusters = cluster_assignments, centroids = new_centroids))
}
### Plot ###
result <- k_means_clustering(data, 4)
plot_clusters(data, result$clusters) +
geom_point(data = result$centroids, aes(x = x, y = y), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)
```

Finally, we iterate step 3 and 4 until a stopping criterion is fulfilled. Possible stopping criteria are:

- Centroids of newly formed clusters do not change
- Points within one cluster remain the same
- Maximum number of iterations is reached

Here, we will stop the iteration when the centroids converge towards a fixed position.

```
k_means_clustering <- function(data, k) {
# select k centroids from the initial points
set.seed(1)
centroids <- data[sample(nrow(data), k), ] # shuffle and select k points
# stopping criterion
converged <- FALSE
#iteration
while (!converged) {
# compute distances (Euclidean norm)
distances <- matrix(0, nrow = nrow(data), ncol = k)
for (i in 1:nrow(data)) {
for (j in 1:nrow(centroids)) {
distances[i, j] <- sqrt(sum((data[i, ] - centroids[j, ])^2))
}
}
# assign data to centroids
cluster_assignments <- apply(distances, 1, which.min)
# update centroids based on mean of the assigned data points
new_centroids <- data.frame(matrix(0, nrow = k, ncol = ncol(data)))
colnames(new_centroids) <- colnames(data)
for (i in 1:k) {
cluster_points <- data[cluster_assignments == i, ]
if (nrow(cluster_points) > 0) {
new_centroids[i, ] <- colMeans(cluster_points)
}
}
# Check for convergence
if (identical(new_centroids, centroids)) {
converged <- TRUE
} else {
centroids <- new_centroids
}
}
return(list(clusters = cluster_assignments, centroids = new_centroids))
}
### Plot ###
result <- k_means_clustering(data,4)
plot_clusters(data, result$clusters) +
geom_point(data = result$centroids, aes(x = x, y = y), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)
```

Important properties of k-means:

- Limitations: It assumes that clusters have similar sizes and densities and is therefore limited to those cases. As we see in the following examples, k-means clustering solutions may not be suitable in situations where clusters are of varying sizes and densities.