The k-means
clustering algorithm is a popular unsupervised
machine learning method that groups data points into distinct
clusters. As the name suggests, the algorithm divides a data set into a
given number (k
) of clusters, based on their individual
arithmetic mean. In this introduction, we will explore how to use
k-means in R and provide an explanation of the algorithm.
To begin, let’s define a function that will help us visualize the results. The function will create a scatter plot with different colors representing the clusters.
library(ggplot2)
plot_clusters <- function(data, clusters) {
ggplot(data, aes(x = x, y = y, color = factor(clusters))) +
geom_point(size = 3) +
labs(title = "k-means clustering",
x = "Feature 1",
y = "Feature 2") +
theme_minimal(base_size = 11) +
theme(plot.margin = margin(1, 1, 1, 1, "cm"))
}
Now, let’s generate some random toy data to demonstrate the k-means
algorithm. We’ll use a two-dimensional data set and plot it using the
plot_clusters
function.
set.seed(0)
data <- data.frame(x = rnorm(35), y = rnorm(35))
plot_clusters(data, rep(1, nrow(data)))
The data points are currently plotted with a single color, indicating that they haven’t been clustered yet.
To apply the k-means algorithm, we’ll use the kmeans
function from the stats
package in R. This function takes
the data as input and requires us to specify the number of clusters we
want to create.
library(stats)
k <- 4 # Number of clusters
kmeans_result <- kmeans(data, centers = k)
After running the kmeans
function, we obtain the
clustering results in kmeans_result
. We can access the
assigned cluster for each data point using
kmeans_result$cluster
.
Let’s update our plot to visualize the clusters obtained from k-means.
plot_clusters(data, kmeans_result$cluster)
The data points are colored based on the assigned clusters. The algorithm has separated the data into four distinct clusters.
Now, that we’ve seen how to apply the algorithm and visualize the resulting clusters, we want to explore the algorithm in depth.
When clustering data, we aim for a high similarity of data points within each cluster and maximal dissimilarity between different clusters. The k-means algorithm achieves this by minimizing the sum of squared distances between points and their assigned centroid within each cluster. The corresponding optimization problem is \[\underset{\mathcal{C}}{\arg \min} \left( \sum_{i=1}^{k} \sum_{\boldsymbol{x}\in C_i}\| \boldsymbol{x} - \boldsymbol{\mu}_i \|^2\right),\]
with \(\boldsymbol{x}\in \mathbb{R}^d\) where \(d\in \mathbb{N}\) is the dimensionality of the data, \(k\in \mathbb{N}\) the number of clusters, \(\mathcal{C}=\{C_1,C_2, ... , C_k\}\) the set of clusters and \(\boldsymbol{\mu}_i \in \mathbb{R}^d\) the mean (or centroid) of cluster \(C_i\). \(\|\cdot\|\) is usually the Euclidean norm, but based on the nature of the data or problem, other distance metrics can be used.
The algorithm can be broken down into five steps:
k
, e.g. k = 2.To demonstrate, how k-means works we will start from scratch and develop the algorithm step by step in the following. We start again with a set of randomly distributed, unclustered data points.
plot_clusters(data, rep(1, nrow(data)))
For the algorithm itself, the first step is irrelevant, but it is
certainly necessary to pass k
, along with the data, to our
function. Let’s call our function k_means_clustering
.
k <- 4
k_means_clustering <- function(data, k) {
}
The initial centroids will be k
randomly chosen data
points from our data set.
k_means_clustering <- function(data, k) {
# select k centroids from the initial points
set.seed(1)
centroids <- data[sample(nrow(data), k), ]
return(centroids)
}
### Plot ###
centroids <- k_means_clustering(data, k)
plot_clusters(data, rep(1, nrow(data))) +
geom_point(data = centroids, aes(x = centroids[, 1], y = centroids[, 2]), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)
So far, our function is capable of randomly selecting a given number of points. Now, we will form clusters by assigning each point to its closest centroid. Therefore, we compute the distances of all data points to each centroid and select the minimum, respectively. Usually, the Euclidean norm, commonly known as the Euclidean distance, serves as distance metric. However, based on the nature of the data or problem other distance metrics can also be used.
k_means_clustering <- function(data, k) {
# select k centroids from the initial points
set.seed(1)
centroids <- data[sample(nrow(data), k), ] # shuffle and select k points
# compute distances (Euclidean norm)
distances <- matrix(0, nrow = nrow(data), ncol = nrow(centroids))
for (i in 1:nrow(data)) {
for (j in 1:nrow(centroids)) {
distances[i, j] <- sqrt(sum((data[i, ] - centroids[j, ])^2)) # Euclidean norm
}
}
# assign data to centroids
cluster_assignments <- apply(distances, 1, which.min)
return(list(clusters = cluster_assignments, centroids = centroids))
}
### Plot ###
result <- k_means_clustering(data, 4)
plot_clusters(data, result$clusters) +
geom_point(data = result$centroids, aes(x = x, y = y), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)
Now, that we have an initial clustering, we compute the arithmetic mean of each cluster to be the new centroid.
k_means_clustering <- function(data, k) {
# select k centroids from the initial points
set.seed(1)
centroids <- data[sample(nrow(data), k), ] # shuffle and select k points
# compute distances (Euclidean norm)
distances <- matrix(0, nrow = nrow(data), ncol = nrow(centroids))
for (i in 1:nrow(data)) {
for (j in 1:nrow(centroids)) {
distances[i, j] <- sqrt(sum((data[i, ] - centroids[j, ])^2))
}
}
# assign data to centroids
cluster_assignments <- apply(distances, 1, which.min)
# update centroids based on mean of the assigned data points
new_centroids <- data.frame(matrix(0, nrow = k, ncol = ncol(data)))
colnames(new_centroids) <- colnames(data)
for (i in 1:k) {
cluster_points <- data[cluster_assignments == i, ]
if (nrow(cluster_points) > 0) {
new_centroids[i, ] <- colMeans(cluster_points)
}
}
return(list(clusters = cluster_assignments, centroids = new_centroids))
}
### Plot ###
result <- k_means_clustering(data, 4)
plot_clusters(data, result$clusters) +
geom_point(data = result$centroids, aes(x = x, y = y), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)
Finally, we iterate step 3 and 4 until a stopping criterion is fulfilled. Possible stopping criteria are:
Here, we will stop the iteration when the centroids converge towards a fixed position.
k_means_clustering <- function(data, k) {
# select k centroids from the initial points
set.seed(1)
centroids <- data[sample(nrow(data), k), ] # shuffle and select k points
# stopping criterion
converged <- FALSE
#iteration
while (!converged) {
# compute distances (Euclidean norm)
distances <- matrix(0, nrow = nrow(data), ncol = k)
for (i in 1:nrow(data)) {
for (j in 1:nrow(centroids)) {
distances[i, j] <- sqrt(sum((data[i, ] - centroids[j, ])^2))
}
}
# assign data to centroids
cluster_assignments <- apply(distances, 1, which.min)
# update centroids based on mean of the assigned data points
new_centroids <- data.frame(matrix(0, nrow = k, ncol = ncol(data)))
colnames(new_centroids) <- colnames(data)
for (i in 1:k) {
cluster_points <- data[cluster_assignments == i, ]
if (nrow(cluster_points) > 0) {
new_centroids[i, ] <- colMeans(cluster_points)
}
}
# Check for convergence
if (identical(new_centroids, centroids)) {
converged <- TRUE
} else {
centroids <- new_centroids
}
}
return(list(clusters = cluster_assignments, centroids = new_centroids))
}
### Plot ###
result <- k_means_clustering(data,4)
plot_clusters(data, result$clusters) +
geom_point(data = result$centroids, aes(x = x, y = y), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)
Important properties of k-means:
{width = 90%}
k
significantly affects the
final clustering result. But which number of clusters is optimal?
Usually, when using this method, you want to divide the data set into a
specific number of clusters. If you are not sure, there are several
methods to determine the optimal number. Such are the Elbow
method, which hierarchizes several runs with different
k
’s, the Average
silhouette method and the Gap statistic
method, among others.Because of the randomness of the initial placement of centroids, running the algorithm several times will lead to different clustering solutions. To find the optimal clustering, multiple runs are recommended.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.