40230_k_means.knit

The k-means clustering algorithm is a popular unsupervised machine learning method that groups data points into distinct clusters. As the name suggests, the algorithm divides a data set into a given number (k) of clusters, based on their individual arithmetic mean. In this introduction, we will explore how to use k-means in R and provide an explanation of the algorithm.

To begin, let’s define a function that will help us visualize the results. The function will create a scatter plot with different colors representing the clusters.

library(ggplot2)

plot_clusters <- function(data, clusters) {
  ggplot(data, aes(x = x, y = y, color = factor(clusters))) +
    geom_point(size = 3) +
    labs(title = "k-means clustering",
         x = "Feature 1",
         y = "Feature 2") +
    theme_minimal(base_size = 11) +
    theme(plot.margin = margin(1, 1, 1, 1, "cm")) 
}

Now, let’s generate some random toy data to demonstrate the k-means algorithm. We’ll use a two-dimensional data set and plot it using the plot_clusters function.

set.seed(0)
data <- data.frame(x = rnorm(35), y = rnorm(35))
plot_clusters(data, rep(1, nrow(data)))

The data points are currently plotted with a single color, indicating that they haven’t been clustered yet.

To apply the k-means algorithm, we’ll use the kmeans function from the stats package in R. This function takes the data as input and requires us to specify the number of clusters we want to create.

library(stats)

k <- 4  # Number of clusters
kmeans_result <- kmeans(data, centers = k)

After running the kmeans function, we obtain the clustering results in kmeans_result. We can access the assigned cluster for each data point using kmeans_result$cluster.

Let’s update our plot to visualize the clusters obtained from k-means.

plot_clusters(data, kmeans_result$cluster)

The data points are colored based on the assigned clusters. The algorithm has separated the data into four distinct clusters.

Now, that we’ve seen how to apply the algorithm and visualize the resulting clusters, we want to explore the algorithm in depth.

The k-means algorithm

When clustering data, we aim for a high similarity of data points within each cluster and maximal dissimilarity between different clusters. The k-means algorithm achieves this by minimizing the sum of squared distances between points and their assigned centroid within each cluster. The corresponding optimization problem is \[\underset{\mathcal{C}}{\arg \min} \left( \sum_{i=1}^{k} \sum_{\boldsymbol{x}\in C_i}\| \boldsymbol{x} - \boldsymbol{\mu}_i \|^2\right),\]

with $\boldsymbol{x}\in \mathbb{R}^d$ where $d\in \mathbb{N}$ is the dimensionality of the data, $k\in \mathbb{N}$ the number of clusters, $\mathcal{C}=\{C_1,C_2, ... , C_k\}$ the set of clusters and $\boldsymbol{\mu}_i \in \mathbb{R}^d$ the mean (or centroid) of cluster $C_i$. $\|\cdot\|$ is usually the Euclidean norm, but based on the nature of the data or problem, other distance metrics can be used.

The algorithm can be broken down into five steps:

Choose the number of clusters k, e.g. k = 2.
Select k random points from the data set as initial centroids.
Assign each data point to its closest centroid.
Recompute the centroid (i.e. compute the mean) of each newly formed cluster.
Repeat step 3 and 4 until the iteration stops.

To demonstrate, how k-means works we will start from scratch and develop the algorithm step by step in the following. We start again with a set of randomly distributed, unclustered data points.

plot_clusters(data, rep(1, nrow(data)))

Step 1: Choose number of clusters

For the algorithm itself, the first step is irrelevant, but it is certainly necessary to pass k , along with the data, to our function. Let’s call our function k_means_clustering.

k <- 4

k_means_clustering <- function(data, k) {
  }

Step 2: Initialize centroids

The initial centroids will be k randomly chosen data points from our data set.

k_means_clustering <- function(data, k) {
  
  # select k centroids from the initial points
  set.seed(1)
  centroids <- data[sample(nrow(data), k), ] 
  
  return(centroids)
}


### Plot ###
centroids <- k_means_clustering(data, k)
plot_clusters(data, rep(1, nrow(data))) +
  geom_point(data = centroids, aes(x = centroids[, 1], y = centroids[, 2]), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)

Step 3: Assign each data point to its closest centroid

So far, our function is capable of randomly selecting a given number of points. Now, we will form clusters by assigning each point to its closest centroid. Therefore, we compute the distances of all data points to each centroid and select the minimum, respectively. Usually, the Euclidean norm, commonly known as the Euclidean distance, serves as distance metric. However, based on the nature of the data or problem other distance metrics can also be used.

k_means_clustering <- function(data, k) {
  
  # select k centroids from the initial points
  set.seed(1)
  centroids <- data[sample(nrow(data), k), ] # shuffle and select k points
  
  # compute distances (Euclidean norm)
  distances <- matrix(0, nrow = nrow(data), ncol = nrow(centroids))
  for (i in 1:nrow(data)) {
    for (j in 1:nrow(centroids)) {
      distances[i, j] <- sqrt(sum((data[i, ] - centroids[j, ])^2)) # Euclidean norm
    }
  }
  
  # assign data to centroids
  cluster_assignments <- apply(distances, 1, which.min)
  
  return(list(clusters = cluster_assignments, centroids = centroids))
}


### Plot ###
result <- k_means_clustering(data, 4)
plot_clusters(data, result$clusters) +
  geom_point(data = result$centroids, aes(x = x, y = y), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)

Step 4: Recompute the centroid according to arithmetic mean

Now, that we have an initial clustering, we compute the arithmetic mean of each cluster to be the new centroid.

k_means_clustering <- function(data, k) {
  
  # select k centroids from the initial points
  set.seed(1)
  centroids <- data[sample(nrow(data), k), ] # shuffle and select k points
  
  # compute distances (Euclidean norm)
  distances <- matrix(0, nrow = nrow(data), ncol = nrow(centroids))
  for (i in 1:nrow(data)) {
    for (j in 1:nrow(centroids)) {
      distances[i, j] <- sqrt(sum((data[i, ] - centroids[j, ])^2))
    }
  }
  # assign data to centroids
  cluster_assignments <- apply(distances, 1, which.min)
  
  # update centroids based on mean of the assigned data points
    new_centroids <- data.frame(matrix(0, nrow = k, ncol = ncol(data)))
    colnames(new_centroids) <- colnames(data)
    for (i in 1:k) {
      cluster_points <- data[cluster_assignments == i, ]
      if (nrow(cluster_points) > 0) {
        new_centroids[i, ] <- colMeans(cluster_points)
      }
    }
  return(list(clusters = cluster_assignments, centroids = new_centroids))
}


### Plot ###
result <- k_means_clustering(data, 4)
plot_clusters(data, result$clusters) +
  geom_point(data = result$centroids, aes(x = x, y = y), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)

Step 5: Repeat step 3 and 4 until stop condition

Finally, we iterate step 3 and 4 until a stopping criterion is fulfilled. Possible stopping criteria are:

Centroids of newly formed clusters do not change
Points within one cluster remain the same
Maximum number of iterations is reached

Here, we will stop the iteration when the centroids converge towards a fixed position.

k_means_clustering <- function(data, k) {
  
  # select k centroids from the initial points
  set.seed(1)
  centroids <- data[sample(nrow(data), k), ] # shuffle and select k points
  
  # stopping criterion
  converged <- FALSE
  
  #iteration
  while (!converged) {
  
    # compute distances (Euclidean norm)
    distances <- matrix(0, nrow = nrow(data), ncol = k)
    for (i in 1:nrow(data)) {
      for (j in 1:nrow(centroids)) {
        distances[i, j] <- sqrt(sum((data[i, ] - centroids[j, ])^2))
      }
    }
    # assign data to centroids
    cluster_assignments <- apply(distances, 1, which.min)
  
    # update centroids based on mean of the assigned data points
    new_centroids <- data.frame(matrix(0, nrow = k, ncol = ncol(data)))
    colnames(new_centroids) <- colnames(data)
    for (i in 1:k) {
      cluster_points <- data[cluster_assignments == i, ]
      if (nrow(cluster_points) > 0) {
        new_centroids[i, ] <- colMeans(cluster_points)
      }
    }
    # Check for convergence
    if (identical(new_centroids, centroids)) {
      converged <- TRUE
    } else {
      centroids <- new_centroids
    }
  }
  return(list(clusters = cluster_assignments, centroids = new_centroids))
}


### Plot ###
result <- k_means_clustering(data,4)
plot_clusters(data, result$clusters) +
  geom_point(data = result$centroids, aes(x = x, y = y), size = 6, color = c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF"), fill = NA, shape = 9)

Properties

Important properties of k-means:

Limitations: It assumes that clusters have similar sizes and densities and is therefore limited to those cases. As we see in the following examples, k-means clustering solutions may not be suitable in situations where clusters are of varying sizes and densities.

{width = 90%}

Initialization: k-means requires specifying the number of clusters in advance. The selection of k significantly affects the final clustering result. But which number of clusters is optimal? Usually, when using this method, you want to divide the data set into a specific number of clusters. If you are not sure, there are several methods to determine the optimal number. Such are the Elbow method, which hierarchizes several runs with different k’s, the Average silhouette method and the Gap statistic method, among others.

Because of the randomness of the initial placement of centroids, running the algorithm several times will lead to different clustering solutions. To find the optimal clustering, multiple runs are recommended.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.