What Is K-Means Clustering?

The iterative Kmeans algorithm attempts to divide the dataset into K unique, non-overlapping subgroups (clusters), each of which contains a single data point. The iterative Kmeans algorithm attempts to divide the dataset into K unique, non-overlapping subgroups (clusters), each of which contains a single data point.

It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It distributes data points to clusters in a way that minimizes the sum of the squared distances between the data points and the cluster centroid, which is the average value of all the data points in the cluster.

The homogeneity (similarity) of the data points within a cluster increases as the amount of variance within the cluster decreases.

Steps in the algorithm
K-means algorithm iteratively minimizes the distances between every data point and its centroid in order to find the most optimal solution for all the data points.

  1. k random points of the data set are chosen to be centroids.
  2. Distances between every data point and the kk centroids are calculated and stored.
  3. Based on the distance calculations, each point is assigned to the nearest cluster.
  4. New cluster centroid positions are updated, similar to finding a mean in the point locations.
  5. If the centroid locations changed, the process repeats from step 2, until the calculated new center stays the same, which signals that the clusters’ members and centroids are now set.