K-Means Clustering: Algorithm Principle and Implementation
K-means clustering is a classic unsupervised learning algorithm used to partition a dataset into K disjoint clusters. Its goal is to make data points within the same cluster as similar as possible, while data points in different clusters are as dissimilar as possible. The core idea is to iteratively optimize the algorithm to minimize the sum of squared distances between each data point and the centroid of its assigned cluster.
Detailed Algorithm Steps
-
Initialize Cluster Centers (Centroids)
- Randomly select K data points as the initial cluster centers (centroids). K is a predefined hyperparameter representing the desired number of clusters.
- For example, given a dataset of 100 two-dimensional data points to be divided into 3 clusters (K=3), three points are randomly chosen as initial centroids.
-
Assign Data Points to the Nearest Cluster
- For each data point in the dataset, calculate its distance (commonly Euclidean distance) to each of the K centroids.
- Assign each data point to the cluster whose centroid is the closest.
- For example, for a point P(x, y), calculate its distances d1, d2, d3 to the 3 centroids. If d2 is the smallest, assign P to Cluster 2.
-
Recalculate Cluster Centers
- For each cluster, calculate the mean (average of coordinates) of all its data points, and use this mean as the new centroid.
- For example, if Cluster 2 has 10 points, the new centroid's x-coordinate = (point1_x + point2_x + ... + point10_x) / 10, and similarly for the y-coordinate.
-
Iterative Optimization
- Repeat Steps 2 and 3 until a termination condition is met (e.g., the centroids' movement distance falls below a threshold, or cluster assignments no longer change).
- Each iteration makes data points within clusters more compact, and the algorithm eventually converges to a local optimum.
Key Issues and Optimizations
- Choosing K: K is predetermined. The Elbow Method (observing the inflection point in the error vs. K plot) or metrics like the Silhouette Coefficient can help determine an appropriate K value.
- Sensitivity to Initial Centroids: Different initial centroids can lead to different results. The K-means++ algorithm is commonly used to improve initialization by spreading initial centroids apart, enhancing convergence speed and stability.
- Distance Metric: Euclidean distance is suitable for spherical clusters. For special data distributions, Manhattan distance or Cosine Similarity can be used.
- Limitations: Sensitive to non-spherical clusters and noise (outliers), and assumes clusters are of relatively uniform size.
Python Implementation Example
import numpy as np
def k_means(data, k, max_iters=100):
# Randomly initialize centroids
centroids = data[np.random.choice(len(data), k, replace=False)]
for _ in range(max_iters):
# Assign data points to the nearest centroid
distances = np.linalg.norm(data[:, np.newaxis] - centroids, axis=2)
labels = np.argmin(distances, axis=1)
# Recalculate centroids
new_centroids = np.array([data[labels == i].mean(axis=0) for i in range(k)])
# Check for convergence
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return labels, centroids
Due to its simplicity and efficiency, K-means clustering is widely used in customer segmentation, image segmentation, anomaly detection, and more. Understanding its principles and optimization methods is crucial for tackling real-world clustering problems.