Clustering

Overview

Clustering is a type of unsupervised learning that involves grouping a set of data points such that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). The goal is to discover intrinsic groupings in the data, without any prior labels.

Key concepts in clustering include:

Similarity/Distance Measure: A function that quantifies the likeness or dissimilarity between two data points (e.g., Euclidean distance, cosine similarity).
Centroid: The center point of a cluster, often calculated as the mean of all data points in that cluster (especially in centroid-based clustering like K-Means).
Intra-cluster distance: The distance between data points within the same cluster. Good clustering minimizes this.
Inter-cluster distance: The distance between different clusters. Good clustering maximizes this.

Common clustering algorithms include K-Means, Hierarchical Clustering, DBSCAN, and Gaussian Mixture Models.

Core Concepts

Types of Clustering Algorithms
▶
Different clustering algorithms have different approaches to grouping data:
- Centroid-based (e.g., K-Means):
  - Represents clusters by a central vector
  - Good for well-separated, spherical clusters
  - Sensitive to outliers
  K-Means Clustering: Iterative process of assigning points to nearest centroids and updating centroids
- Density-based (e.g., DBSCAN):
  - Defines clusters as dense regions in space
  - Can find arbitrarily shaped clusters
  - Handles noise and outliers well
  DBSCAN: Density-based clustering showing core points, border points, and noise points
- Hierarchical (e.g., Agglomerative):
  - Creates a tree of clusters
  - Provides multiple levels of granularity
  - No need to specify number of clusters
  Hierarchical Clustering: Dendrogram showing the tree-like structure of clusters
- Distribution-based (e.g., Gaussian Mixture):
  - Models clusters using probability distributions
  - Provides soft cluster assignments
  - Can capture complex cluster shapes
Distance Metrics
▶
Common distance metrics used in clustering:
- Euclidean Distance:
  - Most common metric
  - $d(p,q) = \sqrt{\sum_{i=1}^n (p_i - q_i)^2}$
  - Good for continuous data
- Manhattan Distance:
  - Sum of absolute differences
  - $d(p,q) = \sum_{i=1}^n |p_i - q_i|$
  - Less sensitive to outliers
- Cosine Similarity:
  - Measures angle between vectors
  - $\cos(\theta) = \frac{p \cdot q}{||p|| ||q||}$
  - Good for high-dimensional data
Algorithm Selection
▶
Guidelines for choosing the right clustering algorithm:
- K-Means:
  - When clusters are expected to be spherical
  - When the number of clusters is known
  - For large datasets (scalable)
- DBSCAN:
  - When clusters have arbitrary shapes
  - When there's noise in the data
  - When the number of clusters is unknown
- Hierarchical:
  - When hierarchical structure is important
  - For smaller datasets (computationally intensive)
  - When visualization of hierarchy is needed
- Gaussian Mixture:
  - When soft clustering is needed
  - When clusters overlap
  - When probabilistic assignments are useful
Data Preprocessing
▶
Important preprocessing steps for clustering:
- Feature Scaling:
  - Essential for distance-based algorithms
  - Use StandardScaler or MinMaxScaler
  - Ensures all features contribute equally
- Dimensionality Reduction:
  - Consider PCA or t-SNE for high-dimensional data
  - Can improve clustering performance
  - Helps with visualization
- Handling Missing Values:
  - Impute or remove missing values
  - Consider impact on distance calculations
  - Document treatment of missing values

Implementation

K-Means Clustering Example

▶


import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

def kmeans_example():
    # Generate synthetic data
    n_samples = 300
    n_features = 2
    n_clusters = 4
    X, y_true = make_blobs(
        n_samples=n_samples,
        n_features=n_features,
        centers=n_clusters,
        cluster_std=0.60,
        random_state=42
    )

    # Initialize and fit KMeans
    kmeans = KMeans(
        n_clusters=n_clusters,
        init='k-means++',
        n_init=10,
        random_state=42
    )
    kmeans.fit(X)

    # Get cluster assignments
    y_pred = kmeans.predict(X)

    # Calculate silhouette score
    silhouette_avg = silhouette_score(X, y_pred)
    print(f"Silhouette Score: {silhouette_avg:.3f}")

    # Visualize results
    plt.figure(figsize=(12, 5))

    # Plot original data
    plt.subplot(1, 2, 1)
    plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7)
    plt.title('Original Data')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

    # Plot clustered data
    plt.subplot(1, 2, 2)
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.7)
    centers = kmeans.cluster_centers_
    plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='x', s=200, linewidths=3, label='Centroids')
    plt.title('K-Means Clustering')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()

    plt.tight_layout()
    # plt.show()

    # Elbow method for optimal k
    inertias = []
    silhouette_scores = []
    k_range = range(2, 11)

    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(X)
        inertias.append(kmeans.inertia_)
        silhouette_scores.append(silhouette_score(X, kmeans.labels_))

    plt.figure(figsize=(12, 5))

    # Plot elbow curve
    plt.subplot(1, 2, 1)
    plt.plot(k_range, inertias, 'bo-')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia')
    plt.title('Elbow Method')

    # Plot silhouette scores
    plt.subplot(1, 2, 2)
    plt.plot(k_range, silhouette_scores, 'ro-')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Analysis')

    plt.tight_layout()
    # plt.show()

DBSCAN Clustering Example

▶


from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons

def dbscan_example():
    # Generate non-spherical data
    X, y_true = make_moons(n_samples=200, noise=0.05, random_state=42)

    # Scale the data
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Initialize and fit DBSCAN
    dbscan = DBSCAN(
        eps=0.3,          # Maximum distance between two samples for one to be considered as in the neighborhood of the other
        min_samples=5,    # Number of samples in a neighborhood for a point to be considered as a core point
        metric='euclidean'
    )
    y_pred = dbscan.fit_predict(X_scaled)

    # Number of clusters and noise points
    n_clusters = len(set(y_pred)) - (1 if -1 in y_pred else 0)
    n_noise = list(y_pred).count(-1)

    print(f"Number of clusters: {n_clusters}")
    print(f"Number of noise points: {n_noise}")

    # Visualize results
    plt.figure(figsize=(12, 5))

    # Plot original data
    plt.subplot(1, 2, 1)
    plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7)
    plt.title('Original Data')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

    # Plot DBSCAN clusters
    plt.subplot(1, 2, 2)
    unique_labels = set(y_pred)
    colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))

    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Black used for noise
            col = [0, 0, 0, 1]

        class_member_mask = (y_pred == k)
        xy = X_scaled[class_member_mask]
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
                 markeredgecolor='k', markersize=6, alpha=0.7,
                 label='Noise' if k == -1 else f'Cluster {k}')

    plt.title('DBSCAN Clustering')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()

    plt.tight_layout()
    # plt.show()

Hierarchical Clustering Example

▶


from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

def hierarchical_clustering_example():
    # Generate synthetic data
    X, y_true = make_blobs(
        n_samples=50,     # Smaller sample size for clearer dendrogram
        n_features=2,
        centers=3,
        cluster_std=0.60,
        random_state=42
    )

    # Scale the data
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Initialize and fit Hierarchical Clustering
    hierarchical = AgglomerativeClustering(
        n_clusters=3,
        linkage='ward'    # Minimizes variance within clusters
    )
    y_pred = hierarchical.fit_predict(X_scaled)

    # Visualize results
    plt.figure(figsize=(15, 5))

    # Plot clustered data
    plt.subplot(1, 2, 1)
    plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_pred, cmap='viridis', alpha=0.7)
    plt.title('Hierarchical Clustering')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

    # Create and plot dendrogram
    plt.subplot(1, 2, 2)
    linkage_matrix = linkage(X_scaled, method='ward')
    dendrogram(linkage_matrix)
    plt.title('Dendrogram')
    plt.xlabel('Sample Index')
    plt.ylabel('Distance')

    plt.tight_layout()
    # plt.show()

Gaussian Mixture Model Example

▶


from sklearn.mixture import GaussianMixture
from matplotlib.patches import Ellipse

def gmm_example():
    # Generate synthetic data with overlapping clusters
    X, y_true = make_blobs(
        n_samples=300,
        n_features=2,
        centers=3,
        cluster_std=1.0,
        random_state=42
    )

    # Initialize and fit GMM
    gmm = GaussianMixture(
        n_components=3,
        covariance_type='full',
        random_state=42
    )
    gmm.fit(X)

    # Get cluster assignments and probabilities
    y_pred = gmm.predict(X)
    probs = gmm.predict_proba(X)

    # Plot results
    plt.figure(figsize=(15, 5))

    # Plot cluster assignments
    plt.subplot(1, 3, 1)
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.7)
    plt.title('GMM Clustering')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

    # Plot probability contours
    plt.subplot(1, 3, 2)
    x, y = np.mgrid[-4:6:.01, -4:6:.01]
    pos = np.dstack((x, y))
    rv = gmm.score_samples(pos.reshape(-1, 2))
    plt.contourf(x, y, rv.reshape(x.shape), levels=20, cmap='viridis')
    plt.scatter(X[:, 0], X[:, 1], c='white', alpha=0.3)
    plt.title('Log Probability Density')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

    # Plot cluster membership probabilities
    plt.subplot(1, 3, 3)
    for i in range(len(X)):
        plt.scatter(X[i, 0], X[i, 1], c=probs[i], cmap='viridis', alpha=0.7)
    plt.title('Cluster Membership Probabilities')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

    plt.tight_layout()
    # plt.show()

if __name__ == "__main__":
    print("Running Clustering Examples...")
    
    print("\n1. K-Means Clustering Example:")
    kmeans_example()
    
    print("\n2. DBSCAN Clustering Example:")
    dbscan_example()
    
    print("\n3. Hierarchical Clustering Example:")
    hierarchical_clustering_example()
    
    print("\n4. Gaussian Mixture Model Example:")
    gmm_example()

Interview Examples

K-Means vs DBSCAN

Compare K-Means and DBSCAN clustering algorithms. When would you use each?

Evaluating Clustering Quality

How do you evaluate the quality of clustering results without ground truth labels?

Practice Questions

1. What are the practical applications of Clustering? Medium

Hint: Consider both academic and industry use cases

2. Explain the core concepts of Clustering Easy

Hint: Think about the fundamental principles

3. How would you implement this in a production environment? Hard

Hint: Consider scalability and efficiency

Clustering

Overview

Core Concepts

Types of Clustering Algorithms

Distance Metrics

Algorithm Selection

Data Preprocessing

Implementation

K-Means Clustering Example

DBSCAN Clustering Example

Hierarchical Clustering Example

Gaussian Mixture Model Example

Interview Examples

K-Means vs DBSCAN

Evaluating Clustering Quality

Practice Questions

1. What are the practical applications of Clustering? Medium

2. Explain the core concepts of Clustering Easy

3. How would you implement this in a production environment? Hard

Related Resources

Related Topics