3. K-Means clustering digits#

3.1. Pre-reading#

K-Means Clustering

According to my especially unsupervised K-means clustering algorithm, there are currently about 8 billion types of people in the world.

3.1.1. Goals#

  • Learn to import and load datasets with numpy

  • Use KMeans to conduct unsupervised learning

  • Explore the impact of initialization and iterations

  • Explore a mistmatch of K and the number of classes

This lab is modified from the Scikit Learn example.

3.2. Load the dataset#

We will start by loading the digits dataset. This dataset contains handwritten digits from 0 to 9. In the context of clustering, one would like to group images such that the handwritten digits on the image are the same.

%matplotlib inline
import numpy as np
from sklearn.datasets import load_digits

# Load the dataset
data, labels = load_digits(return_X_y=True)
(n_samples, n_features), n_digits = data.shape, np.unique(labels).size

print(f"# digits: {n_digits}; # samples: {n_samples}; # features {n_features}")

We can also display a sample from the dataset, so we know what we are working with. Note that the images are fairly blurry because they are only 8x8 pixels. Is this enough for our machine learning to reliably work?

Run this a few times to see multiple samples.

import matplotlib.pyplot as plt
from random import randrange

# select a random sample
sample_id = randrange(n_samples)
sample_image = data[sample_id]

# reshape the vector back to a 2D image
sample_image = np.reshape(sample_image, (8, 8))

# plot the sample
print(f"Sample labeled as {labels[sample_id]}")
plt.imshow(sample_image, cmap="gray")
plt.show()

3.3. Visualize the results on PCA-reduced data#

sklearn.decomposition.PCA allows to project the data from the original 64-dimensional space into a lower dimensional space. Subsequently, we can use PCA to project into a 2-dimensional space and plot the data and the clusters in this new space.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# reduce the 64-dimension data to 2-D
pca_data = PCA(n_components=2).fit_transform(data)
# simple, non-fancy plot
plt.plot(pca_data[:, 0], pca_data[:, 1], "k.", markersize=2)

3.4. Iterate with K-Means#

First, we’ll define a method plot_kmeans that will add color, centroids, and lines to our PCA plot.

import matplotlib.pyplot as plt
from IPython.display import clear_output


def plot_kmeans(reduced_data, kmeans, iteration):
    # Step size of the mesh. Decrease to increase the quality of the VQ.
    h = 0.02  # point in the mesh [x_min, x_max]x[y_min, y_max].

    # Plot the decision boundary. For that, we will assign a color to each
    x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
    y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # Obtain labels for each point in mesh. Use last trained model.
    Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

    # Clear previous plot
    clear_output(wait=True)

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(1)
    plt.clf()
    plt.imshow(
        Z,
        interpolation="nearest",
        extent=(xx.min(), xx.max(), yy.min(), yy.max()),
        cmap=plt.cm.Paired,
        aspect="auto",
        origin="lower",
    )

    plt.plot(reduced_data[:, 0], reduced_data[:, 1], "k.", markersize=2)
    # Plot the centroids as a white X
    centroids = kmeans.cluster_centers_
    plt.scatter(
        centroids[:, 0],
        centroids[:, 1],
        marker="x",
        s=169,
        linewidths=3,
        color="w",
        zorder=10,
    )
    plt.title(
        "K-means clustering on the digits dataset (PCA-reduced data)\n"
        "Centroids are marked with white cross\n"
        "Iteration: {}".format(iteration)
    )
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
    plt.show()

3.4.1. Step through kmeans#

Before we use the builtin fit method, let’s step through what the algorithm is doing. We will do this with MiniBatchKMeans.

This first code block randomly picks centroids and then runs k-means once before plotting.

from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import PCA
from time import sleep


# PCA reduction to 2D
reduced_data = PCA(n_components=2).fit_transform(data)

# initialize
kmeans_step = MiniBatchKMeans(
    n_clusters=10, init="random", n_init=1, batch_size=10, max_iter=1, random_state=16
)

kmeans_step.partial_fit(reduced_data)

# dispaly clusters
plot_kmeans(reduced_data, kmeans_step, 1)
max_iterations = 16
step_size = 1
for i in range(0, max_iterations, step_size):
    # step_size iterations between plots
    for k in range(step_size):
        kmeans_step.partial_fit(reduced_data)

    plot_kmeans(reduced_data, kmeans_step, i)
    sleep(0.05)

plot_kmeans(reduced_data, kmeans_step, max_iterations)

3.5. Fit#

We’ll now use the builtin fit method. This is more realistic for an actual application.

Instead of specifying the number of iterations, this method will automatically stop when it converges. This is defined as there being no meaningful change of the centroid locations between rounds.

3.5.1. Define our evaluation benchmark#

We will first our evaluation benchmark. During this benchmark, we intend to compare different initialization methods for KMeans. Our benchmark will:

  • create a pipeline which will scale the data using a sklearn.preprocessing.StandardScaler

  • train and time the pipeline fitting;

  • measure the performance of the clustering obtained via different metrics.

3.5.1.1. Metrics#

We will pull several metrics from sklearn.metrics. It takes a fair amount of knowledge to understand what a set of metrics are communicating. It is worth reading about the ones we are using here.

We will use multiple metrics because they communicate different things about our model. Depending on the application, we will likely bias towards accepting one type of error over another. This is based on our assessment of risk.

from time import time
from sklearn import metrics
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


def bench_k_means(kmeans, name, data, labels):
    """Benchmark to evaluate the KMeans initialization methods.

    Parameters
    ----------
    kmeans : KMeans instance
        A :class:`~sklearn.cluster.KMeans` instance with the initialization
        already set.
    name : str
        Name given to the strategy. It will be used to show the results in a
        table.
    data : ndarray of shape (n_samples, n_features)
        The data to cluster.
    labels : ndarray of shape (n_samples,)
        The labels used to compute the clustering metrics which requires some
        supervision.
    """
    t0 = time()
    estimator = make_pipeline(StandardScaler(), kmeans).fit(data)
    fit_time = time() - t0
    results = [name, fit_time, estimator[-1].inertia_]

    # Define the metrics which require only the true labels and estimator labels
    clustering_metrics = [
        metrics.homogeneity_score,
        metrics.completeness_score,
        metrics.v_measure_score,
    ]

    results += [m(labels, estimator[-1].labels_) for m in clustering_metrics]

    # The silhouette score requires the full dataset
    results += [
        metrics.silhouette_score(
            data,
            estimator[-1].labels_,
            metric="euclidean",
            sample_size=300,
        )
    ]

    # Show the results
    formatter_result = (
        # "{:9s}\t{:.3f}s\t{:.0f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}"
        "{:9s}\t{:.3f}s\t{:.0f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}"
    )
    print(formatter_result.format(*results))

3.5.2. Run the benchmark#

We will compare three approaches:

  • an initialization using k-means++. This method is stochastic and we will run the initialization 4 times;

  • a random initialization. This method is stochastic as well and we will run the initialization 4 times;

  • an initialization based on a PCA. This method is deterministic and a single initialization suffice.

import numpy as np
from sklearn.datasets import load_digits
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Load a fresh copy of the dataset
data, labels = load_digits(return_X_y=True)
(n_samples, n_features), n_digits = data.shape, np.unique(labels).size
print(f"# digits: {n_digits}; # samples: {n_samples}; # features {n_features}")

# How many clusters
n_digits = 10

# top bar of stats
print("Results for", n_digits, "clusters")
print(82 * "_")
print("init\t\ttime\tinertia\thomo\tcompl\tv-meas\tsilhouette")

# try with k-means++
kmeans = KMeans(init="k-means++", n_clusters=n_digits, n_init=4, random_state=0)
bench_k_means(kmeans=kmeans, name="k-means++", data=data, labels=labels)

# try with random
kmeans = KMeans(init="random", n_clusters=n_digits, n_init=4, random_state=0)
bench_k_means(kmeans=kmeans, name="random", data=data, labels=labels)

# try with PCA
pca = PCA(n_components=n_digits).fit(data)
kmeans = KMeans(init=pca.components_, n_clusters=n_digits, n_init=1)
bench_k_means(kmeans=kmeans, name="PCA-based", data=data, labels=labels)

# bottom bar of stats
print(82 * "_")

GPT 3.5: Please interpret these results from sklearn.metrics: init, time, inertia, homo, compl, v-meas, silhouette

Overall metrics

  • Time: This is the time taken for the respective clustering initialization method to complete.

  • Inertia: The sum of squared distances between data points and their cluster’s center. A lower inertia indicates better clustering.

These three metrics evaluate the quality of clustering in terms of how well the clusters capture the true classes in your dataset. They are measures of purity and completeness.

  • Homogeneity: measures how much each cluster contains only data points that are members of a single class. Higher values are better.

  • Completeness: measures how well all data points that are members of a given class are assigned to the same cluster. Higher values are better.

  • V-Measure: the harmonic mean of homogeneity and completeness. It provides a balanced view of both measures.

Say more about Silhouette score

Finally, Silhouette Score measure the quality of clusters in a dataset. It takes into account both the cohesion (how close the data points are to each other within the same cluster) and the separation (how far apart different clusters are from each other). The Silhouette Score is computed for each data point and then averaged to obtain an overall score for the entire dataset. The score ranges from -1 to 1.

  1. High Positive Scores: If most of the Silhouette Scores are close to +1, it suggests that the clusters are well-separated and the data points are appropriately assigned to their respective clusters. This indicates a good clustering solution.

  2. Scores Around 0: If the Silhouette Scores are around 0, it indicates that data points might be on or near the decision boundary between clusters. This could imply that the clusters are overlapping or that the clustering algorithm is having difficulty distinguishing between certain data points.

  3. Negative Scores: If a significant number of data points have negative Silhouette Scores, it implies that these data points might have been assigned to the wrong clusters.