Hierarchical Clustering

Hierarchical clustering is a family of unsupervised machine learning algorithms that aim to build a hierarchy of clusters. Unlike partitioning methods like K-Means, it doesn't require specifying the number of clusters beforehand. Instead, it creates a tree-like structure called a dendrogram that visually represents how data points are grouped together at different levels of similarity.

There are two main approaches: agglomerative (bottom-up) and divisive (top-down). Agglomerative clustering starts with each data point as its own cluster and iteratively merges the closest clusters until all points belong to a single cluster. Conversely, divisive clustering starts with all points in one cluster and recursively divides them into smaller clusters. The “closeness” between clusters is determined by a linkage criterion, such as single, complete, average, or Ward linkage, which impacts the shape and size of the resulting clusters.

The strength of hierarchical clustering lies in its ability to reveal the hierarchical relationships within the data. The dendrogram allows for a visual inspection of the clustering process, and the desired number of clusters can be determined by cutting the dendrogram at an appropriate level.

This module outlines the following steps:

  ➤ 1. Import data

  ➤ 2. Analyze data

  ➤ 3. Dendrogram

  ➤ 4. Agglomerative Clustering

  ➤ 5. Visualize the clusters

  ➤ 6. Accuracy Score

  ➤ 7. Confusion Matrix

  ➤ 8. Silhouette score

  ➤ 9. Strengths and Weaknesses

Hierarchical Clustering

↪ 1. Import data

The “Breast Cancer” dataset, a toy binary classification dataset included in the scikit-learn library, serves as a popular example for demonstrating machine learning clustering tasks. The dataset's features are calculated from digitized images of fine needle aspirates (FNA) of breast masses and are used to predict if a mass is malignant (cancerous) or benign (non-cancerous).

The sklearn.datasets package provides access to the “Breast Cancer” dataset, which is loaded using the load_breast_cancer() function. The returned dataset object is stored in the variable bc. This object contains the features, targets (labels), and descriptive metadata. Specifically, bc.feature_names holds a list of the names of all 30 features (for example, “mean radius” and “texture”).

      # Load Breast Cancer dataset
      from sklearn.datasets import load_breast_cancer
      bc = load_breast_cancer()
 
      # Access the data and target attributes
      print(bc.feature_names)

      ---Output---       # ['mean radius' 'mean texture' 'mean perimeter' 'mean area'       # 'mean smoothness' 'mean compactness' 'mean concavity'       # 'mean concave points' 'mean symmetry' 'mean fractal dimension'       # 'radius error' 'texture error' 'perimeter error' 'area error'       # 'smoothness error' 'compactness error' 'concavity error'       # 'concave points error' 'symmetry error' 'fractal dimension error'       # 'worst radius' 'worst texture' 'worst perimeter' 'worst area'       # 'worst smoothness' 'worst compactness' 'worst concavity'       # 'worst concave points' 'worst symmetry' 'worst fractal dimension']

The dataset contains a total of 569 instances (or data points).

      data = bc.data
      print("Number of Instances, Number of features", data.shape)

      ---Output---       # Number of Instances, Number of features (569, 30)

The bc.target_names attribute provides the labels for the classifications. It has two classes:
      Malignant (0): Indicates a cancerous breast mass.
      Benign (1): Indicates a non-cancerous breast mass.
 

      print(bc.target_names)
      target = bc.target

      ---Output---       # ['malignant' 'benign']

Hierarchical Clustering

↪ 2. Analyze data

The code below uses the load_breast_cancer() function with the return_X_y=True argument to separate the features (data) and target variable (labels) and the as_frame = True to load data into a pandas DataFrame. The feature data is assigned to the variable df, and the target variable is assigned to m.

      df, m = load_breast_cancer(return_X_y=True, as_frame = True)
      print("Number of Instances", df.shape)

      ---Output---       # Number of Instances (569, 30)

Next, the code explores the target variable m. It uses the .value_counts() method to display the number of occurrences of each unique value within the target variable. The printed output indicates the number of instances in the 'malignant' category and the number in the 'benign' category, revealing the class distribution in the dataset.

      print("m values", m.value_counts()) # 0 = malignant, 1 = benign

      ---Output---       # m values target       # 1 357       # 0 212

Hierarchical Clustering

↪ 3. Dendrogram

Feature Scaling It extracts the feature data (bc.data) and target variable (bc.target) from a previously loaded dataset, assigning them to X and y, respectively. Then, it imports StandardScaler from scikit-learn for feature scaling, which is essential for distance-based algorithms. A StandardScaler object is initialized, which is used for scaling the feature data by transforming X into X_scaled. Feature scaling standardizes the data that ensures each feature has a mean of 0 and a standard deviation of 1, which prevents features with larger values from dominating the clustering process.

      X = bc.data
      y = bc.target
      from sklearn.preprocessing import StandardScaler
      scaler = StandardScaler()
      X_scaled = scaler.fit_transform(X)

Linkage Matrix The code below performs the core computation for hierarchical clustering and prepares data for a dendrogram, which is a graphical representation of the hierarchical clustering process. It imports the linkage() function from the scipy.cluster.hierarchy module, and applies the linkage() function to the scaled data X_scaled, using the 'ward' method. The 'ward' method minimizes the variance within each cluster. The output is stored in the linkage_matrix variable.

      from scipy.cluster.hierarchy import linkage
      linkage_matrix = linkage(X_scaled, method='ward')
      print(linkage_matrix.shape)

      ---Output---       # Name: count, dtype: int64       # (568, 4)

Hierarchical Clustering

↪ 3. Dendrogram

Visualize The code below visualizes the results of hierarchical clustering using a dendrogram. The dendrogram() function is called using the linkage_matrix as input to display the hierarchical clustering results.

    – The labels = listy argument provides the true class labels to each leaf of the dendrogram.
    – leaf_rotation = 90 rotates the labels for readability.

      listy = bc.target_names[y]
 
      from scipy.cluster.hierarchy import dendrogram
      import matplotlib.pyplot as plt
      plt.figure(figsize=(18, 6))
      dendrogram(linkage_matrix, labels=listy, leaf_rotation=90)
      plt.title('Hierarchical Clustering Dendrogram')
      plt.xlabel('Index')
      plt.ylabel('Distance')
      plt.show()

Dendrogram

Hierarchical Clustering

↪ 4. Agglomerative Clustering

The code below demonstrates the use of Agglomerative Clustering from the scikit-learn library. First, it initializes an AgglomerativeClustering object, setting the n_clusters parameter to 2, meaning it will attempt to group the data into two distinct clusters. The linkage parameter is set to 'ward', which specifies the Ward linkage method. This particular method aims to minimize the variance within each cluster, resulting in more compact and well-defined groups.

Next, the .fit_predict(X_scaled) method is called, and this function not only fits the model to the data, but also predicts which cluster each data point belongs to. The returned labels variable contains the cluster assignment for each data point, essentially categorizing the data into the two clusters based on their similarity.

      from sklearn.cluster import AgglomerativeClustering
      agg_cluster = AgglomerativeClustering(n_clusters=2, linkage='ward')
      labels = agg_cluster.fit_predict(X_scaled)

The code below combines the true class labels and the predicted cluster labels into a single, viewable pandas DataFrame for easier comparison. It converts the cluster labels (labels) from the previous agglomerative clustering process into a pandas DataFrame named 'clusterdf'. Similarly, it converts the true class labels (y) into another pandas DataFrame named 'ydf'. Using the pd.concat() function, it combines 'ydf' and 'clusterdf' into a single DataFrame named 'dframe', concatenating them column-wise (specified by axis=1). The code prints the first five rows of the dframe for a side-by-side comparison.

      import pandas as pd
      clusterdf = pd.DataFrame(labels)
      ydf = pd.DataFrame(y)
      dframe = pd.concat([ydf,clusterdf], axis=1)
      dframe.columns = ['True Label y','Predicted']
      print(dframe.head(5))
      #OUT 
      #     True Label y  Predicted
      #  0             0          0
      #  1             0          0
      #  2             0          0
      #  3             0          0
      #  4             0          0

Hierarchical Clustering

↪ 5. Visualize the clusters

The code below generates a visualization of the Hierarchical clustering results using matplotlib. It generates a scatter plot where the x-axis and y-axis represent the second and third scaled features of the data, respectively (using X_scaled[:, 1] and X_scaled[:, 2]). The parameter c=labels sets the color of each point according to the cluster assignment stored in labels, and cmap='viridis' sets the colormap for visualization. Also, the size of each data point is set to 50 and the alpha to 0.5 to improve visualization.

      plt.figure(figsize=(10, 6))
      plt.scatter(X_scaled[:, 1], X_scaled[:, 2], c=labels, cmap='viridis',s=50, alpha=0.5)
      plt.title('Hierarchical Clustering')
      plt.xlabel('Feature 1')
      plt.ylabel('Feature 2')
      plt.show()

Hierarchical Clustering

Hierarchical Clustering

↪ 6. Accuracy Score

Then, it calculates the number of data points that were correctly assigned to the “correct” clusters by comparing the predicted cluster labels (labels) with the true class labels (y). This is done by summing the boolean results of (y == labels). The result of this sum, representing the number of correctly labeled data points, is stored in correct_labels.

      correct_labels = sum(y == labels)
      print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size))
      print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))

      ---Output---       # Result: 501 out of 569 samples were correctly labeled.       # Accuracy score: 0.88

Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.

      from sklearn.metrics import accuracy_score
      accuracy = accuracy_score(ydf, clusterdf)
      print(f"Accuracy: {accuracy}")

      ---Output---       # Accuracy: 0.880492091388400

Hierarchical Clustering

↪ 7. Confusion Matrix

The code below calculates and displays a confusion matrix to evaluate the performance of a classification model. It uses the confusion_matrix() function to generate a confusion matrix, which summarizes the model's predictions against the true labels.

From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

      from sklearn.metrics import confusion_matrix
      cm = confusion_matrix(ydf, clusterdf)
      print(cm)

      ---Output---       # [[164 48]       # [ 20 337]]

The .ravel() method flattens the confusion matrix into a 1D array. A typical 2×2 confusion matrix (for binary classification) looks like this: [[TN, FP], [FN, TP]]

      TN, FP, FN, TP = confusion_matrix(ydf, clusterdf).ravel()
      print('True Positive (TP)  = ', TP)
      print('False Positive (FP) = ', FP)
      print('True Negative (TN)  = ', TN)
      print('False Negative (FN) = ', FN)

      ---Output---       # True Positive (TP) = 337       # False Positive (FP) = 48       # True Negative (TN) = 164       # False Negative (FN) = 20

Accuracy score can be calculated using the following formula.

      accuracy =  (TP+TN) / (TP+FP+TN+FN)
      print('Accuracy of the classification = {:0.3f}'.format(accuracy))

      ---Output---       # Accuracy of the classification = 0.880

Hierarchical Clustering

↪ 8. Silhouette score

The code below calculates and displays the silhouette score, a metric used to evaluate the quality of clustering results. First, it imports the silhouette_score() function from sklearn.metrics, which is used to compute the silhouette coefficient for each data point. The function takes the scaled feature data (X_scaled) and the cluster labels (labels) as input. The average silhouette score, calculated from these individual silhouette coefficients, is stored in the variable silhouette_avg. This score provides a measure of how similar an object is to its own cluster compared to other clusters; it ranges from -1 to 1, with higher scores indicating better-defined clusters.

      from sklearn.metrics import silhouette_score
      silhouette_avg = silhouette_score(X_scaled, labels)
      print(f"Average Silhouette Score: {silhouette_avg:.2f}")

      ---Output---       # Average Silhouette Score: 0.34

The average silhouette score of 0.34 indicates that the data points are not very well separated with these clusters, as scores closer to 1 are desirable.

Hierarchical Clustering

↪ 9. Strengths and Weaknesses

Strengths

  • Unlike algorithms like K-Means, Hierarchical clustering doesn't requires to predefine the number of clusters. A dendrogram helps to visualize and choose the appropriate number based on the data's structure.
  • The output is a hierarchy, which can reveal nested relationships and the order in which clusters are formed.
  • Can be used with various distance metrics (e.g., Euclidean, Manhattan, cosine).
  • Given the same input and linkage method, the results will always be the same.
  • Weaknesses

  • For large datasets, Hierarchical clustering can be significantly slower than other methods like K-Means, especially when using a bottom-up (agglomerative) approach. The time complexity is typically O(n^3) in the worst case.
  • Outliers can heavily influence cluster formation in certain linkage methods, leading to poor results if the data is noisy.
  • The method used to define distance between clusters (e.g., Ward, complete, average) can drastically affect the structure of the resulting clusters and may require some experimentation to pick the best method for the data.