Analysis of Clustering Algorithms
A Comparative Study of Clustering Algorithms on Customer Data
In this project Article, we will analyze three clustering algorithms: K-Means, K-Medoids, and Hierarchical clustering. We will compare the performance of these algorithms on a sample dataset using the Silhouette score. We will also discuss the factors to consider when choosing a clustering algorithm.
Clustering is an unsupervised machine learning algorithm that groups data points together based on their similarity. Clustering algorithms can be used to find hidden patterns in data and to identify groups of similar data points.
K-Means is a popular clustering algorithm that works by dividing a dataset into k clusters, where k is a user-specified number. The algorithm works by iteratively assigning each data point to the cluster with the closest mean. K-Means is a simple and efficient algorithm, but it can be sensitive to outliers and noise.
K-Medoids is a similar clustering algorithm to K-Means, but instead of using means, it uses medoids, which are the most central data points in a cluster. K-Medoids is more robust to outliers and noise than K-Means, but it is also more computationally expensive.
Hierarchical clustering is a clustering algorithm that builds a hierarchy of clusters by iteratively merging or splitting clusters. Hierarchical clustering is more flexible than K-Means and K-Medoids, but it can be more difficult to interpret.
In the following sections, we will compare the performance of these three clustering algorithms on a Customer dataset using the Silhouette score. We will also discuss the factors to consider when choosing a clustering algorithm.
Data Set Description:
Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing, such as age, gender, interests and spending habits.
The dataset has 8069 data sample with the attributes as ID, Gender, Ever_Married, Age, Graduated, Profession, Work_exprience, Spending_score, Family_size, Var_1, Segmentation, there are few data’s missing in the dataset which can be used for preprocessing the data.
Import all the necessary Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn_extra.cluster import KMedoids
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import silhouette_score
Read the Dataset:
df=pd.read_csv('dataset.csv')
df
Convert the string data into numerical data:
Profession = {
'Healthcare':1,'Engineer':2,'Lawyer':3,'Entertainment':4, 'Artist':5,
'Executive':6, 'Doctor':7, 'Homemaker':8, 'Marketing':8
}
data['Profession'] = data['Profession'].map(Profession)
var = {
'Cat_4':4, 'Cat_6':6, 'Cat_7':7, 'Cat_3':3, 'Cat_1':1, 'Cat_2':2, 'Cat_5':5
}
data['Var_1'] = data['Var_1'].map(var)
Graduate = {
'No':0, 'Yes':1
}
data['Graduated']=data['Graduated'].map(Graduate)
married = {
'No':0, 'Yes':1
}
data['Ever_Married']=data['Ever_Married'].map(married)
gender = {
'Male':0, 'Female':1
}
data['Gender'] = data['Gender'].map(gender)
spending ={
'Low':0, 'Average':1, 'High':2
}
data['Spending_Score']=data['Spending_Score'].map(spending)
segu = {
'D':4, 'A':0, 'B':1, 'C':2
}
data['Segmentation']=data['Segmentation'].map(segu)
data = data.drop('ID',axis=1)
The code first creates a dictionary called Profession that maps each profession to a number. The code then uses the map() method to replace each value in the Profession column with its corresponding number. The code repeats this process for the Var_1, Graduated, Ever_Married, Gender, Spending_Score, and Segmentation columns. Finally, the code drops the ID column from the Data Frame.
Update the Null value with the mean of the every data in that particular columns:
from sklearn.impute import SimpleImputer
# Create a SimpleImputer object
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# Fit the imputer to the DataFrame
imputer.fit(df)
# Transform the DataFrame
imputed_df = imputer.transform(df)
# Convert the imputed DataFrame to a Pandas DataFrame
df = pd.DataFrame(imputed_df, columns=df.columns)
The code imports the SimpleImputer class from sklearn.impute. It creates a SimpleImputer object and fits it to the DataFrame. The missing values are then imputed with the mean of the non-missing values in each column. Finally, the imputed DataFrame is converted to a Pandas DataFrame.
Choose the number of cluster:
# Choose an appropriate number of clusters
num_clusters = 5
Perform clustering Algorithms:
# Perform K-Means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans_labels = kmeans.fit_predict(data_scaled)
# Perform K-Medoids clustering
kmedoids = KMedoids(n_clusters=num_clusters, random_state=42)
kmedoids_labels = kmedoids.fit_predict(data_scaled)
# Perform Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=num_clusters, affinity='euclidean', linkage='ward')
hierarchical_labels = hierarchical.fit_predict(data_scaled)
The code performs three different clustering algorithms on the data: K-Means, K-Medoids, and Hierarchical.
K-Means clustering is an algorithm that groups data points into a specified number of clusters based on their similarity. The algorithm works by iteratively assigning data points to the cluster with the closest mean.
K-Medoids clustering is similar to K-Means, but instead of using the mean of the data points to assign them to clusters, it uses the median.
Hierarchical clustering is a recursive algorithm that creates a hierarchy of clusters by merging similar clusters together. The algorithm can be either agglomerative, which starts with each data point as its own cluster and merges them together until there is only one cluster left, or divisive, which starts with all the data points in one cluster and divides them into smaller and smaller clusters until each data point is in its own cluster.
The code then assigns each data point to a cluster using the three different clustering algorithms.
Perform Dimensionality Reduction:
# Apply PCA for dimensionality reduction
pca = PCA(n_components=2)
principal_components = pca.fit_transform(data_scaled)
The code applies Principal Component Analysis (PCA) for dimensionality reduction. PCA is a statistical procedure that transforms a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
The code first creates a PCA object with 2 principal components. Then, it uses the fit_transform() method to transform the data_scaled DataFrame into a new DataFrame with 2 columns, each representing a principal component.
Create a Data Frame with PCA results and clusters labels:
pca_df = pd.DataFrame(data=principal_components, columns=['principal component 1'
, 'principal component 2'])
pca_df['KMeans'] = kmeans_labels
pca_df['KMedoids'] = kmedoids_labels
pca_df['Hierarchical'] = hierarchical_labels
The code creates a new Data Frame called pca_df. The DataFrame contains the two principal components, as well as the cluster labels from the K-Means, K-Medoids, and Hierarchical clustering algorithms.
The code first creates a DataFrame called pca_df from the principal components. The data argument is set to the principal components, and the columns argument is set to the names of the two principal components.
The code then adds the cluster labels from the three clustering algorithms to the pcs_df DataFrame. The KMeans , KMedoids, and Hierarchical columns are created, and the cluster labels are assigned to the corresponding columns.
Transform the cluster centers using PCA:
kmeans_centers_2d = pca.transform(kmeans.cluster_centers_)
kmedoids_centers_2d = pca.transform(kmedoids.cluster_centers_)
The code transforms the cluster centers from K-Means and K-Medoids clustering algorithms into 2D space using PCA.
Plot the Scatter Plot for each clustering algorithms:
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
colors = ['r', 'g', 'b', 'y', 'c', 'm']
titles = ['K-Means', 'K-Medoids', 'Hierarchical']
for i, algorithm in enumerate(['KMeans', 'KMedoids', 'Hierarchical']):
for j in range(num_clusters):
axes[i].scatter(pca_df.loc[pca_df[algorithm] == j, 'principal component 1'],
pca_df.loc[pca_df[algorithm] == j, 'principal component 2'],
c=colors[j], label=f'Cluster {j}')
if algorithm == 'KMeans':
axes[i].scatter(kmeans_centers_2d[:, 0], kmeans_centers_2d[:, 1], c='black', marker='x', s=100, label='Centroids')
elif algorithm == 'KMedoids':
axes[i].scatter(kmedoids_centers_2d[:, 0], kmedoids_centers_2d[:, 1], c='black', marker='x', s=100, label='Medoids')
elif algorithm == 'Hierarchical':
for cluster in np.unique(hierarchical_labels):
cluster_points = principal_components[hierarchical_labels == cluster]
representative_point = cluster_points.mean(axis=0)
axes[i].scatter(representative_point[0], representative_point[1], c='black', marker='x', s=100, label='Representative Points' if cluster == 0 else None)
axes[i].set_title(titles[i])
axes[i].set_xlabel("Principal Component 1")
axes[i].set_ylabel("Principal Component 2")
axes[i].legend()
plt.show()
The code creates 3 subplots, each of which shows the results of a different clustering algorithm: K-Means, K-Medoids, and Hierarchical. In each subplot, the data points are colored according to their cluster, and the cluster centroids or representative points are marked with black X markers. The titles and axis labels are also set for each subplot.
NOTE: Hierarchical clustering does not have explicit cluster centers like K-Means and K-Medoids. However, you can calculate the mean of each cluster as a representative point for visualization purposes. Here’s how you can modify the plotting section to include representative points for Hierarchical clustering
Calculating Silhouette Score:
from sklearn.metrics import silhouette_score
kmeans_silhouette = silhouette_score(data_scaled, kmeans_labels)
kmedoids_silhouette = silhouette_score(data_scaled, kmedoids_labels)
hierarchical_silhouette = silhouette_score(data_scaled, hierarchical_labels)
print("Silhouette Scores:")
print(f"K-Means: {(kmeans_silhouette)*100}")
print(f"K-Medoids: {(kmedoids_silhouette)*100}")
print(f"Hierarchical: {(hierarchical_silhouette)*100}")
The code imports the silhouette_score
function from the sklearn.metrics
library. This function is used to calculate the silhouette score for a clustering algorithm. The silhouette score is a measure of how well each data point is clustered. A high silhouette score indicates that a data point is well-clustered, while a low silhouette score indicates that a data point is poorly clustered.
K-Means is the best clustering algorithm for the data. However, K-Medoids and Hierarchical are also good clustering algorithms for your data. The best clustering algorithm for your data will depend on your specific needs.