Process overview for clustering based user segmentation

Marc Deveaux

6 min readJun 4, 2022

Notes from May 2022 project

Main take away

The dissimilarity measure is the most important decision affecting the clusters and needs to be justified (rather than using Euclidean distance by default without any prior thoughts on this). See “an introduction to statistical learning” on page 396 and the figure 10.13 on the choice of dissimilarity measure. For instance, correlation based distance is often more adapted for checking user’s product interest. It is good practice to find 2 users who should be in the same cluster and confirm with the dissimilarity matrix if they are indeed similar (1 — x as you build a dissimilarity matrix)
Transform all features using quantile method with a normal distribution and scale everything to [0–1]. Note that you may have some cases where data transformation is not recommended (probably not for Euclidean distance though). As for the former point, best way is to check manually some users that you know should be similar and see if the similarity measure after data transformation gives the expected result
I found the tricky part to be the choice of the features to be used for clustering. It is not a bad idea to start from hypothesis and check visually with 2–3 features if any kind of cluster appears. Avoid throwing a lot of random features for clustering (especially if they all have the same weight)
Avoid binary features with Euclidean distance. If they are important, they can be used (with some other features) to check / confirm the clusters’ results
Spend some time on checking the cluster results and robustness. Don’t take the result as it is without proper verification. Check the Google article on Kmeans cluster results analysis. Business knowledge also kicks in
Lot of cool tools /measures seen to automate all of this, at the end of the day I feel it should be very careful crafting style — keep it simple but each step is thought, clearly understood and can be justified

Process overview

Feature creation
Feature transformation
Checking hypothesis
Refining the feature selection
Choosing the dissimilarity measure
Clustering
Clustering verification
Creating personas

Feature creation

The tricky part? Current process is to start by having hypothesis on the user’s persona, then create features and do a visual check with 2–3 features to see if some potential clusters can be seen by eye. I did this with data before and after quantile transformation.

It is good practice to have some features which are not used in the clustering creation but are used for checking the clustering result. See an example on p513 of the “Elements of statistical learning”: clusters are based on gene and you compare the clusters to different type of cancer. It allows you to say that the clustering is successful because the cancer type are correctly grouped in the clusters

Notes: I saw multiple projects where clustering was used after modeling: SHAP is used to take the top features and we then apply clustering on them. See other projects notes section.

Feature Transformation

See the Google course on Kmeans Clustering: https://developers.google.com/machine-learning/clustering/prepare-data

They transform the data first, using normalize for features that have gaussian distribution, log transformation for right skewed or quantile for the rest. Note that quantile can be used by default on everything if you don’t want to dive into each feature

divide the data into intervals where each interval contains an equal number of examples. These interval boundaries are called quantiles
Convert your data into quantiles by performing the following steps:

Decide the number of intervals.
Define intervals such that each interval has an equal number of examples.
Replace each example by the index of the interval it falls in.
Bring the indexes to same range as other feature data by scaling the index values to [0,1].

In order to create quantiles that are reliable indicators of the underlying data distribution, you need a lot of data. As a rule of thumb, to create n quantiles, you should have at least 10n examples. If you don’t have enough data, stick to normalization.
also clip outliers if needed
check everything is scale to 0–1

# quantile
qt = QuantileTransformer(n_quantiles=1000, random_state=0, output_distribution='normal')
ds_to_quantile = qt.fit_transform(ds_to_quantile)
# scale
minmax_scale(ds_to_quantile, feature_range=(0, 1), axis=0, copy=True)

Note:

they then use NN to create embedding via autoencoder. Potentially you can also use PCA however the downside is that PCA is for linear relation only
the Google’s article recommend to always transform the data (for Kmeans as it is using Euclidean distance measure), however in the “introduction to statistical learning”, transforming/scaling the data is one of the question that has to be asked — especially if you are using another distance measure

Feature selection

keep the promising features from previous hypothesis check
remove the useless features with high correlation
avoid adding too many features/weight related to one topic. For example, having 6 features related to user income and 2 features relate to user socio demographic information

Select the dissimilarity measure

Be able to justified what and why you use it: “Specifying an appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm.” (see “Elements of statistical learning” and “Introduction to statistical learning” on this). So don’t take Euclidean distance per default as it often seems to be the case. Correlation-based distance can be very useful as a dissimilarity measure. “This is particularly the case in gene expression data analysis, where we might want to consider genes similar when they are “up” and “down” together. It is also the case, in marketing if we want to identify group of shoppers with the same preference in term of items, regardless of the volume of items they bought. If Euclidean distance is chosen, then observations with high values of features will be clustered together. The same holds true for observations with low values of features”.

Other resources:

https://www.datanovia.com/en/lessons/clustering-distance-measures/#what-type-of-distance-measures-should-we-choose

Clustering

Decisions that have to be made (“Introduction to statistical learning”, p399):

“Should the features first be standardized in some way? For instance, maybe the variables should be centered to have mean zero and scaled to have standard deviation one.

In the case of hierarchical clustering:

What dissimilarity measure should be used
What type of linkage should be used
Where should we cut the dendrogram in order obtain clusters?

In the case of K-means clustering:

how many clusters should we look for in the data?

“Each of these decisions can have a strong impact on the results obtained. In practice, we try several different choices, and look for the one with the most useful or interpretable solution. With these methods there is no single right answer — any solution that exposes some interesting aspects of the data should be considered”

Clustering verification

See: https://developers.google.com/machine-learning/clustering/interpret

Recall that the verification below is related to Kmeans. Overall, you start by checking the following things:

cluster cardinality (number of examples per cluster) and investigate any strong outliers
cluster magnitude: sum of distances from all examples to the centroid of the cluster and investigate outliers
plot Magnitude vs. Cardinality: cardinality and magnitude should have a linear increase, so check for weird cases

Then you check the similarity measure by picking some users you know are similar/dissimilar and confirm if you have the expected result (note that because you are doing a dissimilarity matrix, when checking the correlation result between 2 users, you have to do 1 — corr_result)

Useful flowchart from the google article above:

Create personas

Self explanatory

Notes from other projects I saw

Cluster on limited features + SHAP creation for explanation

first, define clusters based on purchase amount and purchase frequency for users who bought products from the brand X
Then build a model for each cluster where you compare the cluster 1 user list vs average users who bought the same overall genre (for example cosmetic) and at the same frequency/amount but not the same brand (in the example I saw they used a product genre from a higher hierarchy for the average user)
then SHAP values to check/explain the main top different features

Other clustering project

The features chosen for clustering actually came from SHAP after modeling (so the top features were taken). Then we have a more classic approach with PCA and elbow method to choose the number of clusters

Customer persona understanding: A vs B brand products

you have 2 set of users, the ones who bought A and the one who bought B
take some features (socio demographic, etc.)
modeling to predict user buys A
build SHAP to know the characteristics of A customers vs the others
kmeans clustering for A purchasers. Remove some features because of 1) useless for taking business actions or 2) correlated with other features
then check distribution to see the different clusters etc…