Classifying bank transactions

9 min readOct 12, 2024

This project is the application of a research paper from June 2023 named “Scalable and weakly supervised bank transaction classification” by Flowcast_ai. It describes a way to classify bank transactions when you don’t have a training set. See also the Medium’s link below for a summary.

Main Sources

The academic paper: Scalable and Weakly Supervised Bank Transaction Classification (arxiv.org)
The related Medium blog post: https://medium.com/@echo_neath_ashtrees/no-labels-no-problem-a-better-way-to-classify-bank-transaction-data-73380ce20734

Challenges

Quoting the team behind it: “Despite the richness of insights that are possible with bank data, accurately classifying bank transactions is fraught with challenges. Traditional approaches such as manual labeling or rule-based systems are insufficient in the face of increasing transaction volumes and complexity. This leads to biased datasets of easily labeled transactions and low volumes of labels.

The rising adoption of machine learning algorithms, including deep neural networks, helps to categorize transactions, but a significant challenge remains, the lack of training labels and often obscure transaction descriptions that prevent us from getting deeper insights. Therefore, a scalable solution is necessary — one that can handle large volumes of unlabelled transactional data while still providing accurate classifications”

Objective

For a given bank transaction, assign a class label. There are multiple labels spread across 2 levels of detail that were created by the business. Example: 1st level “outflow_travel”, 2nd level: “outflow_travel_hotel”. Any transaction belongs to a label class.
Credit or debit cards are not part of the scope as they have a Visa/Mastercard MCC code pre-assigned. Therefore, a simple mapping can be made between the MCC and your custom transaction labels
One of the key pieces of information we want to leverage is the beneficiary’s name (when it is an entity) and the transaction’s comment

Main challenge to tackle

The biggest challenge to address is the lack of a training set. To overcome this, the Snorkel library combines noisy Label Functions and Label Model. The output is a multiclass training set that a discriminative model will be trained on

Architecture Overview

Main steps include:

Data ingestion: the SQL query output of the bank transactions
Data preparation: cleaning the text (beneficiary names and transaction comments written by the debitor)
Weak Supervision Label Model: the part where we create a training set from scratch
DNN Discriminative model: final model trained on the Label Model’s output (i.e the training set)
Data Publishing: the “predict” module to classify any transaction

Pipeline Overview

500 random transactions are manually annotated with class labels and used in the “test” boxes. It allows you to understand the “real” performance of the DNN model and check if the model manages to generalize from the training set (more on this later).

Data preparation

Text normalization

Objective: taking the beneficiary’s name and transaction comments, we clean the text (stop words, special characters, stems, etc.)

How: we use NLTK and the string libraries

Key challenges:

Multiple languages in transaction comments mean transformations like stemming are not always efficient
There are many orthographic mistakes or word abbreviations
You want to remove natural person names as they risk polluting the cleaned text that FastText will be trained on (see Label Supervision step); however, you need to keep companies’ names as they are precious information about the type of service the company provide (example: “assurance xxxx” or “fitness xxxx”)
Similarly, you also want to remove location or dates. Therefore, you need a solid NER applicable to multiple languages to do this (tested with Spacy library but without good results)

Feature engineering

Objective:

For a given debit account, group the transactions paid to the same credit account with the same transaction’s comment. We calculate various information such as min, max, std, time series delta between transactions, etc.
This aggregated information will be used to find patterns to create Label Functions as well as features for the final discriminative model

How: pandas

Key challenges:

As it is based on the previous step work, you need good text cleaning to make sure transactions can be grouped. Example: transactions with comments such as “football facture Jun” or “football factur Oct” need to be both cleaned to “football fact” so that they can be grouped together

Weak Supervision Label Supervision

Fasttext unsupervised learning

Objective: Create a Fasttext model that is trained on your cleaned text. For a given word (ex: “sport”), the model will be able to find the most similar words associated with it (i,e “fitness”, “fit”, “gym”, “piscine”, etc.)

How: Fasttext library

Key challenges: as good as your text normalisation step

Labels Functions

Objective:

“Labeling functions (LFs) help users encode domain knowledge and other supervision sources programmatically. LFs are heuristics that take input a data point and assign a label to it”. The goal is to create multiple LFs using SME knowledge to generate noisy labels (meaning LFs will have various coverage, some will overlap, others conflict, etc.)
Create multiple LFs to generate a matrix where each column is a LF and each row is a transaction. A cell shows if a client belongs to a specific class label for a given LF

How:

Snorkel + Fasttext
LFs are of 2 types: heuristic and anchoring. Heuristic means simple business rules to detect a label. Anchoring means leveraging the Fasttext model we generated in the previous step by taking the x words with the highest similarity and searching those words in transaction data. For example, the words associated with “fitness” in the Fasttext model will be “gym”, “sport”, “fit”, etc. So by listing a few words associated with a label, we can search for a dozen or hundreds of highly similar and relevant words

Key Challenges: creating LFs can be time-consuming

Label Model

Objective:

“The label model combines weak supervision sources (the LFs) to create probabilistic weak labels”. The Label Model can do prediction on the LFs output matrix. This prediction result is the training set we will use in the next step
To get more info on how it works, see https://medium.com/@annazubovab/understanding-snorkel-5e33351aa43b . Key passage: “Snorkel first treats LFs as independent voters, assuming there is no correlation between them. The next step is to include correlations to avoid double vote in case the functions encode similar rules. […] Once LFs are created, Snorkel has the ability to decide whether to just apply a majority vote to determine the label, or model the accuracies in the generative model. The decision is made based on the label matrix density (mean number of non-abstention labels per data point) and an optimizer suggested by the Snorkel’s creators. As for the density parameter, the generative model performs best with medium density, while majority vote gives the best results with low and high density. However, this criterion was not sufficient, so another optimization rule was introduced that is based on the “the ratio of positive to negative labels for each data point”.

How: Snorkel + a test set created by hands of 500 random transactions

Key Challenges: documentation on playing with the label model parameters is limited. It is also not clear if the Snorkel library will be updated in the future as the team behind it created a company

Discriminative model

Objective: using the Label Model’s output as a training set, train the final discriminative model

How:

Scikit learn and feature-engine for the feature engineering pipeline, lightGBM for the model with balanced accuracy as the main performance metric
In the original paper, they use a NN as a discriminative model. I used lightGBM to speed up the implementation
We pass probabilistic labels (generated by the label model) along with y_train. They indicate how confident we are of a given label which greatly increases the final model performance

Key challenges: for this project, training speed was the main factor therefore lightGBM was chosen. However, SVM tends to have good performance on this kind of project (but is very slow!). Catboost was also tempting as I have many categorical data in the dataset but I couldn’t test it yet

The original model used this neural network architecture:

Lessons Learned

Why not use the Label Model as the final model? After all the LM’s prediction becomes our new “truth”. In practice, you could do it, but there are a few issues with this approach:

Applying LFs to big datasets is time-consuming
Your LFs are unlikely to cover all the dataset’s possibilities. All those remaining transactions will need to be classified one way or another and the label model won’t be good at it
LFs are simple rules and cannot be generalized. For example, let’s say you want to be able to classify if a given transaction is a rent. You would create a label function saying if it is written “rent” then it belongs to the “outflow_home” category. Your discriminative model will be able to generalize by finding new rules, such as transactions are the same amount, it tends to be around x EUR, etc. Therefore, your discriminative model should have higher accuracy than the label model, because it can find “rent” without the word rent being indicated in the transaction comment

The unsupervised training of Fasttext worked well on the cleaned text and is extremely fast to train

The anchoring technique allows you to reach hundreds of relevant words by preselecting manually a dozen which was also quite efficient

Because I used a boosting algorithm, I had to pass the cleaned text embedding (100 float number vector for a given sentence) as one number = one feature. It seems that the model managed to leverage those 100 features however, I am unsure if any transformation (log, dropping correlation) should be used on them. So far I kept them as is

Fasttext ability to embed unseen words is useful for words with orthographic mistakes and for words that don’t exist (i.e. a company name, an abbreviation, etc.)

I used the function TfidfVectorizer from scikitlearn to create one hot encoding feature from the most frequent words in the text. I didn’t check yet with Shapley values/features importance list if it worked well

I found lightGBM pretty good at interpreting the embedding features

Some transactions are ambiguous with no clear label even for a SME. I think it shows that there is a trade-off that can be made between transaction coverage and performance. Although the label model from Snorkel is “basic”, you can probably get decent coverage with high performance just by using simple rules and anchoring techniques.

Pain points

One of the main challenges is the different languages used in the transaction’s comments as well as the orthographic mistakes. Different languages (and their writing mistakes!) make text normalization challenging for stemming or lemmatization. As a result, it is harder to group transactions (the feature engineering step) and Fasttext is less efficient. An example of this is the words “clinique” and “clinic”: Fasttext doesn’t consider the two words similar because they are used in different language contexts.

Not much is said about the research paper’s original data, but it seems that their transactions are based on companies and not natural persons. A company name is insightful but a person’s name is not and must be excluded from the corpus before Fastext is trained. Therefore a solid NER covering multilanguages (and locations) would be a huge plus.

One of the main goals is to get your discriminative model to generalize Snorkel’s label functions. If your heuristic label functions are based on data that will be available to the final model, it will just re-learn the label functions you manually wrote. This is probably why the authors “push” for using transactions aggregated data like frequency-variance-min-max; those features are not available in the final dataset so your discriminative model will have no other choice than to find new patterns and therefore (hopefully) generalize