High level overview of a machine learning project in 10 steps
I wrote this post in order to clarify what a typical machine learning project framework could look like. I wanted a high level overview roadmap that I could use to guide me through (almost) any new project, step by step. It is based on online content (mostly from Andrew Ng’s courses) and work related experiences. It is also a draft in a way, as I intend to refine it as I get more projects done.
Step 1: Define the project objective and the business key performance metrics
First thing first, we need to understand the business objectives. In the mean time, it is a good practice to get some expert’s input (ideally, someone from the business) to understand which features may be important. Those features will be precious as it will give you ideas on the best way to start the exploration data analysis (EDA).
Metric you should get from the business (or approved by them):
- The business end goal metric. For instance: users conversion rate from an email marketing campaign.
- The baseline to beat. For instance: past marketing campaign results or human level performance.
- What your model should predict: Define the class label 1 and 0 for a classification project (or the numerical feature to predict for a regression project).
It is a good habit to constantly remind yourself as the project evolves to get your model as closely aligned as possible with the business end goal.
Step 2: Data acquisition and EDA
Start by reviewing the available databases and their related tables, then start an EDA and write SQL queries. The goal is to discover a set of features to use for modeling while getting used to the data.
To guide you in the EDA, you can use a hypothesis approach:
- Write down what your hypotheses about the class label 1 are (in a classification example), then write the related SQL queries and check the data to confirm your thoughts.
- While exploring the data, new ideas and hypotheses will come to your mind so iterate until you have a good sense of the data. By the end of this process, you should have a set of features that you believe may be useful for prediction.
As you start to accumulate data, SQL codes and other scripts, be sure to keep your folders and files well organized. I have been using the system described in this Toward Data Science article and it works quite well.
Step 3: Data set creation
Create a small data set by merging your SQL queries and use quick feature engineering fixes to have the data set ready in a short amount of time.
Step 4: Quick modeling
Choose your model performance metric (F1 score, Brier score, etc.). It is good practice to use a single metric as it will help you to take decision when comparing model performances. Then, get your hands dirty and start modeling:
- Try several models with a rough grid search to be tested on a small data set. Note that the data set is the same for each model.
- Usually, try around 4–5 models, mixing linear, non linear and ensemble models. Some of those models should be of the white boxes type in order to help you interpret what is happening.
- Each model has its own grid search. Grid search should be small, so do not test too many values. Ideally, it should take one hour or less to execute a model on all the grid search possibilities you selected.
- Using cross validation, pick the top 2–3 models with their related best parameters. Check if your models tend to suffer from high bias or high variance by comparing the CV train and test performance’s results
Step 5: Error analysis
Error analysis applied to machine learning is well explained by Andrew Ng, whether on Coursera or in the Machine Learning Yearning book. The below method largely comes from his work, so be sure to check his content to have a better and deeper understanding of this key step.
Start an error analysis by selecting 100 random cases where your algorithm did wrong predictions (among some of the worst cases FP / FN). The error analysis is based on the model which gave you the best result. Among those 100 cases, search manually any systematic trend on where and why your model was wrong.
Based on the error analysis you just did and the bias variance issue your model may suffer from, decide how you will improve the model (create new features, etc…) and take action. Note that after doing CV and error analysis, you will have to do CV again with the same seed to check if your actions worked. There is a risk of overfitting at some point.
Recall the following actions on how to address any bias and variance issue (source: Machine Learning Yearning).
Techniques to reduce avoiding bias:
- Increase the model size.
- Modify input features based on insights from error analysis.
- Reduce or eliminate regularization (L2 regularization, L1 regularization, dropout).
- Modify model architecture (such as neural network architecture).
Techniques to reduce variance:
- Add more training data.
- Add regularization (L2 regularization, L1 regularization, dropout).
- Add early stopping (i.e., stop gradient descent early, based on dev set error).
- Feature selection to decrease number/type of input feature.
- Decrease the model size.
Step 6: Select a larger data set
Pretty self explanatory!
Step 7: Feature Engineering
This is probably the most important step, so be sure to spend some time on this!
Step 8: Feature Selection
Feature selection can be done through techniques such as feature importance, lasso or step forward. Depending on the algorithm used, you can skip this step as your modeling function will take care of it.
Step 9: Tuning
It is now time to tune your model(s), expanding the range that you already tested with the grid searches from step 4.
Tuning the model can be time consuming for sometimes a small (~5%) accuracy boost. Depending on your work environment and project, spending precious hours on tuning is not always the best course of action. Instead, focusing on feature engineering and testing various models can be the best way to spend your time.
Step 10: Iterate
It is never completely over, so do a new error analysis to understand where your model was wrong and iterate as the project progresses!