Lessons learned from scaling a ML product (Sept-Oct)
Sept-Oct notes based on current project
What should be improved
Limit Jupyter notebook usage
Jupyter notebook is very powerful for exploration and searching errors — and generally for all the initial model creation phase; however, once the routine starts to kick in (generating monthly score list, etc.), the tool starts to show limitations. Moreover, the tool is not good for scaling your solution. So gain some times by using basic python scripts instead of notebook so that the transition to model deployment get easier. Jupyter notebook is also not so practical when several people works on the same project (the format makes it a bit annoying to look at on a repository, etc.).
Moving forward
Use Jupyter notebook for very limited script’s usage as its versatility makes it shines for those specific cases. Data exploration or one-shot investigation are good examples where Jupyter notebooks can be used. Scripts that will regularly be used (feature engineering, prediction, modeling) should not be notebook based.
Code structure
Overall, the code structure was too far from what was needed in order to scale it. As a result, when the ML product started to be built, the code had to be rewritten from scratch using the same ML process logic. Similarly, using generalized functions would have help when building other models.
Moving forward
Code structure should basically look like the following (more on this in another blog post):
- 1 module where you store all the functions which should be generalized as much as possible
- 1 script to apply the process you are interested in (for example feature engineering). Script is made as short as possible (using the module’s functions) with an extensive information’s feedback from the logger
- 1 unitest script to test the functions
This structure is replicated for the different steps of the ML project (so one for feature engineering, one for modeling, one for prediction…). You can then easily use bash to execute all the needed scripts.
Folder structure
The current folder structure I use for all my projects comes from the following blog post. It works really well, keeping things neatly organized and is easy to share with over people; however, it doesn’t really fit with a code structure as the one above.
Moving forward
Next step I think is to adjust the folder structure to something closer to what would be needed if the model was going to be deployed. Instead of having folders ‘0_data’, ‘1_code’, ‘2_pipeline’ and ‘3_output’, we would move to ‘data_engineering’, ‘feature_engineering’, ‘model_prediction’, ‘model_training’, etc. We will probability lose a bit in terms of folder structure clarity but it would make the transition to scale easier. I will keep the folder structure from the blog post for all other projects which are not expected to be developed into a full ML product.
The business and data understanding
A correct understanding of the databases, tables and data sounds like the most obvious thing, but in practice — in my recent projects — it turned out to be the hardest part to achieve. Reasons are multiple, but some of the main factors that come to my mind are teams working in silos, people being busy or difficult to join and miscommunications. The main issue is that in most cases you can’t find the solution by yourself, you need to rely on one of the data engineer working on the database. An example could be discovering afterward that a number of transactions you used in your data had to be excluded based on filtering a column X because they are dummy data or just not adequate for your situation. In theory you should learn about this before, but in practice, you will learn it once you start asking questions because some data don’t make sense.
Moving forward
You will need an ally, if possible on the data engineering team’s side or at least one of your colleagues who used the data for past projects and is willing to help you. If you can, gather all the SQL scripts used by your team members fetching those databases. In addition to this, take your own notes on the tricky cases you find, tables specificity and so on.
Having a good understanding of the business is also critical to detect incoherence and get some hypothesis. For example, having a personal broker account when you work on broker’s related data really helps to get intuitions about what features you will need, assess the feature importance list, expectations about what a feature’s distribution could look like, etc.
Creating trackers
Generally, there is a need for a better organization system around all the things you discover while working on the data.
Moving forward
Examples of areas to track:
- model’s performance + marketing result performance to assess the model
- as written in the paragraph above, information related to the database, tables and data. It includes saving past SQL queries, what are each tables used for, when are the tables updated, what are the traps, etc.
- conclusion of any past investigations (on seasonality, trends, personas, error analysis, etc.)
Double check the machine learning process fundamentals before starting
No need to rush the coding. Be sure to confirm all the ML process flow before getting your head into feature engineering details.
Moving forward
Write the plan in advance with clear indications of the main milestones and how it will be done. In particular what data will be used, with which dates, what is the class 0/1 definition and why, how the cross validation has to be set up with your feature engineering (check the machine learning mastery post on this), etc…
Other quick notes
- error analysis (as seen ‘machine learning yearning’ from Andrew Ng) is a must have and should be done regularly
- get one metric to assess your model performance and get it right
- try to align your model as close as possible from the way the business will assess your model’s performance
- there is always something to be done whether it is analysis, investigations, model improvement, testing new features, etc… (after all, machine learning is an iterative process). So it is never over (until your manager tells you to stop!)
- Mark Manson was saying that to master something you have to 1) get the fundamentals right (what are the 20% that drives the 80% result?), 2) create a constant feedback loop and 3) spend the time on it. Sounds like good things to also think about when you start working on a new ML project