June notes on building a data science product

5 min readJun 12, 2021

Personal notes on a rough guideline to move from a simple model built mostly using Jupyter’s scripts (excepted for the SQL extraction file) to a fully automated product.

Simple modeling project

Projects I usually create from scratch tend to look like this:

Data extraction: script in SQL to extract all the required data. One or multiple queries.
Merge: if needed, script in python to merge all the various SQL queries to have one dataset and applies simple clean up if required.
Feature Engineering Creation: a python script which takes a random sample and goes through the feature engineering process. All values needed for feature engineering (median, the max/min value based on percentile, scaler, etc.) are saved to be re-used by Feature Engineering Apply script.
Feature Engineering Apply: script in python which applies all the saved values from Feature Engineering Creation to any new dataset.
Modeling: train the model. First, get the baseline scores for a linear, non linear and ensemble model using cross validation. Then, using grid search, train several more complex models and train the best resulting model on the full dataset
Prediction: create the final score list
Interpretation: error analysis, Shap analysis, etc.

This process works but obviously presents issues if you need to go to a more automated / robust process

Output is always as a csv file while using tables could be more practical, especially for dealing with historical data
Feature Engineering, label data processing, input data processing, training and prediction are all tightly coupled
Implementation is not scalable
No system engineering, such as CI/CD, execution pipelines, etc…

ML Product structure

The below covers the main components of a ML product. Tips: it is good practice to target a minimum viable product at first!

First segment: Feature data creation and Feature Engineering

Feature data creation deals with the SQL extraction
Feature engineering is cut in 2 parts. A first part of feature engineering is done in SQL (rename features, put 0 for NA, etc.) and the second part is done in python during the data preparation step

Second segment: Data preparation

Data preparation is split in two sections: label data creation and input feature data creation

Label data creation: a table to contain class label + ID. It also includes historical data

Input feature data creation:

A flat table “feature_summary” contains all the features values. Data was transformed in the first place by a python script doing feature engineering. The table also includes historical data
2 csv files are kept separately and come from the “feature_summary” table. One “feature_store” which list all available feature’s names — used by the model or not. It can also keep information such as feature key or status, etc. The other csv file “feature_list” is a list of the feature’s names used by the model

Note on flat table: “Highly normalized database design often uses a star or snowflake schema model, comprising multiple large fact tables and many smaller dimension tables. Queries typically involve joins between a large fact table and multiple dimension tables. Depending on the number of tables and quantity of data that are joined, these queries can incur significant overhead. To avoid this problem, some users create wide tables that combine all fact and dimension table columns that their queries require. These tables can dramatically speed up query execution. However, maintaining redundant sets of normalized and denormalized data has its own administrative costs. (source)”

Third segment: Model training framework

Configuration management: used to have historical references to old model versions, as well as performance metrics, failures logs, reverting changes, etc. Tools like ML flow can be used for this purpose
Framework function: multiple python functions to deal with interpretation, model training, sampling, etc.

Fourth segment: Prediction Framework

Prediction framework functions: any function related to the prediction’s script, whether it is to generate the score list or to evaluate/check the prediction result (prediction’s distribution)
Prediction_result: table to hold prediction’s result. Includes historical data

Fifth segment: Execution pipeline

Pipelines are built using Airflow, priority being the steps “Feature data creation and Feature Engineering”, “Data_preparation” and “Prediction framework”. Training can be done at first on an adhoc basis.

Roadmap overview

1: System Engineering

Setting up GIT branching
CI/CD using Jenkins (CI/CD stand for continuous integration and continuous deployment — more information after)
Pipeline setup in Airflow
Security and environment setup —dev and prod (workspace, namespace, storage, etc.)

2: Server environment setup

Security and Schema/DB setup on the server — DEV and PRD
Editor setup and hands on experience with sample query executions

3: Feature Engineering and other SQL processing

DB, feature_set and feature design, implementation and pipeline execution
Label data and input data processing

4: Training Framework implementation

5: Prediction Framework implementation

Tools used

Airflow

Usage: free and open source
What is it: a pipeline orchestration tool / workflow management platform using DAG (Directed Acyclic Graph).
Objective: author, schedule and monitor data pipelines
Terminology: DAG is a graphic representation of all the tasks you want to run, and by which order

Jenkins

Usage: free and open source
What is it: a Continuous Integration Continuous Deployment (CI CD) tool. Jenkins is used to build and test your software projects continuously making it easier for developers to integrate changes to the project, and making it easier for users to obtain a fresh build (source)
Objective: to deploy your code in a multi development environment
Terminology:

Continuous integration (CI): “practice that involves developers making small changes and checks to their code. Due to the scale of requirements and the number of steps involved, this process is automated to ensure that teams can build, test, and package their applications in a reliable and repeatable way. CI helps streamline code changes, thereby increasing time for developers to make changes and contribute to improved software.

Continuous delivery (CD) is the automated delivery of completed code to environments like testing and development. CD provides an automated and consistent way for code to be delivered to these environments.

Continuous deployment is the next step of continuous delivery. Every change that passes the automated tests is automatically placed in production, resulting in many production deployments.

In short, CI is a set of practices performed as developers are writing code, and CD is a set of practices performed after the code is completed”. (source)

Azure Synapse tool

Usage: license to pay. A free alternative would be mySQL (relational database management system)
What is it: a SQL execution engine, doing cloud data warehousing and big data analytic. it is based on Azure SQL Data Warehouse. It plays the same role as Hadoop/Hive where Hadoop is a framework to process/query the Big data and Hive is an SQL Based tool that is built over Hadoop to process the data.
Objective: automate SQL queries

Bitbucket

Usage: free and open source
What is it: Git code management / version control system, similar to github
Objective: store and share your code

MlFlow

Usage: free and open source
What is it: MLflow is an open source platform for managing the end-to-end machine learning lifecycle. For example, you can see who started the model training, what was the accuracy, etc. Some key features:

Track experiments to record and compare parameters and results

Package ML code in a reusable, reproducible form in order to share with other data scientists or transfer to production

Manage and deploy models from a variety of ML libraries to a variety of model serving and inference platforms

Objective: track model performance results