ML product folder structure and CICD concept
Notes from my current project
Folder structure overview
Here is what a typical ML product folder structure could look like:
cicd : folder with the Jenkins files doing auto deployment. In our case, instead of Jenkins, a bash file will execute the various CICD stages
common : folder storing any constant data that we need to keep (dimension table, etc…). This is not not specific to any particular model but rather for data used across all models. It could be dump table, any specific ddl (data definition language), static data/table or any csv file
config : folder storing the server configuration files and any other constant that need to be passed to the script . Files used for configuration are often in yml format (from the YAML programming language)
dags: folder storing all the Airflow files to schedule the data flow. Files are python scripts: one for data engineering, one for feature engineering, one for prediction, etc…
data-engineering: data engineering files, mostly SQL related. 4 subfolders:
- ddl: CREATE sql queries
- etl: SQL insert into the tables created above
- script: scripts in python listing DML functions in order to do the SQL queries within python scripts. So functions will specify things like server name, OBDC driver, database, but also execute sql and return a pandas dataframe
- views: CREATE sql for views. They are used as temp tables in the data engineering process flow
feature-engineering: python script for FE (module + scripts for py). The folder contains the following files:
- utils.py file: functions that are generalized across all models
- common.py file: store dictionaries for encoding, etc.
- model_XX.py: the script executing the FE process
function : folder related to Faas creation/execution. The folder is divided into subfolders: one for BCP, one for data engineering, one for feature engineering, one for prediction. Each subfolder has a main.sh bash file executing the related python script (for example, for data engineering it will execute the data-engineering/scripts python file, while for FE it will execute the model_XX.py script). FaaS here plays the role of Airflow python operator
label-data: target class related
model-prediction: folder for model score prediction
model-training : all python files used for training a model
models: place where the pickle algorithm are stored
unit_test : limited to python scripts only, testing all python functions. A subfolder ‘data’ is used to store all the dummy data, both input and expected output needed for testing
CICD Overview
Short definitions related to CICD
- Faas: “Function-as-a-Service, or FaaS, is a kind of cloud computing service that allows developers to build, compute, run, and manage application packages as functions without having to maintain their own infrastructure.[…] FaaS is an event-driven execution model that runs in stateless containers and those functions manage server-side logic and state through the use of services from a FaaS provider.” (source)
- Container: “Containers are packages of software that contain all of the necessary elements to run in any environment. In this way, containers virtualize the operating system and run anywhere, from a private data center to the public cloud or even on a developer’s personal laptop” (source)
- Jenkins: “Jenkins is used to build and test your product continuously, so developers can continuously integrate changes into the build. Jenkins is the most popular open source CI/CD tool on the market today” (source)
- Git webhook: it is used to “fire off custom scripts when certain important actions occur. There are two groups of these hooks: client-side and server-side. Client-side hooks are triggered by operations such as committing and merging, while server-side hooks run on network operations such as receiving pushed commits” (source)
- BCP (Bulk copy program): “bcp is the console application used to export and import data from text files to SQL Server or vice versa” (source)
- cURL: “cURL is a command-line tool that lets you transfer data to/from a server using various protocols (source)”. cURL commands are used to trigger FaaS events, as an alternative to Airflow
CICD process
CI/CD process helps in automating the delivery process (initiating code builds, running tests, deployment to a Dev, staging, or production environment).
Normally this is where Jenkins is used, by helping the auto deployment on a cloud platform. In our case, an alternative option is used through Faas (used on a dedicated platform).
Concept
When a branch commit push or pull request on git, it triggers a git webhook to execute the following process on the platform:
1. Http FaaS
Http FaaS triggers the following FaaS event.
There are 2 types of of FaaS/serverless containers that we will use. The first one is Http endpoint as a container: it is built with Http trigger so that end users can access your website on the data science platform. The second one is batch job as a container, which allow you to build an executable job container with event trigger. You can then schedule those batch jobs with Airflow.
2. Event FaaS is triggered
After receiving an event from Http FaaS, a bash script executes various python scripts to run CICD stages
3. CICD stages
1. Assign environment based on the input branch name
2. Clone the input branch name to Temp directory in Azure mount
3. Code quality check for Py Scripts
4. Unit test coverage
5. Copy input branch from Temp directory to working directory in Azure mount. Based on where the pull request is coming from (master, QA/test, dev), a different workspace and /mnt/azure/ folder is chosen
6. Create FaaS if it doesn’t exist
Git branching strategy in a CICD situation
This system makes it possible to achieve continuous integration with parallel development.
The branches
- master: code used in the production environment
- QA/test: branch used for Quality Assurance and Integration testing before releasing to production. The code is used in the staging environment
- development: branch used for all development work
Steps to create and integrate a new feature
- first, fetch the master branch
- develop your new feature on the development branch and do unit testing
- Testing phase: merge your feature to the QA/test branch for QA and integration testing after pull request approval
- Release phase: after QA, the new development branch is merged to master through pull request. So you don’t merge QA/test with master but development branch with master
To avoid errors, you can’t do direct commits to the master branch. It only accepts pull request where reviewer validate the merging.
When two developers work on the same module and the feature of your colleague is released before yours, you will have to rebase with the production.