ML product: Data engineering notes

Marc Deveaux
4 min readNov 5, 2021

Data engineering notes on my current project

Concept

We want to build a machine learning product which end goal is to do a prediction whether the user will be interested in a product A. In order to do this, we will use customer information and marketing campaign results hold into a business database. Today we will have a look at the way the data engineering part can be built.

Data engineering folder organization

On the repo, the data engineering folder is split in 4 subfolders:

  • ddl: (data definition language) CREATE sql queries
  • etl: SQL insert into the tables created above
  • script: scripts in python listing DML functions in order to do the SQL queries within python scripts. So functions will specify things like server name, OBDC driver, database, but also execute sql and return a pandas dataframe
  • views: CREATE sql for views. They are used as temp tables in the data engineering process flow

Structure overview

Extract business data into TBL_A and TBL_B

Goal: data is extracted from various business source tables at a specific base_date and inserted into tables we created with the ddl files (TBL_A and TBL_B). The insert script can be found in the etl folder

  • TBL_A and TBL_B are both historical tables
  • In case of multiple extracts at the same base_date, data in TBL_A or TBL_B will be overwritten. We do this by putting this code at the top of the insert script

DELETE FROM ${sqlconf:db_name}.TBL_A
WHERE BASE_DATE = ${sqlconf:base_date};

  • We could potentially put all the data extracted into a single table (just TBL_A); however, using 2 tables allow us to avoid having too many columns in one place; it is also good practice to have data separated by theme / information’s type
  • SQL is a good place to do some FE like categorical encoding with CASE WHEN or binary. However, it can make it harder to detect something wrong if the original value changed (for example, a new category was added within a column without the team being informed). To fix this, we keep the 2 features: the original value and the encoded feature
  • The following columns are added to TBL_A and TBL_B for audit purpose

${sqlconf:base_date} AS BASE_DATE
,CONVERT(smalldatetime, CURRENT_TIMESTAMP) AS CREATION_DATE
,SYSTEM_USER AS CREATED_BY

Extract data from TBL_A and TBL_B into a VW_FEATURE view

Goal: create a view table with all the columns we need in order to train the model or do a prediction. There is no date filter for this, the view is basically all the data available from TBL_A and TBL_B

  • The view VW_FEATURE lists all the columns we need before applying feature engineering in python
  • The view is recreated each time we need it. The SQL script to create it is in the “view” subfolder
  • If needed, specific criteria can be defined in order to filter users
  • This views has historical data

VW_LABEL_FEATURE : store the data used for training the model into a view

Goal: get the data for training the model

  • This view is the result of a join of 2 tables: a table holding the marketing results (which users converted, who didn’t) and the VW_FEATURE from above, which hold all the features we need to train the model
  • This view is then used as an input to create all the feature engineering values (like min max scaler, saving median values, etc…)

VW_PREDICTION_FEATURE

Goal: this view extracts the data we need in order to do a prediction

  • It extracts the data from VW_FEATURE where data fetches MAX base_date
  • This view is then used as an input for a python script applying feature engineering

Other tables for feature engineering and prediction

The other tables are all intermediary tables that we use to save each step. For example, a table is used to store data transformed by the feature engineering python script. Another table stores all the prediction scores, etc. They also keep historical records, mostly for audit purpose.

Data flow diagram for airflow execution

Note that it’s not showing all the data engineering, as we have additional tables created on the right (as specified just above), from intermediary tables to PREDICTION_SCORE.

Other notes

  • DEV and PROD use different databases
  • Airflow will orchestrate the data flow, the related files are found in the “dags” folder
  • Questions to ask systematically when writing the SQL CREATE query. What is happening if there are NULL data? Is the query optimized? Are you doing JOIN on where the data is portioned? Is each feature type definition as precise as possible (using bigint, varchar(7), etc…)
  • SQL statement finish by GO: it will execute the related sql commands n times. This is specific to some MS utilities. Example: INSERT INTO mytable DEFAULT VALUES GO 10 (source)
  • SQL features are always in capital_letters
  • SQL file names are in capital letters except the subfolder name like ddl_SOMETHING_INFO
  • DISTINCT should be avoided (for example if it used to remove duplicates) as it underlies an issue
  • To compare 2 tables (for example in the staging phase, to see if data is exactly the same), you can use the following query. Result should be empty if the 2 tables have the same values.

SELECT XXX AS ID

FROM DB.TBL_1

EXCEPT

SELECT XXXX AS ID

FROM DB.TBL_2

“The EXCEPT operator is used to exclude like rows that are found in one query but not another. It returns rows that are unique to one result” (source)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Marc Deveaux
Marc Deveaux

No responses yet

Write a response