ML deployment: feature engineering notes

Marc Deveaux
9 min readDec 11, 2021
Photo by Matt Hardy on Unsplash

Feature engineering notes related to the ongoing ML deployment working project

Table of content

  1. Feature Engineering structure
  2. Libraries
  3. Other Notes
  • Difference between args and kwargs
  • Dunder
  • Public and Private functions
  • Returning various values in a function using SimpleNamespace
  • How to import a module if the .py file is in another folder
  • Mutable object vs non mutable data types

Feature engineering structure

Folder structure

We have 3 scripts within the folder:

  • common.py: some categorical mapping stored as dictionary
  • model.py : the FE script to go through
  • utils.py: list of functions to use in model01.py

Model.py structure skeleton

Generally, the goal is to have as many functions as possible in utils.py so that model.py script is kept clean and short with many log.info().

import os
import sys
# ...
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(message)s',
datefmt='%m/%d/%Y %I:%M:%S %p')
if __name__ == "__main__":
try:
# using argparse, pass config file information (is it FE create or FE apply? etc.)
try:
# using configparser to get configurations
except Exception as exp:
logging.error("Failed to get config, %s", exp)
raise
# start FE process
# ...
if operation == 'create':
# ...
else:
# FE apply situation

Small remarks on the usage of _name_

  • name is simply a built-in variable in Python which evaluates to the name of the current module (see https://www.afternerd.com/blog/python-__name__-__main__/)
  • In the example below we run the FE process only if the current module is main, which means it cannot be imported by another script. It protects users from accidentally invoking the script when they didn’t intend to
# hello.py -> if this script is run from outside you have the hello {name}
def hello(name):
print(f”Hello, {name}!”)
if __name__ == ‘__main__’:
hello(‘Afternerd’)

Utils.py notes

The goal is to make the functions as generalized as possible so you can use them for other models. Even pandas functions are wrapped into functions (like drop columns, etc.). Note that we also use logging within the functions def.

Overall, functions are not doing error check such as: is the input a dataframe, what happened if file is not in folder, etc…

Function for categorical encoding

One function _mapping_func from utils.py is doing categorical encoding transformation. The transformation dictionary are within common.py files and are imported using the following:

# written within the function
comms_mapper = 'common.' + mapper_data
# where mapper_data is a str specified in the fct
# so doing common.XXXX import the dictionary XXXX
# category_mapping
dataset_map = dataset.replace({column_name: {comms_mapper}})

Quick remarks on building functions

  • always define the output, always define the type of each function argument
  • functions should be generalized as much as possible
  • One of the advantage of using args and kwarg is for future development, as you can easily pass any new arguments if needed. more information on this in the “other” section
  • so instead of function(cols: list, apply_mean==True) where if True then mean_encoding else freq_encoding; we do: function(args, kwargs)
  • where args will be the column names that you get through col_mean_encode = list(args).
  • where kwargs is used to define the technique we want that we get with apply_techq = kwargs.pop(‘technique’,None) and then if apply_techq == ‘mean’[…]
  • In case one of the argument is not necessarily needed, you can use kwargs.pop()
def my_funct(mkg_data:bool = False,**kwargs)-> pd.DataFrame:
if mkg_data:
mkg_query_path = kwargs.pop('mkg_path')
ds = pd.read_csv(f"{mkg_query_path}_mkt_raw.csv")

Testing your functions with UniTest

Unit test has its own folder which goal is to test your functions. Note that if your folder is named unittest, you will have bugs when libraries are imported; so unit_test is a better idea

  • Inside of the unit_test folder, we have a subfolder named data where we put sample csv, so the test are done on it. csv files are dummy data, extremely simple and for each test, we have test_function_name_input_data.csv and test_function_name_output_data.csv
  • Example, you upload “input_data”, transform it with the related function and call the result “expected_data” and compare it against “output_data” that you also uploaded
  • For example, if you want to compare two expected dataframes:
pd.testing.assert_frame_equal(output_data, expected_data)
  • we can use pytest rather than unittest — which is easier to use. You execute test by going to windows terminal to the pytest file’s folder and do pytest -s -v test_XXXX.py

Libraries

List of useful libraries

  • os : used for dealing with the current working directory / file system
  • sys : the sys module provides functions and variables used to manipulate different parts of the Python runtime environment
  • logging : create log messages
  • configparser : used for working with configuration files (which store information about the database, the server, the username, etc.)
  • argparse : allows you to specify argument when you are executing your python script from the command line (such as python xxx.py arg1 arg2)

OS library basics

Source: https://www.geeksforgeeks.org/os-path-module-python/

  • os.path.basename(path) : return the file name from the path given
  • os.path.dirname(path) : return the directory name from the path given. This function returns the name from the path except the path name
  • os.path.isdir(path) : This function specifies whether the path is existing directory or not
  • os.path.isfile(path) : This function specifies whether the path is existing file or not
  • os.path.normcase(path) : in Windows it converts the path to lowercase and forward slashes to backslashes
  • os.path.normpath(path) : This function normalizes the path names by collapsing redundant separators and up-level references so that A//B, A/B/, A/./B and A/foo/../B all become A/B. On Windows, it converts forward slashes to backward slashes
  • os.path.join(os.path.dirname(__file _), ‘/config’) : concatenate the path
Calling os for the path: os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)), os.pardir))

Sys library basics

Source : https://www.tutorialsteacher.com/python/sys-module

  • sys.argv returns a list of command line arguments passed to a Python script. The item at index 0 in this list is always the name of the script
import sys
print("You entered: ",sys.argv[1], sys.argv[2], sys.argv[3])
  • sys.exit: this causes the script to exit back to either the Python console or the command prompt. This is generally used to safely exit from the program in case of generation of an exception
  • sys.path: this is an environment variable that is a search path for all Python modules
  • sys.version: this attribute displays a string containing the version number of the current Python interpreter

Logging library basics

At the top of the script, indicate your logging type:

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')

Then during the script, create log:

logging.info("Load Datasets ...")
logging.info(f"{dataset.dtypes}")
except Exception as exp:
logging.error("Failed to get configurations!, %s ", exp)
raise

Configparser library basics

Example where we read a conf file and generate an error if we can’t read it

try:
config = configparser.ConfigParser(strict=False)
config_dir = os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)), os.pardir))
var = os.path.abspath(os.path.join(config_dir, "../"))
print("Directory path",var)
config.read('{}/XXXX/configs/env.conf' .format(os.path.abspath(os.path.join(config_dir,"../"))))

env_conf = config[env_label]
except Exception as exp:
logging.error("Failed to get configurations!, %s", exp)
raise
  • generally, you should put in the config file information that you would traditionally keep in the script. Such as: test_size: 0.1, random_state: 0, n_estimators: 50, etc…

Argparse library basics

source: https://towardsdatascience.com/learn-enough-python-to-be-useful-argparse-e482e1764e05

For example, the following script creates a parser objects with 2 string variables to be defined

# videos.py
import argparse
parser = argparse.ArgumentParser(description='Videos to images')
parser.add_argument('indir',
type=str,
help='Input dir for videos')
parser.add_argument('outdir',
type=str,
help='Output dir for image')
args = parser.parse_args()
print(args.indir)

You can then call the script with the related arguments:

python videos.py /videos /images

Another example from a FE script:

parser = argparse.ArgumentParser()
parser.add_argument('-env', "--env",
type=str,
required=True,
help='Environment variable')
args = parser.parse_args()
env_label = args.env

Other notes

Difference between args and kwargs

source: https://careerkarma.com/blog/python-args-kwargs/

Python argument orders

A single asterisk denotes *args whereas **kwargs uses a double asterisk. This is an important distinction because both “args” and “kwargs” are placeholders. You can replace these words with any value

Arguments in a Python function must appear in a specific order:

  • Formal arguments
  • *args
  • Keyword arguments
  • **kwargs

Example:

def printOrder(coffee, *args, coffee_order="Espresso", **kwargs):

Python args

  • The Python *args method represents a varied number of arguments
  • In your function, you can replace the name of “args” with any value
def addNumbers(*args):
total = 0
for a in args: # you loop through args
total = total + a
print(total)
addNumbers(9, 10, 12)
addNumbers(13, 14, 15, 15)

Python kwargs

  • The kwargs keyword represents an arbitrary number of arguments that are passed to a function
  • kwargs keywords are stored in a dictionary. You can access each item by referring to the keyword you associated with an argument when you passed the argument
def our_function(**record):
for key, value in record.items():
print("{} (key) and {} (value)".format(key, value))

def printOrder(**kwargs):
for key, value in kwargs.items():
print("{} is equal to {}".format(key, value))
printOrder(coffee="Mocha", price=2.90, size="Large")

Gives:

coffee is equal to Mocha
price is equal to 2.9
size is equal to Large

Dunder

Source : https://www.geeksforgeeks.org/__file__-a-special-variable-in-python/

  • A double underscore variable in Python is usually referred to as a dunder. A dunder variable is a variable that Python has defined so that it can use it in a “Special way”. This Special way depends on the variable that is being used.
  • _ _file__ is a variable that contains the path to the module that is currently being imported. Python creates a _ _file__ variable for itself when it is about to import a module. The updating and maintaining of this variable is the responsibility of the import system. The import system can choose to leave the variable empty when there is no semantic meaning, that is when the module/file is imported from the database. This attribute is a String. This can be used to know the path of the module you are using.
  • Ex: file_path= os.path.join(os.path.dirname(_ _file__), ‘/config’)

Public and Private functions

Source : https://salishsea-meopar-tools.readthedocs.io/en/latest/python_packaging/library_code.html

Private functions are inaccessible outside of their particular program scope. “Dynamic languages like Python have very strong introspection capabilities that make such privacy constraints impossible. Instead, the Python community relies on the social convention that functions, methods, etc. that are spelled with leading underscore characters (_) are considered to be private”.

Reasons to use private functions: “it is not intended to be used outside of this module.” or “I don’t want to guarantee that I won’t change its arguments later and I don’t want other people to rely its definition”

In the below example, _plot_alerts_map is private while storm_surge_alerts is public

def storm_surge_alerts(
grids_15m, weather_path, coastline, tidal_predictions,
figsize=(18, 20),
theme=nowcast.figures.website_theme,
):
...
plot_data = _prep_plot_data(grids_15m, tidal_predictions, weather_path)
fig, (ax_map, ax_pa_info, ax_cr_info, ax_vic_info) = _prep_fig_axes(
figsize, theme)
_plot_alerts_map(ax_map, coastline, plot_data, theme)
...

Returning various values in a function using SimpleNamespace

“if you are writing a function that returns more than one value, consider returning the collection of values as a SimpleNamespace. If your function returns more than 3 values, definitely return them as a SimpleNamespace."

“SimpleNamespace objects that have fields accessible by attribute lookup (dotted notation). They also have a helpful string representation which lists the namespace contents in a name=value format."

from types import SimpleNamespacedef load_ADCP(
daterange, station='central',
adcp_data_dir='/ocean/dlatorne/MEOPAR/ONC_ADCP/',
):
"""
[...]
:returns: :py:attr:`datetime` attribute holds a :py:class:`numpy.ndarray`
of data datatime stamps,
:py:attr:`depth` holds the depth at which the ADCP sensor is
deployed,
:py:attr:`u` and :py:attr:`v` hold :py:class:`numpy.ndarray`
of the zonal and meridional velocity profiles at each datetime.
:rtype: 4 element :py:class:`types.SimpleNamespace`
"""
...
return SimpleNamespace(datetime=datetime, depth=depth, u=u, v=v)

You then can call the function and extract the information as below

adcp_data = load_ADCP(('2016 05 01', '2016 05 31'))
adcp_data.depth

How to import a module if the .py file is in another folder

Source: https://realpython.com/lessons/module-search-path/

“When the interpreter executes the import statement (such as ‘import mod’), it searches for mod.py in a list of directories assembled from the following sources:

  1. The directory from which the input script was run or the current directory if the interpreter is being run interactively
  2. The list of directories contained in the PYTHONPATH environment variable, if it is set. (The format for PYTHONPATH is OS-dependent but should mimic the PATH environment variable.)
  3. An installation-dependent list of directories configured at the time Python is installed

The resulting search path is accessible in the Python variable sys.path, which is obtained from a module named sys”

import sys
sys.path

So if the module is not in the same directory and you don’t want to put it in the same folder, you can:

  • Modify the PYTHONPATH environment variable to contain the directory where mod.py is located before starting the interpreter
  • Put mod.py in one of the installation-dependent directories, which you may or may not have write-access to, depending on the OS
  • You can put the module file in any directory of your choice and then modify sys.path at run-time so that it contains that directory. For example, in this case, you could put mod.py in directory C:\Users\john and then issue the following statements
>>> sys.path.append(r’C:\Users\john’)
>>> sys.path
[‘’, ‘C:\\Users\\john\\Documents\\Python\\doc’, ‘C:\\Python36\\Lib\\idlelib’,
‘C:\\Python36\\python36.zip’, ‘C:\\Python36\\DLLs’, ‘C:\\Python36\\lib’,
‘C:\\Python36’, ‘C:\\Python36\\lib\\site-packages’, ‘C:\\Users\\john’]
>>> import mod

Once a module has been imported, you can determine the location where it was found with the module’s __file__ attribute:

import mod
mod.__file__

Mutable object vs non mutable data types

Source : https://towardsdatascience.com/https-towardsdatascience-com-python-basics-mutable-vs-immutable-objects-829a0cb1530a

  • Some of the mutable data types in Python are list, dictionary, set and user-defined classes
  • On the other hand, some of the immutable data types are int, float, decimal, bool, string, tuple, and range

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Marc Deveaux
Marc Deveaux

No responses yet

Write a response