ML deployment: feature engineering notes

Marc Deveaux

9 min readDec 11, 2021

Feature engineering notes related to the ongoing ML deployment working project

Table of content

Feature Engineering structure
Libraries
Other Notes

Difference between args and kwargs
Dunder
Public and Private functions
Returning various values in a function using SimpleNamespace
How to import a module if the .py file is in another folder
Mutable object vs non mutable data types

Feature engineering structure

Folder structure

We have 3 scripts within the folder:

common.py: some categorical mapping stored as dictionary
model.py : the FE script to go through
utils.py: list of functions to use in model01.py

Model.py structure skeleton

Generally, the goal is to have as many functions as possible in utils.py so that model.py script is kept clean and short with many log.info().

import os
import sys
# ...
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(message)s',
                    datefmt='%m/%d/%Y %I:%M:%S %p')if __name__ == "__main__":
    try:
        # using argparse, pass config file information (is it FE create or FE apply? etc.)
        try: 
            # using configparser to get configurations
        except Exception as exp:
            logging.error("Failed to get config,  %s", exp)
            raise
        # start FE process
        # ...
        if operation == 'create':
            # ...
        else:
            # FE apply situation

Small remarks on the usage of _name_

name is simply a built-in variable in Python which evaluates to the name of the current module (see https://www.afternerd.com/blog/python-__name__-__main__/)
In the example below we run the FE process only if the current module is main, which means it cannot be imported by another script. It protects users from accidentally invoking the script when they didn’t intend to

# hello.py -> if this script is run from outside you have the hello {name}
def hello(name):
 print(f”Hello, {name}!”)if __name__ == ‘__main__’:
 hello(‘Afternerd’)

Utils.py notes

The goal is to make the functions as generalized as possible so you can use them for other models. Even pandas functions are wrapped into functions (like drop columns, etc.). Note that we also use logging within the functions def.

Overall, functions are not doing error check such as: is the input a dataframe, what happened if file is not in folder, etc…

Function for categorical encoding

One function _mapping_func from utils.py is doing categorical encoding transformation. The transformation dictionary are within common.py files and are imported using the following:

# written within the function
comms_mapper = 'common.' + mapper_data 
# where mapper_data is a str specified in the fct
# so doing common.XXXX import the dictionary XXXX
# category_mapping
dataset_map = dataset.replace({column_name: {comms_mapper}})

Quick remarks on building functions

always define the output, always define the type of each function argument
functions should be generalized as much as possible
One of the advantage of using args and kwarg is for future development, as you can easily pass any new arguments if needed. more information on this in the “other” section
so instead of function(cols: list, apply_mean==True) where if True then mean_encoding else freq_encoding; we do: function(args, kwargs)
where args will be the column names that you get through col_mean_encode = list(args).
where kwargs is used to define the technique we want that we get with apply_techq = kwargs.pop(‘technique’,None) and then if apply_techq == ‘mean’[…]
In case one of the argument is not necessarily needed, you can use kwargs.pop()

def my_funct(mkg_data:bool = False,**kwargs)-> pd.DataFrame:
    if mkg_data:
        mkg_query_path = kwargs.pop('mkg_path')
        ds = pd.read_csv(f"{mkg_query_path}_mkt_raw.csv")

Testing your functions with UniTest

Unit test has its own folder which goal is to test your functions. Note that if your folder is named unittest, you will have bugs when libraries are imported; so unit_test is a better idea

Inside of the unit_test folder, we have a subfolder named data where we put sample csv, so the test are done on it. csv files are dummy data, extremely simple and for each test, we have test_function_name_input_data.csv and test_function_name_output_data.csv
Example, you upload “input_data”, transform it with the related function and call the result “expected_data” and compare it against “output_data” that you also uploaded
For example, if you want to compare two expected dataframes:

pd.testing.assert_frame_equal(output_data, expected_data)

we can use pytest rather than unittest — which is easier to use. You execute test by going to windows terminal to the pytest file’s folder and do pytest -s -v test_XXXX.py

Libraries

List of useful libraries

os : used for dealing with the current working directory / file system
sys : the sys module provides functions and variables used to manipulate different parts of the Python runtime environment
logging : create log messages
configparser : used for working with configuration files (which store information about the database, the server, the username, etc.)
argparse : allows you to specify argument when you are executing your python script from the command line (such as python xxx.py arg1 arg2)

OS library basics

Source: https://www.geeksforgeeks.org/os-path-module-python/

os.path.basename(path) : return the file name from the path given
os.path.dirname(path) : return the directory name from the path given. This function returns the name from the path except the path name
os.path.isdir(path) : This function specifies whether the path is existing directory or not
os.path.isfile(path) : This function specifies whether the path is existing file or not
os.path.normcase(path) : in Windows it converts the path to lowercase and forward slashes to backslashes
os.path.normpath(path) : This function normalizes the path names by collapsing redundant separators and up-level references so that A//B, A/B/, A/./B and A/foo/../B all become A/B. On Windows, it converts forward slashes to backward slashes
os.path.join(os.path.dirname(__file _), ‘/config’) : concatenate the path

Calling os for the path: os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)), os.pardir))

Sys library basics

Source : https://www.tutorialsteacher.com/python/sys-module

sys.argv returns a list of command line arguments passed to a Python script. The item at index 0 in this list is always the name of the script

import sys
print("You entered: ",sys.argv[1], sys.argv[2], sys.argv[3])

sys.exit: this causes the script to exit back to either the Python console or the command prompt. This is generally used to safely exit from the program in case of generation of an exception
sys.path: this is an environment variable that is a search path for all Python modules
sys.version: this attribute displays a string containing the version number of the current Python interpreter

Logging library basics

At the top of the script, indicate your logging type:

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')

Then during the script, create log:

logging.info("Load Datasets ...")
logging.info(f"{dataset.dtypes}")
except Exception as exp:
    logging.error("Failed to get configurations!, %s ", exp)
    raise

Configparser library basics

Example where we read a conf file and generate an error if we can’t read it

try:
 config = configparser.ConfigParser(strict=False)
    config_dir = os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)), os.pardir))
    var = os.path.abspath(os.path.join(config_dir, "../"))
    print("Directory path",var)
    config.read('{}/XXXX/configs/env.conf'                      .format(os.path.abspath(os.path.join(config_dir,"../"))))
 
 env_conf = config[env_label]except Exception as exp:
    logging.error("Failed to get configurations!, %s", exp)
    raise

generally, you should put in the config file information that you would traditionally keep in the script. Such as: test_size: 0.1, random_state: 0, n_estimators: 50, etc…

Argparse library basics

source: https://towardsdatascience.com/learn-enough-python-to-be-useful-argparse-e482e1764e05

For example, the following script creates a parser objects with 2 string variables to be defined

# videos.py
import argparse
parser = argparse.ArgumentParser(description='Videos to images')
parser.add_argument('indir', 
                    type=str, 
                    help='Input dir for videos')
parser.add_argument('outdir', 
                    type=str, 
                    help='Output dir for image')
args = parser.parse_args()
print(args.indir)

You can then call the script with the related arguments:

python videos.py /videos /images

Another example from a FE script:

parser = argparse.ArgumentParser()
        parser.add_argument('-env', "--env", 
                            type=str,
                            required=True,
                            help='Environment variable')
        args = parser.parse_args()
        env_label = args.env

Other notes

Difference between args and kwargs

source: https://careerkarma.com/blog/python-args-kwargs/

Python argument orders

A single asterisk denotes *args whereas **kwargs uses a double asterisk. This is an important distinction because both “args” and “kwargs” are placeholders. You can replace these words with any value

Arguments in a Python function must appear in a specific order:

Formal arguments
*args
Keyword arguments
**kwargs

Example:

def printOrder(coffee, *args, coffee_order="Espresso", **kwargs):

Python args

The Python *args method represents a varied number of arguments
In your function, you can replace the name of “args” with any value

def addNumbers(*args):
 total = 0
 for a in args:  # you loop through args
  total = total + a
 print(total)addNumbers(9, 10, 12)
addNumbers(13, 14, 15, 15)

Python kwargs

The kwargs keyword represents an arbitrary number of arguments that are passed to a function
kwargs keywords are stored in a dictionary. You can access each item by referring to the keyword you associated with an argument when you passed the argument

def our_function(**record):
 for key, value in record.items():
  print("{} (key) and {} (value)".format(key, value))
        
def printOrder(**kwargs):
    for key, value in kwargs.items():
  print("{} is equal to {}".format(key, value))printOrder(coffee="Mocha", price=2.90, size="Large")

Gives:

coffee is equal to Mocha
price is equal to 2.9
size is equal to Large

Dunder

Source : https://www.geeksforgeeks.org/__file__-a-special-variable-in-python/

A double underscore variable in Python is usually referred to as a dunder. A dunder variable is a variable that Python has defined so that it can use it in a “Special way”. This Special way depends on the variable that is being used.
_ _file__ is a variable that contains the path to the module that is currently being imported. Python creates a _ _file__ variable for itself when it is about to import a module. The updating and maintaining of this variable is the responsibility of the import system. The import system can choose to leave the variable empty when there is no semantic meaning, that is when the module/file is imported from the database. This attribute is a String. This can be used to know the path of the module you are using.
Ex: file_path= os.path.join(os.path.dirname(_ _file__), ‘/config’)

Public and Private functions

Source : https://salishsea-meopar-tools.readthedocs.io/en/latest/python_packaging/library_code.html

Private functions are inaccessible outside of their particular program scope. “Dynamic languages like Python have very strong introspection capabilities that make such privacy constraints impossible. Instead, the Python community relies on the social convention that functions, methods, etc. that are spelled with leading underscore characters (_) are considered to be private”.

Reasons to use private functions: “it is not intended to be used outside of this module.” or “I don’t want to guarantee that I won’t change its arguments later and I don’t want other people to rely its definition”

In the below example, _plot_alerts_map is private while storm_surge_alerts is public

def storm_surge_alerts(
    grids_15m, weather_path, coastline, tidal_predictions,
    figsize=(18, 20),
    theme=nowcast.figures.website_theme,
):
    ...
    plot_data = _prep_plot_data(grids_15m, tidal_predictions, weather_path)
    fig, (ax_map, ax_pa_info, ax_cr_info, ax_vic_info) = _prep_fig_axes(
        figsize, theme)
    _plot_alerts_map(ax_map, coastline, plot_data, theme)
    ...

Returning various values in a function using SimpleNamespace

“if you are writing a function that returns more than one value, consider returning the collection of values as a SimpleNamespace. If your function returns more than 3 values, definitely return them as a SimpleNamespace."

“SimpleNamespace objects that have fields accessible by attribute lookup (dotted notation). They also have a helpful string representation which lists the namespace contents in a name=value format."

from types import SimpleNamespacedef load_ADCP(
         daterange, station='central',
         adcp_data_dir='/ocean/dlatorne/MEOPAR/ONC_ADCP/',
 ):
     """
     [...]
    :returns: :py:attr:`datetime` attribute holds a :py:class:`numpy.ndarray`
              of data datatime stamps,
              :py:attr:`depth` holds the depth at which the ADCP sensor is
              deployed,
              :py:attr:`u` and :py:attr:`v` hold :py:class:`numpy.ndarray`
              of the zonal and meridional velocity profiles at each datetime.
    :rtype: 4 element :py:class:`types.SimpleNamespace`
    """
    ...
    return SimpleNamespace(datetime=datetime, depth=depth, u=u, v=v)

You then can call the function and extract the information as below

adcp_data = load_ADCP(('2016 05 01', '2016 05 31'))
adcp_data.depth

How to import a module if the .py file is in another folder

Source: https://realpython.com/lessons/module-search-path/

“When the interpreter executes the import statement (such as ‘import mod’), it searches for mod.py in a list of directories assembled from the following sources:

The directory from which the input script was run or the current directory if the interpreter is being run interactively
The list of directories contained in the PYTHONPATH environment variable, if it is set. (The format for PYTHONPATH is OS-dependent but should mimic the PATH environment variable.)
An installation-dependent list of directories configured at the time Python is installed

The resulting search path is accessible in the Python variable sys.path, which is obtained from a module named sys”

import sys
sys.path

So if the module is not in the same directory and you don’t want to put it in the same folder, you can:

Modify the PYTHONPATH environment variable to contain the directory where mod.py is located before starting the interpreter
Put mod.py in one of the installation-dependent directories, which you may or may not have write-access to, depending on the OS
You can put the module file in any directory of your choice and then modify sys.path at run-time so that it contains that directory. For example, in this case, you could put mod.py in directory C:\Users\john and then issue the following statements

>>> sys.path.append(r’C:\Users\john’)
>>> sys.path
[‘’, ‘C:\\Users\\john\\Documents\\Python\\doc’, ‘C:\\Python36\\Lib\\idlelib’,
‘C:\\Python36\\python36.zip’, ‘C:\\Python36\\DLLs’, ‘C:\\Python36\\lib’,
‘C:\\Python36’, ‘C:\\Python36\\lib\\site-packages’, ‘C:\\Users\\john’]
>>> import mod

Once a module has been imported, you can determine the location where it was found with the module’s __file__ attribute:

import mod
mod.__file__

Mutable object vs non mutable data types

Source : https://towardsdatascience.com/https-towardsdatascience-com-python-basics-mutable-vs-immutable-objects-829a0cb1530a

Some of the mutable data types in Python are list, dictionary, set and user-defined classes
On the other hand, some of the immutable data types are int, float, decimal, bool, string, tuple, and range