Creating pipeline for feature engineering

7 min readDec 26, 2021

Building FE pipeline

Source

Code and notes come from Udemy course “Deployment of Machine Learning Models”. See section 4 from https://github.com/trainindata/deploying-machine-learning-models

Example of a FE script with a pipeline

# 1. config: define all the variables for FE
CATEGORICAL_VARS_WITH_NA_MISSING = [
    'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']# numerical variables with NA in train set
NUMERICAL_VARS_WITH_NA = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']# variables to log transform
NUMERICALS_LOG_VARS = ["LotFrontage", "1stFlrSF", "GrLivArea"]
# [...]# 2. set up the pipeline
price_pipe = Pipeline([# ===== IMPUTATION =====
    # impute categorical variables with string missing
    ('missing_imputation', CategoricalImputer(
        imputation_method='missing', variables=CATEGORICAL_VARS_WITH_NA_MISSING)),('frequent_imputation', CategoricalImputer(
        imputation_method='frequent', variables=CATEGORICAL_VARS_WITH_NA_FREQUENT)),# add missing indicator
    ('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARS_WITH_NA)),# impute numerical variables with the mean
    ('mean_imputation', MeanMedianImputer(
        imputation_method='mean', variables=NUMERICAL_VARS_WITH_NA
    )),
    
    # == TEMPORAL VARIABLES ====
    ('elapsed_time', pp.TemporalVariableTransformer(
        variables=TEMPORAL_VARS, reference_variable=REF_VAR)),('drop_features', DropFeatures(features_to_drop=[REF_VAR])),# ==== VARIABLE TRANSFORMATION =====
    ('log', LogTransformer(variables=NUMERICALS_LOG_VARS)),
    
    ('yeojohnson', YeoJohnsonTransformer(variables=NUMERICALS_YEO_VARS)),
    
    ('binarizer', SklearnTransformerWrapper(
        transformer=Binarizer(threshold=0), variables=BINARIZE_VARS)),# === mappers ===
    ('mapper_qual', pp.Mapper(
        variables=QUAL_VARS, mappings=QUAL_MAPPINGS)),('mapper_exposure', pp.Mapper(
        variables=EXPOSURE_VARS, mappings=EXPOSURE_MAPPINGS)),('mapper_finish', pp.Mapper(
        variables=FINISH_VARS, mappings=FINISH_MAPPINGS)),('mapper_garage', pp.Mapper(
        variables=GARAGE_VARS, mappings=GARAGE_MAPPINGS)),
    
    ('mapper_fence', pp.Mapper(
        variables=FENCE_VARS, mappings=FENCE_MAPPINGS)),# == CATEGORICAL ENCODING
    ('rare_label_encoder', RareLabelEncoder(
        tol=0.01, n_categories=1, variables=CATEGORICAL_VARS
    )),# encode categorical and discrete variables using the target mean
    ('categorical_encoder', OrdinalEncoder(
        encoding_method='ordered', variables=CATEGORICAL_VARS)),
])# 3. train the pipeline
price_pipe.fit(X_train, y_train)X_train = price_pipe.transform(X_train)
X_test = price_pipe.transform(X_test)# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]
# check absence of na in the test set
[var for var in X_test.columns if X_test[var].isnull().sum() > 0]# the parameters are learnt and stored in each step
# of the pipeline
price_pipe.named_steps['frequent_imputation'].imputer_dict_

another example of pipeline, this time where you create a model within the pipeline. Note that we don’t use transform() on X_train or X_test in this case

# [...] all the previous pipeline steps 
    ('scaler', MinMaxScaler()),
#     ('selector', SelectFromModel(Lasso(alpha=0.001, random_state=0))),
    ('Lasso', Lasso(alpha=0.001, random_state=0)), # linear model
])# make predictions for train set
pred = price_pipe.predict(X_train)# determine mse, rmse and r2
print('train mse: {}'.format(int(
    mean_squared_error(np.exp(y_train), np.exp(pred)))))
print('train rmse: {}'.format(int(
    mean_squared_error(np.exp(y_train), np.exp(pred), squared=False))))
print('train r2: {}'.format(
    r2_score(np.exp(y_train), np.exp(pred))))
print()# make predictions for test set
pred = price_pipe.predict(X_test)# determine mse, rmse and r2
print('test mse: {}'.format(int(
    mean_squared_error(np.exp(y_test), np.exp(pred)))))
print('test rmse: {}'.format(int(
    mean_squared_error(np.exp(y_test), np.exp(pred), squared=False))))
print('test r2: {}'.format(
    r2_score(np.exp(y_test), np.exp(pred))))
print()print('Average house price: ', int(np.exp(y_train).median()))

Procedural Programming vs Object Oriented Programming for ML

Why and how to create a class where you store all the FE functions and FE values

Procedural Programming

Pretty straightforward, you hard-code the parameters and save multiple objects or data structures

Code:

Learn the parameters
Make the transformations
Make the predictions

Data:

Store the parameters
Mean values, regression coefficients, etc…

Object Oriented Programming — OOP

We write code in the form of “objects”. This “objects” can store data and can also store instructions or procedures (code) to modify that data, or do something else, like obtaining predictions.

Data -> attributes, properties
Code or Instructions -> methods (procedures)

So we can learn and store parameters

parameters get automatically refreshed every time model is re-trained
no need of manual hard-coding

Methods:

Fit: learn parameters
Transform: transforms data with the learned parameters

Attributes: store the learn parameters

In order to use those, we need to create a class

Creating a class

Methods are functions defined inside a class and can only be called from an instance of that class

The first parameters will always be a variable called self

Our fit() methods learn parameters

class MeanImputer:
    def __init__(self, variables):
        self.variables = variables
        
    def fit(self, X, y=None):
        self.imputer_dict_ = X[self.variables].mean().to_dict()
        return self
    
    def transform(self, X):
        for x in self.variables:
            X[x] = X[x].fillna(
             self.imputer_dict[x])
        return Xmy_inputer = MeanImputer(
    variables = ['age', 'fare']
)
my_imputer.fit(my_data)
my_imputer.imputer_dict_ # give dictionary with mean

Inheritance

Inheritance is the process by which one class takes on the attributes and methods of another

parent class

class TransformerMixin:
    def fit_transform(self, X, y=None):
        X = self.fit(X, y).transform(X)
        return X

child class

class MeanInputer(TransformerMixin):
    def __init__(self, variables):
        self.variables = variables
        
    def fit(self, X, y=None):
        self.imputer_dict_ = X[self.variables].mean().to_dict()
        return self
    
    def transform(self, X):
        for x in self.variables:
            X[x] = X[x].fillna(
             self.imputer_dict[x])
        return X

You can now do

my_inputer = MeanImputer(variables=['age', 'fare'])
data_t = my_imputer.fit_transform(my_data)
data_t.head() # return a df

Check the scikit learn API documentation : https://scikit-learn.org/stable/modules/classes.html .

For example, you can use TransformerMixin to have access to base.fit_transform which allows you to fit and transform anything that you pass to it. BaseEstimator is used to get the list of all passed parameters

Create Scikit-Learn compatible transformer

If you create your own transformer, to have them compatible with Scikit-learn you need to follow this framework.

class My_Transformer(BaseEstimator, TransformerMixin):
    # temporal elsapsed time transformer
    
    def __init__(self, variables):
            self.variables = variables
            
    def fit(self, X, y=None):
        # we need this step to fit the sklearn pipeline. Even if there is nothing to fit
        return self
    
    def transform(self, X):
        X = X.copy()
        # your code to transform
        return X

Example:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixinclass TemporalVariableTransformer(BaseEstimator, TransformerMixin):
    # temporal elsapsed time transformer
    
    def __init__(self, variables, reference_variable):
        if not isinstance(variables, list):
            raise ValueError('var should be a list')
            
            self.variables = variables
            self.reference_variable = reference_variable
            
    def fit(self, X, y=None):
        # we need this step to fit the sklearn pipeline
        return self
    
    def transform(self, X):
        X = X.copy()
        
        for feature in self.variables:
            X[feature] = X[self.reference_variable] - X[feature]
        return X# categorical missing value imputer
class Mapper(BaseEstimator, TransformerMixin):
    def __init__(self, variables, mappings):
        if not isinstance(variables, list):
            raise ValueError('var should be a list')
        
        self.variables = variables
        self.mappings = mappings
     
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        for feature in self.variables:
            X[feature] = X[feature].map(self.mappings)
        return X

Example of a pipeline with open source and in house transformers

# 1. IMPORT LIBRARIES
# [...] pandas, etc...
# for saving the pipeline
import joblib# from Scikit-learn
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, Binarizer# from feature-engine
from feature_engine.imputation import (
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer,
)from feature_engine.encoding import (
    RareLabelEncoder,
    OrdinalEncoder,
)from feature_engine.transformation import (
    LogTransformer,
    YeoJohnsonTransformer,
)from feature_engine.selection import DropFeatures
from feature_engine.wrappers import SklearnTransformerWrapperimport preprocessors as pp # class we created# 2. LOAD DATASET, CREATE TRAIN-TEST, ETC.# 3. CONFIG
# categorical variables with NA in train set
CATEGORICAL_VARS_WITH_NA_FREQUENT = ['MasVnrType',
                                     'BsmtQual',
                                     # [...]
                                     'GarageCond']CATEGORICAL_VARS_WITH_NA_MISSING = [
    'Alley', #[...] 
    'MiscFeature']# numerical variables with NA in train set
NUMERICAL_VARS_WITH_NA = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']TEMPORAL_VARS = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']
REF_VAR = "YrSold"# variables to log transform
NUMERICALS_LOG_VARS = ["LotFrontage", "1stFlrSF", "GrLivArea"]NUMERICALS_YEO_VARS = ['LotArea']# variables to map
QUAL_VARS = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu',
             'GarageQual', 'GarageCond']QUAL_MAPPINGS = {'Po': 1, 'Fa': 2, 'TA': 3,
                 'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0}# 4. PIPELINE
price_pipe = Pipeline([# ===== IMPUTATION =====
    # impute categorical variables with string missing
    ('missing_imputation', CategoricalImputer(
        imputation_method='missing', variables=CATEGORICAL_VARS_WITH_NA_MISSING)),('frequent_imputation', CategoricalImputer(
        imputation_method='frequent', variables=CATEGORICAL_VARS_WITH_NA_FREQUENT)),# add missing indicator
    ('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARS_WITH_NA)),# impute numerical variables with the mean
    ('mean_imputation', MeanMedianImputer(
        imputation_method='mean', variables=NUMERICAL_VARS_WITH_NA
    )),
    
    # == TEMPORAL VARIABLES ====
    ('elapsed_time', pp.TemporalVariableTransformer(
        variables=TEMPORAL_VARS, reference_variable=REF_VAR)),('drop_features', DropFeatures(features_to_drop=[REF_VAR])),# ==== VARIABLE TRANSFORMATION =====
    ('log', LogTransformer(variables=NUMERICALS_LOG_VARS)),
    
    ('yeojohnson', YeoJohnsonTransformer(variables=NUMERICALS_YEO_VARS)),
    
    ('binarizer', SklearnTransformerWrapper(
        transformer=Binarizer(threshold=0), variables=BINARIZE_VARS)),
    
    # === mappers ===
    ('mapper_qual', pp.Mapper(
        variables=QUAL_VARS, mappings=QUAL_MAPPINGS)),('mapper_exposure', pp.Mapper(
        variables=EXPOSURE_VARS, mappings=EXPOSURE_MAPPINGS)),('mapper_finish', pp.Mapper(
        variables=FINISH_VARS, mappings=FINISH_MAPPINGS)),('mapper_garage', pp.Mapper(
        variables=GARAGE_VARS, mappings=GARAGE_MAPPINGS)),
    
    ('mapper_fence', pp.Mapper(
        variables=FENCE_VARS, mappings=FENCE_MAPPINGS)),# == CATEGORICAL ENCODING
    ('rare_label_encoder', RareLabelEncoder(
        tol=0.01, n_categories=1, variables=CATEGORICAL_VARS
    )),# encode categorical and discrete variables using the target mean
    ('categorical_encoder', OrdinalEncoder(
        encoding_method='ordered', variables=CATEGORICAL_VARS)),
])# train the pipeline
price_pipe.fit(X_train, y_train)
X_train = price_pipe.transform(X_train)
X_test = price_pipe.transform(X_test)# the parameters are learnt and stored in each step
# of the pipeline
price_pipe.named_steps['frequent_imputation'].imputer_dict_

Pipeline with modeling included

If you use pipeline only for feature engineering, you use ‘transform()’ at the end to make your X_train and X_test go through each feature engineering steps. More exactly, you Fit() on X_train so that you get the FE values and you Transform() / apply those values to X_train and X_test. However, if your pipeline does feature engineering + modeling, there is no need to transform() as you can see below:

[...]
    # == CATEGORICAL ENCODING
    ('rare_label_encoder', RareLabelEncoder(
        tol=0.01, n_categories=1, variables=CATEGORICAL_VARS
    )),# encode categorical and discrete variables using the target mean
    ('categorical_encoder', OrdinalEncoder(
        encoding_method='ordered', variables=CATEGORICAL_VARS)),
  
    ('scaler', MinMaxScaler()),
#     ('selector', SelectFromModel(Lasso(alpha=0.001, random_state=0))),
    ('Lasso', Lasso(alpha=0.001, random_state=0)),
])# train the pipeline
price_pipe.fit(X_train, y_train)# evaluate the model:
# ====================
# make predictions for train set
pred = price_pipe.predict(X_train)# determine mse, rmse and r2
print('train mse: {}'.format(int(
    mean_squared_error(np.exp(y_train), np.exp(pred)))))
print('train rmse: {}'.format(int(
    mean_squared_error(np.exp(y_train), np.exp(pred), squared=False))))
print('train r2: {}'.format(
    r2_score(np.exp(y_train), np.exp(pred))))
print()# make predictions for test set
pred = price_pipe.predict(X_test)# determine mse, rmse and r2
print('test mse: {}'.format(int(
    mean_squared_error(np.exp(y_test), np.exp(pred)))))
print('test rmse: {}'.format(int(
    mean_squared_error(np.exp(y_test), np.exp(pred), squared=False))))
print('test r2: {}'.format(
    r2_score(np.exp(y_test), np.exp(pred))))
print()print('Average house price: ', int(np.exp(y_train).median()))

Thoughts: I wonder how practical it is to have feature engineering + modeling within the same pipeline. From what I see, it may be worth checking to create a pipeline for feature engineering, create one for modeling and then join them if needed using the ‘FeatureUnion’.

# source : https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines
words =  Pipeline([
                ('selector', NumberSelector(key='words')),
                ('standard', StandardScaler())
            ])
words_not_stopword =  Pipeline([
                ('selector', NumberSelector(key='words_not_stopword')),
                ('standard', StandardScaler())
            ])
avg_word_length =  Pipeline([
                ('selector', NumberSelector(key='avg_word_length')),
                ('standard', StandardScaler())
            ])
commas =  Pipeline([
                ('selector', NumberSelector(key='commas')),
                ('standard', StandardScaler()),
            ])# followed by 
from sklearn.pipeline import FeatureUnionfeats = FeatureUnion([('text', text), 
                      ('length', length),
                      ('words', words),
                      ('words_not_stopword', words_not_stopword),
                      ('avg_word_length', avg_word_length),
                      ('commas', commas)])feature_processing = Pipeline([('feats', feats)])
feature_processing.fit_transform(X_train)

Creating pipeline for feature engineering

Building FE pipeline

Source

Example of a FE script with a pipeline

Procedural Programming vs Object Oriented Programming for ML

Creating a class

Inheritance

Create Scikit-Learn compatible transformer

Example of a pipeline with open source and in house transformers

Pipeline with modeling included

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Marc Deveaux

No responses yet