Creating pipeline for feature engineering

Marc Deveaux
7 min readDec 26, 2021
Photo by William Bout on Unsplash

Building FE pipeline

Source

Code and notes come from Udemy course “Deployment of Machine Learning Models”. See section 4 from https://github.com/trainindata/deploying-machine-learning-models

See also:

You can save each class of FE package instead of saving csv individually. You can also simplify things by setting up all transformations within a pipeline and save the result in a pickle file. It allows you to apply it to new data easily. In general, try to use Scikit-learn and feature_engine package for FE transformation.

In order to set up the entire feature transformation within a pipeline, you have to create a class that can be used within a pipeline to map the categorical variables. You can use this to create an in-house package.

Example of a FE script with a pipeline

# 1. config: define all the variables for FE
CATEGORICAL_VARS_WITH_NA_MISSING = [
'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']
# numerical variables with NA in train set
NUMERICAL_VARS_WITH_NA = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
# variables to log transform
NUMERICALS_LOG_VARS = ["LotFrontage", "1stFlrSF", "GrLivArea"]
# [...]
# 2. set up the pipeline
price_pipe = Pipeline([
# ===== IMPUTATION =====
# impute categorical variables with string missing
('missing_imputation', CategoricalImputer(
imputation_method='missing', variables=CATEGORICAL_VARS_WITH_NA_MISSING)),
('frequent_imputation', CategoricalImputer(
imputation_method='frequent', variables=CATEGORICAL_VARS_WITH_NA_FREQUENT)),
# add missing indicator
('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARS_WITH_NA)),
# impute numerical variables with the mean
('mean_imputation', MeanMedianImputer(
imputation_method='mean', variables=NUMERICAL_VARS_WITH_NA
)),

# == TEMPORAL VARIABLES ====
('elapsed_time', pp.TemporalVariableTransformer(
variables=TEMPORAL_VARS, reference_variable=REF_VAR)),
('drop_features', DropFeatures(features_to_drop=[REF_VAR])),# ==== VARIABLE TRANSFORMATION =====
('log', LogTransformer(variables=NUMERICALS_LOG_VARS)),

('yeojohnson', YeoJohnsonTransformer(variables=NUMERICALS_YEO_VARS)),

('binarizer', SklearnTransformerWrapper(
transformer=Binarizer(threshold=0), variables=BINARIZE_VARS)),
# === mappers ===
('mapper_qual', pp.Mapper(
variables=QUAL_VARS, mappings=QUAL_MAPPINGS)),
('mapper_exposure', pp.Mapper(
variables=EXPOSURE_VARS, mappings=EXPOSURE_MAPPINGS)),
('mapper_finish', pp.Mapper(
variables=FINISH_VARS, mappings=FINISH_MAPPINGS)),
('mapper_garage', pp.Mapper(
variables=GARAGE_VARS, mappings=GARAGE_MAPPINGS)),

('mapper_fence', pp.Mapper(
variables=FENCE_VARS, mappings=FENCE_MAPPINGS)),
# == CATEGORICAL ENCODING
('rare_label_encoder', RareLabelEncoder(
tol=0.01, n_categories=1, variables=CATEGORICAL_VARS
)),
# encode categorical and discrete variables using the target mean
('categorical_encoder', OrdinalEncoder(
encoding_method='ordered', variables=CATEGORICAL_VARS)),
])
# 3. train the pipeline
price_pipe.fit(X_train, y_train)
X_train = price_pipe.transform(X_train)
X_test = price_pipe.transform(X_test)
# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]
# check absence of na in the test set
[var for var in X_test.columns if X_test[var].isnull().sum() > 0]
# the parameters are learnt and stored in each step
# of the pipeline
price_pipe.named_steps['frequent_imputation'].imputer_dict_
  • another example of pipeline, this time where you create a model within the pipeline. Note that we don’t use transform() on X_train or X_test in this case
# [...] all the previous pipeline steps 
('scaler', MinMaxScaler()),
# ('selector', SelectFromModel(Lasso(alpha=0.001, random_state=0))),
('Lasso', Lasso(alpha=0.001, random_state=0)), # linear model
])
# make predictions for train set
pred = price_pipe.predict(X_train)
# determine mse, rmse and r2
print('train mse: {}'.format(int(
mean_squared_error(np.exp(y_train), np.exp(pred)))))
print('train rmse: {}'.format(int(
mean_squared_error(np.exp(y_train), np.exp(pred), squared=False))))
print('train r2: {}'.format(
r2_score(np.exp(y_train), np.exp(pred))))
print()
# make predictions for test set
pred = price_pipe.predict(X_test)
# determine mse, rmse and r2
print('test mse: {}'.format(int(
mean_squared_error(np.exp(y_test), np.exp(pred)))))
print('test rmse: {}'.format(int(
mean_squared_error(np.exp(y_test), np.exp(pred), squared=False))))
print('test r2: {}'.format(
r2_score(np.exp(y_test), np.exp(pred))))
print()
print('Average house price: ', int(np.exp(y_train).median()))

Procedural Programming vs Object Oriented Programming for ML

Why and how to create a class where you store all the FE functions and FE values

Procedural Programming

Pretty straightforward, you hard-code the parameters and save multiple objects or data structures

Code:

  • Learn the parameters
  • Make the transformations
  • Make the predictions

Data:

  • Store the parameters
  • Mean values, regression coefficients, etc…

Object Oriented Programming — OOP

We write code in the form of “objects”. This “objects” can store data and can also store instructions or procedures (code) to modify that data, or do something else, like obtaining predictions.

  • Data -> attributes, properties
  • Code or Instructions -> methods (procedures)

So we can learn and store parameters

  • parameters get automatically refreshed every time model is re-trained
  • no need of manual hard-coding

Methods:

  • Fit: learn parameters
  • Transform: transforms data with the learned parameters

Attributes: store the learn parameters

In order to use those, we need to create a class

Creating a class

Methods are functions defined inside a class and can only be called from an instance of that class

The first parameters will always be a variable called self

Our fit() methods learn parameters

class MeanImputer:
def __init__(self, variables):
self.variables = variables

def fit(self, X, y=None):
self.imputer_dict_ = X[self.variables].mean().to_dict()
return self

def transform(self, X):
for x in self.variables:
X[x] = X[x].fillna(
self.imputer_dict[x])
return X
my_inputer = MeanImputer(
variables = ['age', 'fare']
)
my_imputer.fit(my_data)
my_imputer.imputer_dict_ # give dictionary with mean

Inheritance

Inheritance is the process by which one class takes on the attributes and methods of another

parent class

class TransformerMixin:
def fit_transform(self, X, y=None):
X = self.fit(X, y).transform(X)
return X

child class

class MeanInputer(TransformerMixin):
def __init__(self, variables):
self.variables = variables

def fit(self, X, y=None):
self.imputer_dict_ = X[self.variables].mean().to_dict()
return self

def transform(self, X):
for x in self.variables:
X[x] = X[x].fillna(
self.imputer_dict[x])
return X

You can now do

my_inputer = MeanImputer(variables=['age', 'fare'])
data_t = my_imputer.fit_transform(my_data)
data_t.head() # return a df

Check the scikit learn API documentation : https://scikit-learn.org/stable/modules/classes.html .

For example, you can use TransformerMixin to have access to base.fit_transform which allows you to fit and transform anything that you pass to it. BaseEstimator is used to get the list of all passed parameters

Create Scikit-Learn compatible transformer

If you create your own transformer, to have them compatible with Scikit-learn you need to follow this framework.

class My_Transformer(BaseEstimator, TransformerMixin):
# temporal elsapsed time transformer

def __init__(self, variables):
self.variables = variables

def fit(self, X, y=None):
# we need this step to fit the sklearn pipeline. Even if there is nothing to fit
return self

def transform(self, X):
X = X.copy()
# your code to transform
return X

Example:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class TemporalVariableTransformer(BaseEstimator, TransformerMixin):
# temporal elsapsed time transformer

def __init__(self, variables, reference_variable):
if not isinstance(variables, list):
raise ValueError('var should be a list')

self.variables = variables
self.reference_variable = reference_variable

def fit(self, X, y=None):
# we need this step to fit the sklearn pipeline
return self

def transform(self, X):
X = X.copy()

for feature in self.variables:
X[feature] = X[self.reference_variable] - X[feature]
return X
# categorical missing value imputer
class Mapper(BaseEstimator, TransformerMixin):
def __init__(self, variables, mappings):
if not isinstance(variables, list):
raise ValueError('var should be a list')

self.variables = variables
self.mappings = mappings

def fit(self, X, y=None):
return self

def transform(self, X):
X = X.copy()
for feature in self.variables:
X[feature] = X[feature].map(self.mappings)
return X

Example of a pipeline with open source and in house transformers

# 1. IMPORT LIBRARIES
# [...] pandas, etc...
# for saving the pipeline
import joblib
# from Scikit-learn
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, Binarizer
# from feature-engine
from feature_engine.imputation import (
AddMissingIndicator,
MeanMedianImputer,
CategoricalImputer,
)
from feature_engine.encoding import (
RareLabelEncoder,
OrdinalEncoder,
)
from feature_engine.transformation import (
LogTransformer,
YeoJohnsonTransformer,
)
from feature_engine.selection import DropFeatures
from feature_engine.wrappers import SklearnTransformerWrapper
import preprocessors as pp # class we created# 2. LOAD DATASET, CREATE TRAIN-TEST, ETC.# 3. CONFIG
# categorical variables with NA in train set
CATEGORICAL_VARS_WITH_NA_FREQUENT = ['MasVnrType',
'BsmtQual',
# [...]
'GarageCond']
CATEGORICAL_VARS_WITH_NA_MISSING = [
'Alley', #[...]
'MiscFeature']
# numerical variables with NA in train set
NUMERICAL_VARS_WITH_NA = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
TEMPORAL_VARS = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']
REF_VAR = "YrSold"
# variables to log transform
NUMERICALS_LOG_VARS = ["LotFrontage", "1stFlrSF", "GrLivArea"]
NUMERICALS_YEO_VARS = ['LotArea']# variables to map
QUAL_VARS = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu',
'GarageQual', 'GarageCond']
QUAL_MAPPINGS = {'Po': 1, 'Fa': 2, 'TA': 3,
'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0}
# 4. PIPELINE
price_pipe = Pipeline([
# ===== IMPUTATION =====
# impute categorical variables with string missing
('missing_imputation', CategoricalImputer(
imputation_method='missing', variables=CATEGORICAL_VARS_WITH_NA_MISSING)),
('frequent_imputation', CategoricalImputer(
imputation_method='frequent', variables=CATEGORICAL_VARS_WITH_NA_FREQUENT)),
# add missing indicator
('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARS_WITH_NA)),
# impute numerical variables with the mean
('mean_imputation', MeanMedianImputer(
imputation_method='mean', variables=NUMERICAL_VARS_WITH_NA
)),

# == TEMPORAL VARIABLES ====
('elapsed_time', pp.TemporalVariableTransformer(
variables=TEMPORAL_VARS, reference_variable=REF_VAR)),
('drop_features', DropFeatures(features_to_drop=[REF_VAR])),# ==== VARIABLE TRANSFORMATION =====
('log', LogTransformer(variables=NUMERICALS_LOG_VARS)),

('yeojohnson', YeoJohnsonTransformer(variables=NUMERICALS_YEO_VARS)),

('binarizer', SklearnTransformerWrapper(
transformer=Binarizer(threshold=0), variables=BINARIZE_VARS)),

# === mappers ===
('mapper_qual', pp.Mapper(
variables=QUAL_VARS, mappings=QUAL_MAPPINGS)),
('mapper_exposure', pp.Mapper(
variables=EXPOSURE_VARS, mappings=EXPOSURE_MAPPINGS)),
('mapper_finish', pp.Mapper(
variables=FINISH_VARS, mappings=FINISH_MAPPINGS)),
('mapper_garage', pp.Mapper(
variables=GARAGE_VARS, mappings=GARAGE_MAPPINGS)),

('mapper_fence', pp.Mapper(
variables=FENCE_VARS, mappings=FENCE_MAPPINGS)),
# == CATEGORICAL ENCODING
('rare_label_encoder', RareLabelEncoder(
tol=0.01, n_categories=1, variables=CATEGORICAL_VARS
)),
# encode categorical and discrete variables using the target mean
('categorical_encoder', OrdinalEncoder(
encoding_method='ordered', variables=CATEGORICAL_VARS)),
])
# train the pipeline
price_pipe.fit(X_train, y_train)
X_train = price_pipe.transform(X_train)
X_test = price_pipe.transform(X_test)
# the parameters are learnt and stored in each step
# of the pipeline
price_pipe.named_steps['frequent_imputation'].imputer_dict_

Pipeline with modeling included

If you use pipeline only for feature engineering, you use ‘transform()’ at the end to make your X_train and X_test go through each feature engineering steps. More exactly, you Fit() on X_train so that you get the FE values and you Transform() / apply those values to X_train and X_test. However, if your pipeline does feature engineering + modeling, there is no need to transform() as you can see below:

[...]
# == CATEGORICAL ENCODING
('rare_label_encoder', RareLabelEncoder(
tol=0.01, n_categories=1, variables=CATEGORICAL_VARS
)),
# encode categorical and discrete variables using the target mean
('categorical_encoder', OrdinalEncoder(
encoding_method='ordered', variables=CATEGORICAL_VARS)),

('scaler', MinMaxScaler()),
# ('selector', SelectFromModel(Lasso(alpha=0.001, random_state=0))),
('Lasso', Lasso(alpha=0.001, random_state=0)),
])
# train the pipeline
price_pipe.fit(X_train, y_train)
# evaluate the model:
# ====================
# make predictions for train set
pred = price_pipe.predict(X_train)
# determine mse, rmse and r2
print('train mse: {}'.format(int(
mean_squared_error(np.exp(y_train), np.exp(pred)))))
print('train rmse: {}'.format(int(
mean_squared_error(np.exp(y_train), np.exp(pred), squared=False))))
print('train r2: {}'.format(
r2_score(np.exp(y_train), np.exp(pred))))
print()
# make predictions for test set
pred = price_pipe.predict(X_test)
# determine mse, rmse and r2
print('test mse: {}'.format(int(
mean_squared_error(np.exp(y_test), np.exp(pred)))))
print('test rmse: {}'.format(int(
mean_squared_error(np.exp(y_test), np.exp(pred), squared=False))))
print('test r2: {}'.format(
r2_score(np.exp(y_test), np.exp(pred))))
print()
print('Average house price: ', int(np.exp(y_train).median()))

Thoughts: I wonder how practical it is to have feature engineering + modeling within the same pipeline. From what I see, it may be worth checking to create a pipeline for feature engineering, create one for modeling and then join them if needed using the ‘FeatureUnion’.

# source : https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines
words = Pipeline([
('selector', NumberSelector(key='words')),
('standard', StandardScaler())
])
words_not_stopword = Pipeline([
('selector', NumberSelector(key='words_not_stopword')),
('standard', StandardScaler())
])
avg_word_length = Pipeline([
('selector', NumberSelector(key='avg_word_length')),
('standard', StandardScaler())
])
commas = Pipeline([
('selector', NumberSelector(key='commas')),
('standard', StandardScaler()),
])
# followed by
from sklearn.pipeline import FeatureUnion
feats = FeatureUnion([('text', text),
('length', length),
('words', words),
('words_not_stopword', words_not_stopword),
('avg_word_length', avg_word_length),
('commas', commas)])
feature_processing = Pipeline([('feats', feats)])
feature_processing.fit_transform(X_train)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Marc Deveaux
Marc Deveaux

No responses yet

Write a response