Creating pipeline for feature engineering
Building FE pipeline
Source
Code and notes come from Udemy course “Deployment of Machine Learning Models”. See section 4 from https://github.com/trainindata/deploying-machine-learning-models
See also:
- https://www.kaggle.com/sermakarevich/sklearn-pipelines-tutorial
- https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines
You can save each class of FE package instead of saving csv individually. You can also simplify things by setting up all transformations within a pipeline and save the result in a pickle file. It allows you to apply it to new data easily. In general, try to use Scikit-learn and feature_engine package for FE transformation.
In order to set up the entire feature transformation within a pipeline, you have to create a class that can be used within a pipeline to map the categorical variables. You can use this to create an in-house package.
Example of a FE script with a pipeline
# 1. config: define all the variables for FE
CATEGORICAL_VARS_WITH_NA_MISSING = [
'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']# numerical variables with NA in train set
NUMERICAL_VARS_WITH_NA = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']# variables to log transform
NUMERICALS_LOG_VARS = ["LotFrontage", "1stFlrSF", "GrLivArea"]
# [...]# 2. set up the pipeline
price_pipe = Pipeline([# ===== IMPUTATION =====
# impute categorical variables with string missing
('missing_imputation', CategoricalImputer(
imputation_method='missing', variables=CATEGORICAL_VARS_WITH_NA_MISSING)),('frequent_imputation', CategoricalImputer(
imputation_method='frequent', variables=CATEGORICAL_VARS_WITH_NA_FREQUENT)),# add missing indicator
('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARS_WITH_NA)),# impute numerical variables with the mean
('mean_imputation', MeanMedianImputer(
imputation_method='mean', variables=NUMERICAL_VARS_WITH_NA
)),
# == TEMPORAL VARIABLES ====
('elapsed_time', pp.TemporalVariableTransformer(
variables=TEMPORAL_VARS, reference_variable=REF_VAR)),('drop_features', DropFeatures(features_to_drop=[REF_VAR])),# ==== VARIABLE TRANSFORMATION =====
('log', LogTransformer(variables=NUMERICALS_LOG_VARS)),
('yeojohnson', YeoJohnsonTransformer(variables=NUMERICALS_YEO_VARS)),
('binarizer', SklearnTransformerWrapper(
transformer=Binarizer(threshold=0), variables=BINARIZE_VARS)),# === mappers ===
('mapper_qual', pp.Mapper(
variables=QUAL_VARS, mappings=QUAL_MAPPINGS)),('mapper_exposure', pp.Mapper(
variables=EXPOSURE_VARS, mappings=EXPOSURE_MAPPINGS)),('mapper_finish', pp.Mapper(
variables=FINISH_VARS, mappings=FINISH_MAPPINGS)),('mapper_garage', pp.Mapper(
variables=GARAGE_VARS, mappings=GARAGE_MAPPINGS)),
('mapper_fence', pp.Mapper(
variables=FENCE_VARS, mappings=FENCE_MAPPINGS)),# == CATEGORICAL ENCODING
('rare_label_encoder', RareLabelEncoder(
tol=0.01, n_categories=1, variables=CATEGORICAL_VARS
)),# encode categorical and discrete variables using the target mean
('categorical_encoder', OrdinalEncoder(
encoding_method='ordered', variables=CATEGORICAL_VARS)),
])# 3. train the pipeline
price_pipe.fit(X_train, y_train)X_train = price_pipe.transform(X_train)
X_test = price_pipe.transform(X_test)# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]
# check absence of na in the test set
[var for var in X_test.columns if X_test[var].isnull().sum() > 0]# the parameters are learnt and stored in each step
# of the pipeline
price_pipe.named_steps['frequent_imputation'].imputer_dict_
- another example of pipeline, this time where you create a model within the pipeline. Note that we don’t use transform() on X_train or X_test in this case
# [...] all the previous pipeline steps
('scaler', MinMaxScaler()),
# ('selector', SelectFromModel(Lasso(alpha=0.001, random_state=0))),
('Lasso', Lasso(alpha=0.001, random_state=0)), # linear model
])# make predictions for train set
pred = price_pipe.predict(X_train)# determine mse, rmse and r2
print('train mse: {}'.format(int(
mean_squared_error(np.exp(y_train), np.exp(pred)))))
print('train rmse: {}'.format(int(
mean_squared_error(np.exp(y_train), np.exp(pred), squared=False))))
print('train r2: {}'.format(
r2_score(np.exp(y_train), np.exp(pred))))
print()# make predictions for test set
pred = price_pipe.predict(X_test)# determine mse, rmse and r2
print('test mse: {}'.format(int(
mean_squared_error(np.exp(y_test), np.exp(pred)))))
print('test rmse: {}'.format(int(
mean_squared_error(np.exp(y_test), np.exp(pred), squared=False))))
print('test r2: {}'.format(
r2_score(np.exp(y_test), np.exp(pred))))
print()print('Average house price: ', int(np.exp(y_train).median()))
Procedural Programming vs Object Oriented Programming for ML
Why and how to create a class where you store all the FE functions and FE values
Procedural Programming
Pretty straightforward, you hard-code the parameters and save multiple objects or data structures
Code:
- Learn the parameters
- Make the transformations
- Make the predictions
Data:
- Store the parameters
- Mean values, regression coefficients, etc…
Object Oriented Programming — OOP
We write code in the form of “objects”. This “objects” can store data and can also store instructions or procedures (code) to modify that data, or do something else, like obtaining predictions.
- Data -> attributes, properties
- Code or Instructions -> methods (procedures)
So we can learn and store parameters
- parameters get automatically refreshed every time model is re-trained
- no need of manual hard-coding
Methods:
- Fit: learn parameters
- Transform: transforms data with the learned parameters
Attributes: store the learn parameters
In order to use those, we need to create a class
Creating a class
Methods are functions defined inside a class and can only be called from an instance of that class
The first parameters will always be a variable called self
Our fit() methods learn parameters
class MeanImputer:
def __init__(self, variables):
self.variables = variables
def fit(self, X, y=None):
self.imputer_dict_ = X[self.variables].mean().to_dict()
return self
def transform(self, X):
for x in self.variables:
X[x] = X[x].fillna(
self.imputer_dict[x])
return Xmy_inputer = MeanImputer(
variables = ['age', 'fare']
)
my_imputer.fit(my_data)
my_imputer.imputer_dict_ # give dictionary with mean
Inheritance
Inheritance is the process by which one class takes on the attributes and methods of another
parent class
class TransformerMixin:
def fit_transform(self, X, y=None):
X = self.fit(X, y).transform(X)
return X
child class
class MeanInputer(TransformerMixin):
def __init__(self, variables):
self.variables = variables
def fit(self, X, y=None):
self.imputer_dict_ = X[self.variables].mean().to_dict()
return self
def transform(self, X):
for x in self.variables:
X[x] = X[x].fillna(
self.imputer_dict[x])
return X
You can now do
my_inputer = MeanImputer(variables=['age', 'fare'])
data_t = my_imputer.fit_transform(my_data)
data_t.head() # return a df
Check the scikit learn API documentation : https://scikit-learn.org/stable/modules/classes.html .
For example, you can use TransformerMixin to have access to base.fit_transform which allows you to fit and transform anything that you pass to it. BaseEstimator is used to get the list of all passed parameters
Create Scikit-Learn compatible transformer
If you create your own transformer, to have them compatible with Scikit-learn you need to follow this framework.
class My_Transformer(BaseEstimator, TransformerMixin):
# temporal elsapsed time transformer
def __init__(self, variables):
self.variables = variables
def fit(self, X, y=None):
# we need this step to fit the sklearn pipeline. Even if there is nothing to fit
return self
def transform(self, X):
X = X.copy()
# your code to transform
return X
Example:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixinclass TemporalVariableTransformer(BaseEstimator, TransformerMixin):
# temporal elsapsed time transformer
def __init__(self, variables, reference_variable):
if not isinstance(variables, list):
raise ValueError('var should be a list')
self.variables = variables
self.reference_variable = reference_variable
def fit(self, X, y=None):
# we need this step to fit the sklearn pipeline
return self
def transform(self, X):
X = X.copy()
for feature in self.variables:
X[feature] = X[self.reference_variable] - X[feature]
return X# categorical missing value imputer
class Mapper(BaseEstimator, TransformerMixin):
def __init__(self, variables, mappings):
if not isinstance(variables, list):
raise ValueError('var should be a list')
self.variables = variables
self.mappings = mappings
def fit(self, X, y=None):
return self
def transform(self, X):
X = X.copy()
for feature in self.variables:
X[feature] = X[feature].map(self.mappings)
return X
Example of a pipeline with open source and in house transformers
# 1. IMPORT LIBRARIES
# [...] pandas, etc...
# for saving the pipeline
import joblib# from Scikit-learn
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, Binarizer# from feature-engine
from feature_engine.imputation import (
AddMissingIndicator,
MeanMedianImputer,
CategoricalImputer,
)from feature_engine.encoding import (
RareLabelEncoder,
OrdinalEncoder,
)from feature_engine.transformation import (
LogTransformer,
YeoJohnsonTransformer,
)from feature_engine.selection import DropFeatures
from feature_engine.wrappers import SklearnTransformerWrapperimport preprocessors as pp # class we created# 2. LOAD DATASET, CREATE TRAIN-TEST, ETC.# 3. CONFIG
# categorical variables with NA in train set
CATEGORICAL_VARS_WITH_NA_FREQUENT = ['MasVnrType',
'BsmtQual',
# [...]
'GarageCond']CATEGORICAL_VARS_WITH_NA_MISSING = [
'Alley', #[...]
'MiscFeature']# numerical variables with NA in train set
NUMERICAL_VARS_WITH_NA = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']TEMPORAL_VARS = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']
REF_VAR = "YrSold"# variables to log transform
NUMERICALS_LOG_VARS = ["LotFrontage", "1stFlrSF", "GrLivArea"]NUMERICALS_YEO_VARS = ['LotArea']# variables to map
QUAL_VARS = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu',
'GarageQual', 'GarageCond']QUAL_MAPPINGS = {'Po': 1, 'Fa': 2, 'TA': 3,
'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0}# 4. PIPELINE
price_pipe = Pipeline([# ===== IMPUTATION =====
# impute categorical variables with string missing
('missing_imputation', CategoricalImputer(
imputation_method='missing', variables=CATEGORICAL_VARS_WITH_NA_MISSING)),('frequent_imputation', CategoricalImputer(
imputation_method='frequent', variables=CATEGORICAL_VARS_WITH_NA_FREQUENT)),# add missing indicator
('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARS_WITH_NA)),# impute numerical variables with the mean
('mean_imputation', MeanMedianImputer(
imputation_method='mean', variables=NUMERICAL_VARS_WITH_NA
)),
# == TEMPORAL VARIABLES ====
('elapsed_time', pp.TemporalVariableTransformer(
variables=TEMPORAL_VARS, reference_variable=REF_VAR)),('drop_features', DropFeatures(features_to_drop=[REF_VAR])),# ==== VARIABLE TRANSFORMATION =====
('log', LogTransformer(variables=NUMERICALS_LOG_VARS)),
('yeojohnson', YeoJohnsonTransformer(variables=NUMERICALS_YEO_VARS)),
('binarizer', SklearnTransformerWrapper(
transformer=Binarizer(threshold=0), variables=BINARIZE_VARS)),
# === mappers ===
('mapper_qual', pp.Mapper(
variables=QUAL_VARS, mappings=QUAL_MAPPINGS)),('mapper_exposure', pp.Mapper(
variables=EXPOSURE_VARS, mappings=EXPOSURE_MAPPINGS)),('mapper_finish', pp.Mapper(
variables=FINISH_VARS, mappings=FINISH_MAPPINGS)),('mapper_garage', pp.Mapper(
variables=GARAGE_VARS, mappings=GARAGE_MAPPINGS)),
('mapper_fence', pp.Mapper(
variables=FENCE_VARS, mappings=FENCE_MAPPINGS)),# == CATEGORICAL ENCODING
('rare_label_encoder', RareLabelEncoder(
tol=0.01, n_categories=1, variables=CATEGORICAL_VARS
)),# encode categorical and discrete variables using the target mean
('categorical_encoder', OrdinalEncoder(
encoding_method='ordered', variables=CATEGORICAL_VARS)),
])# train the pipeline
price_pipe.fit(X_train, y_train)
X_train = price_pipe.transform(X_train)
X_test = price_pipe.transform(X_test)# the parameters are learnt and stored in each step
# of the pipeline
price_pipe.named_steps['frequent_imputation'].imputer_dict_
Pipeline with modeling included
If you use pipeline only for feature engineering, you use ‘transform()’ at the end to make your X_train and X_test go through each feature engineering steps. More exactly, you Fit() on X_train so that you get the FE values and you Transform() / apply those values to X_train and X_test. However, if your pipeline does feature engineering + modeling, there is no need to transform() as you can see below:
[...]
# == CATEGORICAL ENCODING
('rare_label_encoder', RareLabelEncoder(
tol=0.01, n_categories=1, variables=CATEGORICAL_VARS
)),# encode categorical and discrete variables using the target mean
('categorical_encoder', OrdinalEncoder(
encoding_method='ordered', variables=CATEGORICAL_VARS)),
('scaler', MinMaxScaler()),
# ('selector', SelectFromModel(Lasso(alpha=0.001, random_state=0))),
('Lasso', Lasso(alpha=0.001, random_state=0)),
])# train the pipeline
price_pipe.fit(X_train, y_train)# evaluate the model:
# ====================
# make predictions for train set
pred = price_pipe.predict(X_train)# determine mse, rmse and r2
print('train mse: {}'.format(int(
mean_squared_error(np.exp(y_train), np.exp(pred)))))
print('train rmse: {}'.format(int(
mean_squared_error(np.exp(y_train), np.exp(pred), squared=False))))
print('train r2: {}'.format(
r2_score(np.exp(y_train), np.exp(pred))))
print()# make predictions for test set
pred = price_pipe.predict(X_test)# determine mse, rmse and r2
print('test mse: {}'.format(int(
mean_squared_error(np.exp(y_test), np.exp(pred)))))
print('test rmse: {}'.format(int(
mean_squared_error(np.exp(y_test), np.exp(pred), squared=False))))
print('test r2: {}'.format(
r2_score(np.exp(y_test), np.exp(pred))))
print()print('Average house price: ', int(np.exp(y_train).median()))
Thoughts: I wonder how practical it is to have feature engineering + modeling within the same pipeline. From what I see, it may be worth checking to create a pipeline for feature engineering, create one for modeling and then join them if needed using the ‘FeatureUnion’.
# source : https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines
words = Pipeline([
('selector', NumberSelector(key='words')),
('standard', StandardScaler())
])
words_not_stopword = Pipeline([
('selector', NumberSelector(key='words_not_stopword')),
('standard', StandardScaler())
])
avg_word_length = Pipeline([
('selector', NumberSelector(key='avg_word_length')),
('standard', StandardScaler())
])
commas = Pipeline([
('selector', NumberSelector(key='commas')),
('standard', StandardScaler()),
])# followed by
from sklearn.pipeline import FeatureUnionfeats = FeatureUnion([('text', text),
('length', length),
('words', words),
('words_not_stopword', words_not_stopword),
('avg_word_length', avg_word_length),
('commas', commas)])feature_processing = Pipeline([('feats', feats)])
feature_processing.fit_transform(X_train)