Notes on datetime feature creation
Lesson learned from seeing another data scientist’s project where the data is mostly monthly gms. Learning focus is on datetime features treatment.
Folder architecture used
Folders:
- analysis
- code
- data_raw
- data_intermediate
- data_master
- data_to_bu
- model
- send_to_bu
In the code folder, he has a utils.py script which is made of many functions dedicated to preprocessing / ML that he uses from one project to another. Specific functions used for this project are in another file named util_XXX.py.
List of date related features used for model training
- diff: Monthly % change of number of users or gms per area
1 month is pct change vs last month, the other are vs 1 year ago
done for 1 month, 6 month, 1 year, 3 years. Ex: pct change of last 6 mth GMS vs last 6 month GMS 1 year ago
df2['gms_1mth_diff'] = df2.groupby('id')["gms_total_1mth_area"].pct_change(1).replace(np.inf,2).replace(-np.inf,-2)
df2['gms_6mth_diff'] = df2.groupby('id')["gms_total_6mth_area"].pct_change(12).replace(np.inf,2).replace(-np.inf,-2)
- diff_all : GMS of all areas over the last X month vs same thing 1 year ago
done for 1 month (so Jan 2018 vs Jan 2017), 6 month (last 6 months vs last 6 months 1 year ago) and 1 year
df['gms_1yr_diff_all'] = df2.groupby('date')["gms_total_1yr_area"].transform("sum").pct_change(12).replace(np.inf,2).replace(-np.inf,-2)
- diff_relative: GMS YoY of an area compared to 1 year ago minus same thing but for all Japan (which was calculated when creating diff_all)
tells you how an area trend is doing vs the whole country trend
do this for different period of time: 6 month and 1 year
df2['gms_1yr_diff'] = df2.groupby('id')["gms_total_1yr_area"].pct_change(12).replace(np.inf,2).replace(-np.inf,-2)
df2['gms_1yr_diff_relative'] = df2['gms_1yr_diff'] - df2['gms_1yr_diff_all']
- _area_log and _cnt_log : GMS or number of users over the last X months for an area
1 month, 3 months, 6 months, 1 year, 3 years
- vol: rolling volatility of gms over the last X months (6 month, 1 year, 2 years)
df2['gms_6mth_vol'] = df2.reset_index().set_index(['date']).groupby(['zip3'])["gms_1mth_diff"].rolling(6).std(ddof=0).replace(np.inf,2).replace(-np.inf,-2)
- ratio: total gms over the last year in area divided by number of XXX
df2['gms_1yr_ratio'] = df2["gms_total_1yr_area"] / df2["n_XXX"]
- area: integer of prefecture
- current period: ‘month’, ‘quarter’, integer of the month / quarter
Columns used for model training — date related
'month', 'quarter', # integer of the month / quarter
'gms_1mth_diff', # pct change btw last month,
'gms_6mth_diff', # gms for the last 6 month vs gms for the last 6 month 1 year ago
'gms_1yr_diff', # gms of the last 1 year vs gms of the last 1 year (1 year ago)
'cnt_1yr_diff', # cnt is related to the number of users
'cnt_6mth_diff', # sum of users during last 6 months vs the same but one year ago
'gms_1yr_ratio', # total gms in area divided by number of XXX
'gms_6mth_vol', 'gms_1yr_vol', 'gms_2yr_vol', # volatility of gms on the last X months
# gms over the last X month
'gms_total_1mth_area_log', 'gms_total_3mth_area_log',
'gms_total_6mth_area_log', 'gms_total_1yr_area_log',
'gms_total_3yr_area_log',
# number of users over the last X month
'cnt_total_1mth_area_log','cnt_total_3mth_area_log', 'cnt_total_6mth_area_log',
'cnt_total_1yr_area_log', 'cnt_total_3yr_area_log',
'gms_6mth_diff_all', # total gms last 6 month for ALL area vs same thing last month
'gms_1yr_diff_all', # same thing for a 1 year period vs 1 year ago
'gms_6mth_diff_relative', # diff last 6 mth change over 12 month and gms_6mth_diff_all
'gms_1yr_diff_relative',
'y']
Feature creation with a Datetime data type
Take advantages of existing functions
Overall, I should rely more on already existing functions for feature engineering rather than writing the functions from scratch.
- When it comes to dates, use pandas rolling (with sum for example) and pct_change functions
- Create binary features where you compare the GMS vs the average (like is your gms > GMS average?)
Use log with infinite values
It is not because you have infinite values after data transformation that you shouldn’t use log. Just replace the infinite / minus infinite by some extremity values
df2[t2] = np.log(df2[t1])
df2[t2] = df2[t2].replace(np.inf, 10).replace(-np.inf, -10)
Create a volatility feature
Calculate the volatility with the usage of numpy std (standard deviation). First you check how much percentage change you have from one month to another using perc_change function (which give the feature gms_1mth_diff). Then you do rolling function to get the time windows we want (let’s say 6 months) and check the standard deviation within this period.
df2['gms_6mth_vol'] = df2.reset_index().set_index(['date']).groupby(['id'])["gms_1mth_diff"].rolling(6).std(ddof=0).replace(np.inf, 2).replace(-np.inf,-2)
Fill NA with interpolate function
Use of pandas interpolate function to fill NA when we have missing month values. Many possibilities like linear or padding, etc…
Other
- convert current date into month and quarter (integer), particularly if seasonality may play a role
- Remember to systematically use archive format (gz) for intermediary files