Notes on datetime feature creation

Lesson learned from seeing another data scientist’s project where the data is mostly monthly gms. Learning focus is on datetime features treatment.

Folder architecture used


  • analysis
  • code
  • data_raw
  • data_intermediate
  • data_master
  • data_to_bu
  • model
  • send_to_bu

In the code folder, he has a script which is made of many functions dedicated to preprocessing / ML that he uses from one project to another. Specific functions used for this project are in another file named

List of date related features used for model training

  • diff: Monthly % change of number of users or gms per area

1 month is pct change vs last month, the other are vs 1 year ago

done for 1 month, 6 month, 1 year, 3 years. Ex: pct change of last 6 mth GMS vs last 6 month GMS 1 year ago

df2['gms_1mth_diff'] = df2.groupby('id')["gms_total_1mth_area"].pct_change(1).replace(np.inf,2).replace(-np.inf,-2)
df2['gms_6mth_diff'] = df2.groupby('id')["gms_total_6mth_area"].pct_change(12).replace(np.inf,2).replace(-np.inf,-2)
  • diff_all : GMS of all areas over the last X month vs same thing 1 year ago

done for 1 month (so Jan 2018 vs Jan 2017), 6 month (last 6 months vs last 6 months 1 year ago) and 1 year

df['gms_1yr_diff_all'] = df2.groupby('date')["gms_total_1yr_area"].transform("sum").pct_change(12).replace(np.inf,2).replace(-np.inf,-2)
  • diff_relative: GMS YoY of an area compared to 1 year ago minus same thing but for all Japan (which was calculated when creating diff_all)

tells you how an area trend is doing vs the whole country trend

do this for different period of time: 6 month and 1 year

df2['gms_1yr_diff'] = df2.groupby('id')["gms_total_1yr_area"].pct_change(12).replace(np.inf,2).replace(-np.inf,-2)
df2['gms_1yr_diff_relative'] = df2['gms_1yr_diff'] - df2['gms_1yr_diff_all']
  • _area_log and _cnt_log : GMS or number of users over the last X months for an area

1 month, 3 months, 6 months, 1 year, 3 years

  • vol: rolling volatility of gms over the last X months (6 month, 1 year, 2 years)
df2['gms_6mth_vol'] = df2.reset_index().set_index(['date']).groupby(['zip3'])["gms_1mth_diff"].rolling(6).std(ddof=0).replace(np.inf,2).replace(-np.inf,-2)
  • ratio: total gms over the last year in area divided by number of XXX
df2['gms_1yr_ratio'] = df2["gms_total_1yr_area"] / df2["n_XXX"]
  • area: integer of prefecture
  • current period: ‘month’, ‘quarter’, integer of the month / quarter

Columns used for model training — date related

 'month', 'quarter', # integer of the month / quarter
'gms_1mth_diff', # pct change btw last month,
'gms_6mth_diff', # gms for the last 6 month vs gms for the last 6 month 1 year ago
'gms_1yr_diff', # gms of the last 1 year vs gms of the last 1 year (1 year ago)
'cnt_1yr_diff', # cnt is related to the number of users
'cnt_6mth_diff', # sum of users during last 6 months vs the same but one year ago
'gms_1yr_ratio', # total gms in area divided by number of XXX
'gms_6mth_vol', 'gms_1yr_vol', 'gms_2yr_vol', # volatility of gms on the last X months

# gms over the last X month
'gms_total_1mth_area_log', 'gms_total_3mth_area_log',
'gms_total_6mth_area_log', 'gms_total_1yr_area_log',

# number of users over the last X month
'cnt_total_1mth_area_log','cnt_total_3mth_area_log', 'cnt_total_6mth_area_log',
'cnt_total_1yr_area_log', 'cnt_total_3yr_area_log',

'gms_6mth_diff_all', # total gms last 6 month for ALL area vs same thing last month
'gms_1yr_diff_all', # same thing for a 1 year period vs 1 year ago
'gms_6mth_diff_relative', # diff last 6 mth change over 12 month and gms_6mth_diff_all

Feature creation with a Datetime data type

Take advantages of existing functions

Overall, I should rely more on already existing functions for feature engineering rather than writing the functions from scratch.

  • When it comes to dates, use pandas rolling (with sum for example) and pct_change functions
  • Create binary features where you compare the GMS vs the average (like is your gms > GMS average?)

Use log with infinite values

It is not because you have infinite values after data transformation that you shouldn’t use log. Just replace the infinite / minus infinite by some extremity values

df2[t2] = np.log(df2[t1]) 
df2[t2] = df2[t2].replace(np.inf, 10).replace(-np.inf, -10)

Create a volatility feature

Calculate the volatility with the usage of numpy std (standard deviation). First you check how much percentage change you have from one month to another using perc_change function (which give the feature gms_1mth_diff). Then you do rolling function to get the time windows we want (let’s say 6 months) and check the standard deviation within this period.

df2['gms_6mth_vol'] = df2.reset_index().set_index(['date']).groupby(['id'])["gms_1mth_diff"].rolling(6).std(ddof=0).replace(np.inf, 2).replace(-np.inf,-2)

Fill NA with interpolate function

Use of pandas interpolate function to fill NA when we have missing month values. Many possibilities like linear or padding, etc…


  • convert current date into month and quarter (integer), particularly if seasonality may play a role
  • Remember to systematically use archive format (gz) for intermediary files



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store