# Notes on datetime feature creation

Lesson learned from seeing another data scientist’s project where the data is mostly monthly gms. Learning focus is on datetime features treatment.

# Folder architecture used

Folders:

- analysis
- code
- data_raw
- data_intermediate
- data_master
- data_to_bu
- model
- send_to_bu

In the code folder, he has a utils.py script which is made of many functions dedicated to preprocessing / ML that he uses from one project to another. Specific functions used for this project are in another file named util_XXX.py.

# List of date related features used for model training

**diff**: Monthly % change of number of users or gms per area

**1 month is pct change vs last month, the other are vs 1 year ago**

done for 1 month, 6 month, 1 year, 3 years. Ex: pct change of last 6 mth GMS vs last 6 month GMS 1 year ago

`df2['gms_1mth_diff'] = df2.groupby('id')["gms_total_1mth_area"].pct_change(1).replace(np.inf,2).replace(-np.inf,-2)`

df2['gms_6mth_diff'] = df2.groupby('id')["gms_total_6mth_area"].pct_change(12).replace(np.inf,2).replace(-np.inf,-2)

**diff_all**: GMS of all areas over the last X month vs same thing 1 year ago

done for 1 month (so Jan 2018 vs Jan 2017), 6 month (last 6 months vs last 6 months 1 year ago) and 1 year

`df['gms_1yr_diff_all'] = df2.groupby('date')["gms_total_1yr_area"].transform("sum").pct_change(12).replace(np.inf,2).replace(-np.inf,-2)`

**diff_relative**: GMS YoY of an area compared to 1 year ago minus same thing but for all Japan (which was calculated when creating diff_all)

tells you how an area trend is doing vs the whole country trend

do this for different period of time: 6 month and 1 year

`df2['gms_1yr_diff'] = df2.groupby('id')["gms_total_1yr_area"].pct_change(12).replace(np.inf,2).replace(-np.inf,-2)`

df2['gms_1yr_diff_relative'] = df2['gms_1yr_diff'] - df2['gms_1yr_diff_all']

**_area_log**and**_cnt_log**: GMS or number of users over the last X months for an area

1 month, 3 months, 6 months, 1 year, 3 years

**vol**: rolling volatility of gms over the last X months (6 month, 1 year, 2 years)

`df2['gms_6mth_vol'] = df2.reset_index().set_index(['date']).groupby(['zip3'])["gms_1mth_diff"].rolling(6).std(ddof=0).replace(np.inf,2).replace(-np.inf,-2)`

**ratio**: total gms over the last year in area divided by number of XXX

`df2['gms_1yr_ratio'] = df2["gms_total_1yr_area"] / df2["n_XXX"]`

**area**: integer of prefecture**current****period**: ‘month’, ‘quarter’, integer of the month / quarter

# Columns used for model training — date related

` 'month', 'quarter', # integer of the month / quarter`

'gms_1mth_diff', # pct change btw last month,

'gms_6mth_diff', # gms for the last 6 month vs gms for the last 6 month 1 year ago

'gms_1yr_diff', # gms of the last 1 year vs gms of the last 1 year (1 year ago)

'cnt_1yr_diff', # cnt is related to the number of users

'cnt_6mth_diff', # sum of users during last 6 months vs the same but one year ago

'gms_1yr_ratio', # total gms in area divided by number of XXX

'gms_6mth_vol', 'gms_1yr_vol', 'gms_2yr_vol', # volatility of gms on the last X months

# gms over the last X month

'gms_total_1mth_area_log', 'gms_total_3mth_area_log',

'gms_total_6mth_area_log', 'gms_total_1yr_area_log',

'gms_total_3yr_area_log',

# number of users over the last X month

'cnt_total_1mth_area_log','cnt_total_3mth_area_log', 'cnt_total_6mth_area_log',

'cnt_total_1yr_area_log', 'cnt_total_3yr_area_log',

'gms_6mth_diff_all', # total gms last 6 month for ALL area vs same thing last month

'gms_1yr_diff_all', # same thing for a 1 year period vs 1 year ago

'gms_6mth_diff_relative', # diff last 6 mth change over 12 month and gms_6mth_diff_all

'gms_1yr_diff_relative',

'y']

# Feature creation with a Datetime data type

# Take advantages of existing functions

Overall, I should **rely more on already existing functions** for feature engineering rather than writing the functions from scratch.

- When it comes to dates, use pandas
**rolling**(with sum for example) and**pct_change**functions - Create binary features where you compare the GMS vs the average (like is your gms > GMS average?)

# Use log with infinite values

It is not because you have infinite values after data transformation that you shouldn’t use log. Just replace the infinite / minus infinite by some extremity values

`df2[t2] = np.log(df2[t1]) `

df2[t2] = df2[t2].replace(np.inf, 10).replace(-np.inf, -10)

# Create a volatility feature

Calculate the volatility with the usage of numpy std (standard deviation). First you check how much percentage change you have from one month to another using perc_change function (which give the feature gms_1mth_diff). Then you do rolling function to get the time windows we want (let’s say 6 months) and check the standard deviation within this period.

`df2['gms_6mth_vol'] = df2.reset_index().set_index(['date']).groupby(['id'])["gms_1mth_diff"].rolling(6).std(ddof=0).replace(np.inf, 2).replace(-np.inf,-2)`

# Fill NA with interpolate function

Use of pandas **interpolate** function to fill NA when we have missing month values. Many possibilities like linear or padding, etc…

# Other

- convert current date into month and quarter (integer), particularly if seasonality may play a role
- Remember to systematically use archive format (gz) for intermediary files