Let's analyze the VMONTH predictor in more detail. The following code prints all the values in the training set and their counts:
print(X_train.groupby('VMONTH').size())
The output is as follows:
VMONTH 01 1757 02 1396 03 1409 04 1719 05 2032 06 1749 07 1696 08 1034 09 1240 10 1306 11 1693 12 1551 dtype: int64
We can now see that the months are numbered from 01 to 12, as it says in the documentation, and that each month has representation.
One part of preprocessing the data is performing feature engineering – that is, combining or transforming the features in some way to come up with new features that are more predictive than the previous ones. For example, suppose we had a hypothesis that ED visitors tend to be admitted more often during the winter months. We could make a predictor called WINTER that is a combination of the VMONTH predictor such that the value is 1 only if the patient came during December, January, February, or March. We have done that in the following cell. Later on, we can test this hypothesis when we assess variable importance while making our machine learning models:
def is_winter(vmonth):
if vmonth in ['12','01','02','03']:
return 1
else:
return 0
X_train.loc[:,'WINTER'] = df_ed.loc[:,'VMONTH'].apply(is_winter)
X_test.loc[:,'WINTER'] = df_ed.loc[:,'VMONTH'].apply(is_winter)
As an informal test, let's print out the distribution of the WINTER variable and confirm that it is the sum of the preceding four winter months:
X_train.groupby('WINTER').size()
The output is as follows:
WINTER 0 12469 1 6113 dtype: int64
Sure enough, we get 6113 = 1551 + 1757 + 1396 + 1409, confirming that we engineered the feature correctly. We will see other examples of feature engineering throughout this chapter.