Month

Let's analyze the VMONTH predictor in more detail. The following code prints all the values in the training set and their counts:

print(X_train.groupby('VMONTH').size())

The output is as follows:

VMONTH
01    1757
02    1396
03    1409
04    1719
05    2032
06    1749
07    1696
08    1034
09    1240
10    1306
11    1693
12    1551
dtype: int64

We can now see that the months are numbered from 01 to 12, as it says in the documentation, and that each month has representation.

One part of preprocessing the data is performing feature engineering – that is, combining or transforming the features in some way to come up with new features that are more predictive than the previous ones. For example, suppose we had a hypothesis that ED visitors tend to be admitted more often during the winter months. We could make a predictor called WINTER that is a combination of the VMONTH predictor such that the value is 1 only if the patient came during December, January, February, or March. We have done that in the following cell. Later on, we can test this hypothesis when we assess variable importance while making our machine learning models:

def is_winter(vmonth):
    if vmonth in ['12','01','02','03']:
        return 1
    else:
        return 0
    
X_train.loc[:,'WINTER'] = df_ed.loc[:,'VMONTH'].apply(is_winter)
X_test.loc[:,'WINTER'] = df_ed.loc[:,'VMONTH'].apply(is_winter)

As an informal test, let's print out the distribution of the WINTER variable and confirm that it is the sum of the preceding four winter months:

X_train.groupby('WINTER').size()

The output is as follows:

WINTER
0    12469
1     6113
dtype: int64

Sure enough, we get 6113 = 1551 + 1757 + 1396 + 1409, confirming that we engineered the feature correctly. We will see other examples of feature engineering throughout this chapter.