Random forest

A convenient feature of scikit-learn is that many of the classifiers have identical methods so that models can be built interchangeably. We see that in the following code, when we use the fit() and score() methods of the RandomForestClassifier class to train the model and assess its performance, respectively:

from sklearn.ensemble import RandomForestClassifier

clfs_rf = [RandomForestClassifier(n_estimators=100)]

for clf in clfs_rf:
    clf.fit(X_train, y_train.ravel())
    print(type(clf))
    print('Training accuracy: ' + str(clf.score(X_train, y_train)))
    print('Validation accuracy: ' + str(clf.score(X_test, y_test)))
    
    imps = {
        'column': [X_train_cols[i] for i in range(len(X_train_cols))],
        'imp': [clf.feature_importances_[i] for i in range(len(X_train_cols))]
    }
    df_imps = pd.DataFrame(imps)
    print(df_imps.sort_values('imp', axis=0, ascending=False))

The output should look similar to the following:

<class 'sklearn.ensemble.forest.RandomForestClassifier'>
Training accuracy: 1.0
Validation accuracy: 0.885391444713
                                                column       imp
1                                                  AGE  0.039517
13                                               PULSE  0.028348
15                                               BPSYS  0.026833
12                                               TEMPF  0.025898
16                                              BPDIAS  0.025844
0                                             WAITTIME  0.025111
14                                               RESPR  0.021329
17                                               POPCT  0.020407
29                                            TOTCHRON  0.018417
896                                          IMMEDR_02  0.016714
...

This time the validation accuracy was similar to that of the logistic regression model, approximately 88%. However, the accuracy on the training data was 100%, indicating that we overfit the model to the training data. We'll discuss potential ways of improving our models at the end of this chapter.

Looking at the feature importance, again the results make sense. This time the vital signs seem to be the most important predictors, along with the age predictor. Scrolling to the bottom reveals those features that had no impact on predictability. Note that in contrast to regression coefficients, with the variable importance of random forest, the frequency with which the variable was positive plays a role in the importance; this may explain why IMMEDR_02 is ranked as more important than IMMEDR_01.