A convenient feature of scikit-learn is that many of the classifiers have identical methods so that models can be built interchangeably. We see that in the following code, when we use the fit() and score() methods of the RandomForestClassifier class to train the model and assess its performance, respectively:
from sklearn.ensemble import RandomForestClassifier
clfs_rf = [RandomForestClassifier(n_estimators=100)]
for clf in clfs_rf:
clf.fit(X_train, y_train.ravel())
print(type(clf))
print('Training accuracy: ' + str(clf.score(X_train, y_train)))
print('Validation accuracy: ' + str(clf.score(X_test, y_test)))
imps = {
'column': [X_train_cols[i] for i in range(len(X_train_cols))],
'imp': [clf.feature_importances_[i] for i in range(len(X_train_cols))]
}
df_imps = pd.DataFrame(imps)
print(df_imps.sort_values('imp', axis=0, ascending=False))
The output should look similar to the following:
<class 'sklearn.ensemble.forest.RandomForestClassifier'> Training accuracy: 1.0 Validation accuracy: 0.885391444713 column imp 1 AGE 0.039517 13 PULSE 0.028348 15 BPSYS 0.026833 12 TEMPF 0.025898 16 BPDIAS 0.025844 0 WAITTIME 0.025111 14 RESPR 0.021329 17 POPCT 0.020407 29 TOTCHRON 0.018417 896 IMMEDR_02 0.016714
...
This time the validation accuracy was similar to that of the logistic regression model, approximately 88%. However, the accuracy on the training data was 100%, indicating that we overfit the model to the training data. We'll discuss potential ways of improving our models at the end of this chapter.
Looking at the feature importance, again the results make sense. This time the vital signs seem to be the most important predictors, along with the age predictor. Scrolling to the bottom reveals those features that had no impact on predictability. Note that in contrast to regression coefficients, with the variable importance of random forest, the frequency with which the variable was positive plays a role in the importance; this may explain why IMMEDR_02 is ranked as more important than IMMEDR_01.