When fitting decision trees, the tree starts at the root node and greedily chooses the optimal split at every junction that optimizes a certain metric of node purity. By default, scikit-learn optimizes for the gini metric at every step. While each split is created, the model keeps track of how much each split helps the overall optimization goal. In doing so, tree-based models that choose splits based on such metrics have a notion of feature importance.

To illustrate this further, let's go ahead and fit a decision tree to our data and output the feature importance' with the help of the following code:

# create a brand new decision tree classifier
tree = DecisionTreeClassifier()

tree.fit(X, y)

Once our tree has fit to the data, we can call on the feature_importances_ attribute to capture the importance of the feature, relative to the fitting of the tree:

# note that we have some other features in play besides what our last two selectors decided for us

importances = pd.DataFrame({'importance': tree.feature_importances_, 'feature':X.columns}).sort_values('importance', ascending=False)

importances.head()

The preceding code produces the following table as the output:

	feature	importance
5	PAY_0	0.161829
4	AGE	0.074121
11	BILL_AMT1	0.064363
0	LIMIT_BAL	0.058788
19	PAY_AMT3	0.054911

What this table is telling us is that the most important feature while fitting was the column PAY_0, which matches up to what our statistical models were telling us earlier in this chapter. What is more notable are the second, third, and fifth most important features, as they didn't really show up before using our statistical tests. This is a good indicator that this method of feature selection might yield some new results for us.

Recall that, earlier, we relied on a built in scikit-learn wrapper called SelectKBest to capture the top k features based on a ranking function like ANOVA p-values. We will introduce another similar style of wrapper called SelectFromModel which, like SelectKBest, will capture the top k most importance features. However, it will do so by listening to a machine learning model's internal metric for feature importance rather than the p-values of a statistical test. We will use the following code to define the SelectFromModel:

# similar to SelectKBest, but not with statistical tests
from sklearn.feature_selection import SelectFromModel

The biggest difference in usage between SelectFromModel and SelectKBest is that SelectFromModel doesn't take in an integer k, which represents the number of features to keep, but rather SelectFromModel uses a threshold for selection which acts as a hard minimum of importance to be selected. In this way, the model-based selectors of this chapter are able to move away from a human-inputted number of features to keep and instead rely on relative importance to include only as many features as the pipeline needs. Let's instantiate our class as follows:

# instantiate a class that choses features based
# on feature importances according to the fitting phase
# of a separate decision tree classifier
select_from_model = SelectFromModel(DecisionTreeClassifier(), 
 threshold=.05)

Let's fit this SelectFromModel class to our data and invoke the transform method to watch our data get subsetted, with the help of the following code:

selected_X = select_from_model.fit_transform(X, y)
selected_X.shape

(30000, 9)

Now that we know the basic mechanics of the module, let's use it to select features for us in a pipeline. Recall that the accuracy to beat is .8206, which we got both from our correlation chooser and our ANOVA test (because they both returned the same features):

# to speed things up a bit in the future
tree_pipe_params = {'classifier__max_depth': [1, 3, 5, 7]}

from sklearn.pipeline import Pipeline

# create a SelectFromModel that is tuned by a DecisionTreeClassifier
select = SelectFromModel(DecisionTreeClassifier())

select_from_pipe = Pipeline([('select', select),
                             ('classifier', d_tree)])

select_from_pipe_params = deepcopy(tree_pipe_params)

select_from_pipe_params.update({
 'select__threshold': [.01, .05, .1, .2, .25, .3, .4, .5, .6, "mean", "median", "2.*mean"],
 'select__estimator__max_depth': [None, 1, 3, 5, 7]
 })

print select_from_pipe_params  # {'select__threshold': [0.01, 0.05, 0.1, 'mean', 'median', '2.*mean'], 'select__estimator__max_depth': [None, 1, 3, 5, 7], 'classifier__max_depth': [1, 3, 5, 7]}


get_best_model_and_accuracy(select_from_pipe, 
 select_from_pipe_params, 
 X, y)

# not better than original
Best Accuracy: 0.820266666667
Best Parameters: {'select__threshold': 0.01, 'select__estimator__max_depth': None, 'classifier__max_depth': 3} 
Average Time to Fit (s): 0.192 
Average Time to Score (s): 0.002

Note first that, as part of the threshold parameter, we are able to include some reserved words rather than a float that represents the minimum importance to use. For example, the threshold of mean only selects features with an importance that is higher than average. Similarly, a value of median as a threshold only selects features that are more important than the median value. We may also include multiples to these reserved words so that 2.*mean will only include features that are more important than twice the mean importance value.

Let's take a peak as to which features our decision tree-based selector is choosing for us. We can do this by invoking a method within SelectFromModel called get_support(). It will return an array of Booleans, one for each original feature column, and tell us which of the features it decided to keep, as follows:

# set the optimal params to the pipeline
select_from_pipe.set_params(**{'select__threshold': 0.01, 
 'select__estimator__max_depth': None, 
 'classifier__max_depth': 3})

# fit our pipeline to our data
select_from_pipe.steps[0][1].fit(X, y)

# list the columns that the SVC selected by calling the get_support() method from SelectFromModel
X.columns[select_from_pipe.steps[0][1].get_support()]


[u'LIMIT_BAL', u'SEX', u'EDUCATION', u'MARRIAGE', u'AGE', u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_6', u'BILL_AMT1', u'BILL_AMT2', u'BILL_AMT3', u'BILL_AMT4', u'BILL_AMT5', u'BILL_AMT6', u'PAY_AMT1', u'PAY_AMT2', u'PAY_AMT3', u'PAY_AMT4', u'PAY_AMT5', u'PAY_AMT6']

Wow! So the tree decided to keep all but two features, and still only did just as good as the tree did without selecting anything:

For more information on decision trees and how they are fit using gini or entropy, look into the scikit-learn documentation or other texts that handle this topic in more depth.

We could continue onward by trying several other tree-based models, such as RandomForest, ExtraTreesClassifier, and others, but perhaps we may be able to do better by utilizing a model other than a tree-based model.