Binary Decision Trees

We will go through decision trees in detail, since they are highly useful and use most of the functionality in the machine learning library (and thus serve well as an instructional example). Binary decision trees were invented by Leo Breiman and colleagues,^[245] who named them classification and regression tree (CART) algorithms. This is the decision tree algorithm that OpenCV implements. The gist of the algorithm is to define an impurity metric relative to the data in every node of the tree. For example, when using regression to fit a function, we might use the sum of squared differences between the true value and the predicted value. We want to minimize the sum of differences (the "impurity") in each node of the tree. For categorical labels, we define a measure that is minimal when most values in a node are of the same class. Three common measures to use are entropy, Gini index, and misclassification (all are described in this section). Once we have such a metric, a binary decision tree searches through the feature vector to find which feature combined with which threshold most purifies the data. By convention, we say that features above the threshold are "true" and that the data thus classified will branch to the left; the other data points branch right. This procedure is then used recursively down each branch of the tree until the data is of sufficient purity or until the number of data points in a node reaches a set minimum.

The equations for node impurity i(N) are given next. We must deal with two cases, regression and classification.

Regression Impurity

For regression or function fitting, the equation for node impurity is simply the square of the difference in value between the node value y and the data value x. We want to minimize:

Classification Impurity

For classification, decision trees often use one of three methods: entropy impurity, Gini impurity, or misclassification impurity. For these methods, we use the notation P(ω_j) to denote the fraction of patterns at node N that are in class ω_j. Each of these impurities has slightly different effects on the splitting decision. Gini is the most commonly used, but all the algorithms attempt to minimize the impurity at a node. Figure 13-7 graphs the impurity measures that we want to minimize.

Misclassification impurity

Figure 13-7. Decision tree impurity measures

Decision trees are perhaps the most widely used classification technology. This is due to their simplicity of implementation, ease of interpretation of results, flexibility with different data types (categorical, numerical, unnormalized and mixes thereof), ability to handle missing data through surrogate splits, and natural way of assigning importance to the data features by order of splitting. Decision trees form the basis of other algorithms such as boosting and random trees, which we will discuss shortly.

Decision Tree Usage

In what follows we describe perhaps more than enough for you to get decision trees working well. However, there are many more methods for accessing nodes, modifying splits, and so forth. For that level of detail (which few readers are likely ever to need) you should consult the user manual …/opencv/docs/ref/opencvref_ml.htm, particularly with regard to the classes CvDTree{}, the training class CvDTreeTrainData{}, and the nodes CvDTreeNode{} and splits CvDTreeSplit{}.

For a pragmatic introduction, we start by dissecting a specific example. In the …/opencv/samples/c directory, there is a mushroom.cpp file that runs decision trees on the agaricus-lepiota.data data file. This data file consists of a label "p" or "e" (denoting poisonous or edible, respectively) followed by 22 categorical attributes, each represented by a single letter. Observe that the data file is given in "comma separated value" (CSV) format, where the features' values are separated from each other by commas. In mushroom.cpp there is a rather messy function mushroom_read_database() for reading in this particular data file. This function is rather overspecific and brittle but mainly it's just filling three arrays as follows. (1) A floating-point matrix data[][], which has dimensions rows = number of data points by columns = number of features (22 in this case) and where all the features are converted from their categorical letter values to floating-point numbers. (2) A character matrix missing[][], where a "true" or "1" indicates a missing value that is indicated in the raw data file by a question mark and where all other values are set to 0. (3) A floating-point vector responses[], which contains the poison "p" or edible "e" response cast in floating-point values. In most cases you would write a more general data input program. We'll now discuss the main working points of mushroom.cpp, all of which are called directly or indirectly from main() in the program.

Training the tree

For training the tree, we fill out the tree parameter structure CvDTreeParams{}:

struct CvDTreeParams {

  int   max_categories;       //Until pre-clustering
  int   max_depth;            //Maximum levels in a tree
  int   min_sample_count;     //Don't split a node if less
  int   cv_folds;             //Prune tree with K fold cross-validation
  bool  use_surrogates;       //Alternate splits for missing data
  bool  use_1se_rule;         //Harsher pruning
  bool  truncate_pruned_tree; //Don't "remember" pruned branches
  float regression_accuracy;  //One of the "stop splitting" criteria
  const float* priors;        //Weight of each prediction category

  CvDTreeParams() : max_categories(10), max_depth(INT_MAX),
    min_sample_count(10), cv_folds(10), use_surrogates(true),
    use_1se_rule(true), truncate_pruned_tree(true),
    regression_accuracy(0.01f), priors(NULL) { ; }

  CvDTreeParams(
    int          _max_depth,
    int          _min_sample_count,
    float        _regression_accuracy,
    bool         _use_surrogates,
    int          _max_categories,
    int          _cv_folds,
    bool         _use_1se_rule,
    bool         _truncate_pruned_tree,
    const float* _priors
  );
}

In the structure, max_categories has a default value of 10. This limits the number of categorical values before which the decision tree will precluster those categories so that it will have to test no more than 2^{max_categories}–2 possible value subsets.^[246] This isn't a problem for ordered or numerical features, where the algorithm just has to find a threshold at which to split left or right. Those variables that have more categories than max_categories will have their category values clustered down to max_categories possible values. In this way, decision trees will have to test no more than max_categories levels at a time. This parameter, when set to a low value, reduces computation at the cost of accuracy.

The other parameters are fairly self-explanatory. The last parameter, priors, can be crucial. It sets the relative weight that you give to misclassification. That is, if the weight of the first category is 1 and the weight of the second category is 10, then each mistake in predicting the second category is equivalent to making 10 mistakes in predicting the first category. In the code we have edible and poisonous mushrooms, so we "punish" mistaking a poisonous mushroom for an edible one 10 times more than mistaking an edible mushroom for a poisonous one.

The template of the methods for training a decision tree is shown below. There are two methods: the first is used for working directly with decision trees; the second is for ensembles (as used in boosting) or forests (as used in random trees).

// Work directly with decision trees:
  bool CvDTree::train(
  const CvMat*  _train_data,
  int           _tflag,
  const CvMat*  _responses,
  const CvMat*  _var_idx      = 0,
  const CvMat*  _sample_idx   = 0,
  const CvMat*  _var_type     = 0,
  const CvMat*  _missing_mask = 0,
  CvDTreeParams params        = CvDTreeParams()
);

// Method that ensembles of decision trees use to call individual

// training for each tree in the ensemble
bool CvDTree::train(
  CvDTreeTrainData* _train_data,
  const CvMat*      _subsample_idx
);

In the train() method, we have the floating-point _train_data[][] matrix. In that matrix, if _tflag is set to CV_ROW_SAMPLE then each row is a data point consisting of a vector of features that make up the columns of the matrix. If tflag is set to CV_COL_SAMPLE, the row and column meanings are reversed. The _responses[] argument is a floating-point vector of values to be predicted given the data features. The other parameters are optional. The vector _var_idx indicates features to include, and the vector _sample_idx indicates data points to include; both of these vectors are either zero-based integer lists of values to skip or 8-bit masks of active (1) or skip (0) values (see our general discussion of the train() method earlier in the chapter). The byte (CV_8UC1) vector _var_type is a zero-based mask for each feature type (CV_VAR_CATEGORICAL or CV_VAR_ORDERED^[247]); its size is equal to the number of features plus 1. That last entry is for the response type to be learned. The byte-valued _missing_mask[][] matrix is used to indicate missing values with a 1 (else 0 is used). Example 13-2 details the creation and training of a decision tree.

Example 13-2. Creating and training a decision tree

float priors[] = { 1.0, 10.0}; // Edible vs poisonous weights

CvMat* var_type;

var_type = cvCreateMat( data->cols + 1, 1, CV_8U );

cvSet( var_type, cvScalarAll(CV_VAR_CATEGORICAL) ); // all these vars
                                                    // are categorical
CvDTree* dtree;
dtree = new CvDTree;
dtree->train(
  data,
  CV_ROW_SAMPLE,
  responses,
  0,
  0,
  var_type,
  missing,
  CvDTreeParams(
    8,     // max depth
    10,    // min sample count
    0,     // regression accuracy: N/A here
    true,  // compute surrogate split,
           //   since we have missing data
    15,    // max number of categories
           //   (use suboptimal algorithm for
           //   larger numbers)
    10,    // cross-validations
    true,  // use 1SE rule => smaller tree
    true,  // throw away the pruned tree branches
    priors // the array of priors, the bigger
           //   p_weight, the more attention
           //   to the poisonous mushrooms
  )
);

In this code the decision tree dtree is declared and allocated. The dtree->train() method is then called. In this case, the vector of responses[] (poisonous or edible) was set to the ASCII value of "p" or "e" (respectively) for each data point. After the train() method terminates, dtree is ready to be used for predicting new data. The decision tree may also be saved to disk via save() and loaded via load() (each method is shown below).^[248] Between the saving and the loading, we reset and zero out the tree by calling the clear() method.

dtree->save("tree.xml","MyTree");

dtree->clear();

dtree->load("tree.xml","MyTree");

This saves and loads a tree file called tree.xml. (Using the .xml extension stores an XML data file; if we used a .yml or .yaml extension, it would store a YAML data file.) The optional "MyTree" is a tag that labels the tree within the tree.xml file. As with other statistical models in the machine learning module, multiple objects cannot be stored in a single .xml or .yml file when using save(); for multiple storage one needs to use cvOpenFileStorage() and write(). However, load() is a different story: this function can load an object by its name even if there is some other data stored in the file.

The function for prediction with a decision tree is:

CvDTreeNode* CvDTree::predict(
  const CvMat* _sample,
  const CvMat* _missing_data_mask = 0,
  bool         raw_mode           = false
) const;

Here _sample is a floating-point vector of features used to predict; _missing_data_mask is a byte vector of the same length and orientation^[249] as the _sample vector, in which nonzero values indicate a missing feature value. Finally, raw_mode indicates unnormalized data with "false" (the default) or "true" for normalized input categorical data values. This is mainly used in ensembles of trees to speed up prediction. Normalizing data to fit within the (0, 1) interval is simply a computational speedup because the algorithm then knows the bounds in which data may fluctuate. Such normalization has no effect on accuracy. This method returns a node of the decision tree, and you may access the predicted value using (CvDTreeNode *)->value which is returned by the dtree->predict() method (see CvDTree::predict() described previously):

double r = dtree->predict( &sample, &mask )->value;

Finally, we can call the useful var_importance() method to learn about the importance of the individual features. This function will return an N-by-1 vector of type double (CV_64FC1) containing each feature's relative importance for prediction, where the value 1 indicates the highest importance and 0 indicates absolutely not important or useful for prediction. Unimportant features may be eliminated on a second-pass training. (See Figure 13-12 for a display of variable importance.) The call is as follows:

const CvMat* var_importance = dtree->get_var_importance();

As demonstrated in the …/opencv/samples/c/mushroom.cpp file, individual elements of the importance vector may be accessed directly via

double val = var_importance->data.db[i];

Most users will only train and use the decision trees, but advanced or research users may sometimes wish to examine and/or modify the tree nodes or the splitting criteria. As stated in the beginning of this section, the information for how to do this is in the ML documentation that ships with OpenCV at …/opencv/docs/ref/opencvref_ml.htm#ch_dtree, which can also be accessed via the OpenCV Wiki (http://opencvlibrary.sourceforge.net/). The sections of interest for such advanced analysis are the class structure CvDTree{}, the training structure CvDTreeTrainData{}, the node structure CvDTreeNode{}, and its contained split structure CvDTreeSplit{}.

Decision Tree Results

Using the code just described, we can learn several things about edible or poisonous mushrooms from the agaricus-lepiota.data file. If we just train a decision tree without pruning, so that it learns the data perfectly, we get the tree shown in Figure 13-8. Although the full decision tree learns the training set of data perfectly, remember the lesson of Figure 13-2 (overfitting). What we've done in Figure 13-8 is to memorize the data together with its mistakes and noise. Thus, it is unlikely to perform well on real data. That is why OpenCV decision trees and CART type trees typically include an additional step of penalizing complex trees and pruning them back until complexity is in balance with performance. There are other decision tree implementations that grow the tree only until complexity is balanced with performance and so combine the pruning phase with the learning phase. However, during development of the ML library it was found that trees that are fully grown first and then pruned (as implemented in OpenCV) performed better than those that combine training with pruning in their generation phase.

Figure 13-9 shows a pruned tree that still does quite well (but not perfectly) on the training set but will probably perform better on real data because it has a better balance between bias and variance. Yet this classifier has an serious shortcoming: Although it performs well on the data, it still labels poisonous mushrooms as edible 1.23% of the time. Perhaps we'd be happier with a worse classifier that labeled many edible mushrooms as poisonous provided it never invited us to eat a poisonous mushroom! Such a classifier can be created by intentionally biasing the classifier and/or the data. This is sometimes referred to as adding a cost to the classifier. In our case, we want to add a higher cost for misclassifying poisonous mushrooms than for misclassifying edible mushrooms. Cost can be imposed "inside" a classifier by changing the weighting of how much a "bad" data point counts versus a "good" one. OpenCV allows you to do this by adjusting the priors vector in the CvDTreeParams{} structure passed to the train() method, as we have discussed previously. Even without going inside the classifier code, we can impose a prior cost by duplicating (or resampling from) "bad" data. Duplicating "bad" data points implicitly gives a higher weight to the "bad" data, a technique that can work with any classifier.

Full decision tree for poisonous (p) or edible (e) mushrooms: this tree was built out to full complexity for 0% error on the training set and so would probably suffer from variance problems on test or real data (the dark portion of a rectangle represents the poisonous portion of mushrooms at that phase of categorization)

Figure 13-8. Full decision tree for poisonous (p) or edible (e) mushrooms: this tree was built out to full complexity for 0% error on the training set and so would probably suffer from variance problems on test or real data (the dark portion of a rectangle represents the poisonous portion of mushrooms at that phase of categorization)

Figure 13-10 shows a tree where a 10 X bias was imposed against poisonous mushrooms. This tree makes no mistakes on poisonous mushrooms at a cost of many more mistakes on edible mushrooms'a case of "better safe than sorry". Confusion matrices for the (pruned) unbiased and biased trees are shown in Figure 13-11.

Pruned decision tree for poisonous (p) and edible (e) mushrooms: despite being pruned, this tree shows low error on the training set and would likely work well on real data

Figure 13-9. Pruned decision tree for poisonous (p) and edible (e) mushrooms: despite being pruned, this tree shows low error on the training set and would likely work well on real data

An edible mushroom decision tree with 10 × bias against misidentification of poisonous mushrooms as edible; note that the lower right rectangle, though containing a vast majority of edible mushrooms, does not contain a 10 × majority and so would be classified as inedible

Figure 13-10. An edible mushroom decision tree with 10 × bias against misidentification of poisonous mushrooms as edible; note that the lower right rectangle, though containing a vast majority of edible mushrooms, does not contain a 10 × majority and so would be classified as inedible

Finally, we can learn something more from the data by using the variable importance machinery that comes with the tree-based classifiers in OpenCV.^[250] Variable importance measurement techniques were discussed in a previous subsection, and they involve successively perturbing each feature and then measuring the effect on classifier performance. Features that cause larger drops in performance when perturbed are more important. Also, decision trees directly show importance via the splits they found in the data: the first splits are presumably more important than later splits. Splits can be a useful indicator of importance, but they are done in a "greedy" fashion—finding which split most purifies the data now. It is often the case that doing a worse split first leads to better splits later, but these trees won't find this out.^[251] The variable importance for poisonous mushrooms is shown in Figure 13-12 for both the unbiased and the biased trees. Note that the order of important variables changes depending on the bias of the trees.

Confusion matrices for (pruned) edible mushroom decision trees: the unbiased tree yields better overall performance (top panel) but sometimes misclassifies poisonous mushrooms as edible; the biased tree does not perform as well overall (lower panel) but never misclassifies poisonous mushrooms

Figure 13-11. Confusion matrices for (pruned) edible mushroom decision trees: the unbiased tree yields better overall performance (top panel) but sometimes misclassifies poisonous mushrooms as edible; the biased tree does not perform as well overall (lower panel) but never misclassifies poisonous mushrooms

^[245] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees (1984), Wadsworth.

^[246]More detail on categorical vs. ordered splits: Whereas a split on an ordered variable has the form "if x < a then go left, else go right", a split on a categorical variable has the form "if then go left, else go right", where the v_i are some possible values of the variable. Thus, if a categorical variable has N possible values then, in order to find a best split on that variable, one needs to try 2^N –2 subsets (empty and full subset are excluded). Thus, an approximate algorithm is used whereby all N values are grouped into K ≤ max_categories clusters (via the K-mean algorithm) based on the statistics of the samples in the currently analyzed node. Thereafter, the algorithm tries different combinations of the clusters and chooses the best split, which often gives quite a good result. Note that for the two most common tasks, two-class classification and regression, the optimal categorical split (i.e., the best subset of values) can be found efficiently without any clustering. Hence the clustering is applied only in n > 2-class classification problems for categorical variables with N > max_categories possible values. Therefore, you should think twice before setting max_categories to anything greater than 20, which would imply more than a million operations for each split!

^[247] CV_VAR_ORDERED is the same thing as CV_VAR_NUMERICAL.

^[248]As mentioned previously, save() and load() are convenience wrappers for the more complex functions write() and read().

^[249]By "same … orientation" we mean that if the sample is a 1-by-N vector the mask must be 1-by-N, and if the sample is N-by-1 then the mask must be N-by-1.

^[250] Variable importance techniques may be used with any classifier, but at this time OpenCV implements them only with tree-based methods.

^[251]OpenCV (following Breiman's technique) computes variable importance across all the splits, including surrogate ones, which decreases the possible negative effect that CART's greedy splitting algorithm would have on variable importance ratings.