Random Trees

OpenCV contains a random trees class, which is implemented following Leo Breiman's theory of random forests.^[259] Random trees can learn more than one class at a time simply by collecting the class "votes" at the leaves of each of many trees and selecting the class receiving the maximum votes as the winner. Regression is done by averaging the values across the leaves of the "forest". Random trees consist of randomly perturbed decision trees and are among the best-performing classifiers on data sets studied while the ML library was being assembled. Random trees also have the potential for parallel implementation, even on nonshared memory systems, a feature that lends itself to increased use in the future. The basic subsystem on which random trees are built is once again a decision tree. This decision tree is built all the way down until it's pure. Thus (cf. the upper right panel of Figure 13-2), each tree is a high-variance classifier that nearly perfectly learns its training data. To counterbalance the high variance, we average together many such trees (hence the name random trees).

Of course, averaging trees will do us no good if the trees are all very similar to each other. To overcome this, random trees cause each tree to be different by randomly selecting a different feature subset of the total features from which the tree may learn at each node. For example, an object-recognition tree might have a long list of potential features: color, texture, gradient magnitude, gradient direction, variance, ratios of values, and so on. Each node of the tree is allowed to choose from a random subset of these features when determining how best to split the data, and each subsequent node of the tree gets a new, randomly chosen subset of features on which to split. The size of these random subsets is often chosen as the square root of the number of features. Thus, if we had 100 potential features then each node would randomly choose 10 of the features and find a best split of the data from among those 10 features. To increase robustness, random trees use an out of bag measure to verify splits. That is, at any given node, training occurs on a new subset of the data that is randomly selected with replacement,^[260] and the rest of the data—those values not randomly selected, called "out of bag" (or OOB) data—are used to estimate the performance of the split. The OOB data is usually set to have about one third of all the data points.

Like all tree-based methods, random trees inherit many of the good properties of trees: surrogate splits for missing values, handling of categorical and numerical values, no need to normalize values, and easy methods for finding variables that are important for prediction. Random trees also used the OOB error results to estimate how well it will do on unseen data. If the training data has a similar distribution to the test data, this OOB performance prediction can be quite accurate.

Finally, random trees can be used to determine, for any two data points, their proximity (which in this context means "how alike" they are, not "how near" they are). The algorithm does this by (1) "dropping" the data points into the trees, (2) counting how many times they end up in the same leaf, and (3) dividing this "same leaf" count by the total number of trees. A proximity result of 1 is exactly similar and 0 means very dissimilar. This proximity measure can be used to identify outliers (those points very unlike any other) and also to cluster points (group close points together).

Random Tree Code

We are by now familiar with how the ML library works, and random trees are no exception. It starts with a parameter structure, CvRTParams, which it inherits from decision trees:

struct CvRTParams : public CvDTreeParams {

  bool           calc_var_importance;
  int            nactive_vars;
  CvTermCriteria term_crit;

  CvRTParams() : CvDTreeParams(
    5, 10, 0, false,
    10, 0, false, false,
    0
  ), calc_var_importance(false), nactive_vars(0) {

    term_crit = cvTermCriteria(
      CV_TERMCRIT_ITER | CV_TERMCRIT_EPS,
      50,
      0.1
    );
  }

  CvRTParams(
    int          _max_depth,
    int          _min_sample_count,
    float        _regression_accuracy,
    bool         _use_surrogates,
    int          _max_categories,
    const float* _priors,
    bool         _calc_var_importance,
    int          _nactive_vars,
    int          max_tree_count,
    float        forest_accuracy,
    int          termcrit_type,
  );

};

The key new parameters in CvRTParams are calc_var_importance, which is just a switch to calculate the variable importance of each feature during training (at a slight cost in additional computation time). Figure 13-13 shows the variable importance computed on a subset of the mushroom data set that ships with OpenCV in the …/opencv/samples/c/ agaricus-lepiota.data file. The nactive_vars parameter sets the size of the randomly selected subset of features to be tested at any given node and is typically set to the square root of the total number of features; term_crit (a structure discussed elsewhere in this chapter) is the control on the maximum number of trees. For learning random trees, in term_crit the max_iter parameter sets the total number of trees; epsilon sets the "stop learning" criteria to cease adding new trees when the error drops below the OOB error; and the type tells which of the two stopping criteria to use (usually it's both: CV_TERMCRIT_ITER | CV_TERMCRIT_EPS).

Random trees training has the same form as decision trees training (see the deconstruction of CvDTree::train() in the subsection on "Training the Tree") except that is uses the CvRTParam structure:

bool CvRTrees::train(
  const CvMat* train_data,
  int          tflag,
  const CvMat* responses,
  const CvMat* comp_idx     = 0,
  const CvMat* sample_idx   = 0,
  const CvMat* var_type     = 0,
  const CvMat* missing_mask = 0,
  CvRTParams   params       = CvRTParams()
);

Variable importance over the mushroom data set for random trees, boosting, and decision trees: random trees used fewer significant variables and achieved the best prediction (100% correct on a randomly selected test set covering 20% of data)

Figure 13-13. Variable importance over the mushroom data set for random trees, boosting, and decision trees: random trees used fewer significant variables and achieved the best prediction (100% correct on a randomly selected test set covering 20% of data)

An example of calling the train function for a multiclass learning problem is provided in the samples directory that ships with OpenCV; see the …/opencv/samples/c/letter_recog.cpp file, where the random trees classifier is named forest.

forest.train(
  data,
  CV_ROW_SAMPLE,
  responses,
  0,
  sample_idx,
  var_type,
  0,
  CvRTParams(10,10,0,false,15,0,true,4,100,0.01f,CV_TERMCRIT_ITER)
);

Random trees prediction has a form similar to that of the decision trees prediction function CvDTree::predict, but rather than return a CvDTreeNode* pointer it returns the average return value over all the trees in the forest. The missing mask is an optional parameter of the same dimension as the sample vector, where nonzero values indicate a missing feature value in sample.

double CvRTrees::predict(
  const CvMat* sample,
  const CvMat* missing = 0
) const;

An example prediction call from the letter_recog.cpp file is

double r;
CvMat sample;

cvGetRow( data, &sample, i );

r = forest.predict( &sample );
r = fabs((double)r - responses->data.fl[i]) <= FLT_EPSILON ? 1 : 0;

In this code, the return variable r is converted into a count of correct predictions.

Finally, there are random tree analysis and utility functions. Assuming that CvRTParams::calc_var_importance is set in training, we can obtain the relative importance of each variable by

const CvMat* CvRTrees::get_var_importance() const;

See Figure 13-13 for an example of variable importance for the mushroom data set from random trees. We can also obtain a measure of the learned random trees model proximity of one data point to another by using the call

float CvRTrees::get_proximity(
  const CvMat* sample_1,
  const CvMat* sample_2
) const;

As mentioned previously, the returned proximity is 1 if the data points are identical and 0 if the points are completely different. This value is usually between 0 and 1 for two data points drawn from a distribution similar to that of the training set data.

Two other useful functions give the total number of trees or the data structure containing a given decision tree:

int           get_tree_count() const; // How many trees are in the forest
CvForestTree* get_tree(int i) const;  // Get an individual decision tree

Using Random Trees

We've remarked that the random trees algorithm often performs the best (or among the best) on the data sets we tested, but the best policy is still to try many classifiers once you have your training data defined. We ran random trees, boosting, and decision trees on the mushroom data set. From the 8,124 data points we randomly extracted 1,624 test points, leaving the remainder as the training set. After training these three tree-based classifiers with their default parameters, we obtained the results shown in Table 13-4 on the test set. The mushroom data set is fairly easy and so—although random trees did the best—it wasn't such an overwhelming favorite that we can definitively say which of the three classifiers works better on this particular data set.

Table 13-4. Results of tree-based methods on the OpenCV mushroom data set (1,624 randomly chosen test points with no extra penalties for misclassifying poisonous mushrooms)

Classifier	Performance Results
Random trees	100%
AdaBoost	99%
Decision trees	98%

What is more interesting is the variable importance (which we also measured from the classifiers), shown in Figure 13-13. The figure shows that random trees and boosting each used significantly fewer important variables than required by decision trees. Above 15% significance, random trees used only three variables and boosting used six whereas decision trees needed thirteen. We could thus shrink the feature set size to save computation and memory and still obtain good results. Of course, for the decision trees algorithm you have just a single tree while for random trees and AdaBoost you must evaluate multiple trees; thus, which method has the least computational cost depends on the nature of the data being used.

^[259] Most of Breiman's work on random forests is conveniently collected on a single website (http://www.stat.berkeley.edu/users/breiman/RandomForests/cc_home.htm).

^[260] This means that some data points might be randomly repeated.