Boosting

Decision trees are extremely useful, but they are often not the best-performing classifiers. In this and the next section we present two techniques, boosting and random trees, that use trees in their inner loop and so inherit many of the useful properties of trees (e.g., being able to deal with mixed and unnormalized data types and missing features). These techniques typically perform at or near the state of the art; thus they are often the best "out of the box" supervised classification techniques[252] available in the library.

Within in the field of supervised learning there is a meta-learning algorithm (first described by Michael Kerns in 1988) called statistical boosting. Kerns wondered whether it is possible to learn a strong classifier out of many weak classifiers.[253] The first boosting algorithm, known as AdaBoost, was formulated shortly thereafter by Freund and Schapire.[254] OpenCV ships with four types of boosting:

Variable importance for edible mushroom as measured by an unbiased tree (left panel) and a tree biased against poison (right panel)

Figure 13-12. Variable importance for edible mushroom as measured by an unbiased tree (left panel) and a tree biased against poison (right panel)

Each of these are variants of the original AdaBoost, and often we find that the "real" and "gentle" forms of AdaBoost work best. Real AdaBoost is a technique that utilizes confidence-rated predictions and works well with categorical data. Gentle AdaBoost puts less weight on outlier data points and for that reason is often good with regression data. LogitBoost can also produce good regression fits. Because you need only set a flag, there's no reason not to try all types on a data set and then select the boosting method that works best.[255] Here we'll describe the original AdaBoost. For classification it should be noted that, as implemented in OpenCV, boosting is a two-class (yes-or-no) classifier[256] (unlike the decision tree or random tree classifiers, which can handle multiple classes at once). Of the different OpenCV boosting methods, LogitBoost and GentleBoost (referenced in the "Boosting Code" subsection to follow) can be used to perform regression in addition to binary classification.

Boosting algorithms are used to train T weak classifiers ht,

image with no caption

. These classifiers are generally very simple individually. In most cases these classifiers are decision trees with only one split (called decision stumps) or at most a few levels of splits (perhaps up to three). Each of the classifiers is assigned a weighted vote α t in the final decision-making process. We use a labeled data set of input feature vectors xi, each with scalar label yi (where i = 1,…,M data points). For AdaBoost the label is binary,

image with no caption

, though it can be any floating-point number in other algorithms. We initialize a data point weighting distribution Dt(i) that tells the algorithm how much misclassifying a data point will "cost". The key feature of boosting is that, as the algorithm progresses, this cost will evolve so that weak classifiers trained later will focus on the data points that the earlier trained weak classifiers tended to do poorly on. The algorithm is as follows.

Note that, in step 2b, if we can't find a classifier with less than a 50% error rate then we quit; we probably need better features.

When the training algorithm just described is finished, the final strong classifier takes a new input vector x and classifies it using a weighted sum over the learned weak classifiers ht:

image with no caption

Here, the sign function converts anything positive into a 1 and anything negative into a –1 (zero remains 0).

There is example code in …/opencv/samples/c/letter_recog.cpp that shows how to use boosting, random trees and back-propagation (aka multilayer perception, MLP). The code for boosting is similar to the code for decision trees but with its own control parameters:

struct CvBoostParams : public CvDTreeParams {
  int    boost_type;        // CvBoost:: DISCRETE, REAL, LOGIT, GENTLE
  int    weak_count;        // How many classifiers
  int    split_criteria;    // CvBoost:: DEFAULT, GINI, MISCLASS, SQERR
  double weight_trim_rate;
  CvBoostParams();
  CvBoostParams(
    int          boost_type,
    int          weak_count,
    double       weight_trim_rate,
    int          max_depth,
    bool         use_surrogates,
    const float* priors
  );
};

In CvDTreeParams, boost_type selects one of the four boosting algorithms listed previously. The split_criteria is one of the following.

The last parameter, weight_trim_rate, is for computational savings and is used as described next. As training goes on, many data points become unimportant. That is, the weight Dt(i) for the ith data point becomes very small. The weight_trim_rate is a threshold between 0 and 1 (inclusive) that is implicitly used to throw away some training samples in a given boosting iteration. For example, suppose weight_trim_rate is set to 0.95. This means that samples with summary weight ≤ 1.0–0.95 = 0.05 (5%) do not participate in the next iteration of training. Note the words "next iteration". The samples are not discarded forever. When the next weak classifier is trained, the weights are computed for all samples and so some previously insignificant samples may be returned back to the next training set. To turn this functionality off, set the weight_trim_rate value to 0.

Observe that CvBoostParams{} inherits from CvDTreeParams{}, so we may set other parameters that are related to decision trees. In particular, if we are dealing with features that may be missing[257] then we can set use_surrogates to true, which will ensure that alternate features on which the splitting is based are stored at each node. An important option is that of using priors to set the "cost" of false positives. Again, if we are learning edible or poisonous mushrooms then we might set the priors to be float priors[] = {1.0, 10.0}; then each error of labeling a poisonous mushroom edible would cost ten times as much as labeling an edible mushroom poisonous.

The CvBoost class contains the member weak, which is a CvSeq* pointer to the weak classifiers that inherits from CvDTree decision trees.[258] For LogitBoost and GentleBoost, the trees are regression trees (trees that predict floating-point values); decision trees for the other methods return only votes for class 0 (if positive) or class 1 (if negative). This contained class sequence has the following prototype:

class CvBoostTree: public CvDTree {

public:
    CvBoostTree();
    virtual ~CvBoostTree();
    virtual bool train(
      CvDTreeTrainData* _train_data,
      const CvMat*      subsample_idx,
      CvBoost*          ensemble
    );
    virtual void scale( double s );
    virtual void read(
      CvFileStorage*    fs,
      CvFileNode*       node,
      CvBoost*          ensemble,
      CvDTreeTrainData* _data
    );
    virtual void clear();

protected:
    ...
    CvBoost* ensemble;

};

Training is almost the same as for decision trees, but there is an extra parameter called update that is set to false (0) by default. With this setting, we train a whole new ensemble of weak classifiers from scratch. If update is set to true (1) then we just add new weak classifiers onto the existing group. The function prototype for training a boosted classifier is:

bool CvBoost::train(
  const CvMat*  _train_data,
  int           _tflag,
  const CvMat*  _responses,
  const CvMat*  _var_idx      = 0,
  const CvMat*  _sample_idx   = 0,
  const CvMat*  _var_type     = 0,
  const CvMat*  _missing_mask = 0,
  CvBoostParams params        = CvBoostParams(),
  bool          update        = false
);

An example of training a boosted classifier may be found in …/opencv/samples/c/letter_recog.cpp. The training code snippet is shown in Example 13-3.

The prediction function for boosting is also similar to that for decision trees:

float CvBoost::predict(
  const CvMat* sample,
  const CvMat* missing        = 0,
  CvMat*       weak_responses = 0,
  CvSlice      slice          = CV_WHOLE_SEQ,
  bool         raw_mode       = false
) const;

To perform a simple prediction, we pass in the feature vector sample and then predict() returns the predicted value. Of course, there are a variety of optional parameters. The first of these is the missing feature mask, which is the same as it was for decision trees; it consists of a byte vector of the same dimension as the sample vector, where nonzero values indicate a missing feature. (Note that this mask cannot be used unless you have trained the classifier with the use_surrogates parameter set to CvDTreeParams::use_surrogates.)

If we want to get back the responses of each of the weak classifiers, we can pass in a floating-point CvMat vector, weak_responses, with length equal to the number of weak classifiers. If weak_responses is passed, CvBoost::predict will fill the vector with the response of each individual classifier:

CvMat* weak_responses = cvCreateMat(
  1,
  boostedClassifier.get_weak_predictors()->total,
  CV_32F
);

The next prediction parameter, slice, indicates which contiguous subset of the weak classifiers to use; it can be set by

inline CvSlice cvSlice( int start, int end );

However, we usually just accept the default and leave slice set to "every weak classifier" (CvSlice slice=CV_WHOLE_SEQ). Finally, we have the raw_mode, which is off by default but can be turned on by setting it to true. This parameter is exactly the same as for decision trees and indicates that the data is prenormalized to save computation time. Normally you won't need to use this. An example call for boosted prediction is

boost.predict( temp_sample, 0, weak_responses );

Finally, some auxiliary functions may be of use from time to time. We can remove a weak classifier from the learned model via

void CvBoost::prune( CvSlice slice );

We can also return all the weak classifiers for examination:

CvSeq* CvBoost::get_weak_predictors();

This function returns a CvSeq of pointers to CvBoostTree.



[252] Recall that the "no free lunch" theorem informs us that there is no a priori "best" classifier. But on many data sets of interest in vision, boosting and random trees perform quite well.

[254] Y. Freund and R. E. Schapire, "Experiments with a New Boosting Algorithm", in Machine Learning: Proceedings of the Thirteenth International Conference (Morgan Kauman, San Francisco, 1996), 148–156.

[255] This procedure is an example of the machine learning metatechnique known as voodoo learning or voodoo programming. Although unprincipled, it is often an effective method of achieving the best possible performance. Sometimes, after careful thought, one can figure out why the best-performing method was the best, and this can lead to a deeper understanding of the data. Sometimes not.

[256] There is a trick called unrolling that can be used to adapt any binary classifier (including boosting) for N-class classification problems, but this makes both training and prediction significantly more expensive. See …/opencv/samples/c/letter_recog.cpp.

[257] Note that, for computer vision, features are computed from an image and then fed to the classifier; hence they are almost never "missing". Missing features arise often in data collected by humans—for example, forgetting to take the patient's temperature one day.