Common Routines in the ML Library

This chapter is written to get you up and running with the machine learning algorithms. As you try out and become comfortable with different methods, you'll also want to reference the …/opencv/docs/ref/opencvref_ml.htm manual that installs with OpenCV and/or the online OpenCV Wiki documentation (http://opencvlibrary.willowgarage.com/). Because this portion of the library is under active development, you will want to know about the latest and greatest available tools.

All the routines in the ML library^[238] are written as C++ classes and all derived from the CvStatModel class, which holds the methods that are universal to all the algorithms. These methods are listed in Table 13-3. Note that in the CvStatModel there are two ways of storing and recalling the model from disk: save() versus write() and load() versus read(). For machine learning models, you should use the much simpler save() and load(), which essentially wrap the more complex write() and read() functions into an interface that writes and reads XML and YAML to and from disk. Beyond that, for learning from data the two most important functions, predict() and train(), vary by algorithm and will be discussed next.

Table 13-3. Base class methods for the machine learning (ML) library

CvStatModel:: Methods	Description
save( const char* filename, const char* name = 0 )	Saves learned model in XML or YMAL. Use this method for storage.
load( const char* filename, const char* name=0 );	Calls `clear()` and then loads XML or YMAL model. Use this method for recall.
clear()	De-allocates all memory. Ready for reuse.
bool train( -data points-, [flags] -responses-, [flags etc] ) ;	The training function to learn a model of the dataset. Training is specific to the algorithm and so the input parameters will vary.
float predict( const CvMat* sample [,<prediction_params>] ) const;	After training, use this function to predict the label or value of a new training point or points.
Constructor, Destructor:
CvStatModel(); CvStatModel( const CvMat* train_data ... );	Default constructor and constructor that allows creation and training of the model in one shot.
CvStatModel::~CvStatModel();	The destructor of the ML model.
Write/Read support (but use save/load above instead):
write( CvFileStorage* storage, const char* name );	Generic `CvFileStorage` structured write to disk, located in the cxcore library (discussed in Chapter 3) and called by `save()`.
read( CvFileStorage* storage, CvFileNode* node );	Generic file read to `CvFileStorage` structure, located in the cxcore library and called by `load()`.

Training

The training prototype is as follows:

bool CvStatModel::train(
  const CvMat* train_data,
  [int tflag,]               ...,
  const CvMat* responses,    ...,
  [const CvMat* var_idx,]    ...,
  [const CvMat* sample_idx,] ...,
  [const CvMat* var_type,]   ...,
  [const CvMat* missing_mask,]
  <misc_training_alg_params> ...
);

The train() method for the machine learning algorithms can assume different forms according to what the algorithm can do. All algorithms take a CvMat matrix pointer as training data. This matrix must be of type 32FC1 (32-bit, floating-point, single-channel). CvMat does allow for multichannel images, but machine learning algorithms take only a single channel—that is, just a two-dimensional matrix of numbers. Typically this matrix is organized as rows of data points, where each "point" is represented as a vector of features. Hence the columns contain the individual features for each data point and the data points are stacked to yield the 2D single-channel training matrix. To belabor the topic: the typical data matrix is thus composed of (rows, columns) = (data points, features). However, some algorithms can handle transposed matrices directly. For such algorithms you may use the tflag parameter to tell the algorithm that the training points are organized in columns. This is just a convenience so that you won't have to transpose a large data matrix. When the algorithm can handle both row-order and column-order data, the following flags apply.

tflag = CV_ROW_SAMPLE: Means that the feature vectors are stored as rows (default)
tflag = CV_COL_SAMPLE: Means that the feature vectors are stored as columns

The reader may well ask: What if my training data is not floating-point numbers but instead is letters of the alphabet or integers representing musical notes or names of plants? The answer is: Fine, just turn them into unique 32-bit floating-point numbers when you fill the CvMat. If you have letters as features or labels, you can cast the ASCII character to floats when filling the data array. The same applies to integers. As long as the conversion is unique, things should work—but remember that some routines are sensitive to widely differing variances among features. It's generally best to normalize the variance of features as discussed previously. With the exception of the tree-based algorithms (decision trees, random trees, and boosting) that support both categorical and ordered input variables, all other OpenCV ML algorithms work only with ordered inputs. A popular technique for making ordered-input algorithms also work with categorical data is to represent them in 1-radix notation; for example, if the input variable color may have seven different values then it may be replaced by seven binary variables, where one and only one of the variables may be set to 1.

The parameter responses are either categorical labels such as "poisonous" or "nonpoisonous", as with mushroom identification, or are regression values (numbers) such as body temperatures taken with a thermometer. The response values or "labels" are usually a one-dimensional vector of one value per data point—except for neural networks, which can have a vector of responses for each data point. Response values are one of two types: For categorical responses, the type can be integer (32SC1); for regression values, the response is 32-bit floating-point (32FC1). Observe also that some algorithms can deal only with classification problems and others only with regression; but others can handle both. In this last case, the type of output variable is passed either as a separate parameter or as a last element of a var_type vector, which can be set as follows.

CV_VAR_CATEGORICAL: Means that the output values are discrete class labels
CV_VAR_ORDERED (= CV_VAR_NUMERICAL): Means that the output values are ordered; that is, different values can be compared as numbers and so this is a regression problem

The types of input variables can also be specified using var_type. However, algorithms of the regression type can handle only ordered-input variables. Sometimes it is possible to make up an ordering for categorical variables as long as the order is kept consistent, but this can sometimes cause difficulties for regression because the pretend "ordered" values may jump around wildly when they have no physical basis for their imposed order.

Many models in the ML library may be trained on a selected feature subset and/or on a selected sample subset of the training set. To make this easier for the user, the method train() usually includes the vectors var_idx and sample_idx as parameters. These may be defaulted to "use all data" by passing NULL values for these parameters, but var_idx can be used to indentify variables (features) of interest and sample_idx can identify data points of interest. Using these, you may specify which features and which sample points on which to train. Both vectors are either single-channel integer (CV_32SC1) vectors—that is, lists of zero-based indices—or single-channel 8-bit (CV_8UC1) masks of active variables/samples, where a nonzero value signifies active. The parameter sample_idx is particularly helpful when you've read in a chunk of data and want to use some of it for training and some of it for test without breaking it into two different vectors.

Additionally, some algorithms can handle missing measurements. For example, when the authors were working with manufacturing data, some measurement features would end up missing during the time that workers took coffee breaks. Sometimes experimental data simply is forgotten, such as forgetting to take a patient's temperature one day during a medical experiment. For such situations, the parameter missing_mask, an 8-bit matrix of the same dimensions as train_data, is used to mark the missed values (nonzero elements of the mask). Some algorithms cannot handle missing values, so the missing points should be interpolated by the user before training or the corrupted records should be rejected in advance. Other algorithms, such as decision tree and naïve Bayes, handle missing values in different ways. Decision trees use alternative splits (called "surrogate splits" by Breiman); the naïve Bayes algorithm infers the values.

Usually, the previous model state is cleared by clear() before running the training procedure. However, some algorithms may optionally update the model learning with the new training data instead of starting from scratch.

Prediction

When using the method predict(), the var_idx parameter that specifies which features were used in the train() method is remembered and then used to extract only the necessary components from the input sample. The general form of the predict() method is as follows:

float CvStatMode::predict(
  const CvMat* sample
  [, <prediction_params>]
) const;

This method is used to predict the response for a new input data vector. When using a classifier, predict() returns a class label. For the case of regression, this method returns a numerical value. Note that the input sample must have as many components as the train_data that was used for training. Additional prediction_params are algorithm-specific and allow for such things as missing feature values in tree-based methods. The function suffix const tells us that prediction does not affect the internal state of the model, so this method is thread-safe and can be run in parallel, which is useful for web servers performing image retrieval for multiple clients and for robots that need to accelerate the scanning of a scene.

Controlling Training Iterations

Although the iteration control structure CvTermCriteria has been discussed in other chapters, it is used by several machine learning routines. So, just to remind you of what the function is, we repeat it here.

typedef struct CvTermCriteria {
    int    type;     /* CV_TERMCRIT_ITER and/or CV_TERMCRIT_EPS */
    int    max_iter; /* maximum number of iterations */
    double epsilon;  /* stop when error is below this value    */
}

The integer parameter max_iter sets the total number of iterations that the algorithm will perform. The epsilon parameter sets an error threshold stopping criteria; when the error drops below this level, the routine stops. Finally, the type tells which of these two criteria to use, though you may add the criteria together and so use both (CV_TERMCRIT_ITER | CV_TERMCRIT_EPS). The defined values for term_crit.type are:

#define CV_TERMCRIT_ITER    1
#define CV_TERMCRIT_NUMBER  CV_TERMCRIT_ITER
#define CV_TERMCRIT_EPS     2

Let's now move on to describing specific algorithms that are implemented in OpenCV. We will start with the frequently used Mahalanobis distance metric and then go into some detail on one unsupervised algorithm (K-means); both of these may be found in the cxcore library. We then move into the machine learning library proper with the normal Bayes classifier, after which we discuss decision-tree algorithms (decision trees, boosting, random trees, and Haar cascade). For the other algorithms we'll provide short descriptions and usage examples.

^[238] Note that the Haar classifier, Mahalanobis, and K-means algorithms were written before the ML library was created and so are in cv and cxcore libraries instead.