This chapter is written to get you up and running with the machine learning algorithms. As you try out and become comfortable with different methods, you'll also want to reference the …/opencv/docs/ref/opencvref_ml.htm manual that installs with OpenCV and/or the online OpenCV Wiki documentation (http://opencvlibrary.willowgarage.com/). Because this portion of the library is under active development, you will want to know about the latest and greatest available tools.
All the routines in the ML library[238] are written as C++ classes and all derived from the CvStatModel
class, which holds the methods that are universal to all the
algorithms. These methods are listed in Table 13-3. Note that in the CvStatModel
there are two ways of storing and recalling the
model from disk: save()
versus write()
and load()
versus read()
. For machine learning models, you should use the much
simpler save()
and load()
, which essentially wrap the more complex write()
and read()
functions into an
interface that writes and reads XML and YAML to and from disk. Beyond that, for learning
from data the two most important functions, predict()
and
train()
, vary by algorithm and will be discussed next.
Table 13-3. Base class methods for the machine learning (ML) library
Description | |
---|---|
save( const char* filename, const char* name = 0 )
|
Saves learned model in XML or YMAL. Use this method for storage. |
load( const char* filename, const char* name=0 );
|
Calls |
clear()
|
De-allocates all memory. Ready for reuse. |
bool train( -data points-, [flags] -responses-, [flags etc] ) ;
|
The training function to learn a model of the dataset. Training is specific to the algorithm and so the input parameters will vary. |
float predict( const CvMat* sample [,<prediction_params>] ) const;
|
After training, use this function to predict the label or value of a new training point or points. |
Constructor, Destructor: | |
CvStatModel(); CvStatModel( const CvMat* train_data ... );
|
Default constructor and constructor that allows creation and training of the model in one shot. |
CvStatModel::~CvStatModel();
|
The destructor of the ML model. |
Write/Read support (but use save/load above instead): | |
write( CvFileStorage* storage, const char* name );
|
Generic |
read( CvFileStorage* storage, CvFileNode* node );
|
Generic file read to |
The training prototype is as follows:
bool CvStatModel::train( const CvMat* train_data, [int tflag,] ..., const CvMat* responses, ..., [const CvMat* var_idx,] ..., [const CvMat* sample_idx,] ..., [const CvMat* var_type,] ..., [const CvMat* missing_mask,] <misc_training_alg_params> ... );
The train()
method for the machine learning algorithms can assume different forms according to what the
algorithm can do. All algorithms take a CvMat
matrix
pointer as training data. This matrix must be of type 32FC1
(32-bit, floating-point, single-channel). CvMat
does allow for multichannel images, but machine learning algorithms
take only a single channel—that is, just a two-dimensional matrix of numbers. Typically
this matrix is organized as rows of data points, where each "point" is represented as a
vector of features. Hence the columns contain the individual features for each data point
and the data points are stacked to yield the 2D single-channel training matrix. To belabor
the topic: the typical data matrix is thus composed of (rows, columns) = (data points,
features). However, some algorithms can handle transposed matrices directly. For such
algorithms you may use the tflag
parameter to tell the
algorithm that the training points are organized in columns. This is just a convenience so
that you won't have to transpose a large data matrix. When the algorithm can handle both
row-order and column-order data, the following flags apply.
tflag = CV_ROW_SAMPLE
Means that the feature vectors are stored as rows (default)
tflag = CV_COL_SAMPLE
Means that the feature vectors are stored as columns
The reader may well ask: What if my training data is not floating-point numbers but
instead is letters of the alphabet or integers representing musical notes or names of
plants? The answer is: Fine, just turn them into unique 32-bit floating-point numbers when
you fill the CvMat
. If you have letters as features or
labels, you can cast the ASCII character to floats when filling the data array. The same
applies to integers. As long as the conversion is unique, things should work—but remember
that some routines are sensitive to widely differing variances among features. It's
generally best to normalize the variance of features as discussed previously. With the
exception of the tree-based algorithms (decision trees, random trees, and boosting) that
support both categorical and ordered input variables, all other OpenCV ML algorithms work
only with ordered inputs. A popular technique for making ordered-input algorithms also
work with categorical data is to represent them in 1-radix notation; for example, if the
input variable color may have seven different values then it may be replaced by seven
binary variables, where one and only one of the variables may be set to 1.
The parameter responses are either categorical labels such as "poisonous" or
"nonpoisonous", as with mushroom identification, or are regression values (numbers) such
as body temperatures taken with a thermometer. The response values or "labels" are usually
a one-dimensional vector of one value per data point—except for neural networks, which can
have a vector of responses for each data point. Response values are one of two types: For
categorical responses, the type can be integer (32SC1
);
for regression values, the response is 32-bit floating-point (32FC1
). Observe also that some algorithms can deal only with classification
problems and others only with regression; but others can handle both. In this last case,
the type of output variable is passed either as a separate parameter or as a last element
of a var_type
vector, which can be set as
follows.
CV_VAR_CATEGORICAL
Means that the output values are discrete class labels
CV_VAR_ORDERED (= CV_VAR_NUMERICAL)
Means that the output values are ordered; that is, different values can be compared as numbers and so this is a regression problem
The types of input variables can also be specified using var_type
. However, algorithms of the regression type can handle only
ordered-input variables. Sometimes it is possible to make up an ordering for categorical
variables as long as the order is kept consistent, but this can sometimes cause
difficulties for regression because the pretend "ordered" values may jump around wildly
when they have no physical basis for their imposed order.
Many models in the ML library may be trained on a selected feature subset and/or on a
selected sample subset of the training set. To make this easier for the user, the method
train()
usually includes the vectors var_idx
and sample_idx
as
parameters. These may be defaulted to "use all data" by passing NULL
values for these parameters, but var_idx
can be used to indentify variables (features) of interest and
sample_idx
can identify data points of interest.
Using these, you may specify which features and which sample points on which to train.
Both vectors are either single-channel integer (CV_32SC1
) vectors—that is, lists of zero-based indices—or single-channel
8-bit (CV_8UC1
) masks of active variables/samples,
where a nonzero value signifies active. The parameter sample_idx
is particularly helpful when you've read in a chunk of data and
want to use some of it for training and some of it for test without
breaking it into two different vectors.
Additionally, some algorithms can handle missing measurements. For example, when the
authors were working with manufacturing data, some measurement features would end up
missing during the time that workers took coffee breaks. Sometimes experimental data
simply is forgotten, such as forgetting to take a patient's temperature one day during a
medical experiment. For such situations, the parameter missing_mask
, an 8-bit matrix of the same dimensions as train_data
, is used to mark the missed values (nonzero
elements of the mask). Some algorithms cannot handle missing values, so the missing points should be interpolated by the user
before training or the corrupted records should be rejected in advance. Other algorithms,
such as decision tree and naïve Bayes, handle missing values in different ways. Decision trees use alternative
splits (called "surrogate splits" by Breiman); the naïve Bayes algorithm infers the
values.
Usually, the previous model state is cleared by clear()
before running the training procedure. However, some algorithms may
optionally update the model learning with the new training data instead of starting from scratch.
When using the method predict()
, the var_idx
parameter that specifies which features were used in
the train()
method is remembered and then used to
extract only the necessary components from the input sample. The general form of the
predict()
method is as follows:
float CvStatMode::predict( const CvMat* sample [, <prediction_params>] ) const;
This method is used to predict the response for a new input data vector. When using a
classifier, predict()
returns a class label. For the
case of regression, this method returns a numerical value. Note that the input sample must
have as many components as the train_data
that was used
for training. Additional prediction_params
are
algorithm-specific and allow for such things as missing feature values in tree-based
methods. The function suffix const
tells us that
prediction does not affect the internal state of the model, so this method is thread-safe
and can be run in parallel, which is useful for web servers performing image retrieval for
multiple clients and for robots that need to accelerate the scanning of a scene.
Although the iteration control structure CvTermCriteria
has been discussed in other chapters, it is used by several
machine learning routines. So, just to remind you of what the function is, we repeat it
here.
typedef struct CvTermCriteria { int type; /* CV_TERMCRIT_ITER and/or CV_TERMCRIT_EPS */ int max_iter; /* maximum number of iterations */ double epsilon; /* stop when error is below this value */ }
The integer parameter max_iter
sets the total
number of iterations that the algorithm will perform. The epsilon
parameter sets an error threshold stopping criteria; when the error
drops below this level, the routine stops. Finally, the type
tells which of these two criteria to use, though you may add the
criteria together and so use both (CV_TERMCRIT_ITER |
CV_TERMCRIT_EPS
). The defined values for term_crit.type
are:
#define CV_TERMCRIT_ITER 1 #define CV_TERMCRIT_NUMBER CV_TERMCRIT_ITER #define CV_TERMCRIT_EPS 2
Let's now move on to describing specific algorithms that are implemented in OpenCV. We will start with the frequently used Mahalanobis distance metric and then go into some detail on one unsupervised algorithm (K-means); both of these may be found in the cxcore library. We then move into the machine learning library proper with the normal Bayes classifier, after which we discuss decision-tree algorithms (decision trees, boosting, random trees, and Haar cascade). For the other algorithms we'll provide short descriptions and usage examples.
[238] Note that the Haar classifier, Mahalanobis, and K-means algorithms were written before the ML library was created and so are in cv and cxcore libraries instead.