Computer vision applications with machine learning have a common basic structure. This structure is divided into different steps:
- Pre-process
- Segmentation
- Feature extraction
- Classification result
- Post-process
These are common in almost all computer vision applications, while others are omitted. In the following diagram, you can see the different steps that are involved:
Almost all computer vision applications start with a Pre-process applied to the input image, which consists of the removal of light and noise, filtering, blurring, and so on. After applying all pre-processing required to the input image, the second step is Segmentation. In this step, we have to extract the regions of interest in the image and isolate each one as a unique object of interest. For example, in a face detection system, we have to separate the faces from the rest of the parts in the scene. After detecting the objects inside the image, we continue to the next step. Here, we have to extract the features of each one; the features are normally a vector of characteristics of objects. A characteristic describes our objects and can be the area of an object, contour, texture pattern, pixels, and so on.
Now, we have the descriptor, also known as a feature vector or feature set, of our object. Descriptors are the features that describe an object, and we use these to train or predict a model. To do this, we have to create a large dataset of features where thousands of images are pre-processed. We then use the extracted features (image/object characteristics) such as area, size, and aspect ration, in the Train model function we choose. In the following diagram, we can see how a dataset is fed into a Machine Learning Algorithm to train and generate a Model:
When we Train with a dataset, the Model learns all the parameters required to be able to predict when a new vector of features with an unknown label is given as input to our algorithm. In the following diagram, we can see how an unknown vector of features is used to Predict using the generated Model, thus returning the Classification result or regression:
After predicting the result, the post-processing of output data is sometimes required, for example, merging multiple classifications to decrease the prediction error or merging multiple labels. A sample case in Optical Character recognition is where the Classification result is according to each predicted character, and by combining the results of character recognition, we construct a word. This means that we can create a post-processing method to correct errors in detected words. With this small introduction to machine learning for computer vision, we are going to implement our own application that uses machine learning to classify objects in a slide tape. We are going to use support vector machines as our classification method and explain how to use them. The other machine learning algorithms are used in a very similar way. The OpenCV documentation has detailed information about all of the machine learning algorithms at the following link: https://docs.opencv.org/master/dd/ded/group__ml.html.