Text detection

Let's start by creating a simple program so that we can perform text segmentation using ERFilters. In this program, we will use the trained classifiers from text API samples. You may download this from the OpenCV repository, but they are also available in this book's companion code.

First, we start by including all of the necessary libs and usings:

#include  "opencv2/highgui.hpp" 
#include  "opencv2/imgproc.hpp" 
#include  "opencv2/text.hpp" 
 
#include  <vector> 
#include  <iostream> 
 
using namespace std; 
using namespace cv; 
using namespace cv::text;

Recall from the Extremal region filtering section that the ERFilter works separately in each image channel. Therefore, we must provide a way to separate each desired channel in a different single channel, cv::Mat. This is done by the separateChannels function:

vector<Mat> separateChannels(const Mat& src)  
{ 
   vector<Mat> channels; 
   //Grayscale images 
   if (src.type() == CV_8U || src.type() == CV_8UC1) { 
         channels.push_back(src); 
         channels.push_back(255-src); 
         return channels; 
   } 
 
   //Colored images 
   if (src.type() == CV_8UC3) { 
         computeNMChannels(src, channels); 
         int size = static_cast<int>(channels.size())-1; 
         for (int c = 0; c < size; c++) 
               channels.push_back(255-channels[c]); 
         return channels; 
   } 
 
   //Other types 
   cout << "Invalid image format!" << endl; 
   exit(-1);    
}

First, we verify whether the image is already a single channel image (grayscale image). If that's the case, we just add this image – it does not need to be processed. Otherwise, we check if it's an RGB image. For colored images, we call the computeNMChannels function to split the image into several channels. This function is defined as follows:

void computeNMChannels(InputArray src, OutputArrayOfArrays channels, int mode = ERFILTER_NM_RGBLGrad);

The following are its parameters:

src: The source input array. It must be a colored image of type 8UC3.
channels: A vector of Mats that will be filled with the resulting channels.
mode: Defines which channels will be computed. Two possible values can be used:
- ERFILTER_NM_RGBLGrad: Indicates whether the algorithm will use RGB color, lightness, and gradient magnitude as channels (default)
- ERFILTER_NM_IHSGrad: Indicates whether the image will be split by its intensity, hue, saturation, and gradient magnitude

We also append the negative of all color components in the vector. Since the image will have three distinct channels (R, G, and B), this is usually enough. It's also possible to add the non-flipped channels, just like we did with the de-grayscaled image, but we'll end up with six channels, and this could be computer-intensive. Of course, you're free to test with your images if this leads to a better result. Finally, if another kind of image is provided, the function will terminate the program with an error message.

Negatives are appended, so the algorithms will cover both bright text in a dark background and dark text in a bright background. There is no sense in adding a negative for the gradient magnitude.

Let's proceed to the main method. We'll use this program to segment the easel.png image, which is provided with the source code:

This picture was taken by a mobile phone camera while I was walking on the street. Let's code this so that you may also use a different image easily by providing its name in the first program argument:

int main(int argc, const char * argv[]) 
{ 
   const char* image = argc < 2 ? "easel.png" : argv[1];     
   auto input = imread(image);

Next, we'll convert the image to grayscale and separate its channels by calling the separateChannels function:

   Mat processed; 
   cvtColor(input, processed, COLOR_RGB2GRAY); 
 
   auto channels = separateChannels(processed);

If you want to work with all of the channels in a colored image, just replace the two first lines of this code extract to the following:

Mat processed = input;

We will need to analyze six channels (RGB and inverted) instead of two (gray and inverted). Actually, the processing times will increase much more than the improvements that we can get. With the channels in hand, we need to create ERFilters for both stages of the algorithm. Luckily, the OpenCV text contribution module provides functions for this:

// Create ERFilter objects with the 1st and 2nd stage classifiers 
auto filter1 = createERFilterNM1(
     loadClassifierNM1("trained_classifierNM1.xml"),  15, 0.00015f,  
    0.13f, 0.2f,true,0.1f); 
 
auto filter2 = createERFilterNM2(      
     loadClassifierNM2("trained_classifierNM2.xml"),0.5);

For the first stage, we call the loadClassifierNM1 function to load a previously trained classification model. The .xml containing the training data is its only argument. Then, we call createERFilterNM1 to create an instance of the ERFilter class that will perform the classification. The function has the following signature:

Ptr<ERFilter> createERFilterNM1(const Ptr<ERFilter::Callback>& cb, int thresholdDelta = 1, float minArea = 0.00025, float maxArea = 0.13, float minProbability = 0.4, bool nonMaxSuppression = true, float minProbabilityDiff = 0.1);

The parameters for this function are as follows:

cb: The classification model. This is the same model we loaded with the loadCassifierNM1 function.
thresholdDelta: The amount to be summed to the threshold in each algorithm iteration. The default value is 1, but we'll use 15 in our example.
minArea: The minimum area of the extremal region (ER), where text may be found. This is measured by the percentage of the image's size. ERs with areas smaller than this are immediately discarded.
maxArea: The maximum area of the ER where text may be found. This is also measured by the percentage of the image's size. ERs with areas greater than this are immediately discarded.
minProbability: The minimum probability that a region must have to be a character in order to remain for the next stage.
nonMaxSupression: This is used to indicate if non-maximum suppression will be done in each branch probability.
minProbabilityDiff: The minimum probability difference between the minimum and maximum extreme region.

The process for the second stage is similar. We call loadClassifierNM2 to load the classifier model for the second stage and createERFilterNM2 to create the second stage classifier. This function only takes the input parameters of the loaded classification model and a minimum probability that a region must achieve to be considered as a character. So, let's call these algorithms in each channel to identify all possible text regions:

//Extract text regions using Newmann & Matas algorithm 
cout << "Processing " << channels.size() << " channels..."; 
cout << endl; 
vector<vector<ERStat> > regions(channels.size()); 
for (int c=0; c < channels.size(); c++) 
{ 
    cout << "    Channel " << (c+1) << endl; 
    filter1->run(channels[c], regions[c]); 
    filter2->run(channels[c], regions[c]);          
}     
filter1.release(); 
filter2.release();

In the previous code, we used the run function of the ERFilter class. This function takes two arguments:

The input channel: This includes the image to be processed.
The regions: In the first stage algorithm, this argument will be filled with the detected regions. In the second stage (performed by filter2), this argument must contain the regions selected in stage 1. These will be processed and filtered by stage 2.

Finally, we release both filters, since they will not be needed in the program anymore. The final segmentation step is grouping all ERRegions into possible words and defining their bounding boxes. This is done by calling the erGrouping function:

//Separate character groups from regions 
vector< vector<Vec2i> > groups; 
vector<Rect> groupRects; 
erGrouping(input, channels, regions, groups, groupRects, ERGROUPING_ORIENTATION_HORIZ);

This function has the following signature:

void erGrouping(InputArray img, InputArrayOfArrays channels, std::vector<std::vector<ERStat> > &regions, std::vector<std::vector<Vec2i> > &groups, std::vector<Rect> &groups_rects, int method = ERGROUPING_ORIENTATION_HORIZ, const std::string& filename = std::string(), float minProbablity = 0.5);

Let's take a look at the meaning of each parameter:

img: Input image, also called the original image.
regions: Vector of single channel images where regions were extracted.
groups: An output vector of indexes of grouped regions. Each group region contains all extremal regions of a single word.
groupRects: A list of rectangles with the detected text regions.
method: This is the method of grouping. It can be any of the following:
- ERGROUPING_ORIENTATION_HORIZ: The default value. This only generates groups with horizontally oriented text by doing an exhaustive search, as proposed originally by Neumann and Matas.
- ERGROUPING_ORIENTATION_ANY: This generates groups with text in any orientation, using single linkage clustering and classifiers. If you use this method, the filename of the classifier model must be provided in the next parameter.
- Filename: The name of the classifier model. This is only needed if ERGROUPING_ORIENTATION_ANY is selected.
- minProbability: The minimum detected probability of accepting a group. This is also only needed if ERGROUPING_ORIENTATION_ANY is selected.

The code also provides a call to the second method, but it's commented out. You may switch between the two to test this out. Just comment the previous call and uncomment this one:

erGrouping(input, channels, regions,  
    groups, groupRects, ERGROUPING_ORIENTATION_ANY,  
    "trained_classifier_erGrouping.xml", 0.5);

For this call, we also used the default trained classifier that's provided in the text module sample package. Finally, we draw the region boxes and show the results:

// draw groups boxes  
for (const auto& rect : groupRects) 
    rectangle(input, rect, Scalar(0, 255, 0), 3); 
 
imshow("grouping",input); 
waitKey(0);

This program outputs the following result:

You may check the entire source code in the detection.cpp file.

While most OpenCV text module functions are written to support both grayscale and colored images as its input parameter, at the time of writing this book, there were bugs preventing us from using grayscale images in functions such as erGrouping. For more information, take a look at the following GitHub link: https://github.com/Itseez/opencv_contrib/issues/309.
Always remember that the OpenCV contrib modules package is not as stable as the default OpenCV packages.