Text detection

Let's start by creating a simple program so that we can perform text segmentation using ERFilters. In this program, we will use the trained classifiers from text API samples. You may download this from the OpenCV repository, but they are also available in this book's companion code.

First, we start by including all of the necessary libs and usings:

#include  "opencv2/highgui.hpp" 
#include  "opencv2/imgproc.hpp" 
#include  "opencv2/text.hpp" 
 
#include  <vector> 
#include  <iostream> 
 
using namespace std; 
using namespace cv; 
using namespace cv::text; 

Recall from the Extremal region filtering section that the ERFilter works separately in each image channel. Therefore, we must provide a way to separate each desired channel in a different single channel, cv::Mat. This is done by the separateChannels function:

vector<Mat> separateChannels(const Mat& src)  
{ 
   vector<Mat> channels; 
   //Grayscale images 
   if (src.type() == CV_8U || src.type() == CV_8UC1) { 
         channels.push_back(src); 
         channels.push_back(255-src); 
         return channels; 
   } 
 
   //Colored images 
   if (src.type() == CV_8UC3) { 
         computeNMChannels(src, channels); 
         int size = static_cast<int>(channels.size())-1; 
         for (int c = 0; c < size; c++) 
               channels.push_back(255-channels[c]); 
         return channels; 
   } 
 
   //Other types 
   cout << "Invalid image format!" << endl; 
   exit(-1);    
}

First, we verify whether the image is already a single channel image (grayscale image). If that's the case, we just add this image – it does not need to be processed. Otherwise, we check if it's an RGB image. For colored images, we call the computeNMChannels function to split the image into several channels. This function is defined as follows:

void computeNMChannels(InputArray src, OutputArrayOfArrays channels, int mode = ERFILTER_NM_RGBLGrad); 

The following are its parameters:

We also append the negative of all color components in the vector. Since the image will have three distinct channels (R, G, and B), this is usually enough. It's also possible to add the non-flipped channels, just like we did with the de-grayscaled image, but we'll end up with six channels, and this could be computer-intensive. Of course, you're free to test with your images if this leads to a better result. Finally, if another kind of image is provided, the function will terminate the program with an error message.

Negatives are appended, so the algorithms will cover both bright text in a dark background and dark text in a bright background. There is no sense in adding a negative for the gradient magnitude.

Let's proceed to the main method. We'll use this program to segment the easel.png image, which is provided with the source code:

This picture was taken by a mobile phone camera while I was walking on the street. Let's code this so that you may also use a different image easily by providing its name in the first program argument:

int main(int argc, const char * argv[]) 
{ 
   const char* image = argc < 2 ? "easel.png" : argv[1];     
   auto input = imread(image); 

Next, we'll convert the image to grayscale and separate its channels by calling the separateChannels function:

   Mat processed; 
   cvtColor(input, processed, COLOR_RGB2GRAY); 
 
   auto channels = separateChannels(processed); 

If you want to work with all of the channels in a colored image, just replace the two first lines of this code extract to the following:

Mat processed = input;

We will need to analyze six channels (RGB and inverted) instead of two (gray and inverted). Actually, the processing times will increase much more than the improvements that we can get. With the channels in hand, we need to create ERFilters for both stages of the algorithm. Luckily, the OpenCV text contribution module provides functions for this:

// Create ERFilter objects with the 1st and 2nd stage classifiers 
auto filter1 = createERFilterNM1(
loadClassifierNM1("trained_classifierNM1.xml"), 15, 0.00015f,
0.13f, 0.2f,true,0.1f); auto filter2 = createERFilterNM2( loadClassifierNM2("trained_classifierNM2.xml"),0.5);

For the first stage, we call the loadClassifierNM1 function to load a previously trained classification model. The .xml containing the training data is its only argument. Then, we call createERFilterNM1 to create an instance of the ERFilter class that will perform the classification. The function has the following signature:

Ptr<ERFilter> createERFilterNM1(const Ptr<ERFilter::Callback>& cb, int thresholdDelta = 1, float minArea = 0.00025, float maxArea = 0.13, float minProbability = 0.4, bool nonMaxSuppression = true, float minProbabilityDiff = 0.1); 

The parameters for this function are as follows:

The process for the second stage is similar. We call loadClassifierNM2 to load the classifier model for the second stage and createERFilterNM2 to create the second stage classifier. This function only takes the input parameters of the loaded classification model and a minimum probability that a region must achieve to be considered as a character. So, let's call these algorithms in each channel to identify all possible text regions:

//Extract text regions using Newmann & Matas algorithm 
cout << "Processing " << channels.size() << " channels..."; 
cout << endl; 
vector<vector<ERStat> > regions(channels.size()); 
for (int c=0; c < channels.size(); c++) 
{ 
    cout << "    Channel " << (c+1) << endl; 
    filter1->run(channels[c], regions[c]); 
    filter2->run(channels[c], regions[c]);          
}     
filter1.release(); 
filter2.release(); 

In the previous code, we used the run function of the ERFilter class. This function takes two arguments:

Finally, we release both filters, since they will not be needed in the program anymore. The final segmentation step is grouping all ERRegions into possible words and defining their bounding boxes. This is done by calling the erGrouping function:

//Separate character groups from regions 
vector< vector<Vec2i> > groups; 
vector<Rect> groupRects; 
erGrouping(input, channels, regions, groups, groupRects, ERGROUPING_ORIENTATION_HORIZ); 

This function has the following signature:

void erGrouping(InputArray img, InputArrayOfArrays channels, std::vector<std::vector<ERStat> > &regions, std::vector<std::vector<Vec2i> > &groups, std::vector<Rect> &groups_rects, int method = ERGROUPING_ORIENTATION_HORIZ, const std::string& filename = std::string(), float minProbablity = 0.5); 

Let's take a look at the meaning of each parameter:

The code also provides a call to the second method, but it's commented out. You may switch between the two to test this out. Just comment the previous call and uncomment this one:

erGrouping(input, channels, regions,  
    groups, groupRects, ERGROUPING_ORIENTATION_ANY,  
    "trained_classifier_erGrouping.xml", 0.5); 

For this call, we also used the default trained classifier that's provided in the text module sample package. Finally, we draw the region boxes and show the results:

// draw groups boxes  
for (const auto& rect : groupRects) 
    rectangle(input, rect, Scalar(0, 255, 0), 3); 
 
imshow("grouping",input); 
waitKey(0);

This program outputs the following result:

You may check the entire source code in the detection.cpp file.

While most OpenCV text module functions are written to support both grayscale and colored images as its input parameter, at the time of writing this book, there were bugs preventing us from using grayscale images in functions such as erGrouping. For more information, take a look at the following GitHub link: https://github.com/Itseez/opencv_contrib/issues/309.
Always remember that the OpenCV contrib modules package is not as stable as the default OpenCV packages.