Importing SSD face detection into OpenCV

To work with deep learning in our code, we have to import the corresponding headers:

#include <opencv2/dnn.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/highgui.hpp>

After that, we will import the required namespaces:

using namespace cv;
using namespace std;
using namespace cv::dnn;

Now we are going to define the input image size and constant that we are going to use in our code:

const size_t inWidth = 300;
const size_t inHeight = 300;
const double inScaleFactor = 1.0;
const Scalar meanVal(104.0, 177.0, 123.0);

In this example, we need a few parameters as input, such as the model configuration and pre-trained model, if we are going to process camera or video input. We also need the minimum confidence to accept a prediction as correct or not:

const char* params
= "{ help | false | print usage }"
"{ proto | | model configuration (deploy.prototxt) }"
"{ model | | model weights (res10_300x300_ssd_iter_140000.caffemodel) }"
"{ camera_device | 0 | camera device number }"
"{ video | | video or image for detection }"
"{ opencl | false | enable OpenCL }"
"{ min_confidence | 0.5 | min confidence }";

Now, we are going to start with the main function, where we are going to parse the arguments with the CommandLineParser function:

int main(int argc, char** argv)
{
 CommandLineParser parser(argc, argv, params);

 if (parser.get<bool>("help"))
 {
 cout << about << endl;
 parser.printMessage();
 return 0;
 }

We are also going to load the model architecture and pre-trained model files, and load the model in a deep learning network:

 String modelConfiguration = parser.get<string>("proto");
 String modelBinary = parser.get<string>("model");

 //! [Initialize network]
 dnn::Net net = readNetFromCaffe(modelConfiguration, modelBinary);
 //! [Initialize network]

It's very important to check that we have imported the network correctly. We must also check whether the model is imported, using the empty function, as follows:

if (net.empty())
 {
 cerr << "Can't load network by using the following files" << endl;
 exit(-1);
 }

After loading our network, we are going to initialize our input source, a camera or video file, and load into VideoCapture, as follows:

 VideoCapture cap;
 if (parser.get<String>("video").empty())
 {
 int cameraDevice = parser.get<int>("camera_device");
 cap = VideoCapture(cameraDevice);
 if(!cap.isOpened())
 {
 cout << "Couldn't find camera: " << cameraDevice << endl;
 return -1;
 }
 }
 else
 {
 cap.open(parser.get<String>("video"));
 if(!cap.isOpened())
 {
 cout << "Couldn't open image or video: " << parser.get<String>("video") << endl;
 return -1;
 }
 }

Now we are prepared to start capturing frames and processing each one into the deep neural network to find faces.

First of all, we have to capture each frame in a loop:

for(;;)
 {
 Mat frame;
 cap >> frame; // get a new frame from camera/video or read image

 if (frame.empty())
 {
 waitKey();
 break;
 }

Next, we will put the input frame into a Mat blob structure that can manage the deep neural network. We have to send the image with the proper size of SSD, which is 300 x 300 (we will have initialized the inWidth and inHeight constant variables already) and we subtract from the input image a mean value, which is required in the SSD using the defined meanVal constant variable:

Mat inputBlob = blobFromImage(frame, inScaleFactor, Size(inWidth, inHeight), meanVal, false, false);

Now we are ready to set the data into the network and get the predictions/detections using the net.setInput and net.forward functions, respectively. This converts the detection results into a detection mat that we can read, where detection.size[2] is the number of detected objects and detection.size[3] is the number of results per detection (bounding box data and confidence):

 net.setInput(inputBlob, "data"); //set the network input
 Mat detection = net.forward("detection_out"); //compute output
 Mat detectionMat(detection.size[2], detection.size[3], CV_32F, detection.ptr<float>());

The Mat detection contains the following per each row:

Column 0: Confidence of object being present
Column 1: Confidence of bounding box
Column 2: Confidence of face detected
Column 3: X bottom-left bounding box
Column 4: Y bottom-left bounding box
Column 5: X top-right bounding box
Column 6: Y top-right bounding box

The bounding box is relative (zero to one) to the image size.

Now we have to apply the threshold to get only the desired detections based on the defined input threshold:

float confidenceThreshold = parser.get<float>("min_confidence");
 for(int i = 0; i < detectionMat.rows; i++)
 {
 float confidence = detectionMat.at<float>(i, 2);

 if(confidence > confidenceThreshold)
 {

Now we are going to extract the bounding box, draw a rectangle over each detected face, and show it as follows:

 int xLeftBottom = static_cast<int>(detectionMat.at<float>(i, 3) * frame.cols);
 int yLeftBottom = static_cast<int>(detectionMat.at<float>(i, 4) * frame.rows);
 int xRightTop = static_cast<int>(detectionMat.at<float>(i, 5) * frame.cols);
 int yRightTop = static_cast<int>(detectionMat.at<float>(i, 6) * frame.rows);

 Rect object((int)xLeftBottom, (int)yLeftBottom, (int)(xRightTop - xLeftBottom), (int)(yRightTop - yLeftBottom));

 rectangle(frame, object, Scalar(0, 255, 0));
 }
 }
 imshow("detections", frame);
 if (waitKey(1) >= 0) break;
}

The final result looks like this:

In this section, you learned a new deep learning architecture, SSD, and how to use it for face detection.