Common pitfalls and suggested solutions

OpenCV is very feature rich and provides multiple solutions and paths to resolve a visual-understanding problem. With this great power also comes hard work, choosing and crafting the best processing pipeline for the project requirements. Having multiple options means that probably finding the exact best performing solution is next to impossible, as many pieces are interchangeable and testing all the possible options is out of our reach. This problem's exponential complexity is compounded by the input data; more unknown variance in the incoming data will make our algorithm choices even more unstable. In other words, working with OpenCV, or any other computer vision library, is still a matter of experience and art. A priori intuition as to the success of one or another route to a solution is something computer vision engineers develop with years of experience, and for the most part there are no shortcuts.

There is, however, the option of learning from someone else's experience. If you've purchased this book, it likely means you are looking to do just so. In this section, we have prepared a partial list of problems that we encountered in our years of work as computer vision engineers. We also look to propose solutions for these problems, like we used in our own work. The list focuses on problems arising from computer vision engineering; however, any engineer should also be aware of common problems in general purpose software and system engineering, which we do not enumerate here. In practice, no system implementation is without some problems, bugs, or under-optimizations, and even after following our list, one will probably find there is much more left to do.

The primary common pitfall in any engineering field is making assumptions instead of assertions. For any engineer, if there's an option to measure something, it should be measured, even by approximation, establishment of lower and upper bounds, or measuring a different highly correlated phenomenon. For some examples on which metrics can be used for measurement in OpenCV, refer to Chapter 20, Finding the Best OpenCV Algorithm for the Job. The best made decisions are the informed ones, based on hard data and visibility; however, that is often not the privilege of an engineer. Some projects require a fast and cold start that forces an engineer to rapidly build up a solution from scratch, without much data or intuition. In such cases, the following advice can save a lot of grief:

Not comparing algorithm options: One pitfall engineers often make is choosing algorithms categorically based on what they encounter first, something they've done in the past and seemed to work, or something that has a nice tutorial (someone else's experience). This is called the anchoring or focalism cognitive bias, a well-known problem in decision making theory. Reiterating the words from the last chapter, the choice of algorithm can have tremendous impact on the results of the entire pipeline and project, in terms of accuracy, speed, resources, and otherwise. Making uninformed decisions when selecting algorithms is not a good idea.
- Solution: OpenCV has many ways to assist in testing different options seamlessly, through common base APIs (such as Feature2D, DescriptorMatcher, SparseOpticalFlow, and more) or common function signatures (such as solvePnP and solvePnPRansac). High-level programming languages, such as Python, have even more flexibility in interchanging algorithms; however, this is also possible in C++ beyond polymorphism, with some instrumentation code. After establishing a pipeline, see how you can interchange some of the algorithms (for example, feature type or matcher type, thresholding technique) or their parameters (for example, threshold values, algorithm flags) and measure the effect on the final result. Strictly changing parameters is often called hyperparameter tuning, which is standard practice in machine learning.

Not unit testing homegrown solutions or algorithms: It is often a programmer's fallacy to believe their work is bug-free, and that they've covered all edge cases. It is far better to err on the side of caution when it comes to computer vision algorithms, since in many cases the input space is vastly unknown, as it is incredibly highly dimensional. Unit tests are excellent tools to make sure functionality doesn't break on unexpected input, invalid data, or edge cases (for example, an empty image) and has a graceful degradation.
- Solution: Establish unit tests for any meaningful function in your code, and make sure to cover the important parts. For example, any function that either reads or writes image data is a good candidate for a unit test. The unit test is a simple piece of code that usually invokes the function a number of times with different arguments, testing the function's ability (or inability) to handle the input. Working in C++, there are many options for a test framework; one such framework is part of the Boost C++ package, Boost.Test (https://www.boost.org/doc/libs/1_66_0/libs/test/doc/html/index.html). Here is an example:

#define BOOST_TEST_MODULE binarization test
#include <boost/test/unit_test.hpp>

BOOST_AUTO_TEST_CASE( binarization_test )
{
    // On empty input should return empty output
    BOOST_TEST(binarization_function(cv::Mat()).empty())
    // On 3-channel color input should return 1-channel output
    cv::Mat input = cv::imread("test_image.png");
    BOOST_TEST(binarization_function(input).channels() == 1)
}

After compiling this file, it creates an executable that will perform the tests and exit with a status of 0 if all tests passed or 1 if any of them failed. It is common to mix this approach with CMake's CTest (https://cmake.org/cmake/help/latest/manual/ctest.1.html) feature (via ADD_TEST in the CMakeLists.txt files), which facilitates building tests for many parts of the code and running them all upon command.

Not checking data ranges: A common problem in computer vision programming is to assume a range for the data, for example a range of [0, 1] for floating-point pixels (float, CV_32F) or [0, 255] for byte pixels (unsigned char, CV_8U). There really are no guarantees that these assumptions hold in any situation, since the memory block can hold any value. The problems that arise from these errors are mostly value saturation, when trying to write a value bigger than the representation; for example, writing 325 into a byte that can hold [0, 255] will saturate to 255, losing a great deal of precision. Other potential problems are differences between expected and actual data, for example, expecting a depth image in the range of [0, 2048] (for example, two meters in millimeters) only to see the actual range is [0, 1], meaning it was normalized somehow. This can lead to underperformance in the algorithm, or a complete breakdown (imagine dividing the [0, 1] range by 2048 again).
- Solution: Check the input data range and make sure it is what you expect. If the range is not within acceptable bounds, you may throw an out_of_range exception (a standard library class, visit https://en.cppreference.com/w/cpp/error/out_of_range for more details). You can also consider using CV_ASSERT to check the range, which will trigger a cv::error exception on failure.
Data types, channels, conversion, and rounding errors: One of the most vexing problems in OpenCV's cv::Mat data structure is that it doesn't carry data type information on its variable type. A cv::Mat can hold any type of data (float, uchar, int, short, and so on) in any size, and a receiving function cannot know what data is inside the array without inspection or convention. The problem is also compounded by the number of channels, as an array can hold any number of them arbitrarily (for example, a cv::Mat can hold CV_8UC1 or CV_8UC3). Failing to have a known data type can lead to runtime exceptions from OpenCV functions that don't expect such data, and therefore to potential crashing of the entire application. Problems with handling multiple data types on the same input cv::Mat may lead to other issues of conversion. For example, if we know an incoming array holds CV_32F (by checking input.type() == CV_32F), we may input.convertTo(out, CV_8U) to "normalize" it to a uchar character; however, if the float data is in the [0, 1] range, the output conversion will have all 0s and 1s in a [0, 255] image, which may be a problem.
- Solution: Prefer cv::Mat_<> types (for example, cv::Mat_<float>) over cv::Mat to also carry the data type, establish very clear conventions on variable naming (for example cv::Mat image_8uc1), test to make sure the types you expect are the types you get, or create a "normalization" scheme to turn any unexpected input type to the type you would like into work with in your function. Using try .. catch blocks is also a good practice when data type uncertainty is feared.
Colorspace-originating problems: RGB versus perceptual (HSV, L*a*b*) versus technical (YUV): Colorspaces are a way to encode color information in numeric values in a pixel array (image). However, there are a number of problems with this encoding. The foremost problem is that any colorspace eventually becomes a series of numbers stored in the array, and OpenCV does not keep track of colorspace information in cv::Mat (for example, an array may hold 3-byte RGB or 3-byte HSV, and the variable user cannot tell the difference). This is not a good thing, because we tend to think we can do any kind of numeric manipulation on numeric data and it will make sense. However, in some colorspaces, certain manipulations need to be cognizant of the colorspace. For example, in the very useful HSV (Hue, Saturation, Value) colorspace, one must remember the H (Hue) is in fact a measure of degrees [0,360] that usually is compressed to [0,180] to fit in a uchar character. There is, therefore, no sense in putting a value of 200 in the H channel, as it violates the colorspace definition and leads to unexpected problems. Same goes for linear operations. If for example, we wish to dim an image by 50%, in RGB we simply divide all channels by two; however, in HSV (or L*a*b*, Luv, and so on) one must only perform the division on the V (Value) or L (Luminance) channels.

The problem becomes much worse when working with non-byte images, such as YUV420 or RGB555 (16-bit colorspaces). These images store pixel values on the bit level, not the byte level, compounding data for more than one pixel or one channel in the same byte. For example, an RGB555 pixel is stored in two bytes (16 bits): one bit unused, then five bits for red, five bits for green, and five bits for blue. All kinds of numeric operations (for example, arithmetics) in that case fail, and may cause irreparable corruption to the data.
- Solution: Always know the colorspace of the data you process. When reading images from files using cv::imread, you may assume they are read in BGR order (standard OpenCV pixel data storage). When no colorspace information is available, you may rely on heuristics or test the input. In general, you should be wary of images with only two channels, as they are more than likely a bit-packed colorspace. Images with four channels are usually ARGB or RGBA, adding an alpha channel, and again introduce some uncertainty. Testing for perceptual colorspaces can be done visually, by displaying the channels to the screen. The worst of the bit-packing problem comes from working with image files, memory blocks from external libraries, or sources. Within OpenCV, most of the work is done on single-channel grayscale or BGR data, but when it comes to saving to the file, or preparing an image memory block for use in a different library, then it is important to keep track of colorspace conversions. Remember cv::imwrite expects BGR data, and not any other format.

Accuracy versus speed versus resources (CPU, memory) trade-offs and optimization: Most of the problems in computer vision have trade-offs between their computation and resource efficiency. Some algorithms are fast because they cache in memory crucial data with a fast lookup efficiency; others may be fast because of a rough approximation they make on the input or output that reduces accuracy. In most cases, one fetching trait comes at the expense of another. Not paying attention to these trade-offs, or paying too much attention to them, can become a problem. A common pitfall for engineers is around matters of optimization. There is under-or over-optimization, premature optimization, unnecessary optimization, and more. When looking to optimize an algorithm, there's a tendency to treat all optimizations as equals, when in fact there is usually just one culprit (code line or method) causing most of the inefficiency. Dealing with algorithmic tradeoff or optimization is mostly a problem of research and development time, rather than result. Engineers may spend too much or not enough time in optimization, or optimize at the wrong time.
- Solution: Know the algorithms before or while employing them. If you choose an algorithm, make sure you have an understanding of its the complexity (runtime and resource) by testing it, or at least by looking at the OpenCV documentation pages. For example, when matching image features, one should know the brute-force matcher BFMatcher is often a few orders of magnitude slower than the approximate FLANN-based matcher FlannBasedMatcher, especially if preloading and caching the features is possible.