As an example, we will set up a scenario where we are required to align overlapping images, like what is done in panorama or aerial photo stitching. One important feature that we need to measure performance is to have a ground truth, a precise measurement of the true condition that we are trying to recover with our approximation method. Ground truth data can be obtained from datasets made available for researchers to test and compare their algorithms; indeed, many of these datasets exist and computer vision researchers use them all the time. One good resource for finding computer vision datasets is Yet Another Computer Vision Index To Datasets (YACVID), https://riemenschneider.hayko.at/vision/dataset/, which has been actively maintained for the past eight years and contains hundreds of links to datasets. The following is also a good resource for data: https://github.com/jbhuang0604/awesome-computer-vision#datasets.
We, however, will pick a different way to get ground truth, which is well practiced in computer vision literature. We will create a contrived situation within our parametric control, and create a benchmark that we can vary to test different aspects of our algorithms. For our example, we will take a single image and split it into two overlapping images, and apply some transformations to one of them. The fusing of the images with our algorithm will try to recreate the original fused image, but it will likely not do a perfect job. Choices we make in selecting the pieces in our system (for example, the type of 2D feature, the feature matching algorithm, and the transform recovery algorithm) will affect the final result, which we will measure and compare. Working with artificial ground truth data gives us a lot of control over the conditions and level in our trials.
Consider the following image and its two-way overlapping split:
The left image we keep untouched, while we perform artificial transformations on the right image to see how well our algorithm will be able to undo them. To keep things simple, we will only rotate the right image at several brackets, like so:
We add a middle bracket for the no rotation case, in which the right image is only translated somewhat. This makes up our ground truth data, where we know exactly what transformation occurred and what the original input was.
Our goal is to measure the success of different 2D feature descriptor types in aligning images. One measure for our success can be the Mean Squared Error (MSE) over the pixels of the final re-stitched image. If the transformation recovery wasn't very well done, the pixels will not align perfectly, and thus we expect to see a high MSE. As the MSE approaches zero, we know the stitching was done well. We may also wish to know, for practical reasons, which feature is the most efficient, so we can also take a measurement of execution time. To this end, our algorithm can be very simple:
- Split original image left image and right image.
- For each of the feature types (SURF, SIFT, ORB, AKAZE, BRISK), do the following:
- Find keypoints and features in the left image.
- For each rotation angle [-90, -67, ..., 67, 90] do the following:
- Rotate the right image by the rotation angle.
- Find keypoints and features in the rotated right image.
- Match keypoints between the rotated right image and the left image.
- Estimate a rigid 2D transform.
- Transform according to the estimation.
- Measure the MSE of the final result with the original unsplit image.
- Measure the overall time it takes to extract, compute, and match features, and perform the alignment.
As a quick optimization, we can cache the rotated images, and not calculate them for each feature type. The rest of the algorithm remains untouched. Additionally, to keep things fair in terms of timing, we should take care to have a similar number of keypoints extracted for each feature type (for example, 2,500 keypoints), which can be done by setting the threshold for the keypoint extraction functions.
Note the alignment execution pipeline is oblivious of the feature type, and works exactly the same given the matched keypoints. This is a very important feature for testing many options. With OpenCV's cv::Feature2D and cv::DescriptorMatcher common base API, it is possible to achieve this, since all features and matchers implement them. However, if we take a look at the table in the Is it covered in OpenCV? section, we can see that this may not be possible for all vision problem in OpenCV, so we may need to add our own instrumentation code to make this comparison possible.
In the accompanying code, we can find the Python implementation of this routine, which provides the following results. To test rotation invariance, we vary the angle and measure the reconstruction MSE:
With the same experiments, we record the mean MSE across all experiments for a feature type, and also the mean execution time, shown as follows:
Results analysis, we can clearly see some features performing better than others in terms of MSE, with respect to both the different rotation angles and overall, and we can also see a big variance in the timing. It seems AKAZE and SURF are the highest performers in terms of alignment success across the rotation angle domain, with an advantage for AKAZE in higher rotations (~60°). However, at very small angular variation (rotation angle close to 0°), SIFT achieves practically perfect reconstruction with MSE around zero, and it also does as well as if not better then, the others with rotations below 30°. ORB does very badly throughout the domain, and BRISK, while not as bad, rarely was able to beat any of the forerunners.
Considering timing, ORB and BRISK (which are essentially the same algorithm) are the clear winners, but they both are far behind the others in terms of reconstruction accuracy. AKAZE and SURF are the leaders with neck-and-neck timing performance.
Now, it is up to us as the application developers to rank the features according to the requirements of the project. With the data from this test we performed, it should be easy to make a decision. If we are looking for speed, we would choose BRISK, since it's the fastest and performs better than ORB. If we are looking for accuracy, we would choose AKAZE, since it's the best performer and is faster than SURF. Using SURF in itself is a problem, since the algorithm is not free and it is protected by patent, so we are lucky to find AKAZE as a free and adequate alternative.
This was a very rudimentary test, only looking at two simple measures (MSE and time) and only one varied parameter (rotation). In a real situation, we may wish to insert more complexity into the transformations, according to the requirements of our system. For example, we may use full perspective transformation, rather than just rigid rotation. Additionally, we may want to do a deeper statistical analysis of the results. In this test, we only ran the alignment process once for each rotation condition, which is not good for capturing a good measure of timing, since some of the algorithms may benefit from executing in succession (for example, loading static data to memory). If we have multiple executions, we can reason about the variance in the executions, and calculate the standard deviation or error to give our decision making process more information. Lastly, given enough data, we can perform statistical inference processes and hypothesis testing, such as a t-test or analysis of variance (ANOVA), to determine whether the minute differences between the conditions (for example, AKAZE and SURF) have statistical significance or are too noisy to tell apart.