Although OpenCV does not have an absolute focus on real-time algorithms, it will continue to favor real-time techniques. No one can state future plans with certainty, but the following high-priority areas are likely to be addressed.
There are more "consumers" for full working applications than there are for low-level functionality. For example, more people will make use of a fully automatic stereo solution than a better subpixel corner detector. There will be several more full applications, such as extensible single-to-many camera calibration and rectification as well as 3D depth display GUI.
As already mentioned, you can expect to see better support for 3D depth sensors and combinations of 2D cameras with 3D measurement devices. Also expect better stereo algorithms. Support for structured light is also likely.
Because we want to know how whole objects move (and partially to support 3D), OpenCV is long overdue for an efficient implementation of Black's [Black96] dense optical flow techniques.
In support of better object recognition, you can expect a full-function tool kit that will have a framework for interchangeable interest-point detection and interchangeable keys for interest-point identification. This will include popular features such as SURF, HoG, Shape Context, MSER, Geometric Blur, PHOG, PHOW, and others. Support for 2D and 3D features is planned.
This includes things like a wrapper class,[269] a good Python interface, GUI improvements, documentation improvements, better error handling, improved Linux support, and so on.
More seamless handling of cameras is planned along with eventual support for cameras with higher dynamic range. Currently, most cameras support only 8 bits per color channel (if that), but newer cameras can supply 10 or 12 bits per channel.[270] The higher dynamic range of such cameras allows for better recognition and stereo registration because it enables them to detect the subtle textures and colors to which older, more narrow-range cameras are blind.
Many object recognition techniques in computer vision detect salient regions that change little between views. These salient regions[271] can be tagged with some kind of key—for example, a histogram of image gradient directions around the salient point. Although all the techniques described in this section can be built with existing OpenCV primitives, OpenCV currently lacks direct implementation of the most popular interest-region detectors and feature keys.
OpenCV does include an efficient implementation of the Harris corner interest-point detectors, but it lacks direct support for the popular "maximal Laplacian over scale" detector developed by David Lowe [Lowe04] and for maximally stable extremal region (MSER) [Matas02] detectors and others.
Similarly, OpenCV lacks many of the popular keys, such as SURF gradient histogram grids [Bay06], that identify the salient regions. Also, we hope to include features such as histogram of oriented gradients (HoG) [Dalai05], Geometric Blur [Berg01], offset image patches [Torralba07], dense rapidly computed Gaussian scale variant gradients (DAISY) [Tola08], gradient location and orientation histogram (GLOH) [Mikolajczyk04], and, though patented, we want to add for reference the scale invariant feature transform (SIFT) descriptor [Lowe04] that started it all. Other learned feature descriptors that show promise are learned patches with orientation [Hinterstoisser08] and learned ratio points [Ozuysal07]. We'd also like to see contextual or meta-features such as pyramid match kernels [Grauman05], pyramid histogram embedding of other features, PHOW [Bosch07], Shape Context [Belongie00; Mori05], or other approaches that locate features by their probabilistic spatial distribution [Fei-Fei98]. Finally, some global features give the gist of an entire scene, which can be used to boost recognition by context [Oliva06]. All this is a tall order, and the OpenCV community is encouraged to develop and donate code for these and other features.
Other groups have demonstrated encouraging results using frameworks that employ efficient nearest neighbor matching to recognize objects using huge learned databases of objects [Nister06; Philbin07; Torralba08]. Putting in an efficient nearest neighbor framework is therefore suggested.
For robotics, we need object recognition (what) and object location (where). This suggests adding segmentation approaches building on Shi and Malik's work [Shi00] perhaps with faster implementations [Sharon06]. Recent approaches, however, use learning to provide recognition and segmentation together [Oppelt08; Schroff08; Sivic08]. Direction of lighting [Sun98] and shape cues may be important [Zhang99; Prados05].
Along with better support for features and for 3D sensing should come support for visual odometry and visual SLAM (simultaneous localization and mapping). As we acquire more accurate depth perception and feature identification, we'll want to enable better navigation and 3D object manipulation. There is also discussion about creating a specialized vision interface to a ray-tracing package (e.g., perhaps the Manta open source ray-tracing software [Manta]) in order to generate better 3D object training sets.
Robots, security systems, and Web image and video search all need the ability to recognize objects; thus, OpenCV must refine the pattern-matching techniques in its machine learning library. In particular, OpenCV should first simplify its interface to the learning algorithms and then to give them good defaults so that they work "out of the box". Several new learning techniques may arise, some of which will work with two or more object classes at a time (as random forest does now in OpenCV). There is a need for scalable recognition techniques so that the user can avoid having to learn a completely new model for each object class. More allowances should also be made to enable ML classifiers to work with depth information and 3D features.
Markov random fields (MRFs) and conditional random fields (CRFs) are becoming quite popular in computer vision. These methods are often highly problem-specific, yet we would like to figure how they might be supported in a flexible way.
We'll also want methods of learning web-sized or automatically collected via moving robot databases, perhaps by incorporating Zisserman's suggestion for "approximate nearest neighbor" techniques as mentioned previously when dealing with millions or billions of data points. Similarly, we need much-accelerated boosting and Haar feature training support to allow scaling to larger object databases. Several of the ML library routines currently require that all the data reside in memory, severely limiting their use on large datasets. OpenCV will need to break free of such restrictions.
OpenCV also requires better documentation than is now available. This book helps of course, but the OpenCV manual needs an overhaul together with improved search capability. A high priority is incorporating better Linux support and a better external language interface—especially to allow easy vision programming with Python and Numpy. We'll also want to make sure that the machine learning library can be directly called from Python and its SciPy and Numpy packages.
For better developer community interaction, developer workshops may be held at major vision conferences. There are also efforts underway that propose vision "grand challenge" competitions with commensurate prize money.
[269] Daniel Filip and Google have donated the fast, lightweight image class wrapper,
WImage
, which they developed for internal
use, to OpenCV. It will be incorporated by the time this book is published, but
too late for documentation in this version.
[270] Many expensive cameras claim up to 16 bits, but the authors have yet to see more than 10 actual bits of resolution, the rest being noise.
[271] These are also known as interest points.