Estimating distances (a cheap approach)

Suppose we have an object sitting in front of a pinhole camera. Regardless of the distance between the camera and object, the following equation holds true:

objectSizeInImage / focalLength = objectSizeInReality / distance

We might use any unit (such as pixels) in the equation's left-hand side and any unit (such as meters) in its right-hand side. (On each side of the equation, the division cancels the unit.) Moreover, we can define the object's size based on anything linear that we can detect in the image, such as the diameter of a detected blob or the width of a detected face rectangle.

Let's rearrange the equation to illustrate that the distance to the object is inversely proportional to the object's size in the image:

distance = focalLength * objectSizeInReality / objectSizeInImage

Let's assume that the object's real size and the camera's focal length are constant. Consider the following arrangement, which isolates the pair of constants on the right-hand side of the equation:

distance * objectSizeInImage = focalLength * objectSizeInReality

As the right-hand side of the equation is constant, so is the left. We can conclude that the following relationship holds true over time:

newDist * newObjectSizeInImage = oldDist * oldObjectSizeInImage

Let's solve the equation for the new distance:

newDist = oldDist * oldObjectSizeInImage / newSizeInImage

Now, let's think about applying this equation in software. To provide a ground truth, the user must take a single, true measurement of the distance to use as our old distance in all future calculations. Moreover, based on our detection algorithm, we know the object's old pixel size and its subsequent new pixel sizes, so we can compute the new distance anytime we have a detection result. Let's review the following assumptions for the camera:

There is no lens distortion; the pinhole camera model can be used here.
Focal length is constant; no zoom is applied.
The object is rigid; its real-world measurements do not change.
The camera is always viewing the same side of the object; the relative rotation of camera and object does not change.

One might wonder whether the first assumption is problematic, as webcams often have cheap wide-angle lenses with significant distortion. Despite lens distortion, does the object's size in the image remain inversely proportional to the real distance between the camera and object? The following paper reports experimental results for a lens that appears to distort very badly and an object that is located off-center (in an image region where distortion is likely to be especially bad):

M. N. A. Wahab, N. Sivadev, and K. Sundaraj. "Target distance estimation using monocular vision system for mobile robot", IEEE Conference on Open Systems (ICOS) 2011 Proceedings, vol. 11, no. 15, p. 25-28. September 2011.

Using exponential regression, the authors show that the following model is a good fit for the experimental data (R^2 = 0.995):

distanceInCentimeters = 4042 * (objectSizeInPixels ^ -1.2)

Note that the exponent is close to -1 and, thus, the statistical model is not far from the ideal inverse relationship that we would hope to find. Even the bad lens and off-center subject did not break our assumptions too horribly!

We can ensure that the second assumption (no zooming) holds true.

Let's consider the third and fourth assumptions (rigidity and constant rotation) in the case of a camera and object that are parts of two cars on a highway. Except in a crash, most of the car's exterior parts are rigid. Except when passing or pulling over, one car travels directly behind the other on a surface that is mostly flat and mostly straight. However, on a road that is hilly or has many turns, the assumptions start to fall apart. It becomes more difficult to predict which side of the object is currently being viewed. Thus, it is more difficult to say whether our reference measurements are applied to the side that is currently being viewed.

Of course, we need to define a generic piece of a car to be our "object". The headlights (and the space between them) are a decent choice, since we have a method for detecting them and the distance between headlights is similar across most models of cars. That being said, this distance is not exactly equal across models of cars, and this difference can be considered as a weakness in the choice of object, though the same is true for other car components too.

All distance estimation techniques in computer vision rely on some assumptions (or some calibration steps) relating to the camera, object, relationship between camera and object, and/or the lighting. For comparison, let's consider some of the common techniques:

A time-of-flight (ToF) camera shines a light on objects and measures the intensity of reflected light. This intensity is used to estimate distance at each pixel based on the known falloff characteristics of the light source. Some ToF cameras, such as Microsoft Kinect, use an infrared light source. Other more expensive ToF cameras scan a scene with a laser or even use a grid of lasers. ToF cameras suffer from interference if other bright lights are being imaged, so they are poorly suited to our application.
A stereo camera consists of two parallel cameras with a known, fixed distance between them. Each frame, a pair of images is captured, features are identified, and a disparity or pixel distance is calculated for each pair of corresponding features in the images. We can convert disparity to real distance, based on the cameras' known field of view and the known distance between them. For our application, stereo techniques would be feasible except that they are computationally expensive and also use a lot of input bus bandwidth for the pair of cameras. Optimizing these techniques for Raspberry Pi would be a big challenge.
Structure from motion (SfM) techniques only need a single, regular camera but rely on moving the camera by known distances over time. For each pair of images taken from neighboring locations, disparities are calculated just as they are done for a stereo camera. While knowing the camera's movements, we must know the object's movements as well (or we must know that the object is stationary). Due to these limitations, SfM techniques are poorly suited to our application, where the camera and object are mounted on two freely moving vehicles.
Various 3D feature tracking techniques entail estimating the rotation of the object as well as its distance and other coordinates. The types of features might include edges and texture details. For our application, differences between models of cars make it difficult to define one set of features that is suitable for 3D tracking. Moreover, 3D tracking is computationally expensive, especially by the standards of a low-powered computer such as Raspberry Pi.
Note
For more information on these techniques, refer to the following books, available from Packt Publishing:
- Kinect and other ToF cameras are covered in my book OpenCV Computer Vision with Python, specifically in Chapter 5, Detecting Foreground/Background Regions and Depth.
- 3D feature tracking and SfM are covered in Mastering OpenCV with Practical Computer Vision Projects, specifically in Chapter 3, Markerless Augmented Reality, and Chapter 4, Exploring Structure from Motion Using OpenCV.
- Stereo vision and 3D feature tracking are covered in Robert Laganière's book OpenCV 2 Computer Vision Application Programming Cookbook, specifically in Chapter 9, Estimating Projective Relations in Images.

On balance, the simplistic approach based on pixel distances being inversely proportional to real distances is a justifiable choice, given our application and our intent to support the Pi.