Estimating distances (a cheap approach)

Suppose we have an object sitting in front of a pinhole camera. Regardless of the distance between the camera and object, the following equation holds true:

objectSizeInImage / focalLength = objectSizeInReality / distance

We might use any unit (such as pixels) in the equation's left-hand side and any unit (such as meters) in its right-hand side. (On each side of the equation, the division cancels the unit.) Moreover, we can define the object's size based on anything linear that we can detect in the image, such as the diameter of a detected blob or the width of a detected face rectangle.

Let's rearrange the equation to illustrate that the distance to the object is inversely proportional to the object's size in the image:

distance = focalLength * objectSizeInReality / objectSizeInImage

Let's assume that the object's real size and the camera's focal length are constant. Consider the following arrangement, which isolates the pair of constants on the right-hand side of the equation:

distance * objectSizeInImage = focalLength * objectSizeInReality

As the right-hand side of the equation is constant, so is the left. We can conclude that the following relationship holds true over time:

newDist * newObjectSizeInImage = oldDist * oldObjectSizeInImage

Let's solve the equation for the new distance:

newDist = oldDist * oldObjectSizeInImage / newSizeInImage

Now, let's think about applying this equation in software. To provide a ground truth, the user must take a single, true measurement of the distance to use as our old distance in all future calculations. Moreover, based on our detection algorithm, we know the object's old pixel size and its subsequent new pixel sizes, so we can compute the new distance anytime we have a detection result. Let's review the following assumptions for the camera:

  1. There is no lens distortion; the pinhole camera model can be used here.
  2. Focal length is constant; no zoom is applied.
  3. The object is rigid; its real-world measurements do not change.
  4. The camera is always viewing the same side of the object; the relative rotation of camera and object does not change.

One might wonder whether the first assumption is problematic, as webcams often have cheap wide-angle lenses with significant distortion. Despite lens distortion, does the object's size in the image remain inversely proportional to the real distance between the camera and object? The following paper reports experimental results for a lens that appears to distort very badly and an object that is located off-center (in an image region where distortion is likely to be especially bad):

M. N. A. Wahab, N. Sivadev, and K. Sundaraj. "Target distance estimation using monocular vision system for mobile robot", IEEE Conference on Open Systems (ICOS) 2011 Proceedings, vol. 11, no. 15, p. 25-28. September 2011.

Using exponential regression, the authors show that the following model is a good fit for the experimental data (R^2 = 0.995):

distanceInCentimeters = 4042 * (objectSizeInPixels ^ -1.2)

Note that the exponent is close to -1 and, thus, the statistical model is not far from the ideal inverse relationship that we would hope to find. Even the bad lens and off-center subject did not break our assumptions too horribly!

We can ensure that the second assumption (no zooming) holds true.

Let's consider the third and fourth assumptions (rigidity and constant rotation) in the case of a camera and object that are parts of two cars on a highway. Except in a crash, most of the car's exterior parts are rigid. Except when passing or pulling over, one car travels directly behind the other on a surface that is mostly flat and mostly straight. However, on a road that is hilly or has many turns, the assumptions start to fall apart. It becomes more difficult to predict which side of the object is currently being viewed. Thus, it is more difficult to say whether our reference measurements are applied to the side that is currently being viewed.

Of course, we need to define a generic piece of a car to be our "object". The headlights (and the space between them) are a decent choice, since we have a method for detecting them and the distance between headlights is similar across most models of cars. That being said, this distance is not exactly equal across models of cars, and this difference can be considered as a weakness in the choice of object, though the same is true for other car components too.

All distance estimation techniques in computer vision rely on some assumptions (or some calibration steps) relating to the camera, object, relationship between camera and object, and/or the lighting. For comparison, let's consider some of the common techniques:

On balance, the simplistic approach based on pixel distances being inversely proportional to real distances is a justifiable choice, given our application and our intent to support the Pi.