POSIT: 3D Pose Estimation

Before moving on to stereo vision, we should visit a useful algorithm that can estimate the positions of known objects in three dimensions. POSIT (aka "Pose from Orthography and Scaling with Iteration") is an algorithm originally proposed in 1992 for computing the pose (the position T and orientation R described by six parameters [DeMenthon92]) of a 3D object whose exact dimensions are known. To compute this pose, we must find on the image the corresponding locations of at least four non-coplanar points on the surface of that object. The first part of the algorithm, pose from orthography and scaling (POS), assumes that the points on the object are all at effectively the same depth^[194] and that size variations from the original model are due solely to scaling with distance from the camera. In this case there is a closed-form solution for that object's 3D pose based on scaling. The assumption that the object points are all at the same depth effectively means that the object is far enough away from the camera that we can neglect any internal depth differences within the object; this assumption is known as the weak-perspective approximation.

Given that we know the camera intrinsics, we can find the perspective scaling of our known object and thus compute its approximate pose. This computation will not be very accurate, but we can then project where our four observed points would go if the true 3D object were at the pose we calculated through POS. We then start all over again with these new point positions as the inputs to the POS algorithm. This process typically converges within four or five iterations to the true object pose—hence the name "POS algorithm with iteration". Remember, though, that all of this assumes that the internal depth of the object is in fact small compared to the distance away from the camera. If this assumption is not true, then the algorithm will either not converge or will converge to a "bad pose". The OpenCV implementation of this algorithm will allow us to track more than four (non-coplanar) points on the object to improve pose estimation accuracy.

The POSIT algorithm in OpenCV has three associated functions: one to allocate a data structure for the pose of an individual object, one to de-allocate the same data structure, and one to actually implement the algorithm.

CvPOSITObject* cvCreatePOSITObject(
    CvPoint3D32f* points,
    int           point_count
);
void cvReleasePOSITObject(
    CvPOSITObject** posit_object
);

The cvCreatePOSITObject() routine just takes points (a set of three-dimensional points) and point_count (an integer indicating the number of points) and returns a pointer to an allocated POSIT object structure. Then cvReleasePOSITObject() takes a pointer to such a structure pointer and de-allocates it (setting the pointer to NULL in the process).

void cvPOSIT(
    CvPOSITObject* posit_object,
    CvPoint2D32f*  image_points,
    double         focal_length,
    CvTermCriteria criteria,
    float*         rotation_matrix,
    float*         translation_vector
);

Now, on to the POSIT function itself. The argument list to cvPOSIT() is different stylistically than most of the other functions we have seen in that it uses the "old style" arguments common in earlier versions of OpenCV. ^[195] Here posit_object is just a pointer to the POSIT object that you are trying to track, and image_points is a list of the locations of the corresponding points in the image plane (notice that these are 32-bit values, thus allowing for subpixel locations). The current implementation of cvPOSIT() assumes square pixels and thus allows only a single value for the focal_length parameter instead of one in the x and one in the y directions. Because cvPOSIT() is an iterative algorithm, it requires a termination criteria: criteria is of the usual form and indicates when the fit is "good enough". The final two parameters, rotation_matrix and translation_vector, are analogous to the same arguments in earlier routines; observe, however, that these are pointers to float and so are just the data part of the matrices you would obtain from calling (for example) cvCalibrateCamera2(). In this case, given a matrix M, you would want to use something like M->data.fl as an argument to cvPOSIT().

When using POSIT, keep in mind that the algorithm does not benefit from additional surface points that are coplanar with other points already on the surface. Any point lying on a plane defined by three other points will not contribute anything useful to the algorithm. In fact, extra coplanar points can cause degeneracies that hurt the algorithm's performance. Extra non-coplanar points will help the algorithm. Figure 12-3 shows the POSIT algorithm in use with a toy plane [Tanguay00]. The plane has marking lines on it, which are used to define four non-coplanar points. These points were fed into cvPOSIT(), and the resulting rotation_matrix and translation_vector are used to control a flight simulator.

Figure 12-3. POSIT algorithm in use: four non-coplanar points on a toy jet are used to control a flight simulator

^[194] The construction finds a reference plane through the object that is parallel to the image plane; this plane through the object then has a single distance Z from the image plane. The 3D points on the object are first projected to this plane through the object and then projected onto the image plane using perspective projection. The result is scaled orthographic projection, and it makes relating object size to depth particularly easy.

^[195]You might have noticed that many function names end in "2". More often than not, this is because the function in the current release in the library has been modified from its older incarnation to use the newer style of arguments.