Face recognition with dlib

Dlib offers a high-quality face recognition algorithm based on deep learning. Dlib implements a face recognition algorithm that offers state-of-the-art accuracy. More specifically, the model has an accuracy of 99.38% on the labeled faces in the wild database.

The implementation of this algorithm is based on the ResNet-34 network proposed in the paper Deep Residual Learning for Image Recognition (2016), which was trained using three million faces. The created model (21.4 MB) can be downloaded from https://github.com/davisking/dlib-models/blob/master/dlib_face_recognition_resnet_model_v1.dat.bz2.

This network is trained in a way that generates a 128-dimensional (128D) descriptor, used to quantify the face. The training step is performed using triplets. A single triplet training example is composed of three images. Two of them correspond to the same person. The network generates the 128D descriptor for each of the images, slightly modifying the neural network weights in order to make the two vectors that correspond to the same person closer and the feature vector from the other person further away. The triplet loss function formalizes this and tries to push the 128D descriptor of two images of the same person closer together, while pulling the 128D descriptor of two images of different people further apart.

This process is repeated millions of times for millions of images of thousands of different people and finally, it is able to generate a 128D descriptor for each person. So, the final 128D descriptor is good encoding for the following reasons:

The generated 128D descriptors of two images of the same person are quite similar to each other.
The generated 128D descriptors of two images of different people are very different.

Therefore, making use of the dlib functionality, we can use a pre-trained model to map a face into a 128D descriptor. Afterward, we can use these feature vectors to perform face recognition.

The encode_face_dlib.py script shows how to calculate the 128D descriptor, used to quantify the face. The process is quite simple, as shown in the following code:

# Load image:
image = cv2.imread("jared_1.jpg")

# Convert image from BGR (OpenCV format) to RGB (dlib format):
rgb = image[:, :, ::-1]

# Calculate the encodings for every face of the image:
encodings = face_encodings(rgb)

# Show the first encoding:
print(encodings[0])

As you can guess, the face_encodings() function returns the 128D descriptor for each face in the image:

pose_predictor_5_point = dlib.shape_predictor("shape_predictor_5_face_landmarks.dat")
face_encoder = dlib.face_recognition_model_v1("dlib_face_recognition_resnet_model_v1.dat")
detector = dlib.get_frontal_face_detector()

def face_encodings(face_image, number_of_times_to_upsample=1, num_jitters=1):
    """Returns the 128D descriptor for each face in the image"""

    # Detect faces:
    face_locations = detector(face_image, number_of_times_to_upsample)
    # Detected landmarks:
    raw_landmarks = [pose_predictor_5_point(face_image, face_location) for face_location in face_locations]
    # Calculate the face encoding for every detected face using the detected landmarks for each one:
    return [np.array(face_encoder.compute_face_descriptor(face_image, raw_landmark_set, num_jitters)) for
            raw_landmark_set in raw_landmarks]

As you can see, the key point is to calculate the face encoding for every detected face using the detected landmarks for each one, calling dlib the face_encoder.compute_face_descriptor() function.

The num_jitters parameter sets the number of times each face will be randomly jittered, and the average 128D descriptor calculated each time will be returned. In this case, the output (encoding 128D descriptor) is as follows:

[-0.08550473 0.14213498 0.01144615 -0.05947386 -0.05831585 0.01127038 -0.05497809 -0.03466939 0.14322688 -0.1001832 0.17384697 0.02444006 -0.25994921 0.13708787 -0.08945534 0.11796272 -0.25426617 -0.0829383 -0.05489913 -0.10409787 0.07074109 0.05810066 -0.03349853 0.07649824 -0.07817822 -0.29932317 -0.15986916 -0.087205 0.10356752 -0.12659372 0.01795856 -0.01736169 -0.17094864 -0.01318233 -0.00201829 0.0104903 -0.02453734 -0.11754096 0.2014133 0.12671679 -0.0271306 -0.02350519 0.08327188 0.36815098 0.12599576 0.04692561 0.03585262 -0.03999642 0.23675609 -0.28394884 0.11896492 0.11870296 0.20243752 0.2106981 0.03092775 -0.14315812 0.07708532 0.16536239 -0.19648902 0.22793224 0.06825032 -0.00117573 0.00304667 -0.01902146 0.2539638 0.09768397 -0.13558105 -0.15079053 0.11357955 -0.14893037 -0.09028706 0.03625216 -0.13004847 -0.16567475 -0.21958281 0.08687183 0.35941613 0.16637127 -0.08334676 0.02806632 -0.09188357 -0.10760318 0.02889947 0.08376379 -0.11524356 -0.00998984 -0.05582509 0.09372396 0.30287758 -0.01063644 -0.07903813 0.30418509 -0.01998731 0.0752025 -0.00424637 0.07463965 -0.12972119 -0.04034984 -0.08435905 -0.01642537 0.00847361 -0.09549874 -0.07568903 0.06476583 -0.19202243 0.16904426 -0.01247451 0.03941975 -0.01960869 0.02145611 -0.25607404 -0.03039071 0.20248309 -0.25835767 0.21397503 0.19302645 0.07284702 0.07879912 0.06171442 0.02366752 0.06781606 -0.06446165 -0.14713687 -0.0714087 0.11978403 -0.01525984 -0.04687868 0.00167655]

Once the faces are encoded, the next step is to perform the recognition.

The recognition can be easily computed using some kind of distance metrics computed using the 128D descriptors. Indeed, if two face descriptor vectors have a Euclidean distance between them that is less than 0.6, they can be considered to belong to the same person. Otherwise, they are from different people.

The Euclidean distance can be calculated using numpy.linalg.norm().

In the compare_faces_dlib.py script, we compare four images against another image. To compare the faces, we have coded two functions: compare_faces() and compare_faces_ordered(). The compare_faces() function returns the distance when comparing a list of face encodings against a candidate to check:

def compare_faces(face_encodings, encoding_to_check):
    """Returns the distances when comparing a list of face encodings against a candidate to check"""

    return list(np.linalg.norm(face_encodings - encoding_to_check, axis=1))

The compare_faces_ordered() function returns the ordered distances and the corresponding names when comparing a list of face encodings against a candidate to check:

def compare_faces_ordered(face_encodings, face_names, encoding_to_check):
    """Returns the ordered distances and names when comparing a list of face encodings against a candidate to check"""

    distances = list(np.linalg.norm(face_encodings - encoding_to_check, axis=1))
    return zip(*sorted(zip(distances, face_names)))

Therefore, the first step in comparing four images against another image is to load all of them and convert to RGB (dlib format):

# Load images:
known_image_1 = cv2.imread("jared_1.jpg")
known_image_2 = cv2.imread("jared_2.jpg")
known_image_3 = cv2.imread("jared_3.jpg")
known_image_4 = cv2.imread("obama.jpg")
unknown_image = cv2.imread("jared_4.jpg")

# Convert image from BGR (OpenCV format) to RGB (dlib format):
known_image_1 = known_image_1[:, :, ::-1]
known_image_2 = known_image_2[:, :, ::-1]
known_image_3 = known_image_3[:, :, ::-1]
known_image_4 = known_image_4[:, :, ::-1]
unknown_image = unknown_image[:, :, ::-1]

# Crate names for each loaded image:
names = ["jared_1.jpg", "jared_2.jpg", "jared_3.jpg", "obama.jpg"]

The next step is to compute the encodings for each image:

# Create the encodings:
known_image_1_encoding = face_encodings(known_image_1)[0]
known_image_2_encoding = face_encodings(known_image_2)[0]
known_image_3_encoding = face_encodings(known_image_3)[0]
known_image_4_encoding = face_encodings(known_image_4)[0]
known_encodings = [known_image_1_encoding, known_image_2_encoding, known_image_3_encoding, known_image_4_encoding]
unknown_encoding = face_encodings(unknown_image)[0]

And finally, you can compare the faces using the previous functions. For example, let's make use of the compare_faces_ordered() function:

computed_distances_ordered, ordered_names = compare_faces_ordered(known_encodings, names, unknown_encoding)
print(computed_distances_ordered)
print(ordered_names)

Doing so will give us the following:

(0.3913191431497527, 0.39983264838593896, 0.4104153683230741, 0.9053700273411349)
('jared_3.jpg', 'jared_1.jpg', 'jared_2.jpg', 'obama.jpg')

The first three values (0.3913191431497527, 0.39983264838593896, 0.4104153683230741) are less than 0.6. This means that the first three images ('jared_3.jpg', 'jared_1.jpg', 'jared_2.jpg') can be considered from the same person as the image to check ('jared_4.jpg'). The fourth value obtained (0.9053700273411349) means that the fourth image ('obama.jpg') is not the same person as the image to check.

This can be seen in the next screenshot:

In the previous screenshot, you can see that the first three images can be considered from the same person (the obtained values are less than 0.6), while the fourth image can be considered from another person (the obtained value is greater than 0.6).