NEURAL NETWORKS AND DEEP LEARNING
Every day, cameras powered by facial recognition technology surveil travelers moving through airports to identify terrorists, dissidents, and other political opponents instantly. Autonomous vehicles and drones use similar technology to navigate their surroundings. While these types of computer vision may seem like an extraordinarily complex and mysterious superpower, at its core, this type of narrow AI system is merely a sophisticated form of supervised learning. To help you understand how computer vision systems work, let’s start with a simple image classification task.
Figure 11.1 Closed-circuit monitoring of travelers. © Pixinoo – Licensed from Dreamstime.com ID 79209154.
Suppose we want to build a system that can distinguish apples from bananas.
Figure 11.2 Apple vs. banana classification task.
We could first run each picture through a color filter to extract the dominant color. Then we could create a training table like table 11.1 that contains a manually entered feature (the color) of each image:
COLOR | ANSWER |
---|---|
Red | Apple |
Yellow | Banana |
Table 11.1 Simple training table to distinguish apples from bananas.
Of course, we do not need an AI system to figure out the rule If it is red, it is an apple. If it is yellow, it is a banana. But that scheme would not work anyway when we encounter green apples and bananas. Instead of only color, table 11.2 has a column that indicates whether the fruit shape is roundish.
COLOR | ROUNDISH? | ANSWER |
---|---|---|
Red | Yes | Apple |
Yellow | No | Banana |
Green | Yes | Apple |
Green | No | Banana |
Table 11.2 Training table that includes both apples and bananas.
Analyzing this table, we find we do not need the color column at all, since we can just classify the fruits based on whether they are roundish or not.
Next, suppose we want our classifier to distinguish among apples, bananas, and pears.
Figure 11.3 Apples, bananas, pears classification.
COLOR | ROUNDISH? | ANSWER |
---|---|---|
Red | Yes | Apple |
Yellow | No | Banana |
Green | Yes | Apple |
Green | No | Banana |
Brown | Yes | Pear |
Table 11.3 Training table that includes pears.
When we analyze table 11.3, we find we need a rule based on both color and shape.
The problem gets harder if we decide to classify only apples and change the task to naming the type of apple: Red Delicious, McIntosh, Braeburn, each of which has the same color and shape. In that case, we would need to find features that distinguish these types of apples other than color and shape. The problem becomes even harder if we want to build an image recognition system that can distinguish between tens, hundreds, or thousands of image categories, such as human facial features. We need to extract these features automatically so we do not have to hand code each feature for each training image to create a training table. Finally, we need to recognize each feature regardless of its orientation, scale, rotation, or the illumination level in the image.
These challenges were the focus of research on image recognition during the first decade of the twenty-first century. Instead of using simple characteristics like color and shape, researchers used features created by sophisticated mathematical algorithms.1 They would then create a training table with one row per training image and one column for each feature. They also added an output column with the name of the image class. Last, they fed the training table as input into a supervised learning classification algorithm so it could learn a function that could perform the image classification task.
Since 2010, researchers have engaged in an annual image classification competition named the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). They begin with an ILSVRC database that contains images in one thousand categories such as appliance, bird, and flower. The competition goal is to train a classification algorithm on 1.2 million observations in the database to identify the primary image category and then test the resulting learned function on 150,000 test images and achieve the lowest error rate.2
In the first two competitions, the winners used different types of mathematical feature extraction, fed the training table with features plus the column with the image category name into a supervised learning algorithm, and achieved around a 25 percent top-five error rate in both 2010 and 2011. As an example of the type of error that might be possible, if a system is shown a test image of a motor scooter, it might produce these five classes: motor scooter, go-kart, moped, bumper car, and golf cart. In this case, the system’s first choice was the correct choice. However, if motor scooter had been any of the five choices, it would have been correct. The error rate when only allowed one prediction for each image was 47 percent in 2010.
The 2012 competition woke up the world to the power of neural networks, even though they had been around for thirty years. A team of University of Toronto researchers3 created the AlexNet system, which won the 2012 challenge by achieving a 15 percent top-five error rate. The next best system reached only 26 percent.4
Figure 11.4 Increasing ILSVRC performance for image classification systems.
The top-five error rate for the winning system decreased to 11 percent in 2013. The winning system beat human performance for the first time in 2015 with an error rate under 5%. By 2017, the winning system error rate was down to 2.3 percent.
HOW NEURAL NETWORKS WORK
To understand how neural networks5 work, let’s start with a much simpler image classification problem. Suppose we want to build a system that can recognize handwritten numbers like in figure 11.5.
Figure 11.5 Examples of handwritten numbers from the MNIST database. Josef Teppan licensed under CC BY-SA 4.0.
These are images from an extensive database published by NIST in 1995 to support the development of handwriting recognition technology. Each image is 28 × 28 pixels. Each training image will be a row in our training table. Instead of trying to extract features from the images, the training table will contain the raw pixels, and it will be up to the neural network to figure out the features. There will be one column for each of the 28 × 28 = 784 pixels plus a column for the output number. Figure 11.6 depicts a neural network for this problem.
Figure 11.6 A neural network for classifying handwritten numbers.
In figure 11.6, there is an input layer that has one neuron for each of the 784 pixels in the image of the handwritten number. Each of the 784 input layer neurons will have a value of one if it is black in the image and zero otherwise. The output layer has ten neurons, one for each possible output value (zero through nine). In the middle, there are two hidden layers (only the input and output layer values are derived directly from the training table).
In addition to neurons, neural networks have weights that are variables like those in the temperature and housing functions. There is one weight for each connection between two neurons. In neural networks, the weights represent the strength of the connection between two neurons.
We can write a function for every neural network that operates the same way as the function we discussed for computing Fahrenheit temperatures from Celsius temperatures. The difference is in the number of variables (weights) in the function. In the temperature-conversion example, there was only one variable. OpenAI recently came out with a network that has 175 billion variables.6
We will skip the technical details of how neural networks learn the optimal values of their weights.7 The result is a function that can take an image as input and predict a category as output. In this example, the predicted category is one of the ten numbers.
AlexNet and all the post-2012 winning ILSVRC systems used convolutional neural networks (ConvNets).8 ConvNets have had tremendous success in image processing tasks, such as handwriting recognition, image classification, and facial recognition.9 Yann LeCun and his colleagues at Bell Labs used a ConvNet to create the first commercial system that could read the handwritten letters and numbers written on checks. And ConvNets are the primary technology behind facial recognition.10
Deep learning refers to neural network architectures with more than one hidden layer. The network shown in figure 11.6 is a deep learning network because it has two hidden layers. The use of multiple hidden layers usually improves performance. For example, AlexNet had 8 hidden layers, the GoogleNet system that won the 2014 ILSVRC competition had 22 layers,11 and the 2015 winner from Microsoft had 152 layers.12 Deep learning can be used in supervised, unsupervised, and reinforcement learning.
MACHINE TRANSLATION
Until 1799, archaeologists frequently discovered Egyptian hieroglyphs, but no one knew how to interpret them. That year, a Napoleon-led expedition to Egypt discovered the Rosetta Stone, which had a decree by King Ptolemy V in Egyptian hieroglyphs and an ancient Greek translation of the decree. It took more than twenty years, but scholars finally figured out how to interpret the hieroglyphs.
Just as deciphering the Rosetta Stone required access to the same text in two different languages, machine translation systems require translated documents for each language pair. Automated techniques analyze the parallel texts and derive statistics and rules for machine translation. One of the most heavily used parallel texts was extracted from the proceedings of the European Parliament. It contains parallel texts in twenty-one European languages. Google Translate started with this set of texts and added many other parallel texts, including records of international tribunals, company reports, and articles and books in bilingual form that have been put up on the web by individuals, libraries, booksellers, authors, and academic departments.13
If you have a smartphone or web browser, there is a good chance that you have used Google Translate to translate webpages and other documents. Google Translate launched in 2006,14 and until late 2016, the translations were barely good enough to provide the gist of the text on the webpage, and it was often hard to fully understand the translated text. The technique used by Google Translate before late 2016 was phrase-based machine translation, which relies on a massive phrasal dictionary. To build this dictionary, Google engineers wrote programs that scoured parallel texts for translations of each phrase that occurred in each of the texts. They created a massive electronic dictionary that could be used to look up a phrase and find a corresponding translation.15
Then, when asked to translate text, the system would look up each phrase in the electronic dictionary and replace it with the translated phrase.16 This phrasal approach enabled the system to handle some differences in word order17 between languages. More importantly, it helped to sort out word ambiguity. For example, the word break has multiple meanings when the word is looked up in a translation dictionary by itself, whereas the word in the phrases give me a break and break a window have single (and distinct) meanings. According to Google, in 2016, there were 500 million Google Translate users and 100 billion words translated per day by Google Translate.18
In the middle of the night, on November 15, 2016, Google switched out the underlying paradigm from phrase-based machine translation to a neural network–based paradigm.19 That morning, Google Translate users who were accustomed to marginal translations woke up to a whole new experience; the change reduced translation errors by up to 85 percent on several major language pairs, bringing accuracy up to near-human quality.
Reporting on the new system, The New York Times wrote20 that a Japanese professor, Jun Rekimoto, had translated a few sentences from Hemingway’s “The Snows of Kilimanjaro” from English to Japanese and back to English with Google Translate. Using the old phrase-based system, the first line of the translation had read as follows:
“Kilimanjaro is 19,710 feet of the mountain covered with snow, and it is said that the highest mountain in Africa.”
Like most internet text translated by Google Translate to this point, you could make a fairly accurate guess at the meaning, but the translation was nowhere near human quality. The day after the switch to the neural translation model, the first line read like this:
“Kilimanjaro is a mountain of 19,710 feet covered with snow and is said to be the highest mountain in Africa.”
On that night in 2016, Google Translate started using a deep learning architecture called neural machine translation (NMT). Google Translate was trained separately for each language pair and in each direction. Researchers trained one system on English to French, another on French to English, another on French to Japanese, and so on. When you type a sentence in a source language into Google Translate and request a translation in a target language, Google Translate passes the source text to the appropriate system.21
The training input to NMT for each language and direction is a training table with one row per training sentence pair (i.e., the sentence in one language plus the translated sentence in another language). Each row has a column for each word in the source language followed by a column for each word in the target language. For example, the source language sentence Where is the bus station? combined with the correct translation “Où est l’arrêt de bus?” would constitute one row of the training table. The other rows would have different sentences. Google does not disclose the size of its training tables, but they likely have somewhere between tens of millions to billions of rows or more for each language pair.
The job of the NMT algorithm is to take as input a training table for a language pair and direction and learn to translate the source words into the target words. You can see the architecture of the NMT system in figure 11.7.
Figure 11.7 Encoder-decoder with attention architecture for machine translation.
The NMT system is composed of three different deep neural networks: the encoder, the decoder, and the attention system.22 You can think of it like this: The encoder passes along the gist, the attention system helps put it together word by word, and the decoder computes the translation in the target language. Together, these three deep neural networks learn a function that can translate from one language to another.23
NMT is an ingenious architecture, but it is nonetheless narrow AI. An NMT program takes as input a training table with one row per sentence. Each row contains both the words in the source language (e.g., English) sentence and the words in the target language (e.g., French) translations. Each column holds one word. When the NMT algorithm reads this training table, it learns a function that translates the input columns to the output columns. Most importantly, the function can also translate sentences that were not in the training table.
SPEECH RECOGNITION
Much of the world uses speech recognition daily. When we talk to Siri on iPhones or Google Assistant on Android devices, our speech is converted to text almost as accurately as if we had typed the text ourselves. Microsoft and Baidu recently published papers on speech recognition systems that transcribe better than humans in controlled environments with unaccented, clear speech. But so far, no system is as good as a human for noisy environments, speakers with accents, people who do not articulate clearly, and lower-quality microphones.
The first attempt at speech recognition was in 1952 by a group of researchers at the Bell Labs facility in Murray Hill, New Jersey.24 Computers had been invented but were not yet in widespread use. At the time, it was probably as easy to develop custom circuitry as it was to program a computer (never mind getting access to one of the few available). This group of researchers developed an ingenious custom circuit that ended with ten gas tubes. A single male speaker would utter a random string of digits, pausing at least one-third of a second between digits, and the correct gas tube would light up for each digit with 97 percent accuracy. The discrimination algorithm built into the circuit identified vowel patterns and mapped them to numbers.
Figure 11.8 Speech translation. Speaker icon: ID 142615061
© Pavel Stasevich | Dreamstime.com.
Today’s speech recognition systems also work this way. Systems take an audio signal as input and produce a string of characters as output. In 1952, this approach was only feasible by limiting the speaker vocabulary to ten digits. If it had tried to recognize all 170,000-plus English words in this fashion, it would not have worked because vowels alone do not provide enough information to discriminate among a large set of words; the algorithm would need to consider consonants as well.
Unfortunately, even basing a speech recognition algorithm on consonants, as well as vowels, is problematic. Although there are only twenty-six characters in the English language, individual characters do not always have the same sounds. For example, in the phrase speech recognition, there are three instances of the letter “e,” and they represent different sounds. The first two act as a pair, indicating that they’re pronounced as a vowel that resonates in the upper front part of the mouth, like in feet. The third “e” is a mid-central vowel, like in bet. Worse, these letters will sound different when pronounced by various speakers. Regional pronunciation patterns and the many dialects of English exacerbate this issue. I pronounce tomato and potato so that the second syllable rhymes with pay. As a child, I was surprised to learn that in other parts of the world (and in a famous song), people pronounce these words so that syllable rhymes with paw.
Creating algorithms based on words is also problematic. The Oxford Dictionary contains over 170,000 English words, and that does not include proper names. To create a training table based on words would require numerous spoken examples of each of those 170,000 words because of the varied pronunciations of different speakers. The audio waveform for a word will also change based on its context: The word spoken before and after it can affect its pronunciation (for example, the is pronounced differently in the fact and the act). Also, homophones (like there, they’re, and their) are indistinguishable by sound alone. That supervised learning algorithm would need 170,000 output categories; the higher the number of output categories, the harder it is to train a supervised learning system.
Instead, most of today’s speech recognition systems use subword units—for example, phonemes—that have fewer possibilities. Phonemes are the distinctive sounds made by speakers; for example, those three instances of the letter “e” in speech recognition are pronounced as two different phonemes. There are between thirteen and twenty-one vowel phonemes and between twenty-two and twenty-six consonant phonemes in the English language, depending on the dialect. Standard American English has forty-four phonemes. Depending on which expert you ask, we could be looking at several hundred or maybe thousands of phonemes across all human languages.
The first stage of speech recognition is to break the acoustic signal down into small windows or frames of about twenty to twenty-five milliseconds. Because the speech signal for each phoneme is typically relatively constant for about ten to twenty milliseconds, just randomly picking these windows makes it unlikely that each window will capture a phoneme. To make sure that some windows capture phonemes, researchers use overlapping windows. For example, they will start a new window every ten milliseconds.
The speech recognition task starts with a training table that has one row per spoken sentence in the training table. The columns are the windows (i.e., the audio signal in each window) and the words in the spoken sentence. The task is to learn a function that will translate the windows into words.
Before around 2010, speech recognition researchers had spent years developing methods to extract features from these windows that they could feed into supervised learning algorithms that would identify phonemes. They then used pronunciation dictionaries, which contain each word in the language and the sequence of phonemes used to pronounce the word. And they used language models25 that specify the probability that any sequence of words was a legal construct in the target language.26
Speech recognition started to improve dramatically in 2009, when researchers at the University of Toronto started experimenting with deep learning networks for speech recognition.27 In 2012, one of the authors of the 2009 paper helped Microsoft create a deep neural network that produced a 16 percent increase in performance compared with the technologies that researchers had been using for the previous five decades.28 One of the more popular speech recognition architectures is illustrated in figure 11.9.
Figure 11.9 Listener–speller with attention architecture for speech recognition.
This architecture has similarities to the NMT system illustrated in figure 11.7. Like NMT, it has three deep neural networks: the listener, the speller, and the attention mechanism.29 The listener functions like the encoder, and the speller functions like the decoder in the NMT architecture.
Due to deep learning techniques and after decades of laborious effort using conventionally coded features as input to supervised learning algorithms, speech recognition technology has finally reached its potential. Society is now reaping a wide range of benefits, from handsfree phone dialing while driving to voice interactions with digital personal assistants like Siri and Alexa.
DEEPFAKES
Deepfakes are often created with an unsupervised learning technique known as an autoencoder. An autoencoder (depicted in figure 11.10) is a type of neural network that learns an internal representation from a large set of images of a person or object. Then it can take as input a new image of that person or object and reproduce the image from the internal representation.
Figure 11.10 An autoencoder that learns to reproduce images.
Why do this? We do not need AI to be able to reproduce an image. We can do that with copiers. The goal of an autoencoder is to create a compact internal representation of the input image that has fewer dimensions than the input. More specifically, there will be many fewer neurons (variables) in the encoder output layer than in the encoder input layer. This output layer contains a compact internal representation of the input image. If the decoder can take this compact internal representation as input and still reproduce the images, then it has not merely memorized the values of the pixels in the input images. The compact internal representation must capture the essential features of the input image to reproduce the image. In other words, the learned weights and neurons of the encoder output layer must capture details about what makes up a face—the angle of the face, the facial expression, and other features.
In the example diagrammed in figure 11.10, the training table included many different images of this person, including photos of her smiling, speaking, in different poses, and in different lighting. The encoder learned to produce a compact internal representation that was sufficient for the decoder to reproduce what she looked like in the input image, her facial expression, her pose, and the lighting.
In figure 11.11, we have used an autoencoder to reconstruct images of two people.30
Figure 11.11 Training the deepfake network.
The encoder learns a compact internal representation that captures the key features of both individuals. However, it learns a different decoder for each person. The two decoders will then learn to take the common internal representation of these facial expressions and reconstitute them for each person’s image.
Once the network completes its learning phase, producing a deep-fake is easy: Just switch the decoder, as illustrated in figure 11.12. When a new image of the woman is input to the network, the output will be an image of the man but with the facial expression and pose that was in the woman’s input image.
Figure 11.12 Using the trained system to produce deepfakes.
The next step is to make this substitution for each frame in an entire video. The result will be a video with frames of the man wearing the same facial expression as the woman at each point in the video. If the woman is talking in the video, the fake video of the man will appear to be saying the same words.31
You can see a video of Jennifer Lawrence answering questions at the Golden Globe Awards with Steve Buscemi’s face.32 The creators of this video used the FaceSwap tool.33
Our example deepfake video still had the woman’s voice (and Steve Buscemi’s face still sounded like Jennifer Lawrence) because only the video was a fake. However, researchers are developing technologies to change the words of a speaker as well. For example, the Lyrebird division of Descript, a Montreal company, offers a very impressive public-facing demo.34 You can record your voice on the website and then type in the words you want to say and hear it played back in your voice. In conjunction with video deepfake technology, you will soon be able to create a video of someone saying whatever you want them to say.
DEEP LEARNING SYSTEMS CAN BE UNRELIABLE
Image recognition, machine translation, and speech recognition systems represent tremendous victories for deep learning. Unfortunately, these same deep learning systems make surprising mistakes.
If I train a system to distinguish cats from dogs, and in the training table, all the pictures of dogs are outside and all the images of cats are inside homes, the deep learning system will likely key in on yard and home features instead of those of the animals. Then if I show a picture of a dog inside a home, the system will probably label it a cat.
Figure 11.13 Adding a guitar causes the narrow AI system to misclassify the monkey as human. Reprinted with permission of International Press of Boston, Inc.
Similarly, in figure 11.13, before researchers pasted the guitar onto the picture, both object recognition systems and human subjects correctly labeled the monkey in the photo. Adding the guitar did not confuse people, but it made the object recognition system think it was observing a picture of a person instead of a monkey.35 The object classification system did not learn the visual characteristics that people use to recognize monkeys and people. Instead, it learned that guitars are only present in pictures of people.
Other types of mistakes are even more concerning. A group of Japanese researchers found that by modifying just a single pixel in an image, they could alter an object recognition system’s category choice. In one instance, by changing a single pixel on a picture of a deer, the object recognition system was fooled into identifying the image as a car.36 Researchers have also figured out how to fool deep learning systems into recognizing objects such as cheetahs and peacocks in images with high confidence even though there are no objects at all in the image.37
Another scary example: Researchers have developed deep learning systems that very accurately diagnosis the presence or absence of pneumonia. However, they have also discovered that spurious noise in the medical images can cause these deep learning systems to produce an incorrect diagnosis.38
The same is true for audio streams. Researchers created mathematical perturbations of audio streams that have no impact on people’s ability to identify words. Yet these perturbations caused deep learning systems that previously identified the words with high accuracy to fail to recognize any words at all.39
The reliability of deep learning systems is important because many important applications, including self-driving cars and autonomous weapons, depend on deep learning technology.
RAW DATA
Facial recognition, machine translation, speech recognition, and many other applications can be built using supervised deep learning. For each of these applications, deep learning technology has eliminated the need to manually extract features that are placed in the input columns. Instead, the input columns contain raw data. Pixels are used for facial recognition, words (and parts of words) are used for machine translation, and waveforms are used for speech recognition. The deep learning system determines what features are important and incorporates those features into a function that can transform the input values into the output values.
It is possible to build amazing applications using deep learning. However, each of these applications is a narrow AI application. The learned function can only be used to predict the outputs from the inputs for that specific application. Deep learning is also unreliable. For critical applications like self-driving vehicles that rely on functions created with deep learning, this could result in traffic jams and injuries.