In the 1990s, neural network research was in a rut again. After the discovery of backpropagation, connectionists had reaped a few high-profile successes, including a system to read handwritten numbers for the U.S. Postal Service.[29] And yet, the AI community at large still scoffed at neural networks.
The general consensus was that yeah, regular neural networks could solve simple problems—but that was about as far as they could go. To tackle more interesting problems, you needed deep neural networks, and those were a pain in the neck. They were slow to train, prone to overfitting, and riddled by frustrating problems such as the vanishing gradient. Connectionists had cracked backpropagation, but they still couldn’t win over their peers.
Then a few things happened that changed everything.
The progress of machine learning from the 1990s to the early 2010s is the stuff of lore. In spite of the general skepticism around neural networks, a covenant of researchers kept advancing the field. Today, some of their names are famous: Geoffrey Hinton, Yahn LeCun, Yoshua Bengio, and many others.
One problem after the other, those stubborn pioneers tackled the most pressing issues of deep networks. They discovered novel weight initialization methods and activation functions like ReLUs to counter vanishing and exploding gradients. They sped up networks with better optimization algorithms and—again—better activation functions. They invented new regularization techniques such as dropout to keep overfitting in check.
At the same time, those trailblazers invented or perfected radically different ways to build neural networks. Fully connected networks struggled with complex data like images, text, and speech. Early deep network practitioners developed architectures that were custom-fit to those data, such as convolutional neural networks (which we looked at in the previous chapter), and recurrent neural networks (which I’ll mention a few pages from now). Those new architectures made quick progress in corners of AI that had barely seen any improvement in dozens of years.
Besides their formidable skills and resolve, the pioneers of deep networks also had a fair share of luck. During those years, the world of computing underwent two sea changes that buoyed up their efforts: the diffusion of graphics processing units (GPUs), and the explosive rise of the Internet.
GPUs are the massively parallel graphics processors popularized by 3D computer games. In the 2000s, as progress on CPUs slowed down, GPUs were getting faster and faster—and they happened to fit neural networks like a glove. Like neural networks, 3D graphics rendering is mostly about churning through matrix operations, and doing it fast. Deep networks took forever to train on a CPU, but they could be trained dozens of times faster on a GPU.
The second boost to the connectionists’ efforts came from the Internet, which helped solve a critical problem with early deep networks: even with all those clever regularization techniques, they kept overfitting their small datasets. In Collecting More Data, you learned that the first defense against overfitting is collecting more data. When Internet companies burst onto the scene, they flooded researchers with previously unthinkable amounts of data—and the economic resources to process them.
In 2012, all those factors came together to form a perfect storm.
In 2012, three deep networks researchers (Geoffrey Hinton, Alex Krizhevsky, and Ilya Sutskever) entered the high-profile ImageNet computer vision competition.[30] Up to then, ImageNet had been dominated by traditional techniques that boasted accuracies well below 75% on the benchmark dataset. The new challenger, a convolutional deep network called AlexNet, wiped the floor with the competition, with an astounding 84.7%. Nobody had ever seen anything like that.
It was a watershed moment. The name “deep learning,” which replaced the disreputed name “neural networks,” suddenly became a buzzword. As the research funds exploded, companies like Google and Facebook swept up the pioneers of the field, putting them at the helm of their nascent AI departments. Deep networks got deeper, reaching hundreds of layers—numbers that would have seemed ludicrous a few years before. Their evolution became hard to keep up with as they pulverized long-standing records across all fields of AI.
Soon enough, neural networks had become the most popular idea in machine learning. In the first chapter of this book, Chapter 1, How Machine Learning Works, you learned that ML has three main branches: supervised learning, which is the subject of this book; reinforcement learning, which is all about learning by trial and error; and unsupervised learning, which is a collection of algorithms to make sense on unlabeled data. Deep neural networks seeped into all three, often displacing long-time established techniques.
Since the 1960s, connectionist ideas had been mostly cast out of the field of AI. In hindsight, AI had been impoverished by that exile. Over dozens of years, the brightest minds of artificial intelligence had gained a reputation for overpromising and underdelivering. Now, for the first time since Minsky’s book on perceptrons, connectionists ideas were popular again—and AI was leaping forward. Neural networks quickly displaced ineffective symbolist AI in crucial fields like natural language processing, image recognition, and medical diagnosis. Voice-controlled digital assistants, automatic photo captioning, and sophisticated recommendation systems became the new normal. Previously unthinkable technologies, such as self-driving cars, became a matter of public discussion. After all those years, connectionism was back with a vengeance.
The rest of the story is still left to be written. In 2020, as I finish this book, the progress of deep learning shows no sign of slowing down. It’s been hailed as a revolution in AI––a strangely enthusiastic definition from the traditionally staid academic community. For better or worse, deep learning is changing the world.
I mentioned a few factors behind this upheaval:
Backpropagation, which allowed people to train neural networks in the first place.
Dozens of novel techniques such as ReLUs, dropout, and Xavier initialization, which paved the way to deeper neural networks.
New powerful architectures such as convolutional and recurrent neural networks, which boosted accuracy on complex datasets.
More processing power, especially thanks to GPUs, which made deep networks not just possible, but practically feasible.
Large datasets, which allowed deep networks to spend their power generalizing the training data, instead of overfitting it.
All those elements were important. However, they would have amounted to little if not for another, less concrete factor: the tenacity of the trailblazers who advanced machine learning through a decades-long pushback. They swam against the tide for all those years, cracking the problems that most of their colleagues deemed unsolvable. If we have deep learning today, it’s thanks to them.
And that’s how deep neural networks took over AI. But we still didn’t investigate the secret sauce that makes them so good at what they do.