Chapter 18
Taming Deep Networks

In a sense, there’s nothing special about deep networks. They’re like shallow neural networks, only with more layers. When people started experimenting with deep networks, however, they faced an uncomfortable truth: building deep networks may be easy, but training them is not.

Backpropagation on deep networks comes with its own specific challenges that carry intimidating names such as “vanishing gradients” and “dead neurons.” Those challenges rarely come up in shallow neural networks—but they’re par for the course in deep neural networks.

Over the years, neural networks researchers developed a collection of strategies, or what you might call a “bag of tricks,” to tackle those challenges and tame deep neural networks:

This chapter is a whirlwind tour through these techniques. We’ll spend most of our time discussing activation functions: why the sigmoid doesn’t pass muster in deep neural networks, and how to replace it. Then we’ll round off the chapter, and your bag of tricks, with a few choice approaches from the other categories listed here.

Let’s start our tour with activation functions.