Chapter 18
Taming Deep Networks

In a sense, there’s nothing special about deep networks. They’re like shallow neural networks, only with more layers. When people started experimenting with deep networks, however, they faced an uncomfortable truth: building deep networks may be easy, but training them is not.

Backpropagation on deep networks comes with its own specific challenges that carry intimidating names such as “vanishing gradients” and “dead neurons.” Those challenges rarely come up in shallow neural networks—but they’re par for the course in deep neural networks.

Over the years, neural networks researchers developed a collection of strategies, or what you might call a “bag of tricks,” to tackle those challenges and tame deep neural networks:

New activation functions to replace the sigmoid
Multiple flavors of gradient descent
More effective weight initializations
Better regularization techniques to counter overfitting
Other ideas that work, though they don’t quite fit any of these categories

This chapter is a whirlwind tour through these techniques. We’ll spend most of our time discussing activation functions: why the sigmoid doesn’t pass muster in deep neural networks, and how to replace it. Then we’ll round off the chapter, and your bag of tricks, with a few choice approaches from the other categories listed here.

Let’s start our tour with activation functions.

Chapter 18Taming Deep Networks

Chapter 18
Taming Deep Networks