Soon enough, we’re going to explain backpropagation from the ground up. But first, let’s see why people use backpropagation in the first place.
To begin with, here’s a piece of good news: in a sense, you already know how to train a neural network. You train it with gradient descent, like you train a perceptron. At each iteration, you calculate the gradient of the loss, then descend that gradient to minimize the loss. (If you need a refresher, review Chapter 3, Walking the Gradient.)
Now for the less-than-good news: descending the gradient is the easy part of the job. The tough part is calculating that gradient in the first place.
In the case of the perceptron, we knew how to get the gradient: we calculated the derivatives of the loss with respect to the weights. (Well, we actually looked up those derivatives in a textbook. Still counts.) In the case of a neural network, however, coming up with the equivalent derivatives can be hard. For example, the code that computes the loss in our three-layered network looks something like this:
| h = sigmoid(matmul(X, w1)) |
| y_hat = softmax(matmul(h, w2)) |
| L = cross_entropy_loss(Y, y_hat) |
Imagine writing the mathematical equivalent of the previous code, and then taking the derivatives of that formula with respect to w₁ and w₂. That would be some work, even for someone steeped in calculus.
We could still sweat our way to the gradient of our neural network—but the typical modern network would be much more challenging. Real-world neural networks can be mindbogglingly complicated, with dozens of intricately connected layers and weight matrices. It would be very hard to write down the loss for one of those large networks—let alone calculate its derivatives.
We’re facing a dilemma: on one hand, we want to calculate the gradients of the loss for any neural network; on the other, calculating those derivatives is unfeasible, except for the simplest and less powerful networks.
That’s where backpropagation enters the picture. Backprop is the way out of that dilemma—an algorithm that calculates the gradients of the loss in any neural network. Once we know those gradients, we can descend them with good old GD.
Let’s see how backprop works its magic.