Cross Entropy

So far, we used the log loss formula for our binary classifiers. We even used the log loss when we bundled ten binary classifiers in a multiclass classifier (in Chapter 7, The Final Challenge). In that case, we added together the losses of the ten classifiers to get a total loss.

While the log loss served us well so far, it’s time to switch to a simpler formula—one that’s specific to multiclass classifiers. It’s called the cross-entropy loss, and it looks like this:

Here’s the cross-entropy loss in code form:

	def loss(Y, y_hat):
	return -np.sum(Y * np.log(y_hat)) / Y.shape[0]

Check out “Grokking the Cross-Entropy Loss” on ProgML.

If you’re curious, you can read how the cross-entropy loss works on the ProgML^[15] site. However, you don’t need to understand how it works, as long as you remember what it does: like other loss formulae, it measures the distance between the classifier’s predictions and the labels. The lower the loss, the better the classifier.

Besides its cool name, there is a pragmatic reason to use the cross-entropy loss in our neural network: it’s a perfect match for the softmax. More specifically, a softmax followed by a cross-entropy loss makes it easier to code gradient descent. But I’m getting ahead of myself here—that’s a topic for the next chapter. For now, just know that the softmax and the cross-entropy loss jive well together, and they’ll cap off our neural networks for the rest of this book.

While we’re talking about loss functions, there’s a note worth making. Since When Gradient Descent Fails, we vetted our loss functions for a major requirement: they had to play well with gradient descent. In particular, the GD algorithm can get stuck in a local minimum—so we picked loss functions that didn’t have local minima.

Unfortunately, once you graduate to neural networks, all bets are off: however carefully you select it, the loss of a neural network can have local minima. Neural networks have more sophisticated models than perceptrons, that allow them to approximate more complicated functions—but as a downside, the losses of those functions can have “holes” where GD gets stuck.

On the bright side, even though a neural network’s loss can have local minima, that doesn’t mean that it always does—and even when local minima do exist, they’re not necessarily a showstopper. As researchers understand more and more about neural networks, they’re finding out that local minima aren’t quite as disruptive as we assumed.^[16] Most of the times, training a neural network will find a minimum that’s good enough, even though we cannot be sure it’s the best minimum overall.

Finally, let me clear one potential source of confusion. If you look at the code, you might wonder why we bother with the loss in the first place. In fact, we don’t even seem to ever use the loss function, apart from printing its value on the screen.

Indeed, we don’t care about the loss as much as the gradient of the loss, that we’re going to use later during gradient descent. While the loss function is not really necessary, however, it’s still nice to have. We can look at that number to gauge how well the classifier is doing, both during training and during classification. That’s the reason why we bothered to code loss and call it from the report function.

With the loss function, our neural network’s classification code is done. Let’s wrap it up.