By now, we’re familiar enough with gradient descent. This chapter introduces a souped-up variant of GD: mini-batch gradient descent.
Mini-batch gradient descent is slightly more complicated than plain vanilla GD—but as we’re about to see, it also tends to converge faster. In simpler terms, mini-batch GD is faster at approaching the minimum loss, speeding up the network’s training. As a bonus, it takes less memory, and sometimes it even finds a better loss than regular GD. In fact, after this chapter, you might never use regular GD again!
You might wonder why we’re focusing on training speed, when we have more pressing concerns to deal with. In particular, the accuracy of our neural network is still disappointing—better than a perceptron, yes, but well below our target 99% on MNIST. Shouldn’t we make the network more accurate first, and faster later? After all, as Donald Knuth has said, “premature optimization is the root of all evil”![21]
However, there’s a reason to speed up training straight away. Within a couple of chapters, we’re going to inch toward that 99% goal by tuning the network iteratively: we’ll tweak the hyperparameters, train the network… and then do it all over again, until we’re happy with the result. Each of those iterations could take hours. We’d better find a way to speed them up—otherwise, the tuning process might take days.
You might also wonder why we’re looking for an alternative algorithm, instead of speeding up the algorithm that we already have. Here’s the answer: soon enough, we’ll stop writing our own backpropagation code, and we’ll switch to highly optimized ML libraries. Instead of optimizing code that we’re going to throw away soon, we’d better focus on techniques that will stay valid even after we switch to libraries.
Mini-batch gradient descent is one of those techniques. Let’s see what it’s about. But first, let’s review what happens when we train a neural network.