In the previous chapter, we achieved something to be proud of: we wrote a piece of code that learns. If we got that code reviewed by computer scientists, however, they would find it lacking. In particular, they’d raise an eyebrow at the sight of the train function. “This code might work okay for this simple example,” the stern computer scientist would say, “but it won’t scale to real-world problems.”
Fair enough. In this chapter, we’re going to address those concerns in two ways. First, we’re not going to get our code reviewed by a computer scientist. Second, we’re going to analyze the shortcomings of the current train implementation and solve them with one of machine learning’s key ideas: an algorithm called gradient descent. Like our current train code, gradient descent is a way to find the minimum of the loss function—but it’s faster, more precise, and more general than the code from the previous chapter.
Gradient descent isn’t just useful for our tiny program. In fact, you cannot go very far in ML without gradient descent. In different forms, this algorithm will accompany us all the way to the end of this book.
Let’s start with the problem that gradient descent is meant to solve.