Our program can successfully forecast pizza sales, but why stop there? Maybe we could use the same code to forecast other things, such as the stock market. We could get rich overnight! (Spoiler: that wouldn’t really work.)
If we tried to apply our linear regression program to a different problem, however, we’d bump into an impediment. Our code is based on a simple line-shaped model with two parameters: the weight w and the bias b. Most real-life problems require complex models with more parameters. As an example, remember our goal for Part I of this book: we want to build a system that recognizes images. An image is way more complicated than a single number, so it needs a model with many more parameters than the pizza forecaster.
Unfortunately, if we added more parameters to our model, we’d kill its performance. To see why, let’s review the train function from the previous chapter:
| def train(X, Y, iterations, lr): |
| w = b = 0 |
| for i in range(iterations): |
| current_loss = loss(X, Y, w, b) |
| print("Iteration %4d => Loss: %.6f" % (i, current_loss)) |
| |
| if loss(X, Y, w + lr, b) < current_loss: |
| w += lr |
| elif loss(X, Y, w - lr, b) < current_loss: |
| w -= lr |
| elif loss(X, Y, w, b + lr) < current_loss: |
| b += lr |
| elif loss(X, Y, w, b - lr) < current_loss: |
| b -= lr |
| else: |
| return w, b |
| |
| raise Exception("Couldn't converge within %d iterations" % iterations) |
At each iteration, this algorithm tweaks either w or b, looking for the values that minimize the loss. Here is one way in which that approach could go wrong: as we tweak w, we might be increasing the loss caused by b, and the other way around. To avoid that problem and get as close as possible to the minimum loss, we should tweak both parameters at once. The more parameters we have, the more important it is to tweak them all at the same time.
To tweak w and b together, we’d have to try all the possible combinations of tweaks: increase w and b; increase w and decrease b; increase w and leave b unchanged; decrease w and… well, you get the point. Do the math, and you’ll find that the total number of tweaking combinations, including the one where all the parameters stay unchanged, would be 3 to the power of the number of parameters. With two parameters, that would be 32—that is, nine combinations.
Calling loss nine times per iteration doesn’t sound like a big deal—but increase the number of parameters to 10, and you get 310 combinations, which is almost 60,000 calls per iteration. You might think that 10 parameters are far-fetched, but they aren’t. Later in this book we’ll use models with hundreds of thousands of parameters. With such large models, an algorithm that tries every combination of parameters is never going to fly. We should nip this slow code in the bud.
There is also a more urgent problem in the current implementation of train: it tweaks parameters in increments that are equal to the learning rate. If lr is large, then the parameters change quickly, which speeds up training—but the final result is less precise, because each parameter has to be a multiple of that large lr. To increase precision, we need a small lr, which results in even slower training. We’re trading off speed for precision, when we actually need both.
That’s why our current code is basically a hack. We should replace it with a better algorithm—one that makes train both fast and precise.