A Regularization Toolbox

Just like tuning hyperparameters, reducing overfitting is more art than science. Besides L1 and L2, there are many other regularization methods you can use. Here’s an overview of some of them.

The most fundamental regularization technique is also the first one you should reach for: make the overfitting network smaller. After all, overfitting happens because the system is too clever for the data it’s learning. Smaller networks are not as clever as big networks. Try reducing the number of hidden nodes, or even removing a few layers. You’ll have a go at this approach in the chapter’s closing exercise.

Instead of simplifying the model, you can also reduce overfitting by simplifying the data—that is, removing a few input variables. Let’s say you’re predicting a boiler’s consumption from a set of 20 input variables. An overfitting network strives to fit the details of that dataset, noise included. Try dropping a few variables that are less likely to impact consumption (like the day of the week) in favor of the ones that seem more relevant (like the outside temperature). The idea is that the fewer features you have, the less noise you inject into the system.

Here’s another way to reduce overfitting: cut short the network’s training. This idea is not as weird as it sounds. If you look at the history of the network’s loss during training, you can see the system moving from underfitting to overfitting as it learns the noise in the training data. Once overfitting kicks in, the validation flattens, and then diverges from the training loss. If you stop training at that point, you’ll get a network that didn’t learn enough to overfit the data yet. This technique is called early stopping.

Finally, and perhaps surprisingly, sometimes you can reduce overfitting by increasing a neural network’s learning rate. To understand why, remember that the learning rate measures the size of each GD step. With the bigger learning rate, GD takes bolder, coarser steps. As a result, the trained model is likely to be less detailed, which might help reduce overfitting.

We went through a few regularization techniques, and we’ll see a couple more in the next chapter. Each of these approaches might or might not work for a specific network and dataset. Here, as in many other aspects of ML, your mileage may vary. Be ready to experiment with different approaches, either alone or in combination, and learn by experience which approaches work best in which circumstances.

Besides the regularization techniques I describe in these pages, there is another effective approach to reduce overfitting: collect more training data. Intuitively, overfitting happens when a model fails to generalize from its training examples. It’s hard to generalize from a handful of examples, and easier if you have plenty of examples. So, the bigger and more varied your training set, the less likely the system is to overfit it.

Collecting data can be expensive, both in terms of time and money. As an alternative approach, you can generate fake training data by modifying the data you already have. For example, if you’re training a system to recognize images, you might double the size of your training set just by mirroring your images. Fake data don’t generally work as well as real data, but they might be the next best thing in some cases.

One last thing: more data help to reduce overfitting, but it does nothing for a system that’s underfitting. It’s a common rookie mistake to try and fix underfitting with more training data. An underfitting system isn’t sophisticated enough to make sense of the data it already has, and collecting more won’t help.