Just like tuning hyperparameters, reducing overfitting is more art than science. Besides L1 and L2, there are many other regularization methods you can use. Here’s an overview of some of them.
The most fundamental regularization technique is also the first one you should reach for: make the overfitting network smaller. After all, overfitting happens because the system is too clever for the data it’s learning. Smaller networks are not as clever as big networks. Try reducing the number of hidden nodes, or even removing a few layers. You’ll have a go at this approach in the chapter’s closing exercise.
Instead of simplifying the model, you can also reduce overfitting by simplifying the data—that is, removing a few input variables. Let’s say you’re predicting a boiler’s consumption from a set of 20 input variables. An overfitting network strives to fit the details of that dataset, noise included. Try dropping a few variables that are less likely to impact consumption (like the day of the week) in favor of the ones that seem more relevant (like the outside temperature). The idea is that the fewer features you have, the less noise you inject into the system.
Here’s another way to reduce overfitting: cut short the network’s training. This idea is not as weird as it sounds. If you look at the history of the network’s loss during training, you can see the system moving from underfitting to overfitting as it learns the noise in the training data. Once overfitting kicks in, the validation flattens, and then diverges from the training loss. If you stop training at that point, you’ll get a network that didn’t learn enough to overfit the data yet. This technique is called early stopping.
Finally, and perhaps surprisingly, sometimes you can reduce overfitting by increasing a neural network’s learning rate. To understand why, remember that the learning rate measures the size of each GD step. With the bigger learning rate, GD takes bolder, coarser steps. As a result, the trained model is likely to be less detailed, which might help reduce overfitting.
We went through a few regularization techniques, and we’ll see a couple more in the next chapter. Each of these approaches might or might not work for a specific network and dataset. Here, as in many other aspects of ML, your mileage may vary. Be ready to experiment with different approaches, either alone or in combination, and learn by experience which approaches work best in which circumstances.