So far, we split our examples in a training set and a test set. This approach, however, tends to break down once we start tuning our system’s hyperparameters, because it sneakily leads us to optimize the system for the specific examples in the test set.
In this chapter, we switched to a more sophisticated approach, splitting the test set in two: a smaller test set, and a brand-new validation set. We’ll use the validation set for development, and the test set only once, for our final benchmark. During the final benchmark, the network comes in contact with the test set for the first time, so it should give us a reliable idea of the network’s accuracy on future production data.
One word of warning before we move on: this three-sets strategy helps us measure our system’s accuracy correctly, but it doesn’t eliminate overfitting. Our network is still going to yield unrealistically good results on the training and the validation set—we just decided not to trust those results, and look at the results on the test set instead.
In other words, we worked around overfitting, but we didn’t vanquish it. In fact, overfitting is this book’s recurring villain, and it will rear its ugly head again. We’ll have a final confrontation with overfitting in Chapter 17, Defeating Overfitting. For the time being, we’ll have to live with it, and neutralize its effects by applying the Blind Test Rule.
Armed with that sound testing strategy, we can finally roll up our sleeves and dive into development.