A Testing Conundrum

To see where the testing hurdle is, consider that we’re going to tune our neural network with an iterative process. That process is going to work like this:

  1. Tune the network’s hyperparameters.
  2. Train the network on the training set.
  3. Test the network on the test set.
  4. Repeat until we’re happy with the network’s accuracy.

This process is pretty much the ML equivalent of software development, so we can simply call it like that: “the development cycle.”

We already went through a few iterations of development in the previous chapter, when we measured the network’s performance with different batch sizes. However, we overlooked a distressing fact: the development cycle violates the Blind Test Rule. Here is why.

During development, we tune the neural network’s hyperparameters while looking at the network’s accuracy on the test set. By doing that, we implicitly custom-taylor the hyperparameters to get a good result on that set. In the end, our hyperparameters are optimized for the test set, and are unlikely to be equally good on never-before seen production data. In a sense, our own brain violates the Blind Test Rule, by leaking information about the test examples into the network. We threw overfitting out of the door, and it sneaked back in through the window.

If you find it hard to understand how the development cycle can cause overfitting, think of a similar example from software development. People commonly use standardized benchmarks to measure the performance of hardware such as graphics cards. Every now and then, a hardware maker is caught “optimizing for the benchmark”—that is, engineering a product to get great results on a particular benchmark, even though those results don’t translate as well to real-world problems. The development cycle can easily mislead us into the same kind of “cheating” behavior, where a machine learning algorithm gets unrealistic results on the test set. Call it “unintended optimization,” if you wish. To avoid unintended optimization, we shouldn’t use the test set during tuning.

Unintended optimization is a sneaky issue. As long as we’re aware of it, however, we can avoid it with a low-cost approach: instead of two sets of examples, we can have three—one for training, one for testing, and a third one that we can use during the development cycle. This third set is usually called the validation set. If we use the validation set during development, then we can safely use the test set at the very end of the process, to get a realistic estimate of the system’s accuracy.

Let me recap how this strategy works:

  1. The setup: we put the test set aside. We’ll never look at it until the very end.

  2. The development cycle: we train the network on the training set as usual, but we use the validation set to gauge its performance.

  3. The final test: after we’ve tweaked and tuned our hyperparameters, we test the network on the test set, which gives us an objective idea of how it will perform in production.

The key idea in this strategy is worth repeating: put the test set under a stone, and forget about it until the very end of the process. We shouldn’t cave in to the temptation of using the test set during the development process. As soon as we do, we’ll violate the Blind Test Rule, and risk unrealistic measures of the system’s accuracy.

How many examples should we set aside for the validation and test sets? That depends on the specific problem. Some people recommend setting aside 20% of the examples for the validation set, and just as many for the test set. That’s called the “60/20/20” split.

In MNIST, however, we have plenty of examples—70,000 in total. It feels like a waste to set aside almost 30,000 examples for testing. Instead, we can take the 10,000 examples from the current test set, and split them in two groups of 5,000—one for the validation set, and one for the new test set. Here’s the updated code that does that:

14_testing/mnist_three_sets.py
 # X_train/X_validation/X_test: 60K/5K/5K images
 # Each image has 784 elements (28 * 28 pixels)
 X_train = load_images(​"../data/mnist/train-images-idx3-ubyte.gz"​)
 X_test_all = load_images(​"../data/mnist/t10k-images-idx3-ubyte.gz"​)
 X_validation, X_test = np.split(X_test_all, 2)
 
 # 60K labels, each a single digit from 0 to 9
 Y_train_unencoded = load_labels(​"../data/mnist/train-labels-idx1-ubyte.gz"​)
 
 # Y_train: 60K labels, each consisting of 10 one-hot-encoded elements
 Y_train = one_hot_encode(Y_train_unencoded)
 
 # Y_validation/Y_test: 5K/5K labels, each a single digit from 0 to 9
 Y_test_all = load_labels(​"../data/mnist/t10k-labels-idx1-ubyte.gz"​)
 Y_validation, Y_test = np.split(Y_test_all, 2)

There you have it—a training set, a validation set, and a test set.