Hands On: Thinking About Testing

This exercise isn’t literally “hands on.” Instead, it asks you to stop for a moment, and think about what could go wrong as you split a dataset.

Imagine that you receive a large dataset of handwritten characters—kinda like MNIST, only for letters from A to Z. You want to build a classifier for this dataset, so you need training, validation, and test sets. Let’s say that the entire dataset is made up of 1,000,000 characters, and you plan to reserve 50,000 characters for the validation set, and 50,000 more for the test set. Here’s the code that does that:

data_train, data_validation, data_test = np.split(data_all, [900_000, 950_000])

This line splits data_all at the two given indexes. The result is that data_train contains 900,000 characters, data_validation contains 50,000, and the rest are left for data_test.

The code here seems innocuous enough, but it contains a sneaky problem that might easily trip you up—or not, depending on the specific dataset. Here’s the question that I ask you to ponder: can you imagine what that problem is, and what’s the specific thing that makes this splitting strategy potentially wrong? How would you counter the problem if it happens?

Try to answer yourself, then read my answer in the 14_testing/solution folder.