Preparing Data

You might think that an ML engineer spends her time dreaming up and training sophisticated algorithms. Just like programming, however, the job comes with a less glamorous and more time-consuming side. In the case of ML, that grindwork usually involves preparing data.

If you’re not convinced that preparing data is a big time sink, think of the effort that went into MNIST. Somebody had to collect and scan 60,000 handwritten digits. They probably hand-checked all those digits to remove the examples that were not representative of real-life digits, maybe because they were too garbled. They also had to center, crop, and scale those images to the same resolution, taking care to avoid graphical artifacts such as jagged edges. I’d wager that they also processed all digits to give them uniform lightness and contrast, from clear-white to pitch-black. Last but not least, they labeled each example, and double-checked the labels to sieve out mistakes.

In the case of MNIST, we don’t have to do all that work—somebody did it for us. However, we can still massage MNIST a bit further, to make it more friendly to our network.

Preparing data is a complex activity in itself. Here, we’re going to scratch its surface by looking at a couple of common techniques, and we’ll get an intuitive understanding of what those techniques are for.

Checking the Range of Input Variables

Before you feed data to a neural network, it’s a good idea to check that all input variables span similar ranges. Imagine what happens if one input variable ranges from 0 to 10, and another from 1,000 to 2,000. The two variables might be equally important to predict the label, but the second one would contribute more to the loss, just because it’s bigger. As a result, the network would focus on minimizing the loss of the larger variable, and mostly ignore the smaller one.

To avoid that problem, you can rescale the variables to a similar range. That operation is called feature scaling, where feature is just another name for “input variable.” In our case, we don’t need to bother with feature scaling, because all the variables in MNIST are 1-byte pixels—so they never drop below 0 or raise over 255.

However, even if your input variables span a similar range, you don’t want that range to extend to large numbers. The problem with feeding large numbers to the network is that they tend to cause large numbers inside the network. As we learned in Dead Neurons, neural networks work better when they process values that are close to zero, because that’s where sigmoids give their best.

Bottom line: if your input variables are spread out (for example, from -10,000 to +10,000), or off-center (for example, from 10,000 to 10,100), then you should shift them and scale them to make them small and centered around zero. Let’s see a common technique to do that.

Standardizing Input Variables

To keep input variables close to zero, ML practitioners often standardize them. “Standardization” means slightly different things to different people, but its most common meaning is this: “rescale the inputs so that their average is 0 and their standard deviation is 1.”

In case you don’t know, the standard deviation measures how “spread out” a variable is. If the standard deviation is low, that means that the values tend to stay close to their average. For example, the height of humans has a relatively low standard deviation because nobody is hundreds of times taller than anyone else. On the other hand, the height of plants has a high standard deviation because a plant can be as short as moss, or as tall as a redwood.

A quick way to standardize a bunch of inputs is to to subtract their average, and divide them by their standard deviation:

standardized_inputs = (inputs - np.average(inputs)) / np.std(inputs)

This operation gives us the kind of data we want: data that are centered around zero, and never stray too far from it.

Here’s an updated version of mnist.py that standardizes the dataset:

15_development/mnist_standardized.py

	def standardize(training_set, test_set):
	average = np.average(training_set)
	standard_deviation = np.std(training_set)
	training_set_standardized = (training_set - average) / standard_deviation
	test_set_standardized = (test_set - average) / standard_deviation
	return (training_set_standardized, test_set_standardized)


	# X_train/X_validation/X_test: 60K/5K/5K images
	# Each image has 784 elements (28 28 pixels)*
	X_train_raw = load_images("../data/mnist/train-images-idx3-ubyte.gz")
	X_test_raw = load_images("../data/mnist/t10k-images-idx3-ubyte.gz")
	X_train, X_test_all = standardize(X_train_raw, X_test_raw)
	X_validation, X_test = np.split(X_test_all, 2)

We could standardize each input variable separately, or all of them together. The input variables in MNIST all have comparable sizes, so the preceding function standardizes them together. It uses NumPy to calculate the average and standard deviation of the training set, and then applies the formula:

standardized-values = (original-values - average) / standard_deviation

The last four lines load the two MNIST datasets, standardize them, and split the test data into a validation set and a test set, like we did in the previous chapter. Let me spell out what the standardize function does, because this process is a common cause of rookie mistakes:

The function calculates the average and the standard deviation on the training set alone, because that’s the only information that we want to train on. If we involve the validation and test set in those calculation, we’ll leak information from those sets into the neural network’s training.
On the other hand, after calculating those parameters, the function uses them to standardize all three sets: the training, the validation, and the test set. That’s because we want the three sets to be similar—otherwise, the network would fail once it moves from training to testing. If we ever deployed our network to production, we’d also have to standardize production data with the same average and standard deviation.

So, calculate the average and the standard deviation on the training set alone, then use them to standardize the entire dataset. I know that process is confusing, because it tripped me up a few times as a beginner.

So, we just standardized MNIST. Let’s check whether it was worth it.

Standardization in Practice

Let’s check the effect of standardization on our neural network’s accuracy. The following code runs the network twice, once with regular MNIST and once with standardized MNIST. Each configuration runs for two epochs, with a batch size of 60. The MNIST training set contains 60,000 examples, so that’s 1,000 iterations per epoch:

15_development/mnist_vs_standardized_mnist.py

	import neural_network as nn
	import mnist as normal
	import mnist_standardized as standardized

	print("Regular MNIST:")
	nn.train(normal.X_train, normal.Y_train,
	normal.X_validation, normal.Y_validation,
	n_hidden_nodes=200, epochs=2, batch_size=60, lr=0.1)

	print("Standardized MNIST:")
	nn.train(standardized.X_train, standardized.Y_train,
	standardized.X_validation, standardized.Y_validation,
	n_hidden_nodes=200, epochs=2, batch_size=60, lr=0.1)

Here are the results after a few tens of minutes of number crunching:

	Regular MNIST:
	0-0 > Loss: 2.28073164, Accuracy: 18.90%
	…
	RuntimeWarning: overflow encountered in exp
	…
	1-999 > Loss: 0.41537869, Accuracy: 84.60%
	Standardized MNIST:
	0-0 > Loss: 2.26244033, Accuracy: 17.96%
	…
	1-999 > Loss: 0.21364970, Accuracy: 91.80%

We don’t know what would happen if we kept training the system for days, but this short test gives us pretty compelling numbers. The network is way more accurate when we train it on standardized MNIST than regular MNIST. Also, while regular MNIST causes an overflow during training, standardized MNIST doesn’t. It seems that having smaller input variables makes the network more numerically stable, just as the theory goes.

All in all, the verdict is clear: from now on, we’ll use the standardized version of MNIST. And now that we have data we trust, let’s move into the heart of the development cycle.