The Finished Network

Here is the complete code of our neural network:

	import numpy as np


	def sigmoid(z):
	return 1 / (1 + np.exp(-z))


	def softmax(logits):
	exponentials = np.exp(logits)
	return exponentials / np.sum(exponentials, axis=1).reshape(-1, 1)


	def sigmoid_gradient(sigmoid):
	return np.multiply(sigmoid, (1 - sigmoid))


	def loss(Y, y_hat):
	return -np.sum(Y * np.log(y_hat)) / Y.shape[0]


	def prepend_bias(X):
	return np.insert(X, 0, 1, axis=1)


	def forward(X, w1, w2):
	h = sigmoid(np.matmul(prepend_bias(X), w1))
	y_hat = softmax(np.matmul(prepend_bias(h), w2))
	return (y_hat, h)


	def back(X, Y, y_hat, w2, h):
	w2_gradient = np.matmul(prepend_bias(h).T, (y_hat - Y)) / X.shape[0]
	w1_gradient = np.matmul(prepend_bias(X).T, np.matmul(y_hat - Y, w2[1:].T)
	* sigmoid_gradient(h)) / X.shape[0]
	return (w1_gradient, w2_gradient)


	def classify(X, w1, w2):
	y_hat, _ = forward(X, w1, w2)
	labels = np.argmax(y_hat, axis=1)
	return labels.reshape(-1, 1)


	def initialize_weights(n_input_variables, n_hidden_nodes, n_classes):
	w1_rows = n_input_variables + 1
	w1 = np.random.randn(w1_rows, n_hidden_nodes) * np.sqrt(1 / w1_rows)

	w2_rows = n_hidden_nodes + 1
	w2 = np.random.randn(w2_rows, n_classes) * np.sqrt(1 / w2_rows)

	return (w1, w2)


	def report(iteration, X_train, Y_train, X_test, Y_test, w1, w2):
	y_hat, _ = forward(X_train, w1, w2)
	training_loss = loss(Y_train, y_hat)
	classifications = classify(X_test, w1, w2)
	accuracy = np.average(classifications == Y_test) * 100.0
	print("Iteration: %5d, Loss: %.8f, Accuracy: %.2f%%" %
	(iteration, training_loss, accuracy))


	def train(X_train, Y_train, X_test, Y_test, n_hidden_nodes, iterations, lr):
	n_input_variables = X_train.shape[1]
	n_classes = Y_train.shape[1]
	w1, w2 = initialize_weights(n_input_variables, n_hidden_nodes, n_classes)
	for iteration in range(iterations):
	y_hat, h = forward(X_train, w1, w2)
	w1_gradient, w2_gradient = back(X_train, Y_train, y_hat, w2, h)
	w1 = w1 - (w1_gradient * lr)
	w2 = w2 - (w2_gradient * lr)
	report(iteration, X_train, Y_train, X_test, Y_test, w1, w2)
	return (w1, w2)


	import mnist
	w1, w2 = train(mnist.X_train, mnist.Y_train,
	mnist.X_test, mnist.Y_test,
	n_hidden_nodes=200, iterations=10000, lr=0.01)

To write the very last line here, I had to set values for the hyperparameters. The number of hidden nodes was easy: we’d already decided to have 200 hidden nodes—plus the bias, that’s added automatically by the network. By contrast, it took me some time to find a good value for lr: I had to try a few different learning rates, and pick the one that resulted in the lowest loss. Near the end of Part II, we’ll look at a more systematic way to choose hyperparameter values.

And at long last, here are the results of running the network:

	Iteration: 0, Loss: 2.38746031, Accuracy: 13.61%
	Iteration: 1, Loss: 2.34527197, Accuracy: 15.00%
	…
	Iteration: 9999, Loss: 0.14668400, Accuracy: 93.25%

The perceptron that we built in Chapter 7, The Final Challenge couldn’t reach 93% accuracy, no matter how long you trained it. The neural network has to do more calculations than a perceptron, so it takes a bit longer to pass 90% accuracy—but after that, it just keeps going. After a few hours of number crunching, the network has reached over 93% accuracy in character recognition on MNIST. Pretty good, although still far from our lofty goal of 99% accuracy.

In case you think that my enthusiasm for that 1% improvement is unwarranted, look at it from the opposite point of view: the perceptron makes a classification mistake over 8% of the times, while the network gets it wrong less than 7% of the times. In a production system, that improvement might already make a difference—and that is only the beginning. Over the next few chapters, we’ll push that accuracy much higher.

Let’s recap what we learned in this long chapter.