The Finished Network

Here is the complete code of our neural network:

11_training/neural_network.py
 import​ ​numpy​ ​as​ ​np
 
 
 def​ ​sigmoid​(z):
 return​ 1 / (1 + np.exp(-z))
 
 
 def​ ​softmax​(logits):
  exponentials = np.exp(logits)
 return​ exponentials / np.sum(exponentials, axis=1).reshape(-1, 1)
 
 
 def​ ​sigmoid_gradient​(sigmoid):
 return​ np.multiply(sigmoid, (1 - sigmoid))
 
 
 def​ ​loss​(Y, y_hat):
 return​ -np.sum(Y * np.log(y_hat)) / Y.shape[0]
 
 
 def​ ​prepend_bias​(X):
 return​ np.insert(X, 0, 1, axis=1)
 
 
 def​ ​forward​(X, w1, w2):
  h = sigmoid(np.matmul(prepend_bias(X), w1))
  y_hat = softmax(np.matmul(prepend_bias(h), w2))
 return​ (y_hat, h)
 
 
 def​ ​back​(X, Y, y_hat, w2, h):
  w2_gradient = np.matmul(prepend_bias(h).T, (y_hat - Y)) / X.shape[0]
  w1_gradient = np.matmul(prepend_bias(X).T, np.matmul(y_hat - Y, w2[1:].T)
  * sigmoid_gradient(h)) / X.shape[0]
 return​ (w1_gradient, w2_gradient)
 
 
 def​ ​classify​(X, w1, w2):
  y_hat, _ = forward(X, w1, w2)
  labels = np.argmax(y_hat, axis=1)
 return​ labels.reshape(-1, 1)
 
 
 def​ ​initialize_weights​(n_input_variables, n_hidden_nodes, n_classes):
  w1_rows = n_input_variables + 1
  w1 = np.random.randn(w1_rows, n_hidden_nodes) * np.sqrt(1 / w1_rows)
 
  w2_rows = n_hidden_nodes + 1
  w2 = np.random.randn(w2_rows, n_classes) * np.sqrt(1 / w2_rows)
 
 return​ (w1, w2)
 
 
 def​ ​report​(iteration, X_train, Y_train, X_test, Y_test, w1, w2):
  y_hat, _ = forward(X_train, w1, w2)
  training_loss = loss(Y_train, y_hat)
  classifications = classify(X_test, w1, w2)
  accuracy = np.average(classifications == Y_test) * 100.0
 print​(​"Iteration: ​​%5​​d, Loss: ​​%.8​​f, Accuracy: ​​%.2​​f​​%%​​"​ %
  (iteration, training_loss, accuracy))
 
 
 def​ ​train​(X_train, Y_train, X_test, Y_test, n_hidden_nodes, iterations, lr):
  n_input_variables = X_train.shape[1]
  n_classes = Y_train.shape[1]
  w1, w2 = initialize_weights(n_input_variables, n_hidden_nodes, n_classes)
 for​ iteration ​in​ range(iterations):
  y_hat, h = forward(X_train, w1, w2)
  w1_gradient, w2_gradient = back(X_train, Y_train, y_hat, w2, h)
  w1 = w1 - (w1_gradient * lr)
  w2 = w2 - (w2_gradient * lr)
  report(iteration, X_train, Y_train, X_test, Y_test, w1, w2)
 return​ (w1, w2)
 
 
 import​ ​mnist
 w1, w2 = train(mnist.X_train, mnist.Y_train,
  mnist.X_test, mnist.Y_test,
  n_hidden_nodes=200, iterations=10000, lr=0.01)

To write the very last line here, I had to set values for the hyperparameters. The number of hidden nodes was easy: we’d already decided to have 200 hidden nodes—plus the bias, that’s added automatically by the network. By contrast, it took me some time to find a good value for lr: I had to try a few different learning rates, and pick the one that resulted in the lowest loss. Near the end of Part II, we’ll look at a more systematic way to choose hyperparameter values.

And at long last, here are the results of running the network:

 Iteration: 0, Loss: 2.38746031, Accuracy: 13.61%
 Iteration: 1, Loss: 2.34527197, Accuracy: 15.00%
 
 Iteration: 9999, Loss: 0.14668400, Accuracy: 93.25%

The perceptron that we built in Chapter 7, The Final Challenge couldn’t reach 93% accuracy, no matter how long you trained it. The neural network has to do more calculations than a perceptron, so it takes a bit longer to pass 90% accuracy—but after that, it just keeps going. After a few hours of number crunching, the network has reached over 93% accuracy in character recognition on MNIST. Pretty good, although still far from our lofty goal of 99% accuracy.

In case you think that my enthusiasm for that 1% improvement is unwarranted, look at it from the opposite point of view: the perceptron makes a classification mistake over 8% of the times, while the network gets it wrong less than 7% of the times. In a production system, that improvement might already make a difference—and that is only the beginning. Over the next few chapters, we’ll push that accuracy much higher.

Let’s recap what we learned in this long chapter.