Handwritten digits recognition using TensorFlow

In this example, we are going to classify images using TensorFlow. More specifically, we are going to create a simple model (a softmax regression model) for learning and predicting handwritten digits in images using the MNIST dataset.

Softmax regression is a generalization of logistic regression we can use for multi-class classification. The MNIST dataset (http://yann.lecun.com/exdb/mnist/) contains a variety of handwritten digital images:

The mnist_tensorflow_save_model.py script creates the model for learning and predicting handwritten digits in images.

The main steps are shown as follows. You can use the following code to automatically import this dataset:

from tensorflow.examples.tutorials.mnist import input_data
data = input_data.read_data_sets("MNIST/", one_hot=True)

The downloaded data set is composed of three parts – 55,000 rows of mnist.train training data, 10,000 rows of mnist.test test data, and 5,000 rows of mnist.validation validation data. Additionally, training, testing, and validation parts contain the corresponding label for each digit. For example, the training data is composed of mnist.train.images (training dataset images) and mnist.train.labels (training dataset labels). Each image is composed of 28 x 28 pixels, resulting in a 784 element array. The one_hot=True option means that the labels will be represented in a way that only one bit will be 1 for a specific digit. For example, for 9, the corresponding label will be [0 0 0 0 0 0 0 0 0 1].

This technique is called one-hot encoding, meaning that labels have been converted from a single number to a vector, whose length is equal to the number of possible classes. This way, all elements of the vector will be set to zero, except the i element, whose value will be 1 corresponding to the i class.

When defining the placeholders, we need to match their shapes and types in order to feed the data into the following variables:

x = tf.placeholder(tf.float32, shape=[None, 784], name='myInput')
y = tf.placeholder(tf.float32, shape=[None, 10], name='Y')

When we assign None to a placeholder, it means the placeholder can be fed with as many examples as necessary. In this case, the x placeholder can be fed with any 784-dimensional vector. Therefore, the shape of this tensor is [None, 784 ]. Additionally, we also create the y placeholder for feeding the true label. In this case, the shape of this tensor will be [None, 10].

At this point, we can start building the computation graph. The first step is to create the W and b variables as follows:

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

The W and b variables are created and will be initialized with zeros because TensorFlow will optimize these values when training. The dimension of W is [784, 10] because we want to multiply it by a 784-dimensional array corresponding to the representation of a certain image in order to get a 10-dimensional output vector. 

Now, we can implement our model as follows:

output_logits = tf.matmul(x, W) + b
y_pred = tf.nn.softmax(output_logits, name='myOutput')

tf.matmul() is used for matrix multiplication and tf.nn.softmax() is used to apply the softmax function to an input tensor, meaning that the output is normalized and can be interpreted as probabilities. At this point, we can define the loss function, the optimizer (in this case, AdamOptimizer (https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) is created), and the accuracy of the model as follows:

# Define the loss function, optimizer, and accuracy
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=output_logits), name='loss')
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, name='Adam-op').minimize(loss)
correct_prediction = tf.equal(tf.argmax(output_logits, 1), tf.argmax(y, 1), name='correct_pred')
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy')

Finally, we can train the model, validate it with the mnist.validation validation data, and also save the model as follows:

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(num_steps):
# Get a batch of training examples and their corresponding labels.
x_batch, y_true_batch = data.train.next_batch(batch_size)

# Put the batch into a dict to be fed into the placeholders
feed_dict_train = {x: x_batch, y: y_true_batch}
sess.run(optimizer, feed_dict=feed_dict_train)

# Validation:
feed_dict_validation = {x: data.validation.images, y: data.validation.labels}
loss_test, acc_test = sess.run([loss, accuracy], feed_dict=feed_dict_validation)
print("Validation loss: {}, Validation accuracy: {}".format(loss_test, acc_test))

# Save model:
saved_path_model = saver.save(sess, './softmax_regression_model_mnist')
print('Model has been saved in {}'.format(saved_path_model))

Once the model has been saved, we can use it to recognize handwritten digits in images. In the mnist_save_and_load_model_builder.py script, we are going to create saved_model.pb inside the my_model folder and use this model for making new predictions for loading images using OpenCV. To save the model, we make use of the export_model() function that was introduced in the previous section. To make new predictions, we use the following code:

# Load some test images:
test_digit_0 = load_digit("digit_0.png")
test_digit_1 = load_digit("digit_1.png")
test_digit_2 = load_digit("digit_2.png")
test_digit_3 = load_digit("digit_3.png")

with
tf.Session(graph=tf.Graph()) as sess:
tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], './my_model')
graph = tf.get_default_graph()
x = graph.get_tensor_by_name('myInput:0')
model = graph.get_tensor_by_name('myOutput:0')
output = sess.run(model, {x: [test_digit_0, test_digit_1, test_digit_2, test_digit_3]})
print("predicted labels: {}".format(np.argmax(output, axis=1)))

Here, test_digit_0, test_digit_1, test_digit_2, and test_digit_3 are four loaded images containing one digit each. To load each image, we make use of the load_digit() function as follows:

def load_digit(image_name):
"""Loads a digit and pre-process in order to have the proper format"""

gray = cv2.imread(image_name, cv2.IMREAD_GRAYSCALE)
gray = cv2.resize(gray, (28, 28))
flatten = gray.flatten() / 255.0
return flatten

As you can see, we have to preprocess each image in order to have the proper format, corresponding with the format of the MNIST database images. If we execute this script, we will get the following predicted class for each image: 

predicted labels: [0 1 2 3]