Flatten layers, dense layers, and softmax

After applying multiple convolutional layers, the resulting data structure is a multi-dimensional matrix (or tensor). We must transform this into a matrix that is in the shape of the required output. For example, if our classification task has 10 classes (for example, 10 for the MNIST example), we need the output of the model to be a 1 x 10 matrix. We do this by taking the results of our convolutional and max-pooling layers and using a Flatten layer to reshape the data. The last layer should have the same number of nodes as the number of classes we wish to predict for. If our task is binary classification, the activation function in our last layer will be sigmoid. If our task is binary classification, the activation function in our last layer will be softmax.

Before applying the softmax/sigmoid activation, we may optionally apply a number of dense layers. A dense layer is just a normal hidden layer, as we saw in Chapter 1, Getting Started with Deep Learning.

We need a softmax layer because the values in the last layer are numeric but range from -infinity to + infinity. We must convert these series of input values into a series of probabilities that says how likely the instance is for each category. The function to transform these numeric values to a series of probabilities must have the following characteristics:

Each output value must be between 0.0 to 1.0
The sum of the output values should be exactly 1.0

One way to do this is to just rescale the values by dividing each input value by the sum of the absolute input values. That approach has two problems:

It does not handle negative values correctly
Rescaling the input values may give us probabilities that are too close to each other

These two issues can be solved by first applying e^x (where e is 2.71828) to each input value and then rescaling those values. This transforms any negative number to a small positive number, and it also causes the probabilities to be more polarized. This can be demonstrated with an example; here, we can see the result from our dense layers. The values for categories 5 and 6 are quite close at 17.2 and 15.8, respectively. However, when we apply the softmax function, the probability value for category 5 is 4 times the probability value for category 6. The softmax function tends to result in probabilities that emphasize one category over all others, which is exactly what we want:

Figure 5.12 Example of the softmax function