Before running the model, we first have to determine the elements that we will use in building a multilayer perceptron model. Following are the elements that we will use in this model:
- Architecture: The model contains 728 neurons in the input layer. This is because we have 28 images and each image has 28 pixels. Here, each pixel is a feature in this case, so we have 728 pixels. We will have 10 elements in the output layer, and we will also use three hidden layers, although we could use any number of hidden layers. Here, we will use three hidden layers. The number of neurons we will use in each layer is 350 in the first layer, 200 in the second one, and 100 in the last layer.
- Activation function: We will use the ReLU activation function, as shown in the following code block:
vector = np.arange(-5,5,0.1)
def relu(x) :
return max(0.,x)
relu = np.vectorize(relu)
If the input is negative, the function outputs 0, and if the input is positive the function just outputs the same value as the input. So, mathematically, the ReLU function looks similar to this. The following screenshot shows the lines of code used for generating the graphical representation of the ReLU activation function:
It gains the maximum between 0 and the input. This activation function will be used in every neuron of the hidden layers.
- Optimizing algorithm: The optimizing algorithm used here is the gradient descent with a learning rate of 0.01.
- Loss function: For the loss function, we will use the cross_entropy function, but as with other loss functions that we have used in this book, this function measures the distance between the actual values and the predictions that the model makes.
- Weights initialization strategy: For this, we will use the Xavier initializer, a method that actually comes with the fully_connected function from TensorFlow as a default.
- Regularization strategy: We are not going to use any regularization strategy.
- Training strategy: We are going to use 20 epochs. The dataset will be presented to the network 20 times, and in every iteration, we will use a batch size of 80. So, we will present the data to the network 80 points at a time and the whole dataset 20 times.