Network initialization

So far, we have seen that there are a number of stages in a neural network model. We already know that weight exists between two nodes (of two different layers). The weights undergo a linear transformation and, along with values from input nodes, it crosses through nonlinear activation function in order to yield the value of the next layer. It gets repeated for the next and subsequent layers and later on, with the help of backpropagation, optimal values of weights are found out.

For a long time, weights used to get randomly initialized. Later on, it was realized that the way we initialize the network has a massive impact on the model. Let's see how we initialize the model:

Zero initialization: In this kind of initialization, all the initial weights are set to zero. Due to this, all the neurons of all the layers perform the same calculation, which results in producing the same output. It will make the whole deep network futile. Predictions coming out of this network would be as good as random. Intuitively speaking, it doesn't perform symmetry breaking. Normally, during forward propagation of a neural network, each hidden node gets a signal and this signal is nothing but the following:

If a network is initialized with zero, then all the hidden nodes will get zero signal because all the inputs will be multiplied by zero. Hence, no matter what the input value is, if all weights are the same, all units in the hidden layer will be the same too. This is called symmetry, and it has to be broken in order to have more information capturing a good model. Hence, the weights are supposed to be randomly initialized or with different values:

Random initialization: This kind of initialization helps in symmetry breaking. In this method, the weights are randomly initialized very close to zero. Every neuron doesn't perform the same computation as the weight is not equal to zero:

He-et-al initialization: This initialization depends on the size of the previous layer. It helps in attaining a global minimum of the cost function. The weights are random but differ in range depending on the size of the previous layer of neurons: