Recurrent networks

All the models that we have analyzed until now have a common feature. Once the training process is completed, the weights are frozen and the output depends only on the input sample. Clearly, this is the expected behavior of a classifier, but there are many scenarios where a prediction must take into account the history of the input values. A time series is a classic example. Let's suppose that we need to predict the temperature for the next week. If we try to use only the last known x^(t) value and an MLP trained to predict x^(t+1), it's impossible to take into account temporal conditions like the season, the history of the season over the years, the position in the season, and so on. The regressor will be able to associate the output that yields the minimum average error, but in real-life situations, this isn't enough. The only reasonable way to solve this problem is to define a new architecture for the artificial neuron, to provide it with a memory. This concept is shown in the following diagram:

Now the neuron is no longer a pure feed-forward computational unit because the feedback connection forces it to remember its past and use it in order to predict new values. The new dynamic rule is now as follows:

The previous prediction is fed back and summed to new linear output. The resulting value is transformed by the activation function in order to produce the actual new output (conventionally the first output is null, but this is not a constraint). An immediate consideration concerns the activation function—this is a dynamic system that could easily become unstable. The only way to prevent this phenomenon is to employ saturating functions (such as the sigmoid or hyperbolic tangent). In fact, whatever the input is, the output can never explode by moving towards +∞ or -∞.

Suppose that, instead, we were to use a ReLU activation—under some conditions, the output will grow indefinitely, leading to an overflow. Clearly, the situation is even worse with a linear activation and could be very similar even when using a Leaky ReLU or ELU. Hence, it's obvious that we need to select saturating functions, but is this enough to ensure stability? Even if a hyperbolic tangent (as well as a sigmoid) has two stable points (-1 and +1), this isn't enough to ensure stability. Let's imagine that the output is affected by noise and oscillates around 0.0. The unit cannot converge towards a value and remains trapped in a limit cycle.

Luckily, the possibility to learn the weights allows us to increase the robustness to noise, avoiding that limited changes in the input could invert the dynamic of the neuron. This is a very important (and easy to prove) result that guarantees stability under very simple conditions, but again, what is the price that we need to pay? Is it anything simple and straightforward? Unfortunately, the answer is negative and the price for stability is extremely high. However, before discussing this problem, let's show how a simple recurrent network can be trained.