Reinforcement learning

Reinforcement learning is a different paradigm in machine learning where an agent tries to learn to behave optimally in a defined environment by making decisions/actions and observing the outcome of that decision. So, in the case of reinforcement learning, the agent is not really from some given dataset, but rather, by interacting with the environment, the agent tries to learn by observing the effects of its actions. The environment is defined in such a way that the agent gets rewards if its action gets it closer to the goal.

Humans are known to learn in this way. For example, consider a child in front of a fireplace where the child is the agent and the space around the child is the environment. Now, if the child moves its hand towards the fire, it feels the warmth, which feels good and, in a way, the child (or the agent) is rewarded for the action of moving its hand close to the fire. But if the child moves its hand too close to the fire, its hand will burn, hence receiving a negative reward. Using these rewards, the child is able to figure out the optimal distance to keep its hand from the fire. Reinforcement learning tries to imitate exactly this kind of system in order to train the agent to learn to optimize its goal in the given environment.

Making this more formal, to train an agent we will need to have an environment which represents the world in which the agent should be able to take actions. For each of these actions, the environment should return observations which contain information about the reward, telling it how good or bad the action was. The observation should also have information regarding the next state of the agent in the environment. And, based on these observations, the agent tries to figure out the optimal way to reach its goal. Figure 1 shows the interaction between an agent and an environment:

The thing that makes reinforcement learning fundamentally different from other algorithms is that there is no supervisor. Instead, there is only a reward signal giving feedback to the agent about the action it took. Another important thing to mention here is that an environment can be constructed in such a way that the reward is delayed, which can make the agent wander around before reaching its goal. It is also possible that the agent might have to go through a lot of negative feedback before reaching its goal.

In the case of supervised learning, we are given the dataset, which basically tells our learning algorithm the right answers in different situations. Our learning algorithm then looks over all these different situations, and the solutions in those cases, and tries to generalize based upon it. Hence, we also expect that the dataset given to use is independent and identically distributed (IID). But in the case of reinforcement learning the data is not IID, the data generated depends on the path the agent took, and, hence, it depends on the actions taken by the agent. Hence, reinforcement learning is an active learning process in which the actions taken by the agent influence the environment which in turn influences the data generated by the environment.

We can take a very simple example to better understand how a reinforcement learning agent and environment behave. Consider an agent trying to learn to play Super Mario Bros:

The agent will receive the initial state from the environment. In the case of Super Mario, this would be the current frame of the game.
Having received the state information, the agent will try to take an action. Let's say the action the agent took is to move to the right.
When the environment receives this action it will return the next state based on it. The next state would also be a frame, but the frame could be of a dying Mario if there was an enemy next to Mario in the previous state. Otherwise, the frame would just have shown Mario to have moved one step to the right. The environment will also return the rewards based on the action. If there was an enemy on the right of Mario the rewards could be, let's say -5 (since the action killed Mario) or could be +1 if the Mario moved towards finishing the level.