Reinforcement learning is the last of the three most broad categories of machine learning. We have already studied supervised learning and unsupervised learning. Reinforcement learning is the third broad category and differs from the other two types in significant ways. Reinforcement learning neither trains on labeled data nor adds labels to data. Instead, it seeks to find an optimal solution for an agent to receive the highest reward.
The environment is the space where the agent completes its task. In our case, the environment will be the 3 x 3 grid used to play the game tic-tac-toe. The agent performs tasks within the environment. In this case, the agent places the X's or O's on the grid. The environment also contains rewards and penalties—that is, the agent needs to be rewarded for certain actions and penalized for others. In tic-tac-toe, if a player places marks (X or O) in three consecutive spaces either horizontally, vertically, or diagonally, then they win and conversely the other player loses. This is the simple reward and penalty structure for this game. The policy is the strategy that dictates which actions the agent should take to lead to the greatest probability for success given any set of previous actions.
To determine the optimal policy, we will be using Q-learning. The Q in Q-learning stands for quality. It involves developing a quality matrix to determine the best course of action. This involves using the Bellman equation. The interior of the equation calculates the reward value plus the discounted maximum value of future moves minus the current quality score. This calculated value is then multiplied by the learning rate and added to the current quality score. Later, we will see how to write this equation using R.
In this chapter, we are using Q-learning; however, there are other ways to perform reinforcement learning. Another popular algorithm is called actor–critic and it differs from Q-learning in significant ways. The following paragraph is a comparison of the two to better show the different approaches to pursuing the same type of machine learning.
Q-learning computes a value function, so it requires a finite set of actions, such as the game tic-tac-toe. Actor–critic works with a continuous environment and seeks to optimize the policy without a value function like Q-learning does. Instead, actor–critic has two models. One of them, the actor, performs actions while the other, the critic, calculates the value function. This takes place for each action, and over numerous iterations, the actor learns the best set of actions. While Q-learning works well for solving a game like tic-tac-toe, which has a finite space and set of moves, actor–critic works well for environments that are not constrained or that change dynamically.
In this section, we quickly reviewed the different methods for performing reinforcement learning. Next, we will begin to implement Q-learning on our tic-tac-toe data.