<link href="css/stylesheet.css" rel="stylesheet" type="text/css"/> <link href="css/page-template.xpgt" rel="page-template" type="application/vnd.adobe-page-template+xml"/> <meta content="urn:uuid:e29ecbb7-df54-466e-b5f4-fb81e7e51403" name="Adept.expected.resource"/> </head> <body> <p class="chnum"><a id="page_111"/><b>9</b></p> <p class="chtitle">REINFORCEMENT LEARNING</p> <p class="noindent"><span class="dp">I</span>n the early 1900s, Edward Thorndike put cats in boxes, from which they could escape only by stepping on a switch. After some wandering, the cats would eventually learn to step on the switch and free themselves. He termed the exit opening a <i>reinforcement event.</i> Ivan Pavlov and then B. F. Skinner further refined the theory of behavior modification using reinforcement events. Subsequent researchers used these ideas to develop technology to control elevators, robots, computer games, self-driving cars, and many other applications.<sup><a href="endnotes.html#ch9_ren1" id="en1">1</a></sup> <i>Reinforcement learning</i> is a form of machine learning that uses trial and error with reinforcement, much like Thorndike used on his cats.</p> <p class="indent">Let’s start with the challenge of controlling elevators. Many of us have stood waiting in a lobby in front of a bank of elevators, wondering why those stupid machines cannot be more efficient. We understand how to manage a single elevator. It goes up to the highest floor, picks<a id="page_112"/> up waiting passengers, and makes its way down, often stopping several times on the way. With more than one elevator, however, the optimal algorithm is far more complex.</p> <p class="indent">Consider a system with four elevators that services twenty floors.<sup><a href="endnotes.html#ch9_ren2" id="en2">2</a></sup> If each of the four elevators goes to the highest floor on every trip, one will pick up all the passengers, and the others will not have anyone to pick up. If there are twenty floors, we could program each elevator to service five floors. But when the top floors are busy and the bottom floors are not or vice versa, the result will be unhappy customers.</p> <p class="indent">With a total of eighty call buttons inside the elevators (twenty floors times four elevator cars), and thirty-eight up/down buttons in the hallways (the top and bottom floors have one button each, and the other eighteen hallways have two buttons), we have a total of 118 buttons. If we observe which of these 118 buttons were pressed by passengers and which ones were not pressed by passengers, we will find well over a trillion-trillion possible combinations for any point in time.<sup><a href="endnotes.html#ch9_ren3" id="en3">3</a></sup> Each of these combinations is a <i>state,</i> and the set of all possible combinations is the <i>state space.</i> The reason this is important is that we must create a program that responds optimally to each one of these trillion-trillion possible states.</p> <p class="indent">In response to a state, a system can give each elevator only one of three actions: Stop at the current floor, go up, or go down. Therefore, while the state space is enormous, the number of possible <i>actions</i> (i.e., the <i>action space)</i> is small. The problem to solve is this: Given any state,<sup><a href="endnotes.html#ch9_ren4" id="en4">4</a></sup> which action should each of the four elevators take?</p> <p class="indent">First, we must define our business objective. We may want to do any of the following:</p> <p class="bullt">•<span class="bs"/>Minimize the average wait time.</p> <p class="bull">•<span class="bs"/>Minimize the maximum wait time.<a id="page_113"/></p> <p class="bull">•<span class="bs"/>Minimize the average total transport time.</p> <p class="bull">•<span class="bs"/>Minimize the number of times anyone must wait more than one minute.</p> <p class="bull">•<span class="bs"/>Minimize the number of people waiting.</p> <p class="bull">•<span class="bs"/>Minimize the number of people whose wait time is higher than the average wait time.</p> <p class="bull">•<span class="bs"/>Minimize the cost of operations (i.e., power consumption).</p> <p class="indentt">Our goal is to pick one of these business objectives and define a <i>policy</i> (i.e., a function) to achieve it.<sup><a href="endnotes.html#ch9_ren5" id="en5">5</a></sup> The policy will take as input any of our trillion-trillion states and as output what each of the four elevators should do (go up, go down, stop at a floor). And we do not just want to define <i>any</i> policy; we want to find the <i>optimal</i> policy.<sup><a href="endnotes.html#ch9_ren6" id="en6">6</a></sup></p> <p class="indent">From the chosen business objective, we can derive a formula— termed a <i>reward function</i>—that we can use to measure the performance of any policy. For example, we could take a guess at an initial policy that assesses the state and sends out instructions to the elevators every five seconds. We could try this policy for a week and use the reward function to evaluate performance. If we chose to minimize the average wait time, our reward function would tell us the average wait time of customers that week.<sup><a href="endnotes.html#ch9_ren7" id="en7">7</a></sup></p> <p class="indent">Testing the algorithm, however, is problematic. We could try it out in a hotel with actual customers, find out what does not work well, adjust the program, try it out again, and keep trying until the program works well. The problem with this approach is that not only will it take years, but also the customers will be so unhappy that by the time we figure it out, the hotel may have no guests.</p> <p class="indent">A better approach is to construct a simulator and test algorithms there. The simulator is a computer program that, in this case, simulates<a id="page_114"/> people’s arrival and the movement of the elevators. For example, we might program the simulator to allow a capacity of eight people per elevator, take five seconds to transit each floor, and have the elevators’ doors stay open for fifteen seconds.</p> <p class="indent">Simulators are critical for testing possible solutions to many real-world reinforcement learning problems. For example, we do not want bad algorithms to crash self-driving cars in the real world; we want the crashes to occur in a simulator. In robotics, simulators are important because robots are expensive pieces of machinery, and constant trial and error results in excessive wear and tear at best and accidents at worst.<sup><a href="endnotes.html#ch9_ren8" id="en8">8</a></sup></p> <p class="indent">By running the elevator tests in our simulator, we can simulate a week’s worth of traffic in a few seconds, and we can evaluate the random policy function on our business objective (e.g., minimizing the average wait time). Then we can change the values of all the weights in a way that produces a better policy (e.g., a policy that will have a lower average wait time). It turns out that, mathematically, we can compute values that will be better, but we cannot calculate which values will be optimal with just one simulation. We will need to do this over and over until the result (e.g., the average wait time) stops getting better, indicating that we have found and enabled the optimal average wait time.<sup><a href="endnotes.html#ch9_ren9" id="en9">9</a></sup></p> <p class="h3"><b>REINFORCEMENT VERSUS SUPERVISED LEARNING</b></p> <p class="noindent">In supervised learning systems, every prediction made on a training table observation can be analyzed right away by comparing the predicted value with the actual observed output value (i.e., the value in the output column). For reinforcement learning, it is not possible to draw an immediate conclusion about an action. When the system tells the elevator to go up and stop on the seventh floor, that action has<a id="page_115"/> consequences. For example, it delays the ability to pick up someone on the fifth floor. Also, the other elevators will not need to stop at the seventh floor. The point is that we cannot tell the effect of a single selected action on the average wait time (or whatever goal is selected). We can only observe the impact of the policy function on the average wait time over a period of time.</p> <p class="indent">At the same time, reinforcement learning has many similarities to supervised learning. Supervised systems learn a function that can predict output values for previously unobserved input data. Similarly, reinforcement systems learn a policy function that can identify the optimal action for a given state, even if the system did not encounter that state during training. Like supervised learning, the function learning by the reinforcement learning algorithm is specific to the task.</p> <p class="h4">GAMES</p> <p class="noindent">Learning a computer game like <i>Pong</i> is a classic reinforcement learning problem. The states are the pixels in the game image at any point in time. The actions are to move the paddle up or down. The reward function is simply the number of points scored. One advantage of computer games is that we do not need to create a simulator; the game itself is a simulator.</p> <p class="indent">Reinforcement learning for game playing came into its own in 2013, when a little company in London named DeepMind used neural network technology to learn functions to play seven video games<sup><a href="endnotes.html#ch9_ren10" id="en10">10</a></sup> on an Atari 2600.<sup><a href="endnotes.html#ch9_ren11" id="en11">11</a></sup> It learned to play three of those games<sup><a href="endnotes.html#ch9_ren12" id="en12">12</a></sup> better than a human. Shortly after that, Google bought the company.</p> <p class="indent">The DeepMind team also built a system to play Go that beat the European champion five straight games.<sup><a href="endnotes.html#ch9_ren13" id="en13">13</a></sup> Go is considered even more complicated than chess because the number of possible moves is 10<sup>360</sup>;<a id="page_116"/> in chess, it’s 10<sup>123</sup>. The same team<sup><a href="endnotes.html#ch9_ren14" id="en14">14</a></sup> then created a generic game-playing algorithm based on the AlphaGo system that learned chess, shogi (Japanese chess), and Go. This system achieved superhuman performance on all three games within twenty-four hours. In each case, the reward function was simple (a game is either won or lost). Also, in each case, the system started playing games against itself and used the game outcomes to determine the value of the reward function. After each game, it gradually updated its network weights to play slightly better than the previous time.</p> <p class="indent">Like supervised learning systems, reinforcement learning systems can only learn patterns found in the training data. For example, a team of University of California at Berkeley researchers trained a system using reinforcement learning to play a game in which a soccer player kicks a ball past a goalie. However, when the movements of the goalie were reprogrammed to do things that the system did not see during training, such as sitting down, the opposing soccer players completely lost the ability to kick the ball.<sup><a href="endnotes.html#ch9_ren15" id="en15">15</a></sup></p> <p class="h4">ROBOTICS</p> <p class="noindent">Robots come in a variety of forms. A robot might look something like a human, with arms and legs, but it might also be an arm bolted to the floor or a wall, or a vacuum cleaner on a set of wheels. As is depicted in <a href="#fig9-1">figure 9.1</a>, a robot may have cameras and image recognition software that enable it to see. It may have audio sensors, speech recognition software, and natural language processing software that allow it to hear. No one knows yet how to build computers with commonsense reasoning capabilities; however, many consider it a logical possibility, so I have included it in the figure. Some companies specialize in creating robots that look as much like humans as possible.<sup><a href="endnotes.html#ch9_ren16" id="en16">16</a></sup><a id="page_117"/></p> <p class="fimage" id="fig9-1"><img alt="" src="images/p117.jpg"/></p> <p class="fcap"><b>Figure 9.1</b> Robot components. PhonlamaiPhoto – Licensed from iStockphoto ID 1050049486.</p> <p class="indent">Robots also have varying numbers of components that can move in different directions and are guided by an algorithm known as a <i>controller,</i> a miniature brain that governs the robot’s movements. Robots also have other sensors that provide input to the controller. For example, a sensor might tell the controller how much pressure a robotic hand is applying to an object.</p> <p class="indent">Companies across many industries employ robots in all sorts of functions, mostly for tasks with repetitive motions, such as welding, gluing, and painting in the automotive industry. Amazon has robots that move boxes from one part of a warehouse to another.</p> <p class="indent">Programming robot controllers for a task is time-consuming, and the resulting controller will only work for a well-defined set of environmental conditions. A robot arm that places welds on a car requires the car to be in the same position for every weld. They perform the same specific task over and over in the same way, so they are far easier to program than the robots that must navigate the real world.<a id="page_118"/></p> <p class="fimage" id="fig9-2"><img alt="" src="images/p118.jpg"/></p> <p class="fcap"><b>Figure 9.2</b> An assembly line composed of robot arms.<br/>© imaginima – Licensed from iStockphoto ID 1057277428.</p> <p class="indent">Robots navigate based on sensors that provide input about the environment and a controller that can interpret those sensors and direct the robot components accordingly. Tasks that sound easy to us, such as walking and picking specified objects out of bins, are challenging to teach to robots. There is a hilarious video of robots falling while trying to walk at a 2015 DARPA-sponsored contest named the Robotics Challenge.<sup><a href="endnotes.html#ch9_ren17" id="en17">17</a></sup> Researchers have made progress since then,<sup><a href="endnotes.html#ch9_ren18" id="en18">18</a></sup> but even the latest models have difficulties navigating the real world.<sup><a href="endnotes.html#ch9_ren19" id="en19">19</a></sup> A task like picking specified objects out of a bin is challenging to achieve with conventional controller programming, because the robot must sense the environment and adapt accordingly. It must find the right spot on an object to pick it up, and it must exert enough pressure so the object does not fall but not so much pressure that the object breaks. Controllers trained using reinforcement learning to pick items out of bins can be more flexible than conventionally coded controllers, and researchers are working to develop them.</p> <p class="indent">Reinforcement learning technology offers the promise of robot controllers that can learn rather than having to be programmed and offers<a id="page_119"/> the promise of robots that can generalize, which means that they will be able to function within a wider range of environmental conditions. Researchers have applied both reinforcement learning and supervised learning to robotics. However, movement controllers appear to be better suited to reinforcement learning. Consider programming a robotic arm to find and pick up a ball. The reward gets triggered when the ball gets picked up. None of the individual small movements that lead to picking up the ball have rewards, and therefore it is hard to label each incremental movement as correct or incorrect.</p> <p class="indent">Reinforcement learning for robotics is significantly different from reinforcement learning for game playing. First, defining a reward function for robotics tasks can be challenging. In an Atari game, the reward function is simple. A reward is a scored point. But what should the reward function be for turning on a light switch or preparing a meal? In the laboratory, human researchers can indicate when the light switch has been successfully turned on and can decide when a meal is ready and if it tastes good. That will not work when a researcher is not present to provide feedback. Second, it is often difficult to obtain many trials in a robotics setting because robots and other industrial automation equipment are expensive to run and maintain. A simulator can be used, but it is difficult if not impossible to make simulators that are identical to the real-world environment in all respects.</p> <p class="h3"><b>IMITATION LEARNING</b></p> <p class="noindent">It turns out that both issues are often best managed by learning from an expert who demonstrates how to do a task instead of defining a reward function and acquiring many trials. When we teach a young adult to drive a car, we do not give them a reward function. Instead, we demonstrate how to drive.<sup><a href="endnotes.html#ch9_ren20" id="en20">20</a></sup> The trainee observes the driver, who<a id="page_120"/> is assumed to be following an optimal driving policy and maximizing rewards.<sup><a href="endnotes.html#ch9_ren21" id="en21">21</a></sup> Learning by copying the behavior of a person is known as <i>imitation learning.</i> Researchers have successfully used imitation learning in a wide variety of research settings, ranging from robot arm tasks to simulating helicopter maneuvers.</p> <p class="indent">One crucial aspect of imitation learning for robots is that the teacher is often someone who knows how to do a task but is not knowledgeable about machine learning. Researchers use imitation learning to teach robots to do tasks such as picking up an object and placing it somewhere else. One difficulty is that the demonstrator often has different degrees of freedom from the robot. For example, people have wrists and elbows that bend, but most robots do not have bendable wrists and elbows. This mismatch is known as the <i>correspondence problem;</i> the demonstrator might do the task in a way that the robot cannot imitate.<sup><a href="endnotes.html#ch9_ren22" id="en22">22</a></sup></p> <p class="indent">One method to avoid the correspondence problem is to have a human physically guide the robot to perform a task and have the robot record both its sensor inputs and the positions of its parts. For example, there is a video<sup><a href="endnotes.html#ch9_ren23" id="en23">23</a></sup> that shows a Google DeepMind researcher guiding a robot arm to move to a door handle, then grab it, then turn it, and then pull it open. This type of training is <i>kinesthetic feedback.</i> Other researchers have used kinesthetic feedback to teach a robot to pour a glass of water,<sup><a href="endnotes.html#ch9_ren24" id="en24">24</a></sup> play Ping-Pong,<sup><a href="endnotes.html#ch9_ren25" id="en25">25</a></sup> and pick up and place objects.<sup><a href="endnotes.html#ch9_ren26" id="en26">26</a></sup></p> <p class="indent">Another approach is remote operation. For example, researchers used remote operation to train a robot arm to pick up a towel, wipe an object, and then put the towel back in its original location. A University of California at Berkeley video<sup><a href="endnotes.html#ch9_ren27" id="en27">27</a></sup> shows how researchers can train a robot using a virtual reality headset.</p> <p class="indent">In each type of learning, the robot uses its sensors to gather data about the environment and its position and movement. In contrast, if a<a id="page_121"/> human merely demonstrates, all the data must somehow be translated from the robot’s sensors (e.g., camera images) into robotic movements. There are several reinforcement learning methods of taking the robot sensor data plus the data on robot body part movements and learning a policy that results in the robot being able to do the task.</p> <p class="h3"><b>STILL NARROW</b></p> <p class="noindent">In supervised learning, feedback is available for each observation in the training table in the form of an output value. In reinforcement learning, feedback occurs in the form of intermittent rewards, and the system must determine how multiple observations contribute to a reward. By doing so, reinforcement learning has found success in game playing and robotics. With imitation learning, a system can even perform a task based on following a demonstration. However, like supervised learning, reinforcement learning systems learn a function that is specific to one narrowly defined task.<a id="page