CONTENTS
Foreword by Kevin Scott, CTO, Microsoft
2 SELF-DRIVING CARS AND THE DARPA GRAND CHALLENGE
3 KEEPING WITHIN THE LANES: PERCEPTION IN SELF-DRIVING CARS
4 YIELDING AT INTERSECTIONS: THE BRAIN OF A SELF-DRIVING CAR
5 NETFLIX AND THE RECOMMENDATION–ENGINE CHALLENGE
6 ENSEMBLES OF TEAMS: THE NETFLIX PRIZE WINNERS
7 TEACHING COMPUTERS BY GIVING THEM TREATS
8 HOW TO BEAT ATARI GAMES BY USING NEURAL NETWORKS
9 ARTIFICIAL NEURAL NETWORKS’ VIEW OF THE WORLD
10 LOOKING UNDER THE HOOD OF DEEP NEURAL NETWORKS
11 NEURAL NETWORKS THAT CAN HEAR, SPEAK, AND REMEMBER
12 UNDERSTANDING NATURAL LANGUAGE (AND JEOPARDY! QUESTIONS)
13 MINING THE BEST JEOPARDY! ANSWER
14 BRUTE-FORCE SEARCH YOUR WAY TO A GOOD STRATEGY
15 EXPERT-LEVEL PLAY FOR THE GAME OF GO
17 FIVE DECADES (OR MORE) FROM NOW
LIST OF FIGURES
A centrifugal governor, the precursor to electronic control systems. As the engine runs more quickly, the rotating axis with the “flyballs” spins more rapidly, and the flyballs are pulled outward by centrifugal force. Through a series of levers, this causes the valve to the engine to close. If the engine is running too slowly, the valve will allow more fuel through.
The control loop for a PID controller, the three-rule controller described in the text. The controller uses feedback from the speedometer to adjust inputs to the engine, such as power.
An example map. Darker shades indicate a higher travel cost.
The search “frontier” at different iterations of Dijkstra’s algorithm.
An optimal path through this map.
A simplified summary of the software and hardware organization of Stanley, the Stanford Racing Team’s 2005 Grand Challenge winner.
Area A in the DARPA Urban Challenge. Professional human drivers circled in the outer loop as self-driving cars circulated on the right half. The primary challenge here for autonomous cars was to merge into a lane of moving traffic at the stop sign. Self-driving cars were allowed to circulate as many times as they could within their time limit.
Boss’s architecture, simplified: the hardware, the perception (and world-modeling), and the reasoning (planning) modules, organized in increasing levels of reasoning abstraction from left to right. Its highest-level reasoning layer (planning, far right) was organized into its own three-layer architecture: the controller (route planner module), the sequencer (Monopoly board module), and the deliberator (motion planner module). The motion planner could have arguably been placed with the sequencer.
The handle intersection finite state machine. The Monopoly board module steps through the diagram, from “start” to “done.” The state machine waits for precedence and then attempts to enter the intersection. If the intersection is partially blocked, it is handled as a “zone,” which is a complex area like a parking lot, instead of being handled as a “lane”; otherwise the state machine creates a “virtual lane” through the intersection and drives in that lane. This is a simplified version of the state machine described in Urmson et al. (see note 7).
The motion planner of a self-driving car has been instructed by the Monopoly board to park in the designated parking spot.
The car has an internal map represented as a grid in which obstacles fill up cells in that grid. The motion planner also uses a cost function when picking its path (shaded gray). The cost function incorporates the distance to obstacles (in this case, other cars).
The motion planner searches for a path to its goal. The path will comprise many small path segments that encode velocity, position, and orientation. Unlike this picture, the search is performed from the end state to the start state.
A candidate path to the goal.
By applying a classifier to the recipe “Holiday Baked Alaska” we can see whether it would be good for kids. The weights stay fixed, and the details (and hence the recipe score) change for each recipe. “Holiday Baked Alaska” details are from the Betty Crocker website (see notes).
Some examples of star ratings that Netflix customers might have made on various movies. Netflix provided some of the ratings in the matrix (shown as numbers). Competitors needed to predict some of the missing recommendations (shown as question marks).
A test to determine whether Steven Spielberg would like the movie Jurassic Park. Here, we can assume Jurassic Park falls into two genres: science fiction and adventure. Steven Spielberg tends to like science fiction movies, comedies, and adventures, and tends not to like horror movies, as indicated by his genre affinities. We combine these into a score by multiplying together the genre scores, which are 0 or 1, with Spielberg’s affinities for those genres, and adding the results together. The result is a fairly high “affinity” score describing how much Spielberg likes Jurassic Park.
Popularity of the movie The Matrix over time.
The chart shows team progress toward the Netflix Prize. The final team to win the competition was BellKor’s Pragmatic Chaos.
Two of the Atari games played by DeepMind’s agent: Space Invaders (top) and Breakout (bottom).
DeepMind’s Atari-playing agent ran continuously. At any given moment, it would receive the last four screenshots’ worth of pixels as an input, and then it would run an algorithm to decide on its next actions and output its action.
The golf course used in the reinforcement-learning example. Terrain types, ranging from light grey to dark black: the green (least difficult), fairway, rough, sand trap, and water hazard (most difficult). The starting point is on the left, and the goal is in the top-right corner.
Your goal is to hit the ball from the start position into the hole in as few swings as possible; the ball moves only one square (or zero squares) per swing.
The golf course also has explosive mines, each of which is marked with an x. You must avoid hitting these.
One way to train a reinforcement-learning agent is with simulation. First the agent plays through a game to generate a series of state-action pairs and rewards, as shown in the left panel. Next, as shown in the right panel, the agent’s estimate of future rewards for taking different actions when it’s on a given state is updated using the state-action pairs experienced by the agent. This particular method is sometimes called “temporal difference, or TD, learning.”
Trajectories (white paths) made by the golf-playing agent. (a): a trajectory made by the agent after playing 10 games. (b): a trajectory made by the agent after playing 3,070 games.
A simple neural network.
Propagation of values through a neural network. In neural networks, the value of a neuron is either specified by outside data—that is, it’s an “input” neuron—or it is a function of other “upstream” neurons that act as inputs to it. When the value of the neuron is determined by other neurons, the values of the upstream neurons are weighted by the edges, summed, and passed through a nonlinear function such as max(x,0), tanh(x), or an S-shaped function, exp(x)/(exp(x) + 1).
The performance of several neural networks (b)–(d) trained to represent a target image (a). The networks take as their inputs the x and y coordinates of each pixel in the image and predict the brightness of each pixel, in the range 0 to 1.
A neural network designed to play the game golf. The right two layers, from the “Current position on course” to the “Joystick direction,” determine where the agent should aim, given the current position of the ball and the goal. The left two layers convert the pixels of the screen into the coordinates.
A convolutional layer with two filters. Each filter scans the image and produces a resulting image in which each “pixel” corresponds to a patch of the input image passed through that filter.
Convolutional filters for the flagpole in the hole (left) and the ball (right).
A convolutional layer with two filters. The filters are classifiers that scan the input image looking for certain patterns. The output of each filter is a set of neurons, organized as an image, which are “bright” wherever patches of pixels in the original image matched the filter.
A layer that converts white pixels in a convolutional layer into coordinates. In this figure, the weight between a pixel and the neuron giving the x-coordinate is equal to the x-coordinate of that pixel, and the weight between a pixel and the neuron giving the y-coordinate is equal to the y-coordinate of the pixel. If the neuron at (4,3) in the left layer is glowing with a value of 1 and all other neurons are dark, then the values of the output of this layer will reflect this: they will be x=4 and y=3.
The convolutional layer of the Atari neural network. Layer 1 shows the input to the network: a screenshot of the game (the Atari network actually used 4 recent screenshots). The next layer is a convolutional layer that searches for 32 distinct patterns of pixels in the first layer with 32 filters. The result of applying each filter is 32 images, each of which is close to 0 everywhere except where the filter matches part of the input screenshot.
The architecture for an Atari-playing agent.
Plots to illustrate overfitting: (a) a sample of points (input, output) for which we hope to build a model; (b) a complex and overfit model of these points (the black curvy line); (c) a linear model of these points (the straight line); and (d) a complex but not overfit model of these points (the black, not-very-curvy line).
The architecture of AlexNet, the artificial network that won the 2012 ImageNet Challenge, set the stage for further improvements in image classification. AlexNet had five convolutional layers followed by three fully connected layers. Much of the network was trained on two different processors, so that some layers didn’t process any inputs from the convolutional layers handled by the other processor. The input layer represented the red-green-blue values of an image, while the output layer had 1,000 neurons corresponding to each of the categories predicted by the network. Image adapted with permission from Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge.”
Patterns of pixels that activate filters in various layers of AlexNet in convolutional layers 1 (a), 2 (b), 3 (c), and 4 (d). These filters search for patterns of light and dark (they also search for certain colors, which you can’t see in this picture). Images used with permission from Yosinski et al., “Understanding Neural Networks Through Deep Visualization.”
Image patches that activate the neurons in the output layer of our network. Neurons correspond to categories in the ImageNet Challenge (A: great white shark; B: hourglass; C: hen; D: wall clock). Images used with permission from Yosinski et al., “Understanding Neural Networks Through Deep Visualization.”
Activation functions for neural networks. The S-curve (a) (formally known as a sigmoid) was used for a long time, but ReLU activation functions (c) have become popular because they make training deep neural networks easier.
A photo of foster kittens, (a), along with a reinterpretation of the photo based on what the network sees after many iterations of the Deep Dream algorithm, (b), and style transfer algorithms, (c) and (d). Image (c) uses style from a Vincent van Gogh painting, while image (d) uses a style created from The Simpsons. All images except (a) were generated via https://
Recurrent neural network (RNN) units unfolded in time. Each unit has a state variable h that transitions from unit to unit. The transition is determined by the input x and the state of the previous unit. Each unit also produces an output y to share information about the state with the rest of the network. The dark boxes represent transformations—typically encoded by other neurons in the network—that may take place within the unit.
The architecture for Deep Speech 2, Baidu’s speech-recognition system. The network is trained using written transcriptions of recordings of human speech and a concept known as connectionist temporal classification, which searches for an alignment between the label and the fully connected layers. This image is adapted from Amodei et al., “Deep Speech 2” (cited in note 1).
An image-captioning neural network. The state for each RNN unit summarizes how much of the caption has been generated. The output of each unit is a probability distribution over words; and the input to each unit is the previously generated word. The input to the first unit is the output of a convolutional neural network. This image is adapted from Vinyals et al., “Show and Tell.”
Long Short-Term Memory (LSTM) units in a recurrent neural network. This particular LSTM is the one used by Google for its image caption generator. As with a general RNN, the state can shift with each subsequent unit based on what is observed in the network. LSTMs like this one use “gates” to modify the input, the output, and the state for each unit, typically with just a multiplication. Adapted from Vinyals et al., “Show and Tell” (cited in note 8).
A very basic overview of the very complicated DeepQA pipeline.
A parse tree for the sentence “Milorad Čavić almost upset this man's perfect 2008 Olympics, losing to him by one hundredth of a second.” This tree is a traditional parse, much like what you might have learned in grade school. Watson didn’t parse a sentence exactly like this, but it used the same idea.
Some of the most important information Watson looks for in its clues during its Question Analysis phase.
The evidence retrieval phase of Watson. Watson first filtered its answer candidates using a lightweight filter and then collected reams of evidence for each of its remaining candidates from its databases and search engines.
The Merging and Ranking phase in the DeepQA framework on which Watson ran. This phase consisted of seven operations, each of which had a merge step, a transform and filter step, and a linear classifier step, which used different classifiers for different types of questions. Each of the seven operations was unique in that their merge, transform, and classify steps differed (some transformations even skipped one or more of these steps), but the framework provided the capability for each step within each transformation.
A search tree to find all ways to fill a 2 × 2 grid with the numbers 1 through 3. The number of states to search through grows quickly with each level of the tree, and there are 34 = 81 states at the bottom of this tree. A Sudoku board with 45 empty spots would have an unfathomable 945 states at the bottom of the tree.
A subset of the search tree in figure 14.2, showing only selected states in the tree. At each level of the tree, the algorithm selects the next empty square and attempts to fill it in with each of the numbers 1 through 3 (shown in bold). The algorithm fills the spot with one of these values, descends to the next level, and attempts to fill in the next number.
The number of states just two levels into the Sudoku search tree is 9 × 9 = 81. Because the number of states grows by a factor of 9 with each level in the tree, we must use a pruning algorithm to narrow the search down.
A “pruned” search tree to find the values in a Sudoku board. Most branches are cut short because they would lead to a Sudoku board that couldn’t lead to a valid Sudoku layout.
A multilayer search tree representing choices in a two-player game. Each level of the tree represents a single player’s choice between two actions. Gray dots at the end represent the outcome that you win, while white dots represent the outcome that I win.
A multilayer search tree in which each state is colored with the result of an evaluation function. This evaluation function is perfect: it describes which player will win the game from each state, provided that each player plays perfectly. In practice, most evaluation functions are approximate.
Using an evaluation function to search to a fixed depth in a two-player game.
An example of a Go game between Go champions Lee Sedol and Ke Jie. In the board on the bottom, which is one move after the board on the top, white places a stone that “captures” two black pieces. Game snapshots are available at https://
An example of a simulated game in the Go search tree to be used as a sample in its move decision (top). The sample game is run until its end. At the end of the game, the outcome of it is known, and this information is “bubbled up” through the tree to the top layers (bottom). Sampled games are sometimes called “rollouts.”
The “Hand of God” move played by Lee Sedol in the fourth of their five games (L-11, top board). The “Hand of Elephant” played by AlphaGo followed immediately afterwards (K-10). Game states are available at https://
The boundary between the slow rollout phase and the fast rollout phase. The slow move-prediction network and win/loss statistics from past simulations are used to choose actions during the slow rollout phase. When a game reaches the fast rollout phase, an evaluation function is run on the state at the boundary, and the fast move-prediction network is used to choose actions for the rest of the simulation. As AlphaGo runs more simulations and becomes more confident about states near the top of the tree, it extends the envelope of the slow-rollout phase to include the states with the most promise.
A sample StarCraft bot architecture, simplified.
A paper-rock-scissors cycle among StarCraft bots from a 2011 competition. In the competition, Xelnaga usually won when it played against Skynet, Skynet usually won when it played against AIUR, and AIUR usually won when it played against Xelnaga.
An architecture of the bot that beat some top human players at DOTA 2. At each epoch, the agent runs a neural network that takes in a feature vector summarizing the current world and outputs variables that determine the action the agent will select. The agent also keeps track of state, which it passes from epoch to epoch. This state serves as a sort of “memory” for the agent.
LIST OF TABLES
Candidate answers for the clue “Milorad Čavić almost upset this man’s perfect 2008 Olympics, losing to him by one hundredth of a second”