Deep Reinforcement Learning with Python, Second Edition by Ravichandiran, Sudharsan -- Read -- Imperial Library of Trantor

Index

Preface

Who this book is for What this book covers To get the most out of this book Get in touch

Fundamentals of Reinforcement Learning

Key elements of RL

Agent Environment State and action Reward

The basic idea of RL The RL algorithm

RL agent in the grid world

How RL differs from other ML paradigms Markov Decision Processes

The Markov property and Markov chain The Markov Reward Process The Markov Decision Process

Fundamental concepts of RL

Math essentials

Expectation

Action space Policy

Deterministic policy Stochastic policy

Episode Episodic and continuous tasks Horizon Return and discount factor

Small discount factor Large discount factor What happens when we set the discount factor to 0? What happens when we set the discount factor to 1?

The value function Q function Model-based and model-free learning Different types of environments

Deterministic and stochastic environments Discrete and continuous environments Episodic and non-episodic environments Single and multi-agent environments

Applications of RL RL glossary Summary Questions Further reading

A Guide to the Gym Toolkit

Setting up our machine

Installing Anaconda Installing the Gym toolkit

Common error fixes

Creating our first Gym environment

Exploring the environment

States Actions Transition probability and reward function

Generating an episode in the Gym environment

Action selection Generating an episode

More Gym environments

Classic control environments

State space Action space Cart-Pole balancing with random policy

Atari game environments General environment

Deterministic environment No frame skipping State and action space An agent playing the Tennis game Recording the game

Other environments

Box2D MuJoCo Robotics Toy text Algorithms

Environment synopsis Summary Questions Further reading

The Bellman Equation and Dynamic Programming

The Bellman equation

The Bellman equation of the value function The Bellman equation of the Q function The Bellman optimality equation The relationship between the value and Q functions

Dynamic programming

Value iteration

The value iteration algorithm Solving the Frozen Lake problem with value iteration

Policy iteration

Algorithm – policy iteration Solving the Frozen Lake problem with policy iteration

Is DP applicable to all environments? Summary Questions

Monte Carlo Methods

Understanding the Monte Carlo method Prediction and control tasks

Prediction task Control task

Monte Carlo prediction

MC prediction algorithm Types of MC prediction

First-visit Monte Carlo Every-visit Monte Carlo

Implementing the Monte Carlo prediction method

Understanding the blackjack game The blackjack environment in the Gym library Every-visit MC prediction with the blackjack game First-visit MC prediction with the blackjack game

Incremental mean updates MC prediction (Q function)

Monte Carlo control

MC control algorithm On-policy Monte Carlo control

Monte Carlo exploring starts Monte Carlo with the epsilon-greedy policy Implementing on-policy MC control

Off-policy Monte Carlo control

Is the MC method applicable to all tasks? Summary Questions

Understanding Temporal Difference Learning

TD learning TD prediction

TD prediction algorithm

Predicting the value of states in the Frozen Lake environment

TD control

On-policy TD control – SARSA

Computing the optimal policy using SARSA

Off-policy TD control – Q learning

Computing the optimal policy using Q learning

The difference between Q learning and SARSA

Comparing the DP, MC, and TD methods Summary Questions Further reading

Case Study – The MAB Problem

The MAB problem

Creating a bandit in the Gym Exploration strategies

Epsilon-greedy Softmax exploration Upper confidence bound Thompson sampling

Applications of MAB Finding the best advertisement banner using bandits

Creating a dataset Initialize the variables Define the epsilon-greedy method Run the bandit test

Contextual bandits Summary Questions Further reading

Deep Learning Foundations

Biological and artificial neurons ANN and its layers

Input layer Hidden layer Output layer

Exploring activation functions

The sigmoid function The tanh function The Rectified Linear Unit function The softmax function

Forward propagation in ANNs How does an ANN learn? Putting it all together

Building a neural network from scratch

Recurrent Neural Networks

The difference between feedforward networks and RNNs Forward propagation in RNNs Backpropagating through time

LSTM to the rescue

Understanding the LSTM cell

What are CNNs?

Convolutional layers

Strides Padding

Pooling layers Fully connected layers

The architecture of CNNs Generative adversarial networks

Breaking down the generator Breaking down the discriminator How do they learn, though? Architecture of a GAN Demystifying the loss function

Discriminator loss Generator loss

Total loss Summary Questions Further reading

A Primer on TensorFlow

What is TensorFlow? Understanding computational graphs and sessions

Sessions

Variables, constants, and placeholders

Variables Constants Placeholders and feed dictionaries

Introducing TensorBoard

Creating a name scope

Handwritten digit classification using TensorFlow

Importing the required libraries Loading the dataset Defining the number of neurons in each layer Defining placeholders Forward propagation Computing loss and backpropagation Computing accuracy Creating a summary Training the model Visualizing graphs in TensorBoard

Introducing eager execution Math operations in TensorFlow TensorFlow 2.0 and Keras

Bonjour Keras

Defining the model Compiling the model Training the model Evaluating the model

MNIST digit classification using TensorFlow 2.0

Summary Questions Further reading

Deep Q Network and Its Variants

What is DQN?

Understanding DQN

Replay buffer Loss function Target network

Putting it all together

The DQN algorithm

Playing Atari games using DQN

Architecture of the DQN Getting hands-on with the DQN

Preprocess the game screen Defining the DQN class Training the DQN

The double DQN

The double DQN algorithm

DQN with prioritized experience replay

Types of prioritization

Proportional prioritization Rank-based prioritization

Correcting the bias

The dueling DQN

Understanding the dueling DQN The architecture of a dueling DQN

The deep recurrent Q network

The architecture of a DRQN

Summary Questions Further reading

Policy Gradient Method

Why policy-based methods? Policy gradient intuition

Understanding the policy gradient Deriving the policy gradient Algorithm – policy gradient

Variance reduction methods

Policy gradient with reward-to-go

Algorithm – Reward-to-go policy gradient

Cart pole balancing with policy gradient

Computing discounted and normalized reward Building the policy network Training the network

Policy gradient with baseline

Algorithm – REINFORCE with baseline

Summary Questions Further reading

Actor-Critic Methods – A2C and A3C

Overview of the actor-critic method

Understanding the actor-critic method The actor-critic algorithm

Advantage actor-critic (A2C) Asynchronous advantage actor-critic (A3C)

The three As The architecture of A3C Mountain car climbing using A3C

Creating the mountain car environment Defining the variables Defining the actor-critic class Defining the worker class Training the network Visualizing the computational graph

A2C revisited Summary Questions Further reading

Learning DDPG, TD3, and SAC

Deep deterministic policy gradient

An overview of DDPG

Actor Critic

DDPG components

Critic network Actor network

Putting it all together Algorithm – DDPG Swinging up a pendulum using DDPG

Creating the Gym environment Defining the variables Defining the DDPG class Training the network

Twin delayed DDPG

Key features of TD3

Clipped double Q learning Delayed policy updates Target policy smoothing

Putting it all together Algorithm – TD3

Soft actor-critic

Understanding soft actor-critic

V and Q functions with the entropy term

Components of SAC

Critic network Actor network

Putting it all together Algorithm – SAC

Summary Questions Further reading

TRPO, PPO, and ACKTR Methods

Trust region policy optimization

Math essentials

The Taylor series The trust region method The conjugate gradient method Lagrange multipliers Importance sampling

Designing the TRPO objective function

Parameterizing the policies Sample-based estimation

Solving the TRPO objective function

Computing the search direction Performing a line search in the search direction

Algorithm – TRPO

Proximal policy optimization

PPO with a clipped objective

Algorithm – PPO-clipped

Implementing the PPO-clipped method

Creating the Gym environment Defining the PPO class Training the network

PPO with a penalized objective

Algorithm – PPO-penalty

Actor-critic using Kronecker-factored trust region

Math essentials

Block matrix Block diagonal matrix The Kronecker product The vec operator Properties of the Kronecker product

Kronecker-Factored Approximate Curvature (K-FAC) K-FAC in actor-critic Incorporating the trust region

Summary Questions Further reading

Distributional Reinforcement Learning

Why distributional reinforcement learning? Categorical DQN

Predicting the value distribution Selecting an action based on the value distribution Training the categorical DQN

Projection step

Putting it all together Algorithm – categorical DQN Playing Atari games using a categorical DQN

Defining the variables Defining the replay buffer Defining the categorical DQN class

Quantile Regression DQN

Math essentials

Quantile Inverse CDF (quantile function)

Understanding QR-DQN

Action selection Loss function

Distributed Distributional DDPG

Critic network Actor network Algorithm – D4PG

Summary Questions Further reading

Imitation Learning and Inverse RL

Supervised imitation learning DAgger

Understanding DAgger Algorithm – DAgger

Deep Q learning from demonstrations

Phases of DQfD

Pre-training phase Training phase

Loss function of DQfD Algorithm – DQfD

Inverse reinforcement learning

Maximum entropy IRL

Key terms Back to maximum entropy IRL Computing the gradient Algorithm – maximum entropy IRL

Generative adversarial imitation learning

Formulation of GAIL

Summary Questions Further reading

Deep Reinforcement Learning with Stable Baselines

Installing Stable Baselines Creating our first agent with Stable Baselines

Evaluating the trained agent Storing and loading the trained agent Viewing the trained agent Putting it all together

Vectorized environments

SubprocVecEnv DummyVecEnv

Integrating custom environments Playing Atari games with a DQN and its variants

Implementing DQN variants

Lunar lander using A2C

Creating a custom network

Swinging up a pendulum using DDPG

Viewing the computational graph in TensorBoard

Training an agent to walk using TRPO

Installing the MuJoCo environment Implementing TRPO Recording the video

Training a cheetah bot to run using PPO

Making a GIF of a trained agent

Implementing GAIL Summary Questions Further reading

Reinforcement Learning Frontiers

Meta reinforcement learning

Model-agnostic meta learning

Understanding MAML MAML in a supervised learning setting MAML in a reinforcement learning setting

Hierarchical reinforcement learning

MAXQ value function Decomposition

Imagination augmented agents Summary Questions Further reading

Appendix 1 – Reinforcement Learning Algorithms

Reinforcement learning algorithm Value Iteration Policy Iteration First-Visit MC Prediction Every-Visit MC Prediction MC Prediction – the Q Function MC Control Method On-Policy MC Control – Exploring starts On-Policy MC Control – Epsilon-Greedy Off-Policy MC Control TD Prediction On-Policy TD Control – SARSA Off-Policy TD Control – Q Learning Deep Q Learning Double DQN REINFORCE Policy Gradient Policy Gradient with Reward-To-Go REINFORCE with Baseline Advantage Actor Critic Asynchronous Advantage Actor-Critic Deep Deterministic Policy Gradient Twin Delayed DDPG Soft Actor-Critic Trust Region Policy Optimization PPO-Clipped PPO-Penalty Categorical DQN Distributed Distributional DDPG DAgger Deep Q learning from demonstrations MaxEnt Inverse Reinforcement Learning MAML in Reinforcement Learning

Appendix 2 – Assessments

Chapter 1 – Fundamentals of Reinforcement Learning Chapter 2 – A Guide to the Gym Toolkit Chapter 3 – The Bellman Equation and Dynamic Programming Chapter 4 – Monte Carlo Methods Chapter 5 – Understanding Temporal Difference Learning Chapter 6 – Case Study – The MAB Problem Chapter 7 – Deep Learning Foundations Chapter 8 – A Primer on TensorFlow Chapter 9 – Deep Q Network and Its Variants Chapter 10 – Policy Gradient Method Chapter 11 – Actor-Critic Methods – A2C and A3C Chapter 12 – Learning DDPG, TD3, and SAC Chapter 13 – TRPO, PPO, and ACKTR Methods Chapter 14 – Distributional Reinforcement Learning Chapter 15 – Imitation Learning and Inverse RL Chapter 16 – Deep Reinforcement Learning with Stable Baselines Chapter 17 – Reinforcement Learning Frontiers

Other Books You May Enjoy Index

← Prev
Back
Next →

← Prev
Back
Next →