Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Preface
Who this book is for
What this book covers
To get the most out of this book
Get in touch
Fundamentals of Reinforcement Learning
Key elements of RL
Agent
Environment
State and action
Reward
The basic idea of RL
The RL algorithm
RL agent in the grid world
How RL differs from other ML paradigms
Markov Decision Processes
The Markov property and Markov chain
The Markov Reward Process
The Markov Decision Process
Fundamental concepts of RL
Math essentials
Expectation
Action space
Policy
Deterministic policy
Stochastic policy
Episode
Episodic and continuous tasks
Horizon
Return and discount factor
Small discount factor
Large discount factor
What happens when we set the discount factor to 0?
What happens when we set the discount factor to 1?
The value function
Q function
Model-based and model-free learning
Different types of environments
Deterministic and stochastic environments
Discrete and continuous environments
Episodic and non-episodic environments
Single and multi-agent environments
Applications of RL
RL glossary
Summary
Questions
Further reading
A Guide to the Gym Toolkit
Setting up our machine
Installing Anaconda
Installing the Gym toolkit
Common error fixes
Creating our first Gym environment
Exploring the environment
States
Actions
Transition probability and reward function
Generating an episode in the Gym environment
Action selection
Generating an episode
More Gym environments
Classic control environments
State space
Action space
Cart-Pole balancing with random policy
Atari game environments
General environment
Deterministic environment
No frame skipping
State and action space
An agent playing the Tennis game
Recording the game
Other environments
Box2D
MuJoCo
Robotics
Toy text
Algorithms
Environment synopsis
Summary
Questions
Further reading
The Bellman Equation and Dynamic Programming
The Bellman equation
The Bellman equation of the value function
The Bellman equation of the Q function
The Bellman optimality equation
The relationship between the value and Q functions
Dynamic programming
Value iteration
The value iteration algorithm
Solving the Frozen Lake problem with value iteration
Policy iteration
Algorithm – policy iteration
Solving the Frozen Lake problem with policy iteration
Is DP applicable to all environments?
Summary
Questions
Monte Carlo Methods
Understanding the Monte Carlo method
Prediction and control tasks
Prediction task
Control task
Monte Carlo prediction
MC prediction algorithm
Types of MC prediction
First-visit Monte Carlo
Every-visit Monte Carlo
Implementing the Monte Carlo prediction method
Understanding the blackjack game
The blackjack environment in the Gym library
Every-visit MC prediction with the blackjack game
First-visit MC prediction with the blackjack game
Incremental mean updates
MC prediction (Q function)
Monte Carlo control
MC control algorithm
On-policy Monte Carlo control
Monte Carlo exploring starts
Monte Carlo with the epsilon-greedy policy
Implementing on-policy MC control
Off-policy Monte Carlo control
Is the MC method applicable to all tasks?
Summary
Questions
Understanding Temporal Difference Learning
TD learning
TD prediction
TD prediction algorithm
Predicting the value of states in the Frozen Lake environment
TD control
On-policy TD control – SARSA
Computing the optimal policy using SARSA
Off-policy TD control – Q learning
Computing the optimal policy using Q learning
The difference between Q learning and SARSA
Comparing the DP, MC, and TD methods
Summary
Questions
Further reading
Case Study – The MAB Problem
The MAB problem
Creating a bandit in the Gym
Exploration strategies
Epsilon-greedy
Softmax exploration
Upper confidence bound
Thompson sampling
Applications of MAB
Finding the best advertisement banner using bandits
Creating a dataset
Initialize the variables
Define the epsilon-greedy method
Run the bandit test
Contextual bandits
Summary
Questions
Further reading
Deep Learning Foundations
Biological and artificial neurons
ANN and its layers
Input layer
Hidden layer
Output layer
Exploring activation functions
The sigmoid function
The tanh function
The Rectified Linear Unit function
The softmax function
Forward propagation in ANNs
How does an ANN learn?
Putting it all together
Building a neural network from scratch
Recurrent Neural Networks
The difference between feedforward networks and RNNs
Forward propagation in RNNs
Backpropagating through time
LSTM to the rescue
Understanding the LSTM cell
What are CNNs?
Convolutional layers
Strides
Padding
Pooling layers
Fully connected layers
The architecture of CNNs
Generative adversarial networks
Breaking down the generator
Breaking down the discriminator
How do they learn, though?
Architecture of a GAN
Demystifying the loss function
Discriminator loss
Generator loss
Total loss
Summary
Questions
Further reading
A Primer on TensorFlow
What is TensorFlow?
Understanding computational graphs and sessions
Sessions
Variables, constants, and placeholders
Variables
Constants
Placeholders and feed dictionaries
Introducing TensorBoard
Creating a name scope
Handwritten digit classification using TensorFlow
Importing the required libraries
Loading the dataset
Defining the number of neurons in each layer
Defining placeholders
Forward propagation
Computing loss and backpropagation
Computing accuracy
Creating a summary
Training the model
Visualizing graphs in TensorBoard
Introducing eager execution
Math operations in TensorFlow
TensorFlow 2.0 and Keras
Bonjour Keras
Defining the model
Compiling the model
Training the model
Evaluating the model
MNIST digit classification using TensorFlow 2.0
Summary
Questions
Further reading
Deep Q Network and Its Variants
What is DQN?
Understanding DQN
Replay buffer
Loss function
Target network
Putting it all together
The DQN algorithm
Playing Atari games using DQN
Architecture of the DQN
Getting hands-on with the DQN
Preprocess the game screen
Defining the DQN class
Training the DQN
The double DQN
The double DQN algorithm
DQN with prioritized experience replay
Types of prioritization
Proportional prioritization
Rank-based prioritization
Correcting the bias
The dueling DQN
Understanding the dueling DQN
The architecture of a dueling DQN
The deep recurrent Q network
The architecture of a DRQN
Summary
Questions
Further reading
Policy Gradient Method
Why policy-based methods?
Policy gradient intuition
Understanding the policy gradient
Deriving the policy gradient
Algorithm – policy gradient
Variance reduction methods
Policy gradient with reward-to-go
Algorithm – Reward-to-go policy gradient
Cart pole balancing with policy gradient
Computing discounted and normalized reward
Building the policy network
Training the network
Policy gradient with baseline
Algorithm – REINFORCE with baseline
Summary
Questions
Further reading
Actor-Critic Methods – A2C and A3C
Overview of the actor-critic method
Understanding the actor-critic method
The actor-critic algorithm
Advantage actor-critic (A2C)
Asynchronous advantage actor-critic (A3C)
The three As
The architecture of A3C
Mountain car climbing using A3C
Creating the mountain car environment
Defining the variables
Defining the actor-critic class
Defining the worker class
Training the network
Visualizing the computational graph
A2C revisited
Summary
Questions
Further reading
Learning DDPG, TD3, and SAC
Deep deterministic policy gradient
An overview of DDPG
Actor
Critic
DDPG components
Critic network
Actor network
Putting it all together
Algorithm – DDPG
Swinging up a pendulum using DDPG
Creating the Gym environment
Defining the variables
Defining the DDPG class
Training the network
Twin delayed DDPG
Key features of TD3
Clipped double Q learning
Delayed policy updates
Target policy smoothing
Putting it all together
Algorithm – TD3
Soft actor-critic
Understanding soft actor-critic
V and Q functions with the entropy term
Components of SAC
Critic network
Actor network
Putting it all together
Algorithm – SAC
Summary
Questions
Further reading
TRPO, PPO, and ACKTR Methods
Trust region policy optimization
Math essentials
The Taylor series
The trust region method
The conjugate gradient method
Lagrange multipliers
Importance sampling
Designing the TRPO objective function
Parameterizing the policies
Sample-based estimation
Solving the TRPO objective function
Computing the search direction
Performing a line search in the search direction
Algorithm – TRPO
Proximal policy optimization
PPO with a clipped objective
Algorithm – PPO-clipped
Implementing the PPO-clipped method
Creating the Gym environment
Defining the PPO class
Training the network
PPO with a penalized objective
Algorithm – PPO-penalty
Actor-critic using Kronecker-factored trust region
Math essentials
Block matrix
Block diagonal matrix
The Kronecker product
The vec operator
Properties of the Kronecker product
Kronecker-Factored Approximate Curvature (K-FAC)
K-FAC in actor-critic
Incorporating the trust region
Summary
Questions
Further reading
Distributional Reinforcement Learning
Why distributional reinforcement learning?
Categorical DQN
Predicting the value distribution
Selecting an action based on the value distribution
Training the categorical DQN
Projection step
Putting it all together
Algorithm – categorical DQN
Playing Atari games using a categorical DQN
Defining the variables
Defining the replay buffer
Defining the categorical DQN class
Quantile Regression DQN
Math essentials
Quantile
Inverse CDF (quantile function)
Understanding QR-DQN
Action selection
Loss function
Distributed Distributional DDPG
Critic network
Actor network
Algorithm – D4PG
Summary
Questions
Further reading
Imitation Learning and Inverse RL
Supervised imitation learning
DAgger
Understanding DAgger
Algorithm – DAgger
Deep Q learning from demonstrations
Phases of DQfD
Pre-training phase
Training phase
Loss function of DQfD
Algorithm – DQfD
Inverse reinforcement learning
Maximum entropy IRL
Key terms
Back to maximum entropy IRL
Computing the gradient
Algorithm – maximum entropy IRL
Generative adversarial imitation learning
Formulation of GAIL
Summary
Questions
Further reading
Deep Reinforcement Learning with Stable Baselines
Installing Stable Baselines
Creating our first agent with Stable Baselines
Evaluating the trained agent
Storing and loading the trained agent
Viewing the trained agent
Putting it all together
Vectorized environments
SubprocVecEnv
DummyVecEnv
Integrating custom environments
Playing Atari games with a DQN and its variants
Implementing DQN variants
Lunar lander using A2C
Creating a custom network
Swinging up a pendulum using DDPG
Viewing the computational graph in TensorBoard
Training an agent to walk using TRPO
Installing the MuJoCo environment
Implementing TRPO
Recording the video
Training a cheetah bot to run using PPO
Making a GIF of a trained agent
Implementing GAIL
Summary
Questions
Further reading
Reinforcement Learning Frontiers
Meta reinforcement learning
Model-agnostic meta learning
Understanding MAML
MAML in a supervised learning setting
MAML in a reinforcement learning setting
Hierarchical reinforcement learning
MAXQ value function Decomposition
Imagination augmented agents
Summary
Questions
Further reading
Appendix 1 – Reinforcement Learning Algorithms
Reinforcement learning algorithm
Value Iteration
Policy Iteration
First-Visit MC Prediction
Every-Visit MC Prediction
MC Prediction – the Q Function
MC Control Method
On-Policy MC Control – Exploring starts
On-Policy MC Control – Epsilon-Greedy
Off-Policy MC Control
TD Prediction
On-Policy TD Control – SARSA
Off-Policy TD Control – Q Learning
Deep Q Learning
Double DQN
REINFORCE Policy Gradient
Policy Gradient with Reward-To-Go
REINFORCE with Baseline
Advantage Actor Critic
Asynchronous Advantage Actor-Critic
Deep Deterministic Policy Gradient
Twin Delayed DDPG
Soft Actor-Critic
Trust Region Policy Optimization
PPO-Clipped
PPO-Penalty
Categorical DQN
Distributed Distributional DDPG
DAgger
Deep Q learning from demonstrations
MaxEnt Inverse Reinforcement Learning
MAML in Reinforcement Learning
Appendix 2 – Assessments
Chapter 1 – Fundamentals of Reinforcement Learning
Chapter 2 – A Guide to the Gym Toolkit
Chapter 3 – The Bellman Equation and Dynamic Programming
Chapter 4 – Monte Carlo Methods
Chapter 5 – Understanding Temporal Difference Learning
Chapter 6 – Case Study – The MAB Problem
Chapter 7 – Deep Learning Foundations
Chapter 8 – A Primer on TensorFlow
Chapter 9 – Deep Q Network and Its Variants
Chapter 10 – Policy Gradient Method
Chapter 11 – Actor-Critic Methods – A2C and A3C
Chapter 12 – Learning DDPG, TD3, and SAC
Chapter 13 – TRPO, PPO, and ACKTR Methods
Chapter 14 – Distributional Reinforcement Learning
Chapter 15 – Imitation Learning and Inverse RL
Chapter 16 – Deep Reinforcement Learning with Stable Baselines
Chapter 17 – Reinforcement Learning Frontiers
Other Books You May Enjoy
Index
← Prev
Back
Next →
← Prev
Back
Next →