Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Preface
Who this book is for What this book covers To get the most out of this book Get in touch
Fundamentals of Reinforcement Learning
Key elements of RL
Agent Environment State and action Reward
The basic idea of RL The RL algorithm
RL agent in the grid world
How RL differs from other ML paradigms Markov Decision Processes
The Markov property and Markov chain The Markov Reward Process The Markov Decision Process
Fundamental concepts of RL
Math essentials
Expectation
Action space Policy
Deterministic policy Stochastic policy
Episode Episodic and continuous tasks Horizon Return and discount factor
Small discount factor Large discount factor What happens when we set the discount factor to 0? What happens when we set the discount factor to 1?
The value function Q function Model-based and model-free learning Different types of environments
Deterministic and stochastic environments Discrete and continuous environments Episodic and non-episodic environments Single and multi-agent environments
Applications of RL RL glossary Summary Questions Further reading
A Guide to the Gym Toolkit
Setting up our machine
Installing Anaconda Installing the Gym toolkit
Common error fixes
Creating our first Gym environment
Exploring the environment
States Actions Transition probability and reward function
Generating an episode in the Gym environment
Action selection Generating an episode
More Gym environments
Classic control environments
State space Action space Cart-Pole balancing with random policy
Atari game environments General environment
Deterministic environment No frame skipping State and action space An agent playing the Tennis game Recording the game
Other environments
Box2D MuJoCo Robotics Toy text Algorithms
Environment synopsis Summary Questions Further reading
The Bellman Equation and Dynamic Programming
The Bellman equation
The Bellman equation of the value function The Bellman equation of the Q function The Bellman optimality equation The relationship between the value and Q functions
Dynamic programming
Value iteration
The value iteration algorithm Solving the Frozen Lake problem with value iteration
Policy iteration
Algorithm – policy iteration Solving the Frozen Lake problem with policy iteration
Is DP applicable to all environments? Summary Questions
Monte Carlo Methods
Understanding the Monte Carlo method Prediction and control tasks
Prediction task Control task
Monte Carlo prediction
MC prediction algorithm Types of MC prediction
First-visit Monte Carlo Every-visit Monte Carlo
Implementing the Monte Carlo prediction method
Understanding the blackjack game The blackjack environment in the Gym library Every-visit MC prediction with the blackjack game First-visit MC prediction with the blackjack game
Incremental mean updates MC prediction (Q function)
Monte Carlo control
MC control algorithm On-policy Monte Carlo control
Monte Carlo exploring starts Monte Carlo with the epsilon-greedy policy Implementing on-policy MC control
Off-policy Monte Carlo control
Is the MC method applicable to all tasks? Summary Questions
Understanding Temporal Difference Learning
TD learning TD prediction
TD prediction algorithm
Predicting the value of states in the Frozen Lake environment
TD control
On-policy TD control – SARSA
Computing the optimal policy using SARSA
Off-policy TD control – Q learning
Computing the optimal policy using Q learning
The difference between Q learning and SARSA
Comparing the DP, MC, and TD methods Summary Questions Further reading
Case Study – The MAB Problem
The MAB problem
Creating a bandit in the Gym Exploration strategies
Epsilon-greedy Softmax exploration Upper confidence bound Thompson sampling
Applications of MAB Finding the best advertisement banner using bandits
Creating a dataset Initialize the variables Define the epsilon-greedy method Run the bandit test
Contextual bandits Summary Questions Further reading
Deep Learning Foundations
Biological and artificial neurons ANN and its layers
Input layer Hidden layer Output layer
Exploring activation functions
The sigmoid function The tanh function The Rectified Linear Unit function The softmax function
Forward propagation in ANNs How does an ANN learn? Putting it all together
Building a neural network from scratch
Recurrent Neural Networks
The difference between feedforward networks and RNNs Forward propagation in RNNs Backpropagating through time
LSTM to the rescue
Understanding the LSTM cell
What are CNNs?
Convolutional layers
Strides Padding
Pooling layers Fully connected layers
The architecture of CNNs Generative adversarial networks
Breaking down the generator Breaking down the discriminator How do they learn, though? Architecture of a GAN Demystifying the loss function
Discriminator loss Generator loss
Total loss Summary Questions Further reading
A Primer on TensorFlow
What is TensorFlow? Understanding computational graphs and sessions
Sessions
Variables, constants, and placeholders
Variables Constants Placeholders and feed dictionaries
Introducing TensorBoard
Creating a name scope
Handwritten digit classification using TensorFlow
Importing the required libraries Loading the dataset Defining the number of neurons in each layer Defining placeholders Forward propagation Computing loss and backpropagation Computing accuracy Creating a summary Training the model Visualizing graphs in TensorBoard
Introducing eager execution Math operations in TensorFlow TensorFlow 2.0 and Keras
Bonjour Keras
Defining the model Compiling the model Training the model Evaluating the model
MNIST digit classification using TensorFlow 2.0
Summary Questions Further reading
Deep Q Network and Its Variants
What is DQN?
Understanding DQN
Replay buffer Loss function Target network
Putting it all together
The DQN algorithm
Playing Atari games using DQN
Architecture of the DQN Getting hands-on with the DQN
Preprocess the game screen Defining the DQN class Training the DQN
The double DQN
The double DQN algorithm
DQN with prioritized experience replay
Types of prioritization
Proportional prioritization Rank-based prioritization
Correcting the bias
The dueling DQN
Understanding the dueling DQN The architecture of a dueling DQN
The deep recurrent Q network
The architecture of a DRQN
Summary Questions Further reading
Policy Gradient Method
Why policy-based methods? Policy gradient intuition
Understanding the policy gradient Deriving the policy gradient Algorithm – policy gradient
Variance reduction methods
Policy gradient with reward-to-go
Algorithm – Reward-to-go policy gradient
Cart pole balancing with policy gradient
Computing discounted and normalized reward Building the policy network Training the network
Policy gradient with baseline
Algorithm – REINFORCE with baseline
Summary Questions Further reading
Actor-Critic Methods – A2C and A3C
Overview of the actor-critic method
Understanding the actor-critic method The actor-critic algorithm
Advantage actor-critic (A2C) Asynchronous advantage actor-critic (A3C)
The three As The architecture of A3C Mountain car climbing using A3C
Creating the mountain car environment Defining the variables Defining the actor-critic class Defining the worker class Training the network Visualizing the computational graph
A2C revisited Summary Questions Further reading
Learning DDPG, TD3, and SAC
Deep deterministic policy gradient
An overview of DDPG
Actor Critic
DDPG components
Critic network Actor network
Putting it all together Algorithm – DDPG Swinging up a pendulum using DDPG
Creating the Gym environment Defining the variables Defining the DDPG class Training the network
Twin delayed DDPG
Key features of TD3
Clipped double Q learning Delayed policy updates Target policy smoothing
Putting it all together Algorithm – TD3
Soft actor-critic
Understanding soft actor-critic
V and Q functions with the entropy term
Components of SAC
Critic network Actor network
Putting it all together Algorithm – SAC
Summary Questions Further reading
TRPO, PPO, and ACKTR Methods
Trust region policy optimization
Math essentials
The Taylor series The trust region method The conjugate gradient method Lagrange multipliers Importance sampling
Designing the TRPO objective function
Parameterizing the policies Sample-based estimation
Solving the TRPO objective function
Computing the search direction Performing a line search in the search direction
Algorithm – TRPO
Proximal policy optimization
PPO with a clipped objective
Algorithm – PPO-clipped
Implementing the PPO-clipped method
Creating the Gym environment Defining the PPO class Training the network
PPO with a penalized objective
Algorithm – PPO-penalty
Actor-critic using Kronecker-factored trust region
Math essentials
Block matrix Block diagonal matrix The Kronecker product The vec operator Properties of the Kronecker product
Kronecker-Factored Approximate Curvature (K-FAC) K-FAC in actor-critic Incorporating the trust region
Summary Questions Further reading
Distributional Reinforcement Learning
Why distributional reinforcement learning? Categorical DQN
Predicting the value distribution Selecting an action based on the value distribution Training the categorical DQN
Projection step
Putting it all together Algorithm – categorical DQN Playing Atari games using a categorical DQN
Defining the variables Defining the replay buffer Defining the categorical DQN class
Quantile Regression DQN
Math essentials
Quantile Inverse CDF (quantile function)
Understanding QR-DQN
Action selection Loss function
Distributed Distributional DDPG
Critic network Actor network Algorithm – D4PG
Summary Questions Further reading
Imitation Learning and Inverse RL
Supervised imitation learning DAgger
Understanding DAgger Algorithm – DAgger
Deep Q learning from demonstrations
Phases of DQfD
Pre-training phase Training phase
Loss function of DQfD Algorithm – DQfD
Inverse reinforcement learning
Maximum entropy IRL
Key terms Back to maximum entropy IRL Computing the gradient Algorithm – maximum entropy IRL
Generative adversarial imitation learning
Formulation of GAIL
Summary Questions Further reading
Deep Reinforcement Learning with Stable Baselines
Installing Stable Baselines Creating our first agent with Stable Baselines
Evaluating the trained agent Storing and loading the trained agent Viewing the trained agent Putting it all together
Vectorized environments
SubprocVecEnv DummyVecEnv
Integrating custom environments Playing Atari games with a DQN and its variants
Implementing DQN variants
Lunar lander using A2C
Creating a custom network
Swinging up a pendulum using DDPG
Viewing the computational graph in TensorBoard
Training an agent to walk using TRPO
Installing the MuJoCo environment Implementing TRPO Recording the video
Training a cheetah bot to run using PPO
Making a GIF of a trained agent
Implementing GAIL Summary Questions Further reading
Reinforcement Learning Frontiers
Meta reinforcement learning
Model-agnostic meta learning
Understanding MAML MAML in a supervised learning setting MAML in a reinforcement learning setting
Hierarchical reinforcement learning
MAXQ value function Decomposition
Imagination augmented agents Summary Questions Further reading
Appendix 1 – Reinforcement Learning Algorithms
Reinforcement learning algorithm Value Iteration Policy Iteration First-Visit MC Prediction Every-Visit MC Prediction MC Prediction – the Q Function MC Control Method On-Policy MC Control – Exploring starts On-Policy MC Control – Epsilon-Greedy Off-Policy MC Control TD Prediction On-Policy TD Control – SARSA Off-Policy TD Control – Q Learning Deep Q Learning Double DQN REINFORCE Policy Gradient Policy Gradient with Reward-To-Go REINFORCE with Baseline Advantage Actor Critic Asynchronous Advantage Actor-Critic Deep Deterministic Policy Gradient Twin Delayed DDPG Soft Actor-Critic Trust Region Policy Optimization PPO-Clipped PPO-Penalty Categorical DQN Distributed Distributional DDPG DAgger Deep Q learning from demonstrations MaxEnt Inverse Reinforcement Learning MAML in Reinforcement Learning
Appendix 2 – Assessments
Chapter 1 – Fundamentals of Reinforcement Learning Chapter 2 – A Guide to the Gym Toolkit Chapter 3 – The Bellman Equation and Dynamic Programming Chapter 4 – Monte Carlo Methods Chapter 5 – Understanding Temporal Difference Learning Chapter 6 – Case Study – The MAB Problem Chapter 7 – Deep Learning Foundations Chapter 8 – A Primer on TensorFlow Chapter 9 – Deep Q Network and Its Variants Chapter 10 – Policy Gradient Method Chapter 11 – Actor-Critic Methods – A2C and A3C Chapter 12 – Learning DDPG, TD3, and SAC Chapter 13 – TRPO, PPO, and ACKTR Methods Chapter 14 – Distributional Reinforcement Learning Chapter 15 – Imitation Learning and Inverse RL Chapter 16 – Deep Reinforcement Learning with Stable Baselines Chapter 17 – Reinforcement Learning Frontiers
Other Books You May Enjoy Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion