The following pseudo code shows the DDPG algorithm:
Initialize replay memory to capacity ;
Initialize the critic network and actor network with random weights and ;
Initialize the target networks and with weights and ;
Repeat for each episode:
Set time step ;
Initialize a random process for action exploration noise;
Receive an initial observation state ;
While the terminal state hasn't been reached:
Select an action according to the current policy and exploration noise;
Execute action in the simulator and observe reward and the next state ;
Store transition into replay memory ;
Randomly sample a batch of transitions from ;
Set if is a terminal state or if is a non-terminal state;
Update critic by minimizing the loss:
;
Update the actor policy using the sampled policy gradient:
;
Update the target networks:
,
;
End while
There is a natural extension of DDPG by replacing the feedforward neural networks used for approximating the actor and the critic with recurrent neural networks. This extension is called theĀ recurrent deterministic policy gradient algorithm (RDPG) and is discussed in the f paper N. Heess, J. J. Hunt, T. P. Lillicrap and D. Silver. Memory-based control with recurrent neural networks. 2015.
The recurrent critic and actor are trained using backpropagation through time (BPTT). For readers who are interested in it, the paper can be downloaded from https://arxiv.org/abs/1512.04455.