DPG algorithm

The following pseudo code shows the DDPG algorithm:

Initialize replay memory  to capacity ;
Initialize the critic network  and actor network  with random weights  and ;
Initialize the target networks  and  with weights  and ;
Repeat for each episode:
    Set time step ;
    Initialize a random process  for action exploration noise;
    Receive an initial observation state ;
    While the terminal state hasn't been reached:
        Select an action  according to the current policy and exploration noise;
        Execute action  in the simulator and observe reward  and the next state ;
        Store transition  into replay memory ;
        Randomly sample a batch of  transitions  from ;
        Set  if  is a terminal state or  if  is a non-terminal state;
        Update critic by minimizing the loss:
                      ;
        Update the actor policy using the sampled policy gradient:
                      ;
        Update the target networks:
                      ,
                      ;
    End while

There is a natural extension of DDPG by replacing the feedforward neural networks used for approximating the actor and the critic with recurrent neural networks. This extension is called the recurrent deterministic policy gradient algorithm (RDPG) and is discussed in the f paper N. Heess, J. J. Hunt, T. P. Lillicrap and D. Silver. Memory-based control with recurrent neural networks. 2015.

The recurrent critic and actor are trained using backpropagation through time (BPTT). For readers who are interested in it, the paper can be downloaded from https://arxiv.org/abs/1512.04455.