Using neural networks to build a better policy

Let's first see how to build a random policy using a simple fully connected (dense) neural network, which takes 4 values in an observation as input, uses a hidden layer of 4 neurons, and outputs the probability of the 0 action, based on which, the agent can sample the next action between 0 and 1:

# nn_random_policy.py

import tensorflow as tf
import numpy as np
import gym

env = gym.make("CartPole-v0") 

num_inputs = env.observation_space.shape[0]
inputs = tf.placeholder(tf.float32, shape=[None, num_inputs])
hidden = tf.layers.dense(inputs, 4, activation=tf.nn.relu)
outputs = tf.layers.dense(hidden, 1, activation=tf.nn.sigmoid)
action = tf.multinomial(tf.log(tf.concat([outputs, 1-outputs], 1)), 1)

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())

  total_rewards = []
  for _ in range(1000):
    rewards = 0
    obs = env.reset()
    while True:
      a = sess.run(action, feed_dict={inputs: obs.reshape(1, num_inputs)})
      obs, reward, done, info = env.step(a[0][0]) 
      rewards += reward
      if done:
        break
    total_rewards.append(rewards)

print(np.mean(total_rewards))

Note that we use the tf.multinomial function to sample an action based on the probability distribution of action 0 and 1, defined as outputs and 1-outputs, respectively (the sum of the two probabilities is 1). The mean of the total rewards will be around 20-something, better than the single-minded policy but worse than the simple intuitive policy in the previous subsection. This is a neural network this is generating a random policy, with no training at all.

To train the network, we use tf.nn.sigmoid_cross_entropy_with_logits to define the loss function between the network output and the desired y_target action, defined using the basic simple policy in the previous subsection, so we expect this neural network policy to achieve about the same rewards as the basic non-neural-network policy:

# nn_simple_policy.py

import tensorflow as tf
import numpy as np
import gym

env = gym.make("CartPole-v0")

num_inputs = env.observation_space.shape[0]
inputs = tf.placeholder(tf.float32, shape=[None, num_inputs])
y = tf.placeholder(tf.float32, shape=[None, 1])
hidden = tf.layers.dense(inputs, 4, activation=tf.nn.relu)
logits = tf.layers.dense(hidden, 1)
outputs = tf.nn.sigmoid(logits)
action = tf.multinomial(tf.log(tf.concat([outputs, 1-outputs], 1)), 1)

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(0.01)
training_op = optimizer.minimize(cross_entropy)

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())

  for _ in range(1000):
    obs = env.reset()

    while True:
      y_target = np.array([[1. if obs[2] < 0 else 0.]])
      a, _ = sess.run([action, training_op], feed_dict={inputs: obs.reshape(1, num_inputs), y: y_target})
      obs, reward, done, info = env.step(a[0][0])
      if done:
        break
  print("training done")

We define outputs as a sigmoid function of the logits net output, that is, the probability of action 0, and then use the tf.multinomial to sample an action. Note that we use the standard tf.train.AdamOptimizer and its minimize method to train the network. To test and see how good the policy is, run the following code:

  total_rewards = []
  for _ in range(1000):
    rewards = 0
    obs = env.reset()

    while True:
      y_target = np.array([1. if obs[2] < 0 else 0.])
      a = sess.run(action, feed_dict={inputs: obs.reshape(1, num_inputs)})
      obs, reward, done, info = env.step(a[0][0])
      rewards += reward
      if done:
        break
    total_rewards.append(rewards)

  print(np.mean(total_rewards))

The mean of the total rewards will be around 40-something, about the same as that of using the simple policy with no neural network, which is exactly what we expected as we specifically used the simple policy, with y: y_target in the training phase, to train the network.

We're now all set to explore how we can implement a policy gradient method on top of this to make our neural network perform much better, getting rewards several times larger.

The basic idea of a policy gradient is that in order to train a neural work to generate a better policy, when all an agent knows from the environment is the rewards it can get when taking an action from any given state (meaning we can't use supervised learning for training), we can adopt two new mechanisms:

Discounted rewards: Each action's value needs to consider its future action rewards. For example, an action that gets an immediate reward, 1, but ends the episode two actions (steps) later should have fewer long-term rewards than an action that gets an immediate reward, 1, but ends the episode 10 steps later. The typical formula of discounted rewards for an action is the sum of its immediate reward plus the multiple of each of its future rewards and a discount rate powered by the steps into the future. So if an action sequence has 1, 1, 1, 1, 1 rewards before the end of an episode, the discounted rewards for the first action is 1+(1*discount_rate)+(1*discount_rate**2)+(1*discount_rate**3)+(1*discount_rate**4).

Test run the current policy and see which actions lead to higher discounted rewards, then update the current policy's gradients (of the loss for weights) with the discounted rewards, in a way that an action with higher discounted rewards will, after the network update, have a higher probability of being chosen next time. Repeat such test runs and update the process many times to train a neural network for a better policy.

For a more detailed discussion and a walkthrough of policy gradients, see Andrej Karpathy's blog entry, Deep Reinforcement Learning: Pong from Pixels (http://karpathy.github.io/2016/05/31/rl). Let's now see how to implement a policy gradient for our CartPole problem in TensorFlow.

First, import tensorflow, numpy, and gym, and define a helper method that calculates the normalized and discounted rewards:

import tensorflow as tf
import numpy as np
import gym

def normalized_discounted_rewards(rewards):
    dr = np.zeros(len(rewards))
    dr[-1] = rewards[-1]
    for n in range(2, len(rewards)+1):
        dr[-n] = rewards[-n] + dr[-n+1] * discount_rate
    return (dr - dr.mean()) / dr.std()

For example, if discount_rate is 0.95, then the discounted reward for the first action in a reward list [1,1,1] is 1+1*0.95+1*0.95**2=2.8525, and the discounted rewards for the second and the last elements are 1.95 and 1; the discounted reward for the first action in a reward list [1,1,1,1,1] is 1+1*0.95+1*0.95**2 + 1*0.95**3 + 1*0.95**4=4.5244, for the rest of actions are 3.7099, 2.8525, 1.95, and 1. The normalized discounted rewards for [1,1,1] and [1,1,1,1,1] are [1.2141, 0.0209, -1.2350] and [1.3777, 0.7242, 0.0362, -0.6879, -1.4502]. Each normalized discounted list is in the decreasing order, meaning the longer an action lasts (before the end of an episode), the larger its reward is.

Next, create the CartPole gym environment, define the learning_rate and discount_rate hyper-parameters, and build the network with four input neurons, four hidden neurons and one output neuron as before:

env = gym.make("CartPole-v0")

learning_rate = 0.05
discount_rate = 0.95

num_inputs = env.observation_space.shape[0]
inputs = tf.placeholder(tf.float32, shape=[None, num_inputs])
hidden = tf.layers.dense(inputs, 4, activation=tf.nn.relu) 
logits = tf.layers.dense(hidden, 1)
outputs = tf.nn.sigmoid(logits) 
action = tf.multinomial(tf.log(tf.concat([outputs, 1-outputs], 1)), 1)

prob_action_0 = tf.to_float(1-action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=prob_action_0)
optimizer = tf.train.AdamOptimizer(learning_rate)

Note that we don't use the minimize function, as we did in the previous simple neural network policy example, anymore here because we need to manually fine-tune the gradients to take into consideration the discounted rewards for each action. This requires us to first use the compute_gradients method, then update the gradients the way we want, and finally call the apply_gradients method (the minimize method that we should use most of the time actually calls compute_gradients and apply_gradients behind the scenes—see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/optimizer.py for more information).

So let's now compute the gradients of the cross-entropy loss for the network parameters (weights and biases), and set up gradient placeholders, which are to be fed later with the values that consider both the computed gradients and the discounted rewards of the actions taken using the current policy during test run:

gvs = optimizer.compute_gradients(cross_entropy)
gvs = [(g, v) for g, v in gvs if g != None]
gs = [g for g, _ in gvs]

gps = []
gvs_feed = []
for g, v in gvs:
    gp = tf.placeholder(tf.float32, shape=g.get_shape())
    gps.append(gp)
    gvs_feed.append((gp, v))
training_op = optimizer.apply_gradients(gvs_feed)

The gvs returned from optimizer.compute_gradients(cross_entropy) is a list of tuples, and each tuple consists of the gradient (of the cross_entropy for a trainable variable) and the trainable variable. For example, if you take a look at gvs after the whole program runs, you'll see something like this:

[(<tf.Tensor 'gradients/dense/MatMul_grad/tuple/control_dependency_1:0' shape=(4, 4) dtype=float32>,
 <tf.Variable 'dense/kernel:0' shape=(4, 4) dtype=float32_ref>),
 (<tf.Tensor 'gradients/dense/BiasAdd_grad/tuple/control_dependency_1:0' shape=(4,) dtype=float32>,
 <tf.Variable 'dense/bias:0' shape=(4,) dtype=float32_ref>),
 (<tf.Tensor 'gradients/dense_2/MatMul_grad/tuple/control_dependency_1:0' shape=(4, 1) dtype=float32>,
 <tf.Variable 'dense_1/kernel:0' shape=(4, 1) dtype=float32_ref>),
 (<tf.Tensor 'gradients/dense_2/BiasAdd_grad/tuple/control_dependency_1:0' shape=(1,) dtype=float32>,
 <tf.Variable 'dense_1/bias:0' shape=(1,) dtype=float32_ref>)]

Note that kernel is just another name for weight, and (4, 4), (4, ), (4, 1), and (1, ) are the shapes of the weights and biases for the first (input to hidden) and second (hidden to output) layers. If you run the script multiple times from iPython, the default graph of the tf object will contain trainable variables from previous runs, so unless you call tf.reset_default_graph(), you need to use gvs = [(g, v) for g, v in gvs if g != None] to remove those obsolete training variables, which would return None gradients (for more information about computer_gradients, see https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer#compute_gradients).

Now, play some games and save the rewards and gradient values:

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for _ in range(1000):
        rewards, grads = [], []
        obs = env.reset()
        # using current policy to test play a game
        while True:
            a, gs_val = sess.run([action, gs], feed_dict={inputs: 
                                   obs.reshape(1, num_inputs)})
            obs, reward, done, info = env.step(a[0][0])
            rewards.append(reward)
            grads.append(gs_val)
            if done:
                break

After the test play of a game, update the gradients with discounted rewards and train the network (remember that training_op is defined as optimizer.apply_gradients(gvs_feed)):

        # update gradients and do the training 
        nd_rewards = normalized_discounted_rewards(rewards)
        gp_val = {}
        for i, gp in enumerate(gps):
            gp_val[gp] = np.mean([grads[k][i] * reward for k, reward in 
                             enumerate(nd_rewards)], axis=0)
        sess.run(training_op, feed_dict=gp_val)

Finally, after 1,000 iterations of test play and updates, we can test the trained model:

    total_rewards = []

    for _ in range(100):
      rewards = 0
      obs = env.reset()

      while True:
        a = sess.run(action, feed_dict={inputs: obs.reshape(1, 
                                           num_inputs)})
        obs, reward, done, info = env.step(a[0][0])
        rewards += reward
        if done:
          break
      total_rewards.append(rewards)

    print(np.mean(total_rewards))

Note that we now use the trained policy network and sess.run to get the next action with the current observation as input. The output mean of the total rewards will be about 200, a big improvement over our simple intuitive policy, using a neural network or not.

You can also save a trained model after the training using tf.train.Saver, as we did many times before in the previous chapters:

  saver = tf.train.Saver()
  saver.save(sess, "./nnpg.ckpt")

Then you can reload it in a separate test program with:

with tf.Session() as sess:
  saver.restore(sess, "./nnpg.ckpt")

All the preceding policy implementations run on Raspberry Pi, even the one that uses TensorFlow to train a reinforcement learning policy gradient model, which takes about 15 minutes to finish. Here are the total rewards returned after running on Pi each policy we've covered:

pi@raspberrypi:~/mobiletf/ch12 $ python single_minded_policy.py 
9.362

pi@raspberrypi:~/mobiletf/ch12 $ python simple_policy.py 
42.535

pi@raspberrypi:~/mobiletf/ch12 $ python nn_random_policy.py 
21.182

pi@raspberrypi:~/mobiletf/ch12 $ python nn_simple_policy.py 
41.852

pi@raspberrypi:~/mobiletf/ch12 $ python nn_pg.py 
199.116

Now that you have a powerful neural-network-based policy model that can help your robot keep in balance, fully tested in a simulated environment, you can deploy it in a real physical environment, after replacing the simulated environment API returns with real environment data, of course—but the code to build and train the neural network reinforcement learning model can certainly be easily reused.