Q-Learning with Q-Network or Deep Q Network (DQN)

In the DQN, we replace the Q-Table with a neural network (Q-Network) that will learn to respond with the optimal action as we train it continuously with the explored states and their Q-Values. Thus, for training the network we need a place to store the game memory:

Implement the game memory using a deque of size 1000:

memory = deque(maxlen=1000)

Next, build a simple hidden layer neural network model, q_nn:

from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(8,input_dim=4, activation='relu'))
model.add(Dense(2, activation='linear'))
model.compile(loss='mse',optimizer='adam')
model.summary()
q_nn = model

The Q-Network looks like this:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 8)                 40        
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 18        
=================================================================
Total params: 58
Trainable params: 58
Non-trainable params: 0
_________________________________________________________________

The episode() function that executes one episode of the game, incorporates the following changes for the Q-Network-based algorithm:

After generating the next state, add the states, action, and rewards to the game memory:

action = policy(state_prev, env)
obs, reward, done, info = env.step(action)
state_next = discretize_state(obs,s_bounds,n_s)
                                     
# add the state_prev, action, reward, state_new, done to memory
memory.append([state_prev,action,reward,state_next,done])

Generate and update the q_values with the maximum future rewards using the bellman function:

states = np.array([x[0] for x in memory])
states_next = np.array([np.zeros(4) if x[4] else x[3] for x in memory])
q_values = q_nn.predict(states)
q_values_next = q_nn.predict(states_next)

for i in range(len(memory)):
    state_prev,action,reward,state_next,done = memory[i]
    if done:
        q_values[i,action] = reward
    else:
        best_q = np.amax(q_values_next[i])
        bellman_q = reward + discount_rate * best_q
        q_values[i,action] = bellman_q

Train the q_nn with the states and the q_values we received from memory:

q_nn.fit(states,q_values,epochs=1,batch_size=50,verbose=0)

The process of saving gameplay in memory and using it to train the model is also known as memory replay in deep reinforcement learning literature. Let us run our DQN-based gameplay as follows:

learning_rate = 0.8
discount_rate = 0.9
explore_rate = 0.2
n_episodes = 100
experiment(env, policy_q_nn, n_episodes)

We get a max reward of 150 that you can improve upon with hyper-parameter tuning, network tuning, and by using rate decay for the discount rate and explore rate:

Policy:policy_q_nn, Min reward:8.0, Max reward:150.0, Average reward:41.27

We calculated and trained the model in every step; you may want to explore changing it to training after the episode. Also, you can change the code to discard the memory replay and retraining the model for the episodes that return smaller rewards. However, implement this option with caution as it may slow down your learning as initial gameplay would generate smaller rewards more often.