不论是深度强化学习的DQN还是传统的Q learning,都有maximization bias,会高估Q value。为什么呢?我们可以看下Q learning更新Q值时的公式:
在2010年Hasselt et. al.提出了Double Q learning来解决这一问题。Double Q learning就是学习两套Q值,使得选择最佳动作和估计最佳动作的价值可以在不同的网络上进行。公式如下:
Double DQN其实就是Double Q learning在DQN上的拓展。用上面的两套Q值,分别对应DQN的policy network(更新的快)和target network(每隔一段时间与policy network同步)。Double Q learning error如下: $$Y^{DoubleQ}t ≡ R{t+1 }+ γQ(S_{t+1}, \arg\max_a{Q(S_{t+1}, a; \theta_t); θ_t^-} )$$ 实现起来也很简单,只需要在DQN计算TD error的时候稍作改动:
def compute_td_loss(self, states, actions, rewards, next_states, is_done, gamma=0.99):
""" Compute td loss using torch operations only. Use the formula above. """
actions = torch.tensor(actions).long() # shape: [batch_size]
rewards = torch.tensor(rewards, dtype =torch.float) # shape: [batch_size]
is_done = torch.tensor(done).bool() # shape: [batch_size]
if self.USE_CUDA:
actions = actions.cuda()
rewards = rewards.cuda()
is_done = is_done.cuda()
# get q-values for all actions in current states
predicted_qvalues = self.DQN(states)
# select q-values for chosen actions
predicted_qvalues_for_actions = predicted_qvalues[
range(states.shape[0]), actions
]
# compute q-values for all actions in next states
## Where DDQN is different from DQN
predicted_next_qvalues_current = self.DQN(next_states)
predicted_next_qvalues_target = self.DQN_target(next_states)
# compute V*(next_states) using predicted next q-values
next_state_values = predicted_next_qvalues_target.gather(1, torch.max(predicted_next_qvalues_current, 1)[1].unsqueeze(1)).squeeze(1)
# compute "target q-values" for loss - it's what's inside square parentheses in the above formula.
target_qvalues_for_actions = rewards + gamma *next_state_values # YOUR CODE
# at the last state we shall use simplified formula: Q(s,a) = r(s,a) since s' doesn't exist
target_qvalues_for_actions = torch.where(
is_done, rewards, target_qvalues_for_actions)
# mean squared error loss to minimize
#loss = torch.mean((predicted_qvalues_for_actions -
# target_qvalues_for_actions.detach()) ** 2)
loss = F.smooth_l1_loss(predicted_qvalues_for_actions, target_qvalues_for_actions.detach())
return loss
我们来看下训练结果,下图是DDQN(蓝色)和DQN(橘黄)在Pong上面得表现。可以看到DDQN大约比DQN得收敛快10%。
DDQN的实现很容易,在Pong这个游戏上的提升也不大。可能效果要在更难的任务中才能看出来。