在传统RL中,agent的目标是最大cumulative discounted rewards:
Soft Q Learning是解决max-ent RL问题的一种算法,最早用在continuous action task(mujoco benchmark)中。它相比policy-based的算法(DDPG,PPO等),表现更好并且学习更加稳定。这里我主要介绍Soft Q Learning在discrete action task上面如何使用。相比continuous action task,discrete action task不需要使用policy network,十分简单易懂。
类似于Q learning,我们先要定义算法的value function。这里,我们定义soft Q function和soft Value function: $$Q^\text{soft}\pi (s, a) = r(s, a) + \gamma V^\text{soft}\pi(s') $$ $$V^\text{soft}\pi(s) = E{a}{{Q^\text{soft}{\pi}(s,a) - \log(\pi(a|s))}} =\alpha\log\sum_a\exp(\frac{1}{\alpha}Q^\text{soft}\pi (s, a)) $$ 我们可以利用上面定义的soft Q function的Bellman Equation来进行policy evaluation。原paper中有证明这样的soft policy evaluation可以使得soft Q function收敛到true soft Q function。
而更新策略的方法也比较简单,是动作soft Q value的softmax值作为选择该动作的概率,如下: $$\pi(a| s) = \text{softmax}a{(\frac{1}{\alpha}Q^\text{soft}\pi (s, a))}$$ 可以看到,Q learning中max操作,改为了softmax操作,使得对应非最优Q值的动作也能有概率被选择,从而提升算法的exploration和generalization。原paper中有证明这样的soft policy improvement可以使得soft Q function的数值增加。 我们只需要改变DQN的policy evaluation和policy improvement的代码,就可以实现soft-DQN。改动后计算TD-loss的代码如下如下:
def compute_td_loss(self, states, actions, rewards, next_states, is_done, gamma=0.99):
""" Compute td loss using torch operations only. Use the formula above. """
actions = torch.tensor(actions).long() # shape: [batch_size]
rewards = torch.tensor(rewards, dtype =torch.float) # shape: [batch_size]
is_done = torch.tensor(is_done).bool() # shape: [batch_size]
if self.USE_CUDA:
actions = actions.cuda()
rewards = rewards.cuda()
is_done = is_done.cuda()
# get q-values for all actions in current states
predicted_qvalues = self.DQN(states)
# select q-values for chosen actions
predicted_qvalues_for_actions = predicted_qvalues[
range(states.shape[0]), actions
]
# compute q-values for all actions in next states
predicted_next_qvalues = self.DQN_target(next_states) # YOUR CODE
# compute V*(next_states) using predicted next q-values
next_state_values = self.alpha*torch.logsumexp(predicted_next_qvalues/self.alpha, dim = -1) # YOUR CODE
# compute "target q-values" for loss - it's what's inside square parentheses in the above formula.
target_qvalues_for_actions = rewards + gamma*next_state_values # YOUR CODE
# at the last state we shall use simplified formula: Q(s,a) = r(s,a) since s' doesn't exist
target_qvalues_for_actions = torch.where(
is_done, rewards, target_qvalues_for_actions)
# mean squared error loss to minimize
#loss = torch.mean((predicted_qvalues_for_actions -
# target_qvalues_for_actions.detach()) ** 2)
loss = F.smooth_l1_loss(predicted_qvalues_for_actions, target_qvalues_for_actions.detach())
return loss
因为不明原因,soft-DQN在Pong游戏上的表现非常的差,这里利用它来完成MsPacman游戏。MsPacman估计大家小时候都玩过,就是吃糖豆。 DQN在这款游戏上的学习较慢(大概需要20M步),而Soft-DQN可以很快的在1M步内学习,结果如下(DQN的结果没有展示):