Soft Actor-Critic with hybrid action types#

Actor training loss#

Let \(a\) be the discrete action and \(b\) the continuous action, then the actor loss is:

\[\begin{split}\begin{equation} \begin{array}{ll} & \displaystyle\int_{a,b}\pi(a,b|s)[\alpha \log \pi(a,b|s) - Q(s,a,b)]\\ =& \displaystyle\int_{a,b}\pi_{\phi}(b|s)\pi(a|b,s)[\alpha\log \pi_{\phi}(b|s) + \alpha \log \pi(a|b,s) - Q(s,a,b)]\\ =& \displaystyle\int_b \pi_{\phi}(b|s) \left(\alpha\log \pi_{\phi}(b|s) - \int_a \pi(a|b,s) q_b(s,a)\right)\\ =& \displaystyle\int_b \pi_{\phi}(b|s) \left(\alpha\log \pi_{\phi}(b|s) - \mathbb{E}_{\pi(a|b,s)}[q_b(s,a)]\right)\\ \end{array} \end{equation}\end{split}\]

where \(q_b(s,a):= Q(s,a,b)-\alpha\log\pi(a|b,s)\). Given any \(\pi_{\phi}(b|s)\) for any \(s,b\), maximizing the inner expectation we have

\[\pi^*(a|b,s)=\text{argmax}_{\pi}\mathbb{E}_{\pi(a|b,s)}[q_b(s,a)]=\frac{\exp(\frac{Q(s,a,b)}{\alpha})}{Z(s,b)}\]

as the optimal conditional policy for action \(a\). For a reasonable discrete action space, the optimal inner expectation \(q^*(s,b):=\mathbb{E}_{\pi^*(a|b,s)}[\cdot]\) can be easily computed. It’s a function of \(s\) and \(b\), and differentiable w.r.t. \(b\). Thus we can still use re-parameterization trick \(b=g_{\phi}(\epsilon,s)\) to optimize \(\phi\):

\[\mathbb{E}_{\epsilon\sim p(\epsilon)}\left[\alpha \log\pi_{\phi}(g_{\phi}(\epsilon,s)|s) - q^*(s,g_{\phi}(\epsilon,s))\right]\]

Different entropy coefficients#

To have different entropy coefficients, the actor loss becomes

\[\displaystyle\int_{a,b}\pi_{\phi}(b|s)\pi(a|b,s)[\alpha_b\log \pi_{\phi}(b|s) + \alpha_a \log \pi(a|b,s) - Q(s,a,b)]\]

And accordingly, the value definition is changed to

\[\displaystyle V(s)=\mathbb{E}_{a,b\sim \pi}[Q(s,a,b)-\alpha_a \log \pi(a|s,b) - \alpha_b \log\pi_{\phi}(b|s)]\]