Estimating the derivative of an expectation#

Given a probablity distribution \(p(x)\) and a variable \(f(x)\), the expectation is:

\[E(f(x)) = \int p(x)f(x) dx\]

If only \(f(x)\) (but not \(p(x)\)) is parameterized by \(\theta\), when estimating the derivative of the expectation, it’s easy to verify that

\[E(\frac{\partial f_{\theta}(x)}{\partial\theta}) = \int p(x)\frac{\partial f_{\theta}(x)}{\partial \theta} dx\]

is an unbiased estimator of \(\frac{\partial E(f_{\theta}(x))}{\partial\theta}\). However, when \(p(x)\) is also parameterized by \(\theta\), then \(E(\frac{\partial f_{\theta}(x)}{\partial\theta})\) is no longer unbiased. Usually we have to go through the process of derivative calculation to get the final correct result. Below are two example scenarios.

Scenario 1: Derivative of an entropy#

For some algorithms, we need to calculate the entropy and its derivative. If there is no analytic formula for the entropy, we can resort to sampling. Given the definition of entropy:

\[\begin{equation*} H(p) = E_{x\sim p_\theta}(-\log p_\theta(x)) \end{equation*}\]

We can see that \(-\log p_{\theta}(x)\) is an unbiased estimator of \(H\) if \(x\) is sampled from \(p_{\theta}\). It is tempting to use \(-\frac{\partial\log p_\theta(x)}{\partial\theta}\) as an estimator of \(\frac{\partial H}{\partial\theta}\). However, it is wrong, as shown in the following:

\[\begin{equation*} E_{x\sim p_\theta}\left(\frac{\partial\log p_\theta(x)}{\partial\theta}\right) = \int \frac{\partial\log p_\theta(x)}{\partial\theta} p_\theta(x) dx = \int \frac{\partial p_\theta(x)}{\partial\theta} dx = \frac{\partial}{\partial\theta} \int p_\theta(x) dx = \frac{\partial 1}{\partial\theta} = 0 \end{equation*}\]

We need to actually go through the process of calculating the derivative to get the unbiased estimator of \(\frac{\partial H}{\partial\theta}\):

\[\begin{split}\begin{array}{ll} \frac{\partial H}{\partial\theta} &=&-\frac{\partial}{\partial\theta}\int \log p_\theta(x) p_\theta(x) dx \\ &=& - \int \left(\frac{\partial\log p_\theta(x)}{\partial\theta}p_\theta(x) + \log p_\theta(x) \frac{\partial p_\theta(x)}{\partial\theta}\right) dx \\ &=& - \int \left(\frac{\partial\log p_\theta(x)}{\partial\theta}p_\theta(x) + \log p_\theta(x) \frac{\partial\log p_\theta(x)}{\partial\theta} p_\theta(x)\right) dx \\ &=& - \int (1+\log p_\theta(x))\frac{\partial\log p_\theta(x)}{\partial\theta} p_\theta(x) dx \\ &=& -E_{x\sim p_\theta}\left(\log p_\theta(x)\frac{\partial\log p_\theta(x)}{\partial\theta}\right) -E_{x\sim p_\theta}\left(\frac{\partial\log p_\theta(x)}{\partial\theta}\right) \\ &=& -\frac{1}{2}E_{x\sim p_\theta}\left(\frac{\partial}{\partial\theta}(\log p_\theta(x))^2\right) \\ \end{array}\end{split}\]

This means that \(-\frac{1}{2}\frac{\partial}{\partial\theta}(\log p_\theta(x))^2\) is an unbiased estimator of \(\frac{\partial H}{\partial\theta}\). Actually, \(-\frac{1}{2}\frac{\partial}{\partial\theta}(c+\log p_\theta(x))^2\) is an unbiased estimator for any constant \(c\).

For some distributions, the sample of \(p_\theta\) is generated by transforming \(\epsilon \sim q\) by \(f_\theta(\epsilon)\), where \(q\) is a fixed distribution and \(f_\theta\) is a smooth bijective mapping. \(p_\theta(x)\) is implicitly defined by \(q\) and \(f_\theta\) as:

\[\begin{equation*} p_\theta(x) = \frac{q(f_\theta^{-1}(x))}{\left|\det \left. \frac{\partial f_\theta(\epsilon)}{\partial\epsilon}\right| _{\epsilon=f_\theta^{-1}(x)}\right|} \end{equation*}\]

Interestingly, when calculating \(-\frac{\partial\log p_\theta(x)}{\partial\theta}\), if we treat \(x\) as \(x=f_\theta(\epsilon)\), we get an unbiased estimator of \(\frac{\partial H}{\partial\theta}\):

\[\begin{split}\begin{array}{ll} && E_{x\sim p_\theta}\left(-\frac{\partial\log p_\theta(x)}{\partial\theta}\right) = E_{\epsilon \sim q}\left(-\frac{\partial\log p_\theta(f_\theta(\epsilon))}{\partial\theta}\right) \\ &=& -\frac{\partial}{\partial\theta}E_{\epsilon \sim q}\left(\log p_\theta(f_\theta(\epsilon))\right) = -\frac{\partial}{\partial\theta}E_{x \sim p_\theta}\left(\log p_\theta(x)\right) = \frac{\partial}{\partial\theta}H(p) \end{array}\end{split}\]

So we can use \(-\frac{\partial\log p_\theta(x)}{\partial\theta}\) as an unbiased estimator of \(\frac{\partial H(p)}{\partial\theta}\) if \(x=f_\theta(\epsilon)\) and we allow gradient to propagate through \(x\) to \(\theta\).

Scenario 2: SAC discrete actor training with a CriticNetwork#

Usually if we have a discrete action in SAC, we would choose QNetwork to estimate Q values. Suppose that there are a total number of \(K\) actions, then the QNetwork will output \(K\) heads given an observation \(s\), each of which represents the Q-value \(Q(s,k)\).

Theoretically, we can also use CriticNetwork, which takes both an observation \(s\) and an action \(k\) (after proper encoding) and outputs a single value representing \(Q(s,k)\).

With a CriticNetwork in the actor training stage, for continuous actions, the original SAC paper already derives the gradient formulation. Next we will derive the gradient for discrete actions.

Given a sampled discrete action \(k\) and its corresponding critic value \(Q(s,k)\), we want to empirically estimate the gradient of the following expectation, in order to train the actor \(p_{\theta}(k|s)\) by gradient descent:

\[L_{\theta} = \sum_k p_{\theta}(k|s)(\alpha \log p_{\theta}(k|s) - Q(s,k))\]

where \(\alpha\) is the entropy coefficient. Take derivative w.r.t. to \(\theta\) we have:

\[\begin{split}\begin{array}{ll} \frac{\partial L_{\theta}}{\partial \theta} &= \sum_k \left[\frac{\partial p_{\theta}(k|s)}{\partial \theta}(\alpha\log p_{\theta}(k|s) - Q(s,k)) + \alpha p_{\theta}(k|s)\frac{\partial \log p_{\theta}(k|s)}{\partial \theta}\right]\\ &=\sum_k \left[p_{\theta}(k|s)\frac{ \partial \log p_{\theta}(k|s)}{\partial \theta}(\alpha\log p_{\theta}(k|s) - Q(s,k))\right] + \underbrace{\alpha\frac{\partial\sum_k p_{\theta}(k|s)}{\partial \theta}}_{=0}\\ &=E_{p_{\theta}}\left[\frac{\partial\log p_{\theta}(k|s)}{\partial\theta} (\alpha \log p_{\theta}(k|s) - Q(s,k))\right]\\ \end{array}\end{split}\]

which means that \(\frac{\partial\log p_{\theta}(k|s)}{\partial\theta}(\alpha \log p_{\theta}(k|s) - Q(s,k))\) is an unbiased estimator of the actor training gradient \(\frac{\partial L_{\theta}}{\partial \theta}\).

Note

Although the above is theoretically sound, in practice fitting an actor \(p_{\theta}(k|s)\) to Q values \(Q(s,k)\) is inefficient. Actually in the discrete case, we can directly sample actions given Q values:

\[p(k|s) \propto \exp(\frac{Q(s,k)}{\alpha})\]

So usually we will still choose QNetwork because it enables the above sampling while CriticNetwork doesn’t.