Policy Gradient Derivation#

\[U(\theta) = \sum_{\tau} P(\tau ; \theta) * R(\tau)\]

Take gradient

\[\Delta_{\theta} U(\theta) = \Delta_{\theta} \sum_{\tau} P(\tau ; \theta) * R(\tau)\]

Since summation is a linear operation, we can move the gradient inside the sum.

\[\Delta_{\theta} U(\theta) = \sum_{\tau} \Delta_{\theta} P(\tau ; \theta) * R(\tau)\]

It is difficult to see what to do next. On one hand, we could immediatly expand \(P(\tau ; \theta)\). However, a bit down the road this would lead to problems. Instead, we will employ the grad-log trick. Lets first add in 2 terms which multiply to 1. Their utility will soon be apparent.

\[\Delta_{\theta} U(\theta) = \sum_{\tau} \frac{P(\tau ; \theta)}{1} * \frac{1}{P(\tau ; \theta)} * \Delta_{\theta} P(\tau ; \theta) * R(\tau)\]

Now we will employ the log-derivitive trick (\(\Delta log(x) = \frac{1}{x} * \Delta x\))

\[\Delta_{\theta} U(\theta) = \sum_{\tau} P(\tau ; \theta) * \Delta_{\theta} log(P(\tau ; \theta)) * R(\tau)\]

Lets quickly condense the formula via the expectation operator

\[\Delta_{\theta} U(\theta) = \mathbb{E}_{\tau \sim \pi} \left[ \Delta_{\theta} log(P(\tau ; \theta)) * R(\tau) \right]\]

Now lets expand \(P(\tau ; \theta)\). The probability of a trajectory can be defined as the probability of the first state * the probability of transitioning into the next state given the current state and selected action * the probability of the selected action under the current policy. We repeatedly multiply the last 3 steps until the end of the trajectory.

\[\Delta_{\theta} U(\theta) = \mathbb{E}_{\tau \sim \pi} \left[ \Delta_{\theta} log \left( P(s_0) * \prod_{t=0}^{\tau} P \left( s_{t+1} | s_t, a_t \right) * \pi (a_t | s_t) \right) * R(\tau) \right]\]

Now we employ the log-sum trick (\(log(a*b) = log(a) + log(b)\))

\[\Delta_{\theta} U(\theta) = \mathbb{E}_{\tau \sim \pi} \left[ \Delta_{\theta} \left( log (P(s_0)) + \sum_{t=0}^{\tau} log \left( P \left( s_{t+1} | s_t, a_t \right)\right) + log \left( \pi (a_t | s_t) \right) \right) * R(\tau) \right]\]

Lets move the gradient inside the parenthesis.

\[\Delta_{\theta} U(\theta) = \mathbb{E}_{\tau \sim \pi} \left[ \left( \Delta_{\theta} log (P(s_0)) + \Delta_{\theta} \sum_{t=0}^{\tau} log \left( P \left( s_{t+1} | s_t, a_t \right)\right) + log \left( \pi (a_t | s_t) \right) \right) * R(\tau) \right]\]

Recognize that the first term does not depend on \(\theta\). Thus, its gradient is 0.

\[\Delta_{\theta} U(\theta) = \mathbb{E}_{\tau \sim \pi} \left[ \left( \Delta_{\theta} \sum_{t=0}^{\tau} log \left( P \left( s_{t+1} | s_t, a_t \right)\right) + log \left( \pi (a_t | s_t) \right) \right) * R(\tau) \right]\]

Lets move the gradient into the sum.

\[\Delta_{\theta} U(\theta) = \mathbb{E}_{\tau \sim \pi} \left[ \left( \sum_{t=0}^{\tau} \Delta_{\theta} log \left( P \left( s_{t+1} | s_t, a_t \right)\right) + \Delta_{\theta} log \left( \pi (a_t | s_t) \right) \right) * R(\tau) \right]\]

We can see that the transition probability does not depend on \(\theta\). Thus, its gradient is 0.

\[\Delta_{\theta} U(\theta) = \mathbb{E}_{\tau \sim \pi} \left[ \left( \sum_{t=0}^{\tau} \Delta_{\theta} log \left( \pi (a_t | s_t) \right) \right) * R(\tau) \right]\]