Policy Gradients
Contents
Policy Gradients#
What#
Policy gradient algorithms are a class of deep reinforcement learning algorithms.
Why#
They are useful since they directly optimize the policy and can learn continuous actions.
Explanation#
In RL we attempt to maximize discounted expected return. The utility of a policy (how well it performs) is defined by its expected discounted return.
\[U(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[
\sum_{t=0}^{\tau} \gamma^{t} * r_t
\right]\]
It can also be written as:
\[U(\theta) = \sum_{\tau} P(\tau ; \theta) * R(\tau)\]
A common way of maximizing functions is by taking their gradient and stepping up the gradient. Since \(\theta\) are the parameters that we are modifying, we take the gradient with respect to \(\theta\).
The gradient ends up being
\[\Delta_{\theta} U(\theta) =
\mathbb{E}_{\tau \sim \pi} \left[
\sum_{t=0}^{\tau}
\Delta_{\theta} log \left( \pi (a_t | s_t) \right) *
R(\tau)
\right]\]