Policy Gradients#

What#

Policy gradient algorithms are a class of deep reinforcement learning algorithms.

Why#

They are useful since they directly optimize the policy and can learn continuous actions.

Explanation#

In RL we attempt to maximize discounted expected return. The utility of a policy (how well it performs) is defined by its expected discounted return.

\[U(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{\tau} \gamma^{t} * r_t \right]\]

It can also be written as:

\[U(\theta) = \sum_{\tau} P(\tau ; \theta) * R(\tau)\]

A common way of maximizing functions is by taking their gradient and stepping up the gradient. Since \(\theta\) are the parameters that we are modifying, we take the gradient with respect to \(\theta\).

The gradient ends up being

\[\Delta_{\theta} U(\theta) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\tau} \Delta_{\theta} log \left( \pi (a_t | s_t) \right) * R(\tau) \right]\]

Policy Gradient Derivation
Reward to Go
Reward to Go Derivation
- Sources
Baselines
- Unbiased State Value Proof

ML/DL Notes 0.0.1 documentation

Policy Gradients

Contents

Policy Gradients#

What#

Why#

Explanation#