Reward to Go Derivation#

We want to show that

\[\mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T} \Delta_{\theta} log \pi (a_t | s_t) * R(\tau) \right] = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T} \Delta_{\theta} log \pi (a_t | s_t) * R(\tau[t:]) \right]\]

Lets expand the left side

\[\mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T} \Delta_{\theta} log \pi (a_t | s_t) * \left( \sum_{t'=0}^{T}R(s_{t'}, a_{t'}, s_{t'+1}) \right) \right]\]

Lets put the sums together

\[\mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T}\sum_{t'=0}^{T} \Delta_{\theta} log \pi (a_t | s_t) * \left( R(s_{t'}, a_{t'}, s_{t'+1}) \right) \right]\]

Now take the summations outside of the expectation. This doesnt make sense at first since T is the length of the episode sampled in the expectation. However, we can set T to be the longest episode possible or even an infinitely long episode and we can consider all episodes which end before length T to actually stay in an absorbing state for the rest of the steps until T.

\[\sum_{t=0}^{T}\sum_{t'=0}^{T} \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \Delta_{\theta} log \pi (a_t | s_t) R(s_{t'}, a_{t'}, s_{t'+1}) \right]\]

We want to show that rewards which occur before actions have no effect on the probabilities of those actions. So, we want to show that when t’ < t, the value inside of the expectation is 0. Lets separate the summation into cases where t’ < t and t’ >= t.

\[ \begin{align}\begin{aligned}\mathbb{E}_{\tau \sim \pi_{\theta}} \left[ t, t' \right] = \Delta_{\theta} log \pi (a_t | s_t) R(s_{t'}, a_{t'}, s_{t'+1})\\\textcolor{red}{ \sum_{t=0}^{T} \sum_{t'=0}^{t-1} \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ t, t' \right]} + \sum_{t=0}^{T} \sum_{t'=t-1}^{T} \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ t, t' \right]\end{aligned}\end{align} \]

Lets expand the left double sum.

\[\sum_{t=0}^{T} \sum_{t'=0}^{t-1} \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \Delta_{\theta} log \pi (a_t | s_t) R(s_{t'}, a_{t'}, s_{t'+1}) \right]\]

Since the rewards occur before the actions, the two functions inside the expectation represent independent events. Thus, we can separate them.

\[\sum_{t=0}^{T} \sum_{t'=0}^{t-1} \textcolor{red}{\mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \Delta_{\theta} log \pi (a_t | s_t) \right]} * \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ R(s_{t'}, a_{t'}, s_{t'+1}) \right]\]

Now, we attempt to prove that the left term equals zero via the expected gradient log theorum.

\[ \begin{align}\begin{aligned}\int_{a_t} \pi_{\theta}(s_t|a_t) = 1\\\text{Differentiate according to } \theta\\\Delta_{\theta} \int_{a_t} \pi_{\theta}(s_t|a_t) = 0\\\text{Swap order of } \theta \text{ and } \int\\\int_{a_t} \Delta_{\theta} \pi_{\theta}(s_t|a_t) = 0\\\text{Apply log derivitive trick}\\\int_{a_t} P(a_t | s_t) * \Delta_{\theta} log\pi_{\theta}(s_t|a_t) = 0\\\text{Convert to expectation}\\\mathbb{E} \left[ \Delta_{\theta} log\pi_{\theta}(s_t|a_t) \right] = 0\end{aligned}\end{align} \]

Thus, the previous formula …

doesnt quite work

Sources#

https://ai.stackexchange.com/questions/9614/why-does-the-reward-to-go-trick-in-policy-gradient-methods-work https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#expected-grad-log-prob-lemma

ML/DL Notes 0.0.1 documentation

Reward to Go Derivation

Contents

Reward to Go Derivation#

Sources#