Reward to Go Derivation
We want to show that
\[\mathbb{E}_{\tau \sim \pi_{\theta}}
\left[
\sum_{t=0}^{T}
\Delta_{\theta} log \pi (a_t | s_t) * R(\tau)
\right]
=
\mathbb{E}_{\tau \sim \pi_{\theta}}
\left[
\sum_{t=0}^{T}
\Delta_{\theta} log \pi (a_t | s_t) * R(\tau[t:])
\right]\]
Lets expand the left side
\[\mathbb{E}_{\tau \sim \pi_{\theta}}
\left[
\sum_{t=0}^{T}
\Delta_{\theta} log \pi (a_t | s_t) * \left(
\sum_{t'=0}^{T}R(s_{t'}, a_{t'}, s_{t'+1})
\right)
\right]\]
Lets put the sums together
\[\mathbb{E}_{\tau \sim \pi_{\theta}}
\left[
\sum_{t=0}^{T}\sum_{t'=0}^{T}
\Delta_{\theta} log \pi (a_t | s_t) * \left(
R(s_{t'}, a_{t'}, s_{t'+1})
\right)
\right]\]
Now take the summations outside of the expectation. This doesnt make sense at first since
T is the length of the episode sampled in the expectation. However, we can set T to be the
longest episode possible or even an infinitely long episode and we can consider all episodes which
end before length T to actually stay in an absorbing state for the rest of the steps until T.
\[\sum_{t=0}^{T}\sum_{t'=0}^{T} \mathbb{E}_{\tau \sim \pi_{\theta}}
\left[
\Delta_{\theta} log \pi (a_t | s_t)
R(s_{t'}, a_{t'}, s_{t'+1})
\right]\]
We want to show that rewards which occur before actions have no effect on the probabilities
of those actions. So, we want to show that when t’ < t, the value inside of the expectation is 0.
Lets separate the summation into cases where t’ < t and t’ >= t.
\[ \begin{align}\begin{aligned}\mathbb{E}_{\tau \sim \pi_{\theta}}
\left[
t, t'
\right]
= \Delta_{\theta} log \pi (a_t | s_t)
R(s_{t'}, a_{t'}, s_{t'+1})\\\textcolor{red}{
\sum_{t=0}^{T}
\sum_{t'=0}^{t-1} \mathbb{E}_{\tau \sim \pi_{\theta}}
\left[
t, t'
\right]}
+
\sum_{t=0}^{T}
\sum_{t'=t-1}^{T} \mathbb{E}_{\tau \sim \pi_{\theta}}
\left[
t, t'
\right]\end{aligned}\end{align} \]
Lets expand the left double sum.
\[\sum_{t=0}^{T}
\sum_{t'=0}^{t-1} \mathbb{E}_{\tau \sim \pi_{\theta}}
\left[
\Delta_{\theta} log \pi (a_t | s_t)
R(s_{t'}, a_{t'}, s_{t'+1})
\right]\]
Since the rewards occur before the actions, the two functions inside the expectation
represent independent events. Thus, we can separate them.
\[\sum_{t=0}^{T}
\sum_{t'=0}^{t-1}
\textcolor{red}{\mathbb{E}_{\tau \sim \pi_{\theta}} \left[
\Delta_{\theta} log \pi (a_t | s_t)
\right]}
*
\mathbb{E}_{\tau \sim \pi_{\theta}} \left[
R(s_{t'}, a_{t'}, s_{t'+1})
\right]\]
Now, we attempt to prove that the left term equals zero via the expected gradient log theorum.
\[ \begin{align}\begin{aligned}\int_{a_t} \pi_{\theta}(s_t|a_t) = 1\\\text{Differentiate according to } \theta\\\Delta_{\theta} \int_{a_t} \pi_{\theta}(s_t|a_t) = 0\\\text{Swap order of } \theta \text{ and } \int\\\int_{a_t} \Delta_{\theta} \pi_{\theta}(s_t|a_t) = 0\\\text{Apply log derivitive trick}\\\int_{a_t} P(a_t | s_t) * \Delta_{\theta} log\pi_{\theta}(s_t|a_t) = 0\\\text{Convert to expectation}\\\mathbb{E} \left[
\Delta_{\theta} log\pi_{\theta}(s_t|a_t)
\right]
= 0\end{aligned}\end{align} \]
Thus, the previous formula …
doesnt quite work