Reward to Go Derivation
============================

We want to show that 

.. math::

    \mathbb{E}_{\tau \sim \pi_{\theta}} 
        \left[ 
            \sum_{t=0}^{T}
                \Delta_{\theta} log \pi (a_t | s_t) * R(\tau)
        \right]
    = 
    \mathbb{E}_{\tau \sim \pi_{\theta}} 
        \left[ 
            \sum_{t=0}^{T}
                \Delta_{\theta} log \pi (a_t | s_t) * R(\tau[t:])
        \right]


Lets expand the left side

.. math::

     \mathbb{E}_{\tau \sim \pi_{\theta}} 
        \left[ 
            \sum_{t=0}^{T}
                \Delta_{\theta} log \pi (a_t | s_t) * \left(
                    \sum_{t'=0}^{T}R(s_{t'}, a_{t'}, s_{t'+1})
                \right)
        \right]

Lets put the sums together

.. math::

     \mathbb{E}_{\tau \sim \pi_{\theta}} 
        \left[ 
            \sum_{t=0}^{T}\sum_{t'=0}^{T}
                \Delta_{\theta} log \pi (a_t | s_t) * \left(
                    R(s_{t'}, a_{t'}, s_{t'+1})
                \right)
        \right]

Now take the summations outside of the expectation. This doesnt make sense at first since 
T is the length of the episode sampled in the expectation. However, we can set T to be the 
longest episode possible or even an infinitely long episode and we can consider all episodes which
end before length T to actually stay in an absorbing state for the rest of the steps until T.

.. math::

     \sum_{t=0}^{T}\sum_{t'=0}^{T} \mathbb{E}_{\tau \sim \pi_{\theta}} 
        \left[ 
                \Delta_{\theta} log \pi (a_t | s_t)
                    R(s_{t'}, a_{t'}, s_{t'+1})
        \right]

We want to show that rewards which occur before actions have no effect on the probabilities
of those actions. So, we want to show that when t' < t, the value inside of the expectation is 0.
Lets separate the summation into cases where t' < t and t' >= t.

.. math::

    \mathbb{E}_{\tau \sim \pi_{\theta}} 
     \left[ 
         t, t'
    \right]
    = \Delta_{\theta} log \pi (a_t | s_t)
                    R(s_{t'}, a_{t'}, s_{t'+1})


    \textcolor{red}{
    \sum_{t=0}^{T} 
        \sum_{t'=0}^{t-1} \mathbb{E}_{\tau \sim \pi_{\theta}} 
        \left[ 
            t, t'
        \right]}
    +
    \sum_{t=0}^{T} 
        \sum_{t'=t-1}^{T} \mathbb{E}_{\tau \sim \pi_{\theta}} 
        \left[ 
                t, t'
        \right]

Lets expand the left double sum.


.. math::
    \sum_{t=0}^{T} 
        \sum_{t'=0}^{t-1} \mathbb{E}_{\tau \sim \pi_{\theta}} 
        \left[ 
            \Delta_{\theta} log \pi (a_t | s_t)
                    R(s_{t'}, a_{t'}, s_{t'+1})
        \right]

Since the rewards occur before the actions, the two functions inside the expectation 
represent independent events. Thus, we can separate them.

.. math::
    \sum_{t=0}^{T} 
        \sum_{t'=0}^{t-1} 
        \textcolor{red}{\mathbb{E}_{\tau \sim \pi_{\theta}} \left[ 
            \Delta_{\theta} log \pi (a_t | s_t)
        \right]}
        *
        \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ 
            R(s_{t'}, a_{t'}, s_{t'+1})
        \right]

Now, we attempt to prove that the left term equals zero via the expected gradient log theorum.

.. math::

    \int_{a_t} \pi_{\theta}(s_t|a_t) = 1

    \text{Differentiate according to } \theta

    \Delta_{\theta} \int_{a_t} \pi_{\theta}(s_t|a_t) = 0

    \text{Swap order of } \theta \text{ and } \int 

    \int_{a_t} \Delta_{\theta} \pi_{\theta}(s_t|a_t) = 0

    \text{Apply log derivitive trick}

    \int_{a_t} P(a_t | s_t) * \Delta_{\theta} log\pi_{\theta}(s_t|a_t) = 0

    \text{Convert to expectation}

    \mathbb{E} \left[
        \Delta_{\theta} log\pi_{\theta}(s_t|a_t)
    \right] 
     = 0

Thus, the previous formula ...

doesnt quite work

Sources
----------------
https://ai.stackexchange.com/questions/9614/why-does-the-reward-to-go-trick-in-policy-gradient-methods-work
https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#expected-grad-log-prob-lemma