Policy-based approach and actor critic algorithm

Define policy \pi_\theta(a|s) and transition probability P(s_{t+1}|s_t,a_t), the probability of a trajectory \tau = s_0,a_0,s_1,a_1,s_2,a_2,\cdots,s_T equal to

    \[\pi_\theta(\tau) = \rho_0(s_0)\prod_{t=0}^{T-1}P(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)\]

Denote the expected return as J=E\left[ \sum_{t=0}^T \gamma^t r(s_t,a_t)\right]\triangleq E[r(\tau)]. Let’s try to find the improve a policy through gradient ascent by updating

    \[\theta_{k+1} \leftarrow \theta_k + \alpha \nabla_\theta J(\theta)|_{\theta_k}\]

Policy gradient and REINFORCE

Let’s compute the policy gradient \nabla_\theta J(\theta),

(1)   \begin{align*} \nabla_\theta J(\theta)& =\nabla_\theta E_{\tau \sim \pi_\theta}[r(\tau)] = \nabla_\theta \int \pi_\theta(\tau) r(\tau) d\tau \\ &=\int \nabla_\theta \pi_\theta(\tau) r(\tau) d\tau =\int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) r(\tau) d\tau \\ &=E_{\tau \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(\tau) r(\tau) \right] \\ &=E_{\tau \sim \pi_\theta}\left[ \nabla_\theta \left[\rho_0(s_0) + \sum_t \log P(s_{t+1}|s_t,a_t) \right. \right. \\ & \qquad \left. \left. +\sum_{t}\log \pi_\theta(a_t|s_t)\right] r(\tau) \right]\\ &=E_{\tau \sim \pi_\theta}\left[ \left(\sum_{t} \nabla_\theta \log \pi_\theta(a_t|s_t)\right) r(\tau) \right]\\ &=E_{\tau \sim \pi_\theta}\left[ \left(\sum_{t}\nabla_\theta \log \pi_\theta(a_t|s_t)\right) \left(\sum_t \gamma^t r(s_t,a_t)\right) \right] \end{align*}

With N trajectories, we can approximate the gradient as

    \[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)\right) \left( \sum_t \gamma^t r(s_t,a_t)\right)\]

The resulting algorithm is known as REINFORCE.

Actor-critic algorithm

One great thing about the policy-gradient method is that there is much less restriction on the action space than methods like Q-learning. However, one can only learn after a trajectory is finished.  Note that because of causality, reward before the current action should not be affected by the choice of policy. So we can write the policy gradient approximate in REINFORCE instead as

(2)   \begin{align*} \nabla_\theta J(\theta) &= E_{\tau \sim \pi_\theta}\left[ \left(\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)\right) \left(\sum_{t'=0}^T \gamma^{t'} r(s_{t'},a_{t'})\right) \right] \\ &\approx E_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)\sum_{t'=t}^T \gamma^{t'} r(s_{t'},a_{t'})\right]\\ &\approx E_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t) \gamma^t \sum_{t'=t}^T \gamma^{(t'-t)} r(s_{t'},a_{t'})\right] \\ &\approx E_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t) \gamma^t Q(a_t,s_t)\right] \end{align*}

Rather than getting the reward from the trajectory directly, an actor-critic algorithm can estimate the return with a critic network Q(a_t,s_t). So the policy gradient could be instead approximated by (ignoring discount \gamma)

    \[\nabla_\theta J(\theta) \approx E\left[\sum_{t=0}^T \nabla_\theta \log \underset{\mbox{\tiny actor}}{\underbrace{\pi_\theta(a_t|s_t)}} \underset{\mbox{\tiny critic}}{\underbrace{Q(a_t,s_t)}}\right]\]

Advantage Actor Critic (A2C)

One problem of REINFORCE and the original actor-critic algorithm is high variance. To reduce the variance, we can simply subtract Q(s,a) by its average over the action V(s). The difference Q(s,a)-V(s)\triangleq A(s,a) is known as the advantage function and the resulting is known as the Advantage Actor Critic algorithm. That is the policy gradient will be computed as

    \[\nabla_\theta J(\theta) \approx E\left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) A(s_t,a_t)\right]\]

instead.

 

Reference:

Leave a Reply

Copyright OU-Tulsa Lab of Image and Information Processing 2025
Tech Nerd theme designed by Siteturner
transformation hentai prohentai.net my hero academia midnight porn sexc girl flyporntube.info sexy picture video player سكس منوم orivive.com سكس نيك طياز سكس مصري مشعر homeofpornstars.com سكس مص الزبر kd; lphvl iwanktv.pro سكس حرامي
sex video free malayalam indianfuckertube.com xxxsex telugu كس شقراء iporntv.info الفلاسكس horror porn movie 2beeg.mobi chandni ki chudai fate grand order hentai manga hentaiquality.com hentai porno free free xvideo pornon.org antyvidio
اوضاع ساخنه meyzo.org سكس ياباني قصص sexi aunty indianhardfuck.net tamilsexstories4u رجل ينيك بنته xxcmh.com فنون النيك sanny builder hairyporntrends.com xvideo tamilnadu uot jaipur nudevista.pro katrina bf film