Define policy and transition probability
, the probability of a trajectory
equal to
Denote the expected return as . Let’s try to find the improve a policy through gradient ascent by updating
Policy gradient and REINFORCE
Let’s compute the policy gradient ,
(1)
With trajectories, we can approximate the gradient as
The resulting algorithm is known as REINFORCE.
Actor-critic algorithm
One great thing about the policy-gradient method is that there is much less restriction on the action space than methods like Q-learning. However, one can only learn after a trajectory is finished. Note that because of causality, reward before the current action should not be affected by the choice of policy. So we can write the policy gradient approximate in REINFORCE instead as
(2)
Rather than getting the reward from the trajectory directly, an actor-critic algorithm can estimate the return with a critic network . So the policy gradient could be instead approximated by (ignoring discount
)
Advantage Actor Critic (A2C)
One problem of REINFORCE and the original actor-critic algorithm is high variance. To reduce the variance, we can simply subtract by its average over the action
. The difference
is known as the advantage function and the resulting is known as the Advantage Actor Critic algorithm. That is the policy gradient will be computed as
instead.