DDPG and TD3

DDPG

Deep deterministic policy gradient (DDPG) is an actor-critic method. As the name suggests, the action is deterministic given the observation. DDPG composes of actor and critic networks. Given an observation, an actor network outputs an appropriate action. And given an action and an observation, a critic network outputs a prediction of the expected return (Q-value).

Replay Buffer

A replay buffer will store the (obs, action, reward, obs_next) tuples for each step the agent interacts with the environment. After the buffer is full, a batch of tuples can be extracted for training. Meanwhile, the new experience can be stored again in another buffer. Once the latter buffer is full, it can be swapped with the training buffer.

Training

A batch sampled from the replay buffer will be used to train the critic networks and then the actor networks. Then, another batch will be sampled and trained both networks again. As you will see, there are actually two critic and two actor networks in DDPG. The two networks in each type are almost identical but just one is a delayed or an average version of the other.

Training Actor Networks

We will fix the critic networks when we train the actor networks. Given a tuple (obs, action, reward, obs_next), we will only use the obs variable and plugged that into our actor network. The current actor network will create action_est and we can input both action_est and obs into the critic network to get an Q-value estimate. Assuming that the critic network is well-trained, the objective here is simply to maximize this estimated Q-value.

Training Critic Networks

Given a tuple (obs, action, reward, obs_next) and fixed actor networks, we have multiple ways to estimate an Q-value. For example, we can have

Q_est = critic_est (obs, action)

Q_tar = reward + discount \cdot critic_est (obs_next, act_est (obs_next))

The critic_est and act_est are critic and actor networks, respectively. So to train the network, we may try to minimize the square difference of Q_est and Q_tar.

Target Networks

As both Q_est and Q_tar depend on critic_est network, the naive implementation in training the critic networks tends to have poor convergence. To mitigate this problem, we can introduce a separate critic_est network. We will call this target critic network and denote it as critic_target. In practice, critic_target will simply be a delayed copy or an exponential average of critic_est. It seems that a target actor network, act_target, is introduced in the same manner. But I am not sure if it is really necessary.

TD3

TD3 stands for Twin Delayed DDPG. It is essentially DDPG but added with two additional tricks as follows.

Delayed Policy Update

The authors found that it is more important to have an accurate critic network than an actor network (similar to GAN that an accurate discriminator is important). Consequently, the authors suggest “delaying” the policy update. In practice, say train two batches of critic network before training a batch of actor network.

Clipped Double Q-Learning

Double Q-Learning was introduced to address the overestimation of Q-value. The authors argue that there is an overestimate of Q in DDPG as well. And they state that even double Q-learning is not sufficient to suppress the overestimation. Instead, they introduced a “clipped double Q-Learning”, where they use two critic networks rather than one to generate a current Q-estimate. And they aggregate the two estimates as the minimum of the two.

References:

Policy-based approach and actor critic algorithm

Define policy \pi_\theta(a|s) and transition probability P(s_{t+1}|s_t,a_t), the probability of a trajectory \tau = s_0,a_0,s_1,a_1,s_2,a_2,\cdots,s_T equal to

    \[\pi_\theta(\tau) = \rho_0(s_0)\prod_{t=0}^{T-1}P(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)\]

Denote the expected return as J=E\left[ \sum_{t=0}^T \gamma^t r(s_t,a_t)\right]\triangleq E[r(\tau)]. Let’s try to find the improve a policy through gradient ascent by updating

    \[\theta_{k+1} \leftarrow \theta_k + \alpha \nabla_\theta J(\theta)|_{\theta_k}\]

Policy gradient and REINFORCE

Let’s compute the policy gradient \nabla_\theta J(\theta),

(1)   \begin{align*} \nabla_\theta J(\theta)& =\nabla_\theta E_{\tau \sim \pi_\theta}[r(\tau)] = \nabla_\theta \int \pi_\theta(\tau) r(\tau) d\tau \\ &=\int \nabla_\theta \pi_\theta(\tau) r(\tau) d\tau =\int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) r(\tau) d\tau \\ &=E_{\tau \sim \pi_\theta}\left[ \nabla_\theta \log \pi_\theta(\tau) r(\tau) \right] \\ &=E_{\tau \sim \pi_\theta}\left[ \nabla_\theta \left[\rho_0(s_0) + \sum_t \log P(s_{t+1}|s_t,a_t) \right. \right. \\ & \qquad \left. \left. +\sum_{t}\log \pi_\theta(a_t|s_t)\right] r(\tau) \right]\\ &=E_{\tau \sim \pi_\theta}\left[ \left(\sum_{t} \nabla_\theta \log \pi_\theta(a_t|s_t)\right) r(\tau) \right]\\ &=E_{\tau \sim \pi_\theta}\left[ \left(\sum_{t}\nabla_\theta \log \pi_\theta(a_t|s_t)\right) \left(\sum_t \gamma^t r(s_t,a_t)\right) \right] \end{align*}

With N trajectories, we can approximate the gradient as

    \[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)\right) \left( \sum_t \gamma^t r(s_t,a_t)\right)\]

The resulting algorithm is known as REINFORCE.

Actor-critic algorithm

One great thing about the policy-gradient method is that there is much less restriction on the action space than methods like Q-learning. However, one can only learn after a trajectory is finished.  Note that because of causality, reward before the current action should not be affected by the choice of policy. So we can write the policy gradient approximate in REINFORCE instead as

(2)   \begin{align*} \nabla_\theta J(\theta) &= E_{\tau \sim \pi_\theta}\left[ \left(\sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)\right) \left(\sum_{t'=0}^T \gamma^{t'} r(s_{t'},a_{t'})\right) \right] \\ &\approx E_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t)\sum_{t'=t}^T \gamma^{t'} r(s_{t'},a_{t'})\right]\\ &\approx E_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t) \gamma^t \sum_{t'=t}^T \gamma^{(t'-t)} r(s_{t'},a_{t'})\right] \\ &\approx E_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^T\nabla_\theta \log \pi_\theta(a_t|s_t) \gamma^t Q(a_t,s_t)\right] \end{align*}

Rather than getting the reward from the trajectory directly, an actor-critic algorithm can estimate the return with a critic network Q(a_t,s_t). So the policy gradient could be instead approximated by (ignoring discount \gamma)

    \[\nabla_\theta J(\theta) \approx E\left[\sum_{t=0}^T \nabla_\theta \log \underset{\mbox{\tiny actor}}{\underbrace{\pi_\theta(a_t|s_t)}} \underset{\mbox{\tiny critic}}{\underbrace{Q(a_t,s_t)}}\right]\]

Advantage Actor Critic (A2C)

One problem of REINFORCE and the original actor-critic algorithm is high variance. To reduce the variance, we can simply subtract Q(s,a) by its average over the action V(s). The difference Q(s,a)-V(s)\triangleq A(s,a) is known as the advantage function and the resulting is known as the Advantage Actor Critic algorithm. That is the policy gradient will be computed as

    \[\nabla_\theta J(\theta) \approx E\left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) A(s_t,a_t)\right]\]

instead.

 

Reference:

Q-learning, SARSA, on/off policy learning

Dong has given a very nice talk (slides here) on reinforcement learning (RL) earlier. I learned Q-learning from an online Berkeley lecture several years ago. But I never had a chance to look into SARSA and grasp the concept of on-policy learning before. I think the talk sorted out some of my thoughts.

A background of RL

A typical RL problem involves the interaction between an agent and an environment. The agent will decide on an action based on the current state as it interacts with the environment. Based on this action, the model reaches a new state stochastically with some reward. The agent’s goal is to devise a policy (i.e., determining what action to take under each state) that maximizes the total expected reward. We usually model this setup as the Markov decision process (MDP), where the probability of reaching the next state s' only depends on the current state s and current choice of action a (independent of all earlier states) and hence is a Markov model.

Policy and value functions

A policy \pi is just a mapping from each state s to an action a.  The value function is defined as the expected utility (total reward) of a state given a policy. There are two most common value functions: the state-value function, V_\pi(s), which is the expected utility given the current state and policy, and the action-value function, Q_\pi(s,a), which is the expected utility given the current action, current state, and policy.

Q-Learning

The main difference of RL from MDP is that the probability p(s'|s,a) is not known in general. If we know this and also the reward r(s,a,s') for each state, current action, and next state, we can compute the expected utility for any state and appropriate action always. For Q-learning, the goal is precisely to estimate the Q function for the optimal policy by updating Q as

    \[Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha [r + \gamma (\max_{a'} Q(s',a'))],\]

where we have 0<\alpha\le 1 to control the degree of exponential smoothing in approximating Q(s,a). When \alpha =1, we do not use exponential smoothing at all.

Note that the equation above only describes how we are going to update Q(s,a), but it does not describe what action should be taken. Of course, we can exploit the estimate of Q and select the action that maximizes it. However, the early estimates of Q(s,a) is bad and so exploiting will work very poorly. Instead, we may simply select the action a randomly initially. And as the prediction of Q improves, exploit the knowledge of Q and take the action maximizing Q at times. This is the exploration vs exploitating trade-off. We often denote the probability of exploitation as \epsilon and we set an algorithm is \epsilon-greedy when with a probability of \epislon that the best action according to the current estimate of Q is taken.

On policy/Off-policy

In Q-learning, the Q-value is not updated according to data obtained from the actual action that has been taken. There are two terminologies that sometimes confuse me.

  • Behavior policy: policy that actually determines the next action
  • Target policy: policy that used to evaluate an action and that we are trying to learn

For Q-learning, the behavior policy and target policy apparently are not the same as the action that maximizes Q does not necessarily be the action that was actually taken.

SARSA

Given an experience (s,a,r,s',a') (that is why it is called SARSA), we update an estimate of Q instead by

    \[Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha [r + \gamma  Q(s',a'))].\]

It is on-policy as the data used to update Q (target policy) is directly from the behavior policy that was used to generate the data

Off-policy has the advantage to be more flexible and sample efficient but it could be less stable as well (see [6], for example).

Reference:

 

 

 

Copyright OU-Tulsa Lab of Image and Information Processing 2024
Tech Nerd theme designed by Siteturner