Q-learning, SARSA, on/off policy learning

Dong has given a very nice talk (slides here) on reinforcement learning (RL) earlier. I learned Q-learning from an online Berkeley lecture several years ago. But I never had a chance to look into SARSA and grasp the concept of on-policy learning before. I think the talk sorted out some of my thoughts.

A background of RL

A typical RL problem involves the interaction between an agent and an environment. The agent will decide on an action based on the current state as it interacts with the environment. Based on this action, the model reaches a new state stochastically with some reward. The agent’s goal is to devise a policy (i.e., determining what action to take under each state) that maximizes the total expected reward. We usually model this setup as the Markov decision process (MDP), where the probability of reaching the next state only depends on the current state and current choice of action (independent of all earlier states) and hence is a Markov model.

Policy and value functions

A policy is just a mapping from each state to an action .  The value function is defined as the expected utility (total reward) of a state given a policy. There are two most common value functions: the state-value function, , which is the expected utility given the current state and policy, and the action-value function, , which is the expected utility given the current action, current state, and policy.

Q-Learning

The main difference of RL from MDP is that the probability is not known in general. If we know this and also the reward for each state, current action, and next state, we can compute the expected utility for any state and appropriate action always. For Q-learning, the goal is precisely to estimate the Q function for the optimal policy by updating Q as

where we have to control the degree of exponential smoothing in approximating . When , we do not use exponential smoothing at all.

Note that the equation above only describes how we are going to update , but it does not describe what action should be taken. Of course, we can exploit the estimate of and select the action that maximizes it. However, the early estimates of is bad and so exploiting will work very poorly. Instead, we may simply select the action randomly initially. And as the prediction of improves, exploit the knowledge of and take the action maximizing at times. This is the exploration vs exploitating trade-off. We often denote the probability of exploitation as and we set an algorithm is -greedy when with a probability of that the best action according to the current estimate of is taken.

On policy/Off-policy

In Q-learning, the Q-value is not updated according to data obtained from the actual action that has been taken. There are two terminologies that sometimes confuse me.

• Behavior policy: policy that actually determines the next action
• Target policy: policy that used to evaluate an action and that we are trying to learn

For Q-learning, the behavior policy and target policy apparently are not the same as the action that maximizes does not necessarily be the action that was actually taken.

SARSA

Given an experience (that is why it is called SARSA), we update an estimate of Q instead by

It is on-policy as the data used to update (target policy) is directly from the behavior policy that was used to generate the data

Off-policy has the advantage to be more flexible and sample efficient but it could be less stable as well (see [6], for example).