SARSA - Albert Masoliver's learning site

## Definition **SARSA** (State-Action-Reward-State-Action) is a *model-free*, *on-policy* RL algorithm. Like [[Q-Learning]] it learns an action-value function, but evaluates the *policy actually being followed* rather than the greedy policy. ## Update Rule After observing transition $(s, a, r, s', a')$: $ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right] $ The next action $a'$ is the one *actually taken* by the behavior policy (typically $\epsilon$-greedy on the current $Q$), not necessarily the greedy one. The acronym SARSA comes from the five-tuple: $(s, a, r, s', a')$. ## On-Policy vs Off-Policy | | SARSA | Q-Learning | |---|---|---| | TD target uses | $Q(s', a')$ for behaviour action | $\max_{a'} Q(s', a')$ | | Policy being learnt | Same as behaviour | Greedy (different from behaviour) | | Type | On-policy | Off-policy | ## Behavioural Difference: The Cliff Walk Example Sutton & Barto's classic example. A grid world with a cliff alongside a path to the goal. Stepping off the cliff costs -100. - **Q-learning** learns the optimal *greedy* policy — walking right along the cliff edge for the shortest path. But during $\epsilon$-greedy training, random exploration occasionally pushes it off the cliff. Cumulative training reward is poor. - **SARSA** learns the optimal policy *given* the $\epsilon$-greedy exploration — a safer path away from the cliff edge. Cumulative training reward is higher. After exploration decays, Q-learning's policy is *better* (optimal); during training, SARSA's policy is *safer*. ## When To Prefer SARSA - **Online learning where exploration is risky.** Robots in the real world; medical treatment policies. Safer behaviour during training matters. - **When you want the agent's policy to reflect its actual behaviour** including exploration noise. ## When To Prefer Q-Learning - **Simulated environments** where exploration is free. - **When the goal is the optimal policy** for later deployment. - **Off-policy learning from a fixed dataset** (a common modern setting). ## Expected SARSA A variant: replace $Q(s', a')$ with $\mathbb{E}_{a' \sim \pi}[Q(s', a')]$ — the expected Q over the policy's action distribution. Less variance than vanilla SARSA; works well in practice. ## Related - [[Q-Learning]] - [[Markov Decision Process]] - [[Policy Iteration]] - [[Exploration vs Exploitation]]