## Definition
**SARSA** (State-Action-Reward-State-Action) is a *model-free*, *on-policy* RL algorithm. Like [[Q-Learning]] it learns an action-value function, but evaluates the *policy actually being followed* rather than the greedy policy.
## Update Rule
After observing transition $(s, a, r, s', a')$:
$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right]
$
The next action $a'$ is the one *actually taken* by the behavior policy (typically $\epsilon$-greedy on the current $Q$), not necessarily the greedy one.
The acronym SARSA comes from the five-tuple: $(s, a, r, s', a')$.
## On-Policy vs Off-Policy
| | SARSA | Q-Learning |
|---|---|---|
| TD target uses | $Q(s', a')$ for behaviour action | $\max_{a'} Q(s', a')$ |
| Policy being learnt | Same as behaviour | Greedy (different from behaviour) |
| Type | On-policy | Off-policy |
## Behavioural Difference: The Cliff Walk Example
Sutton & Barto's classic example. A grid world with a cliff alongside a path to the goal. Stepping off the cliff costs -100.
- **Q-learning** learns the optimal *greedy* policy — walking right along the cliff edge for the shortest path. But during $\epsilon$-greedy training, random exploration occasionally pushes it off the cliff. Cumulative training reward is poor.
- **SARSA** learns the optimal policy *given* the $\epsilon$-greedy exploration — a safer path away from the cliff edge. Cumulative training reward is higher.
After exploration decays, Q-learning's policy is *better* (optimal); during training, SARSA's policy is *safer*.
## When To Prefer SARSA
- **Online learning where exploration is risky.** Robots in the real world; medical treatment policies. Safer behaviour during training matters.
- **When you want the agent's policy to reflect its actual behaviour** including exploration noise.
## When To Prefer Q-Learning
- **Simulated environments** where exploration is free.
- **When the goal is the optimal policy** for later deployment.
- **Off-policy learning from a fixed dataset** (a common modern setting).
## Expected SARSA
A variant: replace $Q(s', a')$ with $\mathbb{E}_{a' \sim \pi}[Q(s', a')]$ — the expected Q over the policy's action distribution. Less variance than vanilla SARSA; works well in practice.
## Related
- [[Q-Learning]]
- [[Markov Decision Process]]
- [[Policy Iteration]]
- [[Exploration vs Exploitation]]