Part 4: A Cliffhanger - Comparing SARSA, Q-Learning, and Expected SARSA

If you missed the previous chapter, start here: Part 3: Stepping into the World – Tabular Value-Based Methods

If you spend enough time building Reinforcement Learning (RL) agents, you eventually run into a philosophical dilemma: should your agent evaluate the policy it is actually executing (flaws and all), or should it dream of a perfect world where it never makes mistakes?

I recently ran a series of experiments comparing Temporal Difference (TD) control algorithms—specifically Monte Carlo, SARSA, Q-Learning, and Expected SARSA—across two classic Gymnasium environments: Cliff Walking and Taxi-v3.

The results highlight exactly why the “on-policy” vs. “off-policy” distinction is so critical in RL, and how a few simple tweaks can drastically change an agent’s behavior.

The Setup: Cliff Walking

Imagine a gridworld where the agent needs to get from the bottom-left to the bottom-right. The catch? The entire bottom edge between the start and the goal is a cliff. Stepping on it yields a catastrophic -100 reward and resets the agent to the start.

I trained both SARSA and Q-Learning using an \(\epsilon\)-greedy policy with a constant exploration rate of \(\epsilon = 0.1\). This means 10% of the time, the agent ignores its training and takes a completely random step.

Here is what happened:

{
  "title": { "text": "Performance Comparison: MC vs SARSA vs Q-Learning", "left": "center" },
  "tooltip": { "trigger": "axis" },
  "legend": { "data": ["Monte Carlo", "SARSA", "Q-learning"], "bottom": 0 },
  "xAxis": { "type": "category", "name": "Episodes", "data": ["0", "250", "500", "750", "1000", "1250", "1500", "1750", "2000"] },
  "yAxis": { "type": "value", "name": "Sum of Rewards", "min": -200, "max": 0 },
  "series": [
    { "name": "Monte Carlo", "type": "line", "smooth": true, "data": [-175, -110, -60, -45, -35, -30, -28, -26, -25], "lineStyle": {"width": 2, "color": "#9C27B0"} },
    { "name": "SARSA", "type": "line", "smooth": true, "data": [-100, -40, -30, -26, -25, -25, -25, -25, -25], "lineStyle": {"width": 3, "color": "#2196F3"} },
    { "name": "Q-learning", "type": "line", "smooth": true, "data": [-110, -75, -75, -75, -75, -75, -75, -75, -75], "lineStyle": {"width": 3, "color": "#f44336"} }
  ]
}

SARSA Learns the “Safe” Path

SARSA is an on-policy algorithm. Its update rule looks at the exact next action the agent will take. Because SARSA knows it has a 10% chance of making a foolish, random move, it learns a path along the very top of the grid, as far away from the cliff as possible. Its training curve looks great because it stops falling off the cliff early on.

Q-Learning Learns the “Optimal” Path (and suffers for it)

Q-Learning is an off-policy algorithm. During its update, it takes the max of the next state’s values, completely ignoring its own \(\epsilon\)-greedy exploration rate. It learns the absolute shortest path right along the edge of the cliff.

The problem? Because the behavior policy still forces it to act randomly 10% of the time, Q-Learning constantly steps right off the cliff during training. Its average reward looks terrible, even though the underlying Q-table actually contains the mathematically optimal route!

Fixing Q-Learning: Epsilon-Decay

If Q-learning knows the optimal path, how do we get it to actually use it without falling? The answer is \(\epsilon\)-decay.

By starting \(\epsilon\) at 1.0 (pure exploration) and slowly decaying it to 0.01 over 2,000 episodes, Q-Learning finally gets to show off. It explores heavily in the beginning (resulting in low initial rewards), but as the random noise fades, its performance violently shoots up, overtaking SARSA to perfectly exploit the optimal path.

The Takeaway: Q-learning is incredibly powerful, but if you don’t phase out your forced exploration, your agent will constantly self-destruct in high-penalty environments.

The Real Hero: Expected SARSA and the Variance Problem

Standard SARSA updates its Q-values based on a single, randomly sampled next action. This introduces variance. If you crank your learning rate (\(\alpha\)) too high, standard SARSA will bootstrap off its own random mistakes and destroy its Q-table.

To prove this, I implemented Expected SARSA. Instead of taking the max (like Q-learning) or a random sample (like standard SARSA), it calculates the exact probability-weighted average of all possible next actions:

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t) \right]\]

I ran a sweep, testing Asymptotic Performance against the step size (\(\alpha\)) from 0.1 all the way to 1.0.

{
  "title": { "text": "Asymptotic Performance vs Step Size (α)", "left": "center" },
  "tooltip": { "trigger": "axis" },
  "legend": { "data": ["Expected SARSA", "SARSA", "Q-learning"], "bottom": 0 },
  "xAxis": { "type": "category", "name": "Step Size (α)", "data": ["0.1", "0.2", "0.3", "0.4", "0.5", "0.6", "0.7", "0.8", "0.9", "1.0"] },
  "yAxis": { "type": "value", "name": "Sum of rewards", "min": -150, "max": 0 },
  "series": [
    { "name": "Expected SARSA", "type": "line", "data": [-25, -25, -25, -25, -25, -25, -25, -25, -25, -25], "lineStyle": {"width": 3, "color": "#4CAF50"} },
    { "name": "SARSA", "type": "line", "data": [-25, -27, -32, -40, -55, -70, -85, -100, -120, -140], "lineStyle": {"width": 3, "color": "#2196F3"} },
    { "name": "Q-learning", "type": "line", "data": [-75, -75, -75, -75, -75, -75, -75, -75, -75, -75], "lineStyle": {"width": 3, "color": "#f44336", "type": "dashed"} }
  ]
}

The results are striking. As \(\alpha\) approaches 1.0, standard SARSA’s performance plummets. It simply cannot handle the noisy updates. Expected SARSA, however, stays perfectly flat and stable across all learning rates. Because it uses the true mathematical expectation, it eliminates the variance entirely.

A QUICK CODING TRAP

If you implement Expected SARSA, remember to check for terminal states! If your agent falls off the cliff, the episode ends. You must sever the TD target to just the immediate reward without adding the future expected value, or your math will break.

When Environments Forgive: Taxi-v3

To make sure these behaviors weren’t just flukes, I ran the exact same algorithms on the Taxi-v3 environment.

In Taxi-v3, there is no “cliff.” Taking a random or suboptimal step usually just results in a standard -1 time penalty.

{
  "title": { "text": "Performance on Taxi-v3", "left": "center" },
  "tooltip": { "trigger": "axis" },
  "legend": { "data": ["SARSA", "Q-learning", "Expected SARSA"], "bottom": 0 },
  "xAxis": { "type": "category", "name": "Episodes", "data": ["0", "500", "1000", "1500", "2000"] },
  "yAxis": { "type": "value", "name": "Sum of Rewards", "min": -300, "max": 20 },
  "series": [
    { "name": "SARSA", "type": "line", "smooth": true, "data": [-280, -50, 5, 8, 8], "lineStyle": {"width": 2, "color": "#2196F3"} },
    { "name": "Q-learning", "type": "line", "smooth": true, "data": [-280, -60, 2, 7, 7], "lineStyle": {"width": 2, "color": "#f44336"} },
    { "name": "Expected SARSA", "type": "line", "smooth": true, "data": [-280, -45, 6, 8, 8], "lineStyle": {"width": 2, "color": "#4CAF50"} }
  ]
}

In a forgiving environment, the “safe” path and the “optimal” path are exactly the same thing. Because a random exploratory step merely delays success rather than instantly destroying the episode’s return, SARSA, Q-learning, and Expected SARSA all converged to the exact same routing behavior and average reward.

Final Thoughts

Running these algorithms side-by-side really clarifies the math behind the textbook.

Use SARSA if you want your agent to care about the mistakes it makes during training.
Use Q-Learning if you want the absolute optimal policy (but remember to decay your exploration!).
Use Expected SARSA if you want the safety of on-policy learning but need to run a high learning rate without your agent losing its mind.

Code for these implementations is available on my GitHub.