-
Part 6: Cutting Out the Middleman – Policy Gradient Methods
-
Part 7: The Complete Alignment Pipeline — From SFT to Advanced RL
A complete guide to how PPO, DPO, and GRPO transform language models from pattern copiers into reasoning agents.
-
Part 5: Breaking the Table – Function Approximation
-
Part 4: A Cliffhanger - Comparing SARSA, Q-Learning, and Expected SARSA
-
10 RL Stability Tests That Detect Collapse Before Rewards Drop
How to catch reinforcement learning collapse early with the right stability tests, before headline reward metrics tell you everything is fine.