Replicating R1-Zero on a countdown task

2025-01-07 · timeseries

When DeepSeek's R1 paper landed in January 2025, the headline result was a reasoning model trained with reinforcement learning, no supervised distillation, that recovered chain-of-thought behavior from the RL signal alone. R1-Zero — the variant without the cold-start supervised phase — was the more philosophically interesting one. Pure RL. The model discovers reasoning because the reward function rewards it.

The TinyZero project published a smaller reproduction pipeline. The training script — train_tiny_zero.sh — runs RL on a Countdown-numbers task at a scale that fits on a single consumer GPU. The task is the math game where you're given a set of numbers and a target and have to combine them with arithmetic to reach the target. Solvable by hand, easy to score automatically (you either got the target or you didn't), and large enough in state space that the model has room to discover non-trivial reasoning strategies.

timeseries (poorly named, in retrospect — it predates the repo's eventual time-series work and the name stuck) is my replication attempt. The repository is mostly a thin layer over TinyZero with one practical addition: a working train_tiny_zero.sh invocation with data.train_batch_size=256 and data.val_batch_size=1312 hardcoded in, because at the time the commandline override path was broken and the only way to get the right batch sizes through was to edit the script.

(The note in the README is a souvenir of that frustration. It says "ADJUST! Commandline handover of override does not work." Two years later it's probably been fixed upstream. I left the note because at the time it was the only thing standing between someone with this repo and a working training run.)

The replication itself behaved roughly as the paper predicted at this scale: the model's outputs got more verbose, the structure became more deliberate, the success rate on held-out countdown problems climbed. The interesting part wasn't reaching the target accuracy — it was watching the model's outputs change qualitatively as RL training proceeded. Early in training, the model guesses. By mid-training it's writing out partial computations. By late training it's hedging, backtracking, restating the problem.

R1-Zero's broader claim — that pure RL on a verifiable reward is sufficient to elicit reasoning — held up in miniature. The countdown task is small enough that you can stare at the trajectories and see the reasoning emerge.

What becomes possible: a single-GPU reproduction of one of the more philosophically significant RL results of the year, on hardware you already own.

#reinforcement-learning #r1-zero #tinyzero #llm