Skip to content

DeepSeek-R1: RL-trained reasoning and distillation

a whale swimming in the ocean

Photo by Todd Cravens

Overview: what DeepSeek-R1 is optimizing for

For a long time, large language models (LLMs) primarily relied on massive supervised datasets and human feedback to become more “aligned” or user-friendly. Yet, these conventional methods can struggle to teach deep, multi-step reasoning. Instead, many LLMs simply learn to produce plausible text (helpful for casual Q&A) but less so for systematically solving math, code, or logic puzzles with guaranteed correctness (even if they can be good at those things).

Recently, we’ve seen more models explicitly optimized for verifiable, multi-step reasoning. OpenAI released models like o1 and o3-mini; while technical details are limited, the open-source community has been working hard to reproduce similar behaviors.

Around the same time, DeepSeek released a new family of models in this direction.

DeepSeek-R1 and DeepSeek-R1-Zero are both built on a 671B-parameter Mixture-of-Experts (MoE) transformer (DeepSeek-V3-Base), but they diverge in training philosophy:

Together, they show a recipe for improving reasoning with automated pass/fail feedback on verifiable tasks, then packaging the results into smaller models via distillation. Below, I will explain what these models are, how they circumvent the usual complexities of RL with Group Relative Policy Optimization (GRPO), and how they eventually distill these behaviors into smaller models.


The Two Models in Brief: R1-Zero vs. R1

Shared Architecture, Different Journeys

Both models share a Mixture-of-Experts transformer architecture with a total of 671B parameters, while only ~37B parameters are “active” for any given token. This gating approach amplifies capacity without linearly ballooning compute. They also support a 128K context window (useful for long chain-of-thought (CoT) reasoning) and incorporate an auxiliary Multi-Token Prediction (MTP) training objective that predicts multiple future tokens (inference is still standard next‑token decoding).

But the fine-tuning process for each model is drastically different:

  1. DeepSeek-R1-Zero

    • Trains entirely via RL with no supervised data.
    • Uses “cold start” RL: from the base model, it learns purely by pass/fail signals (can the code solve the problem? is the math answer correct?).
    • Avoids learned reward models (to prevent “reward hacking”), relying on rule-based checks for correctness and format.
    • Yields high accuracy on tasks with unambiguous answers, though the chain-of-thought can be verbose or chaotic.
  2. DeepSeek-R1

    • Multi-stage approach with a small supervised “cold-start” set, multiple RL phases, preference alignment, and rejection sampling.
    • Maintains R1-Zero’s core problem-solving but polishes style, consistency, and alignment for broader usage.
    • More readable chain-of-thought, less random language mixing, and more practical usefulness as an AI assistant.

How DeepSeek-R1-Zero Was Trained: Pure RL, No Human Data

Cold-Start RL: No SFT, No Problem

Typically, training a reasoning LLM without human-provided solutions seems like a recipe for unstable learning, yet R1-Zero shows it can be done. The authors took DeepSeek-V3-Base (pre-trained on ~14.8T tokens of text) and directly applied reinforcement learning on tasks with deterministic checks:

This setup is reminiscent of training game agents like AlphaZero, where outcome-level signals drive the entire learning process. The difference is that R1-Zero deals with open-ended text. Yet with enough pass/fail tasks (math, coding challenges, logic puzzles), it can glean the entire “how to reason” skill from repeated trial and error.

If you’re unfamiliar with Supervised Fine-Tuning (SFT): it’s training on prompt → ideal completion examples.

displaying what supervised fine tuning is with a set of instructions and outputs

Ground-Truth Rewards, No Learned Reward Models

One key design choice was to avoid learned reward models altogether. Learned critics can be gamed (“reward hacking”), or they might misjudge correctness if they only see approximate signals. By sticking to deterministic checks (did the code compile and solve the test cases? did the final numeric answer match the ground-truth?), R1-Zero gets a clear correctness signal. This also simplifies the training pipeline: no large value network to maintain, no risk that a flawed critic signals the wrong behavior.

Emergent Self-Reflection and “Aha Moments”

Even though R1-Zero never sees annotated step-by-step solutions, it begins to generate them anyway. Midway through training, it discovers that longer, more careful chain-of-thought outputs yield higher pass rates. Over thousands of RL steps, these improvements compound:

  1. The model attempts a solution (possibly short at first).
  2. If it fails, it gets zero reward; if correct, it’s reinforced.
  3. Eventually, more thorough reasoning sequences happen to succeed.
  4. The RL update “nudges” the policy to produce even longer solutions next time.
  5. This self-reinforcing dynamic leads to an emergent “reflection” or “verification” habit, purely to maximize pass/fail success rates.

The result is unpolished reasoning. R1-Zero’s chain-of-thought can be verbose, or randomly peppered with partial code or pseudo-math if that was rewarded at some point. It isn’t aligned for clarity or single-language output. It can still be strong on tasks that can be automatically verified.


Group Relative Policy Optimization (GRPO): RL without the Giant Critic

displaying the group relative policy optimization

Why GRPO Instead of PPO or Value Models?

Traditional RL for language models (like RLHF) often pairs a large policy with an equally large value (critic) network. That’s computationally daunting for a 671B-parameter model. Moreover, LLM outputs are highly non-deterministic, making it tricky for the critic to estimate expected returns. GRPO (Group Relative Policy Optimization) sidesteps these challenges. Jay Alammar has a good visual explanation that matches the sequence below.

  1. Sampling: For each prompt, the model generates multiple responses (“a group”).

    describing how GRPO generates multiple responses for each prompt
  2. Reward Calculation: Each response is tested (did it pass all code tests? match the math answer?).

    displaying how GRPO tests each response
  3. Relative Advantages: Compare each response’s reward to the group’s average; the better-than-average samples get a positive advantage, below-average get negative.

    describing how GRPO compares each response to the group's average
  4. Update: Use a PPO-like objective without a learned value baseline, just the group’s reward distribution.

This approach drastically reduces memory overhead, since no large critic network is needed. It also avoids “reward hacking” that can happen if a learned critic is imperfect.

How GRPO Spurs Longer Reasoning

When the model’s own longer chain-of-thought solutions outscore shorter ones, GRPO effectively says, “Do more of that.” Because each prompt sees multiple samples, an “extra-thorough” solution that passes the test suite stands out among the group, reaping a higher advantage. Over time, the model invests more tokens into “thinking,” leading to emergent self-reflection and iterative checking, which can raise pass rates on verifiable tasks.


The Additional Phases That Produced DeepSeek-R1

While R1-Zero proved that pure RL can yield strong emergent reasoning, it also had some liabilities: messy chain-of-thought, random language mixing, and minimal alignment to typical user preferences. DeepSeek-R1 takes a more measured, multi-phase route. Jay Alammar has a good visual explanation that matches the phases below.

Phase 1: Small “Cold-Start” Supervised Fine-Tuning

DeepSeek-R1 begins by fine-tuning a base model (DeepSeek-V3-Base) on thousands of cold-start examples to seed readable reasoning before RL. This dataset has carefully verified solutions (complete with chain-of-thought) to combat the “cold start” problem (where a purely RL-trained model can produce unreadable or disorganized text). This initial step helps the model:

displaying the cold start phase

Phase 2: RL for Reasoning Accuracy

Following the cold start, the model is trained via reasoning-oriented reinforcement learning, taking inspiration from approaches used in earlier “Zero”-style training. Large sets of math, coding, and logic tasks each have a pass/fail test, providing the core signal:

  1. Generate candidate solutions.
  2. Reinforce those that pass correctness checks (uses GRPO).
  3. Include an extra reward term to keep the target language consistent (preventing multilingual drift).

By the end, the model achieves strong problem-solving capabilities similar to previous RL-based methods, but with far cleaner style and structure thanks to the initial supervised step.

displaying the reasoning phase

Phase 3: Generating a Massive Verified Dataset for rejection sampling

With the RL stage stabilized, the authors then leveraged this improved model to produce large-scale synthetic reasoning data:

  1. Pose fresh or more diverse problems.
  2. Collect multiple candidate solutions.
  3. Reject low-quality or incorrect outputs using a mix of rule-based checks and a reward model.
  4. Keep only well-structured, accurate chains of thought.

Through this rejection sampling, they amassed around 600,000 valid reasoning samples. Additionally, about 200,000 non-reasoning or general-purpose samples were included, bringing the total dataset size to roughly 800,000 examples.

displaying the verified dataset phase

Phase 4: Second Supervised Fine-Tuning

In this step, DeepSeek-V3-Base (the underlying base for DeepSeek-R1) is fine-tuned again, this time on the newly generated 800k-sample dataset. This large-scale supervised phase distills all the best RL-derived reasoning patterns into a consistent, deterministic format. As a result:

displaying the second supervised fine-tuning phase

Phase 5: Final RL Stage with Preference Alignment

Finally, DeepSeek-R1 undergoes an additional RL-based phase focused on preference alignment. Unlike purely objective math or coding problems, these open-ended or subjective tasks rely on feedback signals that score helpfulness, clarity, and harmlessness. During this stage:

displaying the final RL stage with preference alignment

More on Reward Models: Process-Level vs. Outcome-Level

A lot of existing research uses separate “Process Reward Models (PRM)” to judge each step of chain-of-thought and “Outcome Reward Models (ORM)” to evaluate just the final answer. Some approaches (like LaTent Reasoning Optimization, or LaTRO) even treat the chain-of-thought as a latent variable, reinforcing partial steps toward correctness. DeepSeek took a simpler route: purely outcome-level checks for tasks with ground-truth solutions, plus straightforward format rewards. There’s no separate model scrutinizing each step for correctness. This simpler design can still lead to advanced emergent reasoning, because a reliably correct final answer incentivizes thorough intermediate work.


Emergent “Self-Reflection” and Exploration Behaviors

One notable aspect of R1-Zero (and, to a refined extent, R1) is that these self-checking or “exploratory” traits arose naturally from RL. The model discovered that:

  1. Longer chain-of-thought => higher chance of catching errors => higher reward.
  2. Occasional backtracking => better final correctness => again, higher reward.

No one explicitly told it to do sub-step verification. RL, especially with GRPO’s batch-based comparison, reinforced any expansions in reasoning that led to more consistent final answers. Over repeated updates, the policy distribution skewed toward these emergent solutions and became more consistent at self-checking.


Distillation: Compressing Big Reasoning into Smaller Models

Why Distillation?

Training or even serving a 671B MoE model is not trivial. So the DeepSeek team released six distilled models onto smaller backbones (Qwen, Llama, etc.), spanning 1.5B to 70B parameters. They fine-tuned these smaller “students” on the correct solutions produced by R1 (or R1-Zero).

How It Works

  1. Teacher: DeepSeek-R1 (the “large teacher”) is prompted with tasks and generates verified, high-quality chain-of-thought + final answers.
  2. Student: A smaller Qwen or Llama model is then trained via supervised learning to reproduce those sequences, learning both the final solutions and the token-by-token reasoning style.

Because the teacher’s solutions are correct and thorough, the student effectively “inherits” advanced CoT reasoning it might never discover with small-scale RL from scratch.

Distillation is supervised imitation. It can transfer useful reasoning behaviors and materially improve benchmark performance, but it doesn’t give you the same training signal as RL on verifiable tasks, so generalization depends on the student model and the distillation data.

What distillation buys you

Even a 7B or 30B student model can match or exceed older, bigger LLMs on tricky math/coding benchmarks, simply by training on the teacher’s verified solutions. The largest student (70B) sometimes approaches R1’s performance. Distillation is how you keep much of that behavior without serving the entire 671B MoE at inference time.

displaying the performance of the distilled models

Not Everything Worked: MCTS, Large-Scale PRMs, and Other Experiments

Interestingly, the authors openly mention that certain ideas (like Monte Carlo Tree Search (MCTS)) didn’t scale to open-ended text tasks. The branching factor is too high, making it computationally intractable. They also skipped process-level learned reward models because they can become huge and are prone to reward exploitation by the policy. Instead, they used simpler rule-based checks plus GRPO.

Likewise, they found that small models trained directly with RL (rather than distillation) struggle to replicate the emergent reasoning that big RL models develop. Hence the strategy: do the RL on a large base, then distill to smaller backbones.


Wrap-up: reinforcement-learned reasoning in practice

DeepSeek-R1-Zero and DeepSeek-R1 show a pragmatic training recipe:

  1. R1-Zero pushes reasoning with outcome-level rewards on tasks you can automatically verify.
  2. R1 makes that capability usable by adding cold-start data, multiple RL stages, and preference alignment, then distilling into smaller backbones.

GRPO is the practical enabler here: it makes large-scale RL feasible without a separate value network, while still favoring samples that actually pass verifiers.

Key takeaways

This works best when correctness is automatically verifiable (tests, exact answers). For open-ended tasks, preference data and reward modeling still matter.


Resources