DeepSeek-R1: The AI That Taught Itself to Think

Feb 8, 2025

Overview: Why DeepSeek-R1 Matters for AI Reasoning

For a long time, large language models (LLMs) primarily relied on massive supervised datasets and human feedback to become more “aligned” or user-friendly. Yet, these conventional methods can struggle to teach deep, multi-step reasoning. Instead, many LLMs simply learn to produce plausible text—helpful for casual Q&A, but less so for systematically solving math, code, or logic puzzles with guaranteed correctness (even if they can be good at those things).

In the last few months, we have seen the emergence of a new type of model, or as OpenAI put it, a new paradigm. They released some very impressive models, starting with o1 and the new o3-mini. Although their “secret sauce” remains a mystery—since OpenAI has only released limited technical details—the Open Source community has been working hard to reproduce them.

A few weeks ago, the Chinese lab DeepSeek, known for their impressive models (and highly regarded scientific papers), released a new family of models based on this emerging paradigm.

DeepSeek-R1 and DeepSeek-R1-Zero upend that paradigm in exciting ways. They’re both built on a 671B-parameter Mixture-of-Experts (MoE) transformer (DeepSeek-V3-Base) yet diverge in training philosophy:

DeepSeek-R1-Zero: Uses a “cold start” approach with pure reinforcement learning (RL) on tasks with verifiable answers. No curated data, no process-level reward model, no supervised fine-tuning.
DeepSeek-R1: Builds on R1-Zero’s foundation but adds supervised fine-tuning, alignment, and multiple RL stages. The result is a more polished, coherent, user-ready assistant that still retains strong problem-solving skills.

Together, they illustrate a new frontier: learning to reason through automated pass/fail feedback alone, with minimal or zero human labeling. Below, I will explain what these models are, how they circumvent the usual complexities of RL with Group Relative Policy Optimization (GRPO), and how they eventually package these breakthroughs into distilled smaller models. By the end of this post, I hope you will appreciate the significance of the DeepSeek-R1 release.

The Two Models in Brief: R1-Zero vs. R1

Shared Architecture, Different Journeys

Both models share a Mixture-of-Experts transformer architecture with a total of 671B parameters, while only ~37B parameters are “active” for any given token. This gating approach amplifies capacity without linearly ballooning compute. They also support an extreme 128K context window—useful for long chain-of-thought (CoT) reasoning—and incorporate Multi-Token Prediction (MTP) to predict multiple tokens per step.

But the fine-tuning process for each model is drastically different:

DeepSeek-R1-Zero
- Trains entirely via RL with no supervised data.
- Uses “cold start” RL: from the base model, it learns purely by pass/fail signals (can the code solve the problem? is the math answer correct?).
- Avoids learned reward models (to prevent “reward hacking”), relying on rule-based checks for correctness and format.
- Yields high accuracy on tasks with unambiguous answers—though the chain-of-thought can be verbose or chaotic.
DeepSeek-R1
- Multi-stage approach with a small supervised “cold-start” set, multiple RL phases, preference alignment, and rejection sampling.
- Maintains R1-Zero’s core problem-solving but polishes style, consistency, and alignment for broader usage.
- More readable chain-of-thought, less random language mixing, and better overall utility as an AI assistant.

How DeepSeek-R1-Zero Was Trained: Pure RL, No Human Data

Cold-Start RL: No SFT, No Problem

Typically, training a reasoning LLM without human-provided solutions seems like a recipe for unstable learning—yet R1-Zero shows it can be done. The authors took DeepSeek-V3-Base (pre-trained on ~14.8T tokens of text) and directly applied reinforcement learning on tasks with deterministic checks:

Outcome-level rewards only (the final answer must match a gold solution or pass code tests).
No process-level reward model: Instead of checking each chain-of-thought step, they just verify the final correctness.
Rule-based signals for correctness and style (e.g. “did your code pass the unit tests?” or “did you wrap your reasoning in <think> tags?”).

This setup is reminiscent of training game agents like AlphaZero—where outcome-level signals drive the entire learning process. The difference is that R1-Zero deals with open-ended text. Yet with enough pass/fail tasks (math, coding challenges, logic puzzles), it can glean the entire “how to reason” skill from repeated trial and error.

💡If you don’t know what Supervised Fine-Tuning (SFT) is, let me explain. It’s basically the process of training a model by presenting it with examples in the form of a prompt (instruction) and the correct completion (output).

displaying what supervised fine tuning is with a set of instructions and outputs

Ground-Truth Rewards, No Learned Reward Models

One key design choice was to avoid learned reward models altogether. Learned critics can be gamed (“reward hacking”), or they might misjudge correctness if they only see approximate signals. By sticking to deterministic checks—did the code compile and solve the test cases? did the final numeric answer match the ground-truth?—R1-Zero gets a crystal-clear measure of success. This also simplifies the training pipeline: no large value network to maintain, no risk that a flawed critic signals the wrong behavior.

Emergent Self-Reflection and “Aha Moments”

Surprisingly, even though R1-Zero never sees annotated step-by-step solutions, it begins to generate them anyway. Midway through training, it discovers that longer, more careful chain-of-thought outputs yield higher pass rates. Over thousands of RL steps, these improvements compound:

The model attempts a solution (possibly short at first).
If it fails, it gets zero reward; if correct, it’s reinforced.
Eventually, more thorough reasoning sequences happen to succeed.
The RL update “nudges” the policy to produce even longer solutions next time.
This self-reinforcing dynamic leads to an emergent “reflection” or “verification” habit, purely to maximize pass/fail success rates.

The result is raw but powerful reasoning. R1-Zero’s chain-of-thought can be verbose, or randomly peppered with partial code or pseudo-math if that was rewarded at some point. It’s not aligned for clarity or single-language output, but it is surprisingly accurate on tasks that can be automatically verified.

Group Relative Policy Optimization (GRPO): RL without the Giant Critic

displaying the group relative policy optimization

Why GRPO Instead of PPO or Value Models?

Traditional RL for language models (like RLHF) often pairs a large policy with an equally large value (critic) network. That’s computationally daunting for a 671B-parameter model. Moreover, LLM outputs are highly non-deterministic, making it tricky for the critic to estimate expected returns. GRPO (Group Relative Policy Optimization) sidesteps these challenges in a very interesting way. Jay Alammar explains this concept in a very good way using the images below.

Sampling: For each prompt, the model generates multiple responses (“a group”).
Reward Calculation: Each response is tested (did it pass all code tests? match the math answer?).
Relative Advantages: Compare each response’s reward to the group’s average; the better-than-average samples get a positive advantage, below-average get negative.
Update: Use a PPO-like objective without a learned value baseline—just the group’s reward distribution.

This approach drastically reduces memory overhead, since no large critic network is needed. It also avoids “reward hacking” that can happen if a learned critic is imperfect.

How GRPO Spurs Longer Reasoning

When the model’s own longer chain-of-thought solutions outscore shorter ones, GRPO effectively says, “Do more of that.” Because each prompt sees multiple samples, an “extra-thorough” solution that passes the test suite stands out among the group, reaping a higher advantage. Over time, the model invests more tokens into “thinking,” leading to emergent self-reflection and iterative checking—exactly the sort of multi-step reasoning we want.

The Additional Phases That Produced DeepSeek-R1

While R1-Zero proved that pure RL yields powerful emergent reasoning, it also had some liabilities: messy chain-of-thought, random language mixing, and minimal alignment to typical user preferences. Enter DeepSeek-R1, which took a more measured, multi-phase route. Jay Alammar explains this concept in a very good way using the images below.

Phase 1: Small “Cold-Start” Supervised Fine-Tuning

DeepSeek-R1 begins by fine-tuning a base model (DeepSeek-V3-Base) on a small but high-quality reasoning dataset of roughly 5,000 tokens. This modest dataset has carefully verified solutions (complete with chain-of-thought) to combat the “cold start” problem—where a purely RL-trained model can produce unreadable or disorganized text. Even a handful of epochs on this curated data helps the model:

Adopt a consistent, human-like reasoning format (no language-switching or random language).
Stabilize enough so that subsequent RL steps won’t suffer from chaotic outputs.

Phase 2: RL for Reasoning Accuracy

Following the cold start, the model is trained via reasoning-oriented reinforcement learning, taking inspiration from approaches used in earlier “Zero”-style training. Large sets of math, coding, and logic tasks—each with a pass/fail test—provide the core signal:

Generate candidate solutions.
Reinforce those that pass correctness checks (uses GRPO).
Include an extra reward term to keep the target language consistent (preventing multilingual drift).

By the end, the model achieves strong problem-solving capabilities similar to previous RL-based methods, but with far cleaner style and structure thanks to the initial supervised step.

Phase 3: Generating a Massive Verified Dataset for rejection sampling

With the RL stage stabilized, the authors then leveraged this improved model to produce large-scale synthetic reasoning data:

Pose fresh or more diverse problems.
Collect multiple candidate solutions.
Reject low-quality or incorrect outputs using a mix of rule-based checks and a reward model.
Keep only well-structured, accurate chains of thought.

Through this rejection sampling, they amassed around 600,000 valid reasoning samples. Additionally, about 200,000 non-reasoning or general-purpose samples were included, bringing the total dataset size to roughly 800,000 examples.

Phase 4: Second Supervised Fine-Tuning

In this step, DeepSeek-V3-Base (the underlying base for DeepSeek-R1) is fine-tuned again—but this time on the newly generated 800k-sample dataset. This large-scale supervised phase distills all the best RL-derived reasoning patterns into a consistent, deterministic format. As a result:

The model’s chain-of-thought improves in clarity and accuracy.
Random style shifts and previously sporadic language issues are further reduced.

displaying the second supervised fine-tuning phase

Phase 5: Final RL Stage with Preference Alignment

Finally, DeepSeek-R1 undergoes an additional RL-based phase focused on preference alignment. Unlike purely objective math or coding problems, these open-ended or subjective tasks rely on feedback signals emphasizing helpfulness, clarity, and harmlessness. During this stage:

The model is nudged to communicate politely and stay within a single language.
It is encouraged to provide concise yet readable summaries of its reasoning to maintain clarity for end users.

displaying the final RL stage with preference alignment

More on Reward Models: Process-Level vs. Outcome-Level

A lot of existing research uses separate “Process Reward Models (PRM)” to judge each step of chain-of-thought and “Outcome Reward Models (ORM)” to evaluate just the final answer. Some approaches (like LaTent Reasoning Optimization, or LaTRO) even treat the chain-of-thought as a latent variable, reinforcing partial steps toward correctness. DeepSeek took a simpler route: purely outcome-level checks for tasks with ground-truth solutions, plus straightforward format rewards. There’s no separate model scrutinizing each step for correctness. Ironically, this simpler design still led to advanced emergent reasoning—probably because a reliably correct final answer strongly incentivizes thorough intermediate reasoning.

Emergent “Self-Reflection” and Exploration Behaviors

One of the most fascinating aspects of R1-Zero (and, to a refined extent, R1) is that these self-checking or “exploratory” traits arose naturally from RL. The model discovered that:

Longer chain-of-thought => higher chance of catching errors => higher reward.
Occasional backtracking => better final correctness => again, higher reward.

No one explicitly told it to do sub-step verification. RL, especially with GRPO’s batch-based comparison, reinforced any expansions in reasoning that led to more consistent final answers. Over repeated updates, the policy distribution skewed toward these emergent solutions—leading to surprisingly sophisticated CoT behaviors.

Distillation: Compressing Big Reasoning into Smaller Models

Why Distillation?

Training or even serving a 671B MoE model is not trivial. So the DeepSeek team released six distilled models onto smaller backbones (Qwen, Llama, etc.), spanning 1.5B to 70B parameters. They fine-tuned these smaller “students” on the correct solutions produced by R1 (or R1-Zero).

How It Works

Teacher: DeepSeek-R1 (the “large teacher”) is prompted with tasks and generates verified, high-quality chain-of-thought + final answers.
Student: A smaller Qwen or Llama model is then trained via supervised learning to reproduce those sequences—learning both the final solutions and the token-by-token reasoning style.

Because the teacher’s solutions are correct and thorough, the student effectively “inherits” advanced CoT reasoning it might never discover with small-scale RL from scratch.

‼️ This doesn’t mean the distilled models have the reasoning capabilities of the R1 and R1-Models. They are just trained to produce output that looks like reasoning, it’s like if they were prompted to “think step by step”

Surprising Gains in Performance

Even a 7B or 30B student model can match or exceed older, bigger LLMs on tricky math/coding benchmarks—simply by absorbing the teacher’s best patterns. The largest student (70B) sometimes approaches R1’s performance. Distillation thus unlocks the benefits of a massive RL-trained system without requiring you to run the entire 671B MoE during inference.

displaying the performance of the distilled models

Not Everything Worked: MCTS, Large-Scale PRMs, and Other Experiments

Interestingly, the authors openly mention that certain ideas—like Monte Carlo Tree Search (MCTS)—didn’t scale to open-ended text tasks. The branching factor is too high, making it computationally intractable. They also skipped process-level learned reward models because they can become huge and are prone to reward exploitation by the policy. Instead, simpler rule-based checks plus GRPO proved to be both stable and powerful.

Likewise, they found that small models trained directly with RL (rather than distillation) struggle to replicate the emergent reasoning that big RL models develop. Hence the strategy: do the RL on a large base, then distill to smaller backbones.

Conclusion: Reinforcement-Learned Reasoning and the Future of Open-Source LLMs

DeepSeek-R1-Zero and DeepSeek-R1 demonstrate a sweeping shift in how we can train LLMs to reason:

R1-Zero proves that purely outcome-level rewards (code passes the tests, math answer matches) can spark surprisingly sophisticated chain-of-thought—even without curated data or a learned critic.
R1 takes that raw capability and polishes it via small SFT, multiple RL cycles, preference alignment, and final distillation into smaller models. The result is both top-notch reasoning and user-friendly answers.

GRPO is the linchpin making large-scale RL feasible without a separate value network, letting the model discover extended reasoning patterns to maximize correctness. From there, distillation compresses the best of R1 into smaller footprints—allowing even a 7B or 30B model to solve complex math or coding tasks once reserved for giant systems.

Key Takeaways

Cold-Start RL: You can skip supervised solutions entirely if your tasks have automated pass/fail checks; a “rookie” can become an expert purely via RL signals.
Emergent Reflection: Self-checking, multi-language reasoning, backtracking—these behaviors aren’t hand-coded, they follow from maximizing correct outcomes under GRPO.
Multi-Stage Refinement: Aligning a raw RL model for everyday usage (single language, clear chain-of-thought, helpfulness) requires carefully mixing supervised data and final RLHF steps.
Distillation: The real gift is packaging huge breakthroughs into more accessible models—so labs don’t need a 671B MoE to achieve advanced reasoning.

DeepSeek shows that open-source LLM research can push the boundaries of how we teach models to think. By focusing on tasks with verifiable solutions, they sidestep massive manual annotation. By employing GRPO, they avoid training giant critics. By distilling, they spread the benefits of emergent reasoning to smaller architectures. From purely RL-driven “cold starts” to carefully curated final alignment, it’s a masterclass in engineering a new generation of “reasoning LLMs.”