DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Content has been generated from NotebookLM

Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J.L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R.J. Chen, R.L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S.S. Li et al. (100 additional authors not shown)
Source: Excerpts from “2501.12948.pdf”

Introduction

Context: The paper addresses the rapid advancement of Large Language Models (LLMs) and their move toward Artificial General Intelligence (AGI). A key area of focus is improving reasoning capabilities, particularly through post-training methods that require less computational power than pre-training.
Problem: Existing methods for enhancing reasoning, including process-based rewards, RL, and search algorithms, haven’t reached the general performance levels of OpenAI’s o1 series models.
Approach: The paper presents a two-pronged strategy for improving reasoning in LLMs using Reinforcement Learning (RL):
- DeepSeek-R1-Zero: A model trained purely through RL, without initial supervised fine-tuning (SFT), designed to explore the emergence of reasoning abilities through self-evolution.
- DeepSeek-R1: A model that builds upon DeepSeek-R1-Zero, using RL with a “cold-start” of curated reasoning data, aiming for higher performance and more human-friendly outputs.
Key Goal: The core goal of the study is to explore the extent to which LLMs can develop reasoning capabilities solely through RL, without the need for a preliminary SFT phase.

DeepSeek-R1-Zero: Pure Reinforcement Learning

Method: The researchers used the DeepSeek-V3-Base model and Group Relative Policy Optimization (GRPO) as the RL framework.
- GRPO: This method foregoes the need for a critic model (which is typically the same size as the policy model) and estimates the baseline from group scores, reducing training costs.
Reward System: The model is trained with a rule-based reward system focused on:
- Accuracy: Correctness of the answer based on problem-specific rules or compilers.
- Format: Ensuring that the thinking process is enclosed within and tags.
Template: A straightforward template is used, where the model must first give its reasoning process and then provide the final answer, without any other constraints
- The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within and tags, respectively, i.e., .
- Results:DeepSeek-R1-Zero showed significant improvement in performance on reasoning benchmarks during RL training, specifically:
- AIME 2024 pass@1 score increased from 15.6% to 71.0%, further increasing to 86.7% with majority voting. This performance is comparable to OpenAI-o1-0912
Emergence of sophisticated reasoning behaviors:
- The model naturally learns to use more “thinking time” (longer generated reasoning chains) for complex tasks.
- The model demonstrates “self-verification, reflection, and generating long CoTs.”
- The model exhibits an “aha moment” where it reevaluates its initial approach during the reasoning process, which is not an explicitly taught behavior, but one that emerges via RL.
- Wait, wait. Wait. That’s an aha moment I can flag here. Let’s reevaluate this step-by-step to identify if the correct sum can be…
Drawbacks: While successful in demonstrating the ability of RL alone to enhance reasoning, DeepSeek-R1-Zero has issues:
- Poor readability.
- Language mixing.

DeepSeek-R1: Reinforcement Learning with Cold Start

Motivation: To address the drawbacks of DeepSeek-R1-Zero and further improve performance, the researchers used a cold start phase for DeepSeek-R1 with high-quality data.
Cold Start Data: Thousands of long Chain-of-Thought (CoT) examples were collected via methods such as few-shot prompting, direct prompting, gathering DeepSeek-R1-Zero outputs, and refining these outputs through human annotation
- Advantages of Cold Start Data:Readability: This data is formatted to have a clear structure, including summaries at the end of each response, improving the human readability of responses.
- Here, we define the output format as |special_token|<reasoning_process>|special_token| , where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results.
- Potential: Cold-start data seems to enable better performance than DeepSeek-R1-Zero.
- Training Stages:Cold Start SFT: The DeepSeek-V3-Base model was fine-tuned using the cold-start data.
Reasoning-oriented RL: The fine-tuned model undergoes large-scale RL to enhance reasoning, focusing on tasks like coding and mathematics.
- Language consistency reward was added to address the mixing of languages during RL.
- Rejection Sampling and Supervised Fine-Tuning:The model is used to generate new SFT data via rejection sampling.
- Includes both reasoning data and non-reasoning data to improve general performance.
- About 600k reasoning-related training samples and 200k non-reasoning samples were collected.
Reinforcement Learning for all Scenarios: A final RL stage is used to improve helpfulness and harmlessness in addition to reasoning capabilities.
- Reward signals and diverse prompts are used to capture human preferences
Results: DeepSeek-R1 achieves performance on par with OpenAI-o1-1217 on reasoning tasks.

Distillation

Method: DeepSeek-R1’s reasoning capabilities were distilled into smaller models (Qwen and Llama series) through direct SFT using the 800k samples curated with DeepSeek-R1, without any additional RL.
Results: These distilled models show significantly improved reasoning ability, outperforming larger open-source models (e.g., the 14B model outperforms the QwQ-32B-Preview) and approaching the performance of models like o1-mini.
- DeepSeek-R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Additionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench.
- Distillation demonstrated that the reasoning patterns of larger models can be effectively transferred to smaller ones.
- The distilled 32B and 70B models set a new record on reasoning benchmarks among dense models.

Experimental Evaluation

Benchmarks: The models were evaluated on a range of benchmarks, including MMLU, GPQA, AIME, MATH-500, Codeforces, LiveCodeBench, and others.
Baselines: Models were compared against several strong baselines including DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, and OpenAI-o1-1217.
- Results:DeepSeek-R1: Demonstrated strong performance in reasoning and knowledge tasks, comparable to OpenAI-o1-1217 on some benchmarks. Outperformed other models on math and coding benchmarks. Achieved impressive performance on long-context understanding tasks.
- Distilled Models: Outperformed the models they were based on. The 7B model outperformed GPT-4o and Claude-3.5-Sonnet on math benchmarks
Evaluation Protocol: Used pass@1 for evaluation, sampling multiple responses per question to ensure more stable performance estimates.

Discussion

Distillation vs. RL: Distilling the reasoning capability of large models to smaller ones is more effective than using large-scale RL on the smaller models. While smaller models can achieve decent performance with large-scale RL training, this can’t match the performance achievable using distilled reasoning data from a larger model.
Unsuccessful Attempts: The researchers also discussed some unsuccessful attempts, including:
- Process Reward Models (PRM): Found that it was too difficult to define fine-grained reasoning steps and avoid reward hacking.
- Monte Carlo Tree Search (MCTS): Experienced difficulties in scaling the search space and training an effective value model.

Conclusion and Future Directions

Summary: The research successfully explores methods to enhance LLM reasoning capabilities via RL, both with and without an SFT cold-start, and via distillation. DeepSeek-R1 was shown to be highly competitive with the best models, demonstrating the potential of the approach.
Future Work: The authors plan to focus on:
- Improving general capabilities beyond reasoning (e.g., function calling, multi-turn tasks).
- Addressing language mixing issues when handling non-English or Chinese queries.
- Improving robustness to prompts and the zero-shot setting.
- Scaling software engineering tasks by using rejection sampling or asynchronous evaluation during RL
- Exploring further gains by applying RL to the distilled models.

Key Contributions

Demonstrates that RL alone, without supervised fine-tuning, can incentivize the emergence of reasoning capabilities in LLMs.
Introduces a pipeline for developing high-performance reasoning models using cold-start data with iterative RL fine-tuning.
Shows that smaller models can be significantly improved using distilled reasoning knowledge from larger models.