ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at Scale

Reinforcement learning (RL) has become essential in the progress of large language models (LLMS), which enables it through the improved thinking capabilities needed for complex tasks. However, the research community faces great challenges in reproducing modern RL technologies due to the complete disclosure of the main training details by the main players in the industry. This Ostrich has reduced progress in the broader scientific efforts and cooperative research.
Researchers from Bytedance, Tsinghua University and Hong Kong University recently presented DAPO (improving dynamic sampling ), a widely open source educational system designed to enhance the capabilities of thinking about large language models. The DAPO system seeks to bridge the gap in cloning by sharing all algorithm details, training procedures and data groups. Building on the Verl frame, DAPO includes training codes and a completely prepared data set called DAPO-Math-17K, specially designed for mathematical thinking tasks.
DAPO Technical Foundation includes four basic innovations aimed at resolving the main challenges in reinforcement learning. The first, a “high -end clip”, addresses the issue of Interopia’s collapse, a position where premature models are stabilized in limited exploration patterns. By carefully managing the cutting ratio in policy updates, this technology encourages a greater diversity of model outputs. Dynamic sampling meters in training meters by liquidating samples dynamically based on their benefit, thus ensuring a more consistent gradient. “Loss of political gradient at the level of the distinctive symbol” provides the method of calculating the repeated loss, with an emphasis at the level of the distinctive symbol instead of modifications at the sample level to accommodate varying lengths of the thinking sequence. Finally, the “vocabulary reward” provides an overly controlling penalty for long responses, and gently directing models towards brief and effective thinking.
In practical experience, DAPO showed significant improvements. The assessments of the American propaganda mathematics examination (AIME) 2024 show that the DAPO trained models have achieved a score of 50 points using the QWEN2.5-32B base model, and improvement on previous roads such as Deepseek-R1-ZWEN-32B, which achieved 47 points. It is worth noting that DAPO has achieved this improvement with nearly half of the training steps, which confirms the efficiency of the proposed methods. Systematic analysis revealed gradual improvements from each made technology, and moving from the foundation line from 30 points (using GRPO alone) to 50 points with full DAPO methodology.
Besides the quantitative results, DAPO training dynamics provided an insight into the advanced thinking patterns of the model. Initially, the models showed little inverter behavior, often written by tasks without reconsidering the previous steps. However, with continuous training, models gradually displayed more reflection behaviors, which indicates a form of repetitive self -review. This shift highlights the ability to learn reinforcement not only to enhance existing thinking paths but also to develop new cognitive strategies over time.
In conclusion, DAPO’s open outsourcing represents a significant contribution to the reinforcement learning community, which leads to the removal of the pre -created barriers through irreparable methodologies. By documenting and providing comprehensive access to the system technologies and the data and symbol, this cooperative initiative calls for more research and innovation. The joint efforts of the University of BYTEDANCE, the University of Tsinghua, and the University of Hong Kong are exposed to transparent and cooperative research capabilities to enhance collective understanding and practical capabilities of wide learning systems.
Payment Paper and project page. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 80k+ ml subreddit.
Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.
2025-03-18 06:48:00