Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models

0 3 minutes read

1748333578 Qwen Researchers Proposes QwenLong L1 A Reinforcement Learning Framework for Long Context.png

While LRMS models showed impressive potential for thinking about the short context through reinforcement learning (RL), these gains are not well circulated to long context scenarios. Applications such as multi -document, research synthesis, legal or financial analysis require models for processing and the cause of serials that exceed 100,000 symbols. However, improving RL in such systems suffers from a slower bonus, unstable policy updates due to KL spacing fluctuations, and reducing exploration caused by Interopia’s collapse. These bottlenecks reveal a basic gap in LRMS’s transition from efficiency in the short context to the generalization of the long context.

QWENLONG-L1: RL frame regulator for long-context adaptation

To address these restrictions, the QWEN research team presents Qwenlong-L1LRMS new RL frame with long thinking tasks in the context. The framework was organized in three main stages:

SFT: SFT: SFT: It provides a stable preparation for the policy model through training on three twins through musical opposition, ensuring the basic efficiency in contextual understanding and extracting the answer.
Learn the proof of training in the curriculum: It provides a fun training process with gradually increasing the lengths of context. This progress enables the model to gradually obtain long thinking behaviors in the context without updates of stability policy.
A sample of difficulty retroactively: Exploration enhances by maintaining difficult examples and reusing them from the previous stages, weighted with difficulty, to encourage deeper thinking and durability through various inputs.

These stages are completed through mixed reward mechanisms-Define accurate verification on the rules with semantic evaluation by lightweight LLM-Get accurate and recall during policy training.

Technical design and methodological benefits

QWenlong-L1 integrates recent developments in improving the group RL group, specifically GRPO and DabuTo alleviate the computer expenses associated with estimating the value of the long context:

GRPO The feature of estimates by normalizing rewards within the groups from which samples were taken, eliminating the need for a separate value network and encouraging various generation patterns.
Dabu It includes mechanisms such as dynamic samples, forming overlap, and asymmetric cut thresholds to prevent anti -interior and relieve length biases during training.

The bonus function is defined as the maximum signals: a match based on inevitable rules and semantic rule of the compact level model (for example, QWEN2.5-1.5B). This mixed approach avoids involvement in solid formats while maintaining the correct answer through the various codes and seeds.

Moreover, the frame has been improved across Limiting the progressive contextWhere the RL process moves from 20K-TOKEN to 60 thousand lengths input in the control stages, stabilizing the training dynamics and facilitating the generalization of policy.

Experimental results and standard performance

QWENLONG-L1 was evaluated on seven QA standards long, including Docmath, Tires, 2Wikimultulhopqa, Hotpotqa, Musique, Narrativeqa, and Qasper. 32B alternative, Qwenlong-L1-32BShow a strong experimental performance:

It exceeds the baseline models such as R1-distill-cwen-32b by 5.1 points The leading property systems such as Openai-o3-mini and QWEN3-23B-A22B.
Her performance Similar to Claude-3.7-Sonnet-THE ThinkNoting the capabilities of competitive thinking under the length of the maximum context.
Experience@k pass for fixed improvements with increased samples, and achieve a pass@2 average 73.7Exceeding Deepsek-R1 and Openai-o1-previewEven at low sampling rates.

SFT and RL have also demonstrated the individual contributions of SFT and RL in stages, and sampling retroactively. It is worth noting that RL played a crucial role in enabling emerging thinking behaviors such as grounding, sub-mode, verification, and decline-this does not effectively cause the installation subject to supervision alone.

conclusion

QWenlong-L1 is a systematic approach to LRMS with long-context strong thinking possibilities through reinforcement learning. Its design effectively embodies the gap between short context experience and demands of intensive information by combining supervision -subject preparation, expanding the scope of the context based on curricula, and mixed evaluation strategies. The framework frame does not only achieve newer results through long context standards, but also shows the appearance of interpretable thinking patterns during training.

Check the paper, a model in the face of the embrace and the Japemb. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 95K+ ML Subreddit And subscribe to Our newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.