Meet ReSearch: A Novel AI Framework that Trains LLMs to Reason with Search via Reinforcement Learning without Using Any Supervised Data on Reasoning Steps

LLMS models have shown great progress through various tasks, especially in thinking capabilities. However, integrating thinking operations with external searches remains difficult, especially for multiple glove questions that require complex thinking chains and multiple retrieval steps. Current methods are primarily dependent on manually designed claims or inference, which raises restrictions on expansion and flexibility. In addition, the generation of data is often expensive to oversee multi -step -stereo scenarios.
Researchers from Baichuan Inc. The University of Tongji, the University of Edinburgh, the University of Zhejiang, research, a new framework of artificial intelligence designed to train LLMS to integrate thinking with research by learning reinforcement, especially without relying on the thinking steps subject to supervision. The basic methodology of research includes direct searches in the thinking chain. Using the Group Relative policy (GRPO), which is the technology of reinforcement, LLMS research to determine the optimal moments and strategies to perform searches, which later affect continuous thinking. This approach enables models to gradually improve their thinking and naturally facilitate advanced capabilities such as reflection and self -correction.
From a technical perspective, the research employs organized output formats by including specific signs – such as
and
and
And
– With the logic chain. These signs facilitate the clear communication between the model and the external retrieval environment, and the regulation of outputs that have been created systematically. During training, the search for retrieval results is intentionally excluded from loss accounts to prevent the form of the form. Reward signals that direct the reinforcement learning process depend on clear criteria: assessing accuracy through F1 grades and commitment to coordinating the pre -determined organized output. This design encourages the independent development of advanced thinking patterns, which circumvents the need for handcrafted thinking data collections.
Experimental evaluation confirms the durability of the research. When evaluating the criteria of multiple answers to the rules, including Hotpotqa, 2Wikimultulhopqa, Musique and BamBogle, research has been constantly outperformed on the foundation line methods. Specifically, Research-SWEN-32B-Instructive has made improvements between 8.9 % and 22.4 % in performance compared to well-known basic lines. It is worth noting that these developments were achieved although the model that is trained exclusively on a single data set, confirms strong generalization capabilities. Additional analyzes showed that models gradually increased their dependence on repetitive searches throughout the training, indicating enhancing the efficiency of thinking. A detailed case study shows the form of the model to determine the optimal level research queries, think about thinking steps, and implement corrective procedures independently.

In short, the research provides great systematic progress in LLMS training to integrate smoothly with external research mechanisms by reinforcement learning. By eliminating dependence on supervision’s thinking data, this framework effectively addresses the problems of expansion and adaptation inherent in the multi -glove scenarios. Its ability to self -reflection and correction enhance its practical application of complex and realistic contexts. Future research trends may expand the scope of this reinforcement -based learning framework for broader applications and integrate additional external knowledge resources.
Payment Paper and GitHub page. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 85k+ ml subreddit.

Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.
2025-04-01 06:38:00