Reinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency

The latest progress in LLMS has shown its potential to perform complex thinking tasks and use external tools such as search engines. Nevertheless, teaching forms to make smart decisions about when they depend on internal knowledge for research, still represents a major challenge. Although the simple methods of the student can direct models to call the tools, LLMS still struggle with more accurate behaviors, such as recognition when the initial research is incorrect and a decision to search again. RL was explored to improve these behaviors with the reward for using effective research. However, RL often leads to the use of unnecessary tools, with the implementation of excessive inspection models even the need for simple tasks, highlighting the shortcomings that must be treated.
Various RL strategies, including improving nearby policy (PPO), improved direct preference (DPO), and improving the group’s relative policy (GRPO), to align LLM behavior with human expectations. PPO helps in a balance between learning to explore while maintaining policy stability, while DPO simplifies alignment by improving the responses directly based on the user’s preferences. GRPO provides the group -based assessments to capture hidden improvements in better thinking. Meanwhile, the processing of LLMS as independent factories plans and performs multi -step thinking tasks gain traction. Frames like Autogpt and Langchain show how these agents can improve their outputs through thinking and repetitive research. However, the current agents systems often depend on fixed claims or use the tools based on inference, which limits the ability to adapt and efficiently.
Researchers at the ANT SEM group, a post -training learning framework designed to teach LLMS when they use research tools and when they depend on internal knowledge. By training a balanced data set that combines questions that do not require external recovery, SEM directs the form to issue search requests only when necessary. Using organized thinking format and GRPO, Framework rewards accurate answers without searching and punishing the use of an unnecessary tool. The results show that SEM improves response and efficiency, which helps models to better judge when needed external information, thus enhancing thinking about complex scenarios.
To combine research tools in the process of thinking about the model, SEM uses learning reinforcement to teach models when and how to use the research effectively. Training data collects between Musique (questions that need external information) and MMLU (questions that can be made from previous knowledge), which helps forms to learn the judgment when the research is necessary. Using the GRPO frame, the model is rewarded for accurate and effective answers, thwarting unnecessary searches, and encouraging it when the inner knowledge is shortened. Organized response coordination (
The study evaluates a trained model to determine the time of relying on its internal knowledge and when to use external research. It combines musique (unfamiliar questions) and MMLU (familiar questions) for training and evaluating performance on data groups such as Hotpotqa, GSM8K and MMLU. The proposed SEM method outperforms the foundation lines such as the naive rag, research in the accuracy of the answer and the efficiency of the research. SEM reduces unnecessary searches for known questions while improving thinking about unknown operations. Status studies and stable learning training and smart decisions confirm. In general, SEM enhances retrieval and internal thinking decisions in large language models.
In conclusion, SEM is a post -training learning framework designed to improve how to use large language language models. The model is trained in a data collection group that combines Musique and MMLU, which helps it distinguish between questions that he can answer internally and those that require external retrieval. SEM uses an organized thinking approach and a reward function that punishes unnecessary searches while promoting careful and effective retrieval. Experiences on standards such as Hotpotqa, GSM8K and MMLU show that SEM reduces excess searches and improves accuracy. This approach enhances the efficiency of smart thinking and use of external knowledge in LLMS.
Check the paper. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 95K+ ML Subreddit.

SANA Hassan, consultant coach at Marktechpost and a double -class student in Iit Madras, is excited to apply technology and AI to face challenges in the real world. With great interest in solving practical problems, it brings a new perspective to the intersection of artificial intelligence and real life solutions.
🚨 Genai building you can trust. ⭐ Parlant is your open source of control, compatible, and calm-calm on GitHub! (It was promoted)
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-05-19 02:52:00