AI

Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy

Long thinking in COT improves the performance of large language models in complex tasks, but it comes with defects. The moderate “thinking” method slows the response times, which disrupts the actual time reactions such as those in Chatbots. It also risk the inaccuracy, because errors in the previous thinking steps can lead to a misleading final answer. Unlike humans, who often share partial ideas or conclusions during conversations, LLMS is late until all logic is complete. Although RL is commonly used to train thinking models, it mainly rewards the final answers, and overlooks useful intermediate visions. There is an increasing interest in teaching models that alternate between thinking and responding, but this still represents a challenge.

RL has become a common way to enhance the thinking of LLMS, and build on its success in harmonizing the models with human preferences. Two types of common rewards RL: OORM, which focuses on the final answer, and the process based on the process (PRM), which provide notes on the steps of medieval thinking. While PRMS offers more detailed supervision, it often depends on human illustrative comments and additional models, making it complex and exposed to issues such as piracy bonus. Separately, the efforts made to improve thinking in LLM have explored the update strategies, organized thinking, tools integration, and ways to reduce cumin and improve efficiency.

Researchers from Apple and Duke University introduce interlocking thinking, a new approach to RL that enables language models of the alternative between thinking and answering when solving sophisticated multi -steps. Instead of waiting to the end to respond, the models provide media intermediate answers, which improves user notes and direct their thinking. Using a direct bonus reward, the model is trained to produce useful thinking steps, leading to more than 80 % of responses and accuracy of up to 19.3 %. This method has been trained only on quality guarantee data sets and logical data groups, a strong circular to the most challenging standards, such as mathematics, GPQA and MMLU.

The study proposes an educational framework to enhance LLMS training on interlocking thinking, as models alternate between internal thinking and intermediate answers facing the user. Each intermediate step, or “sub -answer” is shared, as soon as the model reaches a meaningful milestone in thinking. Specialized training template with and Signs are used. The approach uses bases-based rewards-specifically, coordination, final accuracy and conditional precious accuracy-to guide learning. It is worth noting that intermediate rewards are applied only when setting specific criteria, ensuring that the model gives priority to comprehensive health. They also test various rewards plans, such as all or nothing, partial credit, and bonuses that have been identified for time, to improve the quality of thinking.

The intertwined thinking approach was evaluated on both familiar and unfamiliar data collections using QWEN2.5 models (1.5B and 7B). Unlike traditional methods that separate thinking and responding, the interlocking method gradually provides answers, which improves speed and benefit. When combined with intermediate rewards, it greatly enhances the performance of the model while reducing the delay in responding by more than 80 %. Even without exposure to new fields during training, the model adapts well, which indicates a strong circular. These results highlight the value of intertwined thinking in making artificial intelligence systems more responsive and effective in the tasks of thinking in the real and multidisciplinary world.

In conclusion, the study explores how interlocking thinking – as alternative models can be between thinking and generating intermediate answers – can be greatly improving performance and response. Using the QWEN2.5-1.5B model, the authors show that providing intermediate notes in time during training enhances the accuracy and generation of response. Various RL strategies were tested, where PPO showed stable results, and the applicable police bonuses that prove to be the most effective. The method is well directed to complicated tasks and outperforms the traditional foundation lines of thinking. Unlike the bonus models at the distinctive symbol level, this approach uses simple bonuses based on the rules after completing the full thinking steps, thus avoiding penetration of the reward. In the end, intertwined thinking promotes the quality of thinking and efficiency without relying on external tools.


Check the paper. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 95K+ ML Subreddit And subscribe to Our newsletter.


SANA Hassan, consultant coach at Marktechpost and a double -class student in Iit Madras, is excited to apply technology and AI to face challenges in the real world. With great interest in solving practical problems, it brings a new perspective to the intersection of artificial intelligence and real life solutions.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-05-30 03:03:00

Related Articles

Back to top button