AI

Training Language Model Agents to Reflect via Iterative Self-Training

View the PDF file from the paper entitled Agent-R: Training language form agents for reflection through a repetitive self-training, by SIYU Yuan and 5 other authors

PDF HTML (experimental) view

a summary:LLMS factors are increasingly central to treat complex tasks in interactive environments. The current work focuses mainly on promoting performance by cloning behavior from stronger experts, but these methods often falter in the real world applications, mainly due to the inability to recover from mistakes. However, the step -level cash data is difficult and costly. Consequently, automation and creation of self -conflict data groups are dynamic, it is very important to enable models with smart agent’s capabilities. In this work, we suggest a repeated framework for self-training, Agent-R, which enables the language agent to think of the fly. Unlike traditional methods that are equivalent or successive of procedures based on the right, the Agent-R enhances MCTS to create training data that recover the correct paths of those wrong. The main challenge lies in the agent’s reflection on the need to review in a timely manner instead of waiting until the end of the show. To address this, we offer the construction mechanism for the criticism of the model: determines the actor’s first step -by -step model (within its current ability) in a failed path. Starting with it, we spit it with the next next path, which shares the same node in the tree. This strategy allows the model to learn to think based on its current policy, which leads to a better educational efficiency. For more expansion of this self -improvement model, we check the repetitive improvement of both error correction capabilities and building a data set. The results we find shows that the Agent-R continuously improves the model’s ability to recover from errors and enable the error correction in time. Experiments on three interactive environments show that Agent-R is effectively providing factors to correct wrong procedures while avoiding rings, and achieving superior performance compared to basic methods (+5.59 %).

The application date

From: Seo Yuan [view email]
[v1]

Mon, 20 Jan 2025 11:46:04 UTC (3,940 KB)
[v2]

Wed, March 19, 2025 09:28:09 UTC (4,085 KB)

2025-03-20 04:00:00

Related Articles

Back to top button