Researchers from PSU and Duke introduce “Multi-Agent Systems Automated Failure Attribution

2 5 minutes read

Researchers from PSU and Duke introduce Multi Agent Systems Automated Failure.gif

My research sharing is the Syncied column that welcomes scientists to exchange their research breakthroughs with more than 2 million artificial intelligence lovers. In addition to technological developments, my research also invites interesting stories behind exciting research and research ideas.

Meet the author
Institutions: Pennsylvania State University, Duke University, Google Depp, Washington University, Meta, Nianang Technology University, and Oregon State University. The first authors are Shukkon Chang from Pennsylvania State University and Ming Yin from Duke University.

In recent years, multi -agent LLM systems have received widespread attention to its cooperative approach to solving complex problems. However, it is a common scenario of these systems to fail on a mission despite the wave of activity. This leaves developers with a decisive question: Which agent, at any point, was responsible for failure? The sinking of the vast reaction records to determine the root cause seems to find a needle in a straw pile-a consumer effort for time and the density of work.

This is the familiar frustration of developers. In increasingly complicated agents systems, failure is not only common, but it is also difficult to diagnose due to the independent nature of the agent and long information chains. Without a way to determine the source of failure quickly, the regime’s repetition and improvement stop.

To address this challenge, researchers from Pennsylvania State University and Duke UniversityIn cooperation with institutions, including Google DeepMindI presented the new search problem for “Automated failure.” They built the first measurement data set for this task, From and whenAnd developing and evaluating many automated support methods. This work not only highlights the complexity of the task, but also aims to pave a new path towards strengthening the reliability of multi -agent LLM systems.
The paper was accepted Show the lights at the Supreme ICM Automatic Conference, ICML 2025The symbol and the data set is now completely open.

Paper ： https: //arxiv.org/pdf/2505.00212
：： Https: //github.com/mingyin1/gents_failure_ttribution
Data collection ： https: //hugingface.co/datasets/kevin355/who_and_when

Search and challenge background
LLM -based multi -agent systems showed tremendous potential across many areas. However, these systems are fragile. Errors can be performed by one agent, a misunderstanding between agents, or errors in transferring information to the failure of the entire task.

Currently, when the system fails, developers are often left with manual and ineffective methods of correction:
Manual Developers should review manually lengthy reaction records to find the source of the problem.
Dependence on experience The correction process is highly dependent on the understanding of the deep developer of the system and the mission offered.

This “needle in straw pile” to correct errors is not only ineffective, but it strongly hinders the repetition of the rapid system and improves the reliability of the system. There is an urgent need for an automatic and systematic method to determine the cause of failure, and effectively block the gap between “evaluation results” and “system improvement”.

Basic contributions
This paper offers several leading contributions in facing the challenges mentioned above:
1. Determine a new problemThe paper is the first to give the official nature to “assigning automatic failure” as a specific research task. This task is defined by identifying

2. Customer failure-official and The crucial error step This led to the failure of the task.

Building the first standard data collection: from and when This data collection includes a wide range of failure records collected from 127 LLM systems, which are created algorithm or handcrafted by experts to ensure realism and diversity. Each record of failure is accompanied by good human comments for the sake:
from: The agent responsible for failure.
when: The specific interaction step where the decisive error occurred.
WhyExplanation of a natural language for the cause of failure.

3. Explore the first “automated” automated “automatic” methods: Using who and en Dataset, the paper designs and evaluates three distinctive ways to support automatic failure:
Everything in the matter: This LLM method provides user inquiries and a complete failure record, and ask him to specify the responsible agent and the decisive error step in one pass. Although it is costly effective, it may struggle to determine accurate errors in the long contexts.
Step by step: This approach mimics the correction of manual errors by reviewing the LLM recording of the reaction successively, which makes the ruling every step until the error is found. It is more accurate in determining the error step located but it bears higher costs and risks accumulate errors.
Binary Search: A compromise between the first two methods, this strategy is divided over and over again into half of the record, using LLM to determine the part that contains the error. Then he frequently searches for the specified sector, providing a balance in cost and performance.

Experimental results and the main results

I have experiences in two preparations: one where LLM knows the answer to the truth to the problem that its multi -agent system tries to solve (With the Earth’s truth) And one no (no)Unrealistic). The basic model used was the GPT-4O, although other models were also tested. The systematic evaluation of these methods on the World Health Organization, and when the data group resulted in several important visions:

A long way to goThe current roads are far from perfection. Even the best way to achieve one performance has only accurately about 53.5 % In identifying the responsible agent and just 14.2 % In determining the delicate error step. Some roads have led to worse than random guessy, which confirms the difficulty of the task.
There is no solution “All in one”Different ways that excel in different aspects of the problem. the All there is to it The method is better to determine “than”, while Step -by -step The method is more effective in determining “Matthew”. the Bilateral search The method provides medium performance.

The hybrid approach shows the promise, but at a high cost: The researchers found that combining different methods, such as using the ALL-ONCE approach to select a possible agent and then apply step-by-step method to find the error, can improve total performance. However, this comes with a significant increase in the calculation.

Modern models are struggling: Surprisingly, even the most advanced thinking models, such as Openai O1 and Deepsek R1I am looking for this difficult task. This highlights the difficult difficulty in assigning automatic failure, which requires a higher level of thinking more than what is required for more traditional tasks.

The importance of frank thinkingExpressive claims that require LLM to explain her thinking about methods that range in step and step step to improve performance.

The length of context is a specific factorThe study also revealed that with the length of the context for the failure records, the performance of all the attribution methods tends to decrease, with a more clear effect on the accuracy of determining the error step.