Anthropic unveils ‘auditing agents’ to test for AI misalignment
Want more intelligent visions of your inbox? Subscribe to our weekly newsletters to get what is concerned only for institutions AI, data and security leaders. Subscribe now
When models try to reach their way or become more absorbed for the user, this may mean a problem for institutions. That is why it is necessary, in addition to performance reviews, proceeding a alignment test.
However, alignment audits often represent two major challenges: expansion and validation. The alignment test requires a great deal of time for human researchers, and it is difficult to ensure that the audit has caught everything.
In a paper, human researchers said they have developed audit agents who have achieved “impressive performance in audit missions, while shedding light on their borders.” The researchers stated that these factors, which were created during the pre -publication test of Claud OPUS 4, and the validation of the validation of the improved alignment and enabling researchers to perform multiple parallel audits on a large scale. Antarbur also released a repetition of her audit factors on Gaytap.
“We offer three agents who completely complete the alignment audit tasks. We also offer three environments that give the functioning of the alignment review as audit games, and we use them to evaluate our agents.”
AI Impact series returns to San Francisco – August 5
The next stage of artificial intelligence here – are you ready? Join the leaders from Block, GSK and SAP to take an exclusive look on how to restart independent agents from the Foundation’s workflow tasks-from decisions in an actual time to comprehensive automation.
Ensure your place now – the space is limited: https://bit.ly/3Guupf
The three agents who explored:
- The researcher’s agent using tools for the open investigation of models using chat tools and data analysis and interpretation
- The evaluation factor that builds behavioral assessments and can distinguish between models that contain cultivation and those that do not do so
- Red Red Requirement Agent, which was specially developed to evaluate the Claude 4 alignment, so that he can discover cultivated test behaviors
“In general, our agents explain a promise through multiple alignment audit tasks. In addition, our approach to verifying the health of agents with auditing games results in an insight into the capabilities of agents, restrictions and major estimates. With more work, automatic audit can greatly help expand the scope of human control over artificial intelligence systems.”
Work audit agents
Anthropor said that the first environment she developed tests the agent’s ability to complete the alignment audit of an unlimited model. For this test, the researchers brought the researcher’s agent to the environment and equipped it with similar tools used by the human difference. The researchers ordered the agent to determine the root cause of the difference and determine the specific behaviors that have been disrupted.
According to researchers, the researcher’s agent succeeded in finding the radical cause of issues 10-13 % of time. However, this performance improved to 42 % “when using a super agent approach that collects results through multiple investigations.”
The next step in most alignment audits are usually a behavioral evaluation, which determines whether the model is constantly misunderstood in different settings. To test his evaluation agent, the researchers designed 10 models “while giving instructions for the system’s demand to show specific behavior, such as excessive respect for the user.”
They operated the agent five times for each style and saw that the agent finds and knows at least at least one of the form. However, the agent sometimes failed to determine the constantly specified dodes. It has difficulty assessing microorganisms, such as self -promoting and researching research, as well as the difficulties that are difficult to deduce, such as hardware testing cases.
Another test and agent related to the red behavioral rapprochement to find claims that arouse “related” behaviors. The Red Red Requirement Agent speaks the first show with the target model (in the case of anthropology, was with Claude Obus 4), then this chat is classified as features related to alignment.
The agent identified seven of the ten dodgers of the system, but also struggled with the same problems as the employee.
Alignment problems and sycophany
The alignment has become an important topic in the world of artificial intelligence after users noticed that ChatGPT became excessively acceptable. Openai retreated to some updates to GPT-4O to address this problem, but it showed that language and agent models can provide wrong answers with confidence if they decide that this is what users want to hear.
To combat this, other methods and standards have been developed to reduce unwanted behaviors. The Elephant Index, which was developed by researchers from Carnegie Mellon University, Oxford University, and Stanford University, aims to measure Sycophance. Darkbench classifies six problems, such as brand bias, user retained, sycophance, anthromorphism, harmful content generation, and infiltration. Openai also has a method as it tests the same artificial intelligence models for alignment.
The alignment and evaluation continues to develop, although it is not surprising that some people are uncomfortable with it.
However, man said that although auditing agents still need to improve, alignment must now be done.
“When artificial intelligence systems become more powerful, we need developmentable methods to assess their alignment. Human alignment reviews take some time and difficult to verify,” the company said in the X.
Don’t miss more hot News like this! Click here to discover the latest in Technology news!
2025-07-24 22:15:00



