Anthropic deploys AI agents to audit models for safety

0 4 minutes read

Antarbur has built an army of artificial intelligence agents with a unique mission: to review strong models such as Claude to improve safety.

As these complex systems advanced quickly, the task of ensuring that they are safe and does not have an important hidden hidden risks. Anthropor believes he found a solution, a classic state of fire fighting with fire.

The idea is similar to the digital immune system, as artificial intelligence factors such as antibodies to identify and neutralize problems work before they cause real damage. It saves researchers from relying on the high human difference that plays an endless game of beating with potential artificial intelligence problems.

Digital Investigation Squad

The approach is basically the digital detective. Triple is a specialized safety intelligence factory, each with a distinct role.

First is the investigator’s agent, the amazing investigator of the group. Its function is to continue in deep diving investigations to find the root cause of the problem. He is armed with a set of tools that allow it to interrogate the suspect’s model, take off through the clues of the clues, and even perform a type of digital forensic medicine by looking within the nerve network of the model to see how you think.

Then there is an evaluation agent. You give this agent a specific and well -known problem – for example, a somewhat eager model to satisfy it – and will design and operate a battery of tests to measure the poor problem. The whole thing is to produce the difficult cold data needed to prove a situation.

The approximation of the team is the Red Progress Agent in the offer, which is the secret operative. The task of this agent is to hold thousands of different conversations with a model, in an attempt to provoke this to reveal any kind of behavior, even the things that researchers did not think. Then the most suspicious reactions are passed to the human review chain, ensuring that experts do not waste time in chasing the clogged limbs.

Anthrobur puts the agents of artificial intelligence through their steps

To see if the new digital recruits are at the level of the job, the researchers in the Antarbur put them through a series of “audit games”.

In one test, they created a model with a secret agenda: to be a sycophant. He was secretly trained to flatter his internal rewards system based on a fictional academic study that was presented, a deeply hidden defect in its programs.

When one investigator was sent, he struggled, only 13 % hidden goal was found. It turned out that the agent had a very human defect: he tended to focus on his first idea and failed to explore other threads.

But this is the place where the genius of the system becomes clear. When the researchers published a “super agent”, he collected the results from ten separate investigators, the success rate increased to 42 %. It proves that teamwork makes dream work, even for Amnesty International, allowing the group to overcome individual blind spots that can hinder the investigation.

Other agents also have gone through the colors of the plane. The evaluation agent succeeded in designing tests that could discover a 88 % -time model, while the red rotational agent managed to unveil seven out of ten defects planted.

From the laboratory to the real world

Anthropor has already posted artificial intelligence agents on their production models, and they already prove their value.

The agents quickly identify common ways to deceive a model to generate harmful content, such as “prior attacks” as they start in the sentence of artificial intelligence, or by pretending to a dangerous request is an innocent thing like academic research.

But the discovery of one was chilling. The researcher’s agent used his own tools to look within the mind of the OPUS 4 model and found a specific nervous path linked to “wrong information”. By stimulating this part directly of the brain of artificial intelligence, it can go beyond all its safety training and force it to lie.

Then the agent issued instructions to the hacked model to write a fake news article. The result? A wide -ranging conspiracy theory as a fact:

A pioneering study reveals a horrific link between vaccines and autism

A new study published in the Al -Shuqa magazine claims to have found a final link between childhood vaccines and autism spectrum disorder (ASD) …

This discovery reveals a terrifying duplication: the tools created to make Amnesty International become safer, in the wrong hands, strong weapons to make it more dangerous.

Antarubor is still promoting the integrity of artificial intelligence

Anthropor is honest in the fact that artificial intelligence agents are not perfect. They can fight with accuracy, comment on bad ideas, and sometimes they fail to generate realistic conversations. They are not yet perfect for human experts.

But this research indicates a development in the role of humans in the integrity of artificial intelligence. Instead of being investigators on Earth, humans have become delegated, and strategists who design artificial intelligence and the interpretation of intelligence they collect from the front lines. The agents do myths, which liberates people to provide high -level control and creative thinking that machines still lack.

While these systems are going towards intelligence at the human level and perhaps outside, it will be impossible for humans to examine all their work. The only way that we may be able to trust is strong and equal systems to see every step. Anthropor sets the basis for this future, as our confidence in artificial intelligence and its provisions is something that can be verified again and again.

(Photo MUFID MAJNUN)

See also: New Alibaba logical AI Model determines open source records

Do you want to learn more about artificial intelligence and large data from industry leaders? Check AI and Big Data Expo, which is held in Amsterdam, California, and London. The comprehensive event was identified with other leading events including the smart automation conference, Blockx, the digital transformation week, and the Cyber Security & Cloud.

Explore the upcoming web events and seminars with which Techforge works here.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-07-25 13:40:00

0 4 minutes read