Anthropic details its AI safety strategy

0 3 minutes read

Antarbur has detailed its safety strategy to try to preserve the famous artificial intelligence model, Claude, useful while avoiding permanent damage.

Basic in this effort is the Antarbur guarantees team. Those who are not the medium technical support group, they are a mixture of political experts, data scientists, engineers, and threat analysts who know how poor thinks think.

However, the anthropologist in safety is not a single wall, but it resembles a multi -layer castle of defense. Everything begins to create the right rules and ends with hunting new threats in the wild.

First, it is the policy of use, which is mainly the book of the rules of how Klude is not used and should not be used. It provides clear guidelines on major issues such as election safety and child safety, as well as to use Claude responsible in sensitive areas such as financing or health care.

To form these rules, the team uses a unified damage frame. This helps them to think through any possible negative effects, from material and psychological harm to economic and societal harm. It is less than an official classification system and more than a risk weighing method when making decisions. It also brings external experts to poor policy tests. These specialists in areas such as terrorism and the safety of the child are trying to “break” Claude with difficult questions to find out where there are weaknesses.

We saw this at work during the US 2024 elections. After working with the Institute of Strategic Dialogue, Anthropor realized that Claude might give old voting information. Therefore, they added a sign that users directed to Turbovote, a reliable source of updated and non -partisan election information.

Teaching the correct Claude from the error

The Sofeguards team is closely working with the developers who train Claude to build safety from the beginning. This means determining the types of things that Claude should not do, and include these values in the same form.

They also cooperate with specialists to obtain this correctly. For example, through partnership with Lowry, the leader of crisis support, they taught Claude how to deal with sensitive conversations on mental health and carefully harm the self, instead of refusing to speak. This accurate training is the reason that Claude will reject requests to help illegal activities, write harmful software instructions, or create fraud.

Before any new version of Clauds directly, it is placed through its steps with three main types of evaluation.

Safety reviews: These tests achieve whether Claude sticks to the rules, even in long -standing long conversations.

Risk reviews: For really high risk areas such as electronic threats or biological risks, the team performs specialized tests, and is often with the help of government and industry partners.

Biago reviews: This is all about fairness. They verify whether Claude provides reliable and accurate answers to everyone, testing political bias or deviant responses based on things like sex or race.

This intense test helps the team to know if training is stuck and tells them if they need to build additional protection before launch.

(Credit: Human)

Anthropistic intelligence integrity strategy

Once Claude comes out in the world, a mixture of automatic systems and human auditors monitors the troubles. The main tool here is a set of specialized Claude models called “works” that are trained to discover specific political violations in the actual time when they happen.

If the work is discovered a problem, this can lead to various procedures. Claude’s response may be directed away from the generation of harmful thing, such as random mail. For repeated criminals, the team may issue warnings or even stop the account.

The team also looks at the largest image. They use privacy tools to discover trends in how to use Cloud and use techniques such as hierarchical summary to discover widespread misuse, such as coordinated effect campaigns. They are constantly looking for new threats, digging through data, and monitoring forums in which bad actors may hang out.

However, Anthropor says she knows that ensuring the integrity of artificial intelligence is not a job they can do on its own. They actively work with researchers, policy makers and public to build the best possible guarantees.

(Lead photo by Nick.

See also: Suvianna Grecu, Amnesty International Change: without bases, the risks of artificial intelligence “trust crisis”

Do you want to learn more about artificial intelligence and large data from industry leaders? Check AI and Big Data Expo, which is held in Amsterdam, California, and London. The comprehensive event was identified with other leading events including the smart automation conference, Blockx, the digital transformation week, and the Cyber Security & Cloud.

Explore the upcoming web events and seminars with which Techforge works here.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-08-13 09:55:00

0 3 minutes read