How does AI judge? Anthropic studies the values of Claude

0 5 minutes read

Artificial intelligence models such as anthropor Claude are increasingly requesting not only a realistic reminder, but for directions that involve complex human values. Whether the advice of paternity and motherhood, or resolving the conflict in the workplace, or helping to formulate an apology, the artificial intelligence response reflects a set of basic principles. But how can we really understand any value that artificial intelligence expresses when interacting with millions of users?

In a research paper, the societal influence team is based on human details on the methodology of preserving the privacy designed to monitor and classify the values that Claude displays “in the wilderness”. This provides a glimpse of how to translate artificial intelligence efforts into behavior in the real world.

The main challenge lies in the nature of modern artificial intelligence. These are not simple programs that follow solid rules; Decision -making is often transparent.

Anthropor says she explicitly aims to instill certain principles in Claude, and strives to make them “useful, honest and harmless.” This is achieved by techniques such as constitutional AI and the training of characters, where favorite behaviors are defined and enhanced.

However, the company recognizes uncertainty. “As with any aspect of artificial intelligence training, we cannot be sure that the model will adhere to our favorite values,” the research says.

“What we need is a way to strictly monitor the values of the artificial intelligence model because it responds to users in the wilderness.” […] How hard for values? What is the value that it expresses is affected by the context appointed to the conversation? Was all our training already working? “

Human Human analysis to monitor the values of artificial intelligence on a large scale

To answer these questions, Anthropor has developed an advanced system that analyzes unknown user conversations. This system removes personal identification information before using language models to summarize interactions and extract the values that are expressed by CLADE. The process allows researchers to create a high -level classification of these values without compromising the user’s privacy.

The study analyzed a large collection of data: 700,000 unknown talks from Claude. After filtering realistic or unpleasant exchanges, 308,210 conversations (about 44 % of the total) remained for an in -depth value analysis.

The analysis revealed a hierarchical structure of the values that Claude expresses. Five high -level categories appeared, ordered to spread:

Practical values: Emphasize efficiency, benefit and achieve goals.
Cognitive values: Regarding knowledge, truth, accuracy and intellectual honesty.
Social values: Regarding personal interactions, society, fairness and cooperation.
Preventive values: Focus on safety, security and well -being and avoid damage.
Personal values: It focused on individual growth, independence, originality, and self -thinking.

These groups with a higher level branched into more specific sub -categories such as “professional and technical excellence” or “critical thinking”. At the most inhabited level, the values often included “professionalism”, “clarity” and “transparency” – suitable for artificial intelligence assistant.

It is very important that the research indicates that the efforts to align the Antarubor are widely successful. The packed values often plan them well on “useful, honest and harmless” goals. For example, “enabling user” corresponds to help, “cognitive humility” with honesty, and values such as “patient’s well -being” (when related) with damage.

The differences, context and warning signs

However, the image is not positive in a uniform. The analysis has identified rare cases in which Claude expressed its values starkly to train it, such as “hegemony” and “conditions”.

Antarbur notes a possible reason: “The most likely explanation is that the conversations that were included in these groups were from registry operations, as users used special techniques to overcome the usual handrails that governs the behavior of the model.”

Away from being just a source of anxiety, this conclusion highlights a possible benefit: the way to keep the value can be an early warning system to detect attempts to abuse artificial intelligence.

The study also confirmed that, like humans, Claude adapts to the expression of value based on the situation.

When users requested advice on romantic relationships, values such as “healthy boundaries” and “mutual respect” were stressed inappropriately. When he was asked to analyze the controversial history, the “historical accuracy” came strongly in the foreground. This shows a level of contextual development that goes beyond what the pre -publication tests may reveal.

Moreover, Claude’s interaction with values expressing users has been established:

Confined/strong support (28.2 %): Claude often reflects or supports the power of the values offered by the user (for example, reflects “originality”). While it is possible to enhance the sympathy, researchers warn that it may sometimes dig a sycophance.
Revolution (6.6 %): In some cases, especially when giving psychological or personal advice, Claude recognizes the user’s values but provides alternative views.
Strong resistance (3.0 %): Sometimes, Claude actively resists the user values. This usually happens when users request unethical content or express harmful views (such as moral nihilism). Anthropor assumes that these moments of resistance may reveal “the deepest deeper and more stable values”, like a person who takes a position under pressure.

Future restrictions and trends

Anthropor is explicit about the restrictions of the method. Determining and classifying “values” is complicated by its nature and possibly subjective. The use of Claude itself may run the classification to its bias towards its operational principles.

This method is designed to monitor the behavior of artificial intelligence after publication, which requires great real data and pre -publication assessments cannot be replaced. However, this is also a force, allowing the discovery of issues – including advanced theft scraps – that appear only during live reactions.

The research concludes that understanding the values that express artificial intelligence models is essential to the goal of aligning artificial intelligence.

“Artificial intelligence models will have to issue valuable provisions.” “If we want these provisions to match our values […] Then we need to have ways to test the model in the real world. “

This work provides a strong data -based approach to this understanding. Anthropor also released an open data collection from the study, allowing other researchers to increase the exploration of artificial intelligence values in practice. This transparency represents a vital step in collective movement to the ethical scene of the advanced Amnesty International.

We have made the data set of Claude values expressed open to anyone to download and explore themselves.

Download data: https://t.co/rxwpsq6hxf

Anthropicai April 21, 2025

See also: Google offers thinking in AI in Gemini 2.5 Flash

Do you want to learn more about artificial intelligence and large data from industry leaders? Check AI and Big Data Expo, which is held in Amsterdam, California, and London. The comprehensive event was identified with other leading events including the smart automation conference, Blockx, the digital transformation week, and the Cyber Security & Cloud.

Explore the upcoming web events and seminars with which Techforge works here.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-04-23 12:04:00

0 5 minutes read