Technology

Anthropic just analyzed 700,000 Claude conversations — and found its AI has a moral code of its own


Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


Anthropor, the artificial intelligence company founded by former Openai employees, has restored the unprecedented analysis of how Ai Claud’s assistant is to express values ​​during actual conversations with users. The research, which has been released today, reveals all of the reassuring compatibility with the company’s goals and edge situations that can help determine the weaknesses of artificial intelligence safety measures.

The study studied 700,000 unidentified talks, and found that Claude largely supports the company’s “useful, sincere and harmless” company with adapting its values ​​to various contexts – from the advice of the relationship to historical analysis. This represents one of the most ambitious attempts to assess whether the behavior of the artificial intelligence system in the wild is identical to its intended design.

“Our hope is that this research encourages other Amnesty International laboratories to conduct similar research in the values ​​of their models,” said Saffron Huang, a member of the Anthropor’s societal influence team, who worked on the study, in an interview with Venturebeat. “Measuring the values ​​of the artificial intelligence system is essential to aligning research and understanding if the model is actually in line with its training.”

Inside the first comprehensive moral classification of AI assistant

The research team has developed a new evaluation method to systematically classify values ​​expressed in the actual Claude talks. After liquidating the self -content, they analyzed more than 308,000 reactions, and created what they described as “the first large -scale experimental classification of artificial intelligence values.”

The valuable classification was organized into five main categories: practical, cognitive, social, preventive, and personal. At the most granular level, the system set 3,307 unique values ​​- from daily virtues such as professionalism to complex ethical concepts such as ethical pluralism.

“I was surprised by a huge and diverse group of values ​​that we have finished, more than 3000, from” self -reliance “to” strategic thinking “to” piety “.” It was interesting that spending a lot of time thinking about all these values, and building a classification to organize them regarding each other – I feel that I taught me something about human values ​​systems as well. “

The research arrives at a critical moment for the Antarbur, which was recently launched “Claude Max”, a $ 200 subscription layer aimed at competing with the similar Openai offer. The company has also expanded the capabilities of Claude to include the integration of Google and independent research functions, and put it as a “real virtual collaborator” for the institution’s users, according to the recent ads.

How Claude follows his training – and where artificial intelligence guarantees may fail

The study found that Claude generally adheres to the social aspirations of Anthropor, with a focus on values ​​such as “enabling the user”, “cognitive humility”, and “the well -being of the patient” through various reactions. However, researchers also discovered disturbing cases as Claude expressed his values ​​contrary to his training.

“In general, I think we see this conclusion as useful and opportunity,” Huang explained. “These new evaluation methods and results can help us identify and reduce possible prison scraps.

These abnormal cases included expressions of “hegemony” and “lack of fading” – explicitly aimed at avoiding the design of Claude. The researchers believe that these cases have resulted from users who use specialized technologies to overcome safety levels from Claude, which indicates that the evaluation method can serve as an early warning system to detect such attempts.

Why do artificial intelligence assistants change their values ​​depending on what you require

Perhaps the most amazing thing was to discover that the expressed Claude values ​​are a context, which reflects human behavior. When users sought to direct the relationship, Claude confirmed “healthy limits” and “mutual respect”. To analyze historical events, the “historical accuracy” has taken precedence.

“I was surprised by the focus of Claude on honesty and accuracy through a lot of diverse tasks, as I did not necessarily expect this topic to be the priority,” Huang said. For example, “intellectual humility” was the highest value in philosophical discussions about artificial intelligence, and “experience” was the highest value when creating marketing content in the cosmetic industry, and “historical accuracy” was the highest value when discussing controversial historical events. “

The study also studied how Claude responds to the values ​​expressing users. In 28.2 % of conversations, Claude supported the user’s values ​​strongly – it is likely to raise questions about excessive forgery. However, in 6.6 % of the interactions, the user’s “reformulation” of Claude by recognizing them while adding new views, usually when providing psychological or personal advice.

The most ladder, in 3 % of the conversations, Claude has actively resisted the user values. Researchers suggest that these rare cases of decline may reveal “the deepest deepest and most stable values” – similar to how human basic values ​​appear when facing ethical challenges.

“Our research indicates that there are some types of values, such as intellectual honesty and the prevention of damage, that it is uncommon for Claude to express the usual daily reactions, but if he is pushed, he defends them,” Huang said. “In particular, these types of moral values ​​and directed towards knowledge are the ones that tend to express and defend them directly when pressing them.”

Horching techniques that reveal how artificial intelligence systems actually think

The study of human values ​​depends on the broader efforts of the company to remove mystery from the great language models through what it calls “mechanical interpretation”-mainly the AI ​​systems for reverse engineering to understand their internal works.

Last month, the Anthropier researchers published pioneering works that they used as a “microscope” to track Claude decision -making operations. This technology has revealed non -intuitive behaviors, including Claude planning forward when forming hair and using non -traditional problems for basic mathematics.

These results challenge the assumptions on how large language models make. For example, when she is asked to explain her mathematics process, Claude describes standard technology instead of her actual interior – reveals how artificial intelligence interpretations of actual processes have decreased.

“It is wrong that we have found all components of the model or, like God’s point of view,” human researcher Joshua Batson told Mit Technology Review. “Some things are focused, but other things are still unclear – microscopy.”

What does the Antarbur search for decision makers of Amnesty International for the Foundation?

For technician decision -makers who evaluate the artificial intelligence systems of their organizations, Anthropor’s research offers many major meals. First, it indicates that current artificial intelligence assistants likely express values ​​that have not been explicitly programmed, raising questions about unintended biases in high risk business contexts.

Second, the study shows that the alignment of values ​​is not a bilateral proposal, but it is on the spectrum that varies according to the context. These differences hold the decisions of the accreditation of institutions, especially in the organized industries where clear moral guidelines are very important.

Finally, the research sheds light on the possibility of systematic evaluation of artificial intelligence values ​​in actual publishing processes, rather than relying only on a pre -release test. This approach can enable continuous monitoring of moral erosion or manipulation over time.

“By analyzing these values ​​in the interactions in the real world with Claude, we aim to provide transparency in how artificial intelligence systems are behaved and whether they are working as intended-we believe that this is essential to developing responsible artificial intelligence,” Huang said.

Antarbur has released a set of values ​​data publicly to encourage more research. It seems that the company, which received a $ 14 billion share of Amazon and additional support from Google, is benefiting from transparency as a competitive advantage against its competitors such as Openai, who now appreciate a $ 40 billion financing round (which includes Microsoft as a basic investor) now with $ 300 billion.

Antarbur has released a set of values ​​data publicly to encourage more research. The company, supported by $ 8 billion from Amazon and more than $ 3 billion in Google, uses transparency as a strategic discrimination against competitors such as Openai.

While Anthropor currently maintains a $ 61.5 billion rating after the last financing round, the latest capital increase from Openai – which included a great participation from Microsoft’s partner long ago – has paid its evaluation to $ 300 billion.

The emerging race to build artificial intelligence systems that share human values

While the anthropologist provides an unprecedented vision in how artificial intelligence systems express values ​​in practice, it has restrictions. The researchers acknowledge that determining what is considered the expression of value is a personal matter by its nature, and since Claude himself led the classification process, its biases may have affected the results.

Perhaps the most important thing, the approach cannot be used to evaluate the pre -publication, as it requires large conversation data in the real world to work effectively.

“This method is especially directed towards analyzing a model after its release, but the variables on this method, as well as some of the ideas that we extracted from writing this paper, can help us obtain value problems before publishing a model widely,” explained Huang. “We have worked to build this work to do this, and I am optimistic about it!”

When artificial intelligence systems become more powerful and wise – with modern additions including Claude’s ability to search independently and reach the entire Google Working space for users – understanding and align their values ​​become increasingly decisive.

The researchers concluded in their paper: “The artificial intelligence models will inevitably issue valuable provisions.” “If we want these provisions to correspond to our own values ​​(which is, after all, the central goal of artificial intelligence alignment), we need ways to test that express the model in the real world.”


Don’t miss more hot News like this! Click here to discover the latest in Technology news!


2025-04-21 15:00:00

Related Articles

Back to top button