Technology

Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies


Antarbur has developed a new way to imagine in large language models such as Claude, and for the first time revealing how artificial intelligence systems deal with this information and make decisions.

The research, which was published today in two papers (available here and here), explains that these models are more advanced than previously understood – they plan for the future when writing poetry, using the same internal scheme to explain ideas regardless of the language, and sometimes work back from the desired result rather than simply building from the facts.

The work, which derives inspiration from the techniques of neuroscience used to study biological minds, is a great progress in the ability to explain artificial intelligence. This approach can allow researchers to audit these systems for safety issues that may remain hidden during traditional external tests.

“We have created these artificial intelligence systems with great capabilities, but because of how they were training, we did not understand how these capabilities have already appeared,” said Joshua Batson, Anthropor researcher, in an exclusive interview with Venturebeat. “Within the model, it is just a set of numbers – weights in the artificial nerve network.”

New technologies that illuminate the previously hidden decisions process

Big language models such as Openai’s GPT-4O, Claud’s Anthropic and Google Gemini showed great capabilities, from writing code to research papers. But these systems were largely working as “black boxes” – so that their creators do not understand exactly how they reached certain responses.

The new interpretation techniques for Anthropic, which the company calls “tracking circles” and “charts for support”, allows researchers to draw specific paths of features that resemble neurons that are active when models perform tasks. The approach borrows concepts of neuroscience, and displaying artificial intelligence models as similar models of biological systems.

“This work runs almost philosophical questions -” Do you think of models? Is the layout of the models? Are the models only renewed information? – In concrete scientific inquiries about what is literally happening within these systems, “Batson said.

Claude’s hidden layout: How to wear hair lines and solve geography questions

Among the most striking discoveries, there was evidence that Claude was planning when writing the hair. When he was asked to form a conjunction of rhyme, the model has determined potential rhyme words for the end of the next line before writing began – a level of development that surprised even researchers in the anthropor.

“This may happen everywhere,” Patson said. “If you had asked me before this research, I would have guessed that the model is thinking about different contexts. But this example provides the most residence evidence that we saw from this ability.”

For example, when writing a poem ending with the “rabbit”, the model activates the features that represent this word at the beginning of the line, and then creates the sentence to naturally reach this conclusion.

The researchers also found that Claude is doing real multi -step causes. In a test, he asks “The capital of the country that contains Dallas is …” The model first activates the features that represent “Texas”, then this acting is used to determine “Austin” as a correct answer. This indicates that the model is actually performing a series of thinking rather than just renewing the preserved associations.

By treating these internal representations – for example, replacing “California” Texas – researchers can cause the model “Sacramento” instead, confirming the causal relationship.

Beyond translation: The International Language Concept Network revealed to Claude

Another major discovery includes how Claude deals with multiple languages. Instead of maintaining separate English, French and Chinese systems, the model appears to translate concepts into common abstract representation before creating responses.

“We find that the model uses a mixture of language and abstract circuits and independent of the language,” researchers write in their paper. When asked about the opposite of “small” in different languages, the model uses the same internal features that represent “opposites” and “small”, regardless of the input language.

This conclusion has effects on how to transfer knowledge models learned in one language to others, and indicates that models with the largest census of female teachers have developed more representatives rich in language.

When artificial intelligence offers answers: discovery of Claude sports manufacturing

Perhaps what matters most, revealing the search for cases where the logic of Claude does not match with what he claims. When providing difficult mathematics problems such as calculating the values ​​of the perfection of large numbers, the model sometimes claims to follow an account that is not reflected in its internal activity.

“We are able to distinguish between the cases in which the model makes real steps they say, and the cases in which it presents its thinking without consideration of the truth, and the cases in which he works back from the idea presented by man,” the researchers explained.

In one example, when the user suggests an answer to a difficult problem, the model works back to build a series of thinking that leads to this answer, rather than working forward from the first principles.

“We distinguish mechanically an example of Claude 3.5 Haiko using a series of believers in an example of the chains of un sincedom thought,” says the paper. “In one of them, the model” nonsense “shows … in the other, motivated thinking appears.”

Inside the hallucinations of artificial intelligence: How does Claude decide when to answer or reject questions

The research also provides an insightful look at the cause of hallucinations of language models – the formation of information when they do not know an answer. Anthropor found evidence of a “virtual” circle that causes Claude to answer the questions, which are installed when the model recognizes the entities he knows.

The researchers explain: “The model contains” virtual “circles that reject it to answer the questions. “When a question is asked about something he knows, it activates a set of features that prevent this virtual circle, allowing the model to answer the question.”

When this mechanism leads to its imbalance – recognition of the entity but lacks the specific knowledge of this – hallucinations can occur. This explains the reason for the availability of models with incorrect information about the well -known characters, while refusing to answer questions about those mysterious.

Safety effects: Using a circle tracking to improve the reliability of artificial intelligence and trustworthy

This research represents an important step towards making artificial intelligence systems more transparent and perhaps safer. By understanding how models reach their answers, researchers can identify and address problematic thinking patterns.

The researchers write: “We and others can use these discoveries to make the models safer,” the researchers write. “For example, it may be possible to use the techniques shown here to monitor artificial intelligence systems for some dangerous behaviors – such as user deception – to direct them towards the desired results, or to remove a completely dangerous topic.”

However, Batson warns that current technologies still have great restrictions. They only pick up part of the total account performed by these models, and the analysis of the results remains thick.

“Even in simple short claims, our way is only a small part of the total account performed by Claude,” researchers admit.

The future of transparency of artificial intelligence: Challenges and opportunities in the interpretation of the model

The new technologies of the Anthropor are increasingly concerned about the transparency of artificial intelligence and safety. Since these models become more powerful and more widespread, understanding their internal mechanisms becomes increasingly important.

Search also possible commercial effects. Since institutions are increasingly dependent on the large language models of operating applications, understanding when and why these systems may provide incorrect information that becomes decisive to managing risk.

The researchers say: “Antarubor wants to make the models safe in the broad sense, including everything from diluted bias to ensure frankly artificial intelligence to prevent misuse – including in catastrophic risk scenarios,” the researchers say.

While this research represents great progress, Patson stressed that it is only the beginning of a much longer trip. He said, “The work has really started.” “Understanding the representations that the model uses does not tell us how to use it.”

Currently, the anthropologist tracks a preliminary map of the former unknown lands – such as the dissection of the first doctors who draw the first raw plans of the human brain. The complete atlas remains to realize artificial intelligence, but we can now see the outlines of how these systems think.



2025-03-27 17:00:00

Related Articles

Back to top button