Anthropic can now track the bizarre inner workings of a large language model

antic
So: What did they find? Anthropor looked at 10 different behaviors in Claude. One involves using different languages. Does Claude have a part that speaks French and another part that speaks the tray, and so on?
The team found that Claude used the ingredients independently of any language to answer a question or solve a problem and then choose a specific language when answering it. Ask her, “What is a small opposite?” In the English, French, Chinese and Claude, you will first use the neutral linguistic ingredients related to “small” and “opposites” To reach an answer. Only then will choose a specific language to respond. This indicates that large language models can learn things in one language and apply them in other languages.
Human also looked at how to solve the problems of simple mathematics. The team found that this model has developed its internal strategies that are not similar to those you will see in training data. Ask Claude to add 36 and 59 and the model will take a series of individual steps, including adding a selection of approximate values (add 40ish and 60ISH, add 57ish and 36ISH). Near the end of its operation, it comes with the 92ish value. At the same time, another sequence of steps focuses on the last numbers, 6 and 9, and determines that the answer must end in 5. Putting it with 92ish gives the correct answer 95.
However, if you asked Claude how this succeeded, he would say something like: “I added those (6+9 = 15), I got pregnant 1, then I added 10s (3+5+1 = 9), which led to 95.” In other words, it gives you a common approach that is everywhere online instead of what you already did. Yes! Llms is strange. (Not trusting.)
man
This is clear evidence that large language models will provide reasons for what they do, which do not necessarily reflect what they have already done. But this is also true for people, as Patson says: “You ask someone,” Why did you do that? “They are like,” mother, I think that’s because I was -. “You know, maybe not.
Piran believes that this discovery is especially interesting. Many researchers study the behavior of large language models by demanding to explain their actions. But this may be a risky approach, as he says: “As the number of models continues, it should be equipped with better degrees. I think this work also appears – only depends on the outputs of the forms is not enough.”
The third task that Antarubor studied was writing poems. The researchers wanted to know if the model had already done so, as one word expected at one time. Instead, they found that Claude looked forward in one way or another, and choosing the word at the end of the next line several words in advance.
For example, when Claude was given the claim, “Keeping Stop: He saw a carrot and had to seize it,” the model answered, “Hunger was like a rabbit starring.” But using a microscope, they saw that Claude had already hit the word “rabbit” when she was treating “seizing it”. Then he seemed to write the next line with this end in its place already.
2025-03-27 17:00:00