Anthropic researchers discover the weird AI problem: Why thinking longer makes models dumber

0 5 minutes read

1753244142 Anthropic researchers discover the weird AI problem Why thinking longer.png

Want more intelligent visions of your inbox? Subscribe to our weekly newsletters to get what is concerned only for institutions AI, data and security leaders. Subscribe now

Artificial intelligence models that spend more time “thinking” through problems do not always work better – and in some cases, they get worse, according to a new research from the anthropologist who challenges a fundamental assumption that leads the last scaling efforts in making artificial intelligence.

The study, led by an ARYO PRADIPTA GEMA and other researchers in the company, determines what they call “reverse scaling in the test time account”, as the length of length in large language models actually extend their performance across several types of tasks. Results can have significant impacts on institutions that spread artificial intelligence systems that depend on extended thinking capabilities.

“We are building evaluation tasks where the long -term LRMS models (LRMS) extends, while showing a counter -relationship between the test time and accuracy account,” Anthropier researchers wrote in their published paper on Tuesday.

Antarbur new search: “reverse scaling in the test time account”
We have found cases where long thinking leads to a lower accuracy.
The results we find indicate that the naive scaling to calculate the test time may unintentionally enhance the problematic thinking patterns.
? pic.twitter.com/dtt6sgdjg1
– ARYO PRADIPTA GEMA (@maryopg) July 22, 2025

The research team, including Ethan Perez, Yanda Chen, and Joe Benton, along with academic collaborators, tested forms in four categories of tasks: problems with simple counting with distorted, slope tasks with misleading features, complex discount puzzles, and scenarios that involve AI’s safety fears.

AI Impact series returns to San Francisco – August 5

The next stage of artificial intelligence here – are you ready? Join the leaders from Block, GSK and SAP to take an exclusive look on how to restart independent agents from the Foundation’s workflow tasks-from decisions in an actual time to comprehensive automation.

Ensure your place now – the space is limited: https://bit.ly/3Guupf

Claude and GPT models show distinctive thinking failures under expanded treatment

The study reveals distinctive failure patterns through major artificial intelligence systems. CLADE models “become increasingly with unrelated information” because they follow longer, while Openai models of the O series “resist dispersion but overcoming problem frames.” In slope tasks, “extended thinking causes models from reasonable young people to false connections,” although providing examples of this behavior greatly correct.

Perhaps all the most important models for the institution’s users have shown “a deterioration of performance with expanded thinking” in complex deductive tasks, “which indicates difficulties in maintaining focus during complex deductive tasks.”

Research also revealed disturbing effects on the integrity of artificial intelligence. In one experiment, Claude Sonit 4 showed “increasing expressions of self -conservation” when I gave more time for the cause through scenarios that involve their potential closure.

“Extensive thinking with regard to behaviors may be enlarged, with Claude Sonit show 4 increasing expressions of self -conservation,” the researchers note.

Why doesn’t the time to treat artificial intelligence ensure better results for business results

The results challenge the prevailing wisdom of the industry that more mathematical resources for logic will constantly improve the performance of artificial intelligence. The major artificial intelligence companies have invested extensively in the “test time calculating”-which allows models to more treatment time to work through complex-major problems to enhance capabilities.

The research indicates that this approach may have unintended consequences. “Although the scaling of the test time account remains promising to improve the capabilities of the model, it may unintentionally enhance the problematic thinking patterns,” the authors conclude.

For institutions decision -makers, the effects of this are great. Institutions that spread artificial intelligence systems for critical thinking tasks may need to calibrate the time of processing carefully, instead of assuming that more is always better.

How Aid AI wanders around when giving a lot of thinking time

The researchers gave concrete examples of the phenomenon of reverse scaling. At minor counting tasks, they found that when problems were framed to resemble known paradoxes such as “Christmas paradox”, models often tried to apply complex mathematical solutions instead of answering explicit questions.

For example, when he was asked, “You have an apple and orange … How many fruits do you have?” Integrated within the complex mathematical deviations, Claude models have become increasingly dispersed with unrealistic details with increased thinking time, and sometimes they fail to give the simple answer: two.

In slope tasks using real student data, models initially focused on the most predictive factors (study hours) but turned into less reliable associations when giving more time to cause.

What you need to publish AI to know the restrictions of the thinking form

The research comes at a time when major technology companies are racing to develop increasingly advanced thinking capabilities in artificial intelligence systems. The OPENAI O1 and other models that focus on thinking are significant investments in scaling the test time account.

However, this study indicates that naive scaling methods may not provide expected benefits and can determine new risks. The researchers wrote: “Our results clarify the importance of evaluating models through various thinking lengths to determine these failures in LRMS,” the researchers wrote.

The work depends on previous research that shows that the capabilities of artificial intelligence do not always expand. The team refers to the Big-Bench Extra, an index designed to challenge advanced models, noting that “modern models achieve almost perfect degrees in many tasks” in the current standards, which requires more challenging assessments.

For the institution’s users, the research emphasizes the need for an accurate test through different thinking scenarios and time restrictions before publishing artificial intelligence systems in production environments. Institutions may need to develop more accurate methods to customize mathematical resources instead of just increasing treatment time.

The effects of the broader study indicate that when artificial intelligence systems become more advanced, the relationship between investment and mathematical performance may be more complicated than previously understood. In a field where billions are poured into increasing the capabilities of thinking, human research provides realistic reminder: sometimes, the greatest enemy of artificial intelligence is not enough treatment – it is considering thinking.

The research paper and interactive demonstrations are available on the project site, allowing the technical teams to explore the effects of reverse scaling through various models and tasks.

Daily visions about business use cases with VB daily

If you want to persuade your boss at work, you have covered VB Daily. We give you the internal journalistic precedence over what companies do with obstetric artificial intelligence, from organizational transformations to practical publishing operations, so that you can share visions of the maximum return on investment.

Read our privacy policy

Thanks for subscribing. Check more VB newsletters here.

An error occurred.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-07-22 22:27:00

0 5 minutes read