AI Overthinking: How LLMs Fall into Analysis Paralysis

2 3 minutes read

Recent advances in large language models (LLMs) have drastically improved their ability to reason through answers to prompts. But it turns out that as their ability to reason improves, they increasingly fall victim to a relatable problem: analysis paralysis.

A recent preprint paper from a large team, which includes authors from the University of California, Berkeley; ETH Zurich; Carnegie Mellon University; and the University of Illinois Urbana Champaign, found that LLMs with reasoning are prone to overthinking.

In other words, the model gets stuck in its own head.

What does it mean to overthink?

The paper on overthinking, which has not yet been peer reviewed, defines overthinking as “a phenomenon where models favor extended internal reasoning chains over environmental interaction.”

Alejandro Cuadrón, a research scholar at UC Berkeley and coauthor on the paper, drew an analogy to the very human problem of decision-making without certainty about the results.

“What happens when we really don’t have enough information?” asks Cuadrón. “If you’re asking yourself more and more questions, just talking to yourself…in the best scenario, I’ll realize I need more information. In the worst, I’ll get the wrong results.”

To test how the latest AI models handle this situation, Cuadrón and his colleagues tasked leading reasoning LLMs (also known as large reasoning models, or LRMs), such as OpenAI’s o1 and DeepSeek-R1, with solving problems in a popular software-engineering benchmark. The models had to find bugs and design solutions using the OpenHands agentic platform.

Cuadrón says the results show a link between a model’s general level of intelligence and its ability to successfully reason through problems.

The results? While the best reasoning models performed well overall, reasoning models were found to overthink nearly three times as often as nonreasoning models. And the more a model overthought, the fewer issues it resolved. On average, reasoning models were 7.9 percent less successful per unit increase in overthinking.

Reasoning models based on LLMs with relatively few parameters, such as Alibaba’s QwQ-32B (which has 32 billion parameters), were especially prone to overthinking. QwQ, DeepSeek-R1 32B, and Sky-T1-R had the highest overthinking scores, and they weren’t any more successful at resolving tasks than nonreasoning models.

Cuadrón says this shows a link between a model’s general level of intelligence and its ability to successfully reason through problems.

“I think model size is one of the key contributors, as model size leads to is ‘smartness,’ so to speak,” said Cuadrón. “To avoid overthinking, a model must interact with and understand the environment, and it must understand its output.”

Overthinking is an expensive mistake

AI overthinking is an intriguing problem from a human perspective, as it mirrors the state of mind we often struggle with. But LLMs are, of course, computer systems, which means overthinking has different consequences.

The most obvious is increased compute costs. Reasoning LLMs essentially prompt themselves to reason through a problem, which in turn generates more tokens and keeps expensive hardware (such as GPUs or tensor processing units) occupied. The more reasoning, the higher the costs.

Cuadrón and his colleagues found that running OpenAI’s o1 with high reasoning effort could cost as much as US $1,400, whereas a lower-reasoning configuration brought the cost down to $800. Despite that gap, the models performed almost identically on the software-engineering benchmark. OpenAI o1-high resolved 29.1 percent of problems, while o1-low resolved 27.3 percent.

The researchers also found that running o1-low multiple times and selecting the best output outperformed o1-high but proved more cost efficient. The lower cost of the low-reasoning model meant this technique saved $200 when compared with o1-high.

These results suggest there’s plenty of room to optimize reasoning models, and that throwing more reasoning at a problem isn’t always the best solution.

There’s more to think about

Interestingly, the paper found that DeepSeek-R1 671B, unlike the other reasoning models tested, didn’t overthink relative to DeepSeek-V3 671B, the nonreasoning model that R1 is based on. That powered R1 to healthy results. It beat DeepSeek-V3 to reach the third-best success rate of all models tested and scored second best among reasoning models.

Cuadrón speculates that outcome is due to how DeepSeek trained the model. While large-scale reinforcement learning was key to its training, that technique wasn’t used to train the model specifically for software-engineering tasks. “That means that when the model is presented with a software-engineering task it won’t reason as much, and will prefer to interact with the environment more,” he said.

The paper makes a clear argument that LRMs are more efficient when they use only as much reasoning as required to complete a task successfully. But how exactly can a model be trained to use just the right amount of reasoning across a wide variety of tasks?

That remains to be solved. The paper’s coauthors hope they can help the broader research community tackle overthinking in LLMs by making their evaluation framework and dataset open source. The full dataset, along with the methodology used to quantify overthinking, is available on GitHub.

From Your Site Articles