AI

How Good Are AI Agents at Real Research? Inside the Deep Research Bench Report

With the development of large LLMS models, as well as their promise as strong research assistants. Increasingly, they not only answer simple realistic questions-it addresses the tasks of “deep research”, which include multiple-step thinking, assessing conflicting information, data sources from all over the web, and synthesizing them in a coherent output.

This emerging capacity is now being marketed under different brand names by major laboratories – Openai is called “deep research”, and the anthropor refers to “expanded thinking”, and Google Gemini provides “Search + Pro” features, and its boring stickers “pro -search” or “deep research”. But how effective these offers in practice? A new report issued by FUTUSTTUSTEARCH, entitled Deep Research Bench (DRB): Evaluation of Web Research Agents, the most stringent evaluation so far – and the results reveal great capabilities and critical palaces.

What is a deep search seat?

Deep Research Bench, created by the Future Research Team, is created with a precisely created indicator designed to assess the performance of artificial intelligence agents on multi -step research tasks. These are not simple questions with clear answers-it reflects the open challenges facing analysts, policymakers and researchers in the real world settings.

The index includes 89 distinguished tasks across 8 categories such as:

  • Find the numberFor example, “How many calls are the second degree medical system, did it happen?”
  • Verify the validity of the claimFor example, “Is Chatgpt 10x more intense energy than Google Search?”
  • Data setFor example, “job trends for American program developers from 2019-2023”

Each important type is carefully organized with human identities and evaluated by using a frozen data collection of intense web pages, known as Retrosearch. This guarantees consistency through models assessments, and avoid the volatile state of the living network.

Agent structure: reaction and re -search

In the heart of the deep search seat lies the RACT, short for “Ausion + Act”. This method mimics how the human researcher can address a problem – by thinking through the task, taking action such as a search procedure on the Internet, monitoring results, then determining whether to repetition or conclusion.

While the previous models follow this episode explicitly, the latest “thinking” models often simplify the process, which leads to more thinking about its actions. To ensure consistency through assessments, DRB Retrosearch-a fixed and written version of the web. Instead of relying on the direct internet, which is constantly changing, the agents continue in a coordinated archive of web pages that were scraped using tools such as Serper, Playwright and Clistrapi. The scale is impressive: For high complexity tasks such as “Collecting Evidence”, Retrosearch can provide access to more than 189,000 pages, all of which are frozen in time, ensuring a fair and repetitive test environment.

Which agents of artificial intelligence do better?

Of all the competitors, Openai’s O3 back as the best performance, scored 0.51 out of 1.0 possible on a deep search seat. Although this may seem modest, it is important to understand the difficulty of measurement: due to the ambiguity in the definitions of tasks and registration, it is likely that an irreplaceable factor is about 0.8 – what researchers call the “noise ceiling”. In other words, even the best models today still lack informed human researchers.

However, leaders offer disclosure of visions. The O3 not only led the package, but did so with speed and consistency, which indicates a strong performance across almost all types of tasks. Claude 3.7 Sonnet is closely followed by anthropor, which indicates the multiplicity of uses in both styles of “thinking” and “other than thinking”. Gemini 2.5 Pro, the main model of Google, has emerged for its ability to deal with tasks that require organized planning and step -by -step thinking. Meanwhile, DeepSeek-R1 Open Open weight provided a pleasant surprise-keeping a pace with GPT-4 Turbo and narrowing the performance gap between open and closed models.

In all fields, a clear pattern appeared: the models of “the latest” thinking excelled, which constantly outperformed their previous counterparts, and the closed source models maintained a noticeable edge over the open weight alternatives.

Where is the agents struggling?

Reading through the failure patterns of the failure in the deep search seat report, I was amazing. One of the most frustrated aspects I faced personally – especially during research sessions or long content creation – is when artificial intelligence agents simply forget what we were doing. As the window window extends, the model often begins to lose thread: the main details fade, the goals are disturbed, suddenly, the responses feel loose or without a goal. At some point, I learned that it is often better to reduce losses and start zero, even if that means getting rid of everything created so far.

This type of forgetfulness is not just a stories – it’s the most important indicator of failure to evaluate the deep search seat. But it is not the only repeated issue. The report also sheds light on how to use some models in the use of frequent tools, and run the same search again and again as if they were stuck in a loop. Others show the formulation of bad inquiries, and match keywords rather than thinking about how to search effectively. Often, agents fall victim to the premature conclusions-where he controls the answer to half a formation that is technically verified by the box, but it is no less than the real insight.

Even among upper models, the differences are blatant. For example, GPT-4 Turbo has shown a remarkable tendency to forget the previous steps, while Deepseek-R1 was likely to calm down or invent from reasonable sounding-but incorrect-information. In all areas, models often failed to examine the sources or verify the validity of the results before completing their production. For anyone who relies on artificial intelligence for hard work, these issues will feel severe knowledge – and they emphasize what extent we still have to go to build agents who can really think and search like humans.

What about memory -based performance?

Interestingly, the deep search seat has also evaluated what the “Tooless” agents – language models work without any access to external tools, such as searching on the web or recovering documents. These agents are entirely dependent on their internal training data and their memory, which generates answers that depend only on what they previously learned during training. In practice, this means that they cannot search for anything or verify information – it guesses based on what they remember. “

Surprisingly, these Tooless factors are almost performed as well as full research agents in certain tasks. For example, on the task of demanding validation-where the goal is to assess the statement of the statement-they scored 0.61, and they almost coincide with the average agents that support 0.62. This indicates that models such as O3 and Claude have young people inside and can often learn about the sincerity of common claims without the need to search in the web.

But in more demanding tasks – such as the number of the derivative, which requires the assembly of multiple values ​​from different sources, or the collection of evidence, which depends on finding and evaluating the various facts in the context – the renewed models collapse completely. Without new information or actual search capabilities, they simply lacking means to produce accurate or comprehensive answers.

This contrast highlights a slight difference: While LLMS today can mimic “knowledge” a lot, deep research not only depends on the summons, but to think with the updated information that can be verified-something can only be provided by the factors that are provided to the tools.

Final ideas

The DRB report shows one thing: Although the best artificial intelligence agents today can superior to ordinary humans over the tasks specified in a narrow manner, they are still lagging behind the skilled general researchers-especially when it comes to strategic planning, adapting the mid-process, and logic.

This gap becomes especially clear during the long or complex sessions – which is something that I suffered directly, as the agent gradually loses the purpose of the task, which leads to a frustrating collapse in cohesion and benefit.

What makes the deep search seat of great value is that it does not experience knowledge at the surface level only-it benefits from the intersection of the use of tools, memory, thinking and adaptation, which provides a closer analog of research in the real world of criteria such as MMLU or GSM8K.

As LLMS continues to integrate into serious knowledge, future research tools like DRB will be necessary to assess not only what these systems know, but how well they work.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-06-02 23:14:00

Related Articles

Back to top button