SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents

0 3 minutes read

1747590178 SWE Bench Performance Reaches 508 Without Tool Use A Case for.png

Recent developments in LM agents have shown promising potential to automate complex tasks in the real world. These agents usually work by proposing and implementing procedures through application programming facades, and supporting applications such as software engineering, robotics and scientific experimentation. Since these tasks become more complicated, LM agent frameworks have evolved to include multiple factors, multi -step retrieval and designed scores for improving performance. The central challenge is to explore and understand the environment effectively, prompting the development of engineering scaffolding using tools, memory mechanisms and custom pipelines. However, most current methods assume partial observation, which requires factors gradually collecting notes. Although this assumption applies to dynamic or unfamiliar environments, it is less applicable in settings that can be observed just like Swe-Bency, where all relevant information can be accessed from the beginning.

In software engineering, research on LM agents focused on two main strategies: agent -based frameworks and organized pipelines. The agent-based systems, such as Swe-Agent and OpenHands, LMS, allows independently interacting with code, often through custom facades and retrieval tools. Other models such as Moatless and Autoscoderover are enhancing localization through search techniques, while Specrover improves scaffold design. Instead, organized pipelines – such as the worker without an agent and codemonkey – complete the tasks in serial stages such as Emiratization, reform and validation. While these methods depend on the engineering components of performance, the current study suggests the benefit of LMS LMS (LCLMS) to explain the entire task environment directly. Progress in architecture and LCLM now allows to outperform nutrition systems in many contexts, which reduces dependence on complex external external scaffolding.

Explore researchers from Stanford, IBM, and the University of Toronto if the complex scores are necessary for LM agents who treat tasks such as Swe-Bench. They show that the use of LCLMS, such as Gemini-1.5-PRO, with the right claim and non-scaffolding, can achieve a competitive-38 %-38 % performance on SWE. Gemini-2.5-PRO, using the same simple setting, reaches 50.8 %. Their work indicates that many complex agents can be replaced with one powerful LCLM, which simplifies architecture and training. In addition, a two-stage hybrid approach using Gemini-1.5-PRO and Claude-3.7 achieves a 48.6 % solution rate, which supports this simple trend.

Traditional LM factors depend on interactive exploration due to partial observation, but many tasks, such as software correction, allow full observation. The study proposes state factors in the context that benefits from LCLMS to address the full or compressed environment cases, which goes beyond the need for complex scaffolding. For large codebases, the pressure -based category is chosen for the relevant files to suit them within the limits of context. Two methods are offered: Directsolve, where LCLMS solves the tasks using the full context; And SelectSOOLVE, where LCLMS localizes the relevant LMS files (Sclms) to solve. They both use targeted correction formats and verify health to ensure accuracy and reduce hallucinations.

Experiments evaluate a simplified framework using LLMS on the Swe-Bench standard, which includes 500 software engineering tasks in the real world. Suggested methods, DirectsOOLVE and selectsolve, LCLMS such as GEINI-1.5-PRO and Gemini-2.5-PRO, and in selectsolve, additional Sclm (Claude-3.7-Sonnet) to generate correction. The results show that DirectsolVE surpasses the complex methods of the agent such as Agentless and Codeact with minimal engineering. Selectover chooses more accuracy by taking advantage of stronger models for patching. Shedding detection studies highlight the importance of Cot, code reformulation, and distinctive distinctive context design. In addition, placing relevant files at the beginning of the claim improves performance, confirming the restrictions in dealing with long context.

In conclusion, the cost of using LCLM styles is currently higher than current methods such as Agentless and Code, with an average of $ 2.60 per counter compared to $ 0.25 and $ 0.87, respectively. However, rapid decreases in the costs of reasoning and increased context lengths make LCLMS more practical. Techniques such as KV cache are much lower than the first running costs, which reduces about $ 0.725. Although minor changes at the base of code still limit the benefits of cache, additional improvements can help. The study also indicates that LCLMS can handle the long interaction date, which reduces the need for complex mechanisms for memory and retrieval. It is worth noting that the undue LCLM models can perform competitively on Swe-Bused tasks.

Check the paper. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 90k+ ml subreddit.

SANA Hassan, consultant coach at Marktechpost and a double -class student in Iit Madras, is excited to apply technology and AI to face challenges in the real world. With great interest in solving practical problems, it brings a new perspective to the intersection of artificial intelligence and real life solutions.

🚨 Genai building you can trust. ⭐ Parlant is your open source of control, compatible, and calm-calm on GitHub! (It was promoted)

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-05-18 03:06:00

0 3 minutes read