Augment Code Released Augment SWE-bench Verified Agent: An Open-Source Agent Combining Claude Sonnet 3.7 and OpenAI O1 to Excel in Complex Software Engineering Tasks

0 4 minutes read

1743809862 Augment Code Released Augment SWE bench Verified Agent An Open Source Agent.png

Artificial intelligence factors are increasingly vital in helping engineers to deal with complex coding tasks efficiently. However, one of the important challenges was an evaluation and ensuring that these agents could deal with the real world coding scenarios beyond simplified standard tests.

MUPMENT CODE announced their launch Increase the Swe-Benccad agentDeveloped in Alectic AI specifically designed for software engineering. This version places them at the top of the performance of an open source agent on the SWE plate. By combining the strengths of Claude Sonnet 3.7 and the Openai model, the MUPMENT CODE approach achieved impressive results, showing a convincing mix of innovation and pragmatism.

The Swe-Beck Standard is a strict test that measures the effectiveness of the artificial intelligence agent in dealing with the practical tasks of the software that directly derived from the issues of GitHub in prominent open source warehouses. Unlike traditional coding standards, which generally focus on the algorithm isolated problems, SWE-BEC provides a more realistic test that requires agents to navigate the current code bases, identify relevant tests independently, create text programs, and repeat them against comprehensive slope test tools.

The initial introduction to the increase of CODE has achieved a 65.4 % success rate, which is a noticeable achievement in this difficult environment. The company focused its first effort on taking advantage of modern models on the latest model, specifically Claud Sonnet 3.7 from the Anthropor as the main driver to carry out tasks and the Openai O1 model for overcoming. This approach has exceeded the special training models strategically at this initial stage, setting a strong basic line.

One of the interesting aspects of the MUPMENT methodology is their exploration in different behaviors and strategies of the worker. For example, they found that some expected useful technologies such as Claud Sonnet and separate slope installation factors have not resulted in meaningful performance improvements. This highlights accurate and sometimes combating dynamics in improving the performance of the agent. Also, the basic band technologies such as the majority vote were explored but eventually abandoned due to cost and efficiency considerations. However, the simple band with Openai’s O1 has made gradual improvements in accuracy, confirming the value of the band even in restricted scenarios.

While the Swe-Bencer’s initial success of Musmennt is worth praise, the company is transparent about measurement restrictions. It is worth noting that Swe-Conder problems are very perverted towards fixing errors instead of creating features, the descriptions offered are more organized and friendly to LLM compared to the demands of the real model developer, and the standard is used only Python. The complications of the real world, such as mobility in the huge production code and dealing with less descriptive programming languages, constitute the challenges that the Swe seat does not pick up.

The MUPMENT code publicly acknowledged these restrictions, while emphasizing its continuous commitment to improving the performance of the agent to exceed standard measures. They stress that although the improvements facing claims and competitions can enhance quantitative results, customer notes and ease of use in the real world remain their priorities. The ultimate goal of increasing the MUPMENT code is to develop effective rapid agents in terms of cost to provide unparalleled coding assistance in practical professional environments.

As part of the future road map, it explores increased activity with the activity of refining royal models using RL techniques and royal data. These developments are promoting the accuracy of the model and greatly reducing cumin and operating costs, making it easier to help the most easy and developed coding.

Some fast food includes the agent who has been checked for fake bench seats:

The MUPMENT MUST MUST MUSTMENT SWE-BENCHED code was released, ranked first among open source agents.
The agent combines Claude Sonit 3.7 from Anthropor as a basic driver program and an Openai’s O1 model for registration.
A 65.4 % success rate was achieved on a SWE seat, highlighting strong basic capabilities.
Non -intuitive results are found, as they provide expected beneficial features such as “thinking mode” and separate slope installation factors, that is, significant performance gains.
The cost effectiveness is determined as a crucial barrier to implement a wide range in the real world scenarios.
Related standard restrictions, including their bias towards Bethon and the tasks of fixing smaller errors.
Future improvements will focus on lowering costs, decreased cumin, and improving the ability to use through reinforcement learning and microscopic models.
It highlighted the importance of the budget of the improvement that depends on the measurement with the qualitative improvements that focus on the user.

Payment Jaytap page. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 85k+ ml subreddit.

🔥 [Register Now] The virtual Minicon Conference on Open Source Ai: Free Registration + attendance Certificate + 3 hours short (April 12, 9 am- 12 pm Pacific time) [Sponsored]

Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.