AI models still struggle to debug software, Microsoft study shows

Artificial intelligence models are used from Openai, Anthropor, and other AI laboratories to help programming tasks. Google Sundar Pichai CEO said that 25 % of the new code in the company has been created by artificial intelligence, and the CEO of Meta Mark Zuckerberg has expressed its ambitions to spread artificial intelligence coding models on a large scale within the social media giant.
However, even some of the best models today are struggling to solve software errors that will not rise to experienced Devs.
A new study from Microsoft Research, Microsoft Research and Development Department, revealed that models, including Claude 3.7 Sonnet and Openai’s O3-MINI, fail to correct many issues in the software development standard called Swe-Bus Lite. The results are a realistic reminder, despite the bold statements of companies like Openai, AI still does not match human experts in areas such as coding.
The authors participating in the study have tested nine different models as a poor column of “one -based agent based” who had access to a number of correction tools, including the error corrector in Bithon. They assigned this agent to solve a set of 300 Swey-Bench Lite software.
According to the participating authors, even when it is equipped with stronger and more modern models, their agent rarely complements more than half of the correction tasks successfully. Claude 3.7 Sonnet was the highest average success rate (48.4 %), followed by Openai’s O1 (30.2 %) and O3-MINI (22.1 %).
Why the overwhelming performance? Some models have struggled to use the errors available to them and understand how different tools can help in various problems. The biggest problem, though, was the scarcity of data, according to the participating authors. They expect that there is not enough data that represent “serial decision-making operations”-that is, the effects of correcting human errors-in current models training data.
“We believe a firm belief that training or tunnel [models] “It can make them better interactive corrections.
The results are not so horrific. Several studies have shown that the state -generated artificial intelligence tends to provide weaknesses and security errors, due to weaknesses in areas such as the ability to understand the logic of programming. One of the modern assessments of Devin, an Amnesty International coding tool, found that it can only complete three out of 20 programming tests.
But Microsoft’s action is one of the most detailed appearance so far in a continuous problem of models. The investor’s enthusiasm for auxiliary coding tools will not weaken the twentieth, but with any luck, it will make developers-their higher operations-think twice to allow Amnesty International to operate the coding offer.
When it deserves, an increasing number of technology leaders have contradicted the idea that artificial intelligence will lead to automation of coding functions. Bill Gates, co -founder of Microsoft, said he believed programming as a profession to remain. So AMJAD MASAD CEO, CEO of OKTA Todd MCKINNON, and IBM Arvind Krishna.
Don’t miss more hot News like this! Click here to discover the latest in Technology news!
2025-04-10 19:10:00