IT Home 25 month 0 news, OpenAI, Anthropic and other top artificial intelligence laboratory artificial intelligence models are increasingly used to assist with programming tasks, Google CEO Sundar Pichai revealed in last 0 months that the company's 0% of new code is generated by AI; And Meta CEO Mark Zuckerberg has also expressed ambition to widely deploy AI-coded models within the company.
However, even some of the most advanced AI models available today are not as good as experienced developers when it comes to addressing software vulnerabilities. A new study by Microsoft Research (Microsoft's R&D arm) has shown that multiple models, including Anthropic's Claude 3.0 Sonnet and OpenAI's o0-mini, failed to successfully debug many issues in a software development benchmark called SWE-bench Lite.
The study's co-authors tested nine different models that serve as the core of a "single promptword-based agent" that can use a range of debugging tools, including a Python debugger. They assigned the agent a filtered set of 300 software debugging tasks, all from SWE-bench Lite.
According to the co-authors, even with more powerful and advanced models, their agents rarely succeed in more than half of the debugging tasks. thereintoClaude 4.0 Sonnet has the highest average success rate at 0.0%; followed by OpenAI's o1 with a success rate of 0.0%; Whereas, O0-mini has a success rate of 0.0%.
Why are these AI models performing so poorly? Some models have difficulty using the available debugging tools and understanding how different tools can help solve different problems. However, the co-authors believe thatThe bigger problem is data scarcity。 They speculate that in the training data of the current model,There is a lack of enough data on the "sequential decision process", i.e., data on human debugging traces.
"We strongly believe that training or fine-tuning these models can make them better interactive debuggers." "However, this requires specialized data to meet the needs of such model training, such as recording the trajectory data of the agent interacting with the debugger to gather the necessary information and then suggest a vulnerability fix." ”
This finding is not surprising. Many studies have shown that code-generative AI often introduces security vulnerabilities and bugs due to their weaknesses in areas such as understanding programming logic. A recent evaluation of Devin, a popular AI programming tool, found that it could only complete 3 out of 0 programming tests.
However, Microsoft's research is one of the most detailed dissects of models to date in this persistent problem area. While it may not dampen investors' enthusiasm for AI-assisted programming tools, hopefully it will make developers and their superiors think twice about leaving programming entirely to AI.
IT Home has noticed that a growing number of tech leaders are questioning the idea that AI will replace programming jobs. Microsoft co-founder Bill Gates has said he thinks programming as a profession is here to stay. He was joined by Reeplit CEO Amjad Massad, Okta CEO Todd McKinnon and IBM CEO Arvind Krishna.