AI

A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality

View PDF of the article Benchmarking AI Models in software Engineering: A Review, Research Tool, and Unified Approach to Benchmarking, by Roham Kohistani and 3 other authors

View PDF HTML (beta)

a summary:Standards are essential for standardized assessment and reproducibility. The rapid rise of artificial intelligence for software engineering (AI4SE) has produced many standards for tasks such as code generation and bug fixing. However, this proliferation has led to significant challenges: (1) fragmented knowledge across tasks, (2) difficulty in selecting contextually relevant criteria, (3) lack of standardization in creating criteria, and (4) drawbacks that limit utility. Addressing these challenges requires a dual approach: systematically mapping existing standards for informed choice and defining uniform guidelines for developing robust and adaptable standards.

We conducted a review of 247 studies, identifying 273 benchmarks for AI4SE since 2014. We categorize them, analyze limitations and reveal gaps in current practices. Building on these ideas, we present BenchScout, an extensible semantic search tool for identifying appropriate criteria. BenchScout uses automated clustering with contextual embedding of benchmark-related studies, followed by dimensionality reduction. In a user study of 22 participants, BenchScout achieved ease of use, effectiveness, and intuitive scores of 4.5, 4.0, and 4.1 out of 5.

To improve benchmarking, we propose BenchFrame, a unified framework to enhance the quality of benchmarks. Applying BenchFrame to HumanEval produced HumanEvalNext, which features corrected errors, improved language conversion, higher test coverage, and greater difficulty. Evaluation of 10 recent code samples on HumanEval, HumanEvalPlus, and HumanEvalNext revealed average success drops at 1 of 31.22% and 19.94%, respectively, underscoring the need to continually improve standards. We also examine the scalability of BenchFrame through a proxy pipeline and confirm its generalizability to the MBPP dataset. All review data, user study materials, and enhanced standards are publicly posted.

Submission date

Written by: Roham Kohistani [view email]
[v1]

Friday, 7 March 2025, 18:44:32 UTC (1,018 KB)
[v2]

Tuesday, October 28, 2025, 10:40:52 UTC (604 KB)
[v3]

Friday, 12 December 2025, 09:37:15 UTC (605 KB)

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-12-15 05:00:00

Related Articles

Back to top button