A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains

1 1 minute read

[Submitted on 14 Jul 2025 (v1), last revised 26 Jul 2025 (this version, v3)]

View the PDF file for the paper entitled Verifybench: a systematic standard for evaluating thinking across field

PDF HTML (experimental) view

a summary:LLMS models are increasingly dependent on reinforcement learning (RL) to enhance their logical abilities through comments. One of the decisive challenges is to verify the consistency of responses created from models and reference answers, because these responses are often long, varied and accurate. It fights verification of the rules with complexity, which leads to the use of verification of models based on the model. However, the specialized verification lacks flexibility, while General LLM can be inconsistent. Current research focuses primarily on building better performances, however, the systematic evaluation of different types of verification performance through areas is still not present, which strongly restricts reliable development of reinforcement learning with a verified bonus (RLVR). To address this, we suggest Verifybench-a comprehensive standard for the field to systematically evaluate the verification. We build 4000 questions at the level of experts covering mathematics, physics, chemistry and biology. Each question is equipped with various reference answers and responses. The reliability of the evaluation is guaranteed through the strict explanation process conducted by a multidisciplinary expert team. We design a four -dimensional experimental framework to compare the limits of performance for specialized and general LLMS investigations under the common conditions of the answers extracted against full responses and short outputs against long outputs. Our evaluation reveals the basic scores for verification: While specialized verification achieves the pioneering accuracy, they show the deficiencies in the summons; Public models appear stronger but unstable accuracy. More importantly, to discover the high sensitivity of the graves of the input structure and restrictions in the generalization across the field, which provides important visions in the current verification technology bottlenecks.