REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models

0 4 minutes read

1753590261 REST A Stress Testing Framework for Evaluating Multi Problem Reasoning in Large.png

LRMS models offer rapidly, showing impressive performance in the tasks of solving complex problems across areas such as mathematics, coding and scientific thinking. However, current evaluation methods focus mainly on the one questions test, which reveal great restrictions. This article is presented The rest (thinking evaluation through simultaneous test) -A new framework for multi -problem stress is designed to push LRMS to beyond solid problem solving and better reflects multiple thinking capabilities in the real world.

Why do you limit the current evaluation criteria for major thinking models

Most current standards, such as GSM8K and Math, LRMS evaluation by asking one question each time. Despite its effectiveness to develop the initial model, this isolated questions face critical defects:

Reducing discriminatory power: Many LRMS have achieved the latest almost perfect degrees on common standards (for example, Deepseek-R1 reach 97 % on Math500). These saturated results make it increasingly difficult to distinguish between real typical improvements, forcing the creation of expensive and continuous data groups to distinguish abilities.
The lack of a multi -context assessment in the real world: Applications in the real world-such as educational lessons, technical support, or multi-tasking AI’s aides-require multiple questions and may interfere at one time. The individual questions test do not pick up these multi -problem dynamic challenges that reflect the real cognitive load and the durability of thinking.

Supply: LRMS test with multiple problems at the same time

To confront these challenges, researchers from the University of Tesinghua, Opdatm MAP, Shanghai International Laboratory developed Amnesty International, and Renmin University breakSimple but powerful evaluation method that simultaneously tests LRMS on multiple questions combined in one claim.

Multi -Technology Standard Reconstruction: The rest is re -identifying the current standards by sequence of multiple questions in one router, and setting The level of stress The teacher who controls the number of questions that are presented simultaneously.
Comprehensive evaluation: REST evaluates the competencies of critical thinking behind the solution of basic problems-including Establishment of contextual priorityand Resistance to interference through problemsAnd Dynamic cognitive pregnancy management.
Wide application capacity: The framework is validated on the advanced 34 LRMS ranging from 1.5 billion to 671 billion teachers, tested on 7 various criteria across varying difficulty (from simple GSM8K to difficult AIME and GPQA).

REST reveals basic visions about the capabilities of thinking in LRM

The rest rated reveals many pioneering results:

1. The deterioration of the large performance under the stress of multiple problems

until The latest LRMS model Like Deepsek-R1, a noticeable accuracy appears when dealing with multiple questions together. For example, DeepSeek-R1 resolution challenge standards like Aime24 almost decreases 30 % Under the convenience compared to the insulated questions test. This contradicts previous assumptions that large language models are inherently able to do multiple tasks without effort.

2. Promote discrimination between similar models

The rest exaggerates the differences between models of one semi -identical one. On Math500, for example:

R1-7B and R1-32B Achieving a close accuracy of 93 % monochrome and 94.6 %, respectively.
Under comfort, R1-7B resolution decreases 66.75 % While R1-32B maintains a height 88.97 %Scarcating blatant 22 % performance gap.

Likewise, among the models of the same size as Areal-Boba-RL-7B and OpenTHINKER2-7B, REST picks up significant differences in the capabilities of multiple problems that are underestimated assessments.

3. Post -Training methods may not guarantee strong multi -problem thinking

The models that were seized with reinforcement learning or their controlled control fail to think about one problems in maintaining their advantages in multiple REST settings. This requires rethinking training strategies to improve the durability of thinking in light of realistic scenarios.

4. “Long2short” training promotes performance under pressure

Trained models with “Long2short” techniques – which encourage brief and effective thinking chains – maintain a higher accuracy under comfort. This indicates a promising way to design the appropriate models for a multi -problem thinking simultaneously.

How comfort stimulates the challenges of realistic thinking

By increasing Cognitive pregnancy On LRMS by displaying a synchronous problem, REST simulates the requirements of the real world where you should give thinking systems dynamically, avoid thinking about one problem, and resisting overlap from simultaneous tasks.

REST also systematically analyzes the types of errors, which reveals common failure, such as:

Negligence of the question: Ignore subsequent questions in a multi -question router.
Gross errors: Summarizing answers incorrectly through problems.
Thinking errors: Logical or arithmetic errors within the thinking process.

These microscopic visions are largely invisible in a single question assessments.

Preparing practical evaluation and standard coverage

The rest of the 34 LRMS evaluation extends sizes from From 1.5b to 671b teachers.
The criteria that have been tested include:
- basic: GSM8K
- Medium: Math500, AMC23
- challenge: Aime24, Aime25, GPQA Diamond, LiveCodebeench
Models generate parameters are appointed according to official instructions, with the limits of the distinctive symbol of the output 32k for thinking forms.
The use of a unified OpenCOSS tool group ensures consistent and repetitive results.

Conclusion: Rest as a model for a realistic LRM assessment in the future

The comfort is a great leap forward in assessing large thinking models by:

Standard saturation treatment: The current data sets revive without expensive complete alternatives.
The requirements of multiple tasks in the real world reflect: Model tests under realistic and high cognitive pregnancy conditions.
Development of the guideline: It highlights the importance of training methods such as Long2short to mitigate thinking and encourage focus on adaptive logic.

In short, Rest paves the way for a more reliable, powerful, strong -related measurement of artificial intelligence systems from the next generation.

verify Paper, project page and code. All the credit for this research goes to researchers in this project. Subscribe now To our newsletter

Sajjad Ansari is in the last year of the first university stage of Iit khargpur. As enthusiastic about technology, it turns into the practical applications of Amnesty International with a focus on understanding the impact of artificial intelligence techniques and their effects in the real world. It aims to clarify the concepts of complex artificial intelligence in a clear and accessible way.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-07-26 21:39:00

0 4 minutes read