Can GRPO be 10x Efficient? Kwai AI’s SRPO Suggests Yes with SRPO

0 7 minutes read

1745475335 Can GRPO be 10x Efficient Kwai AIs SRPO Suggests Yes.jpeg

The great success of the Openai’s O1 and Deepseek-R1 series has shown unambiguously in a large scale learning force (RL) in deducting advanced thinking behaviors and greatly enhancing the capabilities of large language models (LLMS).

However, these basic training methodologies behind these pioneering models often remain blocked in their technical reports. Modern society’s efforts have mostly focused on sports thinking, leaving the challenge of generalizing the field to a large extent. Moreover, the standard reinforcement learning of preference improvement (GRPO) suffers from common problems such as performance bottlenecks, useless use of the sample, and difficulties in developing specialized thinking skills when dealing with mixed field data groups. These challenges complicate the effective scaling of RL methods for LLMS.

In dealing with these restrictions, researchers from the Kuaishou team presented a new framework for reinforcement learning: Improving the history policy in history (SRPO). This innovative approach is designed to address the challenges mentioned above systematically through multiple dimensions. The team publicly issued a technical report that explains in detail the complications of their training method, and it is also opened SRPO-SEWEN-32B model.

It is worth noting that this work represents The first example of the performance at the Deepseek-R1-Zero level simultaneously in both sports and symbol. By taking advantage of the same basic model as Deepseek (QWEN2.5-32B) and employing a purely reinforcing learning training, SRPO has achieved great results on AIME24 (50) and LiveCOOPENCH (41.6) standards, bypassing Deepsek-R1-Zer-32B performance.

More significantly, SRPO achieves this level of performance only Ten training steps Required by R1-Zero.

Challenges with vanilla GRPO

In their initial exploration, the Kwaipilot team tried the GRPO algorithm. However, they quickly faced bottlenecks that prevented the model from reaching the required R1-Zero performance levels. These issues included:

Conflicts of improvement in the field (mathematics against code): Mathematical problems tend to derive longer and more detailed thinking paths (long COT), while code data shows a weaker tendency towards it. Mixing these two types directly leads to conflict, which led to an optimal level performance in both fields.
Low training efficiency due to similar group rewards: The GRPO algorithm depends on the contrast of non -zero rewards within a group of samples from them to calculate the feature. When the passes within the group lead to almost identical reward values, the calculated feature approaches zero. If a large part of the training boost shows this phenomenon, effective gradient contributions become minimal, which greatly reduces the efficiency of training.
Early performance saturation: Early GRPO training faced and saturate the reward for standard assessments. This problem is partially attributed to the quality of insufficient data. When the training data lacks complexity or sufficient diversity, especially with an abundance of simplicity, the model tends to maintain its performance conservative in the easiest tasks, which hinders its ability to develop complex thinking and deepening required for difficult problems.

Training two

To address the inherent response conflicts between the sporting fields and the symbol, the Kwaipilot team has implemented a two -stage training model:

Stage 1: Developing the capabilities of thinking: This initial training phase focuses exclusively on the challenge of sports data. The primary goal is to stimulate completely timing the time test time, enhancing capabilities such as stopping temporarily, decline, and step -by -step decomposition.
Stage 2: Skills Integration: At this stage, code data is presented in the training process. Based on the basis of the logical basis created in stage 1, this stage aims to increase the enhancement of coding capabilities while gradually enhancing procedural thinking, repetition and tool capabilities.

Comparative analysis of training strategies

The effect of various training data strategies has been analyzed along the response, which reveals the following ideas:

Mixed training: Trained models have shown a mixture of mathematics data and a limited growth in the length of response and a weak standard performance. Although mathematics problems raised some patterns of thinking, code problems often concentrated direct responses to the immediate exit of the code with minimal analysis or initial planning.
Mathematics training only: The training on sports data only led to a stable increase in the length of response and excellent performance on mathematics standards. Decally, strengthen strong and generalized thinking capabilities; When facing programming tasks, the model tried to detailed, step -by -step, including accurate steps and reconsideration of mathematical problems.
Code training only: With the improved performance of the code standards, the development of frank thinking behavior was minimal, and the achievement of great increases in the length of response has proven difficult. The responses to both the code and mathematics problems were significantly shorter compared to training in mathematics only, as the solutions of code are created directly without great thinking step by step or preliminary analysis.
Traffic training: The training approach in two stages proposed by the Kwaipilot team resulted in superior results in both sports and programming. The model has constantly created a step -by -step thinking of mathematics problems and patterns of thinking of programming tasks. It is worth noting that complex behaviors appeared, such as the model using the code automatically to help sports thinking.

Date of reshaping

The Kwaipilot team noted that during the average to medium stages of training, it produced approximately 50 % of the groups from which samples were taken within the batch of identical rewards. This often happened when the model has constantly succeeded in easier problems, which led to a minimum contractions and ineffective gradient updates.

To address this inefficiency and improve the quality of the graduation signal, they presented Date of reshaping. During training, they registered the results of the rewards for all passes within each period. At the end of the era, they rebuilt the data set for the next era based on the following criteria:

Liquidation of very simple samples: The samples in which all passes led to correct answers were excluded, as no useful signal was provided to improve policy.
Keep useful samples: Samples with various (correct and incorrect) results have been kept or all incorrect results. These samples were born with a positive reward, ensuring non -zero advantages and effective gradient signals. Moreover, difficult samples were also kept where all passes were incorrect in the current era. The logical basis is that these difficult problems in the beginning may become relatively easier for the updated policy, thus generating effective gradations in subsequent training. This strategy is in line with the principle of learning curricula, as the model is gradually exposed to samples that have been increasingly challenged on average to enhance training efficiency.

Compared to the method of taking the proposed dynamic samples in DAPO, the reshaping of the date may significantly improve the mathematical efficiency and increase the growth of the length of the response.

Data

The Kwaipilot team cleaned the accurate data and filtered on the code data and mathematics data available to the public. They applied the extension rules to liquidate unrelated URL addresses, coordinate noise, and ensure the completion of the basic fields (question and answer to the truth) in the original data. By following the PRIME data cleaning approach, they removed the multi -part questions, the problems based on pure proof, and those that require understanding the image or table. For software data data, exclude problems that depend on specific environments, i/o file, or network reactions, focusing on the algorithm logic.

Before swallowing the data, they conducted the correct verification of both mathematics and symbol problems to ensure the accuracy of the answers and their throat, and ignore incorrect or mysterious solutions. After that, they evaluated the difficulty of each problem, and classified it into easy, medium and difficult levels based on their pass rate (Pass@K).

Experimental results

This section shows the experimental results obtained using the SRPO method. The Kwaipilot team focused on monitoring the changes in rewards and standards such as the length of response during training.

Training process

The shape above shows the full bonus curve and the response length of response during SRPO training. After the first bonus began to grow in the plateau, the training moved to the second stage. At the beginning of the second phase, the total bonus decreased due to the lack of the previous model of code training, followed by a fixed increase in the reward during subsequent training. The merged code data did not increase significantly from the length of the response, which is in line with their expectations. At the same time, the standard results indicated a constant and stable improvement in both the sporting capabilities and coding of the model, which indicates the effectiveness of the new method.

Specifically, the reshaping of history guaranteed that the gradual updates remained effective in each training step, which directly increases the percentage of useful gradients. The efficiency of these improved samples led to a stable reward growth, which clearly exposes the efficiency of improved training achieved by the sampling strategy.

Thinking behaviors

The Kwaipilot team has identified three representative reflective patterns: re -verification, frequency, and exploration. They analyzed the statistically responses that contain these patterns and recorded the average length of response to each of them. During the RL training, they noticed a gradual increase in the frequency of self -meditation of the model, its correction and retreat, indicating the emergence of the “self -verification” ability. They assume that the emergence of “thinking”, which is similar to human cognitive processes, in the model during RL is an adaptive behavior resulting from the process of improving politics.

As shown in the figure above, the model did not show any examination and reflection of the previous thinking steps in the early stages of training. However, with the progress of training, the model showed great reflective behaviors, which constitutes response patterns such as step -by -step thinking, numerical replacement, step -by -step verification, and self -improvement.

Interestingly, they also discovered that the model learned to use the program code automatically to verify when solving mathematical problems. It will first provide a solution process through sports thinking and then write the program code proactively to verify the correctness of the solution. These cases showed the ability of the model to take advantage of procedural thinking in self -correction and multiple attempts, which also indicates that in the subsequent stages of training, the model mastered wide thinking and integrated application of various code -based thinking methods to solve problems.

SRPO Paper: Via Application for Learning Learning Wide Revolution Arxiv

Try SRPO-SWEN-32B Form on Lugingface

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-04-24 02:30:00

0 7 minutes read