[2507.05386] Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

0 2 minutes read

[Submitted on 7 Jul 2025 (v1), last revised 30 Sep 2025 (this version, v3)]

Authors:Song lai, Haohan Zhao, Rong Feng, Changyi MA, Wezhuo Liu, HongBo Zhao, XI LIN, Dong Yi, Min XIE, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu

View the PDF file from the paper with a natural reinforcement title that reduces forgetfulness in continuous training, by Song Lai and 12 other authors

PDF HTML (experimental) view

a summary:After continuing training (CPT) is a common and effective technique for adapting basic models such as large multimedia models for specific and constantly developed tasks. While current research focused mainly on ways such as restarting data, expanding the form or organizing parameters, the primary role of the CPT learning model is still largely uncomfortable. This paper presents a comparative analysis of two basic models after training: SFT and augmented light (RFT), and investigating the effects of each of them to keep knowledge during CPT. Our experiments are conducted on a standard of seven various multimedia tasks, using QWEN2.5-VL-7B-Instract as a basic post-training model. The investigation results in important results: (1) Upon constant learning in the estuary tasks, SFT leads to the catastrophic of the tasks that have been previously learned. On the other hand, RFT maintains the previous knowledge and achieve comparative performance with multi -task training. (2) RFT successfully protects and even enhances the general knowledge of standard standards (for example, MMMU and MMLU-PRO). On the contrary, SFT highly degrades the possibilities of the general model. More analysis reveals that this stability is not primarily due to clear mechanisms such as KL or a series of thought. Instead, we define the implicit regulation mechanism inherent in RFT as a major shareholder. Our theoretical analysis indicates that the RFT models are naturally limited through the contractions of the rewards, as it acts as a data -reliable regulator that protects the previously acquired knowledge. Finally, we suggest an absolutely existing liquidation algorithm to enhance the stability and efficiency of RFT. Our comprehensive study shows the superiority of RFT as a strong post -training model.