Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

0 1 minute read

Exploring the Limits of Sub Billion Language Model Reasoners with Open.png

[Submitted on 29 Sep 2025 (v1), last revised 30 Sep 2025 (this version, v2)]

Authors:Changsheng Zhao, Ernie Chang, zechun Liu, Chia-Jung Chang, Wei Wen, Chen Lai, Sheng Cao, Yuandong Tian, Raghuraman Krishnamoorthi, Yangyang SHI, Vikas Chandra

View the PDF file from the paper entitled Mobilellm-R1: Explore the limits of the reasons for the sub-language model with open training recipes, by Changsheng Zhao and 10 other authors

PDF HTML (experimental) view

a summary:The typical shift in the LLMS models of instinctive responses to thinking about the series of thought (COT) has fueled two prevailing assumptions: (1) Thinking possibilities appear only in large enough models, and (2) these capabilities require training on huge data groups. While the first assumption has already been challenged through the recent sub-parameter paradise models such as QWEN3-0.6B and Deepseek cut variables, the second is still largely unexpected. In this work, we reconsider the necessity of scaling to a very large company (> 10T symbols) for thinking. By coordinating and reshaping open source data sets that we define as useful under our designed standards, we explain that strong thinking capabilities can appear with much lower data. Specifically, we explain that only the distinctive symbols of 2T of high -quality data are adequate, and pre -training with 4.2T symbols on the data set of these symbols ~ 2T, followed by the post -training procedure, allows the development of models of pre -open models. For example, Mobilellm-R1-950M AIME 15.5 degrees, compared to only 0.6 for Olmo-2-1.48B and 0.3 for SMOLLM-2-1.7B. It is striking that despite his training on only 11.7 % of the distinctive symbols compared to the 36T-Tokeen Prope Group for QWEN3 for Mobilellm-R1-950m matches or bypass QWEN3-0.6B via multiple thinking standards. To facilitate further research in this direction, we issued a full training recipe, data sources, data mixing, and model inspection points, along with the main visions obtained during this study.