Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

0 2 minutes read

[Submitted on 1 Jul 2025 (v1), last revised 14 Aug 2025 (this version, v4)]

Authors:GLM-V team: Wenyi Hong, WenMeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, ChenHui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Pan, Mingdao Liu, Mingde Xu, Minghhi Zhang, QINKAI Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu Wei Jia, Xiao Liu, Xiaohan Zhang, XIN LYU, Xinyue Fan, Xuancheng Huang, Yaning Wang, Yadong Xue, Yanfeng Wang, Yanzi Wang, Yifan An, Yifan du, Yiming SHI, Yiheng Huang, Yilin Niu, Zhang, Yueing Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao du, Zihan Wang, Peng Zhang, Dising Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang

PDF view of the paper entitled GLM-4.1V Thinking and GLM-4.5V: Towards Multi-use Middle Edit

PDF HTML (experimental) view

a summary:We offer GLM-4.1V-ahinking and GLM-4.5V, a family of vision language models (VLMS) designed to enhance multimedia understanding and understanding of general purposes. In this report, we share our main results in developing the framework -centered training framework. We first develop a vision institution model capable with great capabilities through large -scale preparation, which can be said to put the highest limit for the final performance. Then we suggest learning reinforcement with curriculum samples (RLCS) to cancel the full capabilities of the model, which leads to enhancing comprehensive ability through a variety of tasks, including STEM problems, video understanding, content identification, coding, basis, factors based on the graphic user interface, and the interpretation of long documents. In a comprehensive evaluation via 42 general standards, GLM-4.5V achieves a newer performance in almost all tasks between open source forms of similar size, and shows competitive or superior results compared to closed models such as Gemini-2.5-Flash on difficult tasks including coding factor and GuI. Meanwhile, the smaller thinking of GLM-4.1V-9B is still very competitive results on QWEN2.5-VL-72B much larger than 29 criteria. We open the source both GLM-4.1V-9B-Thinking and GLM-4.5V. The code, models and more information are released in this URL https.