Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

PDF view of the paper entitled GLM-4.1V Thinking and GLM-4.5V: Towards Multi-use Middle Edit
PDF HTML (experimental) view
a summary:We offer GLM-4.1V-ahinking and GLM-4.5V, a family of vision language models (VLMS) designed to enhance multimedia understanding and understanding of general purposes. In this report, we share our main results in developing the framework -centered training framework. We first develop a vision institution model capable with great capabilities through large -scale preparation, which can be said to put the highest limit for the final performance. Then we suggest learning reinforcement with curriculum samples (RLCS) to cancel the full capabilities of the model, which leads to enhancing comprehensive ability through a variety of tasks, including STEM problems, video understanding, content identification, coding, basis, factors based on the graphic user interface, and the interpretation of long documents. In a comprehensive evaluation via 42 general standards, GLM-4.5V achieves a newer performance in almost all tasks between open source forms of similar size, and shows competitive or superior results compared to closed models such as Gemini-2.5-Flash on difficult tasks including coding factor and GuI. Meanwhile, the smaller thinking of GLM-4.1V-9B is still very competitive results on QWEN2.5-VL-72B much larger than 29 criteria. We open the source both GLM-4.1V-9B-Thinking and GLM-4.5V. The code, models and more information are released in this URL https.
The application date
From: Gayel Cheng [view email]
[v1]
Tuesday, 1 July 2025 17:55:04 UTC (8,647 KB)
[v2]
Wed, July 2, 2025 15:53:43 UTC (10,689 KB)
[v3]
Wed, August 13, 2025 15:10:17 UTC (16738 KB)
[v4]
Thursday, 14 August 2025 14:37:03 UTC (16,743 KB)
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-08-15 04:00:00