AI

Qwen Releases the Qwen2.5-VL-32B-Instruct: A 32B Parameter VLM that Surpasses Qwen2.5-VL-72B and Other Models like GPT-4o Mini

In advanced artificial intelligence, VLMS models have become essential tools, enabling machines to explain and generate visions of visual and text data. Despite progress, the challenges in the budget of the model performance are still with mathematical efficiency, especially when publishing large -scale models in limited resource settings.

QWEN QWEN2.5-VL-32B-Instruct, VLM, has provided 32 billion teachers that exceed its largest predecessor, QWEN2.5-VL-72B, and other models such as GPT-4O Mini, during the Apache 2.0 version. This development reflects a commitment to open source cooperation and deals with the need for high -performance models, but can be controlled from a mathematical point of view.

Technically, the QWEN2.5-VL-32B-Instruct model provides many improvements:

  • Optical understandingThe model exceeds the identification of organisms, analysis of texts, plans, symbols, graphics and layouts within the images.
  • The abilities of the agent: It works as a dynamic visual factor capable of thinking and directing the tools of interactions via computer and phone.
  • Understand the videoThe model can understand videos over one hour and define the relevant sectors, indicating advanced time localization.
  • Localization of the object: It accurately determines the objects in the pictures by creating squares or surrounding points, providing stable JSON outputs for coordinates and features.
  • The generation of the result is organizedThe form of structural outputs of data such as bills, models and tables, and benefit from applications in financing and trade.

These features are enhanced by the ability to apply the model across various areas that require accurate multimedia understanding. ​

Experimental assessments highlight the strengths of the form:

  • Vision tasks: In the huge standard for understanding the multi-task (MMMU), the model record 70.0, transgressing QWEN2-VL-72B’s 64.5. In Mathista, 74.7 compared to the previous 70.5. It is worth noting, in Ocrbenchv2, the model recorded 57.2/59.1, which is a significant improvement over 47.8/46.1. In Android control tasks, he achieved 69.6/93.3, exceeding 66.4/84.4.
  • Text tasksThe model showed a competitive performance with a score of 78.4 on MMLU, 82.2 on mathematics, and 91.5 impressive on HumaneVal models, suggesting performance like GPT-4O Mini in certain areas.

These results emphasize the balanced efficiency of the model through various tasks. ​

In conclusion, QWEN2.5-VL-32B-Instructive represents great progress in the modeling of the language of vision, making a harmonious mixture of performance and efficiency. Its open source is available under APache 2.0 license. The global artificial intelligence community is to explore this strong and adaptive model, which is likely to accelerate innovation and application across various sectors.


Payment Model weights. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 85k+ ml subreddit.


Niegel, a trainee consultant at Marktechpost. It follows an integrated double degree in materials at the Indian Institute of Technology, Khargpur. Nichil is a fan of artificial intelligence/ml that always looks for applications in areas such as biomedics and biomedical sciences. With a strong background in material science, it explores new progress and creates opportunities to contribute.

2025-03-25 05:46:00

Related Articles

Back to top button