MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions

Arxiv: 2503.094999V1 The type is announced: Cross ABSTRACT: Large models in the VlmS faces challenges in achieving convertible convertible thinking capabilities due to relying on intensive manual manual instructions collections or methods subject to self -supervision. To address these issues, we present Mindgym, a frame that enhances VLMS through artificial self -challenge questions, which consist of three stages: (1) synthesizing mono -jump questions, generating cognitive questions across the textual areas (EG, logical discount) and multimedia contexts (EG, preparedness on graph) that extend to snowy scripts. (2) Challenge the synthesis of multi -law questions, combining seed questions through various principles such as a dam, visual text alignment, to create multi -step problems that require deep thinking; And (3) refining the curricula caused by thinking, which is an organized pipeline that gradually trains the model from transportation to independent inference. By taking advantage of the ability to generate self-generation, Mindgym achieves high efficiency of data (for example, +16 % of Mathvision-MINI gains with only 400 samples), mathematical efficiency (reducing training and inference costs), and strong generalization through tasks. Intensive assessments on seven standards show high -performance on strong foundation lines, with noticeable improvements (+15.77 % victory rates) in the depth of thinking and widening that were verified through the registration based on GPT. Mindgym emphasizes the feasibility of self -challenge to improve VLM capabilities while reducing human intervention and resource requests. The software and data instructions are issued to enhance multi -media thinking research.
2025-03-13 04:00:00