Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models

2 3 minutes read

Modern developments in long context modeling (LC) has opened new possibilities for LLMS and large vision language models (LVLMS). Long -context vision and linguistic models (LCVLMS) show an important step forward by enabling LVLMS to process hundreds of pictures and thousands of textual textual symbols in one pass to the front. However, the development of effective evaluation criteria is delayed. It is still unclear how much LCVLMS performance in long context settings, and what tasks they are struggling with, and the strength of the input of length. The current standards face the following problem: (a) Limited coverage of the estuary tasks, (b) insufficient coverage of the types of images, (c) not to control the length of context, and (D) the length of one context.

Various technologies extended the windows of the context of LVLMS, including longer training lengths before training, induction of place, and the effective structure. Models such as Gemini-2.5 and QWEN2.5-VL have adopted these methods, along with the code compression methods of the symbol to accommodate longer sequences. For evaluation, the Needle-A-Haystack mission has become a standard standard for the LC ability test by introducing information to specific depths within long texts. However, current language standards are still limited, with a focus only on NIAH variables or long VQA tasks. Even Milebench contains short context tasks with only average of 9K, and failed to assess the real LC capabilities through diversity applications in the language of vision.

Researchers from HKUST, Tencent AI Seattle Lab, University of Edinburgh, Miniml.ai and Nvidia Ai Center MMLONGBENCH, the first comprehensive standard for LCVLMS. 13331 examples of five important categories in the direction of the river course, including visual RAG and many ICL shots, cover the types of natural and artificial images. All examples are unified across five lengths input from 8K to 128,000 icons using the distinctive symbol chart through the media that combines vision spots with text symbols. Through the measurement of 46 closed source and open source models, the research reveals that the single performance of the task predicts the ability of LC in general, both of which suffer from LC tasks, and strong thinking models show the best LC performance.

The researchers build LC by inserting gold clips containing answers between large groups of dispersed corridors that were recovered from Wikipedia. For VIQUAE, gold clips are used from Kilt, while Infoseeek uses lead sections of the Wikipedia entity pages. Moreover, Wikipedia pages are divided into 100 words corridors, and recovery dispersion is added until access to the required input lengths. Several learning tasks within the context use four varied photo classification collections: Stanford cars, Food101, Sun397, and INAT2021, accommodating 500 photos within the windows of the context of 128K. The number of distinctive symbols through the media combines tex tokeenizer with the optical symbols that have been treated with 14 x 14 and 2 x 2 pixels, UNSHUCKER, which guarantees compatibility with modern LVLMS assessment.

The evaluation on MMLONGBENCH through tasks and context lengths shows that all models are struggling, but closed source models work better. For the longest input length of 128 thousand, all models struggle with long-context vision language tasks, with only GPT-4O achieved 62.9 average performance. Gemini-2.5-PRO has become the strongest performance, outperforming open source models by 20 points except for ICL tasks. Moreover, the OVIS2-34B model 41.6 achieves the summary, similar to the GPT-4O (42.4). QWEN2.5-VL-32B is a sub-degree of 64.6 on Vrag, even better than Gemini-2.0-Flash. Models show the possibilities of generalization exceeding the lengths of the training context, as QWEN2-VL-72B achieved 51.9 degrees in 128K despite the 32k training window.

In conclusion, researchers MMLONGBENCH, the first comprehensive standard for LCVLMS evaluation through various tasks. It provides a strict basis for diagnosing the capabilities of the border model by covering five important important categories with a unified uniform code account and unified context lengths. The evaluation of 46 models shows that performing a single task predicts in an unimaginable way, and border models face great challenges in the accuracy of photosynthesis on letters and recovery of the display via media. MMLONGBENCH is a standard evaluation framework for promoting future research towards more efficient symbolic symbols, and strong positioning plans, improving the capabilities of retrieval and multimedia logic.

Check the paper page and GitHub. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 95K+ ML Subreddit And subscribe to Our newsletter.

Sajjad Ansari is in the last year of the first university stage of Iit khargpur. As enthusiastic about technology, it turns into the practical applications of Amnesty International with a focus on understanding the impact of artificial intelligence techniques and their effects in the real world. It aims to clarify the concepts of complex artificial intelligence in a clear and accessible way.