ViSMaP: Unsupervised Summarization of Hour-Long Videos Using Meta-Prompting and Short-Form Datasets

The video label models are usually trained on data collections that consist of short videos, usually under three minutes, associated with the corresponding illustrations. Although this enables them to describe basic procedures such as walking or talking, these models are struggling with the complexity of long videos, such as blog blog, sports events and films that can last more than an hour. When applied to these videos, they often create fragmented descriptions focusing on isolated actions rather than taking a wider story. Efforts such as MA-LMM and Lavila extended to video clips to 10 minutes using LLMS, but the hourly hour videos are challenging due to the lack of appropriate data collections. Although Ego4D presented a large collection of data from videos for an hour, its first perspective limits its wider application. The video summary addressed this gap by training on an hour’s videos with multiple border comments, however this approach is expensive and vulnerable to compensatory contradictions. In contrast, short explanatory video data collections are widely available and easier to use.
The developments in the visual language models greatly strengthened the integration of vision and language tasks, with early works such as the clip and the foundation mode. Subsequent models, such as Llava and Minigpt-4, expanded these capabilities to the photos, while others adapt them to understand the video by focusing on timeline modeling and building more powerful data collections. Despite these developments, the scarcity of large video data collections has long been a great obstacle to progress. Traditional video tasks require short -shaped, such as answering video questions, name, and grounding, primarily a spatial or time understanding, while summarizing videos for an hour requires the main tires in the midst of great repetition. While some models, such as Longva and LLAVA-VIDEO, can perform VQA on long videos, they are struggling with summarizing tasks due to data restrictions.
Researchers from Queen Mary and Spotify Vismap provides a method that is not subject to supervision to summarize the videos that lasted for an hour without the need for costly illustrations. Traditional models work well on pre -piece short videos, but they struggle with longer content as important events emit. VISMAP blocks this gap using LLMS and a tension strategy to generate and refine false tools from the descriptions of a clip created by short -in -form video models. The process includes three LLMS operating in a sequence to generate generation, evaluation and rapid improvement. VISMAP achieves a comparative performance with the models subject to fully supervised by multiple data groups while maintaining the ability to adapt to the field and eliminate the need to put the wide manual marks.
The study deals with the summary of the video through the field by training on the short video data set called and adapting to inappropriate videos for an hour from a different field. Initially, a model is trained to summarize 3 -minute videos using Timesformer features, visual alignment unit, decomposition of text, improved by mutual entry and contradictory losses. To deal with long videos, they are divided into 3 minutes, and false drawing processes are created. The compensatory repetition approach is made with multiple LLMS (generator, evaluated, improved) for the summary newspaper. Finally, this model is set on this false steadfastness using a similar similar loss to manage the loud stickers and improve adaptation.
The study evaluates VISMAP via three scenarios: summarizing long videos using Ego4D-HCAP, generalizing the field via fields on MSRVTT data collections, MSVD, and YouCook2, and adapting to short videos using selfishness. VISMAP, which is trained on an hour videos, is compared to supervision and chapter, such as Recap and Lavila+GPT3.5, which indicates competitive or superior performance without supervision. Apple juice assessments, Rouge-L, meteorite degrees, and quality guarantee accuracy. Shedding detection studies highlight the benefits of stereotypes of access and components, such as contrast learning and SCE. Implementation details include the use of Timesformer, Distilbert and GPT-2, with the training conducted on the NVIDIA A100 graphics processing unit.
In conclusion, VISMAP is an unprecedented approach to summarizing long videos by using the explanatory short video data collections and a commanding strategy. First, it creates high -quality summaries by connecting metal and then trains a summer form, which reduces the need for wide -ranging comments. Experimental results show that VISMAP is equal to the methods subject to fully supervising and effectively adapting through various video data collections. However, its dependence on false posters of the source field model may affect performance under important field attacks. In addition, VISMAP only depends on visual information. Future work can integrate multimedia data, introduce hierarchical summary, and develop more generalized developed techniques.
verify paper. Also, do not forget to follow us twitter And join us Telegram channel and LinkedIn GrOup. Don’t forget to join 90k+ ml subreddit.
🔥 [Register Now] The virtual Minicon Conference on Agency AI: Free Registration + attendance Certificate + 4 hours short (May 21, 9 am- Pacific time)

SANA Hassan, consultant coach at Marktechpost and a double -class student in Iit Madras, is excited to apply technology and AI to face challenges in the real world. With great interest in solving practical problems, it brings a new perspective to the intersection of artificial intelligence and real life solutions.

Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-04-28 20:24:00