Towards Efficient and Dynamic Video Token Representation for Video Large Language Models

0 2 minutes read

Towards Efficient and Dynamic Video Token Representation for Video Large.png

This paper was pulled by haichao zhang

[Submitted on 21 Mar 2025 (v1), last revised 2 Apr 2025 (this version, v3)]

View a PDF file from the paper entitled The Distinguished Code Dynamics: Towards representing an effective and dynamic video code for large language language models, by haichao zhang and 1 other authors

No pdf is available, click to view other formats

a summary:The video based video has emerged to the distinctive symbol as a promising approach to enabling LLMS to explain the video content. However, the current distinctive symbol reduction is often disrupted, such as combining the distinctive symbol and integrating the distinctive symbol, the basic spatial and temporal topography, and sufficiently failed to balance mathematical efficiency with fewer symbols. Consequently, these methods lead to a long symbolic sequence, which limits the ability to apply them in scenarios that require severe symbolic pressure, such as the large languages of the video. In this paper, we offer the new task of reducing the short short code, with the aim of representing a wide range of video with a minimum of distinctive symbols. To address this challenge, we suggest the dynamics of the distinctive symbol, which is a new video representation that reduces the dynamic number of symbols while maintaining spatial and temporal cohesion. Specifically, we separate video representations by separating the visual implications from the movement of movement at the network level, and organizing them to: 1. A summary symbolic fragment schedule, created by the distinctive symbols that describe the content at the object level; 2. Map of the distinctive symbol indicators, capturing spatial motion patterns and detailed time throughout networks; 3. The symbolic retail function, which is forced on the symbolic retail schedule to rebuild the distinctive symbol sequence of the main map. Moreover, we offer the attention mechanism through the dynamics that combine movement features at the base of the distinctive symbol without increasing the length of the distinctive symbol, thus maintaining depression and temporal spatial safety. Experiments show a decrease in the number of the distinctive symbol to only 0.07 % of the original symbols, with only simple performance decreased 1.13 %. In addition, we suggest two new tasks in the maximum extreme symbol (fixed pressure and length of length). Our way provides a large theoretical complication, less than distinctive symbols, improved productivity, thus providing an effective solution for LLMS.