AI

Improving Long Video Description through Global Audio-Visual Character Identification

View the PDF file from the paper entitled Storyteller: Improving long video description by identifying the global audio and visual character, by Yichen and 5 other authors

PDF HTML (experimental) view

a summary:The current large LVLMS models (LVLMS) are largely limited to processing short, secondary and ritual videos with coherent descriptions of or more videos. Long video description offers new challenges, such as identifying conspiracy characters and descriptions at the conspiracy level that includes both visual and sound information. To address these, we learn about the sound and visual letter, and match the names of letters with each dialogue, as a major factor. We suggest the story of stories, a system to generate dense descriptions of long videos, which includes both low -level visual concepts and high -level plot information. Storyteller uses a large multimedia language model that integrates visual, voice and text methods to perform the audio and visual character identifying on micro -videos. The results are then feed in LVLM to enhance the consistency of the video description. We check our approach to the movie description tasks and offer Moviestory101, data collection with a thick description of film clips for three minutes. To assess long video descriptions, we create Storyqa, a large collection of multi -options for the Moviestory101 test group. We evaluate descriptions by inserting them into the GPT-4 to answer these questions, using the accuracy as an automatic evaluation scale. Experiments show that stories outperform all open and closed foundation lines on the story, achieving 9.5 % higher accuracy than the strongest baseline, Gemini -1.5 -PRO, and shows the +15.56 % feature in human side assessments. In addition, the integration of the audio and visual character of the stories improves the performance of all video description models, with Gemini-1.5-PRO and GPT-4O, which indicates a relative improvement of 5.5 % and 13.0 %, respectively, accurately on Storyqa.

The application date

From: yichen is [view email]
[v1]

Monday, November 11, 2024 15:51:48 UTC (10,632 KB)
[v2]

Thursday, 6 Mar 2025 09:13:28 UTC (10,683 KB)
[v3]

Wed, 30 Jul 2025 12:47:35 UTC (1,803 KB)

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-07-31 04:00:00

Related Articles

Back to top button