AI

This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models

Multimedia LLMS: Expanding capabilities via text and vision

LLMS models enabled to deal with multiple methods, especially images and text, to develop the most interactive and intuitive artificial intelligence systems. Multimodal llms (MLLMS) can explain the visual images, answer questions about images, and share dialogues that include both text and images. Their ability to think through the visual and linguistic fields makes it an increasing value for applications such as education, generating content and interactive assistants.

The challenge of forgetting the text only in Mlms

However, the integration of vision into LLMS creates a problem. When trained in data collections that mix with text with text, MLMS often loses its ability to deal with purely textual tasks. This phenomenon, known as the forgetfulness of the text only, occurs, because the visual symbols that were included in the language sequence turned the attention of the model away from the text. As a result, MLLM begins to set the priorities of the content related to pictures and performs badly in tasks that require only understanding language, such as basic logical tasks, understanding or textual tasks (Q&A).

Current mitigation strategies restrictions

Several ways are trying to address this deterioration. Some methods re -provide large quantities of text data only during training, while others turn between text texts only and multi -media. These strategies aim to remind the model of their original linguistic capabilities. Other designs include a transformer layer or a document to the claim. However, these technologies often increase training costs, require the logic of complex switch during reasoning, or fail to regain understanding the entire text. The problem stems greatly from how the attention of the model turns when the distinctive image icons are inserted into the sequence.

Supplies to provide: a double educational approach by Ali Baba and Nanjing University

AI’s AI’s business teams and Nanjing University presented a new approach called Wings. The design adds two new units – visual and text learners – in each layer of MLM. These learners work in parallel with the basic attention mechanism of the model. The structure is similar to the “wings” associated with two aspects of attention layers. The guidance component controls the amount of attention that each learner receives based on the current symbolic mixture, allowing the model to balance its concentration between the visual and dynamic information.

The remaining attentive attention (Lorra): Balancing efficiency and awareness of the method

The wings structure is used as a mechanism called the remaining low -ranking attention (LORRA), which keeps lightweight accounts while enabling learners to capture the basic information of the method. In the first stage of training, only visual learners are activated to align the features of the images. In the second stage, both visual learners and texts are trained in the router unit that uses attention weights to allocate responsibility. Each learner uses effective attention blocks to interact with the image or the surrounding text, and its outputs are combined with those in the main model. This ensures that visual attention does not overwhelm textual understanding.

Wings performance standards via text and multi -media tasks

In terms of performance, the wings showed strong results. On the MMLU data collection, only 60.53 text has achieved 9.70 points compared to a similar foundation model. For CMMLU, he scored 69.82, which is 9.36 points higher than the foundation line. In thinking tasks like the race, he got 11.9 points, and in WSC, an improvement of 11.12 points was recorded. In multimedia standards such as MMMU-Val, the wings have achieved improved 4.78 points. Strong results were also shown on IIT standards, as they dealt with multiple mixed -text conversations more effectively than open source MLMS on the same scale.

Conclusion: Towards MLMS more balanced and knowledgeable

In short, the researchers dealt with the issue of forgetting the catastrophic text only in MLMS by introducing wings, a structure that devotes visual and half learners as well as attention. By analyzing attention attacks and targeting interventions, they maintain the performance of the text while enhancing visual understanding, providing a more balanced multimedia model.


verify paper. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 100K+ ML Subreddit And subscribe to Our newsletter.


Niegel, a trainee consultant at Marktechpost. It follows an integrated double degree in materials at the Indian Institute of Technology, Khargpur. Nichil is a fan of artificial intelligence/ml that always looks for applications in areas such as biomedics and biomedical sciences. With a strong background in material science, it explores new progress and creates opportunities to contribute.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-06-21 21:57:00

Related Articles

Check Also
Close
Back to top button