The Future of Vision AI: How Apple’s AIMV2 Leverages Images and Text to Lead the Pack

The scene of the pre -training vision model has witnessed a great development, especially with the appearance of large language models (LLMS). Traditionally, the vision models have been operated in fixed and pre -defined models, but LLMS has provided a more flexible approach, and to cancel new ways to take advantage of pre -trained vision codes. This shift has led to the reassessment of pre -training methodologies for vision forms to better compatible with multimedia applications.
In a new paper Automatic multimedia before training on the encryption of a large visionApple AIMV2, a family of coding devices that use multimedia strategy before training. Unlike traditional methods, AIMV2 is designed to predict both image corrections and text symbols within a unified sequence. This common goal enables the model to excel in a set of tasks, such as identifying images and visual land and multimedia understanding.
The main innovation of AIMV2 lies in its ability to generalize the unilateral slope frame into a multimedia setting. By dealing with image corrections and text symbols as one sequence, AIMV2 unifies the prediction process of both roads. This approach enhances its ability to understand the visual and complicated relationships.
The pre -training process for AIMV2 includes a causal multimedia coding unit that first predicts pictures, followed by the generation of textual symbols by automatic decline. This simple and effective design provides multiple advantages:
- Simplicity and efficiency: The pre -training process does not require large sizes or complex connection between the batch, which facilitates implementation and expansion.
- Compatibility with multimedia LLM applicationsArchitectural engineering is naturally integrated with the LLM -based multimedia systems, which allows smooth intercourse.
- Heavy supervision: By extracting learning signals from each correction of a text and symbol, AIMV2 achieves heavy supervision compared to traditional discriminatory goals, which facilitates more efficient training.
The AIMV2 structure is focused on the VIT transformer, which is a firm model of vision tasks. However, the AIMV2 team provides major adjustments to enhance its performance:
- Self -restricted: The indoor attention mask is applied inside the vision encryption, which allows bilateral attention during inferring without additional adjustments.
- Nutrition and normalization promotionsSwiglu activation function as the Feedforward (FFN), while all layers of normalization are replaced with RMSNORM. These options are inspired by the success of similar techniques in language modeling, which improves training stability and efficiency.
- Unified multimedia coding unit: The joint reckless coding unit deals with the automatic generation of images corrections and text symbols simultaneously, which increases the enhancement of multiple AIMV2 capabilities.
Experimental assessments reveal impressive capabilities of AIMV2. AIMV2-3B Encoder is achieved 89.5 % accuracy on imagenet-1K Using a frozen trunk, which indicates the ability to recognize high -performance images. Moreover, AIMV2 constantly goes beyond modern contradictory models, such as Clip and Siglip, in understanding multimedia images through various criteria.
One of the main contributors to this success is AIMV2’s ability to fully benefit from learning signals from all distinctive symbols and image corrections. This heavy supervision approach allows more effective training with fewer samples compared to other models that are self -supervised or models that were trained before they fall.
AIMV2 represents a big step forward in developing vision codes. By unifying the prediction of pictures and text under one multimedia framework, AIMV2 achieves a high performance through a wide range of tasks. The direct training process, along with architectural improvements such as Swiglu and RMSNORM, guarantees the ability to expand and adapt the ability. With the continued expansion of the scope of vision models, AIMV2 offers a planner for multimedia educational systems more efficient, varied and unified.
The symbol is available on GitHub for the project. Paper Automatic multimedia before training on the encryption of a large vision It is on Arxiv.
author: HECate is editor: Chang series
2024-12-08 00:43:00