NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding

Nvidia presented Llama Nemotron Nano VLVLM is designed to address the tasks of understanding the level of document with efficiency and accuracy. This version is built on the Llama 3.1 structure associated with a lightweight vision cod, and it targets version applications that require an accurate analysis of complex documents structures such as scanned models, financial reports and technical plans.
Overview of the form and architecture
Llama Nemotron Nano VL merges Cradiov2-H VISION ENCODER with Llama 3.1 8B Speed Language ModelForming a pipeline capable of processing multimedia inputs-including multi-page documents with both visual and text elements.
Architecture has been improved for the saving reasoning of the distinctive symbol, and support for up to Long context 16k Through the sequence of pictures and text. The model can process multiple images along with textual input, making it suitable for long -form multimedia tasks. The alignment of the text of the vision is achieved through the projection layers and coding the rotating position designed for image correction.
Training was conducted in three stages:
- Stage 1PRTRICLED PRTRIVING-Text-Text Intrain on Commercial Photo and Video Data collections.
- Stage 2Connect multimedia instructions to enable interactive transformation.
- Stage 3: Re -mixing texting data only, and improving performance on standard LLM standards.
All training was carried out using NVIDIA’s Megatron-LLM framework With Energon Dataladuader, it was distributed to groups with A100 and H100 graphics processing units.
Standard results and evaluation
Llama Nimotron Nano VL has been evaluated Ocrbench v2A standard designed to assess the understanding of the language level at the level of documents via OCR, table analysis, and planned thinking tasks. OCRBENCH includes more than 10,000 pairs of quality guarantees funded by areas such as financing, health care, legal and scientific publishing.
The results indicate that the model is achieving Modern accuracy Between the compact VLMS on this standard. It is worth noting that its performance is able to compete with larger and less efficient models, especially in extracting organized data (for example, tables and pairs of the main value) and responding to the design based on design.
The model also depends on non -English documents and the quality of the deteriorating survey, which reflects the durability under the circumstances of the real world.
Publishing, measurement and efficiency
The designer Nano Nano VL is designed for flexible publishing, and supports server and edge conclusion. Nvidia provides a 4 -bit version (AWQ) For effective inference using tinychat and Tensorrt -lmWith compatibility with Jetson Orin and other restricted environments.
The main technical features include:
- Support NIM (NVIDIA Internet Microservice)Simplify API integration
- ONNX and Tensorrt Export SupportEnsure that the acceleration of devices
- Prior vision optionEnabling the decrease in cumin for fixed image documents
conclusion
Llama Nemotron Nano VL represents a good engineering comparison between performance, context length and publishing efficiency in the field of document understanding. Its structure – classified in Llama 3.1 and which has been reinforced with integrated vision encryption – gives a practical solution to the applications of institutions that require multimedia understanding under time limitations or strict devices.
Through the top of Ocrbench V2 while maintaining a copyable fingerprint, nemotron nano vl places itself as an applicable model for task such as QA automatic document, smart OCR, and information extraction lines.
Check the technical details and the model in the face of embrace. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 95K+ ML Subreddit And subscribe to Our newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically intact and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-06-04 06:47:00