H Company Releases Holo1.5: An Open-Weight Computer-Use VLMs Focused on GUI Localization and UI-VQA
Holo1.5, a family of open vision models specifically designed to use the computer (CU) that works on the real user interfaces via screenshots and index/keyboard procedures. The version includes 3 b, 7b and 72b Inspection points with a 10 % accuracy ~ 10 % on Holo1 in sizes. 7B Form is Apache-2.0; 3B and 72B inherits research restrictions only from their bases on the source. The series targets basic capabilities of concern CU: localization of the exact user interface component (prediction) and answering Visual UI (UI-VQA) to understand the state.

Why is it important to localize the user interface element?
The Emiratization is how the agent turns into a procedure at the pixel level: “Spotify” → expectable -clicking coordinates for the right control on the current screen. Failure here CASCADE: One click can come out of a multi -step workflow. Holo1.5 is trained and evaluated for high -resolution screens (up to 3840 x 2160) by desktop (MacOS, Ubuntu, Windows), web facades, and mobile phones, which improves durability on the dense professional user interface where the icon and small objectives of error increases.
How is Holo1.5 different from General VLMS?
General VLMS improvement for wide grounding and label; CU agents need reliable understanding in addition to understanding a façade. Holo1.5, along its data and goals, with these requirements: SFT widely on the tasks of the graphic user interface, followed by learning to reinforce similar to GRPO to tighten the accuracy of coordination and decision reliability. Models are delivered as depicted ingredients to be included in the planners/perpetrators (for example, agents like a server), and not as comprehensive factors.
How does Holo1.5 lead to localization criteria?
Holo1.5 reports on the latest graphic user interface via the Screenspot-V2, Screenspot-PRO, Groundui-Web, Showdown, and Webclick. 7 b numbers (averages on six tracks of localization):
- Holo1.5b: 77.32
- QWEN2.5-VL-7B: 60.73
on Screenspot-PRO (Vocational applications with dense layouts), achieve Holo1.5-7B 57.94 Opposite 29.00 For QWEN2.5-VL-7B, indicates choosing the best financial goal under realistic conditions. 3B and 72B checkpoints show similar relative gains against QWEN2.5-VL counterparts.




Does UI-VQA understand?
Yes. On Visualwebbench, Websrc and Screenqa (short/complex), give Holo1.5 consistent accuracy improvements. An average of 7 b 88.17With the 72B alternative around it 90.00. This matters to the reliability of the agent: inquiries such as “What is an active tab?” Or “Was the user signed?” Reducing mystery and enabling verification between procedures.
How do specialized and closed systems compare?
Under the preparation of the published evaluation, the Holo1.5 excels over the open foundation lines (QWEN2.5-VL), and competitive specialized systems (for example, user interface tools, user interface) and features features against closed general models (for example, Claude Sonnet 4) on the aforementioned user interface photography. Since protocols, demands and screen decisions affect the results, practitioners must repeat with their harness before extracting conclusions at the publishing level.
What are the effects of the integration of CU agents?
- Click higher than the reliability of the original decision: The performance of the best screen screens suggests a decrease in the wrong in complex applications (IDES, design wings, and management devices).
- The state follows the strongest: The highest UI-VQA resolution improves the discovery of the entry login, the active tab, the conditional vision, and success/failures.
- Practical licensing path: 7B (Apache-2.0) Suitable for production. the 72b The checkpoint is currently currently only. Use it for internal experiments or to the head room.
Where is the Holo1.5 in a modern pile to use the computer (CU)?
Think of Holo1.5 as Screen visualization layer:
- entrance: Full -resolution screen shots (optional with user interface definition data).
- Outputs: Targeted coordinates with confidence; Short text answers to the screen condition.
- Momn: The procedure policies convert predictions to click/keyboard events; Monitoring verifies the conditions and operators simulating or celebrations.
summary
Holo1.5 Smart a practical gap in CU systems through a strong coordinate grounding with an understanding of the brief interface. If you need a commercially used base today, start with Holo1.5b (Apache-2.0)Standard on your screens, your scheme tools/safety layers around them.
verify Models on face embrace and Technical details. Do not hesitate to check our GitHub page for lessons, symbols and notebooks. Also, do not hesitate to follow us twitter And do not forget to join 100K+ ML Subreddit And subscribe to Our newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically intact and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.
🔥[Recommended Read] Nvidia AI Open-Sources VIPE (Video Forms)
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-09-18 08:14:00



