ByteDance Releases UI-TARS-1.5: An Open-Source Multimodal AI Agent Built upon a Powerful Vision-Language Model

By addition has released the User -1 user interface, which is an updated version of the multi -media agent framework that focuses on the GUI and Game Codes. It is designed as a model of vision language capable of realizing the screen content and performing interactive tasks, and the user interface provides -1.5 fixed improvements through a set of automation of the graphic user interface and thinking standards in the game. It is worth noting that it surpasses many pioneering models – including the Openai and CLAUDE 3.7 operator – in both accuracy and completion of tasks across multiple environments.
The version continues in the research direction of bytedance to build the forms of the original agent, with the aim of unifying perception, perception and work through an integrated structure that supports direct participation with the graphic user interface and visual content.
The original agent approach to the reaction of the graphic user interface
Unlike LLMS tools or functional call structures, UI-Tars-1.5 is trained by one to tip to perceive visual inputs (screen shots) and create control-like control procedures, such as mouse movement and keyboard inputs. This puts the model closer to how human users interact with digital systems.
UI-Tars-1.5 depends on its predecessor by providing many architectural improvements and training:
- Imagine and integrated thinking: The model encodes the images of screen and text instructions jointly, and support the understanding of the complex task and the visual land. Thinking is supported by the multi-step “Think-Then-Act” mechanism, which separates high-level planning from low-level implementation.
- Unified work spaceThe representation of the procedure is designed to be a platform, which allows a consistent interface through desktop, mobile and game environments.
- Self -development through the effects of restartThe training pipeline includes online reflective data. This allows the model to repeat its behavior by analyzing previous reactions – which leads to relying on coordinated demonstrations.
Collectively, these improvements allow the user interface-1.5 to support long horizon reaction, recover errors and planning installation tasks-important capabilities for real-time movement and control.
Measurement and evaluation
The model was evaluated on many standard wings that evaluate the agent’s behavior in both the graphic user interface and game -based tasks. These standards provide a standard way to assess the performance of the model through thinking, grounding and implement the long horizon.
Graphic user interface tasks tasks
- Osworld (100 steps)UI-Tars-1.5 achieves a success rate of 42.5 %, outperform Openai Operator (36.4 %) and Claude 3.7 (28 %). The standard assesses long -context graphic interface tasks in the industrial operating system environment.
- Agrena Windows Agent Arena (50 steps)42.1 % registration, the model improved dramatically more than the previous foundation lines (for example, 29.8 %), which indicates a strong treatment of desktop environments.
- Android WorldThe model reaches a 64.2 % success rate, indicating the circular to mobile operating systems.
Visual grounding and screen understanding
- Screenspot-V2The model achieves 94.2 % accuracy in locating the graphic user interface elements, the superior operator (87.9 %) and CLAUDE 3.7 (87.6 %).
- ScreenspotPro: In a more complex foundation standard, UI-Tars-1.5 is 61.6 %, greatly before the operator (23.4 %) and Claude 3.7 (27.7 %).
These results show fixed improvements in the understanding of the screen and the founding of the work, which are decisive to the real world’s graphical user interface agents.
Game environments
- Boki gamesUI-Tars-1.5 achieves the mission completion rate by 100 % across 14 small games. These games differ in mechanics and context, and require models to generalize interactive dynamics.
- Minecraft (minerl)The model achieves 42 % success in mining tasks and 31 % in murder murders when using the “thought-verb” unit, indicating that it can support high-level planning in open environments.
Access and tools
UI-Tars-1.5 open source under APache 2.0 license and is available through many publishing options:
In addition to the model, the project provides detailed documents, restart data, and evaluation tools to facilitate experimentation and cloning.
conclusion
UI-Tars-1.5 is a technically sound progress in the field of multimedia intelligence factors, especially those that focus on controlling the graphic user interface and visual thinking. Through a mixture of vision language integration, memory mechanisms, and planning organized procedures, the model shows a strong performance through a variety of interactive environments.
Instead of following the comprehensive general public, the model is adjusted for multimedia-oriented thinking towards the task-targeting the real challenge to interact with programs through visual understanding. Its open source release provides a practical framework for researchers and developers interested in exploring the original agent’s facades or automating interactive systems through language and vision.
Also, do not forget to follow us twitter And join us Telegram channel and LinkedIn GrOup. Don’t forget to join 90k+ ml subreddit.
🔥 [Register Now] The virtual Minicon Conference on Agency AI: Free Registration + attendance Certificate + 4 hours short (May 21, 9 am- Pacific time)

Niegel, a trainee consultant at Marktechpost. It follows an integrated double degree in materials at the Indian Institute of Technology, Khargpur. Nichil is a fan of artificial intelligence/ml that always looks for applications in areas such as biomedics and biomedical sciences. With a strong background in material science, it explores new progress and creates opportunities to contribute.

Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-04-21 07:09:00