New vision model from Cohere runs on two GPUs, beats top-tier VLMs on visual tasks

0 4 minutes read

New vision model from Cohere runs on two GPUs beats.jpg

Want more intelligent visions of your inbox? Subscribe to our weekly newsletters to get what is concerned only for institutions AI, data and security leaders. Subscribe now

The high deep research features and other artificial intelligence -powered analyzes have made more models and services that are looking to simplify this process and read more documents that companies already use.

Canadian Ai Cohere works on its models, including the newly released visual model, to clarify the situation in which deep search features of institutions use.

The company has released the company, a visual model that particularly targeting institutions use, based on the back of its matter. The company says that the parameter model is 112 billion that can “open valuable visions of visual data, and take very accurate decisions that depend on data by identifying the OCR and image analysis,” says the company.

The company said in a blog post: “Whether it explains the product booklets with complex plans or analyzing pictures of the real world scenes to detect risk, the vision exceeds the challenges of seeing the most demanding institutions.”

AI Impact series returns to San Francisco – August 5

The next stage of artificial intelligence here – are you ready? Join the leaders from Block, GSK and SAP to take an exclusive look on how to restart independent agents from the Foundation’s workflow tasks-from decisions in an actual time to comprehensive automation.

Ensure your place now – the space is limited: https://bit.ly/3Guupf

This means that the vision can read and analyze the most common types of images needed by institutions: graphs, charts, plans, scanned documents and PDF.

? @cohere I just dropped the vision on Hugingface ?
Designer of the Corporation’s multi -media use cases: interpretation of product evidence, images analysis, and asking about plans … ❓ ??
Model Vision Innocent Intensisers 112B with SOTA performance-Check Standard Mandacles in … pic.twitter.com/ormfm5f8cf
Jeff Body? (Jeffbudier) July 31, 2025

Since it is based on ComMAND A, Command A requires seeing two or less than graphics processing units, just like the text model. The vision model also maintains the abilities of the text on A to read the words on the pictures and understands at least 23 languages. COHERE said that, unlike other models, the vision reduces the total cost of ownership of institutions and is fully improved for retrieval cases.

How to teach the matter a

COHERE said she followed Llava’s structure to build her order, including the visual model. This structure turns visual features into soft vision symbols, which can be divided into different tiles.

The company said that these tiles are passed to the text tower, “dense parameters, 111b.” This way, one image consumes up to 3328 symbols. “

Kwaidle said that he trained the visual model in three stages: align the language of vision, the control subject (SFT) and learning to reinforce after training with human comments (RLHF).

The company said: “This approach allows the appointment of photo encryption features to an area that includes the language model.” “On the contrary, during the SFT stage, we trained the encoded at one time, the vision transformer and the language model on a variety of multimedia -pursuing tasks for education.”

Imagine AI

Standard tests showed that the vision exceeds other models with similar visual capabilities.

Coher Complet Command Vision against Openai’s GPT 4.1, Meta’s Llama 4 MAVERICK, Mistral’s Pixral Barge and Mistral Medium 3 in nine standard tests. The company did not mention whether it had tested the model against the Mistral Application Programming interface that focuses on OCR, OCR Mistral.

It enables agents to see your organization’s visual data safely, and to cancel the establishment of hard tasks that involve slides, graphic fees, PDF, and photos. pic.twitter.com/ihznuwekrk
Cohere (@COEY) July 31, 2025

The vision surpassed other models in tests such as Chartqa, Ocrbench, AI2D and Textvqa. In general, the visibility reached 83.1 % compared to GPT 4.1’s 78.6 %, and Llama 4 MAVERKK’s 80.5 % and 78.3 % of Mistral Medium 3.

Most of the LLMS models these days are multimedia, which means that they can create or understand visual media such as images or videos. However, institutions generally use more graphic documents such as charts and PDFS, so extracting information from non -structured data sources often proves difficult.

With a deep research in high, the importance of bringing models capable of reading and analyzing unorganized data and even downloading them.

COHERE also said that it provides leadership in an open weight system, hoping that companies looking to move away from closed or ownership models will start using their products. So far, there is some attention from developers.

I was very impressed by the extraction of handwritten notes from an image!
Adam Sardo (sardo_adam) July 31, 2025

Finally, Amnesty International, which will not judge my terrible drawings.
– Martha Weissar? (Martwisener) August 1, 2025

Daily visions about business use cases with VB daily

If you want to persuade your boss at work, you have covered VB Daily. We give you the internal journalistic precedence over what companies do with obstetric artificial intelligence, from organizational transformations to practical publishing operations, so that you can share visions of the maximum return on investment.

Read our privacy policy

Thanks for subscribing. Check more VB newsletters here.

An error occurred.