Technology

New vision model from Cohere runs on two GPUs, beats top-tier VLMs on visual tasks


Want more intelligent visions of your inbox? Subscribe to our weekly newsletters to get what is concerned only for institutions AI, data and security leaders. Subscribe now


The high deep research features and other artificial intelligence -powered analyzes have made more models and services that are looking to simplify this process and read more documents that companies already use.

Canadian Ai Cohere works on its models, including the newly released visual model, to clarify the situation in which deep search features of institutions use.

The company has released the company, a visual model that particularly targeting institutions use, based on the back of its matter. The company says that the parameter model is 112 billion that can “open valuable visions of visual data, and take very accurate decisions that depend on data by identifying the OCR and image analysis,” says the company.

The company said in a blog post: “Whether it explains the product booklets with complex plans or analyzing pictures of the real world scenes to detect risk, the vision exceeds the challenges of seeing the most demanding institutions.”


AI Impact series returns to San Francisco – August 5

The next stage of artificial intelligence here – are you ready? Join the leaders from Block, GSK and SAP to take an exclusive look on how to restart independent agents from the Foundation’s workflow tasks-from decisions in an actual time to comprehensive automation.

Ensure your place now – the space is limited: https://bit.ly/3Guupf


This means that the vision can read and analyze the most common types of images needed by institutions: graphs, charts, plans, scanned documents and PDF.

Since it is based on ComMAND A, Command A requires seeing two or less than graphics processing units, just like the text model. The vision model also maintains the abilities of the text on A to read the words on the pictures and understands at least 23 languages. COHERE said that, unlike other models, the vision reduces the total cost of ownership of institutions and is fully improved for retrieval cases.

How to teach the matter a

COHERE said she followed Llava’s structure to build her order, including the visual model. This structure turns visual features into soft vision symbols, which can be divided into different tiles.

The company said that these tiles are passed to the text tower, “dense parameters, 111b.” This way, one image consumes up to 3328 symbols. “

Kwaidle said that he trained the visual model in three stages: align the language of vision, the control subject (SFT) and learning to reinforce after training with human comments (RLHF).

The company said: “This approach allows the appointment of photo encryption features to an area that includes the language model.” “On the contrary, during the SFT stage, we trained the encoded at one time, the vision transformer and the language model on a variety of multimedia -pursuing tasks for education.”

Imagine AI

Standard tests showed that the vision exceeds other models with similar visual capabilities.

Coher Complet Command Vision against Openai’s GPT 4.1, Meta’s Llama 4 MAVERICK, Mistral’s Pixral Barge and Mistral Medium 3 in nine standard tests. The company did not mention whether it had tested the model against the Mistral Application Programming interface that focuses on OCR, OCR Mistral.

The vision surpassed other models in tests such as Chartqa, Ocrbench, AI2D and Textvqa. In general, the visibility reached 83.1 % compared to GPT 4.1’s 78.6 %, and Llama 4 MAVERKK’s 80.5 % and 78.3 % of Mistral Medium 3.

Most of the LLMS models these days are multimedia, which means that they can create or understand visual media such as images or videos. However, institutions generally use more graphic documents such as charts and PDFS, so extracting information from non -structured data sources often proves difficult.

With a deep research in high, the importance of bringing models capable of reading and analyzing unorganized data and even downloading them.

COHERE also said that it provides leadership in an open weight system, hoping that companies looking to move away from closed or ownership models will start using their products. So far, there is some attention from developers.


Don’t miss more hot News like this! Click here to discover the latest in Technology news!


2025-08-01 22:05:00

Related Articles

Back to top button