Visual Reasoning in AI: Boosting Problem-Solving with Images

When humans try to solve problems, they often visualize the tasks in their heads. New research suggests that enabling artificial intelligence to do the same could boost performance on spatial reasoning challenges.
While large language models excel at many text-based tasks, they often struggle with those that require more complex reasoning. One of the most promising approaches for boosting their performance on these kinds of problems is a technique known as “chain-of-thought” (CoT) prompting, where users ask the model to “think” through them step-by-step.
This can lead to significant improvements on various reasoning tasks, especially in mathematics, coding, and logic. But the language-focused technique has proved less effective for problems requiring spatial or visual reasoning. To try and close that gap, researchers at the University of Cambridge and Microsoft Research have developed a new approach that lets “think” in both text and images.
The technique enables multimodal large language models, which can process both image and text data, to generate visual representations of their intermediate reasoning steps. In non-peer reviewed research posted to arXiv, the researchers report that when they tested the approach on spatial reasoning challenges involving 2D mazes, they saw significant improvements over the typical CoT technique on the most challenging scenarios.
“Spatial relations and layouts and also some geometric features are very hard to describe with pure text,” says co-lead author Chengzu Li, a Ph.D. student at Cambridge. “That’s why we think that reasoning with pure text would limit the performance of the model in spatial tasks. And that’s the main motivation for introducing visual ‘thoughts,’” he says.
How AI Visual Reasoning Works
This is not the first attempt to allow AI to reason visually. But Li says previous approaches have either involved extracting information from images and converting it to text before reasoning with it, or have relied on external tools or specialized vision models to enable visual reasoning.
The new approach enables a single multimodal model to generate both visual and text reasoning steps itself. This work only recently became feasible, says Li, thanks to the development of more powerful multimodal AI. Older models could interpret images and text, but could only generate text outputs. For these experiments, the researchers used a model called Anole that can respond in either modality.
This model is an open-source extension of Meta’s Chameleon multimodal model: theresearchers behind Anole retrained it to generate sequences of text interleaved with images. For instance, it can generate a step-by-step recipe with an image for each step. Li and colleagues took this pre-trained model and fine-tuned it on text and image data from three maze-like games with different levels of complexity. They called their fine-tuned version Multimodal Visualization of Thought (MVoT).
The researchers tested the new technique (bottom), which generates both visual and verbal thoughts, against one that reasons only in text (middle) and one that skips reasoning and jumps straight to the answer.Chengzu Li, Wenshan Wu et al.
The goal for the model was to work out what would happen if it took a pre-determined series of actions in each maze. During training, the model was shown examples that included images of the starting position in the maze and a textual description of the task, a series of reasoning steps featuring text descriptions of actions and images of where the player is on the map, and finally an answer as to what the outcome would be for those actions, such as reaching the desired destination or falling down a hole. During testing the model was only given the starting image and a sequence of actions to perform. It then generated image and text reasoning steps followed by a prediction of what would happen.
The researchers compared MVoT to four other models, three of which they fine-tuned themselves. The first two versions of the model were trained only on text data regarding the maze: One model jumped straight from a prompt to generating a final answer, the other used textual CoT reasoning. Another model was trained on examples of both image and text reasoning, but then did its own reasoning purely in text. Finally, they compared MVoT’s performance on the maze tasks to that of the GPT-4o model from OpenAI, which is the company’s most advanced multimodal model.
They found that on all three games, the MVoT model significantly outperformed all models apart from the one using traditional text CoT. That model actually did slightly better on the two simpler mazes, successfully predicting the outcome 98 percent of the time on both, compared to MVoT’s scores of 93 percent and 95 percent. But the traditional text CoT model did much worse on the most complicated game, scoring just 61 percent compared to MVoT’s 86 percent. They tested both models on progressively larger mazes and while MVoT’s performance remained stable, the other model’s performance plummeted as maze size increased.
The researchers say this outcome is likely because CoT relies on accurate textual descriptions of the environment, which get harder the more complex the mazes become. In contrast, the inclusion of images in the reasoning process appears to make MVoT much better at dealing with more challenging environments.
Applications for AI Visual Reasoning
While the tests the researchers used are simple, Li says extending this approach into more complex domains could have broad applications. One of the most compelling is robotics, where the approach could help machines reason more effectively about the visual input they get from the environment. It could also be help AI tutors better illustrate and explain ideas, particularly in areas like geometry. More broadly, he says the approach can boost model interpretability by giving humans a clear picture of what the model is thinking about in spatial tasks.
One potential gap, admits Li, is that the model has no mechanism for deciding when to reason visually or when to reason via text. At present, the model simply alternates between the two, which works well for these maze navigation challenges that have discrete steps but may be less appropriate for more complex spatial reasoning tasks.
“We haven’t really touched on when is the appropriate time to do a visual reasoning process or not,” Li says. “But I think it’s definitely one of the very interesting directions to further explore.” One possibility, he adds, would be to generate reasoning sequences with both visual and text descriptions at each step, and then get humans to provide feedback on which is more expressive. This feedback could then be used to train the model to pick the best option at each reasoning step.
From Your Site Articles
Related Articles Around the Web
2025-02-12 13:00:05