AI

Uni-MoE-2.0-Omni: An Open Qwen2.5-7B Based Omnimodal MoE for Text, Image, Audio and Video Understanding

How do you build a single open model that can? reliably Understand text, images, audio and video while still working efficiently? A team of researchers from Harbin Institute of Technology in Shenzhen Uni-MoE-2.0-Omnia large, fully open multimodal model that pushes Lychee’s Uni-MoE line toward language-centered multimodal thinking. The system was trained from scratch on the Qwen2.5-7B dense backbone and expanded to an expert mixture architecture with dynamic amplitude vectorization, a progressive supervised and reinforcement learning recipe, and approximately 75 billion tokens of carefully matched multimodal data. It processes text, images, audio, and video for understanding and can create images, text, and speech.

https://idealistxy.github.io/Uni-MoE-v2.github.io/

Architecture, a unified way of coding around the core of the language

The core of the Uni-MoE-2.0-Omni is a Qwen2.5-7B style converter that acts as the central hub for the language. Around this core, the research team attaches a unified speech encoder that maps diverse acoustics, including environmental sound, speech, and music, into a common representation space. On the visual side, pre-trained visual encoders process images and video frames and then feed token sequences into the same converter. For generation, a context-dependent MOE-based text-to-speech module and a task-aware diffusion converter handle speech and image synthesis.

https://idealistxy.github.io/Uni-MoE-v2.github.io/

All methods are converted into symbolic sequences that share a unified interface to the language model. This design means that the same self-attention layers see textual, visual, and audio symbols, simplifying media integration and making the language model the central controller for both understanding and generation. The architecture is designed to support 10 conditional input configurations, such as image plus text, video plus speech, and three-media combinations.

Omni Modality 3D RoPE and MoE fusion driven technology

Cross-alignment is handled by an Omni Modality 3D RoPE mechanism that encodes temporal and spatial structure directly into the rotary positional modulation. Instead of using only one-dimensional positions for text, the system assigns three coordinates to tokens, time, height and width for visual and audio streams, and time of speech. This gives the converter a clear view of when and where each symbol occurs, which is important for video understanding and audio-visual reasoning tasks.

Mixture of Experts layers replace standard MLP blocks with a MoE stack containing three types of experts. Empty experts act as empty functions that allow skipping computation at inference time. Vector experts are method specific and store domain knowledge for audio, vision or text. The experts involved are small and always active, providing a communication path for general information across modalities. The routing network selects which experts to activate based on the input code, providing specialization without paying the full cost of a dense model with all experts activated.

Training recipe, from pre-modal training to GSPO DPO

The training path is organized into a data-matched recipe. First, the language-centered cross-media pre-training phase uses paired pictorial text, audio text, and video scripts. This step teaches the model how to display each method in a common semantic space aligned with the language. The basic model is trained on approximately 75 billion open source multimodal codes and is equipped with special speech and image generation codes so that generative behavior can be learned by adapting linguistic cues.

Next, a gradual, supervised fine-tuning phase activates specific modality experts grouped into audio, vision, and text categories. During this phase, the research team introduces special control codes so that the model can perform tasks such as text-conditional speech synthesis and image generation within the language interface itself. After large-scale SFT (supervised fine-tuning), the data-balanced annealing phase reweights the mix of datasets across methods, tasks, and training at a lower learning rate. This avoids overfitting to a single method and improves the stability of the final overall behavior.

To unleash long-term reasoning, Uni-MoE-2.0-Omni adds an iterative policy optimization phase built on GSPO and DPO. GSPO uses the model itself or another LLM as an arbiter to evaluate responses and build preference signals, while DPO transforms these preferences into a direct policy update target that is more stable than standard reinforcement learning from human feedback. The research team applies this GSPO DPO loop in multiple rounds to form the Uni-MoE-2.0 thinking variant, which inherits the multi-modal base and adds stronger thinking step by step.

Generation, MOE TTS and Mission Aware Publishing

To generate speech, the Uni-MoE-2.0-Omni uses the context-aware MoE TTS module that sits on top of the language model. LLM issues control codes that describe timbre, style, and language, as well as text content. The MoE TTS consumes this sequence and produces discrete audio symbols, which are then decoded into waveforms by an external coding model, aligned with the unified speech encoder on the input side. This design makes speech generation a first-class controlled generation task rather than a separate pipeline.

On the vision side, the task-aware diffusion adapter is conditional on both task tokens and image tokens. Task tokens encode at a low level whether the system should execute script to create, edit, or enhance images. Image symbols can capture semantics from a multimedia backbone, for example from text as well as image dialogue. Lightweight projectors map these tokens into the diffusion transformer conditioning space, enabling vector image generation and editing with instruction guidance, while keeping the master multimodal model frozen during the final visual adjustment phase.

Open standards and checkpoints

The Uni-MoE-2.0-Omni is evaluated on 85 multimedia criteria covering image, text, video, audio, and cross- or tri-modal thinking. The model on Qwen2.5-Omni, trained on about 1.2T codes, outperforms more than 50 out of 76 common benchmarks. Gains include about +7% on average in video comprehension across 8 tasks, +7% on average in multimedia comprehension across 4 benchmarks including OmniVideoBench and WorldSense, and about +4% in audio-visual reasoning.

For long speech processing, Uni-MoE-2.0-Omni reduces the word error rate by up to 4.2% compared to long LibriSpeech segmentations and achieves a 1% improvement in WER in TinyStories-en text-to-speech. Image generation and editing results compete with specialized visual models. The research team reported small but consistent gains of about 0.5% on the GEdit Bench compared to the Ming Lite Omni, while also outperforming Qwen Image and PixWizard on several low-level image processing metrics.

https://arxiv.org/pdf/2511.12609

Key takeaways

  1. Uni-MoE-2.0-Omni is a large, fully open multimedia model built from scratch on the dense Qwen2.5-7B backbone, and upgraded to the Mixture of Experts architecture that supports 10 types of multimodal input and shared understanding across text, images, audio and video.
  2. The model introduces Dynamic Capacity MoE with shared, directed, and null experts as well as Omni Modality 3D RoPE, which together balance computation and power by directing experts to each token while maintaining spatiotemporal alignment across modalities within self-attention layers.
  3. Uni-MoE-2.0-Omni uses a phased training pipeline, cross-modal pre-training, supervised incremental fine-tuning with method-specific experts, balanced data softening and GSPO-based reinforcement learning combined with DPO for a variant of Uni-MoE-2.0-Thinking for stronger long-form reasoning.
  4. The system supports comprehensive understanding and generation of images, text and speech via a unified language-centric interface, with custom Uni-MoE-TTS and Uni-MoE-2.0-Image headers derived from the same base for controllable speech and image synthesis.
  5. Across 85 benchmarks, the Uni-MoE-2.0-Omni outperforms the Qwen2.5-Omni in more than 50 of 76 common tasks, with approximately +7% gains in video comprehension and multimedia comprehension, +4% in audiovisual reasoning and up to a 4.2% relative reduction in WER in long-form speech.

verify paperThe repo, model weights, and project page. Feel free to check out our website GitHub page for tutorials, codes, and notebooks. Also, feel free to follow us on twitter Don’t forget to join us 100k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of AI for social good. His most recent endeavor is the launch of the AI ​​media platform, Marktechpost, which features in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand by a broad audience. The platform has more than 2 million views per month, which shows its popularity among the masses.

🙌 FOLLOW MARKTECHPOST: Add us as a favorite source on Google.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-11-18 06:56:00

Related Articles

Back to top button