[2511.00279] LongCat-Flash-Omni Technical Report

1 2 minutes read

[Submitted on 31 Oct 2025 (v1), last revised 28 Nov 2025 (this version, v2)]

Authors:team Meituan LongCat, Bairui Wang, Bayan, Bin Xiao, Bo Zhang, Bolin Rong, Borun Chen, Chang Wan, Chao Zhang, Chen Huang, Chen Chen, Chen Chen, Chengxu Yang, Chengzuo Yang, Cong Han, Dandan Peng, Delian Ruan, Detai Xin, Disong Wang, Dongchao Yang, Fanfan Liu, Fengjiao Chen, Fengyu Yang, Gan Dong, Gang Huang, Gang. Xu, Guangluo Wan, Guoqiang Tan, Guoqiao Yu, Haibo Qiu, Hao Lu, Hongbo Liu, Hongyuxiang, Jiaheng Wu, Jian Yang, Jiaxing Liu, Jing Huang, Jingang Wang, Jinrui Ding, Guozhao Jiang, Jun Kuang, Jun Wang, Junhui Mei, Qi Ding, Qifeng Zhang, Li Chen, Liang Shi, Liming Qiao, Liming Zheng, Lin Ma, Liuyang Gu, Liya Ma, Luying Sun, Man Gao, Mengxin Zhu, Miao Cao, Minliang Lin, Nu Shu, Ping Xi, Qi Zhang, Qian Fang, Qian Wang, Qian Yang, Quanxiu Wang, Rongxiang Weng, Rongxin Gu, Ruoxuan Liang, Sinbin Yang, Shanbo Xu, Shanglin Li, Shengzi Yi, Shimin Chen, Shuiqi Chen, Shuji He, Xu Li, Siqi Yang, Siu Xu, Siu Ren, Song Li, Song Xiang Liu, Tianhao Bai, Tian Yi Dai, Wei Hong, Wei Wang, Xiao Zhao, Wengang Cao, Wenlong Zhou, Wen Long He, Shi Su, Shi Nan, Xiao Han Zhao, Xiaohao Wang, Xiaoyu Zhao, Xiaoyu Wang, Xiaoyu Li, Shen Pan, Shen Qin, Xiaosong Sun, Xu Xiang, Xu Dongsheng

et al. (33 additional authors not displayed)

View PDF of the article LongCat-Flash-Omni Technical Report, by Meituan LongCat Team and 132 other authors

View PDF HTML (beta)

a summary:We introduce LongCat-Flash-Omni, a state-of-the-art, open source, multimedia model with 560 billion parameters that excels at real-time audio and visual interaction. By adopting a progressive, curriculum-inspired training strategy that moves from simpler to more complex method sequence modeling tasks, LongCat-Flash-Omni achieves comprehensive multi-modal capabilities while maintaining strong unimodal capability. Based on LongCat-Flash, which adopts a high-performance shortcut-connected Mixed-of-Experts (MoE) architecture with non-computational experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its massive size of 560 bytes (with 27 bytes of activation), LongCat-Flash-Omni achieves low-latency, real-time audio-visual interaction. For the training infrastructure, we developed a method-decoupled parallel scheme specifically designed to manage data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by maintaining over 90% of the throughput achieved by text-only training. Comprehensive evaluations show that LongCat-Flash-Omni achieves cutting-edge performance on multimedia benchmarks among open source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image and video comprehension, as well as audio comprehension and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open the model to promote future research and development in the community.