[2506.12708] Serving Large Language Models on Huawei CloudMatrix384

0 1 minute read

[Submitted on 15 Jun 2025 (v1), last revised 19 Jun 2025 (this version, v3)]

Authors:Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Dio, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peyang Li, Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao, Depeng Liang, Dong Cao, Juncheng Liu, Xie, Huatao Wu, Zhibin Yu, LV Chen, Hu Liu, Yujun Ding, Haepei Zhu, Jing XIA, Yi Xiong, Zhou Yu, Heng Liao

PDF view of the paper entitled presentation of large language models on Huawei Cloudmatrix384, by Pengfei Zuo and 45 other books

PDF HTML (experimental) view

a summary:The rapid development of large language models (LLMS), driven by the standards of growing parameters, the adoption of expert structures (MEE), and the expansion of context lengths, imposes unprecedented demands on Amnesty International’s infrastructure. Traditional AI groups face restrictions in the severity of the account, the display of the frequency of memory, and the connection between the chip and the cumin, which are included in the changing work burdens and strict goals at the level of service. Treating these problems requires the integration of mainly designed tool programs. This sheet is presented by Huawei Cloudmatrix, which is the AI Data Center structure of the next generation, which was achieved in Supernode Cloudmatrix384 production category. It merges 384 ASCEND 910 NPUS and 192 CPUS KUNPING is a very unified bus network (UB), allowing direct communication all to all dynamic resources. These features improve the performance of intensive communications operations, such as MEE expert parallel and access to the main storage with the main value distributed. To fully take advantage of Cloudmatrix384, we suggest Cloudmatrix-infer, an advanced LLM solution that includes three basic innovations: a counterpart service structure to an equivalence independently determined independently, coding and temporary storage; Parallel parallel strategy on a large scale support EP320 by sending an effective code based on UB; Improvements that are aware of devices including specialized operators, microbatch -based pipelines, and air conditioning. The evaluation with the DeepSeek-R1 model shows that Cloudmatrix-infer achieves the latest efficiency: pre-productivity of 6,688 symbols/s per NPU and disassembled productivity from 1943 icon/s per npu (<50 ms tpot). It effectively balances productivity and cumin, maintaining 538 icons/S per NPU even under strict cumin restrictions 15 milliliters, while the INT8 quantity maintains the accuracy of the model through the standards.