云化的智算中心万卡集群创新与实践

Cloud-based intelligent computing center ten-thousand card cluster innovation and practice

丁宏庆 ¹张鹏飞 ¹牛红韦华 ²李志勇 ³周丹媛 ¹丁国强 ⁴李攀攀 ²李道通 ²张久仙²

扫码查看

作者信息

1. 中国移动通信集团有限公司,北京 100032
2. 中移(苏州)软件技术有限公司,江苏苏州 215123
3. 中国移动通信集团浙江有限公司,浙江杭州 311103
4. 中国移动通信集团设计院有限公司,北京 100080
折叠

摘要

为解决智算中心超大规模算力集群算力可用率低、国产技术成熟度低、大规模组网效率存在瓶颈、运营运维复杂等问题,提出了一种基于云计算技术构建智算中心万卡集群的系统.采用18 432块神经网络处理单元(neural processing unit,NPU)卡和优化后的基于以太网的远程直接内存访问(remote direct memory access,RDMA)网络构建云化的智算中心万卡集群,结合软件定义网络(software defined network,SDN)技术实现RDMA网络租户隔离,实现了链路负载均衡误差小于10%,集群All-Reduce带宽达35 GB/s以上.采用优化后的分布式存储协议,实现模型断点恢复时长缩短为原来的1/2.验证结果表明,经过软硬件协同优化,国产化的NPU万卡集群不仅能够满足千亿参数大模型训练的需求,未来更可以支撑万亿参数大模型训练任务.

Abstract

To address issues such as low availability of computing power in ultra-large scale computing clusters of in-telligent computing centers,low maturity of domestically produced technologies,bottlenecks in large-scale network-ing efficiency,and complex operations and maintenance,a system based on cloud computing technology for construct-ing a ten-thousand card cluster in an intelligent computing center was proposed.A ten-thousand card cluster was con-structed using 18 432 NPU units and an optimized RDMA network.A multi-plane network architecture was adopted,in conjunction with SDN technology to achieve RDMA network tenant isolation.The network load balancing strategy was optimized,resulting in a link load balancing error of less than 10%and an All-Reduce bandwidth of over 35 GB/s.By employing the optimized distributed storage protocol,the model's breakpoint recovery time was reduced to half of its original duration.The validation results demonstrate that the domestic NPU ten-thousand card cluster,with the col-laborative optimization of software and hardware,can not only meet the training needs of large models with hundreds of billions of parameters but also support the training tasks of large models with trillions of parameters.

关键词

超级计算集群/智算中心/万卡集群/人工智能

Key words

supercomputer cluster/intelligent computing center/ten-thousand card cluster/artificial intelligence

引用本文复制引用

出版年

2024

电信科学

中国通信学会　人民邮电出版社

电信科学

CSTPCD北大核心

影响因子：0.902

ISSN：1000-0801

段落导航