Direct xPU:一种新型节点间通信优化的分布式异构计算架构

Direct xPU:A Novel Distributed Heterogeneous Computing Architecture Optimized for Inter-node Communication Optimization

李仁刚 ¹王彦伟 ²郝锐 ²肖麟阁 ²杨乐 ³杨广文 ⁴阚宏伟³

扫码查看

作者信息

1. 清华大学计算机科学与技术系北京 100084;浪潮(北京)电子信息产业有限公司北京 100085
2. 浪潮(北京)电子信息产业有限公司北京 100085
3. 广东浪潮智慧计算技术有限公司广州 510623
4. 清华大学计算机科学与技术系北京 100084
折叠

摘要

人工智能大模型应用的爆发式增长,使得难以依靠单一节点、单一类型的算力实现应用的规模部署,分布式异构计算成为主流选择,而节点间通信成为大模型训练或推理过程中的主要瓶颈之一.目前,主要由GPU,FPGA等头部芯片厂商所主导的各种计算架构的节点间通信方案还存在一些问题.一方面,为了追求极致的节点间通信性能,一部分架构选择使用协议简单而可扩展性差的点对点传输方案.另一方面,传统的异构计算引擎(例如GPU)虽然在内存、计算管线等算力要素方面独立于CPU,但在通信要素方面却缺少专属的网络通信设备,需要完全或部分借助于CPU通过PCIe等物理链路来处理异构计算引擎与共享网络通信设备之间的通信.所实现的Direct xPU分布式异构计算架构,使得异构计算引擎在算力要素和通信要素两方面均具有独立的、专属的设备,实现了数据的零拷贝,并进一步消除了节点间通信过程中处理跨芯片传输数据所带来的能耗和延迟.测试结果表明,Direct xPU取得了与追求极致的节点间通信性能的计算架构相当的通信延迟,带宽接近物理通信带宽的上限.

Abstract

The explosive growth of the application of large-scale artificial intelligence models has made it difficult to achieve the scale deployment of applications relying on a single node or a single type of computing architecture.Distributed heterogeneous computing has become the mainstream choice,and inter-node communication has become one of the main bottlenecks in the training or inference process of large models.Currently,there are still some deficiencies in the inter-node communicating solutions dominated by leading chip manufacturers.On the one hand,some architectures choose to use a simple but less scalable point-to-point transmission scheme in order to pursue the ultimate inter-node communication performance.On the other hand,traditional heterogeneous computing engines(such as GPUs)are independent of CPUs in terms of computing resources such as memory and computing cores,but they lack dedicated communicating network devices in terms of communication resources and need to rely entirely or partially on CPUs to handle transmission between heterogeneous computing engines and the shared communicating network device through physical links such as PCIe.The proposed Direct xPU distributed heterogeneous computing architecture in this article enables heterogeneous computing engines to have independent and dedicated devices in both computing and communication resources,achieving zero-copy data and further eliminating the energy consumption and latency associated with cross-chip data transfer during inter-node communication.Evaluations show that Direct xPU achieves communication latency comparable to computing architectures pursuing ultimate inter-node communication performance,with bandwidth close to the physical limit.

关键词

节点间通信/FPGA/GPU/RDMA/零拷贝

Key words

inter-node communication/FPGA/GPU/RDMA/zero copy

引用本文复制引用

基金项目

广东省重点领域研发计划(2021B0101400001)

出版年

2024

计算机研究与发展

中国科学院计算技术研究所中国计算机学会

计算机研究与发展

CSTPCD北大核心

影响因子：2.649

ISSN：1000-1239

参考文献量24

段落导航