首页|新一代神威处理器上高效任务流并行系统

新一代神威处理器上高效任务流并行系统

扫码查看
我国自主研制的新一代神威超级计算机相比前一代的神威太湖之光,具有更强大的内存系统和更高的计算密度,其主力编程模型仍然是块同步(Bulk Synchronous Parallelism,BSP)模型.顺序任务流(Sequential Task Flow,STF)模型基于数据流信息实现对串行程序的自动任务并行,并通过任务间的细粒度同步实现异步并行,相比于BSP模型的全局同步,并行度更高,负载更均衡.STF模型为用户高效使用神威平台提供了一种新选择.但在众核系统上,STF模型的运行时开销会直接影响并行程序性能.首先,分析新一代神威处理器影响STF模型高效实现的两个特征;然后,利用处理器架构的独有特性,提出一种基于代理的数据流构图机制以实现模型的构图需求,以及一种无锁的集中式任务调度机制以优化调度开销.最后,基于以上技术,为AceMesh模型实现了高效的任务流并行系统.实验表明,实现的任务流并行系统相比传统运行时支持优势显著,在细粒度任务场景下最高加速2.37倍;AceMesh性能高于神威平台的OpenACC模型,对典型应用的加速最高达到2.07倍.
Efficient Task Flow Parallel System for New Generation Sunway Processor
China's independently developed next-generation Sunway supercomputer features a more powerful memory system and higher computational density compared to its predecessor,the Sunway TaihuLight.Its primary programming model remains the bulk synchronous parallelism(BSP)model.The sequential task flow(STF)model,based on data flow information,automates the task parallelization of serial programs and achieves asynchronous parallelism through fine-grained synchronization between tasks.Compared to the global synchronization of the BSP model,STF offers higher parallelism and more balanced load distribution,pro-viding users with a new option for efficiently utilizing the Sunway platform.However,on many-core systems,the runtime over-head of the STF model directly impacts the performance of parallel programs.This paper first analyzes two characteristics of the new Sunway processor that affect the efficient implementation of the STF model.Then,leveraging the unique features of the pro-cessor architecture,it proposes an agent-based dataflow graph construction mechanism to meet the modeling requirements and a lock-free centralized task scheduling mechanism to optimize scheduling overhead.Finally,based on these technologies,an efficient task flow parallel system is implemented for the AceMesh model.Experiments show that the implemented task flow parallel sys-tem has significant advantages over traditional runtime support,achieving a maximum speedup of 2.37 times in fine-grained task scenarios;the performance of AceMesh exceeds that of the OpenACC model on the Sunway platform,with a maximum speedup of 2.07 times for typical applications.

Sequential task flow modelHeterogeneous multi-core parallelismTask schedulingDataflow parallelismBulk syn-chronous model

傅游、杜雷明、高希然、陈莉

展开 >

山东科技大学计算机科学与工程学院 山东青岛 266590

中国科学院计算技术研究所处理器芯片全国重点实验室 北京 100190

顺序任务流模型 异构众核并行 任务调度 数据流并行 块同步模型

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(12)