Efficient Task Flow Parallel System for New Generation Sunway Processor
China's independently developed next-generation Sunway supercomputer features a more powerful memory system and higher computational density compared to its predecessor,the Sunway TaihuLight.Its primary programming model remains the bulk synchronous parallelism(BSP)model.The sequential task flow(STF)model,based on data flow information,automates the task parallelization of serial programs and achieves asynchronous parallelism through fine-grained synchronization between tasks.Compared to the global synchronization of the BSP model,STF offers higher parallelism and more balanced load distribution,pro-viding users with a new option for efficiently utilizing the Sunway platform.However,on many-core systems,the runtime over-head of the STF model directly impacts the performance of parallel programs.This paper first analyzes two characteristics of the new Sunway processor that affect the efficient implementation of the STF model.Then,leveraging the unique features of the pro-cessor architecture,it proposes an agent-based dataflow graph construction mechanism to meet the modeling requirements and a lock-free centralized task scheduling mechanism to optimize scheduling overhead.Finally,based on these technologies,an efficient task flow parallel system is implemented for the AceMesh model.Experiments show that the implemented task flow parallel sys-tem has significant advantages over traditional runtime support,achieving a maximum speedup of 2.37 times in fine-grained task scenarios;the performance of AceMesh exceeds that of the OpenACC model on the Sunway platform,with a maximum speedup of 2.07 times for typical applications.
Sequential task flow modelHeterogeneous multi-core parallelismTask schedulingDataflow parallelismBulk syn-chronous model