Abstract
High-throughput computing(HTC)is a computing paradigm that aims to accomplish jobs by easily breaking them into smaller,independent components.However,it requires a large amount of computing power for a long time.Most existing HTC frameworks are job-oriented without support for coscheduling with hardware architecture and task-level execution.Also,most of the frameworks reach a limited scale,and their usability needs further improvement.Herein,we present HTDcr,a job execution framework for the HTC on supercomputers.This study aims to improve the throughput,task dispatching,and usability of the framework.In detail,the throughput optimizations include a sophisticated designed task management system,a hierarchical scheduler,and the co-optimization of the task-scheduling strategy with the application and hardware characteristics.The optimizations for usability include a programable execution workflow,mechanisms for more robust and reliable service qualities,and a fine-grained resource allocation system for the colocation of multiple jobs.According to our evaluations,HTDcr can achieve outstanding scalability and high throughput on large-scale clusters for the HTC workload.We evaluate HTDcr with several microbenchmarks and real-world applications on Tianhe-2 and Sunway TaihuLight to demonstrate its effects on existing design mechanisms.For instance,the task scheduling for two real-world applications integrated with the application and hardware characteristics achieves 1.7× and 1.9× speedups over the basic task-scheduling strategy.
基金项目
National Key R&D Program of China(2021YFB0301300)
National Natural Science Foundation of China(U1811461)
Zhejiang Lab(2021KC0AB04)
Major Program of Guangdong Basic and Applied Research(2019B030302002)
Program for Guangdong Introducing Innovative and Entrepreneurial Teams(2016ZT06D211)