A Set of Performance Profiling Tools for the General Purpose Processors of TianHe New Generation Supercomputing System
TianHe new generation supercomputer system is a new generation of supercomputer system in the TianHe series after TianHe-2.The system is expected to adopt a hybrid heterogeneous architecture of general processor and accelerator,in which the general purpose processor adopts ARM architecture.At present,performance profiling tools for ARM architecture are still not perfect,and those for new generation supercomputers are even more scarce,and their practicability and efficiency are still difficult to meet the needs of programmers.For the general purpose processors of TianHe new generation of supercomputing system,this paper designs and develops a set of performance profiling tools,which contain cache conflict detection,false sharing detection and memory defect detection.The tool set can analyze the performance problems of the system's single node and the multi-node programs with high data parallelism,and solve the memory problems of the programs under the authority of ordinary users of TianHe new generation supercomputer system.Specially,the performance problems mentioned in this paper are mainly about the cache,which is always invisible to the programmers.This fact leads our work to great significance because the performance problems caused by cache are hard to disclose by programmers themselves only checking their codes.The memory defect detection tool proposed by this paper is able to detect five sub-problems including accessing invalid/illegal address space,use-after-free problem,read uninitialized space,double-free problem and memory leak problem.In this paper,a variety of performance optimization strategies such as min-write,cache line alignment fill,and thread access isolation are used to improve the tool performance,which can achieve 1.2 to 20 times faster than the unoptimized tool.Meanwhile,the novel red-zone detection method and red-zone hiding and recovery mechanism are used to reduce the false error rate reported by the tool.The red zone detection method is to set the red zone at the end of the memory allocation space to detect memory access errors.The design idea of this method comes from the summary of the common pattern that programmers write code,usually array bounds are concentrated in the array boundary.The purpose of the red zone hide and recover mechanism is to avoid false errors during continuous memory allocation and further reduce the false error rate generated by the tool.This paper also developed a supporting visual interface,users can perform visual analysis and processing of the program performance analysis data,improving the utility and usability of the tool.In the experiment,we use our tool-set to find a severe cache contention phenomenon in OCEAN-ncp in SPLASH-3,a famous parallel benchmark suite,which reveals a huge hidden optimizing opportunity.Later,we use the false-sharing detection tool to pinpoint the exact context also the line numbers in source code where the false-sharing happens and incurs great performance degradation.By gathering these information together and thoroughly exploiting this opportunity,we achieve a 3x speedup of parallel program OCEAN-ncp.The tools'time cost and space cost are about 40~100x and 100~200x.The tools have moderate overhead,correctness and practicability,which can improve the programming efficiency and program performance of TianHe new generation supercomputer system.