Dense linear solver on many-core CPUs:Characterization and optimization
The dense linear solver plays a vital role in high-performance computing and machine learning.Typical parallel implementations are built upon the well-known fork-join or task-based pro-gramming model.Though mainstream dense linear algebra libraries adopting the fork-join paradigm can shift most of the computation to well-tuned and high-performance BLAS 3 routines,they fail to exploit many-core CPUs efficiently due to the rigid execution stream of fork-join.While open-source implemen-tations employing the task-based paradigm can provide more promising performance thanks to the model's malleability and better load balance,they still leave much room for optimization on many-core platforms,especially for medium-sized matrices.In this paper,a quantitative characterization of the dense linear solver is carried out to locate performance bottlenecks and a series of optimizations is pro-posed to deliver higher performance.Specifically,idle threads are reduced by merging LU factorization with the following lower triangular solver to improve parallelism.Moreover,duplicated matrix packing operations are reduced to lower memory overhead.Performance evaluation is conducted on two modern many-core platform,Intel® Xeon Gold® 6252N(48 cores)and HiSilicon Kunpeng 920(64 cores).Eval-uation results show that our optimized solver outperforms the state-of-the-art open-source implementa-tion by a factor up to 10.05%(Xeon)and 13.63%(Kunpeng 920)on the two platforms,respectively.
dense linear solverLU factorizationfork-join modeltask-based modelmany-core CPU