ScaLAPACK(Scalable Linear Algebra PACKage)是并行计算软件包,适用于分布式存储的 MIMD(Multiple Instruc-tion,Multiple Data)并行计算机,被广泛应用于基于线性代数运算的并行应用程序开发.然而在进行LU分解过程中,ScaLA-PACK库中的例程并不是通信最优的,没有充分利用当前的并行架构.针对上述问题,提出一种基于鲲鹏处理器的LU并行分解优化算法(Parallel LU Factorization,PLF),实现了负载均衡,适配国产鲲鹏环境.PLF对不同进程的不同分区的数据进行差异化处理,并将每个进程所拥有的部分数据分配给根进程进行计算,之后再由根进程散播回各个子进程,这有利于充分利用CPU资源,实现负载均衡.在单节点Intel 9320R处理器以及鲲鹏(Kunpeng)920处理器环境中进行测试,其中,Intel平台下使用Intel MKL(Math Kernel Library),Kunpeng平台下使用PLF算法.对比两个平台关于不同规模的方程组求解的性能发现,Kunpeng平台的求解性能有显著优势.在NUMA数进程和单线程的情况下,优化后的计算效率在小规模平均达到4.35%,相比Intel的1.38%提升了 215%;中规模平均达到4.24%,相比Intel平台的1.86%提升了 118%;大规模平均达到4.24%,相比Intel 的 1.99%提升了 113%.
LU Parallel Decomposition Optimization Algorithm Based on Kunpeng Processor
Scalable linear algebra PACKage(ScaLAPACK)is a parallel computing package suitable for MIMD(multiple instruc-tion,multiple data)parallel computers with distributed storage.It is widely used in parallel application program development based on linear algebra operation.However,during the LU decomposition process,the routines in the ScaLAPACK library are not communication optimal and do not take full advantage of the current parallel architecture.To solve the above problems,a parallel LU factorization optimization algorithm(PLF)based on Kunpeng processor is proposed to achieve load balancing and adapt to do-mestic Kunpeng environment.PLF processes the data of different partitions of different processes differently.PLF allocates part of the data of each process to the root process for calculation.After the calculation is completed,the root process spreads the data back to each sub-process,which helps to fully utilize CPU resources and achieve load balancing.Tests are performed on single-node Intel 9320R processors and Kunpeng 920 processors.Intel MKL(Math Kernel Library)is used on the Intel platform,and PLF algorithm is used on the Kunpeng platform.After comparing the performance of solving equations of different scales on two platforms,it is found that the performance of solving equations on Kunpeng platform has a significant advantage compared with Intel platform.In the case of NUMA process and single thread,the optimized computing efficiency reaches 4.35%on a small scale on average,which is 215%higher than Intel's 1.38%.The average size of the medium scale reaches 4.24%,compared with 1.86%of Intel platform,an increase of 118%.The large-scale average reaches 4.24%,compared to Intel's 1.99%,an increase of 113%.