面向DCU的LDS访存向量化优化

Vectorization Optimization of LDS Memory Access for DCU

杨思驰 ¹赵荣彩 ²韩林 ²王洪生³

扫码查看

作者信息

1. 郑州大学计算机与人工智能学院,河南郑州 450000
2. 郑州大学计算机与人工智能学院,河南郑州 450000;国家超级计算郑州中心,河南郑州 450000
3. 国家超级计算郑州中心,河南郑州 450000
折叠

摘要

在深度计算器(DCU)中,本地数据共享(LDS)是相较于全局内存延迟更低、带宽更高的关键存储部件.随着异构程序对LDS的使用越来越频繁,LDS访存效率低下成为限制异构程序性能的重要因素.此外,LDS访问过程中存在bank冲突的特性,使LDS的访问应遵循一定原则才能高效利用,当线程间的数据访问呈现重叠的访存特征时,访问向量化指令会因此产生延迟.针对此问题,提出面向DCU的LDS访存向量化优化方法.通过实现连续数据访问的向量化,减少LDS的访问次数,降低访存耗时,由此提高程序访存效率.在此基础上,通过设计访存特征的判断方法,提出能够有效解决数据重叠的LDS访存向量化方法,实现一种面向国产通用加速器的LDS高效访存技术,确保向量化方法对访存效率的有效提升.实验结果表明:在使用LDS的异构程序中,LDS访存向量化实现后程序性能平均提升了22.6%,验证了所提方法的有效性;同时,向量化方法能够实现LDS线程间访存数据重叠问题的优化,使异构程序得到平均30%的性能提升.

Abstract

In a domestic general-purpose accelerator Deep Computing Unit(DCU),Local Data Shared(LDS)is a key storage component with a lower latency and higher bandwidth than global memory.As heterogeneous programs use LDS more frequently,the low memory access efficiency of LDS has become an important limiting factor in the performance of heterogeneous programs.In addition,owing to bank conflicts in the LDS access process,LDS access must follow certain principles to be used efficiently.When the data access between threads presents overlapping memory access characteristics,access vectorization instructions create delays.To address this problem,an optimization method for the LDS memory access vectorization for the DCU is proposed.This method reduces the number of LDS accesse and time-consuming memory accesse by realizing the vectorization of continuous data access,thereby improving the efficiency of program memory access.On this basis,through the determination of memory access characteristics,an LDS access vectorization method that can effectively address data overlap is proposed,and an efficient LDS memory access technology for domestic general-purpose accelerators is realized to ensure the vectorization method effectively improve the memory access efficiency.The experimental results demonstrate that in the heterogeneous programs using LDS,the program performance is improved by an average of 22.6%after the LDS access vectorization is implemented,which verifies the effectiveness of this study.Simultaneously,the vectorization method can realize the overlapping of memory access data between LDS threads,and improves the performance of heterogeneous programs by an average of 30%.

关键词

深度计算器/本地数据共享/访存向量化/访存特征/bank冲突

Key words

Deep Computing Unit(DCU)/Local Data Shared(LDS)/memory access vectorization/memory access characteristic/bank conflict

引用本文复制引用

基金项目

河南省重大科技专项(221100210600)

出版年

2024

计算机工程

华东计算技术研究所　上海市计算机学会

计算机工程

CSTPCD北大核心

影响因子：0.581

ISSN：1000-3428

被引量1

参考文献量10

段落导航