Large Language Model Specific Hardware Architecture Based on Integrated Compute-in-Memory Chips
Artificial intelligent(AI)models represented by ChatGPT are showing an exponential growth trend in parameter size and system com-puting power requirements.The dedicated hardware architecture for large models is studied,and a detailed analysis of the bandwidth bottle-neck issues faced by large models during deployment is provided,as well as the significant impact of this challenge on current data centers.To address this issue,a solution of using integrated compute-in-memory chiplets has been proposed,aiming to alleviate data transmission pres-sure and improve the energy efficiency of large-scale model inference.In addition,the possibility of lightweight in-memory compression col-laborative design under the in-memory computing architecture is studied,in order to achieve dense mapping of sparse networks on the inte-grated in-memory computing architecture hardware,thereby significantly improving storage density and computational energy efficiency.
large language modelcompute-in-memorychipletin-memory compression