Accelerating Distributed Inference of Large Language Models in Low-Resource Clusters
A distributed inference paradigm for large language model(LLM)with stronger parallelism and better compatibility is explored,which is designed for weak computing power and small memory environments.Meanwhile,an efficient All-Reduce group communication technique based on communication tree is designed for the different bandwidths inside and outside the host,and a fine-grained memory management and scheduling technique is designed for small memory clusters.Finally,based on these key techniques,a set of LLM infer-ence software system for resource-constrained scenarios is constructed,aiming to maximize the LLMs that can be inferenced with a lim-ited number of low-resource devices,and at the same time accelerating the distributed inference by optimizing the communication strategy and computation scheduling.Experiments demonstrate that after applying the above techniques,the first lexical element generation latency is reduced by 34%~61%,the lexical element generation throughput per second is increased by 52%~150%,and the memory occupation is re-duced by 61%.
LLM distributed inference paradigmresource-constrained scenarioscommunication and computation scheduling optimization