Efficient GPU Implementation of NTRU Lattice-Based Key Encapsulation Mechanism
As quantum computing technology advances,the threats to traditional encryption al-gorithms are becoming more and more serious.This entails not only developing robust encryp-tion techniques resilient to quantum attacks but also implementing comprehensive strategies to ensure the security of sensitive data and communication channels in the post-quantum era.In or-der to meet the challenges of the quantum computing era,countries are actively strengthening the implementation,migration and deployment of post-quantum cryptographic algorithms.Since the NTRU cryptographic scheme has the advantages of simple structure,high computational efficien-cy,small size,and no patent risk,the NTRU lattice-based key encapsulation algorithm is of great significance to the reserve and application of cryptographic technology in the post-quantum era.At the same time,the Graphics Processing Unit(GPU),with its powerful parallel compu-ting capabilities,high throughput,low energy consumption and other characteristics,has become an important platform in the implementation of current high-concurrency cryptographic engineering.We propose the first efficient GPU implementation of the post-quantum cryptographic algorithm CTRU/CNTR.Taking into account multiple aspects such as parallel computing,memory access,data layout and algorithm optimization,the main resource occupancy of the GPU is analyzed.We use a series of computation and memory optimization techniques to accelerate parallel computing,optimize memory access,reasonably occupy GPU resources,and reduce I/O delays,thereby improving the computing power and performance of this solution.The main contributions of our work are as follows:First,for the modular reduction operation,it is implemented using the NVIDIA parallel instruction set,which effectively reduces the number of instructions required.Besides,for the time-consuming polynomial multiplication module,we employ a mixed-base Number-Theoretic Transform(NTT)and utilize methods such as layer fusion,loop unrolling,and delayed reduction to speed up calculations.Additionally,for problems such as repeated memory access and conflict access,efficient memory access is achieved through optimization technologies such as memory coalescing and kernel fusion.Finally,to achieve high parallel algorithm computation,we design appropriate thread block size,and design a memory pool mechanism to achieve rapid memory access and efficient processing of multi-tasks.Based on the RTX 4090,our implementation of CTRU768 demonstrates throughputs of 11709k,9267k and 3154k operations per second for key generation,encapsulation and decapsula-tion,respectively.Compared with the C language reference implementation,the throughput of key generation,encapsulation and decapsulation is increased by 336 ×,174 × and 128 ×,respec-tively.CNTR768 demonstrates throughputs of 11173k,9718k and 3222k operations per second for key generation,encapsulation and decapsulation,respectively.Compared with the C language reference implementation,the throughput of key generation,encapsulation and decapsulation is increased by 329 ×,175 × and 134 ×,respectively.Compared with the latest open source Kyber implementation,the throughput of key generation,key encapsulation and key decapsulation is increased by 10.84~11.36 ×,9.49~9.95 × and 5.11~5.22 × respectively.High-performance key en-capsulation implementation by this work is the key to ensuring the smooth operation of large-capacity ap-plications,and is helpful to ensure information and data security in the post-quantum era.