NTRU格基密钥封装方案GPU高性能实现

扫码查看

原文链接

万方数据
维普

中文摘要：随着量子计算技术的发展,传统加密算法受到的威胁日益严重.为应对量子计算时代的挑战,各国正积极加强后量子密码算法的实现和迁移部署工作.由于NTRU密码方案具有结构简洁、计算效率高、尺寸较小、无专利风险等优点,因此NTRU格基密钥封装算法对于后量子时代的密码技术储备和应用具有重要意义.同时,图形处理器(Graphics Processing Unit,GPU)以其强大的并行计算能力、高吞吐量、低能耗等特性,已成为当前高并发密码工程实现的重要平台.本文给出后量子密码算法CTRU/CNTR的首个GPU高性能实现方案.对GPU主要资源占用进行分析,我们综合考虑并行计算、内存访问、数据布局和算法优化等多个方面,采用一系列计算和内存优化技术,旨在并行加速计算、优化访存、合理占用GPU资源以及减少I/O时延,从而提高本方案的计算能力和性能.本文的主要贡献在于以下几个方面:首先,针对模约减操作,使用NVIDIA并行指令集实现,有效减少所需指令条数;其次,针对耗时的多项式乘法模块,采用混合基NTT,并采用层融合、循环展开和延迟约减等方法,加快计算速度;此外,针对内存重复访问和冲突访问等问题,通过合并访存、核函数融合等优化技术,实现内存的高效访问;最后,为实现高并行的算法,设计恰当的线程块大小和数量,采用内存池机制,实现多任务的快速访存和高效处理.基于NVIDIA RTX409 0平台,本方案CTRU768实现中密钥生成、封装和解封装的吞吐量分别为每秒1170.9万次、926.7万次和315.4万次.与参考实现相比,密钥生成、封装和解封装的吞吐量分别提高了 336倍、174倍和128倍.本方案CNTR768实现中密钥生成、封装和解封装的吞吐量分别为每秒1117.3万次、971.8万次和322.2万次.与参考实现相比,密钥生成、封装和解封装的吞吐量分别提高了 329倍、175倍和134倍;与开源Kyber实现相比,密钥生成、密钥封装和密钥解封装的吞吐量分别提升10.84～11.36倍、9.49～9.95倍和5.11～5.22倍.高性能的密钥封装实现在大规模任务处理场景下具有较大的应用潜力,对保障后量子时代的信息和数据安全具有重要意义.

外文标题：Efficient GPU Implementation of NTRU Lattice-Based Key Encapsulation Mechanism

外文摘要：As quantum computing technology advances,the threats to traditional encryption al-gorithms are becoming more and more serious.This entails not only developing robust encryp-tion techniques resilient to quantum attacks but also implementing comprehensive strategies to ensure the security of sensitive data and communication channels in the post-quantum era.In or-der to meet the challenges of the quantum computing era,countries are actively strengthening the implementation,migration and deployment of post-quantum cryptographic algorithms.Since the NTRU cryptographic scheme has the advantages of simple structure,high computational efficien-cy,small size,and no patent risk,the NTRU lattice-based key encapsulation algorithm is of great significance to the reserve and application of cryptographic technology in the post-quantum era.At the same time,the Graphics Processing Unit(GPU),with its powerful parallel compu-ting capabilities,high throughput,low energy consumption and other characteristics,has become an important platform in the implementation of current high-concurrency cryptographic engineering.We propose the first efficient GPU implementation of the post-quantum cryptographic algorithm CTRU/CNTR.Taking into account multiple aspects such as parallel computing,memory access,data layout and algorithm optimization,the main resource occupancy of the GPU is analyzed.We use a series of computation and memory optimization techniques to accelerate parallel computing,optimize memory access,reasonably occupy GPU resources,and reduce I/O delays,thereby improving the computing power and performance of this solution.The main contributions of our work are as follows:First,for the modular reduction operation,it is implemented using the NVIDIA parallel instruction set,which effectively reduces the number of instructions required.Besides,for the time-consuming polynomial multiplication module,we employ a mixed-base Number-Theoretic Transform(NTT)and utilize methods such as layer fusion,loop unrolling,and delayed reduction to speed up calculations.Additionally,for problems such as repeated memory access and conflict access,efficient memory access is achieved through optimization technologies such as memory coalescing and kernel fusion.Finally,to achieve high parallel algorithm computation,we design appropriate thread block size,and design a memory pool mechanism to achieve rapid memory access and efficient processing of multi-tasks.Based on the RTX 4090,our implementation of CTRU768 demonstrates throughputs of 11709k,9267k and 3154k operations per second for key generation,encapsulation and decapsula-tion,respectively.Compared with the C language reference implementation,the throughput of key generation,encapsulation and decapsulation is increased by 336 ×,174 × and 128 ×,respec-tively.CNTR768 demonstrates throughputs of 11173k,9718k and 3222k operations per second for key generation,encapsulation and decapsulation,respectively.Compared with the C language reference implementation,the throughput of key generation,encapsulation and decapsulation is increased by 329 ×,175 × and 134 ×,respectively.Compared with the latest open source Kyber implementation,the throughput of key generation,key encapsulation and key decapsulation is increased by 10.84～11.36 ×,9.49～9.95 × and 5.11～5.22 × respectively.High-performance key en-capsulation implementation by this work is the key to ensuring the smooth operation of large-capacity ap-plications,and is helpful to ensure information and data security in the post-quantum era.

外文关键词：

post-quantum cryptographylattice-based cryptographykey encapsulation mecha-nismparallel processingGraphics Processing Units(GPU)

作者：

李文倩、沈诗羽、赵运磊

展开 >

作者单位：

复旦大学计算机科学技术学院上海 200433

密码科学技术全国重点实验室北京 100878

关键词：

后量子密码格基密码密钥封装方案并行处理图形处理器

基金：

国家重点研发计划基金资助项目密码科学技术国家重点实验室面上课题基金资助项目上海市科委技术标准基金资助项目上海市协同创新基金资助项目上海市科委区块链关键技术攻关专项基金资助项目

项目编号：

2022YFB2701601MMKFKT20222721DZ2200500XTCX-KJ-2023-5423511100300

出版年：

2024

DOI：

10.11897/SP.J.1016.2024.02163

计算机学报

中国计算机学会中国科学院计算技术研究所

计算机学报

CSTPCD北大核心

影响因子：3.18

ISSN：0254-4164

年,卷(期)：2024.47(9)