Polynomial multiplication consumes a lot of time in hardware implementation in the underlying operations of Lattice-based post-quantum public-key cryptography algorithms.The paper analyzes the fast implementation of number theoretic transform algorithm in polynomial multiplication operations for CRYSTALS-Kyber and proposes a 2n-th unit root preprocessing fast number theoretic transform algorithm architecture that adapts to the hardware implementation.In order to reduce computing time,the architecture uses parallel processing of small bit-width number theoretic transformation and low-complexity computations.Taking into account the characteristics of the algorithm,the overall computing architecture adopts a 32-way parallel design model.Based on this,we design a unified computing unit that matches the architecture and a storage unit with non-conflicting mechanism while reading or writing data and optimal address assignment.Under the CMOS 65 nm process,a set of polynomial multiplication operations with term number 256 and modulus 3 329 can be com-pleted in 108 cycles within 97 ns.The maximum operating frequency can reach 1.1 GHz,and the area time product is 20.7(kGE×μs).