Configurable Arithmetic and Hash Modules for ML-DSA and ML-KEM Standards Abstract—In light of the emerging threat posed by quantum computers to current cryptographic standards, the American National Institute of Standards and Technology (NIST) has introduced the Module-Lattice-Based Key-Encapsulation Mechanism (ML-KEM) and Digital Signature Standards (ML-DSA). These standards are derived from the lattice-based cryptosystems, CRYSTALS-Kyber and Dilithium, which rely on the Module Learning With Errors (MLWE) assumption to enhance security. Efficient hash and arithmetic modules are essential for implementing these schemes, as they are the most time-consuming phases. This paper presents a high-performance hardware architecture that optimizes and unifies hash and arithmetic modules to accelerate Kyber and Dilithium implementations on FPGA. Our polynomial arithmetic module demonstrates a speedup of up to 1.8× and 2.1× improvement for NTT/INTT of Dilithium and Kyber, respectively, compared to the high-speed software implementation on Cortex-A72 CPU. Additionally, our Keccakbased hash module with four distinct modes to support both cryptographic schemes in comparable hardware consumption. Index Terms—Post-quantum cryptography, number theoretic transform, Digital Signature, Key-Encapsulation Mechanism. I. I NTRODUCTION The looming threat of quantum computers is steadily advancing, posing challenges to current cryptographic standards in terms of their capacity and implementation. Due to the mathematical difficulty of factoring integers and solving discrete logarithm problems, current public key cryptography could be vulnerable to sufficiently powerful quantum computers. After a long process of NIST Post-Quantum Cryptography (PQC) Standardization Project, two lattice-based algorithms, CRYSTALS-Dilithium and Kyber, have been designated as primary choices for standardizing KEM and digital signatures [1]. These algorithms serve as the foundation for NIST’s MLDSA and ML-KEM standards, officially published in August 2023 [2][3]. Both systems have a lattice-based structure, which necessitates computationally expensive polynomial arithmetic operations. Traditional school-book method has been proven to be inefficient for these operations. To enhance efficiency, Number Theoretic Transform (NTT) is employed, reducing the computational complexity from O(n2 ) to quasi-linear. As a result, NTT is employed in lattice-based cryptosystems to address the high complexity of polynomial multiplication. Many works have focused on optimizing the efficiency of arithmetic blocks for post-quantum cryptosystems on both software [4][5] and hardware [6][7][8][9] platforms. However, these efforts have faced challenges in balancing the performance and area requirements of NTT architectures. For example, the memory-based NTT architecture proposed in [6][7] is constrained by the cost of memory access operations. Increasing number of butterfly units (BU) to speed up the NTT algorithm also increases the number of memory and hardware resources correspondingly. The 2×2 NTT and radix4 iterative NTT architectures, as proposed in [8][17], aim to reduce memory access operations, but they still require a complex control unit to generate addresses and reorder positions during the NTT process. Phap et al. [9] proposed the pipelined NTT architecture simplifies the control requirement but requires a large number of BUs, which can be wasteful of hardware resources and is not suitable for integration with arithmetic operations inside this module. In addition to the challenges of NTT architectures, cryptosystems also need a large amount of pseudo-random data for sample generation. Given that the structure of hash algorithms largely comprises logical operations, hardware implementations are highly valuable for integrating the current standard hash algorithm SHA-3 [10]. There are four instances of SHA-3 used in Dilithium and Kyber, which require a unified hardware structure to deploy them in a comparable hardware resources. This paper proposes an efficient unified hardware accelerator for the most time-consuming phases in Kyber and Dilithium. Our proposed polynomial arithmetic module efficiently performs NTT/INTT and polynomial arithmetic operations, while our unified Keccak-based hash module is configurable to perform the four instances of SHA-3 required by both cryptosystems. II. P RELIMINARIES Kyber and Dilithium are part of the Cryptographic Suite for Algebraic Lattices (CRYSTALS). Kyber is a MLWEbased KEM, while Dilithium is a lattice-based DSA based on the Fiat-Shamir paradigm. Their security inherits a strong theoretical security lattice problem – the hardness of solving MLWE, which involves generating and calculating polynomial arithmetic operations that are the most time-consuming phases when implementing these schemes. Each of them has 3 variants for NIST security levels, with complexity increasing with the number of pseudo-random polynomial matrices and Fig. 1. Proposed polynomial arithmetic module. vectors used and the number of computational operations between them. These polynomials and algebraic operations are assumed to be over the polynomial ring Rq = (Zq [X]/X n +1). NTT is a form of the Fast Fourier Transform (FFT) which can be efficiently performed over the ring. The Cooley-Tukey (CT) and Gentleman-Sande (GS) algorithms are the most commonly used algorithms for NTT and INTT transformations. Detailed information can be found in [2][3][12][18]. III. T HE P ROPOSED A RCHITECTURE A. Polynomial Arithmetic Module Hardware NTT implementations use BU to perform calculations layer by layer following a predetermined butterfly diagram. The number of BU used and the architecture they are configured in are the main factors that determine the efficiency of an NTT architecture for a specific application. In this case, given the relatively small number of coefficients in both cryptosystems (n=256), a pipeline architecture is more efficient than other NTT architectures. In order to accelerate the NTT algorithm without incurring excessive and complex memory accesses, this paper introduces an optimized pipelined NTT structure. This innovation is inspired by the radix-2 multipath delay commutator (R2MDC) FFT structure introduced in [11]. The proposed module improves upon the typical R2MDC structure by leveraging folding transformation method to reduce the number of required BU by half, as shown in Fig.1. Instead of 7 or 8 BUs typically used in R2MDC for Kyber and Dilithium, this transformation requires only 4 BUs, effectively mitigating the area consumption limitation often associated with the R2MDC structure. Using 4 BUs in a unified module balances NTT/INTT and arithmetic operations, necessary because the storage scheme cannot handle too many simultaneous read and write operations. Each BU can calculate two adjacent NTT layers in a time-sliced fashion, such as, BU 1 computes the first layer during odd cycles and the second layer during even cycles. In Fig.1, the green line represents the NTT data flow, the blue for INTT, and the black line is a common line used in both transformations. The data flow moves from BU 1 to BU 4 in the NTT direction, and in reverse during INTT. During polynomial arithmetic operations, the data flow accesses memory directly via the 4 BUs in parallel. Dilithium’s modulus q is 23 bits, nearly twice that of Kyber’s. As a result, we optimized the 24-bit datapath for all sub-modules in our architecture. This allows for the same datapath flow to be utilized for both schemes, with the addition of a simple control logic selection. With a 24-bit datapath, BU can process 2 Dilithium and 4 Kyber coefficients per cycle. This architecture can flexibly transform from 2-path R2MDC NTT in Dilithium to 4-path R2MDC NTT in Kyber. We also propose the BRAM configuration that is well-suited for optimizing storage of polynomial samples in both schemes, and it is compatible with this arithmetic module. This configuration composes three 36-kbit BRAMs to form one 96-bit bandwidth memory. In this way, 4 coefficients of Dilithium or 8 of Kyber can be stored per address, as shown in Fig.2. This saves 25% BRAM utilization than the previous reported work [7]. This module stores all twiddle factors (TF) in one dualport 36kbit BRAM and eight 24-bit registers. The BRAM is configured to work with two ports, providing 72-bit width data per cycle to the last three BUs. The BRAM stores all precomputed TFs for both cryptosystems in 512 addresses, arranged in accordance with their order of use. This saves LUTs utilization compared to [8][9], which stores TF in a long string register as a FIFO unit connected to each BU. B. Unified Butterfly Unit Since the proposed design aims to use different butterfly structures for NTT and INTT operations, we propose a unified BU shown in Fig.3. This unit uses one modular multiplier, adder and subtractor as dedicated in CT or GS BU. By using the shared datapath, all the arithmetic units are made configurable to work for both Dylithium and Kyber schemes. As illustrated in Fig.3, the proposed BU takes a, b and ω as input, the red and the green lines are used as selection signals. When performing polynomial arithmetic operations, the result could be taken in port A and B via CT butterfly configuration. A technique presented in [13] suggests eliminating the need to multiply the resulting coefficients n−1 (mod q) after the INTT operation. In this work, we integrated this technique by adding a divide-by-2 operation to the addition unit and preprocessing the operand ω for the INTT to include the factor 2−1 , reducing the need for modular divide-by-2 compared to works in [6][7]. With the configurable BU for 24-bit, we split each 24bit arithmetic sub-modules into two 12-bit parts and select proper input signals based on the scheme. In modular adder/subtractor, there are two small 12-bit parts. These components can work independently in Kyber or a carry bit from the first adder/subtractor can be passed to the second in Dilithium. The modular multiplier with reduction is described in Fig.3. We designed one modular reduction for Dilithium Fig. 2. Structural arrangement of proposed BRAM configuration. Fig. 4. Keccak-based hash module. Fig. 3. Unified butterfly unit. and two for Kyber by recursively exploiting their modulus: 223 ≡ 213 − 1 and 212 ≡ 29 + 28 − 1. Using this equation, we can reduce a 46-bit from the multiplication results of Dilithium and a 24-bit from Kyber to obtain the remainder of the modulo function with modulus q. This approach has been proven to be efficient and suitable for hardware, as described in [7][15]. Kyber’s n-th primitive roots of unity are slightly different from Dilithium’s 2n-th roots, its coefficient-wise multiplication (CWM) is defined in [7]. Therefore, Dilithium’s CWM can be performed directly using our BUs, but Kyber requires 3 steps to perform CWM between polynomial matrix A and vector B. These steps can be implemented using 3 multiply-accumulate operations. The red, green and blue letters represent the input coefficient going through to port b, ω and a. The output is taken from port A. The W denotes the TF, which has been precomputed and stored in our BRAM. The intermediate value At is temporarily saved in memory during CWM process. Step 1: At[2i] = A[2i].1 + 0; At[2i+1] = A[2i+1].W[i] + 0; Step 2: At[2i] = A[2i].B[2i+1] + At[2i]; At[2i+1] = A[2i+1].B[2i+1] + At[2i+1]; Step 3: C[2i] = A[2i].B[2i] + At[2i]; C[2i+1] = A[2i+1].B[2i] + At[2i+1]; C. Keccak-based Hash Module In both cryptosystems, 4 different instances of the SHA-3 standard are utilized for various purposes, including pseudorandom number generation, key derivation, and hash functions. These instances share the same sponge construction, where data is absorbed into the sponge and then squeezed out. During the absorbing phase, input message blocks are XORed into a subset of the state, and the entire state is transformed using a permutation function known as Keccak-f. It is the main function used in both the absorbing and squeezing phases of the sponge construction. While the Keccak-f remains the same in all 4 instances, the differences lie in the parameters, such as the bit-rate r and capacity c [10]. Notably, SHA-3 restricts the output length d in range of bit-rate r, while SHAKE can generate as many bits as requested, making it an extendableoutput function (XOF) used in sampling of both schemes. To optimize hardware usage and eliminate the need for separate modules for each scheme, we propose a unified Keccakbased hash module with 4 modes, each corresponding to one of the 4 SHA-3 instances. The module’s parameters, control signals and datapath are configured to work in these modes, ensuring the requirements for both Dilithium and Kyber. Our module is a modification of the high-speed Keccak hardware implementation from [14], which specifies each SHA-3 instance. We employ a book-keeping approach to improve the performance of the sampling operation, eliminating the need to store and then read the Keccak output in between. Since this sub-module is configured to support sample generation and hashing, we use 64-bit datapath input/output ports, making it compatible with integration into other software/hardware platforms. The choice of 64 bits aligns with the maximum common factor r shared by those SHA-3 instances. In the absorbing phase, the number of clock cycles (CC) required for this phase depends on r. To maintain the security of the hash function, Keccak-f permutations are constructed as iterations, composed of 24 identical rounds. Each round consists of 5 steps: θ, ρ, π, χ and ι executed sequentially as shown in Fig.4. These hardware-friendly logical operations allow each round to be executed per cycle without affecting the critical path of the entire architecture. Totally, 24 CCs are required for this process. When the input message length exceeds r bits, padding may be needed during the permutation process. This means that the estimated time for the absorbing phase can be calculated as r/64 + 24×(N/r), where N is the length of input TABLE I I MPLEMENTATION RESULTS AND COMPARISON OF PROPOSED ARITHMETIC MODULE TO PRIOR WORKS Reference Platform Freq. Latency (CCs) Resources (MHz) K NTT/INTT D NTT/INTT LUT ATP LUT FF ATP FF DSP ATP DSP BRAM Aikata [6] ZUs+ 270 224/224 512/512 3487 1.8 1918 2.2 4 1.1 1 Yaman [7] Artix-7 172 69/71 - 9508 2.4 2684 1.5 16 2.2 35 Beckwith [8] VUs+ 256 - 256/256 4509 1.2 3146 1.9 8 1.2 0 Duong-Ngoc [9] Artix-7 265 64/64 - 3918 0.6 4292 1.5 26 2.1 0 Abdulrahman [4] C-M4 24 5992/6282 8093/8415 - - - - - - - Becker [5] C-A72 1500 1200/1338 2241/2821 - - - - - - - This work VUs+ 300 112/112 256/256 4374 1 1900 1 8 1 1 TABLE II I MPLEMENTATION RESULTS AND COMPARISON OF HASH MODULE Reference Plat. Freq.(MHz) Resources (LUT/ FF/ BRAM) Li [16] Us+ 250 10570/ 3575/ 0 This work VUs+ 300 4522/ 3933/ 0 message. Similarly, the time needed for the squeezing phase is determined by the amount of pseudorandom data requested for the output size d of each instance. When executing XOFs, this module operates exclusively during the squeezing phase, as indicated by the blue boundary line in Fig.4. New output values can be generated continuously after 24 CCs Keccak-f. IV. I MPLEMENTATION R ESULTS AND C OMPARISON The proposed modules were implemented using Verilog HDL and synthesized by Vivado 2022.2, on Virtex Ultrascale+ VU47P platform. Implementation results of the proposed architectures and other works are detailed in Table I and II. To ensure a fair comparison, we use the area-time trade-off (ATP) metric, a method introduced in [9]. It is worth noting that these implementations often use different FPGA platforms and support varying numbers of tasks, making direct one-toone comparisons challenging. In the existing literature, there is a limited hardware implementations that unify Dilithium and Kyber. Among the FPGA-based implementations, only [6] is similar work. This architecture is a memory-based NTT with 2 BUs on the Zynq UltraScale+ platform. In contrast, our work employs folded NTT architecture with 4 BUs and operates at higher frequency, resulting in a 2× speed improvement. The ATP metric shows improvements of 1.8× ATP LUT, 2.2× ATP FF, 1.1× ATP DSP and the same number of BRAMs. Other works, such as those presented in [7][9], focus solely on supporting Kyber. Yaman et al. [7] employed an iterative NTT architecture for Kyber with 16 BUs, which also supports arithmetic operations. This approach consumes a significant amount of memory (35 BRAMs) and faces realization problem when simultaneously reading and writing 64 coefficients per cycle. In contrast, the work presented in [9] introduces a 4path pipelined MDF NTT architecture, which uses a huge number of BU, resulting in 3× DSPs than our design but only yields 2× improvement in latency. This architecture is less suitable for integration with arithmetic operations and necessitates the addition of a ModMult unit. In [8], a 2×2 BU arrangement is proposed for Dilithium’s arithmetic operations. Their architecture employs the same number of BU as ours but consumes more LUTs and FFs, although it solely supports Dilithium. Additionally, it operates at lower frequency due to the use of iterative NTT architecture, which requires a complex control unit to map sample addresses during NTT/INTT. The efficiency of our design is expressed by ATP parameters, as shown in Table I. Due to the fact that these works support solely for one scheme, either NTT or all arithmetic operations, which affects the complexity and hardware resource requirements, the results of LUTs, FFs and frequency are for reference only. We focus on the ATP DSP metric, which measures the efficiency of the number of BUs used to achieve a given latency. In this regard, our work outperforms the others. When compared to state-of-the-art software implementations on Cortex-M4 [4] and Cortex-A72 [5], our highperformance hardware demonstrates substantial improvements. It performs up to 31×/53× and 1.8×/2.1× more efficiently in terms of latency for NTT/INTT in Dilithium and Kyber. Our hash module requires more area resources than the high-speed Keccak cores mentioned in [14]. This is because our hash module supports multiple instances, while the cores in [14] are designed for a single instance. However, when compared to a similar 4-mode hash module in [16], our results show superiority, particularly in terms of LUTs and frequency. V. C ONCLUSION In this paper, we present an efficient and compact hardware architecture for both NIST PQC standards: ML-DSA and MLKEM. Our design includes an arithmetic module for computational tasks, and a versatile Keccak-based hash module for sample generation and hashing. Our design is highly configurable and shares common elements across hardware platforms. We carefully optimized our modules to efficiently use hardware resources, significantly reducing execution time without increasing resource consumption. Comparative analysis confirms our approach’s potential for PQC applications. R EFERENCES [1] NIST, “Status report on the third round of the NIST-PQC standardization Process”. https://csrc.nist.gov/Projects. [2] NIST, “FIPS 203 - Module-Lattice-based Key-Encapsulation Mechanism Standard,” url: https://csrc.nist.gov/pubs/fips/203/ipd [3] NIST, “FIPS 204 - Module-Lattice-Based Digital Signature Standard,” url: https://csrc.nist.gov/pubs/fips/204/ipd [4] A. Abdulrahman, V. Hwang, M. Kannwischer and A. Sprenkels, “Faster Kyber and Dilithium on the cortex-M4,” in ACNS 22, vol 13269 of LNCS, pp. 853–871, Rome, Italy, June, 2022. [5] H. Becker, V. Hwang, M. J. Kannwischer, B.-Y. Yang, and S.-Y. Yang, “Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1,” Cryptology ePrint Archive, Jul. 2021. [6] A. Aikata, C. Mert, M. Imran, S. Pagliarini and S. Roy, “KaLi: A crystal for post-quantum security using Kyber and Dilithium,” IEEE Trans. Circuits Syst. 1, vol. 70, no. 2, pp. 747-758 , Feb. 2023. [7] F. Yaman, C. Mert and E. Ozturk, “A hardware accelerator for polynomial multiplication operation of CRYSTALS-KYBER PQC scheme,” 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1020-1025. IEEE, 2021. [8] L. Beckwith, D. T. Nguyen, and K. Gaj, “High-performance hardware implementation of CRYSTALS-Dilithium,” In 2021 International Conference on Field-Programmable Technology (ICFPT), IEEE, 2021. [9] P. Duong-Ngoc and H. Lee, “Configurable mixed-radix number theoretic transform architecture for lattice-based cryptography,” IEEE Access, vol. 10, pp. 12732–12741, 2022. [10] NIST, “SHA-3 Standard: Permutation-Based Hash and ExtendableOutput Functions,” url: https://csrc.nist.gov/pubs/fips/202/final [11] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu, S. Yin, S. Wei and L. Liu, “A compact and high-performance hardware architecture for crystals-Dilithium,” IACR Transactions on Cryptographic Hardware and Embedded Systems, pp. 270-296, 2022. [12] P. Longa and M. Naehrig, “Speeding up the NTT for Faster Ideal LatticeBased Cryptography,” Cryptology ePrint Archive. (2016). [13] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei and L.Liu, “Highly Efficient Architecture of NewHope-NIST on FPGA using Low-Complexity NTT/INTT,” IACR Trans. on CHES, vol. 2020, no. 2, pp. 49–72, 2020. [14] K. Team, “Keccak in VHDL: High-speed core,” url: https://keccak.team/hardware.html, Accessed on Nov. 2021. [15] G. Land, P. Sasdrich and T. Guneysu, “A hard crystal - Implementing dilithium on reconfigurable hardware,” International Conference on Smart Card Research and Advanced Applications, Cham: Springer International Publishing, 2021. [16] A. Li, D. Liu, X. Li, T. Huang, S. Yang, J. Lu, A. Hu, “A Flexible Instruction-based Post-quantum Cryptographic Processor with Modulus Reconfigurable Arithmetic Unit for Module LWR&E,” in 2022 IEEE ASSCC, pp. 1-3. IEEE, 2022. [17] T.X. Pham, P. Duong-Ngoc and H. Lee, “An Efficient Unified Polynomial Arithmetic Unit for CRYSTALS-Dilithium,” in IEEE Trans. Circuits Syst. 1, early access, 2023. [18] M. Garrido, “A survey on pipelined FFT hardware architectures”,” in J. Signal Process. Syst., no. 6, pp. 849-863, Jul. 2021.