Configurable Arithmetic and Hash Modules for
ML-DSA and ML-KEM Standards
Abstract—In light of the emerging threat posed by quantum
computers to current cryptographic standards, the American
National Institute of Standards and Technology (NIST) has introduced the Module-Lattice-Based Key-Encapsulation Mechanism
(ML-KEM) and Digital Signature Standards (ML-DSA). These
standards are derived from the lattice-based cryptosystems,
CRYSTALS-Kyber and Dilithium, which rely on the Module
Learning With Errors (MLWE) assumption to enhance security.
Efficient hash and arithmetic modules are essential for implementing these schemes, as they are the most time-consuming
phases. This paper presents a high-performance hardware architecture that optimizes and unifies hash and arithmetic modules
to accelerate Kyber and Dilithium implementations on FPGA.
Our polynomial arithmetic module demonstrates a speedup of
up to 1.8× and 2.1× improvement for NTT/INTT of Dilithium
and Kyber, respectively, compared to the high-speed software
implementation on Cortex-A72 CPU. Additionally, our Keccakbased hash module with four distinct modes to support both
cryptographic schemes in comparable hardware consumption.
Index Terms—Post-quantum cryptography, number theoretic
transform, Digital Signature, Key-Encapsulation Mechanism.
The looming threat of quantum computers is steadily advancing, posing challenges to current cryptographic standards
in terms of their capacity and implementation. Due to the
mathematical difficulty of factoring integers and solving discrete logarithm problems, current public key cryptography
could be vulnerable to sufficiently powerful quantum computers. After a long process of NIST Post-Quantum Cryptography
(PQC) Standardization Project, two lattice-based algorithms,
CRYSTALS-Dilithium and Kyber, have been designated as
primary choices for standardizing KEM and digital signatures
[1]. These algorithms serve as the foundation for NIST’s MLDSA and ML-KEM standards, officially published in August
2023 [2][3]. Both systems have a lattice-based structure, which
necessitates computationally expensive polynomial arithmetic
operations. Traditional school-book method has been proven
to be inefficient for these operations. To enhance efficiency,
Number Theoretic Transform (NTT) is employed, reducing
the computational complexity from O(n2 ) to quasi-linear. As
a result, NTT is employed in lattice-based cryptosystems to
address the high complexity of polynomial multiplication.
Many works have focused on optimizing the efficiency of
arithmetic blocks for post-quantum cryptosystems on both
software [4][5] and hardware [6][7][8][9] platforms. However, these efforts have faced challenges in balancing the
performance and area requirements of NTT architectures. For
example, the memory-based NTT architecture proposed in
[6][7] is constrained by the cost of memory access operations.
Increasing number of butterfly units (BU) to speed up the
NTT algorithm also increases the number of memory and
hardware resources correspondingly. The 2×2 NTT and radix4 iterative NTT architectures, as proposed in [8][17], aim
to reduce memory access operations, but they still require
a complex control unit to generate addresses and reorder
positions during the NTT process. Phap et al. [9] proposed the
pipelined NTT architecture simplifies the control requirement
but requires a large number of BUs, which can be wasteful
of hardware resources and is not suitable for integration with
arithmetic operations inside this module. In addition to the
challenges of NTT architectures, cryptosystems also need a
large amount of pseudo-random data for sample generation.
Given that the structure of hash algorithms largely comprises
logical operations, hardware implementations are highly valuable for integrating the current standard hash algorithm SHA-3
[10]. There are four instances of SHA-3 used in Dilithium
and Kyber, which require a unified hardware structure to
deploy them in a comparable hardware resources. This paper
proposes an efficient unified hardware accelerator for the
most time-consuming phases in Kyber and Dilithium. Our
proposed polynomial arithmetic module efficiently performs
NTT/INTT and polynomial arithmetic operations, while our
unified Keccak-based hash module is configurable to perform
the four instances of SHA-3 required by both cryptosystems.
Kyber and Dilithium are part of the Cryptographic Suite
for Algebraic Lattices (CRYSTALS). Kyber is a MLWEbased KEM, while Dilithium is a lattice-based DSA based
on the Fiat-Shamir paradigm. Their security inherits a strong
theoretical security lattice problem – the hardness of solving
MLWE, which involves generating and calculating polynomial
arithmetic operations that are the most time-consuming phases
when implementing these schemes. Each of them has 3
variants for NIST security levels, with complexity increasing
with the number of pseudo-random polynomial matrices and
Fig. 1. Proposed polynomial arithmetic module.
vectors used and the number of computational operations
between them. These polynomials and algebraic operations are
assumed to be over the polynomial ring Rq = (Zq [X]/X n +1).
NTT is a form of the Fast Fourier Transform (FFT) which can
be efficiently performed over the ring. The Cooley-Tukey (CT)
and Gentleman-Sande (GS) algorithms are the most commonly
used algorithms for NTT and INTT transformations. Detailed
information can be found in [2][3][12][18].
A. Polynomial Arithmetic Module
Hardware NTT implementations use BU to perform calculations layer by layer following a predetermined butterfly
diagram. The number of BU used and the architecture they are
configured in are the main factors that determine the efficiency
of an NTT architecture for a specific application. In this
case, given the relatively small number of coefficients in both
cryptosystems (n=256), a pipeline architecture is more efficient
than other NTT architectures. In order to accelerate the NTT
algorithm without incurring excessive and complex memory
accesses, this paper introduces an optimized pipelined NTT
structure. This innovation is inspired by the radix-2 multipath
delay commutator (R2MDC) FFT structure introduced in [11].
The proposed module improves upon the typical R2MDC
structure by leveraging folding transformation method to
reduce the number of required BU by half, as shown in
Fig.1. Instead of 7 or 8 BUs typically used in R2MDC for
Kyber and Dilithium, this transformation requires only 4 BUs,
effectively mitigating the area consumption limitation often
associated with the R2MDC structure. Using 4 BUs in a
unified module balances NTT/INTT and arithmetic operations,
necessary because the storage scheme cannot handle too many
simultaneous read and write operations. Each BU can calculate
two adjacent NTT layers in a time-sliced fashion, such as, BU
1 computes the first layer during odd cycles and the second
layer during even cycles. In Fig.1, the green line represents
the NTT data flow, the blue for INTT, and the black line is
a common line used in both transformations. The data flow
moves from BU 1 to BU 4 in the NTT direction, and in reverse
during INTT. During polynomial arithmetic operations, the
data flow accesses memory directly via the 4 BUs in parallel.
Dilithium’s modulus q is 23 bits, nearly twice that of
Kyber’s. As a result, we optimized the 24-bit datapath for
all sub-modules in our architecture. This allows for the same
datapath flow to be utilized for both schemes, with the addition
of a simple control logic selection. With a 24-bit datapath, BU
can process 2 Dilithium and 4 Kyber coefficients per cycle.
This architecture can flexibly transform from 2-path R2MDC
NTT in Dilithium to 4-path R2MDC NTT in Kyber. We also
propose the BRAM configuration that is well-suited for optimizing storage of polynomial samples in both schemes, and it
is compatible with this arithmetic module. This configuration
composes three 36-kbit BRAMs to form one 96-bit bandwidth
memory. In this way, 4 coefficients of Dilithium or 8 of Kyber
can be stored per address, as shown in Fig.2. This saves 25%
BRAM utilization than the previous reported work [7].
This module stores all twiddle factors (TF) in one dualport 36kbit BRAM and eight 24-bit registers. The BRAM is
configured to work with two ports, providing 72-bit width
data per cycle to the last three BUs. The BRAM stores all
precomputed TFs for both cryptosystems in 512 addresses,
arranged in accordance with their order of use. This saves
LUTs utilization compared to [8][9], which stores TF in a
long string register as a FIFO unit connected to each BU.
B. Unified Butterfly Unit
Since the proposed design aims to use different butterfly
structures for NTT and INTT operations, we propose a unified
BU shown in Fig.3. This unit uses one modular multiplier,
adder and subtractor as dedicated in CT or GS BU. By
using the shared datapath, all the arithmetic units are made
configurable to work for both Dylithium and Kyber schemes.
As illustrated in Fig.3, the proposed BU takes a, b and ω as
input, the red and the green lines are used as selection signals.
When performing polynomial arithmetic operations, the result
could be taken in port A and B via CT butterfly configuration.
A technique presented in [13] suggests eliminating the need to
multiply the resulting coefficients n−1 (mod q) after the INTT
operation. In this work, we integrated this technique by adding
a divide-by-2 operation to the addition unit and preprocessing
the operand ω for the INTT to include the factor 2−1 , reducing
the need for modular divide-by-2 compared to works in [6][7].
With the configurable BU for 24-bit, we split each 24bit arithmetic sub-modules into two 12-bit parts and select proper input signals based on the scheme. In modular adder/subtractor, there are two small 12-bit parts. These
components can work independently in Kyber or a carry bit
from the first adder/subtractor can be passed to the second in
Dilithium. The modular multiplier with reduction is described
in Fig.3. We designed one modular reduction for Dilithium
Fig. 2. Structural arrangement of proposed BRAM configuration.
Fig. 4. Keccak-based hash module.
Fig. 3. Unified butterfly unit.
and two for Kyber by recursively exploiting their modulus:
223 ≡ 213 − 1 and 212 ≡ 29 + 28 − 1. Using this equation, we
can reduce a 46-bit from the multiplication results of Dilithium
and a 24-bit from Kyber to obtain the remainder of the modulo
function with modulus q. This approach has been proven to
be efficient and suitable for hardware, as described in [7][15].
Kyber’s n-th primitive roots of unity are slightly different
from Dilithium’s 2n-th roots, its coefficient-wise multiplication
(CWM) is defined in [7]. Therefore, Dilithium’s CWM can be
performed directly using our BUs, but Kyber requires 3 steps
to perform CWM between polynomial matrix A and vector B.
These steps can be implemented using 3 multiply-accumulate
operations. The red, green and blue letters represent the input
coefficient going through to port b, ω and a. The output is
taken from port A. The W denotes the TF, which has been
precomputed and stored in our BRAM. The intermediate value
At is temporarily saved in memory during CWM process.
Step 1: At[2i] = A[2i].1 + 0;
At[2i+1] = A[2i+1].W[i] + 0;
Step 2: At[2i] = A[2i].B[2i+1] + At[2i];
At[2i+1] = A[2i+1].B[2i+1] + At[2i+1];
Step 3: C[2i] = A[2i].B[2i] + At[2i];
C[2i+1] = A[2i+1].B[2i] + At[2i+1];
C. Keccak-based Hash Module
In both cryptosystems, 4 different instances of the SHA-3
standard are utilized for various purposes, including pseudorandom number generation, key derivation, and hash functions.
These instances share the same sponge construction, where
data is absorbed into the sponge and then squeezed out. During
the absorbing phase, input message blocks are XORed into a
subset of the state, and the entire state is transformed using
a permutation function known as Keccak-f. It is the main
function used in both the absorbing and squeezing phases of
the sponge construction. While the Keccak-f remains the same
in all 4 instances, the differences lie in the parameters, such
as the bit-rate r and capacity c [10]. Notably, SHA-3 restricts
the output length d in range of bit-rate r, while SHAKE can
generate as many bits as requested, making it an extendableoutput function (XOF) used in sampling of both schemes.
To optimize hardware usage and eliminate the need for separate modules for each scheme, we propose a unified Keccakbased hash module with 4 modes, each corresponding to one
of the 4 SHA-3 instances. The module’s parameters, control
signals and datapath are configured to work in these modes,
ensuring the requirements for both Dilithium and Kyber. Our
module is a modification of the high-speed Keccak hardware
implementation from [14], which specifies each SHA-3 instance. We employ a book-keeping approach to improve the
performance of the sampling operation, eliminating the need
to store and then read the Keccak output in between.
Since this sub-module is configured to support sample
generation and hashing, we use 64-bit datapath input/output
ports, making it compatible with integration into other software/hardware platforms. The choice of 64 bits aligns with the
maximum common factor r shared by those SHA-3 instances.
In the absorbing phase, the number of clock cycles (CC)
required for this phase depends on r. To maintain the security
of the hash function, Keccak-f permutations are constructed
as iterations, composed of 24 identical rounds. Each round
consists of 5 steps: θ, ρ, π, χ and ι executed sequentially
as shown in Fig.4. These hardware-friendly logical operations
allow each round to be executed per cycle without affecting the
critical path of the entire architecture. Totally, 24 CCs are required for this process. When the input message length exceeds
r bits, padding may be needed during the permutation process.
This means that the estimated time for the absorbing phase can
be calculated as r/64 + 24×(N/r), where N is the length of input
Latency (CCs)
Aikata [6]
Yaman [7]
Beckwith [8]
Duong-Ngoc [9]
Abdulrahman [4]
Becker [5]
This work
Resources (LUT/ FF/ BRAM)
Li [16]
10570/ 3575/ 0
This work
4522/ 3933/ 0
message. Similarly, the time needed for the squeezing phase
is determined by the amount of pseudorandom data requested
for the output size d of each instance. When executing XOFs,
this module operates exclusively during the squeezing phase,
as indicated by the blue boundary line in Fig.4. New output
values can be generated continuously after 24 CCs Keccak-f.
The proposed modules were implemented using Verilog
HDL and synthesized by Vivado 2022.2, on Virtex Ultrascale+
VU47P platform. Implementation results of the proposed
architectures and other works are detailed in Table I and II.
To ensure a fair comparison, we use the area-time trade-off
(ATP) metric, a method introduced in [9]. It is worth noting
that these implementations often use different FPGA platforms
and support varying numbers of tasks, making direct one-toone comparisons challenging. In the existing literature, there is
a limited hardware implementations that unify Dilithium and
Kyber. Among the FPGA-based implementations, only [6] is
similar work. This architecture is a memory-based NTT with 2
BUs on the Zynq UltraScale+ platform. In contrast, our work
employs folded NTT architecture with 4 BUs and operates
at higher frequency, resulting in a 2× speed improvement.
The ATP metric shows improvements of 1.8× ATP LUT, 2.2×
ATP FF, 1.1× ATP DSP and the same number of BRAMs.
Other works, such as those presented in [7][9], focus solely
on supporting Kyber. Yaman et al. [7] employed an iterative
NTT architecture for Kyber with 16 BUs, which also supports
arithmetic operations. This approach consumes a significant
amount of memory (35 BRAMs) and faces realization problem
when simultaneously reading and writing 64 coefficients per
cycle. In contrast, the work presented in [9] introduces a 4path pipelined MDF NTT architecture, which uses a huge
number of BU, resulting in 3× DSPs than our design but
only yields 2× improvement in latency. This architecture is
less suitable for integration with arithmetic operations and
necessitates the addition of a ModMult unit. In [8], a 2×2 BU
arrangement is proposed for Dilithium’s arithmetic operations.
Their architecture employs the same number of BU as ours but
consumes more LUTs and FFs, although it solely supports
Dilithium. Additionally, it operates at lower frequency due to
the use of iterative NTT architecture, which requires a complex
control unit to map sample addresses during NTT/INTT. The
efficiency of our design is expressed by ATP parameters, as
shown in Table I. Due to the fact that these works support
solely for one scheme, either NTT or all arithmetic operations, which affects the complexity and hardware resource
requirements, the results of LUTs, FFs and frequency are
for reference only. We focus on the ATP DSP metric, which
measures the efficiency of the number of BUs used to achieve a
given latency. In this regard, our work outperforms the others.
When compared to state-of-the-art software implementations on Cortex-M4 [4] and Cortex-A72 [5], our highperformance hardware demonstrates substantial improvements.
It performs up to 31×/53× and 1.8×/2.1× more efficiently in
terms of latency for NTT/INTT in Dilithium and Kyber.
Our hash module requires more area resources than the
high-speed Keccak cores mentioned in [14]. This is because
our hash module supports multiple instances, while the cores
in [14] are designed for a single instance. However, when
compared to a similar 4-mode hash module in [16], our results
show superiority, particularly in terms of LUTs and frequency.
In this paper, we present an efficient and compact hardware
architecture for both NIST PQC standards: ML-DSA and MLKEM. Our design includes an arithmetic module for computational tasks, and a versatile Keccak-based hash module
for sample generation and hashing. Our design is highly
configurable and shares common elements across hardware
platforms. We carefully optimized our modules to efficiently
use hardware resources, significantly reducing execution time
without increasing resource consumption. Comparative analysis confirms our approach’s potential for PQC applications.
