Uploaded by ngocphap2806

Access-2023-18854 Proof hi

advertisement
IEEE Access
Efficient Low-Latency Hardware Architecture for CRYSTALSDilithium
Journal: IEEE Access
Manuscript ID Access-2023-18854
Manuscript Type: Regular Manuscript
Date Submitted by the
30-Jun-2023
Author:
Complete List of Authors: Lee, Hanho; Inha University College of Engineering
TRUONG, QUANG DANG; Inha University College of Engineering
PHAP, DUONG-NGOC; Inha University College of Engineering
Subject Category<br>Please
select at least two subject
Circuits and systems, Computers and information processing
categories that best reflect
the scope of your manuscript:
Keywords: <b>Please choose
keywords carefully as they
Information security, Cryptography, Hardware acceleration
help us find the most suitable
Editor to review</b>:
Additional Manuscript
Keywords:
Note: The following files were submitted by the author for peer review, but cannot be converted to PDF.
You must view these files (e.g. movies) online.
figure.rar
For Review Only
Page 1 of 11
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
IEEE Access
AUTHOR RESPONSES TO IEEE ACCESS
SUBMISSION QUESTIONS
Author chosen
Research Article
manuscript type:
- This paper proposes a low-latency hardware architecture that
supports all three algorithms of CRYSTALS-Dilithium while allowing
for customizable security levels. The design focuses on minimizing
latency and achieves the lowest execution time for all algorithms
Author explanation
simultaneously, while also balancing execution time and resource
/justification for
utilization. - The proposed architecture proposes optimized
choosing this
sub-modules specifically tailored to the characteristics of Dilithium,
manuscript type:
enhancing the overall performance of the system. - The
experimental results confirm that the proposed design achieved
improvements in terms of the area-time product and exhibited lower
execution time.
- This paper technically presents a hardware architecture of
CRYSTALS-Dilithium, a digital signature algorithm designed to
Author description of withstand quantum and classical attacks. - The proposed
how this manuscript fits architecture is flexible and configurable, capable of performing the
within the scope of IEEE key algorithms of Dilithium and seamlessly integrating with other
Access: hardware and software platforms. It accelerates processes such as
random value generation for matrices or vectors and supports
computation through dedicated modules within the design.
- This paper proposes a low-latency hardware architecture for
Dilithium that supports all three algorithms and allows for
customizable security levels. Our design focuses on minimizing
Author description latency and achieves the lowest execution time for all algorithms
detailing the unique simultaneously, while balancing resource utilization. - The proposed
contribution of the sub-modules are optimized specifically for Dilithium, improving the
manuscript related to overall system performance. - Comparative analysis shows that our
existing literature: NTT architectures exhibit lower execution time while maintaining
comparable area-time efficiency compared to previous works. This
makes our hardware architecture highly applicable for digital
signature applications in the post-quantum era.
For Review Only
IEEE Access
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Page 2 of 11
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 00.0000/ACCESS.2020.DOI
Efficient Low-Latency Hardware
Architecture for CRYSTALS-Dilithium
QUANG DANG TRUONG , (Graduate Student Member, IEEE), PHAP DUONG-NGOC ,
(Member, IEEE), AND HANHO LEE , (Senior Member, IEEE)
Department of Electrical and Computer Engineering, Inha University, Incheon 22212, South Korea (e-mail: dangquang@inha.edu, dnphap@inha.ac.kr,
hhlee@inha.ac.kr)
Corresponding author: Hanho Lee (e-mail: hhlee@inha.ac.kr).
This work was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC support program (IITP-2021-0-02052)
supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation), in part by the National Research
Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. 2021R1A2C1011232), and in part by the IITP grant
funded by the Korea government(MSIT) (No.20210007790012003).
ABSTRACT The rapid advancement of powerful quantum computers poses a significant security risk to
current public-key cryptosystems, which heavily rely on the computational complexity of problems such
as discrete logarithms and integer factorization. As a result, there is an increasing demand for alternative
algorithms that can withstand both quantum and classical attacks, leading to ongoing evaluations for future
standardization. Among these contenders, CRYSTALS-Dilithium, a lattice-based digital signature scheme,
has emerged as a primary candidate in the NIST Post-Quantum Cryptography competition and is currently
being considered for standardization. This paper proposes an efficient low-latency hardware architecture for
CRYSTALS-Dilithium by leveraging optimized timing schedules for its three key operations. We further
enhance the performance of the pipelined number theoretic transform module, as well as the hashing
and sampling module, which are the most time-intensive sub-modules. Our FPGA-based implementation
outperforms state-of-the-art studies, achieving an improvement of 1.1∼1.9× in terms of the area-time
product and exhibiting lower execution time. Consequently, the proposed hardware architecture holds
practical applicability for digital signature applications in the post-quantum era.
INDEX TERMS Post-quantum cryptography (PQC), digital signature, CRYSTALS-Dilithium, latticebased cryptography (LBC), number theoretic transform (NTT).
I. INTRODUCTION
N recent years, the rise of quantum computing has posed
a significant threat to traditional public-key cryptographic
schemes. Shor’s algorithm [1], when executed on a sufficiently powerful quantum computer, has the capability to
efficiently solve complex mathematical problems that form
the basis of widely used cryptographic algorithms such as
RSA and ECC. This realization has prompted the National
Institute of Standards and Technology (NIST) to launch the
Post-Quantum Cryptography (PQC) standardization process
in 2016, with the aim of developing new public-key standards
that are resistant to quantum computer attacks.
After three rounds of evaluation and review, NIST has
selected Kyber for key-encapsulation mechanism (KEM) and
CRYSTALS-Dilithium, Falcon and SPHINCS+ for digital
signature standardization to the fourth round of the PQC
standardization process [2]. Among the finalists, Dilithium
I
VOLUME X, 2023
a lattice-based digital signature scheme has been recommended as the primary algorithm to be implemented. Besides, with three of four finalists are based on lattice structure,
that show the potential to against both quantum and classical
attacks. Lattice-based cryptography relies on the worst-case
hardness of lattice problems, such as the Learning with Errors
(LWE) problem, to provide security guarantees. Dilithium,
along with other lattice-based schemes like Kyber, offers
advantages such as fast arithmetic operations and compact
key, ciphertext, and signature sizes. Given the potential of
Dilithium to become a standardized post-quantum cryptographic algorithm, which is ready to replace current digital
signature standard [3], there is a growing interest in developing efficient implementations of the scheme both in hardware and software. Most existing works on implementing
and evaluating Dilithium have used pure software methods
or hardware-software codesign methods. Developing and
For Review Only
1
Page 3 of 11
IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
benchmarking software implementations of cryptographic algorithms like post-quantum cryptography (PQC) is relatively
straightforward. However, when it comes to hardware implementations, particularly on FPGA (Field-Programmable
Gate Array), it requires significant time and design effort
to assess their efficiency in terms of speed, area utilization,
power consumption, and energy efficiency. FPGA implementations of PQC submissions are crucial for gaining valuable
insights into the cost and performance characteristics of these
algorithms when deployed in hardware environments. By
exploring FPGA implementations, researchers can realize
their ideas and apply the trade-off method, thereby optimizing the approach for hardware designs to achieve desired
performance goals.
There have been several previous works optimizing full
hardware implementations for Round 3 Dilithium on FPGA
platforms. Based on our research, we have observed that
these implementations can be divided into two approaches.
The first approach is a combined architecture that supports
all security levels, as demonstrated in the designs presented
in [4] and [5]. This architecture is capable of handling the
required main operations for each security level within a
single design. The second approach involves three individual
designs, each dedicated to a specific security level. These
designs are capable of performing all the required operations
in Dilithium for their respective security levels, as shown in
the works discussed in [6], [7], [8], [9]. In addition, some
authors have proposed unified architectures that combine
multiple schemes into one. For example, Aikata et al. introduced KaLi [10], a unified architecture that supports both
CRYSTALS-Kyber and Dilithium. In hardware design, it is
important to evaluate performance and resource consumption. Therefore, these implementations can prioritize different objectives, such as maximizing performance, minimizing area and power consumption, or achieving efficiency in
area/performance trade-offs. As a result, the implementations
are often categorized as High-performance or Lightweight,
although the distinction is not always clear-cut, as designs
may aim for various trade-offs or other specific directions. In
terms of High-performance designs, notable works include
those by [4], [5]. On the other hand, the Lightweight approach is exemplified by works such as [8], [10] . Additionally, there are High-Efficiency designs, as seen in [6].
In this paper, we present a low-latency and efficient hardware architecture for version 3.1 of Dilithium. Our contributions are summarized as follows:
1) We propose a combined architecture that supports all
three algorithms of Dilithium and allows for customizable security levels. With a focus on minimizing latency, our design achieves the lowest execution time for
all algorithms simultaneously, while also striking a balance between execution time and resource utilization.
This design can operate independently for each task
of Dilithium, such as key generation, signature generation, and verification. Furthermore, it can interface with
other hardware and software platforms to accelerate
2
their respective processes, such as generating random
values for matrices or vectors in Dilithium or supporting computation through corresponding modules
implemented in this design.
2) We also propose several optimization ideas for the
polynomial arithmetic module and hashing module,
specifically tailored to the characteristics of Dilithium.
These two modules represent the largest contributors to
the overall execution time in the architecture, and optimizing them significantly improves the performance
of the entire system.
3) By utilizing flexible modules for the entire architecture, we propose a rational timing diagram and an
approach to leverage sub-modules to achieve resource
utilization balance when implementing different algorithms.
The remainder of this paper is organized as follows. Section II provides a brief background of Dilithium and number
theoretic transform. Section III proposes our design decisions
and details our hardware architecture as well as the optimized
modules. The implementation results and comparison with
the state-of-the-art works are given in Section IV. Finally,
Section V summarizes and concludes the paper.
II. PRELIMINARIES
A. DILITHIUM OVERVIEW
Dilithium, along with the Key Encapsulation Mechanism
(KEM) Kyber, is a member of the Cryptographic Suite for
Algebraic Lattices (CRYSTALS) and is a digital signature
scheme based on the "Fiat-Shamir with Aborts" approach
[12]. Different from other lattice-based algorithms recommended in round 4: Kyber and Falcon, Dilithium uses uniform sampling rather than discrete Gaussian distribution for
secret randomness generation. This approach greatly simplifies polynomial generation and is easily implemented in
constant time. The hard problems underlying the security
of Dilithium schemes are Module Learning with Errors (MLWE) and Module Shortest Integer Solution (M-SIS) formally proved in [11]. M-LWE problem needed to protect
against key-recovery [13] and M-SIS problem for strong
unforgeability [14]. The M-LWE problem can be briefly
described as follows: Let A ∈ Rqk×l be uniformly chosen s1
∈ Rql , s2 ∈ Rqk . Then, the standard M-LWE problem is the
public key (A, t = As1 + s2 ) is indistinguishable from (A, t)
where t is chosen uniformly at random. The M-SIS problem
in Dilithium is expressed through: from a uniformly chosen
matrix A ∈ Rqk×l , that there exist z and u such that Az+u =
0, where ∥z∥∞ ≤ 2(γ1 − β), ∥u∥∞ ≤ 4γ2 + 2 (β, γ1 , γ2 and
other parameters could be chosen corresponding with NIST
security levels in Table 1). In addition to three proposed NIST
security levels, the security of Dilithium can be adjust by
change the parameter inside. The most straightforward way
to enhance or reduce security is by modifying the values of
(k, l) and then adjusting η, β, and ω accordingly as shown
in Table 1. Another approach is to lower the values of γ1
and/or γ2 , which makes forging signatures more challenging
For Review Only
VOLUME X, 2023
IEEE Access
Page 4 of 11
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
TABLE 1. Parameters of Dilithium in version 3.1.
Parameters
2
q [modulus]
d [dropped bits from t]
τ [# of ±1’s in c]
γ1 [y coefficient range]
γ2 [low-order rounding range]
(k, l) [dimension of A]
η [secret key range]
β [τ · η]
ω [max # of 1’s in hint]
NIST security level
3
8380417
13
39
217
(q-1)/88
(4,4)
2
78
80
8380417
13
49
219
(q-1)/32
(6,5)
4
196
55
5
8380417
13
60
219
(q-1)/32
(8,7)
2
120
75
since it relies on the underlying SIS problem. Increasing (k, l)
significantly strengthens security, while adjusting γi is more
suitable for minor security tweaks.
Key generation (Keygen), Signature generation (Sign) and
Verification (Verify), three core algorithms of Dilithium, are
shown in Algorithms 1-3.
KeyGen(): Key generation algorithm generates a keypair
consisting of a private signing key (sk) and public verification
key (pk) used for signature creation and verification. In this
algorithm, two uniform seeds: public seed (ρ) and noise seed
(ρ′ ) are mapped to generate polynomial matrix A and vector
s1 , s2 . It then computes t = As1 +s2 to generate the final
keypair. To keep size small, the public key includes the seed
(ρ) instead of matrix A and the upper bit of t. The secret key
packs the lower d bits of t, seed ρ and two byte-arrays K, tr.
Sign(): This algorithm takes message M and secret key
to generate the signature attached with the message before
sending to verifier. Signature generation begins by generating
masking vector y and matrix A and then compute w =
Ay. It is based for generating challenge c by hashing the
message with the high order bit of w, signature as z =
y+cs1 then and also the hints h for the verifier. The potential
signature σ consisting of (c, z, h) is checked by adapting
the condition of the polynomial max norms are within an
acceptable predefined range and the maximum number of 1’s
in hint that the signature can support which are defined by
choosing parameter depend on the security level. Otherwise,
the signature must be rejected, and a new attempt is made.
Verify(): The high-order bit of w1 is recovered from
message, public key and signature. It is used as a part of hash
function to generate challenge c which will be compared to
the c̃ provided in the signature to confirm the authentication
and integrity of the original message.
B. NUMBER THEORETIC TRANSFORM
The Number Theoretic Transform (NTT) is a variant of the
Fast Fourier Transform (FFT) that operates on a finite field
and provides an efficient method for polynomial multiplication. It reduces the complexity of polynomial multiplication from O(n2 ) by using school-book method to O(nlogn)
by using point-wise multiplication method. The NTT-based
multiplication can be perform as: c=INTT(NTT(a)◦NTT(b)).
Where a, b, c could be polynomial matrix or vector ∈ Rq , ◦
denotes point-wise multiplication of polynomial coefficients.
VOLUME X, 2023
Algorithm 1 Key Generation KeyGen() [11]
Output: pk = (ρ, t1 ), sk = (ρ, K, tr, s1 , s2 , t0 )
256
1: ζ ← {0, 1}
′
2: (ρ, ρ , K) ← H(ζ)
3: A ∈ Rqk×l := ExpandA(ρ′ )
4: (s1 , s2 ) ∈ Sηl × Sηk
5: t:= As1 + s2
6: (t0 , t1 ) := Power2Roundq (t, d)
256
7: tr ∈ {0, 1}
:= H(ρ ∥ t1 )
8: return (pk, sk)
Algorithm 2 Signature Generation Sign(sk, M ) [11]
∗
Input: sk = (ρ, K, tr, s1 , s2 , t0 ), M ∈ {0, 1}
Output: σ = (c̃, z, h)
1: A ∈ Rqk×l :=ExpandA(ρ′ )
2: µ ← H(tr ∥ M ), ρ′ ← H(K ∥ µ)
3: κ := 0, (z, h) :=⊥
4: while (z, h) :=⊥ do
5:
y ∈ S̃γl 1 := ExpandMask(ρ′ , κ)
6:
w := Ay
7:
w1 := HighBitsq (w, 2γ2 )
8:
c̃ ← H(µ ∥ w1 ), c := SampleInBall(c̃)
9:
z := y + cs1
10:
r0 := LowBitsq (w, 2γ2 )
11:
if ∥z∥∞ ≥ γ1 − β or ∥r0 ∥∞ ≥ γ2 − β then
12:
(z, h) :=⊥
13:
else
14:
h := MakeHintq (−ct
P 0 , w − cs2 + ct0 , 2γ2 )
15:
if ∥ct0 ∥∞ ≥ γ2 or hi > ω then
16:
(z, h) :=⊥
17:
end if
18:
end if
19: end while
20: return σ = (c̃, z, h)
Dilithium, as a lattice-based cryptosystems, utilizes the
NTT for matrix-vector and vector-vector multiplications. The
NTT is performed in the ring Rq = Zq [x]/(xn + 1), where
n is a power of 2 and q is a prime number. In Dilithium, the
modulus q is chosen such that there exists a 2n-th root of
unit (ϕ2n = 1753). There are two common algorithms for implementing NTT, Cooley-Tukey decimal-in-time (DIT) algorithm for NTT and Gentleman-Sande decimal-in-frequency
(DIF) algorithm for INTT. By applying the combination of
these algorithms accelerate computations and eliminate the
need of bit-reversal step [15].
III. ARCHITECTURE AND DESIGN DECISIONS
As the design’s goal is efficient low-latency implementation
for CRYSTALS-Dilithium on the FPGA SoC platform, the
proposed architecture is supported to perform all Dilithium’s
main operations at three security levels as defined in its specification. This design supports all processes including packing
and sending signature and keypair, which can cooperate with
For Review Only
3
Page 5 of 11
IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Algorithm 3 Verification Verify() [11]
∗
Input: pk = (ρ, t1 ), M ∈ {0, 1} , σ = (c̃, z, h)
Output: Valid or Invalid
1: A ∈ Rqk×l := ExpandA(ρ′ )
2: µ ← H( H(ρ ∥ t1 ) ∥ M )
3: c := SampleInBall(c̃)
′
d
4: w1
:= UseHintq (h, Az
P − ct1 ·2 , 2γ2 )
5: if ∥z∥∞ < γ1 − β &
hi ≤ ω & c̃ = H(µ ∥ w1′ ) then
6:
return Valid
7: end if
8: return Invalid
other software or hardware platform easily. This section will
introduce the overall high-level architecture and its optimized
modules used to improve latency. Accompany with the efficient hardware implementation, the timing schedules in Figs.
2, 3, 4 are proposed to optimize the work of entire architecture and get low-latency aspect which take advantage of the
most important and time-consuming sub-modules (arithmetic
and hashing module specify for Dilithium).
A. HIGH-LEVEL CONFIGURABLE ARCHITECTURE
The high-level architecture of our hardware design is shown
in Fig. 1. As a configurable architecture implements three
main operations of Dilithium, this design depicts all components used. To connect outside, the bus width is 64 bits
to suitable with some software. The block inside is working
in 92 bits to perform four coefficients at the same time.
When executing one of those algorithms individually, some
sub-modules can be removed. A brief description of these
modules and their works are explained below.
Control unit: Working as a finite-state machine (FSM),
the control unit is responsible for controlling different modules based on the required functionality and sequencing the
entire hardware module to execute all main operations of
Dilithium in their corresponding sequences.
Polynomial BRAM: This module contains six groups
of three dual-port 36Kb BRAMs used to store intermediate value values during the computation. In the Dilithium
scheme, the modulus q is 8380417, allowing for coefficients
to be represented within 23 bits. The Xilinx FPGA supports
dual-port 36Kb BRAM memory on chip [16]. Instead of
storing each coefficient in a separate BRAM address, this
module is designed as a group of three BRAMs, each configured as a 36x1024 memory. This configuration allows for
four coefficients, with a bandwidth requirement of 4x23=92
bits, to be stored in a single address of the group BRAMs,
resulting in a 25% reduction in BRAM utilization compared
to the straightforward approach. For security level 5, where
matrix A contains 56 polynomials, the BRAM configuration
is set to have 92bits dual-port data width and depth of 4096.
Additionally, five 92x1024 bits BRAMs are used to store
intermediate values during the computation process.
Polynomial arithmetic module: used to all arithmetic
operations in Dilithium, including NTT, INTT, point-wise
4
FIGURE 1. High-level configurable architecture for Dilithium.
multiplications, additions and subtractions for polynomial
matrices and vectors.
The Keccak core: includes the SHAKE 128 module and
SHAKE 256 module, which function as hash functions or
Extendable-Output Functions (XOF) [17]. These modules
generate hash values of messages, random seeds, and pseudorandom data, which are then used for polynomial sampling
in the Dilithium scheme.
Sampling unit: This module consists of four sub-modules
responsible for generating different types of pseudorandom
samples used in Dilithium. ExpandMatrix maps a uniform
seed to matrix A, UniformEta for generating the secret vector
s1 , s2 , ExpandMask to make the masking vector y and
SampeInBall to create challenge c.
Decomposer: This module breaks up a finite field element
r in Zq into their “high-order” bits and “low-order” bits, such
that r = r1 · 2γ + r0 where 0 ≤ r0 < 2γ.
Hint: used to describes the work of both MakeHint and
UseHint functions in Algorithm 2 and 3. Given an arbitrary
element r ∈ Zq and a small element z ∈ Zq , "MakeHint"
records a 1-bit hint h as the "carry" that enables the computation of the higher-order bits of r + z using only r and h.
Conversely, "UseHint" utilizes the hint to recover the highorder bits of the sum.
Encoder/decoder: To interface with other hardware or
software platforms, which typically have a bandwidth of 2n ,
this implementation integrates encoder and decoder units.
The encoder is responsible for packing polynomial vectors
into byte strings. The different vectors, along with their corresponding bit widths, are packed into fixed 8-byte strings.
The decoder can then recover the original vectors from these
byte strings.
Depending on the algorithm being performed, certain processing units can operate in parallel to reduce overall latency,
provided there is no data flow conflict between them. From
the Algorithm 1-3, the polynomial generation and computation phases in the Dilithium scheme are the most time-
For Review Only
VOLUME X, 2023
IEEE Access
Page 6 of 11
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
FIGURE 2. Timing diagram of Key generation phase.
FIGURE 3. Timing diagram of Signature generation phase.
FIGURE 4. Timing diagram of Verification phase.
consuming processes. To improve performance, this design
optimizes the sequencing of these tasks, ensuring they are
executed continuously as much as possible.
From the algorithms and the proposed architecture, we
have developed an optimized timing schedule that maximizes the utilization of sub-modules and maintains constant
progress on the critical path of the operations. In our architecture, the controller operates as a finite-state machine,
ensuring sequential processing of all hardware modules. To
enhance clarity, we present the optimized schedule specifically for level 5 of the three algorithms.
Key generation and Verification: The FSM schedule for
Key generation and Verification operations are described
straightforwardly stage by stage in Figs. 2 and 4 corresponding with Algorithm 1 and 3, respectively. The longest
path among them is the computation, hashing and sampling,
which occupies around 90% total time execution. The packing/unpacking phases are immediately executed in parallel
with the generation matrix, vectors and computations. Pack
VOLUME X, 2023
t0 , Pack t1 perform the function Power2Round, the straightforward bit-wise way to break up t into 13-bits high-order
t1 and 10-bits low-order t0 , in line 6 of Algorithm 1. The
key generation progress is finished after packing their ingredients. In verification, the validity of the signature depends on
the result of the comparison described in line 5 of Algorithm
3, which is implemented in stage 7 as depicted in Fig. 4.
Signature generation: This is the most complex main
operation in Dilithium. It consists of two phases: precomputation and rejection loop, as shown in Fig. 3. During the precomputation phase, the secret key and message are unpacked
and the necessary calculations are performed prior to the
while loop in Algorithm 2. This phase is executed only once.
The rejection loop is responsible for generating the signature
until it satisfies the requirements specified in the algorithm.
The average number of loops can be found in the Table 1.
Inspired by the approach proposed by Beckwith et al. [4], the
polynomial arithmetic unit is divided into two independent
parts, functioning as described in mode 2 of this module.
For Review Only
5
Page 7 of 11
IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
The first part is used to create the vector w and executes
the functions from line 5 to 7, while the remaining functions
within the loop are performed in the second module.
B. HASHING AND SAMPLING
Dilithium scheme utilizes SHAKE 128 and SHAKE 256 as
its hashing functions. Both of these functions are ExtendableOutput Functions (XOFs) from the SHA-3 family, based on
the same Keccak permutation with a state size of 1600 bits.
However, they have different rates in the sponge construction,
namely 1344 and 1088, respectively [17]. More precisely, the
ExpandA function uses the output stream of SHAKE 128 to
map a uniform seed to matrix A ∈ Rqk×l . SHAKE 256 is used
for generating a random seed from an initial seed, polynomial
secret vectors s1 and s2 , and masking vector y. It is also
used for generating the challenge c through the SampleInBall
function and other hashing tasks in the Dilithium scheme.
Due to the shared Keccak-f[1600] core with a 24-round
permutation, the previous works use unified Keccak module
which can perform both of SHAKE128/256. This approach
saves area compared to implementing individual units, but
it has some drawbacks. Firstly, the sequential generation of
the polynomial matrix A and other vectors in the Dilithium
algorithms consumes considerable time, while the computation phase can only proceed once the sampling process
is complete. Besides, the hardness of the module-LWE in
Dilithium primarily depends on the dimensions (k, l) of
the vectors and matrices in the scheme, leading to the requirement of a large amount of pseudorandom data, corresponding to longer execution times in this phase and entire
main operation. Secondly, as the combined architecture for
SHAKE 128/256, the complexity of this module could limit
the maximum achievable frequency of entire architecture. To
mitigate this and reach the low latency target, we propose
using independent modules SHAKE 128 and SHAKE 256.
Both of that module has the same 64 bits data-path core,
which performs 24-round Keccak permutation.
For the requirement of a large pseudorandom data for matrix A, we use double 64 bits data-path cores in SHAKE128
module, while the SHAKE 256 module uses a single core.
Our modules are modified from the high-performance Keccak implementation core developed by the Keccak team [18].
This implementation supports for bus widths of 8, 16, 32,
64 or 128 bits, and we choose a 64-bit data path as it is
the greatest common divisor of the rates (r = 1344 and
1088) in SHAKE 128/256 and is also suitable for connecting
with other software/hardware platforms. The Keccak control
unit manages the internal signals to control the module’s
operations and facilitate the transition between the absorb
and squeeze phases. In the case of sampling matrix A, depict
in Fig. 5, SHAKE-128 module needs 5 cycles to absorb
256 bit of seed and 16-bit nonce before transferring to the
squeeze phase, where the sampling and hashing processes
work continuously. With the condition of coefficient in matrix A is within Zq (where q is the modulus), this module
requires 204 cycles to generate two polynomials with 256
6
FIGURE 5. Sampling for matrix A.
coefficients (without taking into account the possibility of
samples being rejected due to violating the condition). Using
double permutation cores speeds up the time for sampling
matrix A which up to 56 polynomials in security level 5.
The remaining hashing tasks and sampling vectors in
Dilithium are implemented in the SHAKE 256 core in a
similar manner. This module consists of a single data-path
core controlled by the Keccak control unit, specifically designed for SHAKE 256. It interfaces with the corresponding
sampling units to generate the required vectors based on their
rejection conditions and received data widths, as outlined in
Table 1.
C. FLEXIBLE POLYNOMIAL ARITHMETIC MODULE
There are two popular variants of the NTT architecture:
memory-based and pipelined architectures. The pipelined
architecture requires a large number of shift registers to store
delay units, which depend on the chosen scheme [19]. On
the other hand, the memory-based architecture utilizes butterfly units to perform calculations layer-by-layer according
to a diagram [4], [6], [7], [8]. This approach can directly
connect with memory, eliminating the need for delay units.
However, each butterfly unit needs to read and write back two
coefficients per cycle, which limits the number of butterfly
units that can be used to accelerate the speed of NTT due to
the memory port limitations. Moreover, the memory access
order for both intermediate values and twiddle factors is
complex, requiring a complicated control module as seen
in the 2x2 NTT module [4] and the dual butterfly unit [7],
[10]. In terms of latency, the pipelined architecture only
requires a long delay time from the beginning of input to output while the memory-based architecture requires a certain
stall delay length between changing stages when performing NTT/INTT. For long serial NTT/INTT operations like
Dilithium, the pipelined architecture exhibits better latency
than the memory-based architecture.
Another important consideration is the extensive computational requirements of the signature generation process in
Dilithium compared to the other two processes, particularly
during the rejection loop phase. To improve performance in
For Review Only
VOLUME X, 2023
IEEE Access
Page 8 of 11
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
FIGURE 6. Flexible polynomial arithmetic module in mode 1.
FIGURE 7. Flexible polynomial arithmetic module in mode 2.
this phase, we propose an idea inspired by [4] to distribute the
calculations across two polynomial arithmetic (PolyArith)
modules. The former module calculates functions from line
5 to 7 of Algorithm 2, while the latter module handles the
remaining calculations. Based on the above data analysis,
our arithmetic module utilizes the radix-2 MDC pipelined
architecture [19] with 8 butterfly units to handle all the
polynomial arithmetic operations in Dilithium. This module
offers two modes that can be flexibly adapted to suit the
characteristics of Dilithium, both of them are configurable
to support NTT/INTT operations with the data flow pass
through the black and faint lines, respectively, in Figs. 6, 7.
1. Fully pipelined mode: In this mode, we use eight
butterfly units corresponding to the eight layers of Dilithium
for N=256. The stall delay length from input to output is
153 clock cycles, after that each polynomial requires 128
cycles for NTT/INTT. The butterfly units can be reused for all
polynomial arithmetic, processing 8 coefficients in parallel.
This mode is employed in the Keygen and Verify operations.
2. Folding transform mode: In this mode, we divide our
module into two independent modules, each equipped with
four butterfly units. By applying the folding transformation
method [20], each butterfly unit is responsible for calculating two adjacent NTT layers in a time-sliced fashion. This
enables us to employ four butterfly units to implement radix2 MDC in Dilithium, similar to the approach in [5]. In this
mode, during NTT/INTT, the butterfly units handle the odd
stages in odd cycles and the even stages in even cycles. This
mode has a delay of 281 cycles from input to output and
requires 256 cycles to transfer a single polynomial during
NTT/INTT. It also supports the processing of 4 coefficients
per cycle during polynomial arithmetic. This mode is used
for Sign operation, one module to compute vector y and w,
the other for remaining computations as shown in Fig. 3.
This module utilizes 6 registers to store twiddle factors
connected with processing elements (PE) 1 and PE4. Additionally, we use two 36Kb BRAMs configurations in a
72x512 mode to store twiddle factors connected with the
remaining 6 PEs. Shift registers are used to implement a
FIFO unit for saving intermediate values during NTT/INTT,
consuming (64+32+...+2+1) x 4 x 23 bits of FFs resources.
VOLUME X, 2023
1) Butterfly unit.
The Butterfly unit (BU) plays a crucial role in the arithmetic
module for performing various operations. In the proposed
design, a configurable BU architecture is presented, supporting both the CT BU and GS BU methods, as depicted in
Fig. 8. The proposed butterfly unit consists of one modular
multiplier, one modular adder, and one modular subtractor,
without requiring any additional modular multiplier, adder, or
subtractor compared to a dedicated CT or GS butterfly unit.
As shown in Fig. 8 the proposed butterfly unit takes a, b
and ω as input, the NTT/INTT control signal in the blue line
acts as a selection signal for the multiplexers, configuring the
BU to operate as either a GS or CT butterfly unit. the BU
architecture enables it to handle point-wise multiplication,
addition, and subtraction through the corresponding internal
unit. A and B denotes the output ports, when performing
point-wise multiplication or subtraction, the result is taken
from port B, while port A is utilized for addition using the
CT butterfly configuration.
In [21] Zhang et al. proposed a technique to eliminate the
multiplication of resulting coefficients with n−1 (mod q) after
the INTT operation. The output can be seen as (A, B) :=
(2-1 (a + b), (a - b)ω). For an odd prime q, x/2 (mod q) can
be performed as shown in Eqn. 1. In this work, we adopt this
technique by insert divide 2 operation integrated in addition
unit, and pre-processed the twiddle factor for the inverse NTT
to incorporate the factor 2-1 .
x/2(modq) = (x ≫ 1) + x[0].(q + 1)/2
(1)
2) Modular multiplication.
Modular multiplication, addition and subtraction are the fundamental arithmetic components inside the butterfly unit. The
reference implementation on software platforms of Dilithium
[11] uses Montgomery reduction after multiplication and
Barrett reduction for addition and subtraction. Both of these
For Review Only
7
Page 9 of 11
IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
TABLE 2. Resource utilization of sub-modules in the architecture.
Sub-modules
LUT
FIGURE 8. Proposed butterfly unit.
Resource Utilization
FF
DSP
RAM
Polynomial RAM
Hint
Encoder
Decoder
Decomposer
PolyArith
SHAKE 128
SHAKE 256
Sampling
Other
0
10476
2183
2683
1565
5819
10278
5126
8009
5682
0
3776
464
251
728
3527
6641
3941
2921
1844
0
0
0
0
0
16
0
0
0
0
25.5
0
0
0
0
2
0
0
0
0
Combined Architecture
51821
24093
16
27.5
IV. IMPLEMENTATION RESULTS AND COMPARISON
A. RESOURCES UTILIZATION AND PERFORMANCE
RESULTS
FIGURE 9. Architecture for modular reduction unit.
can apply on hardware and has been adopted in some related
works but not seem to be a best choice. While Montgomery
needs to transfer from normal domain to Montgomery domain leading to additional multiplications, Barret reduction
does not exploit the specification of modulus q in Dilithium.
In [8], Land proposes the reduction method by recursively
exploit the relation 223 ≡ 213 − 1.
s[45 : 0] ≡ 223 s[45 : 23] + s[22 : 0]
≡ 213 s[45 : 23] − s[45 : 23] + s[22 : 0]
≡ 223 s[45 : 33] + 213 s[32 : 23] − s[45 : 23] + z
≡ 213 (s[45 : 43] + s[42 : 33] + s[32 : 23])
− (s[45 : 43] + s[45 : 33] + s[45 : 23]) + z
(2)
As we know, the subtraction is not friendly with hardware
like addition, we use additive inverse with s̄ is negative s, so
we transfer the equation above to:
s[45 : 0] ≡ s̄[45 : 23] + (213 s[32 : 23] + s̄[45 : 33])
+ (213 (s[42 : 33] + s̄[45 : 43]) + 213 s[45 : 43]
+ z + 50331648
≡ s̄[45 : 23] + {s[42 : 33], 10′ b0, s̄[45 : 43]}
+ {s[32 : 23], s̄[45 : 33]} + 213 s[45 : 43] + z + m
(3)
The result of the reduction at this point is observed within
the interval (-q, 3q), we can simply obtain the hardware
architecture for modular multiplication in Fig. 9 where m =
50331648 and {} denotes concatenated function.
8
Table 2 provides an overview of the resource consumption for
the entire design and its sub-modules. The design as a whole
utilizes 51821 LUTs, 24093 FFs, 27.5 BRAMs, and 16 DSPs,
working in the maximum frequency 280 MHz. To achieve the
low latency target, we selected the Virtex Ultrascale+ platform, and all the resource data was generated using Xilinx
Vivado 2022.2. Due to our data-path employing 4 parallel
paths to execute 4 coefficients per cycle, the resource usage
is higher compared to other lightweight target works. The
two Keccak units account for approximately 30% and 44%
of the LUT usage, making them the most resource-intensive
components. The design utilizes a total of 27.5 BRAMs,
with 10.5 BRAMs allocated for storing matrix A and 15
BRAMs used for intermediate values during computation.
The polynomial arithmetic module requires 2 BRAMs to
store twiddle factors. As a combined architecture, some units
are only utilized in specific operations and are not called
when performing other tasks. For example, the Hint module
is not involved in the Keygen operation. However, the Keccak
core, ExpandA, PolyArith, encoder/decoder, and polynomial
RAM can be fully shared among different operations, consuming 54% LUTs, 65% FFs, and 100% DSPs.
In term of latency, we have compiled statistics on the
overall execution time, including the data input, unpack and
pack signature and keypair processes. From the average
statistics of 1000 executions, we obtained the following specific parameters. The Keygen operation requires an average
of 3830, 5925, and 9930 clock cycles (CCs) for the three
security levels. The interval between each sample depends
on the rejection rate of matrix A and secret vectors s1 but
remains relatively small due to the high acceptance rate.
The Verify operation consumes 5161, 7382, and 10110 CCs
for the three NIST security levels. In this operation, only
valid samples are calculated, because the invalid sample is
processed much faster and does not show total time execution
of this operation. The interval time also depends on the
For Review Only
VOLUME X, 2023
IEEE Access
Page 10 of 11
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
TABLE 3. Comparison of FPGA-based designs for Dilithium signature scheme.
Sec.
Design
Device
lvl
Aikata [10]a
II
III
V
Freq.
Area
Keygen
Sign
Verify
(MHz)
LUT
FF
DSP
BRAM
cycles
µs
cycles
µs
cycles
µs
23277
9798
4
24
14594
54
31662
117
15423
57
ZUs+
270
Land [8]b
Artix-7
163
27433
10681
45
15
18761
115
76613
470
19687
121
Wang [6]b
Zynq-7000
159
18558
7342
10
17
7757
49
52038
327
7675
48
Zhao [5]c
Artix-7
96.9
29998
10366
10
11
4172
43
8361/31600
326.1
4422
45.6
Beckwith [4]c
VUs+
256
53907
28435
16
29
4875
19
10945/29876
43/117
6582
26
This workc
VUs+
280
51821
24093
16
27.5
3830
13.7
10800/29215
38.6/104.3
5161
18.4
Aikata [10]a
ZUs+
270
23277
9798
4
24
23619
87
48446
171
26124
97
Land [8]b
Artix-7
145
30900
11372
45
21
33102
229
123218
852
32050
222
Wang [6]b
Zynq-7000
159
19614
8466
10
21
12982
82
89213
561
11232
71
Zhao [5]c
Artix-7
96.9
29998
10366
10
11
5851
60.4
11399/49496
510.8
6181
63.8
Beckwith [4]c
VUs+
256
53907
28435
16
29
8291
32
16178/49437
63/193
9724
39
This workc
VUs+
280
51821
24093
16
27.5
5925
21.2
15130/46545
54/166.2
7382
26.4
Aikata [10]a
ZUs+
270
23277
9798
4
24
39737
147
70179
260
46671
173
Naina [7]b
ZUs+
391
13975
6845
4
35
63.2k
162
113.9k
291
67.9k
174
Land [8]b
Artix-7
140
44653
13814
45
31
50892
364
145912
1042
52712
377
Wang [6]b
Zynq-7000
159
20973
9677
10
28
20185
127
93708
589
15875
10
Zhao [5]c
Artix-7
96.9
29998
10366
10
11
8765
90.5
15790/55321
570.9
9039
93.3
Beckwith [4]c
VUs+
256
53907
28435
16
29
14037
55
24358/55070
95/215
14642
57
This workc
VUs+
280
51821
24093
16
27.5
9930
35.5
22898/52547
81.8/187.7
10110
36.1
a : Multiple schemes, b : Supports dependent level, c : Supports all levels.
rejection condition in generating matrix A. The number of
cycles required for the Sign operation varies widely due to the
rejection loop in signature generation. The best-case scenario
occurs when the signature is accepted after the first iteration,
resulting in a time of 10800, 15130, and 22898 CCs for the
three security levels. The processing time of input message is
determined by the size of the message and operates at a rate
of 8 bytes per cycle. Based on the parameters in the Dilithium
specification, an average of 4.25 attempts is required for level
2, 5.1 attempts for level 3, and 3.85 attempts for level 5.
The average time to execute one attempt are 5665, 7658, and
10340 CCs, leading to average times of signature generation
are 29215, 46545, and 52547 CCs for three security levels.
B. COMPARISON WITH RELATED WORKS
Among the fully hardware implementations for Round-3
Dilithium based on FPGA platforms, several state-of-the-art
works with the best results are listed in Table 3. As mentioned
above, these implementations vary in their approaches, such
as Lightweight, High-performance, or others. To compare
the effectiveness of these implementations, we employ the
area-time trade-off (ATP) metric. The ATP values are calculated by multiplying the latency with the number of utilized LUTs, FFs, and DSPs, denoted as ATP_LUT, ATP_FF,
and ATP_DSP, respectively. Additionally, the utilization of
BRAM is also an important factor to consider. In this metric,
VOLUME X, 2023
our work serves as the "center point" for comparison, with a
value set to 1. Lower values indicate better performance, as
both area and time consumption should be minimized.
In Table 3, our work and the implementation in [4], [5]
represents fully hardware designs for all three security levels
of Dilithium. In [6], [8], the authors propose independent
designs for each security level. Additionally, with the same
lattice-based structure, Aikata et al. [10] propose a combined
architecture for CRYSTALS-Kyber and Dilithium scheme.
To facilitate the evaluation of design effectiveness, Figs. 1012 present a comparison using the ATP metric for corresponding parameters of level 5, which is the highest complexity supported by Dilithium. We list the four best stateof-the-art designs published to form a comparison table of
ATP_LUT, ATP_FF, and ATP_DSP parameters.
The work of Land [8] is one of the earliest designs for
Round 3 Dilithium. This design has several non-optimized
aspects, such as the use of separate NTT/INTT blocks and
multiplication blocks, which could be combined to achieve
efficiency without increasing algorithm execution time. Thus,
from the data in Table 3, it is evident that our design achieves
significantly better performance than this design.
In [7], Naina proposed a lightweight implementation for
level 5 Dilithium on the Zynq Ultrascale+ hardware platform. With a focus on minimizing hardware utilization, this
design achieves the best results in terms of LUTs and FFs.
For Review Only
9
Page 11 of 11
IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
FIGURE 10. Comparison of Area×Time (LUTs × s).
FIGURE 11. Comparison of Area×Time (FFs × s).
However, regarding execution time, this design only provides
data for the best-case scenario of the Sign operation without
specific information about the average execution time of this
algorithm. Despite the differences in design approach, as our
design supports various security levels, our implementation
outperforms this design by 21% in terms of BRAM usage
and demonstrates better efficiency in almost statistics when
evaluated using the ATP metric, as shown in Fig. 10-12.
Wang et al. [6] utilized three independent cores for the
corresponding security levels of Dilithium. This work is software/hardware codesign, where bit packing and unpacking
are not included in the hardware part. Therefore, when comparing with the ATP metric for level 5, a direct comparison
may not be totally fair as our design, which supports 3
levels, is more complex and resource-consuming. However,
our implementation shows a slight improvement in terms of
memory utilization and demonstrates an enhancement of up
to 1.1∼2.2× in ATP parameters, as depicted in Fig 10 to 12.
The work of [5] presents a compact design for Dilithium
on the low-end Artix-7 platform. Although it also supports all
three levels, it lacks the flexibility of our design. Unlike our
design, its results are stored only in its memory, and it does
not support the packing and sending of results to connect with
external hardware/software platforms. Additionally, their average execution time statistics do not include the data input
receiving and sending processes. The execution time for
Verify operations is calculated only in cases where the public
key remains unchanged. In similar cases, our work only
requires 4377, 5846, and 8062 CCs for the three levels. The
results depicted in Fig. 10-12 also show our implementation
is better when evaluated by the ATP metric.
Beckwith et al. [4] proposed a design that, similar to
ours, provides support for all levels of Dilithium. However,
they use the Minerva tool [22] to determine the maximum
frequency, which represents a different benchmark compared
to other designs. As shown in Table 3, our work outperforms
theirs in terms of execution time and resource utilization.
In summary, our implementation exhibits the lowest latency compared to the other designs. Furthermore, when
compared to the combined architecture supporting all security levels of Dilithium in [4] and [5], our design demonstrates efficiency improvements ranging from 1.1∼1.9×
10
FIGURE 12. Comparison of Area×Time (DSPs × s).
evaluating by ATP metric.
In comparison to the software implementation, our hardware implementation in this work demonstrates significant
performance improvements. Seo et al. [23] reported an optimized NEON implementation on an 8-core ARM v8.2 64-bit
CPU mounted on Jetson AGX Xavier. For Dilithium level
5, their implementation takes 542µs for Keygen, 625µs for
Verify, and 1001µs for the best-case scenario of Sign. In the
same tasks, our hardware implementation outperforms their
software implementation, achieving a speed-up of 15.3×,
17.3×, and 12.2×, respectively. Additionally, Becker et al.
[24] presented a software implementation on high-end CPUs,
specifically the ARMv8 Apple M1 Firestorm core at 3.2GHz.
Their reported execution times for the three main operations
of Dilithium-III are 48µs, 114µs, and 33µs. In terms of
time execution in microseconds, our hardware implementation achieves improvements of 2.3×, 0.7×, and 1.25× for
Keygen, Sign, and Verify, respectively.
V. CONCLUSION
In this paper, we present an efficient and low-latency hardware implementation of round 3 CRYSTALS-Dilithium,
which supports all main operations at three security levels
of NIST. Our implementation enables the complete execution of all Dilithium tasks independently or in interaction
with other software/hardware platforms to accelerate specific
processes. When compared with the state-of-the-art hardware architecture for all three security levels of Dilithium,
For Review Only
VOLUME X, 2023
IEEE Access
Page 12 of 11
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
we achieve the lowest latency and efficiency in terms of
area/performance trade-offs, with a significant improvement
of 1.1 to 1.9× as evaluated by the ATP metric. Additionally,
this work introduces optimized arithmetic operations and
Hashing & Sampling modules tailored to the characteristics
of Dilithium, enhancing the overall efficiency of the scheme.
REFERENCES
[1] P. W. Shor, “Polynomial-time algorithms for prime factorization and
discrete logarithms on a quantum computer,” SIAM Journal on Computing,
vol. 26, no. 5, pp. 1484–1509, Oct. 1997.
[2] NIST, “Status report on the third round of the NIST-PQC standardization Process,”. Available: https://csrc.nist.gov/projects/post-quantumcryptography.
[3] NIST, “FIPS 186-4 - Digital Signature Standard (DSS),” url:
https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.186-4.pdf.
[4] L. Beckwith, D. T. Nguyen, and K. Gaj, “High-performance hardware
implementation of CRYSTALS-Dilithium,” Crypto. ePrint Arch., Report
2021/1451, 2021.
[5] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu, S. Yin, S.
Wei, and L. Liu, “A compact and high-performance hardware architecture
for CRYSTALS-Dilithium,” IACR Trans. Cryptogr. Hardw. Embed.Syst.,
vol. 2022, no. 1, pp. 270–295, 2022.
[6] T. Wang, C. Zhang, P. Cao and D. Gu, “Efficient implementation of
Dilithium signature scheme on FPGA SoC Platform,” IEEE Transactions
on Very Large Integration (VLSI) systems, vol: 30, Sep. 2022.
[7] N. Gupta, A. Jati, A. Chattopadhyay, and G. Jha, “Lightweight Hardware
Accelerator for Post-Quantum Digital Signature CRYSTALS-Dilithium,”
Cryptology ePrint Archive, Paper 2022/496, 28 April 2022.
[8] G. Land, P. Sasdrich, and T. Guneysu, “Hard Crystal-Implementing
Dilithium on Reconfigurable Hardware,” Cryptology ePrint Archive
2021/355, Mar. 2021.
[9] S. Ricci, L. Malina, P. Jedlicka, D. Smekal, 1. Hajny, P. Cibik, and
P. Dobias, “Implementing CRYSTALSDilithium Signature Scheme on
FPGAs,” Cryptology ePrint Archive 2021/108, Jan. 2021.
[10] A. Aikata, C. Mert, M. Imran, S. Pagliarini and S. Roy, “KaLi: A crystal
for post-quantum security using Kyber and Dilithium,” IEEE Transactions
on Circuits and Systems I, vol: 70, Feb. 2023.
[11] S. Bai, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe,
G. Seiler, and D. Stehl´e, “CRYSTALS-Dilithium,” Proposal to NIST
PQC Standardization, Round3, 2021, https://csrc.nist.gov/Projects/postquantum-cryptography/round-3-submissions.
[12] Vadim Lyubashevsky, “Fiat-shamir with aborts: Applications to lattice and
factoring-based signatures,” in International Conference on the Theory
and Application of Cryptology and Information Security, Springer Berlin
Heidelberg, 2009, pp. 598–616.
[13] Y. Wang and M.Wang, “Module-LWE versus Ring-LWE, Revisited,”
Cryptology ePrint Archive 2019/930, Aug. 2019.
[14] E. Kiltz, V. Lyubashevsky, and C. Schaffner,“A concrete treatment
of Fiat–Shamir signatures in the quantum random-oracle model,” in
Proc.37th Annu. Int. Conf. Theory Appl. Cryptograph. Techn., pp.
552–586, Tel Aviv, Israel, Apr. 2018.
[15] P. Longa and M. Naehrig, “Speeding up the number theoretic transform for
faster ideal lattice-based cryptography,” in Proc. 15th Int. Conf. Cryptol.
Netw. Secur., pp. 124–139, Milan, Italy, Nov. 2016.
[16] “UltraScale Architecture Memory Resources Advance Specification User
Guide,” UG573 (v1.1) August 14, 2014.
[17] National Institute of Standards and Technology, in FIPS PUB 202: SHA3 Standard: Permutation-Based Hash and Extendable-Output Functions.
Aug. 2015.
[18] G. Bertoni, J. Daemen, S. Hoffert, M. Peeters, and G. V. Assche.(2020).,
Keccak in VHDL. [Online]. Available:https://keccak.team/hardware.html
[19] S. He and M. Torkelson, “A new approach to pipeline FFT processor,” in
Proceedings of International Conference on Parallel Processing, pp. 766770, April 1996.
[20] K. K. Parhi, C.-Y. Wang, and A. P. Brown, “Synthesis of control circuits in
folded pipelined DSP architectures,” IEEE Journal of Solid-State Circuits,
vol. 27, no. 1, pp. 29– 43, Jan. 1992.
[21] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei and L.Liu, “Highly Efficient Architecture of NewHope-NIST on FPGA using Low-Complexity
NTT/INTT,” IACR Trans. on CHES, vol. 2020, no. 2, pp. 49–72, 2020.
VOLUME X, 2023
[22] F. Farahmand, A. Ferozpuri, W. Diehl, and K. Gaj, “Minerva: Automated
hardware optimization tool,” in 2017 International Conference on ReConFigurable Computing and FPGAs, ReConFig 2017, pp. 1-8, Cancun:
IEEE, Dec. 2017.
[23] S. C. Seo, S. W. An, “Parallel implementation of CRYSTALS-Dilithium
for effective signing and verification in autonomous driving environment,”,
ICT Express, vol. 9, issue 1, pp. 100–105, Feb. 2023.
[24] H. Becker, V. Hwang, M. J. Kannwischer, B.-Y. Yang, and S.-Y. Yang,
“Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple
M1,” Cryptology ePrint Archive 2021/986, Jul. 2021.
QUANG TRUONG-DANG (Graduate Student
Member, IEEE) received his B.S. degree in Electrical and Electronic Engineering from Le Quy
Don University of Technology, Vietnam, in 2019.
He is currently pursuing his M.S. in Electrical
and Computer Engineering from Inha University.
His research interests include algorithm and architecture design for post-quantum cryptography and
homomorphic encryption.
PHAP DUONG-NGOC (Member, IEEE) received the M.S. degree in electronic engineering
from the University of Danang, Vietnam, in 2015,
and the Ph.D. degree in Electrical and Computer
Engineering from Inha University, South Korea, in
2022. He is currently a Post-Doctoral Researcher
with the Artificial Intelligence System-on-Chip
Research Center, Inha University. His research
interests include algorithm and architecture for
digital signal processing, hardware acceleration,
post-quantum cryptography, and homomorphic encryption.
HANHO LEE (Senior Member, IEEE) received
Ph.D. and M.Sc. degrees in Electrical and Computer Engineering, from the University of Minnesota, Minneapolis in 2000 and 1996, respectively. He was a member of the technical staff
at Lucent Technologies (Bell Labs Innovations),
Allentown from April 2000 to August 2002, and
Assistant Professor in the Department of Electrical and Computer Engineering, University of
Connecticut, USA from August 2002 to August
2004. He has been with the Department of Information and Communication
Engineering, Inha University since August 2004, where he is currently a
Professor. He was visiting scholar at Bell Labs, Alcatel-Lucent, Murray Hill,
New Jersey, USA from August 2010 to August 2011. His research interests
include algorithm and architecture design for cryptographic, forward error
correction coding, artificial intelligence and digital signal processing.
For Review Only
11
Download