IEEE Access Efficient Low-Latency Hardware Architecture for CRYSTALSDilithium Journal: IEEE Access Manuscript ID Access-2023-18854 Manuscript Type: Regular Manuscript Date Submitted by the 30-Jun-2023 Author: Complete List of Authors: Lee, Hanho; Inha University College of Engineering TRUONG, QUANG DANG; Inha University College of Engineering PHAP, DUONG-NGOC; Inha University College of Engineering Subject Category<br>Please select at least two subject Circuits and systems, Computers and information processing categories that best reflect the scope of your manuscript: Keywords: <b>Please choose keywords carefully as they Information security, Cryptography, Hardware acceleration help us find the most suitable Editor to review</b>: Additional Manuscript Keywords: Note: The following files were submitted by the author for peer review, but cannot be converted to PDF. You must view these files (e.g. movies) online. figure.rar For Review Only Page 1 of 11 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 IEEE Access AUTHOR RESPONSES TO IEEE ACCESS SUBMISSION QUESTIONS Author chosen Research Article manuscript type: - This paper proposes a low-latency hardware architecture that supports all three algorithms of CRYSTALS-Dilithium while allowing for customizable security levels. The design focuses on minimizing latency and achieves the lowest execution time for all algorithms Author explanation simultaneously, while also balancing execution time and resource /justification for utilization. - The proposed architecture proposes optimized choosing this sub-modules specifically tailored to the characteristics of Dilithium, manuscript type: enhancing the overall performance of the system. - The experimental results confirm that the proposed design achieved improvements in terms of the area-time product and exhibited lower execution time. - This paper technically presents a hardware architecture of CRYSTALS-Dilithium, a digital signature algorithm designed to Author description of withstand quantum and classical attacks. - The proposed how this manuscript fits architecture is flexible and configurable, capable of performing the within the scope of IEEE key algorithms of Dilithium and seamlessly integrating with other Access: hardware and software platforms. It accelerates processes such as random value generation for matrices or vectors and supports computation through dedicated modules within the design. - This paper proposes a low-latency hardware architecture for Dilithium that supports all three algorithms and allows for customizable security levels. Our design focuses on minimizing Author description latency and achieves the lowest execution time for all algorithms detailing the unique simultaneously, while balancing resource utilization. - The proposed contribution of the sub-modules are optimized specifically for Dilithium, improving the manuscript related to overall system performance. - Comparative analysis shows that our existing literature: NTT architectures exhibit lower execution time while maintaining comparable area-time efficiency compared to previous works. This makes our hardware architecture highly applicable for digital signature applications in the post-quantum era. For Review Only IEEE Access 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Page 2 of 11 Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 00.0000/ACCESS.2020.DOI Efficient Low-Latency Hardware Architecture for CRYSTALS-Dilithium QUANG DANG TRUONG , (Graduate Student Member, IEEE), PHAP DUONG-NGOC , (Member, IEEE), AND HANHO LEE , (Senior Member, IEEE) Department of Electrical and Computer Engineering, Inha University, Incheon 22212, South Korea (e-mail: dangquang@inha.edu, dnphap@inha.ac.kr, hhlee@inha.ac.kr) Corresponding author: Hanho Lee (e-mail: hhlee@inha.ac.kr). This work was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC support program (IITP-2021-0-02052) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation), in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. 2021R1A2C1011232), and in part by the IITP grant funded by the Korea government(MSIT) (No.20210007790012003). ABSTRACT The rapid advancement of powerful quantum computers poses a significant security risk to current public-key cryptosystems, which heavily rely on the computational complexity of problems such as discrete logarithms and integer factorization. As a result, there is an increasing demand for alternative algorithms that can withstand both quantum and classical attacks, leading to ongoing evaluations for future standardization. Among these contenders, CRYSTALS-Dilithium, a lattice-based digital signature scheme, has emerged as a primary candidate in the NIST Post-Quantum Cryptography competition and is currently being considered for standardization. This paper proposes an efficient low-latency hardware architecture for CRYSTALS-Dilithium by leveraging optimized timing schedules for its three key operations. We further enhance the performance of the pipelined number theoretic transform module, as well as the hashing and sampling module, which are the most time-intensive sub-modules. Our FPGA-based implementation outperforms state-of-the-art studies, achieving an improvement of 1.1∼1.9× in terms of the area-time product and exhibiting lower execution time. Consequently, the proposed hardware architecture holds practical applicability for digital signature applications in the post-quantum era. INDEX TERMS Post-quantum cryptography (PQC), digital signature, CRYSTALS-Dilithium, latticebased cryptography (LBC), number theoretic transform (NTT). I. INTRODUCTION N recent years, the rise of quantum computing has posed a significant threat to traditional public-key cryptographic schemes. Shor’s algorithm [1], when executed on a sufficiently powerful quantum computer, has the capability to efficiently solve complex mathematical problems that form the basis of widely used cryptographic algorithms such as RSA and ECC. This realization has prompted the National Institute of Standards and Technology (NIST) to launch the Post-Quantum Cryptography (PQC) standardization process in 2016, with the aim of developing new public-key standards that are resistant to quantum computer attacks. After three rounds of evaluation and review, NIST has selected Kyber for key-encapsulation mechanism (KEM) and CRYSTALS-Dilithium, Falcon and SPHINCS+ for digital signature standardization to the fourth round of the PQC standardization process [2]. Among the finalists, Dilithium I VOLUME X, 2023 a lattice-based digital signature scheme has been recommended as the primary algorithm to be implemented. Besides, with three of four finalists are based on lattice structure, that show the potential to against both quantum and classical attacks. Lattice-based cryptography relies on the worst-case hardness of lattice problems, such as the Learning with Errors (LWE) problem, to provide security guarantees. Dilithium, along with other lattice-based schemes like Kyber, offers advantages such as fast arithmetic operations and compact key, ciphertext, and signature sizes. Given the potential of Dilithium to become a standardized post-quantum cryptographic algorithm, which is ready to replace current digital signature standard [3], there is a growing interest in developing efficient implementations of the scheme both in hardware and software. Most existing works on implementing and evaluating Dilithium have used pure software methods or hardware-software codesign methods. Developing and For Review Only 1 Page 3 of 11 IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 benchmarking software implementations of cryptographic algorithms like post-quantum cryptography (PQC) is relatively straightforward. However, when it comes to hardware implementations, particularly on FPGA (Field-Programmable Gate Array), it requires significant time and design effort to assess their efficiency in terms of speed, area utilization, power consumption, and energy efficiency. FPGA implementations of PQC submissions are crucial for gaining valuable insights into the cost and performance characteristics of these algorithms when deployed in hardware environments. By exploring FPGA implementations, researchers can realize their ideas and apply the trade-off method, thereby optimizing the approach for hardware designs to achieve desired performance goals. There have been several previous works optimizing full hardware implementations for Round 3 Dilithium on FPGA platforms. Based on our research, we have observed that these implementations can be divided into two approaches. The first approach is a combined architecture that supports all security levels, as demonstrated in the designs presented in [4] and [5]. This architecture is capable of handling the required main operations for each security level within a single design. The second approach involves three individual designs, each dedicated to a specific security level. These designs are capable of performing all the required operations in Dilithium for their respective security levels, as shown in the works discussed in [6], [7], [8], [9]. In addition, some authors have proposed unified architectures that combine multiple schemes into one. For example, Aikata et al. introduced KaLi [10], a unified architecture that supports both CRYSTALS-Kyber and Dilithium. In hardware design, it is important to evaluate performance and resource consumption. Therefore, these implementations can prioritize different objectives, such as maximizing performance, minimizing area and power consumption, or achieving efficiency in area/performance trade-offs. As a result, the implementations are often categorized as High-performance or Lightweight, although the distinction is not always clear-cut, as designs may aim for various trade-offs or other specific directions. In terms of High-performance designs, notable works include those by [4], [5]. On the other hand, the Lightweight approach is exemplified by works such as [8], [10] . Additionally, there are High-Efficiency designs, as seen in [6]. In this paper, we present a low-latency and efficient hardware architecture for version 3.1 of Dilithium. Our contributions are summarized as follows: 1) We propose a combined architecture that supports all three algorithms of Dilithium and allows for customizable security levels. With a focus on minimizing latency, our design achieves the lowest execution time for all algorithms simultaneously, while also striking a balance between execution time and resource utilization. This design can operate independently for each task of Dilithium, such as key generation, signature generation, and verification. Furthermore, it can interface with other hardware and software platforms to accelerate 2 their respective processes, such as generating random values for matrices or vectors in Dilithium or supporting computation through corresponding modules implemented in this design. 2) We also propose several optimization ideas for the polynomial arithmetic module and hashing module, specifically tailored to the characteristics of Dilithium. These two modules represent the largest contributors to the overall execution time in the architecture, and optimizing them significantly improves the performance of the entire system. 3) By utilizing flexible modules for the entire architecture, we propose a rational timing diagram and an approach to leverage sub-modules to achieve resource utilization balance when implementing different algorithms. The remainder of this paper is organized as follows. Section II provides a brief background of Dilithium and number theoretic transform. Section III proposes our design decisions and details our hardware architecture as well as the optimized modules. The implementation results and comparison with the state-of-the-art works are given in Section IV. Finally, Section V summarizes and concludes the paper. II. PRELIMINARIES A. DILITHIUM OVERVIEW Dilithium, along with the Key Encapsulation Mechanism (KEM) Kyber, is a member of the Cryptographic Suite for Algebraic Lattices (CRYSTALS) and is a digital signature scheme based on the "Fiat-Shamir with Aborts" approach [12]. Different from other lattice-based algorithms recommended in round 4: Kyber and Falcon, Dilithium uses uniform sampling rather than discrete Gaussian distribution for secret randomness generation. This approach greatly simplifies polynomial generation and is easily implemented in constant time. The hard problems underlying the security of Dilithium schemes are Module Learning with Errors (MLWE) and Module Shortest Integer Solution (M-SIS) formally proved in [11]. M-LWE problem needed to protect against key-recovery [13] and M-SIS problem for strong unforgeability [14]. The M-LWE problem can be briefly described as follows: Let A ∈ Rqk×l be uniformly chosen s1 ∈ Rql , s2 ∈ Rqk . Then, the standard M-LWE problem is the public key (A, t = As1 + s2 ) is indistinguishable from (A, t) where t is chosen uniformly at random. The M-SIS problem in Dilithium is expressed through: from a uniformly chosen matrix A ∈ Rqk×l , that there exist z and u such that Az+u = 0, where ∥z∥∞ ≤ 2(γ1 − β), ∥u∥∞ ≤ 4γ2 + 2 (β, γ1 , γ2 and other parameters could be chosen corresponding with NIST security levels in Table 1). In addition to three proposed NIST security levels, the security of Dilithium can be adjust by change the parameter inside. The most straightforward way to enhance or reduce security is by modifying the values of (k, l) and then adjusting η, β, and ω accordingly as shown in Table 1. Another approach is to lower the values of γ1 and/or γ2 , which makes forging signatures more challenging For Review Only VOLUME X, 2023 IEEE Access Page 4 of 11 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 TABLE 1. Parameters of Dilithium in version 3.1. Parameters 2 q [modulus] d [dropped bits from t] τ [# of ±1’s in c] γ1 [y coefficient range] γ2 [low-order rounding range] (k, l) [dimension of A] η [secret key range] β [τ · η] ω [max # of 1’s in hint] NIST security level 3 8380417 13 39 217 (q-1)/88 (4,4) 2 78 80 8380417 13 49 219 (q-1)/32 (6,5) 4 196 55 5 8380417 13 60 219 (q-1)/32 (8,7) 2 120 75 since it relies on the underlying SIS problem. Increasing (k, l) significantly strengthens security, while adjusting γi is more suitable for minor security tweaks. Key generation (Keygen), Signature generation (Sign) and Verification (Verify), three core algorithms of Dilithium, are shown in Algorithms 1-3. KeyGen(): Key generation algorithm generates a keypair consisting of a private signing key (sk) and public verification key (pk) used for signature creation and verification. In this algorithm, two uniform seeds: public seed (ρ) and noise seed (ρ′ ) are mapped to generate polynomial matrix A and vector s1 , s2 . It then computes t = As1 +s2 to generate the final keypair. To keep size small, the public key includes the seed (ρ) instead of matrix A and the upper bit of t. The secret key packs the lower d bits of t, seed ρ and two byte-arrays K, tr. Sign(): This algorithm takes message M and secret key to generate the signature attached with the message before sending to verifier. Signature generation begins by generating masking vector y and matrix A and then compute w = Ay. It is based for generating challenge c by hashing the message with the high order bit of w, signature as z = y+cs1 then and also the hints h for the verifier. The potential signature σ consisting of (c, z, h) is checked by adapting the condition of the polynomial max norms are within an acceptable predefined range and the maximum number of 1’s in hint that the signature can support which are defined by choosing parameter depend on the security level. Otherwise, the signature must be rejected, and a new attempt is made. Verify(): The high-order bit of w1 is recovered from message, public key and signature. It is used as a part of hash function to generate challenge c which will be compared to the c̃ provided in the signature to confirm the authentication and integrity of the original message. B. NUMBER THEORETIC TRANSFORM The Number Theoretic Transform (NTT) is a variant of the Fast Fourier Transform (FFT) that operates on a finite field and provides an efficient method for polynomial multiplication. It reduces the complexity of polynomial multiplication from O(n2 ) by using school-book method to O(nlogn) by using point-wise multiplication method. The NTT-based multiplication can be perform as: c=INTT(NTT(a)◦NTT(b)). Where a, b, c could be polynomial matrix or vector ∈ Rq , ◦ denotes point-wise multiplication of polynomial coefficients. VOLUME X, 2023 Algorithm 1 Key Generation KeyGen() [11] Output: pk = (ρ, t1 ), sk = (ρ, K, tr, s1 , s2 , t0 ) 256 1: ζ ← {0, 1} ′ 2: (ρ, ρ , K) ← H(ζ) 3: A ∈ Rqk×l := ExpandA(ρ′ ) 4: (s1 , s2 ) ∈ Sηl × Sηk 5: t:= As1 + s2 6: (t0 , t1 ) := Power2Roundq (t, d) 256 7: tr ∈ {0, 1} := H(ρ ∥ t1 ) 8: return (pk, sk) Algorithm 2 Signature Generation Sign(sk, M ) [11] ∗ Input: sk = (ρ, K, tr, s1 , s2 , t0 ), M ∈ {0, 1} Output: σ = (c̃, z, h) 1: A ∈ Rqk×l :=ExpandA(ρ′ ) 2: µ ← H(tr ∥ M ), ρ′ ← H(K ∥ µ) 3: κ := 0, (z, h) :=⊥ 4: while (z, h) :=⊥ do 5: y ∈ S̃γl 1 := ExpandMask(ρ′ , κ) 6: w := Ay 7: w1 := HighBitsq (w, 2γ2 ) 8: c̃ ← H(µ ∥ w1 ), c := SampleInBall(c̃) 9: z := y + cs1 10: r0 := LowBitsq (w, 2γ2 ) 11: if ∥z∥∞ ≥ γ1 − β or ∥r0 ∥∞ ≥ γ2 − β then 12: (z, h) :=⊥ 13: else 14: h := MakeHintq (−ct P 0 , w − cs2 + ct0 , 2γ2 ) 15: if ∥ct0 ∥∞ ≥ γ2 or hi > ω then 16: (z, h) :=⊥ 17: end if 18: end if 19: end while 20: return σ = (c̃, z, h) Dilithium, as a lattice-based cryptosystems, utilizes the NTT for matrix-vector and vector-vector multiplications. The NTT is performed in the ring Rq = Zq [x]/(xn + 1), where n is a power of 2 and q is a prime number. In Dilithium, the modulus q is chosen such that there exists a 2n-th root of unit (ϕ2n = 1753). There are two common algorithms for implementing NTT, Cooley-Tukey decimal-in-time (DIT) algorithm for NTT and Gentleman-Sande decimal-in-frequency (DIF) algorithm for INTT. By applying the combination of these algorithms accelerate computations and eliminate the need of bit-reversal step [15]. III. ARCHITECTURE AND DESIGN DECISIONS As the design’s goal is efficient low-latency implementation for CRYSTALS-Dilithium on the FPGA SoC platform, the proposed architecture is supported to perform all Dilithium’s main operations at three security levels as defined in its specification. This design supports all processes including packing and sending signature and keypair, which can cooperate with For Review Only 3 Page 5 of 11 IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Algorithm 3 Verification Verify() [11] ∗ Input: pk = (ρ, t1 ), M ∈ {0, 1} , σ = (c̃, z, h) Output: Valid or Invalid 1: A ∈ Rqk×l := ExpandA(ρ′ ) 2: µ ← H( H(ρ ∥ t1 ) ∥ M ) 3: c := SampleInBall(c̃) ′ d 4: w1 := UseHintq (h, Az P − ct1 ·2 , 2γ2 ) 5: if ∥z∥∞ < γ1 − β & hi ≤ ω & c̃ = H(µ ∥ w1′ ) then 6: return Valid 7: end if 8: return Invalid other software or hardware platform easily. This section will introduce the overall high-level architecture and its optimized modules used to improve latency. Accompany with the efficient hardware implementation, the timing schedules in Figs. 2, 3, 4 are proposed to optimize the work of entire architecture and get low-latency aspect which take advantage of the most important and time-consuming sub-modules (arithmetic and hashing module specify for Dilithium). A. HIGH-LEVEL CONFIGURABLE ARCHITECTURE The high-level architecture of our hardware design is shown in Fig. 1. As a configurable architecture implements three main operations of Dilithium, this design depicts all components used. To connect outside, the bus width is 64 bits to suitable with some software. The block inside is working in 92 bits to perform four coefficients at the same time. When executing one of those algorithms individually, some sub-modules can be removed. A brief description of these modules and their works are explained below. Control unit: Working as a finite-state machine (FSM), the control unit is responsible for controlling different modules based on the required functionality and sequencing the entire hardware module to execute all main operations of Dilithium in their corresponding sequences. Polynomial BRAM: This module contains six groups of three dual-port 36Kb BRAMs used to store intermediate value values during the computation. In the Dilithium scheme, the modulus q is 8380417, allowing for coefficients to be represented within 23 bits. The Xilinx FPGA supports dual-port 36Kb BRAM memory on chip [16]. Instead of storing each coefficient in a separate BRAM address, this module is designed as a group of three BRAMs, each configured as a 36x1024 memory. This configuration allows for four coefficients, with a bandwidth requirement of 4x23=92 bits, to be stored in a single address of the group BRAMs, resulting in a 25% reduction in BRAM utilization compared to the straightforward approach. For security level 5, where matrix A contains 56 polynomials, the BRAM configuration is set to have 92bits dual-port data width and depth of 4096. Additionally, five 92x1024 bits BRAMs are used to store intermediate values during the computation process. Polynomial arithmetic module: used to all arithmetic operations in Dilithium, including NTT, INTT, point-wise 4 FIGURE 1. High-level configurable architecture for Dilithium. multiplications, additions and subtractions for polynomial matrices and vectors. The Keccak core: includes the SHAKE 128 module and SHAKE 256 module, which function as hash functions or Extendable-Output Functions (XOF) [17]. These modules generate hash values of messages, random seeds, and pseudorandom data, which are then used for polynomial sampling in the Dilithium scheme. Sampling unit: This module consists of four sub-modules responsible for generating different types of pseudorandom samples used in Dilithium. ExpandMatrix maps a uniform seed to matrix A, UniformEta for generating the secret vector s1 , s2 , ExpandMask to make the masking vector y and SampeInBall to create challenge c. Decomposer: This module breaks up a finite field element r in Zq into their “high-order” bits and “low-order” bits, such that r = r1 · 2γ + r0 where 0 ≤ r0 < 2γ. Hint: used to describes the work of both MakeHint and UseHint functions in Algorithm 2 and 3. Given an arbitrary element r ∈ Zq and a small element z ∈ Zq , "MakeHint" records a 1-bit hint h as the "carry" that enables the computation of the higher-order bits of r + z using only r and h. Conversely, "UseHint" utilizes the hint to recover the highorder bits of the sum. Encoder/decoder: To interface with other hardware or software platforms, which typically have a bandwidth of 2n , this implementation integrates encoder and decoder units. The encoder is responsible for packing polynomial vectors into byte strings. The different vectors, along with their corresponding bit widths, are packed into fixed 8-byte strings. The decoder can then recover the original vectors from these byte strings. Depending on the algorithm being performed, certain processing units can operate in parallel to reduce overall latency, provided there is no data flow conflict between them. From the Algorithm 1-3, the polynomial generation and computation phases in the Dilithium scheme are the most time- For Review Only VOLUME X, 2023 IEEE Access Page 6 of 11 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 FIGURE 2. Timing diagram of Key generation phase. FIGURE 3. Timing diagram of Signature generation phase. FIGURE 4. Timing diagram of Verification phase. consuming processes. To improve performance, this design optimizes the sequencing of these tasks, ensuring they are executed continuously as much as possible. From the algorithms and the proposed architecture, we have developed an optimized timing schedule that maximizes the utilization of sub-modules and maintains constant progress on the critical path of the operations. In our architecture, the controller operates as a finite-state machine, ensuring sequential processing of all hardware modules. To enhance clarity, we present the optimized schedule specifically for level 5 of the three algorithms. Key generation and Verification: The FSM schedule for Key generation and Verification operations are described straightforwardly stage by stage in Figs. 2 and 4 corresponding with Algorithm 1 and 3, respectively. The longest path among them is the computation, hashing and sampling, which occupies around 90% total time execution. The packing/unpacking phases are immediately executed in parallel with the generation matrix, vectors and computations. Pack VOLUME X, 2023 t0 , Pack t1 perform the function Power2Round, the straightforward bit-wise way to break up t into 13-bits high-order t1 and 10-bits low-order t0 , in line 6 of Algorithm 1. The key generation progress is finished after packing their ingredients. In verification, the validity of the signature depends on the result of the comparison described in line 5 of Algorithm 3, which is implemented in stage 7 as depicted in Fig. 4. Signature generation: This is the most complex main operation in Dilithium. It consists of two phases: precomputation and rejection loop, as shown in Fig. 3. During the precomputation phase, the secret key and message are unpacked and the necessary calculations are performed prior to the while loop in Algorithm 2. This phase is executed only once. The rejection loop is responsible for generating the signature until it satisfies the requirements specified in the algorithm. The average number of loops can be found in the Table 1. Inspired by the approach proposed by Beckwith et al. [4], the polynomial arithmetic unit is divided into two independent parts, functioning as described in mode 2 of this module. For Review Only 5 Page 7 of 11 IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 The first part is used to create the vector w and executes the functions from line 5 to 7, while the remaining functions within the loop are performed in the second module. B. HASHING AND SAMPLING Dilithium scheme utilizes SHAKE 128 and SHAKE 256 as its hashing functions. Both of these functions are ExtendableOutput Functions (XOFs) from the SHA-3 family, based on the same Keccak permutation with a state size of 1600 bits. However, they have different rates in the sponge construction, namely 1344 and 1088, respectively [17]. More precisely, the ExpandA function uses the output stream of SHAKE 128 to map a uniform seed to matrix A ∈ Rqk×l . SHAKE 256 is used for generating a random seed from an initial seed, polynomial secret vectors s1 and s2 , and masking vector y. It is also used for generating the challenge c through the SampleInBall function and other hashing tasks in the Dilithium scheme. Due to the shared Keccak-f[1600] core with a 24-round permutation, the previous works use unified Keccak module which can perform both of SHAKE128/256. This approach saves area compared to implementing individual units, but it has some drawbacks. Firstly, the sequential generation of the polynomial matrix A and other vectors in the Dilithium algorithms consumes considerable time, while the computation phase can only proceed once the sampling process is complete. Besides, the hardness of the module-LWE in Dilithium primarily depends on the dimensions (k, l) of the vectors and matrices in the scheme, leading to the requirement of a large amount of pseudorandom data, corresponding to longer execution times in this phase and entire main operation. Secondly, as the combined architecture for SHAKE 128/256, the complexity of this module could limit the maximum achievable frequency of entire architecture. To mitigate this and reach the low latency target, we propose using independent modules SHAKE 128 and SHAKE 256. Both of that module has the same 64 bits data-path core, which performs 24-round Keccak permutation. For the requirement of a large pseudorandom data for matrix A, we use double 64 bits data-path cores in SHAKE128 module, while the SHAKE 256 module uses a single core. Our modules are modified from the high-performance Keccak implementation core developed by the Keccak team [18]. This implementation supports for bus widths of 8, 16, 32, 64 or 128 bits, and we choose a 64-bit data path as it is the greatest common divisor of the rates (r = 1344 and 1088) in SHAKE 128/256 and is also suitable for connecting with other software/hardware platforms. The Keccak control unit manages the internal signals to control the module’s operations and facilitate the transition between the absorb and squeeze phases. In the case of sampling matrix A, depict in Fig. 5, SHAKE-128 module needs 5 cycles to absorb 256 bit of seed and 16-bit nonce before transferring to the squeeze phase, where the sampling and hashing processes work continuously. With the condition of coefficient in matrix A is within Zq (where q is the modulus), this module requires 204 cycles to generate two polynomials with 256 6 FIGURE 5. Sampling for matrix A. coefficients (without taking into account the possibility of samples being rejected due to violating the condition). Using double permutation cores speeds up the time for sampling matrix A which up to 56 polynomials in security level 5. The remaining hashing tasks and sampling vectors in Dilithium are implemented in the SHAKE 256 core in a similar manner. This module consists of a single data-path core controlled by the Keccak control unit, specifically designed for SHAKE 256. It interfaces with the corresponding sampling units to generate the required vectors based on their rejection conditions and received data widths, as outlined in Table 1. C. FLEXIBLE POLYNOMIAL ARITHMETIC MODULE There are two popular variants of the NTT architecture: memory-based and pipelined architectures. The pipelined architecture requires a large number of shift registers to store delay units, which depend on the chosen scheme [19]. On the other hand, the memory-based architecture utilizes butterfly units to perform calculations layer-by-layer according to a diagram [4], [6], [7], [8]. This approach can directly connect with memory, eliminating the need for delay units. However, each butterfly unit needs to read and write back two coefficients per cycle, which limits the number of butterfly units that can be used to accelerate the speed of NTT due to the memory port limitations. Moreover, the memory access order for both intermediate values and twiddle factors is complex, requiring a complicated control module as seen in the 2x2 NTT module [4] and the dual butterfly unit [7], [10]. In terms of latency, the pipelined architecture only requires a long delay time from the beginning of input to output while the memory-based architecture requires a certain stall delay length between changing stages when performing NTT/INTT. For long serial NTT/INTT operations like Dilithium, the pipelined architecture exhibits better latency than the memory-based architecture. Another important consideration is the extensive computational requirements of the signature generation process in Dilithium compared to the other two processes, particularly during the rejection loop phase. To improve performance in For Review Only VOLUME X, 2023 IEEE Access Page 8 of 11 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 FIGURE 6. Flexible polynomial arithmetic module in mode 1. FIGURE 7. Flexible polynomial arithmetic module in mode 2. this phase, we propose an idea inspired by [4] to distribute the calculations across two polynomial arithmetic (PolyArith) modules. The former module calculates functions from line 5 to 7 of Algorithm 2, while the latter module handles the remaining calculations. Based on the above data analysis, our arithmetic module utilizes the radix-2 MDC pipelined architecture [19] with 8 butterfly units to handle all the polynomial arithmetic operations in Dilithium. This module offers two modes that can be flexibly adapted to suit the characteristics of Dilithium, both of them are configurable to support NTT/INTT operations with the data flow pass through the black and faint lines, respectively, in Figs. 6, 7. 1. Fully pipelined mode: In this mode, we use eight butterfly units corresponding to the eight layers of Dilithium for N=256. The stall delay length from input to output is 153 clock cycles, after that each polynomial requires 128 cycles for NTT/INTT. The butterfly units can be reused for all polynomial arithmetic, processing 8 coefficients in parallel. This mode is employed in the Keygen and Verify operations. 2. Folding transform mode: In this mode, we divide our module into two independent modules, each equipped with four butterfly units. By applying the folding transformation method [20], each butterfly unit is responsible for calculating two adjacent NTT layers in a time-sliced fashion. This enables us to employ four butterfly units to implement radix2 MDC in Dilithium, similar to the approach in [5]. In this mode, during NTT/INTT, the butterfly units handle the odd stages in odd cycles and the even stages in even cycles. This mode has a delay of 281 cycles from input to output and requires 256 cycles to transfer a single polynomial during NTT/INTT. It also supports the processing of 4 coefficients per cycle during polynomial arithmetic. This mode is used for Sign operation, one module to compute vector y and w, the other for remaining computations as shown in Fig. 3. This module utilizes 6 registers to store twiddle factors connected with processing elements (PE) 1 and PE4. Additionally, we use two 36Kb BRAMs configurations in a 72x512 mode to store twiddle factors connected with the remaining 6 PEs. Shift registers are used to implement a FIFO unit for saving intermediate values during NTT/INTT, consuming (64+32+...+2+1) x 4 x 23 bits of FFs resources. VOLUME X, 2023 1) Butterfly unit. The Butterfly unit (BU) plays a crucial role in the arithmetic module for performing various operations. In the proposed design, a configurable BU architecture is presented, supporting both the CT BU and GS BU methods, as depicted in Fig. 8. The proposed butterfly unit consists of one modular multiplier, one modular adder, and one modular subtractor, without requiring any additional modular multiplier, adder, or subtractor compared to a dedicated CT or GS butterfly unit. As shown in Fig. 8 the proposed butterfly unit takes a, b and ω as input, the NTT/INTT control signal in the blue line acts as a selection signal for the multiplexers, configuring the BU to operate as either a GS or CT butterfly unit. the BU architecture enables it to handle point-wise multiplication, addition, and subtraction through the corresponding internal unit. A and B denotes the output ports, when performing point-wise multiplication or subtraction, the result is taken from port B, while port A is utilized for addition using the CT butterfly configuration. In [21] Zhang et al. proposed a technique to eliminate the multiplication of resulting coefficients with n−1 (mod q) after the INTT operation. The output can be seen as (A, B) := (2-1 (a + b), (a - b)ω). For an odd prime q, x/2 (mod q) can be performed as shown in Eqn. 1. In this work, we adopt this technique by insert divide 2 operation integrated in addition unit, and pre-processed the twiddle factor for the inverse NTT to incorporate the factor 2-1 . x/2(modq) = (x ≫ 1) + x[0].(q + 1)/2 (1) 2) Modular multiplication. Modular multiplication, addition and subtraction are the fundamental arithmetic components inside the butterfly unit. The reference implementation on software platforms of Dilithium [11] uses Montgomery reduction after multiplication and Barrett reduction for addition and subtraction. Both of these For Review Only 7 Page 9 of 11 IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 TABLE 2. Resource utilization of sub-modules in the architecture. Sub-modules LUT FIGURE 8. Proposed butterfly unit. Resource Utilization FF DSP RAM Polynomial RAM Hint Encoder Decoder Decomposer PolyArith SHAKE 128 SHAKE 256 Sampling Other 0 10476 2183 2683 1565 5819 10278 5126 8009 5682 0 3776 464 251 728 3527 6641 3941 2921 1844 0 0 0 0 0 16 0 0 0 0 25.5 0 0 0 0 2 0 0 0 0 Combined Architecture 51821 24093 16 27.5 IV. IMPLEMENTATION RESULTS AND COMPARISON A. RESOURCES UTILIZATION AND PERFORMANCE RESULTS FIGURE 9. Architecture for modular reduction unit. can apply on hardware and has been adopted in some related works but not seem to be a best choice. While Montgomery needs to transfer from normal domain to Montgomery domain leading to additional multiplications, Barret reduction does not exploit the specification of modulus q in Dilithium. In [8], Land proposes the reduction method by recursively exploit the relation 223 ≡ 213 − 1. s[45 : 0] ≡ 223 s[45 : 23] + s[22 : 0] ≡ 213 s[45 : 23] − s[45 : 23] + s[22 : 0] ≡ 223 s[45 : 33] + 213 s[32 : 23] − s[45 : 23] + z ≡ 213 (s[45 : 43] + s[42 : 33] + s[32 : 23]) − (s[45 : 43] + s[45 : 33] + s[45 : 23]) + z (2) As we know, the subtraction is not friendly with hardware like addition, we use additive inverse with s̄ is negative s, so we transfer the equation above to: s[45 : 0] ≡ s̄[45 : 23] + (213 s[32 : 23] + s̄[45 : 33]) + (213 (s[42 : 33] + s̄[45 : 43]) + 213 s[45 : 43] + z + 50331648 ≡ s̄[45 : 23] + {s[42 : 33], 10′ b0, s̄[45 : 43]} + {s[32 : 23], s̄[45 : 33]} + 213 s[45 : 43] + z + m (3) The result of the reduction at this point is observed within the interval (-q, 3q), we can simply obtain the hardware architecture for modular multiplication in Fig. 9 where m = 50331648 and {} denotes concatenated function. 8 Table 2 provides an overview of the resource consumption for the entire design and its sub-modules. The design as a whole utilizes 51821 LUTs, 24093 FFs, 27.5 BRAMs, and 16 DSPs, working in the maximum frequency 280 MHz. To achieve the low latency target, we selected the Virtex Ultrascale+ platform, and all the resource data was generated using Xilinx Vivado 2022.2. Due to our data-path employing 4 parallel paths to execute 4 coefficients per cycle, the resource usage is higher compared to other lightweight target works. The two Keccak units account for approximately 30% and 44% of the LUT usage, making them the most resource-intensive components. The design utilizes a total of 27.5 BRAMs, with 10.5 BRAMs allocated for storing matrix A and 15 BRAMs used for intermediate values during computation. The polynomial arithmetic module requires 2 BRAMs to store twiddle factors. As a combined architecture, some units are only utilized in specific operations and are not called when performing other tasks. For example, the Hint module is not involved in the Keygen operation. However, the Keccak core, ExpandA, PolyArith, encoder/decoder, and polynomial RAM can be fully shared among different operations, consuming 54% LUTs, 65% FFs, and 100% DSPs. In term of latency, we have compiled statistics on the overall execution time, including the data input, unpack and pack signature and keypair processes. From the average statistics of 1000 executions, we obtained the following specific parameters. The Keygen operation requires an average of 3830, 5925, and 9930 clock cycles (CCs) for the three security levels. The interval between each sample depends on the rejection rate of matrix A and secret vectors s1 but remains relatively small due to the high acceptance rate. The Verify operation consumes 5161, 7382, and 10110 CCs for the three NIST security levels. In this operation, only valid samples are calculated, because the invalid sample is processed much faster and does not show total time execution of this operation. The interval time also depends on the For Review Only VOLUME X, 2023 IEEE Access Page 10 of 11 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 TABLE 3. Comparison of FPGA-based designs for Dilithium signature scheme. Sec. Design Device lvl Aikata [10]a II III V Freq. Area Keygen Sign Verify (MHz) LUT FF DSP BRAM cycles µs cycles µs cycles µs 23277 9798 4 24 14594 54 31662 117 15423 57 ZUs+ 270 Land [8]b Artix-7 163 27433 10681 45 15 18761 115 76613 470 19687 121 Wang [6]b Zynq-7000 159 18558 7342 10 17 7757 49 52038 327 7675 48 Zhao [5]c Artix-7 96.9 29998 10366 10 11 4172 43 8361/31600 326.1 4422 45.6 Beckwith [4]c VUs+ 256 53907 28435 16 29 4875 19 10945/29876 43/117 6582 26 This workc VUs+ 280 51821 24093 16 27.5 3830 13.7 10800/29215 38.6/104.3 5161 18.4 Aikata [10]a ZUs+ 270 23277 9798 4 24 23619 87 48446 171 26124 97 Land [8]b Artix-7 145 30900 11372 45 21 33102 229 123218 852 32050 222 Wang [6]b Zynq-7000 159 19614 8466 10 21 12982 82 89213 561 11232 71 Zhao [5]c Artix-7 96.9 29998 10366 10 11 5851 60.4 11399/49496 510.8 6181 63.8 Beckwith [4]c VUs+ 256 53907 28435 16 29 8291 32 16178/49437 63/193 9724 39 This workc VUs+ 280 51821 24093 16 27.5 5925 21.2 15130/46545 54/166.2 7382 26.4 Aikata [10]a ZUs+ 270 23277 9798 4 24 39737 147 70179 260 46671 173 Naina [7]b ZUs+ 391 13975 6845 4 35 63.2k 162 113.9k 291 67.9k 174 Land [8]b Artix-7 140 44653 13814 45 31 50892 364 145912 1042 52712 377 Wang [6]b Zynq-7000 159 20973 9677 10 28 20185 127 93708 589 15875 10 Zhao [5]c Artix-7 96.9 29998 10366 10 11 8765 90.5 15790/55321 570.9 9039 93.3 Beckwith [4]c VUs+ 256 53907 28435 16 29 14037 55 24358/55070 95/215 14642 57 This workc VUs+ 280 51821 24093 16 27.5 9930 35.5 22898/52547 81.8/187.7 10110 36.1 a : Multiple schemes, b : Supports dependent level, c : Supports all levels. rejection condition in generating matrix A. The number of cycles required for the Sign operation varies widely due to the rejection loop in signature generation. The best-case scenario occurs when the signature is accepted after the first iteration, resulting in a time of 10800, 15130, and 22898 CCs for the three security levels. The processing time of input message is determined by the size of the message and operates at a rate of 8 bytes per cycle. Based on the parameters in the Dilithium specification, an average of 4.25 attempts is required for level 2, 5.1 attempts for level 3, and 3.85 attempts for level 5. The average time to execute one attempt are 5665, 7658, and 10340 CCs, leading to average times of signature generation are 29215, 46545, and 52547 CCs for three security levels. B. COMPARISON WITH RELATED WORKS Among the fully hardware implementations for Round-3 Dilithium based on FPGA platforms, several state-of-the-art works with the best results are listed in Table 3. As mentioned above, these implementations vary in their approaches, such as Lightweight, High-performance, or others. To compare the effectiveness of these implementations, we employ the area-time trade-off (ATP) metric. The ATP values are calculated by multiplying the latency with the number of utilized LUTs, FFs, and DSPs, denoted as ATP_LUT, ATP_FF, and ATP_DSP, respectively. Additionally, the utilization of BRAM is also an important factor to consider. In this metric, VOLUME X, 2023 our work serves as the "center point" for comparison, with a value set to 1. Lower values indicate better performance, as both area and time consumption should be minimized. In Table 3, our work and the implementation in [4], [5] represents fully hardware designs for all three security levels of Dilithium. In [6], [8], the authors propose independent designs for each security level. Additionally, with the same lattice-based structure, Aikata et al. [10] propose a combined architecture for CRYSTALS-Kyber and Dilithium scheme. To facilitate the evaluation of design effectiveness, Figs. 1012 present a comparison using the ATP metric for corresponding parameters of level 5, which is the highest complexity supported by Dilithium. We list the four best stateof-the-art designs published to form a comparison table of ATP_LUT, ATP_FF, and ATP_DSP parameters. The work of Land [8] is one of the earliest designs for Round 3 Dilithium. This design has several non-optimized aspects, such as the use of separate NTT/INTT blocks and multiplication blocks, which could be combined to achieve efficiency without increasing algorithm execution time. Thus, from the data in Table 3, it is evident that our design achieves significantly better performance than this design. In [7], Naina proposed a lightweight implementation for level 5 Dilithium on the Zynq Ultrascale+ hardware platform. With a focus on minimizing hardware utilization, this design achieves the best results in terms of LUTs and FFs. For Review Only 9 Page 11 of 11 IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 FIGURE 10. Comparison of Area×Time (LUTs × s). FIGURE 11. Comparison of Area×Time (FFs × s). However, regarding execution time, this design only provides data for the best-case scenario of the Sign operation without specific information about the average execution time of this algorithm. Despite the differences in design approach, as our design supports various security levels, our implementation outperforms this design by 21% in terms of BRAM usage and demonstrates better efficiency in almost statistics when evaluated using the ATP metric, as shown in Fig. 10-12. Wang et al. [6] utilized three independent cores for the corresponding security levels of Dilithium. This work is software/hardware codesign, where bit packing and unpacking are not included in the hardware part. Therefore, when comparing with the ATP metric for level 5, a direct comparison may not be totally fair as our design, which supports 3 levels, is more complex and resource-consuming. However, our implementation shows a slight improvement in terms of memory utilization and demonstrates an enhancement of up to 1.1∼2.2× in ATP parameters, as depicted in Fig 10 to 12. The work of [5] presents a compact design for Dilithium on the low-end Artix-7 platform. Although it also supports all three levels, it lacks the flexibility of our design. Unlike our design, its results are stored only in its memory, and it does not support the packing and sending of results to connect with external hardware/software platforms. Additionally, their average execution time statistics do not include the data input receiving and sending processes. The execution time for Verify operations is calculated only in cases where the public key remains unchanged. In similar cases, our work only requires 4377, 5846, and 8062 CCs for the three levels. The results depicted in Fig. 10-12 also show our implementation is better when evaluated by the ATP metric. Beckwith et al. [4] proposed a design that, similar to ours, provides support for all levels of Dilithium. However, they use the Minerva tool [22] to determine the maximum frequency, which represents a different benchmark compared to other designs. As shown in Table 3, our work outperforms theirs in terms of execution time and resource utilization. In summary, our implementation exhibits the lowest latency compared to the other designs. Furthermore, when compared to the combined architecture supporting all security levels of Dilithium in [4] and [5], our design demonstrates efficiency improvements ranging from 1.1∼1.9× 10 FIGURE 12. Comparison of Area×Time (DSPs × s). evaluating by ATP metric. In comparison to the software implementation, our hardware implementation in this work demonstrates significant performance improvements. Seo et al. [23] reported an optimized NEON implementation on an 8-core ARM v8.2 64-bit CPU mounted on Jetson AGX Xavier. For Dilithium level 5, their implementation takes 542µs for Keygen, 625µs for Verify, and 1001µs for the best-case scenario of Sign. In the same tasks, our hardware implementation outperforms their software implementation, achieving a speed-up of 15.3×, 17.3×, and 12.2×, respectively. Additionally, Becker et al. [24] presented a software implementation on high-end CPUs, specifically the ARMv8 Apple M1 Firestorm core at 3.2GHz. Their reported execution times for the three main operations of Dilithium-III are 48µs, 114µs, and 33µs. In terms of time execution in microseconds, our hardware implementation achieves improvements of 2.3×, 0.7×, and 1.25× for Keygen, Sign, and Verify, respectively. V. CONCLUSION In this paper, we present an efficient and low-latency hardware implementation of round 3 CRYSTALS-Dilithium, which supports all main operations at three security levels of NIST. Our implementation enables the complete execution of all Dilithium tasks independently or in interaction with other software/hardware platforms to accelerate specific processes. When compared with the state-of-the-art hardware architecture for all three security levels of Dilithium, For Review Only VOLUME X, 2023 IEEE Access Page 12 of 11 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 we achieve the lowest latency and efficiency in terms of area/performance trade-offs, with a significant improvement of 1.1 to 1.9× as evaluated by the ATP metric. Additionally, this work introduces optimized arithmetic operations and Hashing & Sampling modules tailored to the characteristics of Dilithium, enhancing the overall efficiency of the scheme. REFERENCES [1] P. W. Shor, “Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer,” SIAM Journal on Computing, vol. 26, no. 5, pp. 1484–1509, Oct. 1997. [2] NIST, “Status report on the third round of the NIST-PQC standardization Process,”. Available: https://csrc.nist.gov/projects/post-quantumcryptography. [3] NIST, “FIPS 186-4 - Digital Signature Standard (DSS),” url: https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.186-4.pdf. [4] L. Beckwith, D. T. Nguyen, and K. Gaj, “High-performance hardware implementation of CRYSTALS-Dilithium,” Crypto. ePrint Arch., Report 2021/1451, 2021. [5] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu, S. Yin, S. Wei, and L. Liu, “A compact and high-performance hardware architecture for CRYSTALS-Dilithium,” IACR Trans. Cryptogr. Hardw. Embed.Syst., vol. 2022, no. 1, pp. 270–295, 2022. [6] T. Wang, C. Zhang, P. Cao and D. Gu, “Efficient implementation of Dilithium signature scheme on FPGA SoC Platform,” IEEE Transactions on Very Large Integration (VLSI) systems, vol: 30, Sep. 2022. [7] N. Gupta, A. Jati, A. Chattopadhyay, and G. Jha, “Lightweight Hardware Accelerator for Post-Quantum Digital Signature CRYSTALS-Dilithium,” Cryptology ePrint Archive, Paper 2022/496, 28 April 2022. [8] G. Land, P. Sasdrich, and T. Guneysu, “Hard Crystal-Implementing Dilithium on Reconfigurable Hardware,” Cryptology ePrint Archive 2021/355, Mar. 2021. [9] S. Ricci, L. Malina, P. Jedlicka, D. Smekal, 1. Hajny, P. Cibik, and P. Dobias, “Implementing CRYSTALSDilithium Signature Scheme on FPGAs,” Cryptology ePrint Archive 2021/108, Jan. 2021. [10] A. Aikata, C. Mert, M. Imran, S. Pagliarini and S. Roy, “KaLi: A crystal for post-quantum security using Kyber and Dilithium,” IEEE Transactions on Circuits and Systems I, vol: 70, Feb. 2023. [11] S. Bai, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe, G. Seiler, and D. Stehl´e, “CRYSTALS-Dilithium,” Proposal to NIST PQC Standardization, Round3, 2021, https://csrc.nist.gov/Projects/postquantum-cryptography/round-3-submissions. [12] Vadim Lyubashevsky, “Fiat-shamir with aborts: Applications to lattice and factoring-based signatures,” in International Conference on the Theory and Application of Cryptology and Information Security, Springer Berlin Heidelberg, 2009, pp. 598–616. [13] Y. Wang and M.Wang, “Module-LWE versus Ring-LWE, Revisited,” Cryptology ePrint Archive 2019/930, Aug. 2019. [14] E. Kiltz, V. Lyubashevsky, and C. Schaffner,“A concrete treatment of Fiat–Shamir signatures in the quantum random-oracle model,” in Proc.37th Annu. Int. Conf. Theory Appl. Cryptograph. Techn., pp. 552–586, Tel Aviv, Israel, Apr. 2018. [15] P. Longa and M. Naehrig, “Speeding up the number theoretic transform for faster ideal lattice-based cryptography,” in Proc. 15th Int. Conf. Cryptol. Netw. Secur., pp. 124–139, Milan, Italy, Nov. 2016. [16] “UltraScale Architecture Memory Resources Advance Specification User Guide,” UG573 (v1.1) August 14, 2014. [17] National Institute of Standards and Technology, in FIPS PUB 202: SHA3 Standard: Permutation-Based Hash and Extendable-Output Functions. Aug. 2015. [18] G. Bertoni, J. Daemen, S. Hoffert, M. Peeters, and G. V. Assche.(2020)., Keccak in VHDL. [Online]. Available:https://keccak.team/hardware.html [19] S. He and M. Torkelson, “A new approach to pipeline FFT processor,” in Proceedings of International Conference on Parallel Processing, pp. 766770, April 1996. [20] K. K. Parhi, C.-Y. Wang, and A. P. Brown, “Synthesis of control circuits in folded pipelined DSP architectures,” IEEE Journal of Solid-State Circuits, vol. 27, no. 1, pp. 29– 43, Jan. 1992. [21] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei and L.Liu, “Highly Efficient Architecture of NewHope-NIST on FPGA using Low-Complexity NTT/INTT,” IACR Trans. on CHES, vol. 2020, no. 2, pp. 49–72, 2020. VOLUME X, 2023 [22] F. Farahmand, A. Ferozpuri, W. Diehl, and K. Gaj, “Minerva: Automated hardware optimization tool,” in 2017 International Conference on ReConFigurable Computing and FPGAs, ReConFig 2017, pp. 1-8, Cancun: IEEE, Dec. 2017. [23] S. C. Seo, S. W. An, “Parallel implementation of CRYSTALS-Dilithium for effective signing and verification in autonomous driving environment,”, ICT Express, vol. 9, issue 1, pp. 100–105, Feb. 2023. [24] H. Becker, V. Hwang, M. J. Kannwischer, B.-Y. Yang, and S.-Y. Yang, “Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1,” Cryptology ePrint Archive 2021/986, Jul. 2021. QUANG TRUONG-DANG (Graduate Student Member, IEEE) received his B.S. degree in Electrical and Electronic Engineering from Le Quy Don University of Technology, Vietnam, in 2019. He is currently pursuing his M.S. in Electrical and Computer Engineering from Inha University. His research interests include algorithm and architecture design for post-quantum cryptography and homomorphic encryption. PHAP DUONG-NGOC (Member, IEEE) received the M.S. degree in electronic engineering from the University of Danang, Vietnam, in 2015, and the Ph.D. degree in Electrical and Computer Engineering from Inha University, South Korea, in 2022. He is currently a Post-Doctoral Researcher with the Artificial Intelligence System-on-Chip Research Center, Inha University. His research interests include algorithm and architecture for digital signal processing, hardware acceleration, post-quantum cryptography, and homomorphic encryption. HANHO LEE (Senior Member, IEEE) received Ph.D. and M.Sc. degrees in Electrical and Computer Engineering, from the University of Minnesota, Minneapolis in 2000 and 1996, respectively. He was a member of the technical staff at Lucent Technologies (Bell Labs Innovations), Allentown from April 2000 to August 2002, and Assistant Professor in the Department of Electrical and Computer Engineering, University of Connecticut, USA from August 2002 to August 2004. He has been with the Department of Information and Communication Engineering, Inha University since August 2004, where he is currently a Professor. He was visiting scholar at Bell Labs, Alcatel-Lucent, Murray Hill, New Jersey, USA from August 2010 to August 2011. His research interests include algorithm and architecture design for cryptographic, forward error correction coding, artificial intelligence and digital signal processing. For Review Only 11