Received: 13 August 2020 Revised: 19 October 2020 Accepted: 23 November 2020 DOI: 10.1002/cta.2921 ORIGINAL PAPER Reconfigurable FIFO memory circuit for synchronous and asynchronous communication Saleh Abdel-hafeez1,2 | Ann Gordon-Ross3 1 Department of Computer Engineering, Jordan University of Science and Technology, Irbid, Jordan 2 Sabbatical at Department of Computer Engineering, College of Computer, Qassim University, Qassim, Buraydah, Saudi Arabia 3 Department of Electrical and Computer Engineering, University of Florida (UF), Gainesville, Florida, USA Correspondence Saleh Abdel-hafeez, Department of Computer Engineering, Jordan University of Science and Technology, Irbid 22110, Jordan and Sabbatical at Department of Computer Engineering, College of Computer, Qassim University, Qassim, Buraydah, Saudi Arabia. Email: sabdel@just.edu.jo Abstract We present a new FIFO (first-in first-out) architecture for both synchronous and asynchronous communication for high-speed and low-power operation. Our FIFO design is reconfigurable and scalable using a separate datapath with an 8T-Cell SRAM and control circuits, which enables specialization for different application requirements. The datapath uses a two-phase clock system of nonoverlapping signals such that one signal increments the address pointer, while the other signal activates the memory decoder for data reading and writing. This structure halves the critical path delay and simplifies the timing operations between the memory decoder and address pointer while maintaining robustness against process-voltage-temperature (PVT) variations. Our design uses two alternative control circuits to manage separate synchronous and asynchronous operations by generating nonoverlapping control signals that drive the datapath circuit. The empty-full flag circuitry records only the state of the address pointers' rollover independent of the memory size, and, thus, improves scalability and reconfigurability. Compared to prior works, our design is 5X faster with a 2.3X lower power consumption and has a throughput of 1 Giga-Word/s. For a 64-bit word size with a free latency cycle. Additionally, our design functions clocklessly with the synthesizable structure for asynchronous communication that leverages Internet of Things (IoT) and Networks on Chip (NoCs) applications. KEYWORDS 8T-Cell SRAM, asynchronous and synchronous controlled datapath, FIFO, high-speed and lowpower, IoT and NoCs, two-phased clock 1 | INTRODUCTION Most modern integrated circuits (ICs) have first-in first-out (FIFO) buffers that orchestrate handshaking of information between two different communicating components. FIFOs can either operate asynchronously (independent of any clock signal) or synchronous with a clock signal. Since both implementations have advantages and disadvantages, the greatest design consideration is to determine the most appropriate implementation based on system and/or application requirements. Typically, synchronous FIFOs are used in microprocessors to manage handshaking communications between on-chip components, such as queue management,1 reorder buffers,2 arithmetic logic units (ALUs),3 etc., and more recently, in Artificial Intelligence (AI),4,5 graphics cards,6,7 etc. These synchronous implementations leverage the existing clock generator resources within the chip (e.g., delay-locked loop (DLL) and phase-locked loop (PLL)) and the 938 © 2021 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cta Int J Circ Theor Appl. 2021;49:938–952. 939 same on-chip component and fabrication technology. Consequently, synchronous FIFOs typically operate at very high speeds on the order of 0.5 GHz. Alternatively, asynchronous FIFOs are typically in systems that require very low power consumption where the frequency is usually in the range of MHz. These FIFOs usually orchestrate communication between components that may be using different fabrication technology. Many asynchronous FIFOs play a major role in the Internet of Things (IoT)8 and communication devices such as routers, Networks on Chip (NoCs)9 globally asynchronous locally synchronous (GALS) networks, etc.10,11 In order to flip a FIFO's operation between synchronous and asynchronous, an entirely new circuit design is required, which leaves little design flexibility. In this work, we extend prior work on FIFO design1,8 and present a flexible, reconfigurable FIFO that can be easily flipped between synchronous and asynchronous. The datapath remains the same for both synchronous and asynchronous operation using an arbitrarily sized SRAM-based memory with two separate read and write ports, while the control unit activates different signals based on the operation configuration (i.e., synchronous or asynchronous). Minimal circuitry changes are required when changing the memory size. The SRAM uses 8T-Cell technology that is commonly known for high-speed, low-power operation with adaptability to low power supply voltage for continued technology scaling.12,13 The read and write circuitry uses separate address-decoder and address-pointer signals with two mutually exclusive activation signals that make the memory operation timing simple and reliable with cost-effective reconfigurability, a factor that is needed in continued chip technology scaling.14 Our design obviates the need for any special delay circuitry in the control unit that would require a large design layout cost and does not require brute-force D-type flip-flops (DFFs) to capture unsynchronized data with respect to the clock signal, as is required in other designs,15,16 which usually has a negative impact on throughput and performance. A novel feature of our proposed design is the method for generating the empty-full flag signal using circuitry that checks for an address roll-over instead of tracking every storage address or data register in the FIFO. The empty-full flag requires only four DFFs, regardless of the memory size, thus limiting the cost and hardware overheads associated with larger FIFOs. This structure eliminates any arithmetic operation circuitry, such as a subtractor and gray code mapping,17 a long right-left shift register,18 or a logic-expensive up/down counter,19 and obviates the chain of DFFs required in other designs.20 Two control units are used for synchronous and asynchronous operation. The synchronous control circuit uses the main system clock (CLK) to generate two phase-clock signals (CLKN and CLKP) that are the exact complement of each other and whose edges are isolated with an idle-margin. CLKN and CLKP are generated from CLK using a simple feedback NAND circuit21 or a more advanced technique22–24 that can be susceptible to process-voltage-temperature (PVT) variations. However, the design goal is not to eliminate the PVT variations, but rather to preserve the formation of the mutually exclusive signals with an idle-margin for any of the corners' technologies (e.g., Fast-Fast, Slow-Slow, Typical-Typical, etc.). Asynchronous FIFO operation is based on a handshaking protocol between the sender (active) and receiver (passive) and is synchronized by request and acknowledge handshaking signals.25,26 The control design uses cascaded DFFs as the delay element for generating the mutually exclusive signals with an idle-margin, wherein the cascaded DFFs depend on the active signals' edges, a property that maintains the two generated signals within a certain pulse range and preserves the idle-margin between request and acknowledge. The request and acknowledge signals have a slightly longer cycle time due to the delay required for the signal to propagate across the channel and reach the other communicating component. Due to these design aspects, the asynchronous control circuitry trades off low power consumption for low-speed operation. In summary, to alleviate some drawbacks in previous designs, such as high power consumption, low-speed operation, multicycle latency computation, custom structures that are unsuitable for continued technology scaling, long time to market due to synchronous and asynchronous handshake communication, and irregular VLSI configurability to serve wide range of applications, in this paper we leverage low-cost standard CMOS cells at 65 nm and 1 V power supply to architect our proposed FIFO circuit design with the following key features: 1- Our design is appropriate for a wide range of synchronous and asynchronous communication handshaking applications using a divided control unit with asynchronous and synchronous circuits. The datapath unit remains the same for both asynchronous and synchronous control operations. 2- Our design is robust against PVT variations since the memory array and address counter are activated from two opposite phases with nonoverlapping margin of the control signals, which are generated from the control circuits, similar to state-of-the-art ARM architectures. 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License ABDEL-HAFEEZ AND GORDON-ROSS ABDEL-HAFEEZ AND GORDON-ROSS 3- Our design has a high degree of configurability, which shortens the time-to-market window due to the novelty of the empty-full flag detector circuit that is independent of the depth of the memory array. In addition, our design components are constructed using a standard cell library without the need for a specialized ASIC design. Thus, our design is compatible for Electronic-Design-Automation (EDA) tools. 4- Our design has a free latency cycle and enjoys high-speed operation with a throughput of 1 Giga-Word/s with a 64-bit word size for the synchronous handshake communication. In addition, our design can be leveraged in a wide range of asynchronous low-speed handshaking communications that are prevalent in many IoT and NoC applications. Additionally, our design provides reliability in cases where the handshaking is disconnected during the transfer operation. 5- Our design provides efficient low power consumption due to several factors: low power supply voltage, a memory cell (8T-Cell) that obviates sensing amplifiers and has separate read/write ports, the design activations spread over two phases of nonoverlapping signals, and simple CMOS logic gates with minimum geometry sizes and maximum fan-in/fan-out of four. 6- Our design affords continued technology scaling, which is considered a high degree of design geometry scalability since the FIFO circuit constitutes a low-cost standard CMOS cell library. Additionally, the memory array is based on the Intel 8T-Cell, which is attractive for further advance technologies. In summary, Section 2 exploits the datapath circuitry along with the activation signals, while Section 3 demonstrates the critical path based on a one-unit gate delay for the read and write operations. This critical path determines the minimum assertion of the non-overlapping signals to ensure proper read or write operation. Section 4 realizes the circuit for the synchronous control circuit, while Section 5 demonstrates the circuit for the asynchronous control circuit. 2 | D A T A P A T H C I R C U I T DE S I G N Figure 1 depicts the architectural hardware block structure for our FIFO's datapath, which is the same for both synchronous and asynchronous operation. Table 1 depicts the input/output signals' abbreviations and descriptions. The datapath memory is an SRAM 8T-Cell array with two separate read and write ports to provide independent operations.12,13 Our design's read and write operations use separate address-pointers and address-decoders that operate as mutually exclusive timing events. This mutual exclusion ensures that the address-pointer increments the address while the address-decoder is disabled, and similarly, the address-decoder activates the memory's row while the addresspointer is disabled, thus there is no overlapping activity. Therefore, the datapath operates with a two-phase clocking FIGURE 1 Architectural diagram for our proposed first-in first-out (FIFO)'s asynchronous and synchronous datapath 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 940 TABLE 1 definitions Datapath signal 941 Input-output signals Representations WAP Write address pointer WAD Write address decoder RAP Read address pointer RAD Read address decoder EMPTY Flag for the empty buffer FULL Flag for the full buffer DIN [63:0] Input data bus of size 64 bits DOUT [63:0] Output data bus of size 64 bits system similar to the ARM architecture,14 where one phase is responsible for activating the address-pointer, while the other phase is responsible for activating the address-decoder. This structure also alleviates the race conditions that conventional memories can be sensitive to between the address-pointer and address-decoder circuits. Those conventional circuits include a delay circuit to ensure that there is a large enough time margin to stabilize the address-decoder value so that the correct memory worldline is activated and avoids memory glitches. However, due to PVT variations, this race condition in the delay circuitry continues to worsen, becoming more difficult to ensure correct operation. To provide high-speed operation for high-performance applications, our design's address-pointers are constructed using a simple look-ahead parallel-state up counter, which operates at GHz speed with low power operation.27 The counter is triggered with opposite (nonoverlapping) address-pointer (the write address-pointer [WAP] and read address-pointer [RAP]) and address-decoder signals (the write address-decoder [WAD] and read address-decoder [RAD]). The address-pointers are up-counters that roll over to the counter's minimum value when the counter reaches the counter's saturation point (i.e., the maximum value defined by the memory's logarithmic depth). Using only an upcounter simplifies the address-pointer's circuit as compared to prior work that used up-down counters, which imposed large circuit area and slow performance with higher power consumption. Our address-decoder uses simple NAND logic gates with a prefix-tree structure where the last NAND gate's stage is gated with the nonoverlapping control signals (WAD/RAD) in order to balance the critical path within each active signal, as is analyzed in the datapath timing (Section 3). Figure 2 depicts our empty-full flag circuitry that is designed to detect the memory status with respect to the memory's size. The novelty of this design is the circuitry's ability to reconfigure for different memory sizes without requiring additional DFFs or a newly restructured logic circuit, as compared to prior works that used either an updown, ring, or gray code counter. The empty-full flag detects the status of the memory based on the value of the read and write address-pointers, which dictates the empty-full flags' values. The empty-full flag circuitry is constructed using two serially connected DFFs for each operation (i.e., one pair of DFFs for the read operation and another pair for the write operation). Both pairs of serially connected DFFs are initialized to “10.” The least significant bits (LSBs) are XNOR'ed to evaluate the flags' statuses in combination with the read and write address-pointers. If the address-pointers' values are equal and the serially connected DFFs' LSBs are equal, the empty flag is asserted, and alternatively, if the address-pointers' values are equal and the serially connected DFFs' LSBs are different, full is asserted. When the associated read/write address-pointer reaches the full state, then the associated pair of serially connected DFFs change state to “01j on the rising RAD/WAD edge, which detects the roll over case between the two address pointers. 3 | DATAPAT H T I MI N G AN ALY SI S We evaluate the scalability of our approach using the timing of all critical paths with respect to a one-unit gate delay (GD), which provides an analysis that is independent of technology factors for direct comparison purposes. Our design uses only a basic CMOS logic gate structure with basic width/length sizes for design layout clarity and cost-effectiveness for continued technology scaling. We evaluate our FIFO's design using an 8T-Cell SRAM structure of size 64-row × 64-bit, which is a common size for many synchronous and asynchronous applications. 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License ABDEL-HAFEEZ AND GORDON-ROSS FIGURE 2 ABDEL-HAFEEZ AND GORDON-ROSS Proposed empty-full flag circuitry As discussed in Section 2, the datapath is activated by two nonoverlapping signals—the WAP signal is nonoverlapped with the WAD signal for the write operation, and the RAP is nonoverlapped with the RAD for the read operation. This mutually exclusive behavior halves the critical path delay and precludes the critical path from accumulating all delays from the input-to-output for a complete cycle. Thus, the worst-case delay is the longest half-cycle delay of the WAP, WAD, RAP, or RAD signals. To the best of our knowledge, our design is the first implementation of a FIFO that uses a two-phase clocking methodology. To prevent falling and rising edges of signals from intersecting with each other, a delay margin, referred to as the idle-margin, between the WAP and WAD signals or between the RAP and RAD signals is guaranteed by the control unit, and has a worst-case timing of 3 GDs: Idle −margin = 3 GDs: ð1Þ Essentially, there are three GDs between the rising/falling and falling/rising of the RAP and RAD signals, and similarly, there are three GDs between the rising/falling and falling/rising of the WAP and WAD signals. The idle-margin is maintained to avoid memory race conditions, such that the address-decoder deassertion has completely settled before the address-pointer is asserted and vice versa. This idle-margin can vary from 1.5 to 3 GDs due to PVT variations; however, the goal is to maintain this idle-margin as greater than 1 GD as a minimum timing requirement to disable the address-decoder circuit and enable the address-pointer circuit or vice versa. Therefore, the PVT variation is not an issue as long as the two signals (WAD/WAP) and (RAD/RAP) have a minimum idle-margin greater than 1 GD for all different technology corners. Figure 3 depicts the critical paths for write and read operation, wherein the write operation begins when the WAP signal triggers the write address-pointer circuit, which is a parallel up-counter constructed as a state-lookahead structure with a counting delay of a single DFF. Thus, the delay of the address-pointer's circuit is approximated as 2 GDs as a precaution for the worst-case timing scenario: Address −pointer = 2 GDs: ð2Þ Additionally, a portion of the address-decoder's logic, except for the last stage, is integrated into the address-pointer stage, giving an additional 3 GDs. Thus, the total GD for generating the memory word line address prior to the last stage of the address-decoder is 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 942 FIGURE 3 943 Critical paths for the read and write operations W −1 = Address − pointer + 3 GDs = 5 GDs: ð3Þ Furthermore, tracing the critical path from the WAP signal in the opposite direction (i.e., the empty-full flag circuitry), there are two branches. One branch is directed to the XNOR logic (W-2A) and the other branch is directed to the large fan-in AND gate (W-2B), resulting in W −2A = Address −pointer + 2 GDs ðXNORsÞ = 4 GDs, ð4Þ W −2B = Address −pointer + 2 GDs ðfan −in ANDÞ = 4 GDs: ð5Þ Alternatively, the nonoverlapping WAD signal triggers the write address-decoder, which is part of the last stage of the address-decoder and is assumed as 1 GD. Thus, following W-3 for the write operation, 6 GDs are required to turn on the selected memory row and one additional GD is required to store the data into the selected row since the data are already on the memory bus and waiting for the selected row to be activated. Therefore, the critical path delay starts when WAD is activated and ends when the data are stored into the selected row: W −3 = 1 GD + 6 GDs + 1 GD = 8 GDs: ð6Þ Comparing Equations 3, 4, 5, and 6, the write operation's total delay (Equation 6) is the worst of the nonoverlapping signals' path delays, plus the addition of the idle-margin delay: Write −Delay = W −3 + idle−margin = 8 GDs + 3 GDs = 11 GDs: ð7Þ As a result, the WAD signal is designed to be active for a minimum of 11 GDs, while the WAP signal is deasserted, and vice versa, the WAP signal is asserted for a minimum of 11 GDs while the WAD signal is deasserted. Therefore, the write cycle time is approximated as 22 GDs, where 11 GDs are asserted and 11 GDs are deasserted. Including a precaution time for slew rate effects of 3 GDs, we can assume the write cycle time is 25 GDs: Write −Cycle = 2 Write −Delay + Slew −Rate = 25 GDs: ð8Þ Similarly, the read operation delay has the same address-pointer's write delay starting from the RAP signal initiating R-1, R-2A, and R2-B (Equations 3, 4, and 5 and Figure 3 since both are identical circuits but operate separately and 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License ABDEL-HAFEEZ AND GORDON-ROSS ABDEL-HAFEEZ AND GORDON-ROSS independently of each other. On the other hand, the read address-decoder path requires more delay than the write address-decoder path (Equation 6). The additional delay for the read operation is the time required to read the data from the selected row and store the data into the output latches. The selected row time is the R-3 delay path, which is the same as W-3 depicted in Equation 6. Thus, tracing the RAD signal through R-3 and R-4 to read the data from the memory is R −4 = R −3 + 9 GDs + 1 GD = 18 GDs: ð9Þ Therefore, the read operation's total time is the worst case of the nonoverlapping signals' path delays, which is Equation 3 or Equation 9, plus the addition of the idle-margin delay: Read −Delay = R−4 + idle− margin = 18 + 3 = 21 GDs: ð10Þ As a result, the RAD signal is designed to be asserted for a minimum of 21 GDs, while the RAP signal is deasserted, and vice versa for the WAP signal, which is asserted for 21 GDs while the RAD signal is deasserted. Including a precaution time for slew rate effects, the read cycle time is 45 GDs: Read −Cycle = 2 Read −Delay + 3 GDs = 45 GDs: ð11Þ Since homogeneity is required in comparing the write cycle time (Equation 8) and the read cycle time (Equation 11), we consider the worst case delay (Equation 11) for both the read and write operations. That is, the WAP and WAD signals have a nonoverlapping cycle time of 45 GDs, and the RAP and RAD signals have a nonoverlapping cycle time of 45 GDs. We evaluate the actual physical delay using a 65 nm CMOS technology operating with a 1 V power supply. Each standard GD can be approximated as 0.005 ns,28 but we assume 0.02 ns as a precaution measure even though we use the minimum sizes and distribute all fan-in and fan-out circuits for a four-gate tree structure. Therefore, the actual physical cycle time of 45 GDs is Cycle −Time = 0:02 ns=GD 45 GDs = 0:9 ns: ð12Þ In summary, the datapath circuit can safely operate at approximately 1 GHz, which could be further improved by having a faster SRAM read access that is widely available in literature12,13 or migrating the design into further continued CMOS technology. 4 | S YNC HR ONOUS C ONTROL C IR CUIT The objective of the synchronous control circuit is to generate the nonoverlapping, two-phase signals CLKN and CLKP, which are derived from the main system clock CLK, and represent CLK's negative and positive variations, respectively. The ideal case is depicted in Figure 4. CLKP and CLKN are at the opposite phase of each other with an idle-margin FIGURE 4 Timing diagram depicting the nonoverlapping, two-phase signals with respect to the main system clock 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 944 945 between the transition edges. The CLK source and traditional back-to-back NAND two-phase clocking systems21 are usually susceptible to PVT variations, as well as noise jitter variation. All of these variations can have a negative impact on the duty cycle of the two phases and nonoverlapping margin, thus negatively impacting overall circuit timing performance. However, the idea is not to eliminate the PVT variations, but rather to preserve consistent generation of the signals and consequently, the idle-margin. Therefore, as long as CLKP and CLKN are strictly opposite for a minimum of 21 GDs with enough idle-margin for a minimum of 3 GDs (i.e., derived in Equation 10) for all different technology corners, the circuit operation is stable and correct with no need for additional large circuit cost to maintain correct operation. In literature, a delay-locked-loop circuit (DLL)23,24 is used to control the PVT variation and preserve the clock skew; however, due to the DLL's complexity, power consumption, large silicon area with emphasis in many core architectures, we use a simpler circuit with similar achievements, which shows in22 that simulations over all PVT corners, for nonideal clock signals provides nonoverlapping phases with low values of root mean square (RMS) jitter. Figure 5 depicts the control circuit for the synchronous FIFO operation with the input/output signals' abbreviations and descriptions defined in Table 2. WE evaluated with the full flag signal (asserted by the empty-full flag circuitry) and asserts the write-enable-clean (WEC) signal if the memory has enough room to store the data, otherwise, WEC is deasserted. Similarly, RE is evaluated with the empty flag signal and asserts the read-enable-clean (REC) signal if the memory has data to read, otherwise, REC is deasserted. As a precaution, if both RE and WE are asserted simultaneously, both REC and WEC are deactivated and the FIFO does neither operation. Finally, gating CLKP/CLKN with FIGURE 5 TABLE 2 definitions Synchronous first-in first-out (FIFO) operation's control circuit Control circuit signal Input-output signals Representations CLK System clock WE Write enable RE Read enable RESET System reset CLKN Negative phase of system clock CLKP Positive phase of system clock WEC Write enable clean REC Read enable clean 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License ABDEL-HAFEEZ AND GORDON-ROSS ABDEL-HAFEEZ AND GORDON-ROSS WEC and REC generates WAD/WAP and RAD/RAP, respectively. These signals are inputs to the datapath unit and orchestrate all necessary read and write operations. 5 | ASYNCHRONOUS CONTROL CIRC UIT Figure 6 illustrates the timing for asynchronous communication, with numbers indicating key transition points as described below. This timing is based on a handshaking protocol between the sender (active) and receiver (passive) components.25,26 Therefore, the aim of the asynchronous control unit is to generate the assertion of request-clean (REQC) once receiving an assertion of REQ in order to start the read or write operation. Thus, deasserting REQC indicates the end of the read or write operation. Additionally, the control unit is required to generate an assertion of acknowledge-clean (ACKC), which is mutually exclusive with REQC and maintains the idle-margin with the falling edge of REQC, thus avoiding race conditions. Additionally, Figure 6 illustrates the generation of the signals along with the signals' actions for asynchronous timing. REQ (1) initiates a handshake to the receiver with the control circuit generating REQC (10 ). The assertion of REQC (10 ) is used to enable the datapath's memory for a read or write operation (i.e., enabling the address-decoder circuit), after which the control circuit deasserts REQC (100 ) after a minimum pulse width approximated by Equation (10), which is 21 GDs. Next, the control circuit generates ACKC (2) with a sufficient idle-margin from REQC (100 ) since ACKC is used to activate the address counting (i.e., ACK activates the addresspointer and eliminates race conditions). Once the sender receives ACKC (2), the sender deasserts REQ (3), which is used to reset the receiver, and thus implying the falling edge of ACKC (4) and prepares for a new handshake cycle. Figure 7 depicts the asynchronous circuit with conventional DFFs used as the delay necessary for generating REQC and ACKC. Even though DFFs are commonly known for “overall good” robustness against PVT variations and continued technology scaling,21–24 we are not proposing to eliminate PVT but are instead focus on a preserving the shape of the signals generated, such that REQC and ACKC that are not overlapped with each other under any technology corners. For example, generating the REQC pulse (the rising 10 to the falling 100 in Figure 6) might be varied from a shorter pulse width to a wider pulse width due to different technology corners (i.e., FF, SS, TT, SF, etc.); however, our design goal is to ensure that the minimum pulse width of REQC within all corners is not less than 21 GDs (Equation 10). Thus, the read or write operation occurs at this pulse with enough delay to process and complete. Additionally, several cascaded DFFs are added between ACKC (2) and REQC (100 ) (Figure 6) to ensure that the idle-margin is preserved for a minimum of 3 GDs for all technology corners. Initially in Figure 7, all DFFs are at a reset state with deasserted outputs. Once the asserted REQ is detected, REQC is asserted and activates the memory cells for a read or write operation. After the delay chain of DFFs, which is equal to the write or read access time (21 GDs), REQC is deasserted (i.e., feedback line-1 in Figure 7) in order to close the memory cells, such that no data will be written or read. Then, within the idle-margin (i.e., two cascaded DFFs), ACKC is asserted and propagated across the channel to inform the active component that the operation has completed. Once the active component receives the asserted ACKC, the active component deactivates REQ back to the passive component. F I G U R E 6 Generating a request-clean signal (REQC) and an acknowledge-clean signal (ACKC) within asynchronous communication timing diagram. Numbers indicate key transition points, as described 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 946 947 F I G U R E 7 Asynchronous cascaded DFFs circuit for generating a request-clean signal (REQC) and an acknowledge-clean signal (ACKC) FIGURE 8 Asynchronous control logic with clockless input/output signals to the datapath Once the circuit in Figure 7 receives the falling edge of REQ, the circuit resets all DFFs, and thus, ACKC is deactivated. Subsequently, the asynchronous control circuit returns back to the circuit's initial state and waits for a new request. Figure 8 depicts the overall circuitry of the asynchronous control logic with the clockless input/output signals' abbreviations and descriptions defined in Table 3. During the write operation, the sender asserts the write-request (WREQ) signal, wherein the receiver asserts the write-request-clean (WREQC) signal, which is forwarded to the datapath unit as WAD that is used to activate the write address-decoder, and thus activates the memory for a write operation. Once WREQC is deasserted, subsequently WAD is deasserted, which deactivates the memory's write operation. With some idle-margin, the write-acknowledge-clean (WACKC) signal is asserted, which activates WAP. This assertion activates the up counter for the address-pointer in the datapath unit. Finally, the sender deactivates WREQ upon receiving the asserted WACKC, and thus, the receiver's control unit resets all of the DFFs and deasserts WACKC to wait for another new handshake operation. The read operates similarly to the write operation. 6 | R ESULTS A ND COMPARIS ON ANALYS IS We implement and test the datapath unit of size 64-row × 64-bit using SRAM 8T-Cells, which is a common FIFO size in many chips.1–4,7,8 Additionally, two control units, one with synchronous and the other with asynchronous circuitry, are implemented and integrated with datapath unit for testing purposes. We use a cost-effective CMOS transistor level of 65-nm Taiwan Semiconductor Manufacturing Company (TSMC) technology with a 1 V power supply.28 We gathered timing delay values, total power consumption, and total transistor counts using HSPICE29 simulations. 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License ABDEL-HAFEEZ AND GORDON-ROSS ABDEL-HAFEEZ AND GORDON-ROSS Input–output signals Representations WREQ Write request WREQC Write-request-clean RREQ Read request RREQC Read-request-clean WACK Write acknowledge WACKC Write-acknowledge-clean RACK Read-acknowledge RACKC Read-acknowledge-clean RESET System reset T A B L E 3 Asynchronous control circuit signal definitions The datapath's worst case critical path delay is measured for the read operation showing a cycle time of 1.25 GHz, which is very close to our derived cycle time in Equation 11. In Equation 11, we have more conservative results since we assume the worst case scenario, which adds an additional safety margin delay. In both cases, the datapath can safely operate at 1 GHz for both the read and write operations. Consequently, the complete cycle of the two nonoverlapping signals (WAD/WAP and RAD/RAP) that orchestrates the write and read operations is completed within 1 ns with a slew rate of 0.1 ns/V. Furthermore, the critical path illustrated by Equation 11 affects the scalability, wherein as there is an increase of storage area (FIFO array) the operating frequency reduces, and vice versa. This relationship between scalability and frequency is only due to the FIFO array, not the control unit or the empty-full flag circuitry. As previously mentioned, the empty-full flag circuitry depends on the roll-over of the FIFO array instead of the depth. Besides, the control unit circuitry always maintains the two-phase clocking system with a nonoverlapping margin, where each phase is realized by the worst case critical path delay with a safety margin of the slew rate as shown in Equation 11. As a result, the FIFO array of 8T-Cells determines the maximum running frequency. Currently, deep FIFOs up to a 10-word depth can be considered sufficient for high performance throughput30; however, for further scalability to future applications, we propose far more flexible FIFO design with a depth of 64 word, where each word is of size 64 bit. As a result, the 8T-Cell SRAM sizes against operating frequency for a wide range of CMOS technologies becomes an issue for SRAM designs, where literature has various details in providing the memory array scalability against operating frequency.12,13 Our design has several factors that help reduce the power consumption, with the major factor being that the datapath operates with a 1 V power supply. Additionally, all components are constructed using CMOS transistors with a 65 nm channel length and widths ranging from 3 to 5 μm, except for the inverters drivers' widths that are Wp = 15 μm and Wn = 10 μm. Another major contributing factor for low power design is the use of an SRAM memory array that is based on the 8T-Cell with a standard geometry size on 65 nm from Intel.12 The 8T-Cell SRAM is commonly known for low power since this cell type operates with no biasing current. Additionally, the read and write ports are separated to avoid charging contention; besides, obviating the use of sense amplifier which is considered a major source of harvesting power consumptions. A further factor is related to the ARM architecture,14 which has circuit toggling for two nonoverlapping signals, where the rising and falling edges of the signals are separated with an idle-margin, thus minimizing the shortest path through the logic gates from the power supply to the ground, which tends to reduce the dynamic power operation. Finally, our design ensures that all components are not active at the same time. An additionally advantage factor is that the datapath processes write and read operations without any latencies, which is usually implemented as brute-force, cascaded DFFs in other designs4–20 to synchronize the incoming data with the internal clock. Thus, the datapath exploits high overall performance and processes data on every cycle (i.e., every two nonoverlapping signals process new data). Both the synchronous and asynchronous control unit circuitry uses standard-library digital logic gates with similar sizes as the datapath logic gates. The main objective of the control unit is to generate the two complemented signals for the read operation and the two complemented signals for write operation with an idle-margin at the edges of the signals. The control unit circuit's design is not intended to eliminate PVT variations, but rather to preserve the complement of the two signals in accordance with the idle-margin. The duty cycle's variation of the complemented signals is not important so long as the signals' are not overlapped with each other for a minimum of 21 GDs and there is a minimum idle-margin of 3 GDs. We conducted several HSPICE simulations at different technology corners 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 948 949 (i.e., FF, SS, TT, SF, etc.) to evaluate the complement of the signals under several constraints. We experimentally verified the design's robustness and validated the generation of the complement signals. Tables 4 and 5 summarize the comparison between several state of the art synchronous and asynchronous FIFO structures, respectively. Key factors such as power consumption, operating speed, performance with throughout per cycle, and reconfigurability are evaluated. This comparison evaluates the designs' complexities and scalabilities independent of the underlying technology factor since it is challenging to find comparable designs with the same technology parameters and specifics; however, the comparison still provides insights about relative power consumption, speed, scalability, and design complexity. Table 4 summarizes the characteristics of synchronous FIFO circuits that we compare to, listed by the reference number. The design in Hsu et al.31 is usually classified as a very low power design since that design operates at 0.4 V; however, that design trades off low power, with low speeds ranging from 20 to 30 MHz, which limits that design's use application with these requirements and/or restrictions. Due to the lower power and performance, that design also has a large design cost for a self-time management unit that is used to control the timing between the memory-decoder and address-pointer to overcome the PVT variations that might violate the setup/hold time constraints at the gated clock memory component. Thus, that design is not compatible with EDA synthesis tools and requires a large design cost. Additionally, that design has a large cost up/down counter of size log2(N) for the empty-full flag circuitry, where N is the number of memory rows, which results in a nonscalable design. The design in Rahmani et al.32 operates at the same power supply of 1 V as our proposed; however, that design operates 5X slower than our proposed work. Besides, it uses separate clocks for reading and write operations that required an extra cost for a dual-port memory array. That design uses a Johnson counter for the empty-full flag circuit and an adder with a feedback register for the binary address-pointer to minimize the overhead of the DFFs and reduce power consumption. However, it penalizes for further design cost that requires to convert binary addresses to Gray code to be synchronized with the corresponding clock domain and utilizes for empty-full flag detection. Thus, the design is not cost efficient for scalability and require ASIC components. The design in Taghi Adl and Mohammadi11 presents an elastic method on a dual clock FIFO that operates on different domains and compatible with EDA tools. The storage area is elastic buffers, where every buffer is a form of three DFFs since it has three states, yield a large area of storing data that results in large power consumption. Besides, the latency depends on the read and write activation scenario, yields in the uncertainty of throughput, and large variations. The token ring address-pointer requires N buffers for N elastic buffers, which has the status of 2 N of data; still, the efficiency of the address pointer has considered not power efficient. The control unit is generous power harvesting and requires a large sequential machine to control the status of empty-full flags circuit and elastic buffers. Generally, the design in Taghi Adl and Mohammadi11 is not efficient in power consumption nor high-speed operation with low throughput for the penalty of trying to manage two different operation cycles, which can be managed as we proposed on asynchronous FIFO structure. TABLE 4 Comparison between prior works and our proposed synchronous FIFO design Address-pointer Control unit Storage unit Flag unit 31 Up/down counter Self-timed management units and power switched enable signal 10 T-cell SRAM 256-row × 16-bit Up/down indicators 32 Adder and binary registers Counter control unit Dual-port RAM 128-row × 32-bit Johnson counter 11 Ring-one-hot Elastic modification module Elastic registers Slow sequential machine 800–600 MHz/1 V Our work Carry look-ahead state counter Non-overlapping two-phase clock system with idlemargin between edges 8 T-cell SRAM 64-row × 64-bit Rollover detection and comparison circuit 1 GHz/1 V Abbreviation: FIFO, first-in first-out. Freq/voltage Latency 22.7 MHz/0.4 V One data/cycle 200 MHz/1 V One data/cycle 3 to 4 One data/cycle 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License ABDEL-HAFEEZ AND GORDON-ROSS TABLE 5 ABDEL-HAFEEZ AND GORDON-ROSS Comparison between prior works and our proposed asynchronous FIFO design Address-pointer Control unit Storage unit Flag unit Freq/voltage/ power/tech. Latency 33 Gray code with control timing circuit Slow low-cost clock system generator with metastability avoidance circuit SRAM 8 T-cell 128-row × 16-bit Up/down indicators 150 MHz/1 V/2.31 mW/28 nm One data/t hree cycles 34 Token ring with one-hot bubbleencoding and shift register Global island clock system for read and write High-cost N buffers-queue requiring N*(N-1) latches for N lanes Large cost domino logic detectors to avoid circuit deadlocks 200 MHz/1 V//3.7 mW/90 nm One data/ N-1 cycles Our work Carry look-ahead state counter Clockless system with non-overlapping generation of request and acknowledge signals 8T-Cell SRAM 64-row × 64-bit Rollover detection and comparison circuit 1 GHz/1 V/2 mW/65 nm One data/ one cycle Abbreviation: FIFO, first-in first-out. Table 5 summarizes the characteristics of the asynchronous FIFO circuits that we compare to listed by the reference number. This comparison uses most of the datapath characteristics that have been discussed in Table 4, therefore, we only discuss the new characteristics. The design in33 generates a variant of an internal clock signal to synchronize the incoming data with the internal clock. That design has extra cost due to Gray code conversion that requires only a capacity in the power of two, thus leading to limitation of the addresses. Besides, it imposes three data latency to avoid metastability due to gray code synchronization, and thus increases data buffer area and still harming performance. Consequently, that design is considered to operate at low frequency to alleviate all synchronization issues with the incoming data and pausible clock internal circuitry generation. The design in Nguyen and Tran34 uses registers as storage data with the global clock feeds from a Global Asynchronous Local Synchronous (GALS) structure of NoCs. The address pointer is a high cost of a sequence of cascaded FFs that depends on the depth of data, wherein the circulating FFs are triggered by the global clock and the read/write are taking turns on different registers. That design suffers from a large latency cycle and to assure the correctness of the write and read functionality increase, one of them is far faster than the other one; thus, performance degradation is becoming prominent in that design. Subsequently, the data capacity of long latency becomes a major limitation of that design. Another restriction is the use of bubble encoding for the token rings to detect the empty-full flag of the queue registers. Thus, the detector of that design is commonly known to have a large cost with a special custom design circuit for the cost to compensate for different timing operations between reading and write. In summary for the aforementioned comparable designs, the trade-off in these structures is the requirement of a Gray code address counter with a brute-force mechanism using serial DFFs to synchronize the incoming data with the edge of the generated internal clock. Thus, that structure requires several clock cycle latencies to ensure correct read and write operations, which affects the overall performance. Furthermore, having to generate the internal clock requires long timing channel delays that require a large power consumption that could narrow application applicability. On the contrary, our proposed work manages the handshaking signals in a completely clockless environment and communication channel link, such that intermediate clean (request-clean) and acknowledge-clean signals are generated as mutually exclusive events. The pulse width of the request-clean signal activates the read or write operation, while the acknowledge signal activates the address counting operation. Therefore, this design avoids the trade-offs of previous structures concerning longer latency and higher power consumptions. Besides, the design has a standard cell CMOS library that is synthesizable to EDA tools and does not require any special asynchronous circuit. 7 | C ON C L U S I ON In this work, we proposed a FIFO circuit design that is suitable for both asynchronous and synchronous application communications. Our design separates the control unit and datapath unit, which facilitates easy reconfigurability and scalability. In both types of applications, our proposed datapath circuit remains the same, while our proposed control circuit 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 950 951 comprises two separate internal structures that can handle the different handshaking communication protocol with a simple reconfiguration. The datapath is structured with a two-ported 8T-Cell SRAM-based memory array that operates on two-phase nonoverlapping signals, such that one signal activates the memory decoder, while the other signal activates the memory pointer. This design eliminates the race conditions and eases the controlling circuit structure design. Subsequently, both types of control circuits, asynchronous and synchronous, generate the appropriate nonoverlapping signals to the datapath based on the required handshaking communication protocol. Our circuit design's other unique feature is the method of detecting the status of the memory storage using a novel empty-full flag circuitry that counts the rollover of the memory pointer using only four D-type Flip-Flops (DFFs) regardless of the memory size, as compared to prior work that uses a large-area counter. This structure affords good scalability with minimized power consumption. Through extensive simulations, our results show that our datapath operates at 1 GHz, and can process data (read or write) once per one control cycle, which surpasses most state-of the-art designs by 3X to 5X. Furthermore, our design operates with a 1 V power supply and offers continued technology scaling as an attractive feature for low-power design. ORCID Saleh Abdel-hafeez https://orcid.org/0000-0003-2988-1609 R EF E RE N C E S 1. Gordon-Ross A, Abdel-hafeez S, Alsafrjalni MH. A one-cycle FIFO buffer for memory management units in Manycore System. IEEE Computer Society Annual Symposium on VLSI, July 2019;265-270. https://doi.org/10.1109/ISVLSI.2019.00056 2. Bae Y, Park S, Park I. A single-chip programmable platform based on a multithreaded processor and configurable logic clusters. IEEE J Solid-State Circuits. Oct. 2003;38(10):1703-1711. https://ieeexplore.ieee.org/document/1233767 3. Shibata N, Watanabe M, Tanabe Y. A current-sensed high-speed and low-power first-in-first-out memory using a Wordline/ Bitline-swapped dual-port SRAM cell. IEEE j SSC. June 2002;37(6):735-750. 4. Hu Y, Liang S, Yu J, Wang Y, Yang H. On-chip instruction generation for cross-layer CNN accelerator on FPGA. IEEE Computer Society Annual Symposium on VLSI, July 2019. 5. Alcin M, Koyuncu I, Tuna M, Varan M, Pehljyan I. A novel high speed artificial neural network-based chaotic true random number generator on field programmable gate Array. Int J Circuit Theory Appl, Wiley Press. Nov 8, 2018;47(3):365-378. https://doi.org/10.1002/cta.2581 6. Teymouri M. A multipurpose circuit to read out and digitize pixel signal for low-power CMOS imagers. Int J Circuit Theory Appl, Wiley Press. Aug 4, 2020;48(11):1887-1899. https://doi.org/10.1002/cta.2854 7. Zeinolabedin SMA, Zhou J, Liu X, Tae-Hyoung Kim T. An area- and energy-efficient FIFO design using error-reduced data compression and near-threshold operation for image/video applications. IEEE Trans Very Large Scale Integr VLSI Syst. 2015;23(11):2408-2416. 8. Abdel-Hafeez S, Quwaider MQ. A one-cycle asynchronous FIFO queue buffer circuit. 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 2020;388-393. https://doi.org/10.1109/ICICS49469.2020.239548 9. Ezz-Edin R, El-Moursy MA, Hamed HFA. High throughput asynchronous NoCs design under high process variation. Integr VLSI J. 2015;49:1-13. 10. Ashour H. Design, simulation and realization of a parametrizable, configurable and modular asynchronous FIFO. 2015. https:// ieeexplore.ieee.org/document/7237325 11. Taghi Adl SM, Mohammadi S. A high-performance dual clock elastic FIFO network Interface for GALS NoC. Microelectron J, Elsevier. 2018;76:69-80. 12. Nii K, Masuda Y, Yabuuchi M, et al. A 65 nm ultra-high-density dual-port SRAM with 0.71um/sup /8T-Cell for SoC. Symposium on VLSI Circuits, Digest of Technical Papers, Honolulu, HI, USA, 2006;17-18. https://doi.org/10.1109/VLSIC.2006.1705344 13. Abdel-Hafeez S, Shatnawi M, Gordon-Ross A. A double data rate 8T-Cell SRAM architecture for systems-on-chip. IEEE 14Th International Symposium on System-on-Chip 2012, Tampere, Finland, October 11-12, 2012. 14. Furber S. Chapter 4: ARM Organization and Implementation. ARM: system-on-chip architecture. 2nd ed. Harlow, England: AddisonWesley; 2000;74-101. https://www.pearsoned.co.uk 15. Sheibanyrad A, Greiner A. Two efficient synchronous—Asynchronous converters well-suited for networks-on-Chip in GALS architectures. Integr VLSI J. 2008;41(1):17-26. 16. Panades IM, Greiner A. Bi-synchronous FIFO for synchronous circuit well suited for network-on-chip in GALS architectures. Proceedings of the First International Symposium on Networks-on-Chip (NOCS'07), May 2007;83-94. https://doi.org/10.1109/NOCS.2007.14 17. Chang MT, Huang PY, Hwang W. A robust ultra-low power asynchronous FIFO memory with self-adaptive power control. IEEE International SOC Conference, Newport Beach, CA, USA, 2008;175-178. https://doi.org/10.1109/SOCC.2008.4641505 18. Jeon D, Henry MB, Kim Y, et al. An energy efficient full-frame feature extraction accelerator with shift-latch FIFO in 28 nm CMOS. IEEE j SSC. May 2014;49(5):1271-1283. 19. Chelcea T, Nowick SM. A low-latency FIFO for mixed-clock systems. Proceedings IEEE Computer Society Workshop on VLSI 2000. System Design for a System-on-Chip Era, April 2000;119-126. https://doi.org/10.1109/IWV.2000.844540 20. Fattah M, Manian A, Rahimi A, Mohammadi S. A high throughput low power FIFO used for GALS NoC buffers. IEEE Computer Society Annual Symposium on VLSI, July 2010. 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License ABDEL-HAFEEZ AND GORDON-ROSS ABDEL-HAFEEZ AND GORDON-ROSS 21. Lin W, Black Jr. WC. A low-jitter skew-calibrated multi-phase clock generator for time-interleaved applications. Solid-State Circuits Conference, 2001. Digest of Technical Papers. ISSCC. 2001, IEEE International; 2001;396-397. 22. Nowacki B, Paulino N, Goes J. A simple 1 GHz non-overlapping two-phase clock generators for SC circuits. 20th International Conference on Mixed Design of Integrated Circuits and Systems (MIXDES), June 20-22, Gdynia, Poland: IEEE; 2013;174-178. 23. Zhang D, Yang HG, Zhu W, et al. A multiphase DLL with a novel fast-locking fine-code time-to-digital converter. IEEE Trans Very Large Scale Integr VLSI Syst. 2015;23(11):2680-2684. 24. Abdel-hafeez S, Harb SM, Lee KM. On-chip jitter measurement architecture using a delay-locked Loop with Vernier delay line to the order of giga hertz. Proceedings of the 18th International Conference Mixed Design of Integrated Circuits and System (MIXDES), IEEE; June 2011;502-506. 25. Martin J, MystrÖm M. Asynchronous techniques for system-on-chip design. Proc IEEE. 2006;94(6):1089-1120. 26. Kessels J. Register-communication between mutually asynchronous domains. IEEE International Symposium on Asynchronous Circuits and Systems, March 2005;66-75. https://doi.org/10.1109/ASYNC.2005.27 27. Abdel-Hafeez S, Gordon-Ross A. A digital CMOS parallel counter architecture based on state look-ahead logic. J IEEE Trans Very Large Scale Integr VLSI Syst. May 23, 2011;19(6):1023-1034. 28. 0.65 μm CMOS ASIC Process Digests, Taiwan Semiconductor Manufacturing Corporation, Hsinchu, Taiwan, 2005. 29. Synopsys. HSPICE, Mountain View, CA [Online]. 2016. Available: http://www.synopsys.com 30. Psarras A, Paschou M, Nicopoulos C, Dimitrakopoulos G. A dual-clock multiple-queue shared buffer. IEEE Trans Comput. Oct. 2017;10 (66):1809-1815. 31. Hsu W, Huang P, Wu S, et al. 8nm ultra-low power near-/Sub-threshold first-in-first-out (FIFO) memory for multi-bio-signal sensing platforms. International Symposium on Automation and Test VLSI Design (VLSI-DAT), Hsinchu, Taiwan, April 2016;1-4. 32. Rahmani A, Liljeberg P, Plosila J, Tenhunen H. Design and implementation of reconfigurable FIFOs for voltage/Frequency Island-based networks-on-Chip. Microprocess Microsyst. June–July 2013;37(4-5):432-445. 33. Keller B, Fojtik M, Khailany B. A plausible bisynchronous FIFO for GLAS systems. 21st IEEE International Symposium on Asynchronous Circuits and Systems, California, May 2015;1-8. 34. Nguyen TT, Tran XT. A novel asynchronous first-in-first-out adapting to multi synchronous network-on-chips. 2014 International Conference on Advanced Technologies for Communications (ATC 2014), Hanoi, Vietnam, Feb. 2014;365-370. https://doi.org/10.1109/ATC. 2014.7043413 AUTHOR BIOGRAPHIES Saleh Abdel-hafeez received his BSEE, MSEE, and Ph.D. in Computer Engineering in the field of VLSI design. In 1997, he joined S3.inc as a member of their technical staff, where he performed IC circuit design related to cache memory, digital I/O and ADCs. He has three patents (6,265,509; 6,356,509; 20040211982A1) in the field of IC design. Currently, he is a Professor in the college of Computer and Information Technology, University of Science and Technology, Jordan. His research interests include circuits and architectures for low-power and highperformance VLSI. Dr. Abdel-hafeez is a former chairman of the computer engineering department. Ann Gordon-Ross (M'00) received her B.S. and Ph.D. degrees in computer science and engineering from the University of California, Riverside, CA, USA, in 2000 and 2007, respectively. She is currently an Associate Professor of electrical and computer engineering with the University of Florida, Gainesville, FL, USA. She is also a Faculty Advisor of the women in Electrical and Computer Engineering and the Phi Sigma Rho National Society for women in engineering and engineering technology. Her current research interests include embedded systems, computer architecture, low-power design, reconfigurable computing, dynamic optimizations, hardware design, real-time systems, and multicore platforms. S UP PO RT ING IN FOR MAT ION Additional supporting information may be found online in the Supporting Information section at the end of this article. How to cite this article: Abdel-hafeez S, Gordon-Ross A. Reconfigurable FIFO memory circuit for synchronous and asynchronous communication. Int J Circ Theor Appl. 2021;49:938–952. https://doi.org/10.1002/cta.2921 1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 952