Uploaded by Emre Aslan

Circuit Theory Apps - 2021 - Abdel‐hafeez - Reconfigurable FIFO memory circuit for synchronous and asynchronous

advertisement
Received: 13 August 2020
Revised: 19 October 2020
Accepted: 23 November 2020
DOI: 10.1002/cta.2921
ORIGINAL PAPER
Reconfigurable FIFO memory circuit for synchronous and
asynchronous communication
Saleh Abdel-hafeez1,2
| Ann Gordon-Ross3
1
Department of Computer Engineering,
Jordan University of Science and
Technology, Irbid, Jordan
2
Sabbatical at Department of Computer
Engineering, College of Computer,
Qassim University, Qassim, Buraydah,
Saudi Arabia
3
Department of Electrical and Computer
Engineering, University of Florida (UF),
Gainesville, Florida, USA
Correspondence
Saleh Abdel-hafeez, Department of
Computer Engineering, Jordan University
of Science and Technology, Irbid 22110,
Jordan and Sabbatical at Department of
Computer Engineering, College of
Computer, Qassim University, Qassim,
Buraydah, Saudi Arabia.
Email: sabdel@just.edu.jo
Abstract
We present a new FIFO (first-in first-out) architecture for both synchronous
and asynchronous communication for high-speed and low-power operation.
Our FIFO design is reconfigurable and scalable using a separate datapath with
an 8T-Cell SRAM and control circuits, which enables specialization for different application requirements. The datapath uses a two-phase clock system of
nonoverlapping signals such that one signal increments the address pointer,
while the other signal activates the memory decoder for data reading and
writing. This structure halves the critical path delay and simplifies the
timing operations between the memory decoder and address pointer while
maintaining robustness against process-voltage-temperature (PVT) variations.
Our design uses two alternative control circuits to manage separate synchronous and asynchronous operations by generating nonoverlapping control signals that drive the datapath circuit. The empty-full flag circuitry records only
the state of the address pointers' rollover independent of the memory size, and,
thus, improves scalability and reconfigurability. Compared to prior works, our
design is 5X faster with a 2.3X lower power consumption and has a throughput
of 1 Giga-Word/s. For a 64-bit word size with a free latency cycle. Additionally,
our design functions clocklessly with the synthesizable structure for asynchronous communication that leverages Internet of Things (IoT) and Networks on
Chip (NoCs) applications.
KEYWORDS
8T-Cell SRAM, asynchronous and synchronous controlled datapath, FIFO, high-speed and lowpower, IoT and NoCs, two-phased clock
1 | INTRODUCTION
Most modern integrated circuits (ICs) have first-in first-out (FIFO) buffers that orchestrate handshaking of information
between two different communicating components. FIFOs can either operate asynchronously (independent of any
clock signal) or synchronous with a clock signal. Since both implementations have advantages and disadvantages, the
greatest design consideration is to determine the most appropriate implementation based on system and/or application
requirements. Typically, synchronous FIFOs are used in microprocessors to manage handshaking communications
between on-chip components, such as queue management,1 reorder buffers,2 arithmetic logic units (ALUs),3 etc., and
more recently, in Artificial Intelligence (AI),4,5 graphics cards,6,7 etc. These synchronous implementations leverage the
existing clock generator resources within the chip (e.g., delay-locked loop (DLL) and phase-locked loop (PLL)) and the
938
© 2021 John Wiley & Sons, Ltd.
wileyonlinelibrary.com/journal/cta
Int J Circ Theor Appl. 2021;49:938–952.
939
same on-chip component and fabrication technology. Consequently, synchronous FIFOs typically operate at very high
speeds on the order of 0.5 GHz.
Alternatively, asynchronous FIFOs are typically in systems that require very low power consumption where the
frequency is usually in the range of MHz. These FIFOs usually orchestrate communication between components that
may be using different fabrication technology. Many asynchronous FIFOs play a major role in the Internet of Things
(IoT)8 and communication devices such as routers, Networks on Chip (NoCs)9 globally asynchronous locally synchronous (GALS) networks, etc.10,11
In order to flip a FIFO's operation between synchronous and asynchronous, an entirely new circuit design is
required, which leaves little design flexibility. In this work, we extend prior work on FIFO design1,8 and present a flexible, reconfigurable FIFO that can be easily flipped between synchronous and asynchronous. The datapath remains the
same for both synchronous and asynchronous operation using an arbitrarily sized SRAM-based memory with two separate read and write ports, while the control unit activates different signals based on the operation configuration
(i.e., synchronous or asynchronous). Minimal circuitry changes are required when changing the memory size. The
SRAM uses 8T-Cell technology that is commonly known for high-speed, low-power operation with adaptability to low
power supply voltage for continued technology scaling.12,13
The read and write circuitry uses separate address-decoder and address-pointer signals with two mutually exclusive
activation signals that make the memory operation timing simple and reliable with cost-effective reconfigurability, a
factor that is needed in continued chip technology scaling.14 Our design obviates the need for any special delay circuitry
in the control unit that would require a large design layout cost and does not require brute-force D-type flip-flops
(DFFs) to capture unsynchronized data with respect to the clock signal, as is required in other designs,15,16 which usually has a negative impact on throughput and performance.
A novel feature of our proposed design is the method for generating the empty-full flag signal using circuitry that
checks for an address roll-over instead of tracking every storage address or data register in the FIFO. The empty-full flag
requires only four DFFs, regardless of the memory size, thus limiting the cost and hardware overheads associated with
larger FIFOs. This structure eliminates any arithmetic operation circuitry, such as a subtractor and gray code
mapping,17 a long right-left shift register,18 or a logic-expensive up/down counter,19 and obviates the chain of DFFs
required in other designs.20
Two control units are used for synchronous and asynchronous operation. The synchronous control circuit uses the
main system clock (CLK) to generate two phase-clock signals (CLKN and CLKP) that are the exact complement of each
other and whose edges are isolated with an idle-margin. CLKN and CLKP are generated from CLK using a simple feedback
NAND circuit21 or a more advanced technique22–24 that can be susceptible to process-voltage-temperature (PVT) variations.
However, the design goal is not to eliminate the PVT variations, but rather to preserve the formation of the mutually exclusive signals with an idle-margin for any of the corners' technologies (e.g., Fast-Fast, Slow-Slow, Typical-Typical, etc.).
Asynchronous FIFO operation is based on a handshaking protocol between the sender (active) and receiver
(passive) and is synchronized by request and acknowledge handshaking signals.25,26 The control design uses cascaded
DFFs as the delay element for generating the mutually exclusive signals with an idle-margin, wherein the cascaded
DFFs depend on the active signals' edges, a property that maintains the two generated signals within a certain pulse
range and preserves the idle-margin between request and acknowledge. The request and acknowledge signals have a
slightly longer cycle time due to the delay required for the signal to propagate across the channel and reach the other
communicating component. Due to these design aspects, the asynchronous control circuitry trades off low power
consumption for low-speed operation.
In summary, to alleviate some drawbacks in previous designs, such as high power consumption, low-speed operation, multicycle latency computation, custom structures that are unsuitable for continued technology scaling, long time
to market due to synchronous and asynchronous handshake communication, and irregular VLSI configurability to
serve wide range of applications, in this paper we leverage low-cost standard CMOS cells at 65 nm and 1 V power supply to architect our proposed FIFO circuit design with the following key features:
1- Our design is appropriate for a wide range of synchronous and asynchronous communication handshaking applications using a divided control unit with asynchronous and synchronous circuits. The datapath unit remains the same
for both asynchronous and synchronous control operations.
2- Our design is robust against PVT variations since the memory array and address counter are activated from two
opposite phases with nonoverlapping margin of the control signals, which are generated from the control circuits,
similar to state-of-the-art ARM architectures.
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ABDEL-HAFEEZ AND GORDON-ROSS
ABDEL-HAFEEZ AND GORDON-ROSS
3- Our design has a high degree of configurability, which shortens the time-to-market window due to the novelty of
the empty-full flag detector circuit that is independent of the depth of the memory array. In addition, our design
components are constructed using a standard cell library without the need for a specialized ASIC design. Thus, our
design is compatible for Electronic-Design-Automation (EDA) tools.
4- Our design has a free latency cycle and enjoys high-speed operation with a throughput of 1 Giga-Word/s with a
64-bit word size for the synchronous handshake communication. In addition, our design can be leveraged in a wide
range of asynchronous low-speed handshaking communications that are prevalent in many IoT and NoC applications. Additionally, our design provides reliability in cases where the handshaking is disconnected during the transfer operation.
5- Our design provides efficient low power consumption due to several factors: low power supply voltage, a memory
cell (8T-Cell) that obviates sensing amplifiers and has separate read/write ports, the design activations spread over
two phases of nonoverlapping signals, and simple CMOS logic gates with minimum geometry sizes and maximum
fan-in/fan-out of four.
6- Our design affords continued technology scaling, which is considered a high degree of design geometry scalability
since the FIFO circuit constitutes a low-cost standard CMOS cell library. Additionally, the memory array is based on
the Intel 8T-Cell, which is attractive for further advance technologies.
In summary, Section 2 exploits the datapath circuitry along with the activation signals, while Section 3 demonstrates
the critical path based on a one-unit gate delay for the read and write operations. This critical path determines the minimum assertion of the non-overlapping signals to ensure proper read or write operation. Section 4 realizes the circuit for
the synchronous control circuit, while Section 5 demonstrates the circuit for the asynchronous control circuit.
2 | D A T A P A T H C I R C U I T DE S I G N
Figure 1 depicts the architectural hardware block structure for our FIFO's datapath, which is the same for both synchronous and asynchronous operation. Table 1 depicts the input/output signals' abbreviations and descriptions. The
datapath memory is an SRAM 8T-Cell array with two separate read and write ports to provide independent operations.12,13 Our design's read and write operations use separate address-pointers and address-decoders that operate as
mutually exclusive timing events. This mutual exclusion ensures that the address-pointer increments the address while
the address-decoder is disabled, and similarly, the address-decoder activates the memory's row while the addresspointer is disabled, thus there is no overlapping activity. Therefore, the datapath operates with a two-phase clocking
FIGURE 1
Architectural diagram for our proposed first-in first-out (FIFO)'s asynchronous and synchronous datapath
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
940
TABLE 1
definitions
Datapath signal
941
Input-output signals
Representations
WAP
Write address pointer
WAD
Write address decoder
RAP
Read address pointer
RAD
Read address decoder
EMPTY
Flag for the empty buffer
FULL
Flag for the full buffer
DIN [63:0]
Input data bus of size 64 bits
DOUT [63:0]
Output data bus of size 64 bits
system similar to the ARM architecture,14 where one phase is responsible for activating the address-pointer, while the
other phase is responsible for activating the address-decoder.
This structure also alleviates the race conditions that conventional memories can be sensitive to between the
address-pointer and address-decoder circuits. Those conventional circuits include a delay circuit to ensure that there is
a large enough time margin to stabilize the address-decoder value so that the correct memory worldline is activated and
avoids memory glitches. However, due to PVT variations, this race condition in the delay circuitry continues to worsen,
becoming more difficult to ensure correct operation.
To provide high-speed operation for high-performance applications, our design's address-pointers are constructed
using a simple look-ahead parallel-state up counter, which operates at GHz speed with low power operation.27 The
counter is triggered with opposite (nonoverlapping) address-pointer (the write address-pointer [WAP] and read
address-pointer [RAP]) and address-decoder signals (the write address-decoder [WAD] and read address-decoder
[RAD]). The address-pointers are up-counters that roll over to the counter's minimum value when the counter reaches
the counter's saturation point (i.e., the maximum value defined by the memory's logarithmic depth). Using only an upcounter simplifies the address-pointer's circuit as compared to prior work that used up-down counters, which imposed
large circuit area and slow performance with higher power consumption. Our address-decoder uses simple NAND logic
gates with a prefix-tree structure where the last NAND gate's stage is gated with the nonoverlapping control signals
(WAD/RAD) in order to balance the critical path within each active signal, as is analyzed in the datapath timing
(Section 3).
Figure 2 depicts our empty-full flag circuitry that is designed to detect the memory status with respect to the
memory's size. The novelty of this design is the circuitry's ability to reconfigure for different memory sizes without
requiring additional DFFs or a newly restructured logic circuit, as compared to prior works that used either an updown, ring, or gray code counter. The empty-full flag detects the status of the memory based on the value of the read
and write address-pointers, which dictates the empty-full flags' values.
The empty-full flag circuitry is constructed using two serially connected DFFs for each operation (i.e., one pair of
DFFs for the read operation and another pair for the write operation). Both pairs of serially connected DFFs are initialized to “10.” The least significant bits (LSBs) are XNOR'ed to evaluate the flags' statuses in combination with the read
and write address-pointers. If the address-pointers' values are equal and the serially connected DFFs' LSBs are equal,
the empty flag is asserted, and alternatively, if the address-pointers' values are equal and the serially connected DFFs'
LSBs are different, full is asserted. When the associated read/write address-pointer reaches the full state, then the associated pair of serially connected DFFs change state to “01j on the rising RAD/WAD edge, which detects the roll over
case between the two address pointers.
3 | DATAPAT H T I MI N G AN ALY SI S
We evaluate the scalability of our approach using the timing of all critical paths with respect to a one-unit gate delay
(GD), which provides an analysis that is independent of technology factors for direct comparison purposes. Our design
uses only a basic CMOS logic gate structure with basic width/length sizes for design layout clarity and cost-effectiveness
for continued technology scaling. We evaluate our FIFO's design using an 8T-Cell SRAM structure of size
64-row × 64-bit, which is a common size for many synchronous and asynchronous applications.
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ABDEL-HAFEEZ AND GORDON-ROSS
FIGURE 2
ABDEL-HAFEEZ AND GORDON-ROSS
Proposed empty-full flag circuitry
As discussed in Section 2, the datapath is activated by two nonoverlapping signals—the WAP signal is nonoverlapped with the WAD signal for the write operation, and the RAP is nonoverlapped with the RAD for the read
operation. This mutually exclusive behavior halves the critical path delay and precludes the critical path from accumulating all delays from the input-to-output for a complete cycle. Thus, the worst-case delay is the longest half-cycle delay
of the WAP, WAD, RAP, or RAD signals. To the best of our knowledge, our design is the first implementation of a FIFO
that uses a two-phase clocking methodology. To prevent falling and rising edges of signals from intersecting with each
other, a delay margin, referred to as the idle-margin, between the WAP and WAD signals or between the RAP and
RAD signals is guaranteed by the control unit, and has a worst-case timing of 3 GDs:
Idle −margin = 3 GDs:
ð1Þ
Essentially, there are three GDs between the rising/falling and falling/rising of the RAP and RAD signals, and
similarly, there are three GDs between the rising/falling and falling/rising of the WAP and WAD signals. The
idle-margin is maintained to avoid memory race conditions, such that the address-decoder deassertion has completely
settled before the address-pointer is asserted and vice versa. This idle-margin can vary from 1.5 to 3 GDs due to PVT
variations; however, the goal is to maintain this idle-margin as greater than 1 GD as a minimum timing requirement to
disable the address-decoder circuit and enable the address-pointer circuit or vice versa. Therefore, the PVT variation is
not an issue as long as the two signals (WAD/WAP) and (RAD/RAP) have a minimum idle-margin greater than 1 GD
for all different technology corners.
Figure 3 depicts the critical paths for write and read operation, wherein the write operation begins when the WAP
signal triggers the write address-pointer circuit, which is a parallel up-counter constructed as a state-lookahead structure with a counting delay of a single DFF. Thus, the delay of the address-pointer's circuit is approximated as 2 GDs as
a precaution for the worst-case timing scenario:
Address −pointer = 2 GDs:
ð2Þ
Additionally, a portion of the address-decoder's logic, except for the last stage, is integrated into the address-pointer
stage, giving an additional 3 GDs. Thus, the total GD for generating the memory word line address prior to the last stage
of the address-decoder is
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
942
FIGURE 3
943
Critical paths for the read and write operations
W −1 = Address − pointer + 3 GDs = 5 GDs:
ð3Þ
Furthermore, tracing the critical path from the WAP signal in the opposite direction (i.e., the empty-full flag
circuitry), there are two branches. One branch is directed to the XNOR logic (W-2A) and the other branch is directed to
the large fan-in AND gate (W-2B), resulting in
W −2A = Address −pointer + 2 GDs ðXNORsÞ = 4 GDs,
ð4Þ
W −2B = Address −pointer + 2 GDs ðfan −in ANDÞ = 4 GDs:
ð5Þ
Alternatively, the nonoverlapping WAD signal triggers the write address-decoder, which is part of the last stage of
the address-decoder and is assumed as 1 GD. Thus, following W-3 for the write operation, 6 GDs are required to turn
on the selected memory row and one additional GD is required to store the data into the selected row since the data are
already on the memory bus and waiting for the selected row to be activated. Therefore, the critical path delay starts
when WAD is activated and ends when the data are stored into the selected row:
W −3 = 1 GD + 6 GDs + 1 GD = 8 GDs:
ð6Þ
Comparing Equations 3, 4, 5, and 6, the write operation's total delay (Equation 6) is the worst of the nonoverlapping
signals' path delays, plus the addition of the idle-margin delay:
Write −Delay = W −3 + idle−margin = 8 GDs + 3 GDs = 11 GDs:
ð7Þ
As a result, the WAD signal is designed to be active for a minimum of 11 GDs, while the WAP signal is deasserted,
and vice versa, the WAP signal is asserted for a minimum of 11 GDs while the WAD signal is deasserted. Therefore,
the write cycle time is approximated as 22 GDs, where 11 GDs are asserted and 11 GDs are deasserted. Including a
precaution time for slew rate effects of 3 GDs, we can assume the write cycle time is 25 GDs:
Write −Cycle = 2 Write −Delay + Slew −Rate = 25 GDs:
ð8Þ
Similarly, the read operation delay has the same address-pointer's write delay starting from the RAP signal initiating
R-1, R-2A, and R2-B (Equations 3, 4, and 5 and Figure 3 since both are identical circuits but operate separately and
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ABDEL-HAFEEZ AND GORDON-ROSS
ABDEL-HAFEEZ AND GORDON-ROSS
independently of each other. On the other hand, the read address-decoder path requires more delay than the write
address-decoder path (Equation 6). The additional delay for the read operation is the time required to read the data
from the selected row and store the data into the output latches. The selected row time is the R-3 delay path, which is
the same as W-3 depicted in Equation 6. Thus, tracing the RAD signal through R-3 and R-4 to read the data from the
memory is
R −4 = R −3 + 9 GDs + 1 GD = 18 GDs:
ð9Þ
Therefore, the read operation's total time is the worst case of the nonoverlapping signals' path delays, which is
Equation 3 or Equation 9, plus the addition of the idle-margin delay:
Read −Delay = R−4 + idle− margin = 18 + 3 = 21 GDs:
ð10Þ
As a result, the RAD signal is designed to be asserted for a minimum of 21 GDs, while the RAP signal is deasserted,
and vice versa for the WAP signal, which is asserted for 21 GDs while the RAD signal is deasserted. Including a precaution time for slew rate effects, the read cycle time is 45 GDs:
Read −Cycle = 2 Read −Delay + 3 GDs = 45 GDs:
ð11Þ
Since homogeneity is required in comparing the write cycle time (Equation 8) and the read cycle time
(Equation 11), we consider the worst case delay (Equation 11) for both the read and write operations. That is, the WAP
and WAD signals have a nonoverlapping cycle time of 45 GDs, and the RAP and RAD signals have a nonoverlapping
cycle time of 45 GDs.
We evaluate the actual physical delay using a 65 nm CMOS technology operating with a 1 V power supply. Each
standard GD can be approximated as 0.005 ns,28 but we assume 0.02 ns as a precaution measure even though we use
the minimum sizes and distribute all fan-in and fan-out circuits for a four-gate tree structure. Therefore, the actual
physical cycle time of 45 GDs is
Cycle −Time = 0:02 ns=GD 45 GDs = 0:9 ns:
ð12Þ
In summary, the datapath circuit can safely operate at approximately 1 GHz, which could be further improved by
having a faster SRAM read access that is widely available in literature12,13 or migrating the design into further continued CMOS technology.
4 | S YNC HR ONOUS C ONTROL C IR CUIT
The objective of the synchronous control circuit is to generate the nonoverlapping, two-phase signals CLKN and CLKP,
which are derived from the main system clock CLK, and represent CLK's negative and positive variations, respectively.
The ideal case is depicted in Figure 4. CLKP and CLKN are at the opposite phase of each other with an idle-margin
FIGURE 4
Timing diagram depicting the nonoverlapping, two-phase signals with respect to the main system clock
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
944
945
between the transition edges. The CLK source and traditional back-to-back NAND two-phase clocking systems21 are
usually susceptible to PVT variations, as well as noise jitter variation. All of these variations can have a negative impact
on the duty cycle of the two phases and nonoverlapping margin, thus negatively impacting overall circuit timing
performance. However, the idea is not to eliminate the PVT variations, but rather to preserve consistent generation of
the signals and consequently, the idle-margin. Therefore, as long as CLKP and CLKN are strictly opposite for a
minimum of 21 GDs with enough idle-margin for a minimum of 3 GDs (i.e., derived in Equation 10) for all different
technology corners, the circuit operation is stable and correct with no need for additional large circuit cost to maintain
correct operation.
In literature, a delay-locked-loop circuit (DLL)23,24 is used to control the PVT variation and preserve the clock skew;
however, due to the DLL's complexity, power consumption, large silicon area with emphasis in many core architectures, we use a simpler circuit with similar achievements, which shows in22 that simulations over all PVT corners, for
nonideal clock signals provides nonoverlapping phases with low values of root mean square (RMS) jitter.
Figure 5 depicts the control circuit for the synchronous FIFO operation with the input/output signals' abbreviations
and descriptions defined in Table 2. WE evaluated with the full flag signal (asserted by the empty-full flag circuitry)
and asserts the write-enable-clean (WEC) signal if the memory has enough room to store the data, otherwise, WEC is
deasserted. Similarly, RE is evaluated with the empty flag signal and asserts the read-enable-clean (REC) signal if the
memory has data to read, otherwise, REC is deasserted. As a precaution, if both RE and WE are asserted simultaneously, both REC and WEC are deactivated and the FIFO does neither operation. Finally, gating CLKP/CLKN with
FIGURE 5
TABLE 2
definitions
Synchronous first-in first-out (FIFO) operation's control circuit
Control circuit signal
Input-output signals
Representations
CLK
System clock
WE
Write enable
RE
Read enable
RESET
System reset
CLKN
Negative phase of system clock
CLKP
Positive phase of system clock
WEC
Write enable clean
REC
Read enable clean
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ABDEL-HAFEEZ AND GORDON-ROSS
ABDEL-HAFEEZ AND GORDON-ROSS
WEC and REC generates WAD/WAP and RAD/RAP, respectively. These signals are inputs to the datapath unit and
orchestrate all necessary read and write operations.
5 | ASYNCHRONOUS CONTROL CIRC UIT
Figure 6 illustrates the timing for asynchronous communication, with numbers indicating key transition points as
described below. This timing is based on a handshaking protocol between the sender (active) and receiver (passive)
components.25,26 Therefore, the aim of the asynchronous control unit is to generate the assertion of request-clean
(REQC) once receiving an assertion of REQ in order to start the read or write operation. Thus, deasserting REQC indicates the end of the read or write operation. Additionally, the control unit is required to generate an assertion of
acknowledge-clean (ACKC), which is mutually exclusive with REQC and maintains the idle-margin with the falling
edge of REQC, thus avoiding race conditions. Additionally, Figure 6 illustrates the generation of the signals along with
the signals' actions for asynchronous timing. REQ (1) initiates a handshake to the receiver with the control circuit generating REQC (10 ). The assertion of REQC (10 ) is used to enable the datapath's memory for a read or write operation
(i.e., enabling the address-decoder circuit), after which the control circuit deasserts REQC (100 ) after a minimum pulse
width approximated by Equation (10), which is 21 GDs. Next, the control circuit generates ACKC (2) with a sufficient
idle-margin from REQC (100 ) since ACKC is used to activate the address counting (i.e., ACK activates the addresspointer and eliminates race conditions). Once the sender receives ACKC (2), the sender deasserts REQ (3), which is
used to reset the receiver, and thus implying the falling edge of ACKC (4) and prepares for a new handshake cycle.
Figure 7 depicts the asynchronous circuit with conventional DFFs used as the delay necessary for generating REQC
and ACKC. Even though DFFs are commonly known for “overall good” robustness against PVT variations and continued technology scaling,21–24 we are not proposing to eliminate PVT but are instead focus on a preserving the shape of
the signals generated, such that REQC and ACKC that are not overlapped with each other under any technology corners. For example, generating the REQC pulse (the rising 10 to the falling 100 in Figure 6) might be varied from a shorter
pulse width to a wider pulse width due to different technology corners (i.e., FF, SS, TT, SF, etc.); however, our design
goal is to ensure that the minimum pulse width of REQC within all corners is not less than 21 GDs (Equation 10). Thus,
the read or write operation occurs at this pulse with enough delay to process and complete. Additionally, several cascaded DFFs are added between ACKC (2) and REQC (100 ) (Figure 6) to ensure that the idle-margin is preserved for a
minimum of 3 GDs for all technology corners.
Initially in Figure 7, all DFFs are at a reset state with deasserted outputs. Once the asserted REQ is detected, REQC
is asserted and activates the memory cells for a read or write operation. After the delay chain of DFFs, which is equal to
the write or read access time (21 GDs), REQC is deasserted (i.e., feedback line-1 in Figure 7) in order to close the
memory cells, such that no data will be written or read. Then, within the idle-margin (i.e., two cascaded DFFs), ACKC is
asserted and propagated across the channel to inform the active component that the operation has completed. Once the
active component receives the asserted ACKC, the active component deactivates REQ back to the passive component.
F I G U R E 6 Generating a request-clean signal (REQC) and an acknowledge-clean signal (ACKC) within asynchronous communication
timing diagram. Numbers indicate key transition points, as described
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
946
947
F I G U R E 7 Asynchronous cascaded DFFs circuit for generating a request-clean signal (REQC) and an acknowledge-clean
signal (ACKC)
FIGURE 8
Asynchronous control logic with clockless input/output signals to the datapath
Once the circuit in Figure 7 receives the falling edge of REQ, the circuit resets all DFFs, and thus, ACKC is deactivated.
Subsequently, the asynchronous control circuit returns back to the circuit's initial state and waits for a new request.
Figure 8 depicts the overall circuitry of the asynchronous control logic with the clockless input/output signals'
abbreviations and descriptions defined in Table 3. During the write operation, the sender asserts the write-request
(WREQ) signal, wherein the receiver asserts the write-request-clean (WREQC) signal, which is forwarded to the
datapath unit as WAD that is used to activate the write address-decoder, and thus activates the memory for a write
operation. Once WREQC is deasserted, subsequently WAD is deasserted, which deactivates the memory's write operation. With some idle-margin, the write-acknowledge-clean (WACKC) signal is asserted, which activates WAP. This
assertion activates the up counter for the address-pointer in the datapath unit. Finally, the sender deactivates WREQ
upon receiving the asserted WACKC, and thus, the receiver's control unit resets all of the DFFs and deasserts WACKC
to wait for another new handshake operation. The read operates similarly to the write operation.
6 | R ESULTS A ND COMPARIS ON ANALYS IS
We implement and test the datapath unit of size 64-row × 64-bit using SRAM 8T-Cells, which is a common FIFO size
in many chips.1–4,7,8 Additionally, two control units, one with synchronous and the other with asynchronous circuitry,
are implemented and integrated with datapath unit for testing purposes. We use a cost-effective CMOS transistor level
of 65-nm Taiwan Semiconductor Manufacturing Company (TSMC) technology with a 1 V power supply.28 We gathered
timing delay values, total power consumption, and total transistor counts using HSPICE29 simulations.
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ABDEL-HAFEEZ AND GORDON-ROSS
ABDEL-HAFEEZ AND GORDON-ROSS
Input–output signals
Representations
WREQ
Write request
WREQC
Write-request-clean
RREQ
Read request
RREQC
Read-request-clean
WACK
Write acknowledge
WACKC
Write-acknowledge-clean
RACK
Read-acknowledge
RACKC
Read-acknowledge-clean
RESET
System reset
T A B L E 3 Asynchronous control
circuit signal definitions
The datapath's worst case critical path delay is measured for the read operation showing a cycle time of 1.25 GHz,
which is very close to our derived cycle time in Equation 11. In Equation 11, we have more conservative results since
we assume the worst case scenario, which adds an additional safety margin delay. In both cases, the datapath can safely
operate at 1 GHz for both the read and write operations. Consequently, the complete cycle of the two nonoverlapping
signals (WAD/WAP and RAD/RAP) that orchestrates the write and read operations is completed within 1 ns with a
slew rate of 0.1 ns/V. Furthermore, the critical path illustrated by Equation 11 affects the scalability, wherein as there is
an increase of storage area (FIFO array) the operating frequency reduces, and vice versa. This relationship between
scalability and frequency is only due to the FIFO array, not the control unit or the empty-full flag circuitry. As previously mentioned, the empty-full flag circuitry depends on the roll-over of the FIFO array instead of the depth. Besides,
the control unit circuitry always maintains the two-phase clocking system with a nonoverlapping margin, where each
phase is realized by the worst case critical path delay with a safety margin of the slew rate as shown in Equation 11. As
a result, the FIFO array of 8T-Cells determines the maximum running frequency. Currently, deep FIFOs up to a
10-word depth can be considered sufficient for high performance throughput30; however, for further scalability to future
applications, we propose far more flexible FIFO design with a depth of 64 word, where each word is of size 64 bit. As a
result, the 8T-Cell SRAM sizes against operating frequency for a wide range of CMOS technologies becomes an issue
for SRAM designs, where literature has various details in providing the memory array scalability against operating
frequency.12,13
Our design has several factors that help reduce the power consumption, with the major factor being that the
datapath operates with a 1 V power supply. Additionally, all components are constructed using CMOS transistors with
a 65 nm channel length and widths ranging from 3 to 5 μm, except for the inverters drivers' widths that are
Wp = 15 μm and Wn = 10 μm. Another major contributing factor for low power design is the use of an SRAM memory
array that is based on the 8T-Cell with a standard geometry size on 65 nm from Intel.12 The 8T-Cell SRAM is commonly
known for low power since this cell type operates with no biasing current. Additionally, the read and write ports are
separated to avoid charging contention; besides, obviating the use of sense amplifier which is considered a major source
of harvesting power consumptions. A further factor is related to the ARM architecture,14 which has circuit toggling for
two nonoverlapping signals, where the rising and falling edges of the signals are separated with an idle-margin, thus
minimizing the shortest path through the logic gates from the power supply to the ground, which tends to reduce the
dynamic power operation. Finally, our design ensures that all components are not active at the same time.
An additionally advantage factor is that the datapath processes write and read operations without any latencies,
which is usually implemented as brute-force, cascaded DFFs in other designs4–20 to synchronize the incoming data with
the internal clock. Thus, the datapath exploits high overall performance and processes data on every cycle (i.e., every
two nonoverlapping signals process new data).
Both the synchronous and asynchronous control unit circuitry uses standard-library digital logic gates with similar
sizes as the datapath logic gates. The main objective of the control unit is to generate the two complemented signals for
the read operation and the two complemented signals for write operation with an idle-margin at the edges of the
signals. The control unit circuit's design is not intended to eliminate PVT variations, but rather to preserve the
complement of the two signals in accordance with the idle-margin. The duty cycle's variation of the complemented
signals is not important so long as the signals' are not overlapped with each other for a minimum of 21 GDs and there
is a minimum idle-margin of 3 GDs. We conducted several HSPICE simulations at different technology corners
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
948
949
(i.e., FF, SS, TT, SF, etc.) to evaluate the complement of the signals under several constraints. We experimentally
verified the design's robustness and validated the generation of the complement signals.
Tables 4 and 5 summarize the comparison between several state of the art synchronous and asynchronous FIFO
structures, respectively. Key factors such as power consumption, operating speed, performance with throughout per
cycle, and reconfigurability are evaluated. This comparison evaluates the designs' complexities and scalabilities independent of the underlying technology factor since it is challenging to find comparable designs with the same technology
parameters and specifics; however, the comparison still provides insights about relative power consumption, speed,
scalability, and design complexity.
Table 4 summarizes the characteristics of synchronous FIFO circuits that we compare to, listed by the reference
number. The design in Hsu et al.31 is usually classified as a very low power design since that design operates at 0.4 V;
however, that design trades off low power, with low speeds ranging from 20 to 30 MHz, which limits that design's use
application with these requirements and/or restrictions. Due to the lower power and performance, that design also has
a large design cost for a self-time management unit that is used to control the timing between the memory-decoder and
address-pointer to overcome the PVT variations that might violate the setup/hold time constraints at the gated clock
memory component. Thus, that design is not compatible with EDA synthesis tools and requires a large design cost.
Additionally, that design has a large cost up/down counter of size log2(N) for the empty-full flag circuitry, where N is
the number of memory rows, which results in a nonscalable design.
The design in Rahmani et al.32 operates at the same power supply of 1 V as our proposed; however, that design
operates 5X slower than our proposed work. Besides, it uses separate clocks for reading and write operations that
required an extra cost for a dual-port memory array. That design uses a Johnson counter for the empty-full flag circuit
and an adder with a feedback register for the binary address-pointer to minimize the overhead of the DFFs and reduce
power consumption. However, it penalizes for further design cost that requires to convert binary addresses to Gray code
to be synchronized with the corresponding clock domain and utilizes for empty-full flag detection. Thus, the design is
not cost efficient for scalability and require ASIC components.
The design in Taghi Adl and Mohammadi11 presents an elastic method on a dual clock FIFO that operates on different domains and compatible with EDA tools. The storage area is elastic buffers, where every buffer is a form of three
DFFs since it has three states, yield a large area of storing data that results in large power consumption. Besides, the
latency depends on the read and write activation scenario, yields in the uncertainty of throughput, and large variations.
The token ring address-pointer requires N buffers for N elastic buffers, which has the status of 2 N of data; still, the efficiency of the address pointer has considered not power efficient. The control unit is generous power harvesting and
requires a large sequential machine to control the status of empty-full flags circuit and elastic buffers. Generally, the
design in Taghi Adl and Mohammadi11 is not efficient in power consumption nor high-speed operation with low
throughput for the penalty of trying to manage two different operation cycles, which can be managed as we proposed
on asynchronous FIFO structure.
TABLE 4
Comparison between prior works and our proposed synchronous FIFO design
Address-pointer
Control unit
Storage unit
Flag unit
31
Up/down
counter
Self-timed management
units and power switched
enable signal
10 T-cell
SRAM
256-row ×
16-bit
Up/down
indicators
32
Adder and binary
registers
Counter control unit
Dual-port
RAM
128-row ×
32-bit
Johnson
counter
11
Ring-one-hot
Elastic modification module
Elastic
registers
Slow sequential
machine
800–600 MHz/1 V
Our
work
Carry look-ahead
state counter
Non-overlapping two-phase
clock system with idlemargin between edges
8 T-cell
SRAM
64-row ×
64-bit
Rollover
detection and
comparison
circuit
1 GHz/1 V
Abbreviation: FIFO, first-in first-out.
Freq/voltage
Latency
22.7 MHz/0.4 V
One data/cycle
200 MHz/1 V
One data/cycle
3 to 4
One data/cycle
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ABDEL-HAFEEZ AND GORDON-ROSS
TABLE 5
ABDEL-HAFEEZ AND GORDON-ROSS
Comparison between prior works and our proposed asynchronous FIFO design
Address-pointer
Control unit
Storage unit
Flag unit
Freq/voltage/
power/tech.
Latency
33
Gray code with
control timing
circuit
Slow low-cost clock
system generator
with metastability
avoidance circuit
SRAM 8 T-cell
128-row
× 16-bit
Up/down
indicators
150 MHz/1 V/2.31
mW/28 nm
One data/t
hree cycles
34
Token ring with
one-hot bubbleencoding and
shift register
Global island clock
system for read
and write
High-cost N
buffers-queue
requiring
N*(N-1)
latches for
N lanes
Large cost domino
logic detectors
to avoid circuit
deadlocks
200 MHz/1 V//3.7
mW/90 nm
One data/
N-1 cycles
Our
work
Carry look-ahead
state counter
Clockless system with
non-overlapping
generation of request
and acknowledge
signals
8T-Cell SRAM
64-row ×
64-bit
Rollover detection
and comparison
circuit
1 GHz/1 V/2
mW/65 nm
One data/
one cycle
Abbreviation: FIFO, first-in first-out.
Table 5 summarizes the characteristics of the asynchronous FIFO circuits that we compare to listed by the reference
number. This comparison uses most of the datapath characteristics that have been discussed in Table 4, therefore, we
only discuss the new characteristics. The design in33 generates a variant of an internal clock signal to synchronize the
incoming data with the internal clock. That design has extra cost due to Gray code conversion that requires only a
capacity in the power of two, thus leading to limitation of the addresses. Besides, it imposes three data latency to avoid
metastability due to gray code synchronization, and thus increases data buffer area and still harming performance.
Consequently, that design is considered to operate at low frequency to alleviate all synchronization issues with the
incoming data and pausible clock internal circuitry generation.
The design in Nguyen and Tran34 uses registers as storage data with the global clock feeds from a Global Asynchronous Local Synchronous (GALS) structure of NoCs. The address pointer is a high cost of a sequence of cascaded FFs
that depends on the depth of data, wherein the circulating FFs are triggered by the global clock and the read/write are
taking turns on different registers. That design suffers from a large latency cycle and to assure the correctness of the
write and read functionality increase, one of them is far faster than the other one; thus, performance degradation is
becoming prominent in that design. Subsequently, the data capacity of long latency becomes a major limitation of that
design. Another restriction is the use of bubble encoding for the token rings to detect the empty-full flag of the queue
registers. Thus, the detector of that design is commonly known to have a large cost with a special custom design circuit
for the cost to compensate for different timing operations between reading and write.
In summary for the aforementioned comparable designs, the trade-off in these structures is the requirement of a
Gray code address counter with a brute-force mechanism using serial DFFs to synchronize the incoming data with the
edge of the generated internal clock. Thus, that structure requires several clock cycle latencies to ensure correct read
and write operations, which affects the overall performance. Furthermore, having to generate the internal clock
requires long timing channel delays that require a large power consumption that could narrow application applicability.
On the contrary, our proposed work manages the handshaking signals in a completely clockless environment and communication channel link, such that intermediate clean (request-clean) and acknowledge-clean signals are generated as
mutually exclusive events. The pulse width of the request-clean signal activates the read or write operation, while the
acknowledge signal activates the address counting operation. Therefore, this design avoids the trade-offs of previous
structures concerning longer latency and higher power consumptions. Besides, the design has a standard cell CMOS
library that is synthesizable to EDA tools and does not require any special asynchronous circuit.
7 | C ON C L U S I ON
In this work, we proposed a FIFO circuit design that is suitable for both asynchronous and synchronous application communications. Our design separates the control unit and datapath unit, which facilitates easy reconfigurability and scalability. In both types of applications, our proposed datapath circuit remains the same, while our proposed control circuit
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
950
951
comprises two separate internal structures that can handle the different handshaking communication protocol with a simple reconfiguration. The datapath is structured with a two-ported 8T-Cell SRAM-based memory array that operates on
two-phase nonoverlapping signals, such that one signal activates the memory decoder, while the other signal activates the
memory pointer. This design eliminates the race conditions and eases the controlling circuit structure design. Subsequently, both types of control circuits, asynchronous and synchronous, generate the appropriate nonoverlapping signals
to the datapath based on the required handshaking communication protocol. Our circuit design's other unique feature is
the method of detecting the status of the memory storage using a novel empty-full flag circuitry that counts the rollover of
the memory pointer using only four D-type Flip-Flops (DFFs) regardless of the memory size, as compared to prior work
that uses a large-area counter. This structure affords good scalability with minimized power consumption. Through extensive simulations, our results show that our datapath operates at 1 GHz, and can process data (read or write) once per one
control cycle, which surpasses most state-of the-art designs by 3X to 5X. Furthermore, our design operates with a 1 V
power supply and offers continued technology scaling as an attractive feature for low-power design.
ORCID
Saleh Abdel-hafeez
https://orcid.org/0000-0003-2988-1609
R EF E RE N C E S
1. Gordon-Ross A, Abdel-hafeez S, Alsafrjalni MH. A one-cycle FIFO buffer for memory management units in Manycore System. IEEE
Computer Society Annual Symposium on VLSI, July 2019;265-270. https://doi.org/10.1109/ISVLSI.2019.00056
2. Bae Y, Park S, Park I. A single-chip programmable platform based on a multithreaded processor and configurable logic clusters. IEEE J
Solid-State Circuits. Oct. 2003;38(10):1703-1711. https://ieeexplore.ieee.org/document/1233767
3. Shibata N, Watanabe M, Tanabe Y. A current-sensed high-speed and low-power first-in-first-out memory using a Wordline/
Bitline-swapped dual-port SRAM cell. IEEE j SSC. June 2002;37(6):735-750.
4. Hu Y, Liang S, Yu J, Wang Y, Yang H. On-chip instruction generation for cross-layer CNN accelerator on FPGA. IEEE Computer Society
Annual Symposium on VLSI, July 2019.
5. Alcin M, Koyuncu I, Tuna M, Varan M, Pehljyan I. A novel high speed artificial neural network-based chaotic true random number generator on field programmable gate Array. Int J Circuit Theory Appl, Wiley Press. Nov 8, 2018;47(3):365-378. https://doi.org/10.1002/cta.2581
6. Teymouri M. A multipurpose circuit to read out and digitize pixel signal for low-power CMOS imagers. Int J Circuit Theory Appl, Wiley
Press. Aug 4, 2020;48(11):1887-1899. https://doi.org/10.1002/cta.2854
7. Zeinolabedin SMA, Zhou J, Liu X, Tae-Hyoung Kim T. An area- and energy-efficient FIFO design using error-reduced data compression
and near-threshold operation for image/video applications. IEEE Trans Very Large Scale Integr VLSI Syst. 2015;23(11):2408-2416.
8. Abdel-Hafeez S, Quwaider MQ. A one-cycle asynchronous FIFO queue buffer circuit. 11th International Conference on Information and
Communication Systems (ICICS), Irbid, Jordan, 2020;388-393. https://doi.org/10.1109/ICICS49469.2020.239548
9. Ezz-Edin R, El-Moursy MA, Hamed HFA. High throughput asynchronous NoCs design under high process variation. Integr VLSI J.
2015;49:1-13.
10. Ashour H. Design, simulation and realization of a parametrizable, configurable and modular asynchronous FIFO. 2015. https://
ieeexplore.ieee.org/document/7237325
11. Taghi Adl SM, Mohammadi S. A high-performance dual clock elastic FIFO network Interface for GALS NoC. Microelectron J, Elsevier.
2018;76:69-80.
12. Nii K, Masuda Y, Yabuuchi M, et al. A 65 nm ultra-high-density dual-port SRAM with 0.71um/sup /8T-Cell for SoC. Symposium on
VLSI Circuits, Digest of Technical Papers, Honolulu, HI, USA, 2006;17-18. https://doi.org/10.1109/VLSIC.2006.1705344
13. Abdel-Hafeez S, Shatnawi M, Gordon-Ross A. A double data rate 8T-Cell SRAM architecture for systems-on-chip. IEEE 14Th International Symposium on System-on-Chip 2012, Tampere, Finland, October 11-12, 2012.
14. Furber S. Chapter 4: ARM Organization and Implementation. ARM: system-on-chip architecture. 2nd ed. Harlow, England: AddisonWesley; 2000;74-101. https://www.pearsoned.co.uk
15. Sheibanyrad A, Greiner A. Two efficient synchronous—Asynchronous converters well-suited for networks-on-Chip in GALS architectures. Integr VLSI J. 2008;41(1):17-26.
16. Panades IM, Greiner A. Bi-synchronous FIFO for synchronous circuit well suited for network-on-chip in GALS architectures. Proceedings of the First International Symposium on Networks-on-Chip (NOCS'07), May 2007;83-94. https://doi.org/10.1109/NOCS.2007.14
17. Chang MT, Huang PY, Hwang W. A robust ultra-low power asynchronous FIFO memory with self-adaptive power control. IEEE International SOC Conference, Newport Beach, CA, USA, 2008;175-178. https://doi.org/10.1109/SOCC.2008.4641505
18. Jeon D, Henry MB, Kim Y, et al. An energy efficient full-frame feature extraction accelerator with shift-latch FIFO in 28 nm CMOS.
IEEE j SSC. May 2014;49(5):1271-1283.
19. Chelcea T, Nowick SM. A low-latency FIFO for mixed-clock systems. Proceedings IEEE Computer Society Workshop on VLSI 2000.
System Design for a System-on-Chip Era, April 2000;119-126. https://doi.org/10.1109/IWV.2000.844540
20. Fattah M, Manian A, Rahimi A, Mohammadi S. A high throughput low power FIFO used for GALS NoC buffers. IEEE Computer
Society Annual Symposium on VLSI, July 2010.
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ABDEL-HAFEEZ AND GORDON-ROSS
ABDEL-HAFEEZ AND GORDON-ROSS
21. Lin W, Black Jr. WC. A low-jitter skew-calibrated multi-phase clock generator for time-interleaved applications. Solid-State Circuits
Conference, 2001. Digest of Technical Papers. ISSCC. 2001, IEEE International; 2001;396-397.
22. Nowacki B, Paulino N, Goes J. A simple 1 GHz non-overlapping two-phase clock generators for SC circuits. 20th International Conference on Mixed Design of Integrated Circuits and Systems (MIXDES), June 20-22, Gdynia, Poland: IEEE; 2013;174-178.
23. Zhang D, Yang HG, Zhu W, et al. A multiphase DLL with a novel fast-locking fine-code time-to-digital converter. IEEE Trans Very Large
Scale Integr VLSI Syst. 2015;23(11):2680-2684.
24. Abdel-hafeez S, Harb SM, Lee KM. On-chip jitter measurement architecture using a delay-locked Loop with Vernier delay line to the
order of giga hertz. Proceedings of the 18th International Conference Mixed Design of Integrated Circuits and System (MIXDES), IEEE;
June 2011;502-506.
25. Martin J, MystrÖm M. Asynchronous techniques for system-on-chip design. Proc IEEE. 2006;94(6):1089-1120.
26. Kessels J. Register-communication between mutually asynchronous domains. IEEE International Symposium on Asynchronous Circuits
and Systems, March 2005;66-75. https://doi.org/10.1109/ASYNC.2005.27
27. Abdel-Hafeez S, Gordon-Ross A. A digital CMOS parallel counter architecture based on state look-ahead logic. J IEEE Trans Very Large
Scale Integr VLSI Syst. May 23, 2011;19(6):1023-1034.
28. 0.65 μm CMOS ASIC Process Digests, Taiwan Semiconductor Manufacturing Corporation, Hsinchu, Taiwan, 2005.
29. Synopsys. HSPICE, Mountain View, CA [Online]. 2016. Available: http://www.synopsys.com
30. Psarras A, Paschou M, Nicopoulos C, Dimitrakopoulos G. A dual-clock multiple-queue shared buffer. IEEE Trans Comput. Oct. 2017;10
(66):1809-1815.
31. Hsu W, Huang P, Wu S, et al. 8nm ultra-low power near-/Sub-threshold first-in-first-out (FIFO) memory for multi-bio-signal sensing
platforms. International Symposium on Automation and Test VLSI Design (VLSI-DAT), Hsinchu, Taiwan, April 2016;1-4.
32. Rahmani A, Liljeberg P, Plosila J, Tenhunen H. Design and implementation of reconfigurable FIFOs for voltage/Frequency Island-based
networks-on-Chip. Microprocess Microsyst. June–July 2013;37(4-5):432-445.
33. Keller B, Fojtik M, Khailany B. A plausible bisynchronous FIFO for GLAS systems. 21st IEEE International Symposium on Asynchronous Circuits and Systems, California, May 2015;1-8.
34. Nguyen TT, Tran XT. A novel asynchronous first-in-first-out adapting to multi synchronous network-on-chips. 2014 International Conference on Advanced Technologies for Communications (ATC 2014), Hanoi, Vietnam, Feb. 2014;365-370. https://doi.org/10.1109/ATC.
2014.7043413
AUTHOR BIOGRAPHIES
Saleh Abdel-hafeez received his BSEE, MSEE, and Ph.D. in Computer Engineering in the
field of VLSI design. In 1997, he joined S3.inc as a member of their technical staff, where he
performed IC circuit design related to cache memory, digital I/O and ADCs. He has three patents (6,265,509; 6,356,509; 20040211982A1) in the field of IC design. Currently, he is a Professor
in the college of Computer and Information Technology, University of Science and Technology,
Jordan. His research interests include circuits and architectures for low-power and highperformance VLSI. Dr. Abdel-hafeez is a former chairman of the computer engineering
department.
Ann Gordon-Ross (M'00) received her B.S. and Ph.D. degrees in computer science and engineering from the University of California, Riverside, CA, USA, in 2000 and 2007, respectively. She is
currently an Associate Professor of electrical and computer engineering with the University of
Florida, Gainesville, FL, USA. She is also a Faculty Advisor of the women in Electrical and
Computer Engineering and the Phi Sigma Rho National Society for women in engineering and
engineering technology. Her current research interests include embedded systems, computer
architecture, low-power design, reconfigurable computing, dynamic optimizations, hardware
design, real-time systems, and multicore platforms.
S UP PO RT ING IN FOR MAT ION
Additional supporting information may be found online in the Supporting Information section at the end of this article.
How to cite this article: Abdel-hafeez S, Gordon-Ross A. Reconfigurable FIFO memory circuit for synchronous
and asynchronous communication. Int J Circ Theor Appl. 2021;49:938–952. https://doi.org/10.1002/cta.2921
1097007x, 2021, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cta.2921 by Tubitak Ulakbim, Wiley Online Library on [06/10/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
952
Download