An Atm Routing And Concentration Chip For A Scalable Multicast

advertisement
816
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997
An ATM Routing and Concentration Chip
for a Scalable Multicast ATM Switch
H. Jonathan Chao, Senior Member, IEEE, and Necdet Uzun, Member, IEEE
Abstract—We have proposed a new architecture for building a
scalable multicast ATM switch from a few tens to a few thousands
of input/output ports. The switch, called the Abacus switch,
employs input and output buffering schemes. Cell replication,
cell routing, and output contention resolution are all performed
in a distributed way so that the switch can be scaled up to a
large size. The Abacus switch adopts a novel algorithm to resolve
the contention of both multicast and unicast cells destined for
the same output port (or output module). The switch can also
handle multiple priority traffic by routing cells according to
their priority levels. This paper describes a key ASIC chip for
building the Abacus switch. The chip, called the ATM routing
and concentration (ARC) chip, contains a two-dimensional array
(3 2 32) of switch elements that are arranged in a cross-bar
structure. It provides the flexibility of configuring the chip into
different group sizes to accommodate different ATM switch sizes.
The ARC chip has been designed and fabricated using 0.8-m
CMOS technology and tested to operate correctly at 240 MHz.
Although the ARC chip was designed to handle the line rate at
OC-3 (155 Mb/s), the Abacus switch can accommodate a much
higher line rate at OC-12 (622 Mb/s) or OC-48 (2.5 Gb/s) by
using a bit-sliced technique or distributing cells in a cyclic order
to different inputs of the ARC chip. When the latter scheme is
used, the cell sequence is retained at the output of the Abacus
switch.
Index Terms— Asynchronous transfer mode, ATM concentration, ATM switching, ATM routing, distributed contention resolution, multicast switching.
I. INTRODUCTION
T
HERE are several approaches to building a large-scale
ATM switch. For instance, using small ATM switch
32) as building blocks and connecting
modules (e.g., 32
them in a multistage structure (e.g., Clos-type interconnection)
[1]–[5], or using high-speed technology to switch cells at
multiple Gb/s rate in a core switch [6]–[9]. Among them,
output buffering (including shared-memory output buffering)
has been proven to provide the best delay and throughput
performance. But, as the switch grows up to a certain size,
memory speed may become a bottleneck, or the technology
used to implement such memory may become too costly.
For instance, for a shared-memory switch with 256 input and
output ports at 155 Mb/s input rate, the memory cycle time has
to be less than 5.5 ns [2.8 s/(256 2)]. One way to eliminate
the memory’s speed constraint is to temporarily store some
Manuscript received May 2, 1996; revised September 11, 1996.
The authors are with the Department of Electrical Engineering, Polytechnic
University, Brooklyn, NY 11201 USA.
Publisher Item Identifier S 0018-9200(97)03840-7.
cells destined for the same output port in the input buffers.
Input buffering’s well-known head-of-line (HOL) blocking
drawback can be improved by speeding up the internal links’
bandwidth (e.g., three to four times the input line’s) or
increasing the number of routing links to each output port.
For instance, if the speed-up factor or the number of routing
links per output port is chosen to be four, the throughput will
be increased from 58% because of the HOL blocking to 99%.
Since multiple cells can arrive in one time slot at each output
port while only one cell can be transmitted to the network,
an output buffer is required. The input-and-output buffering
approach provides satisfactory performance while eliminating
memory speed limitation. Our performance study [10] has
shown that for a satisfactory cell loss probability, the output
buffer should be much larger than the input buffer, e.g., a few
thousands versus a few tens of cells. Examples of input-andoutput buffered ATM switches are NTT’s and BNR’s 160 Gb/s
switch [8], [9].
The challenge of implementing input-and-output buffered
switches is to resolve the output port contention among the
input cells destined for the same output port (or the same
output module for a two-stage architecture). The contention
resolution function is usually handled by an arbiter. The
bottleneck caused by the memory speed is shifted to the arbiter.
However, if parallel processing and pipeline techniques can
be intelligently applied to implement the arbiter, a large-scale
switch is feasible.
We have proposed a new modular architecture for building
a scalable multicast ATM switch from a few tens to a few
thousands of input/output ports. The switch, called the Abacus switch, employs an input and output buffering scheme.
The Abacus switch adopts a novel algorithm to resolve the
contention of both multicast and unicast cells destined for the
same output port (or output module). It provides the features
of: 1) sharing input buffers, 2) providing fairness among the
input ports, and 3) supporting multicast call splitting. The
call splitting function allows a multicast cell to be delivered
to subsets of destined output ports in multiple cycles, thus
increasing the system throughput.
The Abacus switch can be built using an ASIC, called the
ATM routing and concentration (ARC) chip, which consists
32 switch elements (SWE’s) arranged in a crossof 32
bar structure. The ARC chip routes cells to the proper output
ports (or groups) based on their routing information. The
chip performs the functions such as routing, multicasting, and
concentration in a distributed manner so that a large-scale
switch can be implemented by cascading the ARC chips in two
0018–9200/97$10.00  1997 IEEE
CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP
817
Fig. 2. The multicast grouping network (MGN).
Fig. 1. The architecture of a 64
2 64 Abacus switch.
dimensions. When multiple cells are contending for the same
output group, they are routed based on their priority levels.
The chip has been designed and fabricated using 0.8- m
CMOS technology and tested to operate correctly at 240 MHz.
Although the ARC chip was designed to handle the line rate of
OC-3 (155 Mb/s), the Abacus switch can accommodate a much
higher line rate of OC-12 (622 Mb/s) or OC-48 (2.5 Gb/s) by
using bit-sliced technique or distributing cells in a cyclic order
to different inputs of the ARC chip. When the latter scheme is
used, the cell sequence is retained at the output of the Abacus
switch.
In Section II, we describe the architecture of the Abacus
switch and show how the switch can be built using the ARC
chips. Section III describes the detailed design of the ARC
chip. Section IV shows the timing diagram of the ARC chip
when operating with different group sizes. Section V presents
testing and simulation results. Section VI gives conclusions.
II. A 64
64 ABACUS SWITCH
To simplify the explanation of the ARC chip operations,
we will only show the Abacus switch with 64 inputs and 64
outputs, as shown in Fig. 1. A large-scale architecture with
thousands of input and output ports can be found in [10]. The
64
64 switch consists of input port controllers (IPC’s), a
multicast grouping network (MGN), and output buffers.
The IPC’s terminate input signals from the network, look up
necessary information in a translation table, and attach routing
information to the front of each cell before it is routed in the
MGN. The IPC’s also resolve contention among input cells
that are destined for the same output port and provide a buffer
for those cells losing contention. The routing information
includes address and priority fields. The address field can be an
output port’s physical address for a unicast case, or a multicast
pattern (MP) for a multicast case. An MP is a bit map of all
output ports, each bit indicating if the cell is to be sent to the
associated output port. For a 64 64 switch, the MP has 64 b.
The priority field carries cells’ priority levels that are used to
assist cell routing in the switch fabric when contention occurs.
The MGN consists of 64 routing modules (RM’s). Each
RM contains a two-dimensional array of SWE’s arranged in
a cross-bar structure, as shown in Fig. 2. It has 64 horizontal
input lines and four vertical routing links. Up to four cells
from inputs can arrive at an output port simultaneously and
can be arbitrarily routed to any one of the four routing links.
Cell replication and routing functions are performed by all
SWE’s simultaneously, resulting in a scalable architecture.
Cell replication is achieved by broadcasting incoming cells
to all RM’s, which then route cells to their output links. The
SWE routes cells from the west and north to east and south,
respectively, when it is at cross state, or to south and east,
respectively, when it is at toggle state. The SWE’s state is
determined from the comparison of address bits and priority
bits of cells from west and north.
Each RM performs three functions in parallel: 1) cell
address filtering, 2) cell concentration, and 3) priority sorting.
Cells whose addresses match with the output address are
allowed to compete for a limited number of output links. For
instance, up to 64 input cells may compete for four output
links. During the competition, high priority cells are chosen
over low priority cells. Cells that lose contention will retry for
the next time slot until they have been successfully transmitted
to all desired output port(s).
The multicast contention resolution algorithm we proposed
in [10] achieves fairness among input ports during cell contention by dynamically assigning a priority level to the HOL
cell of each input port. The priority level, called local priority
(LP), is unique for each HOL cell and changes from cell slot
to cell slot. After cells are routed through the RM, they are
sorted at the output links of the RM according to their priority
levels from left to right in a descending order (see Fig. 2). The
cell that appears at the rightmost output link has the lowest
priority level among the cells that have been routed through
this RM. This lowest priority information is broadcast to all
IPC’s. Each IPC will then compare its HOL cell’s LP with
the feedback priority (FP) of the output port(s) for which the
HOL cell is destined to determine if its HOL cell has been
routed through the RM. If the FP is lower than or equal to the
LP, the IPC ensures that its HOL cell has reached one of the
818
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997
Fig. 3. An example of a 64
Fig. 4. An example of a 64
2
64 Abacus switch at OC-3 line rate.
2 64 Abacus switch at OC-12 line rate.
output links of the RM. Otherwise, the HOL cell must have
been discarded in the RM due to loss of contention and will
be retransmitted in the next time slot. Since it is not known
if the HOL cell will win the contention when it is sent to the
RM, the cell will be temporarily stored in a one-cell buffer for
possible retry in the future.
Here, we will show an example of connecting the ARC
chips to construct a 64 64 Abacus switch, where the output
group size is chosen to be four, meaning each output will
receive up to four cells in one cell time slot. The ARC chip
can be configured for different group sizes, e.g., 4, 8, 16, or
32 links per group, to accommodate different switch sizes.
By cascading the ARC chips in X and Y directions, a larger
64
switch size can be obtained. As shown in Fig. 3, a 64
Abacus switch is implemented by connecting the ARC chips in
two rows, each row with eight chips. All signals indicating
the cell’s address or priority fields are used by the SWE for
properly routing cells in the SWE array. Their usage will
be explained in detail in the following sections. Broadcast
.
horizontally to the chips are 64 input signals,
South outputs of the first row’s chips are connected to the
are
north inputs of the second row’s chips, i.e.,
Fig. 5. Block diagram of the ARC chip.
connected to
. The north inputs of the first row’s
chips are tied to high for multicast applications. The south
outputs of the second row’s chips go to the output buffers,
where in one cell slot up to four cells are received and one
cell is transmitted. The architecture shown in Fig. 3 handles
a line rate of OC-3 (155Mb/s). By using the so-called bit
sliced technique, the Abacus switch is able to accommodate
a line rate higher than OC-3. For instance, Fig. 4 shows a
64
64 Abacus switch with an input line rate of OC-12
(622 Mb/s), where four 64 64 switch planes are connected
in parallel. The OC-12 bit stream is converted to four serial
bit streams, and each plane (Fig. 3) handles each bit stream.
Since the same routing information is attached to each bit
stream, the bit streams of the same cell will appear at the
same output link of each switch plane. Therefore, they can
easily be grouped to the same output port, as shown in
Fig. 4.
III. ARC CHIP DESIGN
A. Block Diagram and Pin Signals
Fig. 5 shows the ARC chip’s block diagram. Each block’s
function and design are explained in detail in the following
CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP
Fig. 6. 32
2
819
4 SWE array.
TABLE I
TRUTH TABLE FOR DIFFERENT OPERATION MODES
Fig. 8. Two-to-one multiplexer structure.
Fig. 7. Two states of the switch element.
sections. The ARC chip contains 32
32 SWE’s which are
partitioned into eight SWE arrays, each with 32 4 SWE’s.
A set of input data signals,
, comes from IPC’s.
, either comes from
Another set of input data signals,
the output,
, of the chips on the above row, or is
tied to high for the chips on the first row (in the multicast
case). A set of the output signals,
, either go to the
north input of the chips one row below or go to the output
buffer.
signal is broadcast to all SWE’s to initialize each
An
SWE to a cross state, where the west input passes to the east
signal specifies the
and the north input passes to the south.
signal specifies
address bit(s) used for routing cells, while
the priority field. Other output signals propagate along with
cells to the adjacent chips on the east or south side.
signals are used to configure the chip into four
The
different group sizes as shown in Table I: 1) eught groups,
each with four output links, 2) four groups, each with eight
820
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997
(a)
(b)
(c)
(d)
Fig. 9. Four different configurations of the ARC chip (a) eight groups, (b) four groups, (c) two groups, and (d) one group.
output links, 3) two groups, each with 16 output links, and
4) one group with 32 output links. The
signal is used to
configure the chip to either unicast or multicast application. For
the unicast case,
is set to zero, while for the multicast
is set to one.
case,
B. 32
4 SWE Array
As shown in Fig. 6, the SWE’s are arranged in a cross-bar
structure, where signals only communicate between adjacent
SWE’s, easing the synchronization problem. ATM cells are
propagated in the SWE array similar to a wave propagating
and
diagonally toward the bottom right corner. The
signals are applied from the top left of the SWE array, and
and
signals to its east and
each SWE distributes the
south neighbors. This requires the same phase to the signal
and
signals are passed to the
arriving at each SWE. The
neighbor SWE’s (east and south) after one clock cycle delay,
signal is broadcast to all
as are data signals ( and ). The
SWE’s (not shown in Fig. 6) to precharge an internal node in
CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP
Fig. 10.
821
Byte-alignment circuit for input cells.
the SWE in every cell cycle. The
output signal is used to
identify the address bit position of the cells in the first SWE
array of the next adjacent chip.
The timing diagram of the SWE input signals and its two
possible states are shown in Fig. 7. Two bit-aligned cells, one
from the west and one from the north, are applied to the
SWE along with the
and
signals, which determine
the address and priority fields of the input cells. The SWE has
two states: cross and toggle. Initially, the SWE is initialized
to a cross state by the
signal, i.e., cells from the north
side are routed to the south side, and cells from the west side
are routed to the east side. When the address of the cell from
the west (
) is matched with the address of the cell from
the north (
), and when the west’s priority level (
) is
), the SWE is toggled. The cell
higher than the north’s (
from the west side is then routed to the south side, and the
cell from the north is routed to the east. Otherwise, the SWE
remains at the cross state. Note that
signal is 1-b long
when operating in multicast mode. It will be as long as the
address field when operating in the unicast mode. This is
due to the fact that a flattened multicast pattern is used in
the multicast mode where each bit in the multicast pattern is
associated with each output group. However,
bits are
required in unicast mode where
is the number of output
groups.
C. Two-to-One Multiplexer
The 2-to-1 mux between every two SWE arrays selects data
signals and
,
signals based on the chip configuration
(Fig. 8). Different configurations are obtained by properly
signals, which are generated from
controlling
and
signals. The group size is configured to four
when
, eight when
, 16 when
, or 32 when
. The group size
control signals ’s can be obtained from the input signals
:
,
,
and
. The
signal controls a 1-b delay
for the
signal when the chip is operated in the multicast
case.
As shown in Fig. 9, when SWE arrays are combined into
a bigger SWE array [e.g., from four links per group in
Fig. 9(a) to eight links per group in Fig. 9(b)], the data
signals (
) and field indication signals ( and )
of each SWE array are fed from the associated outputs of the
SWE array on the left. Note that these signals are latched
by the DFF’s between the SWE arrays (indicted as a D
in the figure). This extra latching reduces the propagation
time between two SWE’s, thus increasing the switch system’s
clock speed. When SWE arrays are not combined, the data
Fig. 11. Clock distribution and drivers.
signals (
) and ,
signals of each SWE array are
provided from global inputs. Also note that in the multicast
case,
signal is delayed by 1 b (a DFF) between every
two SWE arrays, which allows the cells’ address bits to be
identified correctly in the next SWE array. In the unicast
situation, the DFF is bypassed, as shown by a dashed line
in Fig. 9.
West data inputs
and
signal of each SWE array
are chosen either from the global inputs (nongrouped case) or
from the outputs of the 32 4 SWE array on the left (grouped
case). The
signal of each SWE array is selected from three
possible inputs: 1) directly from the global input (unicast with
nongrouped case), 2) from the global input with some bits
delay (multicast with nongrouped case), or 3) from the output
of the SWE array on the left with 1-b delay (grouped case for
either unicast or multicast).
D. Byte-Alignment Circuit
The byte alignment circuit (Fig. 10) is basically a set of shift
registers. It ensures that cells are aligned at bit level when they
arrive at each SWE even though they are byte-aligned at the
chip inputs. Since each SWE introduces a 1-b delay, input cells
would have been required to skew bit by bit at the inputs of the
chip if there were no byte-alignment circuit. In order to ease
the synchronization of the cells from IPC’s, the byte-alignment
circuit is used to allow the delay between cells from IPC’s to
be a multiple of eight bits for every set of eight data inputs.
For example, the delay between
to
inputs is one byte,
the delay between
to
is one byte, the delay between
to
is one byte, and so on. However, cells arriving at
to
inputs are synchronized, cells at
to
inputs
are synchronized, cells at
to
inputs are synchronized,
and cells at
to
inputs are synchronized.
E. Clock Distribution
The chip’s clock inputs,
and
, are applied with
positive 1 V peak-to-peak pseudo emitter coupled logic (ECL)
signals. This is because most off-the-shelf ECL components
are capable of generating and distributing clock signals at a
822
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997
(a)
(b)
Fig. 12.
Switch element (SWE) circuit.
few hundred MHz and these ECL signals can easily be shifted
from negative to positive levels when interfacing with the ARC
chip. A clock distribution circuit in the ARC chip is shown
in Fig. 11. The differential pseudo ECL clock inputs are first
converted to CMOS levels by using a two-stage differential
amplifier followed by a three-stage CMOS buffer. The final
two stages of the clock buffers are distributed along the bottom
of the die to obtain a smaller clock skew. A 28 buffer is used
to drive the SWE’s on each column, where 1 is the smallest
inverter used in the chip.
The load on the clock signals in each SWE is calculated
to be about 56.8 fF; 40.8 fF is due to the loading of four
DFF’s (each with 10.2 fF), and 16 fF is due to the local
wire’s loading. The 5-mm global clock wire from the 28
driver to each SWE on the same column has a total estimated
capacitance of 0.53 pF. This results in a total capacitance
pF
fF).
of each column of 2.36 pF (
The distributed capacitance of the 5-mm metal wire becomes
0.47 fF/ m (
pF/5 mm). The delay caused by the
long wire can be formulated as
[11], where
are the unit wire resistance, distributed capacitance,
and total wire length, respectively. For the values of
m for the metal wire,
fF/ m, and
mm, the maximum delay skew for the clock signal at
different SWE’s can be as large as 530 ps. Note that this
delay would be about 5 ns if we had only one big clock
driver distributing the clock signal globally. In order to further reduce the clock skew to 132 ps, we added additional
small inverters as clock drivers in each SWE (see Fig. 12).
This reduces the distributed capacitance to 0.17 fF/ m, (0.53
fF)/5 mm. Since signals only flow between
pF
adjacent SWE’s, the clock skew of 132 ps will not cause
a problem at the desired operation speed, e.g., 200 to 300
MHz.
CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP
Fig. 13.
Timing diagram of ARC with group size four.
Fig. 14.
Timing diagram of the ARC chip with group size eight.
823
824
Fig. 15.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997
Timing diagram of the ARC chip with group size 32.
F. Switch Element
The circuit diagram of the SWE is shown in Fig. 12. The
input signals
,
are applied to the top and left sides of
the SWE. These signals that have identical waveforms are the
outputs of two other SWE’s: one on the left and the other on
the top. Because of the identical waveforms of
signal at
the south and east sides of the SWE, they are tied together.
signal.
This also applies to
,
and control inputs
,
are first
Data inputs
latched by DFF’s; both their inverted and noninverted signals
are available. The principle of SWE operations is described as
goes low to
follows. At the beginning of each cell period,
precharge node A to high through T1 (T5, T6, T7 are all off).
is high, node B is also precharged to high through
When
T3 (T8 and T9 are off), forcing the SWE entering (or staying
at) a cross state (i.e., the “cross” signal is high). Once node A
(or B) is precharged to the threshold of the following inverter,
inv1 (or inv2), the output of the inverter will go low, which
in turn turns on the weak p-transistor T2 (or T4). This will
keep node A (or B) high as long as there is no path that will
pull the node voltage to GND through a group of transistors
led by T5 or T6 (or T8).
is asserted, addresses of the north and the west
When
and
) are compared bit by bit by a transistor
inputs (
and
are different while the
group led by T5. If
signal is high, the T5 transistor group will discharge node A.
The output of the inverter inv1 will then go high once the
threshold of inv1 is reached. It then turns on T10 and forms
a two-inverter latch, a positive feedback loop. This will pull
down node A immediately, which will then turn off T12 and
thus keep node B high and SWE at the cross state. Note that
T7 is always on except when node A is precharged while
low. At each SWE, if the addresses of input cells match, their
priorities are then compared bit by bit. The priority comparison
and
occurs.
will stop whenever the situation of
This means the north cell’s priority level has been detected
higher than the west cell’s. The comparison of priority field
will stop since there is no need to compare the following
priority bits. This causes node A to be discharged to low
through the T6 transistor group, which in turn prevents node
B from being discharged and thus keeps the SWE at the cross
state. If the west cell’s priority is detected to be higher than
and
), node B is discharged
the north cell’s (i.e.,
through the T8 transistor group, toggling the cross state into
a noncross state. The transistor T11 plays the same role in
discharging node B as T10 for node A.
P and N transistors are mixed in evaluation paths in order
to phase align all gate signals for the transistors on the same
evaluation path. Any small overlap between the gate signals
may accidentally discharge node A (or B). This effect can
easily be seen by SPICE simulation if P transistors are replaced
with N transistors (accordingly, some gate signals are required
to be inverted). Using P transistors may increase the layout
area and slow down the SWE slightly, but it is insignificant.
The performance of the SWE can be improved if P transistors
in evaluation paths are replaced by N transistors, and two
additional D flip-flops in each SWE are added to latch the
inverse of data inputs, and . This approach requires more
transistors and dissipates more dynamic power. Thus, it was
not adopted in our design.
CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP
Fig. 16.
825
Alignment of input signals when cascading two ARC chips.
Fig. 13 shows a timing diagram of input and output signals
when the ARC chip operates in the multicast mode and
is configured with eight groups with four links per group.
and
) are connected to low
Operation mode signals (
is connected to high. The outputs of the first, second,
while
, eighth groups are taken from
to ,
to ,
,
to
outputs, respectively.
signal is asserted low just before the
A 2-b long
(or ) input)
beginning of each cell cycle (with respect to
to precharge an internal node of SWE’s to high. A 1-b long
signal, the routing bit position indicator with respect to
(or
) input, is applied to the chip to inform the controller circuit
in each SWE about the multicast bit position of a specific
signal is applied to the chip to specify the
group. The
(or ) input. Cell inputs
range of the priority field in
to
are applied to the chip in a byte-aligned manner, as
shown in Fig. 13, in order to guarantee the alignment of the
cells when arriving at every SWE. Cell outputs are bit aligned
with 1-b skew between consecutive outputs within the same
and
and , etc. The
signal
group, e.g., between
826
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997
is delayed proportional to the number of groups in the ARC
chip, in this case 8 b.
Fig. 14 shows the timing diagram of I/O signals when the
ARC chip operates in the multicasting mode and is configured
with four groups with eight links per group (
high,
low, and
high). The main difference between the case
of group size
8 and group size
4 is that for group size
8 any two consecutive outputs within the same group are
bit aligned with a skew of 1 b, and a skew of 2 b if they are
in two different 32 4 SWE arrays. For example,
and
are skewed by 1 b. However,
and
are skewed by 2 b
due to the existence of an additional DFF in the multiplexer
circuits between every two 32 4 SWE arrays. Fig. 15 shows
the timing diagram of the I/O signals when the ARC chip
is configured with a group size of 32 links. The skews at
the outputs are similar to those in Fig. 14. Note that timing
diagram of the ARC chip with group size 16 is not given here
since it is similar to those in Figs. 14 and 15.
When ARC chips are vertically cascaded, the input signals
are still aligned with respect to the byte clock as shown in
Fig. 16. Inputs that are in the same set of modulo 8 are applied
to the chip with the same phase. Between successive sets of
eight inputs, 1-byte delay is required as shown in Figs. 13–15
if they are applied to the same chip. Otherwise, 2-byte delay
is required. This is due to the additional DFF’s added at the
input and output buffers of the chip as shown in Fig. 16. The
seven DFF’s at the south outputs are added to ensure the delay
skew between chips is respected to the byte clock instead of
to the bit clock. An additional input, called bypass, can short
one of these seven DFF’s. This may be needed to adjust the
delay from the last DFF in the south outputs of the top ARC
chip to the first DFF in the north inputs of the next ARC chip
(see Fig. 16).
IV. TESTING
AND
Fig. 17. Photograph of the ARC chip.
SIMULATION RESULTS
The 32
32 ARC chip has been designed and fabricated
using 0.8- m CMOS technology with a die size of 6.6
6.6 mm. Its photograph is shown in Fig. 17. Note that this
chip is pad limited. The chip has been tested successfully up to
240 MHz by using a high-speed oscilloscope, timing analyzer,
and a pattern generator capable of generating signals up to
1 GHz. Fig. 18 shows a testing result, where
, , and
are shown from top to bottom, respectively. The range of the
priority field of the cells at the output is specified by
, which
is chosen to be 7 b in this test. Since
is taken from the
bottom left of the SWE from which comes out, it is aligned
with .
is delayed by one clock cycle with respect to
.
Similarly,
is delayed by two clock cycles with respect to
. In this test, cells are applied to the west inputs, while north
inputs are tied to
. It is observed that south outputs come
out in a sorted priority order. The priority of
is 1000100,
which is the highest priority among all inputs. The priority of
is 1000101, which is the second highest. The priority of
is 1000110, which is the third highest. Note that the cell
length used here is kept short in order to be able to see one
cell cycle in the viewing window of the oscilloscope.
Fig. 18. Photograph of a testing result with 5 ns per grid.
We have developed a rigorous testing methodology to
identify bad chips in the first one or two test cases.
1) Basic address test. This test is to verify that the address comparison circuits in each SWE, the transistor
path starting with T5 of Fig. 12, have no defects, such
as stuck-to-one or stuck-to-zero. These defects could
be caused by a transistor failure or a short on wires.
Applying a unique address to each north and west input
while setting a unique identifier in the cell payload
in order to distinguish cells at the outputs results in
a cross state for all SWE’s. The unique address of
the west or north inputs is set to their port numbers,
while the priority of north inputs is set higher than that
should
of west inputs. Thus, whatever is applied to
CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP
Fig. 19.
827
Spice simulation of the SWE.
come out from . If there is any
with
,
one of the SWE’s on the th column must have a
defect.
2) Basic priority test. This test is similar to the basic
address test except it now concentrates on the priority
comparison circuit in the SWE’s, i.e., evaluating
paths beginning with transistors T6 and T8 in
Fig. 12. We applied west input cells with an identical
address, but unique priorities and unique identifiers
in the cell payload. We also applied the north input
cells with the same address as the west inputs, but with
the lowest priority. Cells that appear at south outputs
should be sorted according to their priority levels. Any
output that has a misordered priority will provide clues
for tracing possible failed SWE’s.
3) Location test. This test scans through all SWE’s one
by one by applying specific priorities to test the two
priority evaluation paths in each SWE (led by T6 and
T8 transistors). When we want to test the SWE at the
th row and th column, we applied both west and
north inputs with the same address and priority
as in the basic address test case except for the
and
inputs. The priorities for
priorities of
and
should be set to a specific pattern
) in order to
like (
test the evaluation path from T6 to GND, where “ ”
means these bits for
and
inputs are identical
and “ ” means do not care. Another pattern like
) will be used to test
(
the evaluation path from T8 to GND. Note that the
output.
results of this test will be monitored at
V. TIMING DIAGRAM
Fig. 19 shows the SPICE simulation results of one SWE
operating at 333 MHz. In the first cell slot, addresses of the
north and west inputs match so that the “match” signal stays
high. However, the “cross” signal goes low as soon as the
priority of west input is detected to be higher (i.e., smaller
in value) than the priority of north input, which toggles the
SWE. In the second cell slot, addresses of north and west
inputs are not matched, which makes “match” to go low and
“cross” stay at high. In the third cell slot, even though their
addresses match, “match” goes low for the situation of
and
in the priority field (i.e., stop priority comparison),
thus retaining “cross” at high. These three cases cover almost
all the different input combinations. The length of the address
field is chosen as 1 b in this simulation since in the multicasting
situation it requires only one bit of address match. However,
the SWE can handle any length of address and priority fields.
VI. CONCLUSION
The Abacus switch that we proposed is scalable from a few
tens to a few thousands of input/output ports. It can handle
multicasting, call splitting, high input line rate (e.g., OC-48),
and channel grouping with cell sequence preservation. The
ARC chip described in this paper is the key component of
building the Abacus switch. By cascading the ARC chips in
two dimensions, any practical size of the ATM switch can
be implemented. The ARC chip has been designed, fabricated
by using CMOS 0.8- m technology, and tested successfully
at 240 MHz. The chip’s characteristics are summarized in
Table II.
828
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997
[10] H. J. Chao, B. S. Choe, J. S. Park, and N. Uzun, “Abacus Switch: A
scalable multicast ATM switch,” presented at Globecom’96.
[11] N. H. Weste and K. Eshraghian, Principle of CMOS Design: A Systems
Perspective, 2nd ed. Reading, MA: Addison Wesley, 1993.
TABLE II
CHIP SUMMARY
REFERENCES
[1] T. Kozaki, N. Endo, Y. Sakurai, O. Matsubara, M. Mizukami, and K.
Asano, “32 32 shared buffer type ATM switch VLSI’s for B-ISDN’s,”
IEEE J. Select. Areas Commun., pp. 1239–1247, Oct. 1991.
[2] Y. Shobatake, M. Motoyama, E. Shobatake, T. Kamitake, S. Shimizu,
M. Noda, and K. Sakaue, “A one-chip scalable 8 8 ATM switch LSI
employing shared buffer architecture,” IEEE J. Select. Areas Commun.,
pp. 1248–1254, Oct. 1991.
[3] T. R. Banniza, G. J. Eilenberger, B. Pauwels, and Y. Therasse, “Design
and technology aspects of VLSI’s for ATM switches,” IEEE J. Select.
Areas Commun., Oct. 1991.
[4] A. Itoh, W. Takahashi, H. Nagano, M. Kurisaki, and S. Iwasaki,
“Practical implementation and packaging technologies for a large-scale
ATM switching system,” IEEE J. Select. Areas Commun., vol. 9, pp.
1280–1288, Oct. 1991.
[5] W. Fischer, O. Fundneider, E.-H. Goeldner, and K. A. Lutz, “A scalable
ATM switching system architecture,” IEEE J. Select. Areas Commun.,
vol. 8, pp. 1299–1307, Oct. 1991.
[6] K. Y. Eng, M. A. Pashan, R. A. Spanke, M. J. Karol, and G. D. Martin,
“A high-performance prototype 2.5 Gb/s ATM switch for broadband
applications,” in Proc. ICC’89, June 1989, pp. 111–117.
[7] Y. Kato, T. Shimoe, and K. Murakami, “A development of a high speed
ATM switching LSIC,” in Proc. ICC’90, Apr. 1990, pp. 562–566.
[8] K. Genda, Y. Doi, T. Kawamura, K. Endo, and S. Sasaki, “A 160-Gb/s
ATM switching system using an internal speed-up crossbar switch,”
GLOBECOM’91, pp. 123–133, Nov. 1994.
[9] E. Munter, “A high capacity ATM switch based on advanced electronic
and optical technologies,” in Proc. ISS’95, Berlin, Germany, Apr. 1995,
pp. 389–393.
2
3
H. Jonathan Chao (S’82–M’85–SM’95) received
the B.S.E.E. and M.S.E.E. degrees from National
Chiao Tung University, Taiwan, in 1977 and 1980,
respectively, and the Ph.D. degree in electrical engineering from the Ohio State University, Columbus,
in 1985.
He is an Associate Professor of Electrical Engineering at Polytechnic University, Brooklyn, NY,
which he joined in January 1992. His areas of research include large-scale multicast ATM switches,
photonic ATM switches, multimedia communications, and congestion/flow control in ATM networks. He holds 14 patents
and has published over 60 journal and conference papers in the above areas.
From 1985 to 1991, he was a Member of Technical Staff at Bellcore, NJ,
where he proposed various architectures and implemented several VLSI chips
for SONET/ATM-based broadband networks. From 1977 to 1981, he worked
at Taiwan Telecommunications Laboratories, where he was engaged in the
development of a digital switching system.
Necdet Uzun (M’96) received the B.S. and the M.S.
degrees in electrical engineering from Technical
University of Istanbul in 1983 and 1986, respectively. He received the Ph.D. degree in electrical
engineering from Polytechnic University, Brooklyn,
NY, in 1993.
He is an Industry Associate Professor of Electrical
Engineering at Polytechnic University, Brooklyn,
NY. He was with Bellcore, Red Bank, NJ, from
1990 to 1992, as a Consultant in the Multiplex &
Multiaccess Technology Group. His R&D activities
in high-speed networking include electronic and photonic ATM switches,
admission and congestion control, ATM LAN’s, and high-speed VLSI architectures of ATM switching systems. He has also been involved with the
analysis and modeling of the quantization effects in filter banks.
Download