816 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 An ATM Routing and Concentration Chip for a Scalable Multicast ATM Switch H. Jonathan Chao, Senior Member, IEEE, and Necdet Uzun, Member, IEEE Abstract—We have proposed a new architecture for building a scalable multicast ATM switch from a few tens to a few thousands of input/output ports. The switch, called the Abacus switch, employs input and output buffering schemes. Cell replication, cell routing, and output contention resolution are all performed in a distributed way so that the switch can be scaled up to a large size. The Abacus switch adopts a novel algorithm to resolve the contention of both multicast and unicast cells destined for the same output port (or output module). The switch can also handle multiple priority traffic by routing cells according to their priority levels. This paper describes a key ASIC chip for building the Abacus switch. The chip, called the ATM routing and concentration (ARC) chip, contains a two-dimensional array (3 2 32) of switch elements that are arranged in a cross-bar structure. It provides the flexibility of configuring the chip into different group sizes to accommodate different ATM switch sizes. The ARC chip has been designed and fabricated using 0.8-m CMOS technology and tested to operate correctly at 240 MHz. Although the ARC chip was designed to handle the line rate at OC-3 (155 Mb/s), the Abacus switch can accommodate a much higher line rate at OC-12 (622 Mb/s) or OC-48 (2.5 Gb/s) by using a bit-sliced technique or distributing cells in a cyclic order to different inputs of the ARC chip. When the latter scheme is used, the cell sequence is retained at the output of the Abacus switch. Index Terms— Asynchronous transfer mode, ATM concentration, ATM switching, ATM routing, distributed contention resolution, multicast switching. I. INTRODUCTION T HERE are several approaches to building a large-scale ATM switch. For instance, using small ATM switch 32) as building blocks and connecting modules (e.g., 32 them in a multistage structure (e.g., Clos-type interconnection) [1]–[5], or using high-speed technology to switch cells at multiple Gb/s rate in a core switch [6]–[9]. Among them, output buffering (including shared-memory output buffering) has been proven to provide the best delay and throughput performance. But, as the switch grows up to a certain size, memory speed may become a bottleneck, or the technology used to implement such memory may become too costly. For instance, for a shared-memory switch with 256 input and output ports at 155 Mb/s input rate, the memory cycle time has to be less than 5.5 ns [2.8 s/(256 2)]. One way to eliminate the memory’s speed constraint is to temporarily store some Manuscript received May 2, 1996; revised September 11, 1996. The authors are with the Department of Electrical Engineering, Polytechnic University, Brooklyn, NY 11201 USA. Publisher Item Identifier S 0018-9200(97)03840-7. cells destined for the same output port in the input buffers. Input buffering’s well-known head-of-line (HOL) blocking drawback can be improved by speeding up the internal links’ bandwidth (e.g., three to four times the input line’s) or increasing the number of routing links to each output port. For instance, if the speed-up factor or the number of routing links per output port is chosen to be four, the throughput will be increased from 58% because of the HOL blocking to 99%. Since multiple cells can arrive in one time slot at each output port while only one cell can be transmitted to the network, an output buffer is required. The input-and-output buffering approach provides satisfactory performance while eliminating memory speed limitation. Our performance study [10] has shown that for a satisfactory cell loss probability, the output buffer should be much larger than the input buffer, e.g., a few thousands versus a few tens of cells. Examples of input-andoutput buffered ATM switches are NTT’s and BNR’s 160 Gb/s switch [8], [9]. The challenge of implementing input-and-output buffered switches is to resolve the output port contention among the input cells destined for the same output port (or the same output module for a two-stage architecture). The contention resolution function is usually handled by an arbiter. The bottleneck caused by the memory speed is shifted to the arbiter. However, if parallel processing and pipeline techniques can be intelligently applied to implement the arbiter, a large-scale switch is feasible. We have proposed a new modular architecture for building a scalable multicast ATM switch from a few tens to a few thousands of input/output ports. The switch, called the Abacus switch, employs an input and output buffering scheme. The Abacus switch adopts a novel algorithm to resolve the contention of both multicast and unicast cells destined for the same output port (or output module). It provides the features of: 1) sharing input buffers, 2) providing fairness among the input ports, and 3) supporting multicast call splitting. The call splitting function allows a multicast cell to be delivered to subsets of destined output ports in multiple cycles, thus increasing the system throughput. The Abacus switch can be built using an ASIC, called the ATM routing and concentration (ARC) chip, which consists 32 switch elements (SWE’s) arranged in a crossof 32 bar structure. The ARC chip routes cells to the proper output ports (or groups) based on their routing information. The chip performs the functions such as routing, multicasting, and concentration in a distributed manner so that a large-scale switch can be implemented by cascading the ARC chips in two 0018–9200/97$10.00 1997 IEEE CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP 817 Fig. 2. The multicast grouping network (MGN). Fig. 1. The architecture of a 64 2 64 Abacus switch. dimensions. When multiple cells are contending for the same output group, they are routed based on their priority levels. The chip has been designed and fabricated using 0.8- m CMOS technology and tested to operate correctly at 240 MHz. Although the ARC chip was designed to handle the line rate of OC-3 (155 Mb/s), the Abacus switch can accommodate a much higher line rate of OC-12 (622 Mb/s) or OC-48 (2.5 Gb/s) by using bit-sliced technique or distributing cells in a cyclic order to different inputs of the ARC chip. When the latter scheme is used, the cell sequence is retained at the output of the Abacus switch. In Section II, we describe the architecture of the Abacus switch and show how the switch can be built using the ARC chips. Section III describes the detailed design of the ARC chip. Section IV shows the timing diagram of the ARC chip when operating with different group sizes. Section V presents testing and simulation results. Section VI gives conclusions. II. A 64 64 ABACUS SWITCH To simplify the explanation of the ARC chip operations, we will only show the Abacus switch with 64 inputs and 64 outputs, as shown in Fig. 1. A large-scale architecture with thousands of input and output ports can be found in [10]. The 64 64 switch consists of input port controllers (IPC’s), a multicast grouping network (MGN), and output buffers. The IPC’s terminate input signals from the network, look up necessary information in a translation table, and attach routing information to the front of each cell before it is routed in the MGN. The IPC’s also resolve contention among input cells that are destined for the same output port and provide a buffer for those cells losing contention. The routing information includes address and priority fields. The address field can be an output port’s physical address for a unicast case, or a multicast pattern (MP) for a multicast case. An MP is a bit map of all output ports, each bit indicating if the cell is to be sent to the associated output port. For a 64 64 switch, the MP has 64 b. The priority field carries cells’ priority levels that are used to assist cell routing in the switch fabric when contention occurs. The MGN consists of 64 routing modules (RM’s). Each RM contains a two-dimensional array of SWE’s arranged in a cross-bar structure, as shown in Fig. 2. It has 64 horizontal input lines and four vertical routing links. Up to four cells from inputs can arrive at an output port simultaneously and can be arbitrarily routed to any one of the four routing links. Cell replication and routing functions are performed by all SWE’s simultaneously, resulting in a scalable architecture. Cell replication is achieved by broadcasting incoming cells to all RM’s, which then route cells to their output links. The SWE routes cells from the west and north to east and south, respectively, when it is at cross state, or to south and east, respectively, when it is at toggle state. The SWE’s state is determined from the comparison of address bits and priority bits of cells from west and north. Each RM performs three functions in parallel: 1) cell address filtering, 2) cell concentration, and 3) priority sorting. Cells whose addresses match with the output address are allowed to compete for a limited number of output links. For instance, up to 64 input cells may compete for four output links. During the competition, high priority cells are chosen over low priority cells. Cells that lose contention will retry for the next time slot until they have been successfully transmitted to all desired output port(s). The multicast contention resolution algorithm we proposed in [10] achieves fairness among input ports during cell contention by dynamically assigning a priority level to the HOL cell of each input port. The priority level, called local priority (LP), is unique for each HOL cell and changes from cell slot to cell slot. After cells are routed through the RM, they are sorted at the output links of the RM according to their priority levels from left to right in a descending order (see Fig. 2). The cell that appears at the rightmost output link has the lowest priority level among the cells that have been routed through this RM. This lowest priority information is broadcast to all IPC’s. Each IPC will then compare its HOL cell’s LP with the feedback priority (FP) of the output port(s) for which the HOL cell is destined to determine if its HOL cell has been routed through the RM. If the FP is lower than or equal to the LP, the IPC ensures that its HOL cell has reached one of the 818 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 Fig. 3. An example of a 64 Fig. 4. An example of a 64 2 64 Abacus switch at OC-3 line rate. 2 64 Abacus switch at OC-12 line rate. output links of the RM. Otherwise, the HOL cell must have been discarded in the RM due to loss of contention and will be retransmitted in the next time slot. Since it is not known if the HOL cell will win the contention when it is sent to the RM, the cell will be temporarily stored in a one-cell buffer for possible retry in the future. Here, we will show an example of connecting the ARC chips to construct a 64 64 Abacus switch, where the output group size is chosen to be four, meaning each output will receive up to four cells in one cell time slot. The ARC chip can be configured for different group sizes, e.g., 4, 8, 16, or 32 links per group, to accommodate different switch sizes. By cascading the ARC chips in X and Y directions, a larger 64 switch size can be obtained. As shown in Fig. 3, a 64 Abacus switch is implemented by connecting the ARC chips in two rows, each row with eight chips. All signals indicating the cell’s address or priority fields are used by the SWE for properly routing cells in the SWE array. Their usage will be explained in detail in the following sections. Broadcast . horizontally to the chips are 64 input signals, South outputs of the first row’s chips are connected to the are north inputs of the second row’s chips, i.e., Fig. 5. Block diagram of the ARC chip. connected to . The north inputs of the first row’s chips are tied to high for multicast applications. The south outputs of the second row’s chips go to the output buffers, where in one cell slot up to four cells are received and one cell is transmitted. The architecture shown in Fig. 3 handles a line rate of OC-3 (155Mb/s). By using the so-called bit sliced technique, the Abacus switch is able to accommodate a line rate higher than OC-3. For instance, Fig. 4 shows a 64 64 Abacus switch with an input line rate of OC-12 (622 Mb/s), where four 64 64 switch planes are connected in parallel. The OC-12 bit stream is converted to four serial bit streams, and each plane (Fig. 3) handles each bit stream. Since the same routing information is attached to each bit stream, the bit streams of the same cell will appear at the same output link of each switch plane. Therefore, they can easily be grouped to the same output port, as shown in Fig. 4. III. ARC CHIP DESIGN A. Block Diagram and Pin Signals Fig. 5 shows the ARC chip’s block diagram. Each block’s function and design are explained in detail in the following CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP Fig. 6. 32 2 819 4 SWE array. TABLE I TRUTH TABLE FOR DIFFERENT OPERATION MODES Fig. 8. Two-to-one multiplexer structure. Fig. 7. Two states of the switch element. sections. The ARC chip contains 32 32 SWE’s which are partitioned into eight SWE arrays, each with 32 4 SWE’s. A set of input data signals, , comes from IPC’s. , either comes from Another set of input data signals, the output, , of the chips on the above row, or is tied to high for the chips on the first row (in the multicast case). A set of the output signals, , either go to the north input of the chips one row below or go to the output buffer. signal is broadcast to all SWE’s to initialize each An SWE to a cross state, where the west input passes to the east signal specifies the and the north input passes to the south. signal specifies address bit(s) used for routing cells, while the priority field. Other output signals propagate along with cells to the adjacent chips on the east or south side. signals are used to configure the chip into four The different group sizes as shown in Table I: 1) eught groups, each with four output links, 2) four groups, each with eight 820 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 (a) (b) (c) (d) Fig. 9. Four different configurations of the ARC chip (a) eight groups, (b) four groups, (c) two groups, and (d) one group. output links, 3) two groups, each with 16 output links, and 4) one group with 32 output links. The signal is used to configure the chip to either unicast or multicast application. For the unicast case, is set to zero, while for the multicast is set to one. case, B. 32 4 SWE Array As shown in Fig. 6, the SWE’s are arranged in a cross-bar structure, where signals only communicate between adjacent SWE’s, easing the synchronization problem. ATM cells are propagated in the SWE array similar to a wave propagating and diagonally toward the bottom right corner. The signals are applied from the top left of the SWE array, and and signals to its east and each SWE distributes the south neighbors. This requires the same phase to the signal and signals are passed to the arriving at each SWE. The neighbor SWE’s (east and south) after one clock cycle delay, signal is broadcast to all as are data signals ( and ). The SWE’s (not shown in Fig. 6) to precharge an internal node in CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP Fig. 10. 821 Byte-alignment circuit for input cells. the SWE in every cell cycle. The output signal is used to identify the address bit position of the cells in the first SWE array of the next adjacent chip. The timing diagram of the SWE input signals and its two possible states are shown in Fig. 7. Two bit-aligned cells, one from the west and one from the north, are applied to the SWE along with the and signals, which determine the address and priority fields of the input cells. The SWE has two states: cross and toggle. Initially, the SWE is initialized to a cross state by the signal, i.e., cells from the north side are routed to the south side, and cells from the west side are routed to the east side. When the address of the cell from the west ( ) is matched with the address of the cell from the north ( ), and when the west’s priority level ( ) is ), the SWE is toggled. The cell higher than the north’s ( from the west side is then routed to the south side, and the cell from the north is routed to the east. Otherwise, the SWE remains at the cross state. Note that signal is 1-b long when operating in multicast mode. It will be as long as the address field when operating in the unicast mode. This is due to the fact that a flattened multicast pattern is used in the multicast mode where each bit in the multicast pattern is associated with each output group. However, bits are required in unicast mode where is the number of output groups. C. Two-to-One Multiplexer The 2-to-1 mux between every two SWE arrays selects data signals and , signals based on the chip configuration (Fig. 8). Different configurations are obtained by properly signals, which are generated from controlling and signals. The group size is configured to four when , eight when , 16 when , or 32 when . The group size control signals ’s can be obtained from the input signals : , , and . The signal controls a 1-b delay for the signal when the chip is operated in the multicast case. As shown in Fig. 9, when SWE arrays are combined into a bigger SWE array [e.g., from four links per group in Fig. 9(a) to eight links per group in Fig. 9(b)], the data signals ( ) and field indication signals ( and ) of each SWE array are fed from the associated outputs of the SWE array on the left. Note that these signals are latched by the DFF’s between the SWE arrays (indicted as a D in the figure). This extra latching reduces the propagation time between two SWE’s, thus increasing the switch system’s clock speed. When SWE arrays are not combined, the data Fig. 11. Clock distribution and drivers. signals ( ) and , signals of each SWE array are provided from global inputs. Also note that in the multicast case, signal is delayed by 1 b (a DFF) between every two SWE arrays, which allows the cells’ address bits to be identified correctly in the next SWE array. In the unicast situation, the DFF is bypassed, as shown by a dashed line in Fig. 9. West data inputs and signal of each SWE array are chosen either from the global inputs (nongrouped case) or from the outputs of the 32 4 SWE array on the left (grouped case). The signal of each SWE array is selected from three possible inputs: 1) directly from the global input (unicast with nongrouped case), 2) from the global input with some bits delay (multicast with nongrouped case), or 3) from the output of the SWE array on the left with 1-b delay (grouped case for either unicast or multicast). D. Byte-Alignment Circuit The byte alignment circuit (Fig. 10) is basically a set of shift registers. It ensures that cells are aligned at bit level when they arrive at each SWE even though they are byte-aligned at the chip inputs. Since each SWE introduces a 1-b delay, input cells would have been required to skew bit by bit at the inputs of the chip if there were no byte-alignment circuit. In order to ease the synchronization of the cells from IPC’s, the byte-alignment circuit is used to allow the delay between cells from IPC’s to be a multiple of eight bits for every set of eight data inputs. For example, the delay between to inputs is one byte, the delay between to is one byte, the delay between to is one byte, and so on. However, cells arriving at to inputs are synchronized, cells at to inputs are synchronized, cells at to inputs are synchronized, and cells at to inputs are synchronized. E. Clock Distribution The chip’s clock inputs, and , are applied with positive 1 V peak-to-peak pseudo emitter coupled logic (ECL) signals. This is because most off-the-shelf ECL components are capable of generating and distributing clock signals at a 822 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 (a) (b) Fig. 12. Switch element (SWE) circuit. few hundred MHz and these ECL signals can easily be shifted from negative to positive levels when interfacing with the ARC chip. A clock distribution circuit in the ARC chip is shown in Fig. 11. The differential pseudo ECL clock inputs are first converted to CMOS levels by using a two-stage differential amplifier followed by a three-stage CMOS buffer. The final two stages of the clock buffers are distributed along the bottom of the die to obtain a smaller clock skew. A 28 buffer is used to drive the SWE’s on each column, where 1 is the smallest inverter used in the chip. The load on the clock signals in each SWE is calculated to be about 56.8 fF; 40.8 fF is due to the loading of four DFF’s (each with 10.2 fF), and 16 fF is due to the local wire’s loading. The 5-mm global clock wire from the 28 driver to each SWE on the same column has a total estimated capacitance of 0.53 pF. This results in a total capacitance pF fF). of each column of 2.36 pF ( The distributed capacitance of the 5-mm metal wire becomes 0.47 fF/ m ( pF/5 mm). The delay caused by the long wire can be formulated as [11], where are the unit wire resistance, distributed capacitance, and total wire length, respectively. For the values of m for the metal wire, fF/ m, and mm, the maximum delay skew for the clock signal at different SWE’s can be as large as 530 ps. Note that this delay would be about 5 ns if we had only one big clock driver distributing the clock signal globally. In order to further reduce the clock skew to 132 ps, we added additional small inverters as clock drivers in each SWE (see Fig. 12). This reduces the distributed capacitance to 0.17 fF/ m, (0.53 fF)/5 mm. Since signals only flow between pF adjacent SWE’s, the clock skew of 132 ps will not cause a problem at the desired operation speed, e.g., 200 to 300 MHz. CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP Fig. 13. Timing diagram of ARC with group size four. Fig. 14. Timing diagram of the ARC chip with group size eight. 823 824 Fig. 15. IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 Timing diagram of the ARC chip with group size 32. F. Switch Element The circuit diagram of the SWE is shown in Fig. 12. The input signals , are applied to the top and left sides of the SWE. These signals that have identical waveforms are the outputs of two other SWE’s: one on the left and the other on the top. Because of the identical waveforms of signal at the south and east sides of the SWE, they are tied together. signal. This also applies to , and control inputs , are first Data inputs latched by DFF’s; both their inverted and noninverted signals are available. The principle of SWE operations is described as goes low to follows. At the beginning of each cell period, precharge node A to high through T1 (T5, T6, T7 are all off). is high, node B is also precharged to high through When T3 (T8 and T9 are off), forcing the SWE entering (or staying at) a cross state (i.e., the “cross” signal is high). Once node A (or B) is precharged to the threshold of the following inverter, inv1 (or inv2), the output of the inverter will go low, which in turn turns on the weak p-transistor T2 (or T4). This will keep node A (or B) high as long as there is no path that will pull the node voltage to GND through a group of transistors led by T5 or T6 (or T8). is asserted, addresses of the north and the west When and ) are compared bit by bit by a transistor inputs ( and are different while the group led by T5. If signal is high, the T5 transistor group will discharge node A. The output of the inverter inv1 will then go high once the threshold of inv1 is reached. It then turns on T10 and forms a two-inverter latch, a positive feedback loop. This will pull down node A immediately, which will then turn off T12 and thus keep node B high and SWE at the cross state. Note that T7 is always on except when node A is precharged while low. At each SWE, if the addresses of input cells match, their priorities are then compared bit by bit. The priority comparison and occurs. will stop whenever the situation of This means the north cell’s priority level has been detected higher than the west cell’s. The comparison of priority field will stop since there is no need to compare the following priority bits. This causes node A to be discharged to low through the T6 transistor group, which in turn prevents node B from being discharged and thus keeps the SWE at the cross state. If the west cell’s priority is detected to be higher than and ), node B is discharged the north cell’s (i.e., through the T8 transistor group, toggling the cross state into a noncross state. The transistor T11 plays the same role in discharging node B as T10 for node A. P and N transistors are mixed in evaluation paths in order to phase align all gate signals for the transistors on the same evaluation path. Any small overlap between the gate signals may accidentally discharge node A (or B). This effect can easily be seen by SPICE simulation if P transistors are replaced with N transistors (accordingly, some gate signals are required to be inverted). Using P transistors may increase the layout area and slow down the SWE slightly, but it is insignificant. The performance of the SWE can be improved if P transistors in evaluation paths are replaced by N transistors, and two additional D flip-flops in each SWE are added to latch the inverse of data inputs, and . This approach requires more transistors and dissipates more dynamic power. Thus, it was not adopted in our design. CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP Fig. 16. 825 Alignment of input signals when cascading two ARC chips. Fig. 13 shows a timing diagram of input and output signals when the ARC chip operates in the multicast mode and is configured with eight groups with four links per group. and ) are connected to low Operation mode signals ( is connected to high. The outputs of the first, second, while , eighth groups are taken from to , to , , to outputs, respectively. signal is asserted low just before the A 2-b long (or ) input) beginning of each cell cycle (with respect to to precharge an internal node of SWE’s to high. A 1-b long signal, the routing bit position indicator with respect to (or ) input, is applied to the chip to inform the controller circuit in each SWE about the multicast bit position of a specific signal is applied to the chip to specify the group. The (or ) input. Cell inputs range of the priority field in to are applied to the chip in a byte-aligned manner, as shown in Fig. 13, in order to guarantee the alignment of the cells when arriving at every SWE. Cell outputs are bit aligned with 1-b skew between consecutive outputs within the same and and , etc. The signal group, e.g., between 826 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 is delayed proportional to the number of groups in the ARC chip, in this case 8 b. Fig. 14 shows the timing diagram of I/O signals when the ARC chip operates in the multicasting mode and is configured with four groups with eight links per group ( high, low, and high). The main difference between the case of group size 8 and group size 4 is that for group size 8 any two consecutive outputs within the same group are bit aligned with a skew of 1 b, and a skew of 2 b if they are in two different 32 4 SWE arrays. For example, and are skewed by 1 b. However, and are skewed by 2 b due to the existence of an additional DFF in the multiplexer circuits between every two 32 4 SWE arrays. Fig. 15 shows the timing diagram of the I/O signals when the ARC chip is configured with a group size of 32 links. The skews at the outputs are similar to those in Fig. 14. Note that timing diagram of the ARC chip with group size 16 is not given here since it is similar to those in Figs. 14 and 15. When ARC chips are vertically cascaded, the input signals are still aligned with respect to the byte clock as shown in Fig. 16. Inputs that are in the same set of modulo 8 are applied to the chip with the same phase. Between successive sets of eight inputs, 1-byte delay is required as shown in Figs. 13–15 if they are applied to the same chip. Otherwise, 2-byte delay is required. This is due to the additional DFF’s added at the input and output buffers of the chip as shown in Fig. 16. The seven DFF’s at the south outputs are added to ensure the delay skew between chips is respected to the byte clock instead of to the bit clock. An additional input, called bypass, can short one of these seven DFF’s. This may be needed to adjust the delay from the last DFF in the south outputs of the top ARC chip to the first DFF in the north inputs of the next ARC chip (see Fig. 16). IV. TESTING AND Fig. 17. Photograph of the ARC chip. SIMULATION RESULTS The 32 32 ARC chip has been designed and fabricated using 0.8- m CMOS technology with a die size of 6.6 6.6 mm. Its photograph is shown in Fig. 17. Note that this chip is pad limited. The chip has been tested successfully up to 240 MHz by using a high-speed oscilloscope, timing analyzer, and a pattern generator capable of generating signals up to 1 GHz. Fig. 18 shows a testing result, where , , and are shown from top to bottom, respectively. The range of the priority field of the cells at the output is specified by , which is chosen to be 7 b in this test. Since is taken from the bottom left of the SWE from which comes out, it is aligned with . is delayed by one clock cycle with respect to . Similarly, is delayed by two clock cycles with respect to . In this test, cells are applied to the west inputs, while north inputs are tied to . It is observed that south outputs come out in a sorted priority order. The priority of is 1000100, which is the highest priority among all inputs. The priority of is 1000101, which is the second highest. The priority of is 1000110, which is the third highest. Note that the cell length used here is kept short in order to be able to see one cell cycle in the viewing window of the oscilloscope. Fig. 18. Photograph of a testing result with 5 ns per grid. We have developed a rigorous testing methodology to identify bad chips in the first one or two test cases. 1) Basic address test. This test is to verify that the address comparison circuits in each SWE, the transistor path starting with T5 of Fig. 12, have no defects, such as stuck-to-one or stuck-to-zero. These defects could be caused by a transistor failure or a short on wires. Applying a unique address to each north and west input while setting a unique identifier in the cell payload in order to distinguish cells at the outputs results in a cross state for all SWE’s. The unique address of the west or north inputs is set to their port numbers, while the priority of north inputs is set higher than that should of west inputs. Thus, whatever is applied to CHAO AND UZUN: ATM ROUTING AND CONCENTRATION CHIP Fig. 19. 827 Spice simulation of the SWE. come out from . If there is any with , one of the SWE’s on the th column must have a defect. 2) Basic priority test. This test is similar to the basic address test except it now concentrates on the priority comparison circuit in the SWE’s, i.e., evaluating paths beginning with transistors T6 and T8 in Fig. 12. We applied west input cells with an identical address, but unique priorities and unique identifiers in the cell payload. We also applied the north input cells with the same address as the west inputs, but with the lowest priority. Cells that appear at south outputs should be sorted according to their priority levels. Any output that has a misordered priority will provide clues for tracing possible failed SWE’s. 3) Location test. This test scans through all SWE’s one by one by applying specific priorities to test the two priority evaluation paths in each SWE (led by T6 and T8 transistors). When we want to test the SWE at the th row and th column, we applied both west and north inputs with the same address and priority as in the basic address test case except for the and inputs. The priorities for priorities of and should be set to a specific pattern ) in order to like ( test the evaluation path from T6 to GND, where “ ” means these bits for and inputs are identical and “ ” means do not care. Another pattern like ) will be used to test ( the evaluation path from T8 to GND. Note that the output. results of this test will be monitored at V. TIMING DIAGRAM Fig. 19 shows the SPICE simulation results of one SWE operating at 333 MHz. In the first cell slot, addresses of the north and west inputs match so that the “match” signal stays high. However, the “cross” signal goes low as soon as the priority of west input is detected to be higher (i.e., smaller in value) than the priority of north input, which toggles the SWE. In the second cell slot, addresses of north and west inputs are not matched, which makes “match” to go low and “cross” stay at high. In the third cell slot, even though their addresses match, “match” goes low for the situation of and in the priority field (i.e., stop priority comparison), thus retaining “cross” at high. These three cases cover almost all the different input combinations. The length of the address field is chosen as 1 b in this simulation since in the multicasting situation it requires only one bit of address match. However, the SWE can handle any length of address and priority fields. VI. CONCLUSION The Abacus switch that we proposed is scalable from a few tens to a few thousands of input/output ports. It can handle multicasting, call splitting, high input line rate (e.g., OC-48), and channel grouping with cell sequence preservation. The ARC chip described in this paper is the key component of building the Abacus switch. By cascading the ARC chips in two dimensions, any practical size of the ATM switch can be implemented. The ARC chip has been designed, fabricated by using CMOS 0.8- m technology, and tested successfully at 240 MHz. The chip’s characteristics are summarized in Table II. 828 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 32, NO. 6, JUNE 1997 [10] H. J. Chao, B. S. Choe, J. S. Park, and N. Uzun, “Abacus Switch: A scalable multicast ATM switch,” presented at Globecom’96. [11] N. H. Weste and K. Eshraghian, Principle of CMOS Design: A Systems Perspective, 2nd ed. Reading, MA: Addison Wesley, 1993. TABLE II CHIP SUMMARY REFERENCES [1] T. Kozaki, N. Endo, Y. Sakurai, O. Matsubara, M. Mizukami, and K. Asano, “32 32 shared buffer type ATM switch VLSI’s for B-ISDN’s,” IEEE J. Select. Areas Commun., pp. 1239–1247, Oct. 1991. [2] Y. Shobatake, M. Motoyama, E. Shobatake, T. Kamitake, S. Shimizu, M. Noda, and K. Sakaue, “A one-chip scalable 8 8 ATM switch LSI employing shared buffer architecture,” IEEE J. Select. Areas Commun., pp. 1248–1254, Oct. 1991. [3] T. R. Banniza, G. J. Eilenberger, B. Pauwels, and Y. Therasse, “Design and technology aspects of VLSI’s for ATM switches,” IEEE J. Select. Areas Commun., Oct. 1991. [4] A. Itoh, W. Takahashi, H. Nagano, M. Kurisaki, and S. Iwasaki, “Practical implementation and packaging technologies for a large-scale ATM switching system,” IEEE J. Select. Areas Commun., vol. 9, pp. 1280–1288, Oct. 1991. [5] W. Fischer, O. Fundneider, E.-H. Goeldner, and K. A. Lutz, “A scalable ATM switching system architecture,” IEEE J. Select. Areas Commun., vol. 8, pp. 1299–1307, Oct. 1991. [6] K. Y. Eng, M. A. Pashan, R. A. Spanke, M. J. Karol, and G. D. Martin, “A high-performance prototype 2.5 Gb/s ATM switch for broadband applications,” in Proc. ICC’89, June 1989, pp. 111–117. [7] Y. Kato, T. Shimoe, and K. Murakami, “A development of a high speed ATM switching LSIC,” in Proc. ICC’90, Apr. 1990, pp. 562–566. [8] K. Genda, Y. Doi, T. Kawamura, K. Endo, and S. Sasaki, “A 160-Gb/s ATM switching system using an internal speed-up crossbar switch,” GLOBECOM’91, pp. 123–133, Nov. 1994. [9] E. Munter, “A high capacity ATM switch based on advanced electronic and optical technologies,” in Proc. ISS’95, Berlin, Germany, Apr. 1995, pp. 389–393. 2 3 H. Jonathan Chao (S’82–M’85–SM’95) received the B.S.E.E. and M.S.E.E. degrees from National Chiao Tung University, Taiwan, in 1977 and 1980, respectively, and the Ph.D. degree in electrical engineering from the Ohio State University, Columbus, in 1985. He is an Associate Professor of Electrical Engineering at Polytechnic University, Brooklyn, NY, which he joined in January 1992. His areas of research include large-scale multicast ATM switches, photonic ATM switches, multimedia communications, and congestion/flow control in ATM networks. He holds 14 patents and has published over 60 journal and conference papers in the above areas. From 1985 to 1991, he was a Member of Technical Staff at Bellcore, NJ, where he proposed various architectures and implemented several VLSI chips for SONET/ATM-based broadband networks. From 1977 to 1981, he worked at Taiwan Telecommunications Laboratories, where he was engaged in the development of a digital switching system. Necdet Uzun (M’96) received the B.S. and the M.S. degrees in electrical engineering from Technical University of Istanbul in 1983 and 1986, respectively. He received the Ph.D. degree in electrical engineering from Polytechnic University, Brooklyn, NY, in 1993. He is an Industry Associate Professor of Electrical Engineering at Polytechnic University, Brooklyn, NY. He was with Bellcore, Red Bank, NJ, from 1990 to 1992, as a Consultant in the Multiplex & Multiaccess Technology Group. His R&D activities in high-speed networking include electronic and photonic ATM switches, admission and congestion control, ATM LAN’s, and high-speed VLSI architectures of ATM switching systems. He has also been involved with the analysis and modeling of the quantization effects in filter banks.