COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 “IMPLEMENTING LAYER-2, CONNECTION-ORIENTED QoS ON A 3STAGE CLOS SWITCHING ARCHITECTURE” Fotios K. Liotopoulos & Mohsen Guizani 14/04/2000 Abstract The implementation of effective Quality of Service (QoS) guarantees for user services must be implemented end-to-end and through all layers of the OSI hierarchy at each network node. In this paper, we consider the layer-2 design of an ATM switch, which is built as a scalable, 3-stage Clos switching network and discuss architectural choices for the implementation of QoS support on this switch. An ATM-like cell format is used to encapsulate and propagate traffic and resource management information through the switch. Call Admission Control, fast inter-stage cell transfers and priority-based queue arbitration are used as congestion control mechanisms. Simulation results are presented to show the effect of architectural choices (internal buffering) on QoS parameters (CDV). ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 1 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ 1. Introduction Asynchronous Transfer Mode (ATM) switch architectures can be designed in a variety of core media (i.e., shared memory, ring / bus architectures or networks of switching elements) and a variety of buffering schemes (input, output, both) [9]. In the process of designing layer-2 ATM switches as a network of switching elements, one can apply typical network theory and design principles, such as Quality of Service (QoS) and traffic management support, which are often applied in higher layers of the OSI hierarchy [1,2,3,4]. In order for QoS to be meaningful and effective for user services, it must: a) be end-to-end, from source to destination and b) span and propagate through all layers of the OSI hierarchy at each network node. Therefore, every layer-2 switch design must be aware of and support QoS properties supported by upper layers. In this paper, we will refer to a proposed ATM switch architecture, based on 3-stage Clos networks and we will discuss issues related to the implementation of QoS properties in the layer-2 design of such a switch. 1.1 QoS Overview The ability of ATM networks to provide large bandwidth and versatile Quality of Service (QoS) guarantees can only be realized by applying effective traffic management strategies. Traffic management includes Call Admission Control (CAC), Congestion Control (CC) and Virtual Path / Virtual Channel (VP/VC) Routing [5,6,8]. The combination of Bandwidth Allocation, Call Admission Control and Congestion Control mechanisms is used to ensure appropriate management of network resources, in order to maintain appropriate network performance to meet the QoS required by the user [5,6,7,10]. Standardized QoS parameters include: CLR: Cell Loss Ratio CTD: Cell Transfer Delay CDV: Cell Delay Variation Multiple QoS guarantees are provided by means of multiple service classes, including: UBR: Unspecified Bit Rate CBR: Constant Bit Rate (or DBR: Deterministic Bit Rate) VBR: Variable Bit Rate (RT / NRT) (or SBR: Statistical Bit Rate) ABR: Available Bit Rate For each service class, the user is requested to specify a set of traffic parameters, such as: PCR: Peak Cell Rate SCR: Sustainable Cell Rate CDVT: Cell Delay Variation Tolerance BT: Burst Tolerance All of the aforementioned QoS principles must be considered in the design of a layer-2 ATM switch architecture such as the one proposed in this paper. 1.2 Traffic Characteristics Traffic is characterized by different parameters that define its transportation nature. These characteristics should be considered for a good network design. There are four predominant traffic characteristics: ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 2 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ required capacity or required throughput of a call, delay tolerance, response time and burstiness. These characteristics can be combined to derive the performance requirements of an application. In addition, other requirements such as call setup response time, and routing algorithms are also important for the design phase. a) Capacity and throughput: The capacity requirement is the actual amount of resources required by an application across a given path. The required throughput is a measure of how much application data must pass across the switch fabric and arrive at the destination in a stated period of time. Typically, this refers to user data. Some technologies use available bandwidth more efficiently than others, thus providing higher throughput. A good switching network should be designed with high throughput capability and extra available capacity in order to avoid congestion effects during high traffic conditions. b) Response time: Variations in delay lead to variations in response time. Applications and users require different response times. This ranges from real time applications such as video conferencing to batch applications such as file transfers or electronic mail. The question that one might ask is how the response time is important to the user. c) Traffic delay tolerance: Traffic delay tolerance defines how the application can tolerate the delay of the cells. The bottom line is to determine the maximum delay the application and user can tolerate. d) Traffic burstiness: Burstiness is a commonly used measure of how infrequently a source sends traffic. A source that sends traffic infrequently is said to be bursty. Whereas, a source that always sends at the same rate is said to be non-bursty. Burstiness is defined as the ratio of the peak rate to the average rate of traffic based on a specific sampling period for the data. Traffic burstiness can be measured as: Burstiness = Peak Rate / Average Rate In Section 2, we briefly present the architectural characteristics of the proposed switch architecture. In Section 3, we outline architectural choices that can be adopted in order to provide QoS support for the specific switch architecture. In Section 4, we discuss how resource management functions should take into account and exploit layer-2 QoS support in order to provide more reliable and concrete user support services. Finally, in Section 5, we present simulation results and analysis of the effect of switch fabric buffering to cell delay variation and essentially to QoS. 2. The proposed Switch Architecture In this paper, we propose a scalable switch architecture, based on a 3-stage Clos network of switching elements. Given this design model, we can apply traditional network design principles and analysis to it (such as QoS guarantees), in order to produce a network element (i.e., a switch) that blends transparently with existing networks (LANs, WANs, etc.) 2.1 Three-Stage Clos Networks Fig. 1. A Symmetric 3-stage Clos Network. Three-stage Clos networks have been used for multiprocessor interconnection networks, as well as for data switching fabrics. For example, the Memphis switch of the IBM GF-11 parallel computer uses a symmetric three-stage Clos ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 3 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ network to interconnect 576 processors. A variety of the 3-stage Clos network designs have also been proposed for high performance switching systems, including Fujitsu's FETEX-150, NEC's ATOM and Hitachi's hyper-distributed switching system. Clos networks can be used to efficiently implement low latency, high-bandwidth, connection-oriented ATM switching. Clos-type switching fabrics have also been found to be economically and technically attractive. Combined with their inherent fault-tolerant and multi-path routing properties, they pose as a very appealing choice for reliable broadband switching. A three-stage Clos network consists of three successive stages of switching elements, which are interconnected by point-to-point links. In a symmetric three-stage network, all switching elements in a stage are uniform (see Figure 1). In a symmetric Clos network such as this, there are r switches of size (n x m) in the first stage, m switches of size (r x r) in the second stage and r switches of size (m x n) in the third stage. This network thus inter-connects n*r input ports with n*r output ports in one direction. Bidirectional switching is achieved by using half of the input ports for the reverse direction. 2.2 A novel Clos Switch Architecture Figure 2 depicts the basic components of a recent proposal for a 3-stage Clos switch architecture, scalable to 32 switching elements in each stage. Each switching element consists of up to 32 input ports and 32 output ports (n=r=32) with internal input and output buffering. Such a switching element is constructed in a modular and scalable way, consisting of up to 8 Core Switch Modules (CSM), described below. 01 01 01 02 32 01 02 32 32 32 01 02 32 01 02 32 LBus LiQ CSM GiQ LoQ CSM Switching Element GBus Core Switch Module GoQ Fig. 2. A scalable, modular, 3-stage Clos Switch Architecture. The Core Switch Module (CSM) Figure 2 also shows the modular implementation of a typical switching element (i.e., a stage-switch) of the Clos switching fabric. It consists of 2 (up to 8) modules (CSMs) in parallel, interconnected via a (global) shared bus. ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 4 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ Each CSM has four serial input ports and four serial output ports (e.g., ATM ports operating at 155 Mbps (OC-3), also referred to as Synchronous Transport Signal level 3 (STS-3c)). The serial data stream at each input port is transformed into parallel 32-bit words by a Serial-In-ParallelOut (SIPO) shift register. These words are subsequently stored in a (local) input FIFO queue (LiQ). Cells can be switched from any input FIFO to any output FIFO within the same module or across different modules of the same switch. In this design, we break down the single shared bus into a two-level hierarchy of shared busses, one local to the module (LBus) and one global for the entire switching element (GBus). Modules of the same switch can communicate with each other, via the GBus, by means of a pair of (global) input and output FIFOs (GiQ, GoQ). A cell destined to an output port of the same module is transferred directly to the corresponding local FIFO LoQ, via the local bus, LBus. If the destination of the cell is an output port of a different module, within the same switch, then the cell is first transferred to the GoQ FIFO, and through the GBus, to the GiQ of the target module. At the remote module, the cell is transferred from the GiQ to the appropriate LoQ, before it exits the current switching element and moves to the next stage. Fig. 3. An example of cell routing through 3 switching elements, one per stage. With this design approach, each module contributes only one input load and one output load to the total load of the GBus. Therefore, this design can scale to 8 or even 16 modules per switching element. Given that each module has a switching capacity of 4*155=622 Mbps, the switching element can scale up to 10 Gbps. A 3-stage Clos network consisting of such switching elements can therefore achieve up to 160 Gbps of strictly nonblocking switching capacity (in a [32 x 64 x 32] configuration), or up to 320 Gbps of rearrangeably nonblocking switching capacity (in a [64 x 64 x 64] configuration). With respect to the LBus, arbitration is performed among the four LiQ and the GiQ FIFOs, giving output to the four LoQ and the GoQ FIFOs. Therefore, the LBus is electrically loaded with only 5 inputs and 5 outputs, which is well below its limitation, but kept at this level for overall performance reasons. For a 16module switch implementation, the corresponding global bus loading is 8 or 16 inputs and 8 or 16 outputs, depending on the implementation. A central scheduler performs control and bus arbitration, over the entire switch, and transfers one ATM cell (14 32-bit words) at a time from an input FIFO (LiQ) to an output FIFO (LoQ), assuming that the former has at least one cell to send and the latter has adequate free buffer space to accommodate it. ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 5 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ Cells from the output FIFOs are then transformed into a serial stream by a Parallel-In-Serial-Out (PISO) shift register, in order to be switched to the next stage via an internal link or to an output port. It is often desirable for internal links to have higher capacity than the input or output ports. This is usually implemented with either wider data paths, or higher transfer rates. 3. Layer-2 QoS issues 3.1 Call Admission Control (CAC) Data stream transport through the switching network is achieved by means of Connection Oriented Network Services (CONS). CONS require the establishment of a connection between the origin and destination before transmitting the data. The connection is established by pre-allocating bandwidth on a path of physical links through intermediate nodes of the network. Once established, all data travels over this same path in the network. Control signalling is used to establish and take down a connection dynamically, (Switched Virtual Circuits (SVC)). For the proposed switch architecture, we can apply a variety of Call Admission Control (CAC) algorithms for Strictly Nonblocking (SNB), Wide-Sense Nonblocking (WSN), Semi-Rearrangeably Nonblocking (SRN), or Rearrangeably Nonblocking (RNB) operation, as proposed by numerous researchers for generalized 3-stage Clos switching networks in the multirate environment. Note at this point, that the term “nonblocking” is used here to refer to the nonblocking property of the switching network at call set up time. Even an SNB switch can still be blocking (at cell transport time), due to resource conflicts, although this probability is generally small for an averagely utilized switch. CBR services are often sensitive to delay variations (or “jitter”). For such services, it is useful to modify the CAC function so as to avoid routing via global bus “detours” as much as possible. For example, we can save such a detour in the 3rd stage by appropriately choosing a proper middle-stage switch to route through. 3.2 ATM Data Switching Data transfers from source (some input port) to destination (some output port) are performed based on the Asynchronous Transfer Mode (ATM) switching paradigm. Assuming 32-bit wide intra-stage data transfers, Figure 4 shows the structure of an ATM cell and its parallel transmission from an input queue to an output queue within a switch module VPI (CSM). The small, fixed cell-size and the statistical multiplexing provide better cell Tag 6 bits 5 bits 5 bits 8 bits 4 bits 3b 1b delay variation (CDV) behavior and thus 1 2nd Swi 3rd Swi OutPort VCI GFC PT C better QoS. 8 bits 8 bits 8 bits 8 bits 0 HEC-0 PEC-0 PEC-1 PEC-2 Cell Structure The cell header contains information, such 8 bits 8 bits 8 bits 8 bits as the cell's Virtual Path Identifier (VPI, 16 0 Octet-01 Octet-02 Octet-03 Octet-04 bits) and Virtual Channel Identifier (VCI, 8 bits), Generic Flow Control information : : : : (GFC, 4 bits), Payload Type (PT, 3 bits), 8 bits 8 bits 8 bits 8 bits Call Loss Priority (CLP, 1 bit), Header 0 Octet-45 Octet-46 Octet-47 Octet-48 Error Check (HEC, 8 bits) and Payload Error Check (PEC, 24 bits). The PEC field Fig. 4. The proposed structure of the ATM cell. is not part of the standard ATM cell format, but we have included it for enhanced reliability and in order for each cell to begin at a 32-bit word boundary. ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 6 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ VPI-based Self-Routing In the proposed cell format, the VPI field contains all the routing information the cell needs to reach its destination output port at the third-stage of the switching network. The first 6 bits of this field (VPI0:5) denote the selected middle-stage switch the cell is chosen to go through. This essentially identifies the output port number, which the specific cell should be switched to, inside the first-stage switch. The next 5 bits of the VPI field (VPI6:10) denote the destination third-stage switch, or equivalently, the output port number of the middle-stage switch. Finally, the last 5 bits of the VPI field (VPI11:15) denote the output port number of the third-stage switch. The VCI field contains a virtual circuit ID number, which relatively identifies a virtual circuit (or service) within its corresponding ATM virtual path. Thus, the Absolute Service Identifier (ASI) is derived from the concatenation of VPI6:15 and VCI, which is an 18-bit identifier. The functionality of the aforementioned fields corresponding to their QoS significance is described below: VPI: It carries routing information, which is used to self-route the cell through the switching network. The routing function implemented can be such that the overall CTD and CDV is reduced for those classes of services that have the strictest CDVT requirements. VCI: It can be used to implement VC-based prioritization and “per VC”-scheduling within the switching fabric. GFC: This field is used to carry priority information based on the specific route a cell follows. In particular, cells that are routed through global FIFOs (i.e., they migrate between modules of the same switch, taking a detour via the global bus) are penalized with higher CTD and CDV. To compensate for this overhead these cells need to be treated with higher priority, inversely proportional to their penalty. Moreover, simulation results indicate that, due to the increased congestion in the early stages, the earlier the penalty occurs, the higher it is. Therefore, the assigned priorities must be given as shown in Table 1: (X: no detour, : with detour) 1st stage global bus detour 2nd stage global bus detour 3rd stage global bus detour Priority (bigger = higher) X X X 0 X X 1 X X 2 X X 3 X 4 X 5 X 6 7 Table 1. Assigned priorities based on Global Bus detours per stage. This prioritization scheme is called “Penalty Compensation Prioritization”, or PCP and can be used to provide more consistent QoS parameter values, by compensating for architecture-imposed QoS penalties. PT: The “Payload Type” field encodes in 3 bits the service class type. This information is used to perform “per-service-class” prioritization and scheduling. Values 0-5 encode up to 6 service-class types (i.e., user data). A value of PT=6 indicates a signalling data cell and PT=7 indicates a maintenance data cell. CLP: This 1 bit information is used to indicate whether the specific cell should be discarded during network congestion. CLP is set to 0 to indicate higher priority. HEC, PEC: The main function of these fields are to detect multiple bit errors and correct single bit errors in the cell header and cell payload, respectively. Data Buffering Scheme Both input and output buffering is used, in order to isolate the switching functions of each stage. Internal blocking is drastically reduced or even avoided by emptying the FIFO buffers in each stage, as fast as possible. This is achieved by speeding up all inter-stage (serial) cell transfers and also by speeding up all intra-stage (bus) transfers. ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 7 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ Priority-based arbitration of shared busses and queue multiplexing are used to enforce QoS requirements, by implementing various types of priority class scheduling, including “per class” and “per VC” scheduling. Simulation results indicate that small buffers result in better switch performance (i.e., less CTD and CDV) (see next paragraph). In order to compensate for hot-spot effects, progressively larger buffer space in each stage (i.e., larger FIFOs in stage S+1 than in stage S) may be appropriate, but this issue is still under investigation. 4. Resource Management for Congestion Control 4.1 Congestion Control Congestion can be defined as the state of network elements in which the network cannot provide the agreed-upon QoS to already existing or newly requested connections. In other words, congestion occurs when the offered load from the user to the network exceeds the available network resources to guarantee the required QoS. Some of the network resources that can cause congestion are switch ports, buffers, ATM adaptation layer processors, and CAC processors. Two types of congestion may occur in an ATM network: Long-term congestion, which is caused by more traffic load than the network can handle; and short-term congestion, which is caused by burstiness in the traffic. Techniques used for congestion control include admission control, resource reservation, and rate-based congestion control. 1) Admission Control: In this scheme, once congestion is detected by the network, no new virtual connection circuits are accepted. This technique is widely used and easy to implement. For example, in a telephone network, once congestion occurs, no more new phone calls can be made. Admission control may allow new connections if the network can find a route that avoids the network congestion ports and has the demanded QoS. If no such route can be found, the connection will fail. 2) Resource Reservation: Another congestion control scheme that is similar to admission control is to reserve resources in advance. In this algorithm, the network and the user establish an agreement about the traffic flow, QoS, peak bandwidth, average bandwidth, and other parameters. When a virtual connection is established, the required network resources are reserved with it. Hence, congestion rarely occurs. On the other hand, its main disadvantage is the low utilization of the bandwidth as not all of the allocated resources for a connection may be used at the same time. 3) Rate-based Congestion Control: Due to the real-time nature of data, it is not possible to solve the problem of congestion in both CBR and VBR traffic. When the same problem occurs in UBR, extra cells are dropped. Since it is possible to inform the sender to slow down ABR traffic, congestion control can be applied. Three procedures for rate-based congestion control have been used: a) Whenever the sender wants to transmit burst data, an acknowledgement is needed before bursting. This procedure was rejected because it takes a long time before sending. b) When congestion occurs, the sender is notified with a notifier cell, then the sender has to reduce cell flow by half. This procedure also was rejected because the notifier cell may be lost. c) The congested cells are discarded in a highly selective way. The switch had to scan the end of the incoming packet and discard all cells of the next packet. This procedure also was rejected because the discarded packet may not be the packet causing the congestion. The accepted procedure is that after N data cells, each sender sends a resource management (RM) cell. This cell travels along the cells' path and is treated in a special manner throughout the switches. When the RM cell arrives at the destination, it is examined, updated, and returned to the sender. ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 8 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ Reactive congestion control may be used to provide more effective QoS. Reactive control is achieved via network feedback information collected by network management and maintenance functions. “Probecells” (with PT=7), like RM cells, are injected into the network at one end and are collected at the other end, having registered QoS-related measures in between. This information can then be used to perform traffic management, evaluate the congestion levels of the switching fabric and estimate cell loss probability. Additionally, preventive congestion control may be applied by means of multiple priority queues, which are arbitrated over each shared bus, by means of priority arbiters. Cell streams are FIFO-queued based on their header information discussed above. 4.2 Network Resource Management The main role of network resource management is to allocate network resources to different applications according to their service characteristics. Virtual paths can be used as a tool to achieve traffic control and resource management. The Virtual path connections (VPC) are similar to virtual circuits. Hence, traffic control capabilities such as Call Admission Control (CAC) and User/Network Parameter Contract (UPC/NPC) are simplified. Different types of network end connections lead to the following cases: 1) User-to-user applications: Specifying the QoS of the connection between two user network interfaces is the responsibility of the end users, not the network. 2) User-to-network applications: Because the VPC is a connection between a network user and the network itself, the network has to know the QoS of the internal virtual circuits. 3) Network-to-network applications: The VPC is a connection between two networks. The network has to know the QoS of the virtual circuits. The performance of virtual circuit connections and hence the virtual path connections, depends mainly on the allocation of network resources. The allocation of network resources affects the quality of service parameters such as cell loss ratio, peak-to-peak cell delay variation and maximum cell transfer delay. Virtual circuits that are given the same quality of service will behave similarly. The network has to allocate its resources to different virtual paths according to the following algorithms: a) Aggregate peak demand: In this algorithm, the total allocated network resources used for one virtual path, is equal to the total peak resource requirements of its virtual circuits. This scheme has an advantage of ensuring that all virtual circuits contained in the virtual path operate under their peak conditions. On the other hand, it does not provide full utilization of network resources under normal operation. b) Statistical Multiplexing: In this algorithm, the total allocated network resources used for one virtual path, is nearly equal to the average capacity of all virtual circuits. The main disadvantage of this scheme is that it causes increased cell delay variation and greater cell transfer delay. But it has the advantage of providing better utilization as compared to the aggregate peak demand. c) Traffic Shaping: Traffic shaping enhances traffic flow, reduces cell delay, and allows better network resource allocation. This is accomplished by properly spacing the cells of virtual circuit connections. Traffic shaping has to ensure the cell sequence integrity of an ATM connection, It also can be used by the network operator and the user. It is usually used for cost-effective network dimensioning. Examples of traffic shaping algorithms include the leaky bucket, also known as token bucket, which controls the flow of compliant cells. 5. QoS Performance Analysis Analytical models of the proposed switch, as well as simulation studies are very useful tools to estimate the performance of the switch, its QoS tolerances and limitations. Performance analysis results can be used to define optimal operating points of the switch. For example, if simulation results indicate that for ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 9 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ switch capacity utilizations greater than 90% the cell loss probability (CLP) becomes unacceptable (e.g., >10-4) then the CAC function is set to accept new calls only if the overall capacity utilization is less than 90%. Another set of performance values, that can be derived by analysis and / or simulation and are useful to determine the appropriate operating points of the switch, includes typical and maximum values (or xpercentile) of cell latency and cell delay variance. Knowing these values, we can accept QoS requirements and provide QoS guarantees with a given degree of confidence. Following, we present such a set of simulation results, which is part of a more thorough study of how specific QoS parameters are affected by architectural choices of the switch design. In particular, we study the effect of internal FIFO sizes on the observed cell delay variation (CDV). 5.1 Simulation Results We have simulated an 8x16x8 Clos switching network, with continuous, CBR traffic of multiple ATM cell streams. The simulated configuration consisted of 128 input ports and 128 output ports. Assuming 155 Mbps i/o links, this implies a total switching capacity of approximately 20 Gbps, while, assuming 622 Mbps i/o links the total switching capacity becomes approximately 80 Gbps. We use the “cell cycle” as a time unit, in order to make our results independent of the assumed i/o rate. A cell cycle is equal to the inverse of the i/o rate (e.g. ~3 usec for 155 Mbps (OC-3) i/o ports). All simulation experiments were run for 100,000 cell cycles and in all cases steady state was achieved early enough. In order to evaluate the effect of internal buffer size on the cell delay (or latency) variation (CDV), (i.e., how the switch architecture “shapes” the CBR traffic), we ran several experiments varying the sizes of the Local input/output FIFOS (LiQ, LoQ) (parameter LQ) and Global input/output FIFOS (GiQ, GoQ) (parameter GQ). In particular, we tested various combinations of LQ = {2, 4, 8, 16} cells and GQ = {4, 6, 8, 10, 12, 16, 20, 24} and measured the internal cell blocking, goodput1 and end-to-end cell latency. Next, we present the results for two such cases; one exhibiting strong congestion effects, which shows more clearly the FIFO size effect on CDV and another one with near zero internal cell blocking, which proves possible to provide strong CDV guarantees for CBR services. In both cases, interesting conclusions can also be drawn regarding the optimum size values of the internal FIFOs. 1 Goodput is defined as the ratio of the transferred cell rate to the offered cell rate. Goodput is 100% if no internal blocking occurs, therefore all incoming cells are transferred through the switching fabric unhindered. Goodput is 50% if, in the time duration where two incoming cells are offered, only one actually exits the switching buffer, due to internal blocking. ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 10 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ Cell Latency Distribution vs. Queue Size (8x16x8, 128 OC-3 i/o Clos switch) 14% 12% % of transferred cells 10% LQ=4, GQ=4 LQ=4, GQ=6 8% LQ=4, GQ=8 LQ=4, GQ=10 LQ=4, GQ=12 6% 4% 2% 0% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 Cell Cycles (1 cyc = 3 usec) Fig. 5. Cell latency distribution for LQ=4 cells and various Global FIFO sizes. Case I: Internal bus speedup = 8 x (i/o rate), Inter-stage link speedup = 1 x (i/o rate) Due to the relatively small speedups, this case exhibited high internal cell blocking, thus yielding goodput values between 63% and 64% for all FIFO sizes simulated. Figure 5 shows cell delay distribution curves for LQ=4 cells and GQ={4, 6, 8, 10, 12} cells. Similar results were observed for LQ=2 and the same GQ values, thus suggesting that the combination of LQ=2 and GQ=4 is the best choice, if we aim at reducing CDV. Most of the curves have two “peaks”, the former of which seems to be corresponding to the LiQ/LoQ delays and the latter to the GiQ/GoQ delays. Since most of the conflicts are resolved during the Cell Latency Distribution vs. Queue Size (8x16x8, 128 OC-3 i/o Clos switch) 10% 9% 8% % of transferred cells 7% LQ=16, GQ=8 LQ=16, GQ=12 6% LQ=16, GQ=16 LQ=16, GQ=20 5% LQ=16, GQ=24 4% 3% 2% 1% 0% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 Cell Cycles (1 cyc = 3 usec) Fig. 6. Cell latency distribution for LQ=16 cells and various Global FIFO sizes. ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 11 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ first stage, the two-peak pattern does not repeat 3 times. Figure 6 shows cell delay distribution curves for LQ=16 cells and GQ={8, 12, 16, 20, 24} cells. Similar results were observed for LQ=8 and the same GQ values, thus suggesting that the combination of LQ=8 and GQ=8 seems to be the best choice. In general, we observe that in a congested situation, higher FIFO sizes do not result in higher throughput (or goodput) but they certainly increase CTD and CDV. Case II: Internal bus speedup = 16 x (i/o rate), Inter-stage link speedup = 2 x (i/o rate) In this case, due to the speedup doubling, the measured goodput reached 100% (i.e., no congestion effects were observed). The cell latency distributions for the same sets of FIFO-size parameters of Case I, are shown in Table 2. This table shows that, for LQ=8 we have very narrow cell latency distribution (Probability [cell latency > 6 cell cycles] = 0). Goodput and internal cell blocking measurements indicate that for all 4 cases, goodput is near 100% and for LQ=8 and LQ=16, the internal cell blocking is practically zero (i.e., no cell blocking occurred in the simulation duration). PARAMETERS LQ=2, GQ=4,8,10,12,16 LQ=4, GQ=4,8,10,12,16 LQ=8, GQ=8,12,16,20,24 LQ=16, GQ=8,12,16,20,24 3-4 cycles ~61.19% 73.55% 73.40% 73.40% 4-5 cycles ~32.47% 24.57% 24.73% 24.73% 5-6 cycles ~5.48% 1.72% 1.87% 1.87% 6-7 cycles ~0.86% 0.16% 0.00% 0.00% >7 cycles 0.00% 0.00% 0.00% 0.00% Table 2. Cell latency distribution results for 16x internal bus speedup and 2x inter-stage link speedup [Goodput=~100%]. 6. Conclusions In this paper, we propose the design of a switch as a scalable network of switching elements. This concept has the advantage that we can apply traditional network design and analysis principles during the design of the switch. In this way, the switch itself as a macro-network (LAN/WAN) component can carry through QoS properties to all OSI layers in a more natural way. To demonstrate this concept, we discuss QoS issues and their implementation on the layer-2 design of a novel 3-stage Clos switching architecture. Since internal buffering plays a crucial role in the performance of the switch, we study how the proposed buffer design choices affect QoS properties, such as cell delay and cell delay variation. Simulation results and analysis is used to support our claims. Traffic management and resource management is essential for congestion prevention and control and related issues are briefly touched upon. Apparently, a lot more issues can be discussed and our analysis in this paper does not exhaust the subject. It provides us, though, with a good motivation and insight for future work and analysis. Our immediate future plans include the extension of our current study to other QoS measures and properties by means of simulation and analytical techniques, as well as the further evaluation of other architectural models and switch designs in a similar manner. 7. References [1] Special issue on “Flow and congestion control”, IEEE Comm. Mag., vol.34, no.11, Nov. 1997. ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 12 COMPUTER TECHNOLOGY INSTITUTE 14/04/2000 ________________________________________________________________________________ [2] Special issue on “Bandwidth allocation in ATM networks”, IEEE Comm. Mag., vol.35, no.5, May 1997. [3] U. Black, QoS in Wide Area Networks, Prentice Hall, January 2000. [4] D. Black, W. Leitner, Building Switched Networks: Multilayer Switching, QoS, IP Multicast, Network Policy, and Service Level Agreements, Addison Wesley Longman, January 1999. [5] A. Croll, E. Packman, Managing Bandwidth: Deploying QoS across Enterprise Networks, Prentice Hall, April 1999. [6] D. McDysan, QoS and Traffic Management in IP and ATM Networks, McGraw-Hill, November 1999. [7] P. Ferguson, G. Huston and C. Long, Quality of Service: Delivering QoS on the Internet and in Corporate Networks, Wiley, John & Sons, January 1998. [8] M. Murata, “Requirements on ATM Switch Architectures for Quality-of-Service Guarantees”, IEICE Trans. Communications, vol. E81-B, no.2, pp.138-151, February 1998. [9] J.S. Turner and N. Yamanaka, “Architectural Choices in Large Scale ATM Switches”, IEICE Trans. on Communications, vol. E81-B, no.2, pp.120-137, February 1998. [10] Related WWW link: http://www.qosforum.com/docs/faq/ ____________________________________________________________________________________ TECHNICAL REPORT No. TR2000/04/04 13