Initial studies of SCI LAN topologies for local area clustering Haakon Bryhni* and Bin Wu**, University of Oslo, Norway * Department of Informatics. Email: bryhni@ifi.uio.no ** Department of Physics, Email: bin.wu@fys.uio.no Abstract A Local Area Network (LAN) can be built using the ANSI/IEEE Standard 1592-1992 Scalable Coherent Interface (SCI) as the underlying hardware protocol. For the LAN, SCI will act as a new physical layer, and traditional protocols (as e.g. TCP/IP) may be run on top of SCI. In a first approach, SCI is used to implement message passing while new protocols and functions that take advantage of the shared memory functionality can be developed. In this paper we do initial studies of throughput and latency at the physical layer for some SCI LAN candidate topologies. This performance will be a hard upper bound for application performance. SCI LANs based on ring, switched star and a mixed topology of hubs and a switched backbone are considered, and simulations are done to show how the performance metrics throughput and latency varies as function of topology and the physical transmission distance of the network. The new point in this study is to focus on the effect of increasing transmission distance to show that a LAN as well as a closely coupled system can be built by using the SCI interconnect. 1. Introduction The ANSI/IEEE 1596-1992 Scalable Coherent Interface is a standard giving computer-bus like service to nodes in a distributed environment. SCI scales well, and avoids the limitations of buses by using many point-to-point links and a hardware-embedded packet protocol to provide coherent and non-coherent shared memory to the nodes. The nodes in SCI may be processors, memories and I/O units or complete workstations (WS) connected to SCI by means of a Cluster Adapter (CLAD) doing protocol conversion between the WS bus and the SCI interconnect as shown in figure 1 [CLAD]. workstation[N] CLAD cpu mem SCI LAN I/O workstation[1] CLAD cpu mem I/O local-area “bus” with topology X Figure 1: Local area computing environment The flexibility of the interconnect allows a wide range of topologies ranging from tightly to loosely coupled systems. In this paper, we examine some candidate topologies for a loosely coupled system of SCI-connected workstations forming a Local Area Computing Environment (LACE). In a LAN environment, distances between nodes are typically in the range of 10 to 1000m. Most studies on topologies for SCI so far has been done on tightly coupled systems [e.g. Bothner93], where the effect of varying physical distance between the nodes are not considered in detail. The new point in this study is to focus on the effect of increasing transmission distance to show that a LAN as well as a closely coupled system can be built by using the SCI interconnect. A Local Area Network is characterized by a number of factors, such as the coverage of a small geographic region, the communication channels between the interconnected computers are usually privately owned (one administrative domain, often trusted nodes), the channels have a low bit error rate and are of relatively high capacity, etc. SCI matches all these requirements, and adds features such as hardware support for coherent shared memory, high-end throughput of Gbyte/s and good architectural support for design of multiprocessor systems. For computer communication, there exists two distinct methods of communication, message passing and shared memory [Tanenbaum87]. The idea of SCI LAN is closely connected to the distinction between these communication paradigms. All Local Area Networks of widespread use are based on message passing, but the advent of hardware-supported coherent shared memory (implemented by transactions in the interconnect) allow computers in a LACE to communicate by using shared memory in stead of message passing. The complexity of providing coherent shared memory in such a system is hidden by hardware, and the communication protocols may exploit the simpler shared memory communication paradigm. Use of the coherence mechanisms may further increase both performance and functionality. Using shared memory for interprocess communication is well known from multiprocessor systems [Delp88, Cheriton94], but is it possible to get the benefit of these features also in a loosely coupled LAN-like system? This will require high throughput and low latency even when transmission distance increases. In our opinion, latency is the most crucial parameter that will be discussed in the following sections. We use a simulation model to compare the candidate topologies. Our modelling and model assumptions are discussed in section 2. In section 3 the selected topologies are presented, and in section 4 the simulation results are discussed. Section 5 contains our conclusions. 2. Modelling and Model Assumptions In this section, we briefly discuss model assumptions for the simulations. 2.1 Simulation model The simulation model is a discrete-event simulator, implemented in the Modsim II programming language [MODSIM]. 2.2 Traffic We choose one common arrival intensity λ , and a negative exponential arrival distribution is used. λ is selected to give a heavy traffic load, but low enough to give reasonable latency. We use a simple traffic model, where burstiness, variations in application frame lengths and the properties of segmentation and reassembly of packets are not considered. All traffic is modelled as memory accesses with a 64-byte cache line granularity. Since we do not know the locality of SCI LAN accesses, we limit our study to two cases as shown in figure 2, 1) uniform distributed targetId’s, and 2) the central server case where one server takes 90% of total traffic (typically a multimedia server), while the rest of the traffic is uniformly distributed over all nodes. 1 Initial studies of SCI LAN topologies for local area clustering 1) Table 1: Simulation parameters 2) Parameter Value CLAD parameters τ Figure 2: Two cases of traffic distribution We model the traffic between the workstations and the server as SCI transactions DMOVE64. This is an SCI transaction moving 64 byte data with 16 byte overhead. This transaction does not require the SCI response subaction. 2.3 Physical interface We have chosen to use simulation parameters from the NodeChipTM SCI interface from Dolphin Inc., since this chip can be used in early SCI LAN applications. We simulated the 62.5 MHz CMOS chip, with 125 Mbyte/s link speed. Results for other implementations may be obtained by scaling the results to other interface chip speeds. We note, however, that the overhead introduced in higher layer protocol processing in the end-systems usually dominate over latencies introduced in the physical layer. Different approaches may be taken to overcome the protocol processing bottleneck, but in this paper we study latency and throughput on the physical layer only, since this gives us a hard upper bound of application performance. A switch can be made by N node interface chips interconnected by e.g. a non-blocking switch [Wu94]. We do not go into detail about the switch architecture, but assume that a 16 port switch can be designed by e.g. interconnecting 2x2 or 4x4 switches. We give some rough parameters in Table 1. Hubs interconnect several SCI links, and may be implemented as both rings or switches. Simulations will show the benefits of different topologies. For the physical interface, we use a propagation speed of 2.0 108 m/s which is derived from physical measurements, and we assume error free transmission. 2.4 Parameters The parameters are set as realistic as possible: 16 nodes is a reasonable (but still small) number of workstations in a LACE and we believe that both ring and switch topologies can be realized by using CMOS technology. Table 1 summarize model parameters used in the simulation. CLAD (WS bus response time) 400 ns Nodechip Bypass FIFO delay 16 ns # 4 * 80 byte buffers (input and output, request and response) of the node interface 1 Switch parameters Switch architecture Non-blocking crossbar Routing strategy Virtual Cut-trough τ Decode (Switch addr. decoding) 10 ns τ Switch (internal port-port delay) 100 ns # 4 * 80 byte buffers (input and output, request and response) of the switch 2 The parameter τ CLAD is the memory cycle time in the receiver. This parameter gives an upper bound on receive performance for each node. Cut-through routing [Bertsekas92] give the advantage of pipelining the switching process. 2.5 Performance metrics Throughput X and latency τ are simulated as experienced on the bus between an SCI interface and the rest of the CLAD logic (typically local bus “glue” logic). We assume a CLAD can follow maximum link throughput. There are two common definitions of latency, including and excluding the queuing delay in the network adapter (here: CLAD). In the first definition, latency is calculated from the generation of a packet until it is received in the destination processor or memory. In this case, both network latency, interface speed and processor/memory latency are considered. This is interesting from a users points of view, since the user experience only application performance. From a network designers view, the performance of the interconnect itself is an important upper bound for the overall system performance. To study the interconnect isolated from the interconnected systems, we exclude the queuing delay in the network adapter, and measure from the time a packet enters the SCI interface until it is received in the interface at the destination (see figure 3). Table 1: Simulation parameters Parameter Number of nodes N 2 Value 16 (3.1, 3.2) and 32 (3.3) NodeChip clock speed C 62.5 MHz Transmission speed R on serial fiber-optical link 1 Gbit/s Transaction arrival rate λ 106 Distance L 10-1000 m workstation A 2 * 10 m/s SCI Transactions used DMOVE64 8 CLAD B CLAD A CLAD LOGIC SCI IF SCI LAN SCI IF CLAD LOGIC Definition of throughput bus A Signal propagation velocity workstation B and latency Figure 3: Point of measurement bus B Initial studies of SCI LAN topologies for local area clustering We have chosen to show overall mean throughput and the mean latency experienced by all nodes. X is then the sum of all packets on all nodes sent successfully per second, τ is the average time from a packet enters the SCI interface in a workstation until it is received in the SCI interface in the target workstation. ∑ Xi i=1 workstation[1] ws[..] N N X = ws[2] 1 τ = ---N ∑ τi L i=1 ws[N] For all measurements, gross throughput X are given, net throughput (SCI user payload) are always 64/80 = 80% of this figure. Demands for throughput and latency are given by the applications and will vary for the different traffic sources. Generally, we would like to maximize X and minimize τ . Isochronous channels (e.g. audio and video) in particular, demands low latency and little variation in latency (jitter) but can to some extent tolerate loss of packets. Asynchronous channels (like e.g. memory accesses) can tolerate jitter but want high throughput, low average τ and no packet loss. The fact that the transfer of large data blocks are optimal when the transmission distance increases, is considered, and leads to the use of SCI move transactions that do not have response subactions. Protocols to ensure guaranteed delivery is left to the upper layers of the protocol as an end-to-end responsibility to minimize network processing of each packet and maximize throughput. Figure 4: SCI LAN ring topology 3.2 SCI LAN Switch-based hub In figure 5, the hub is designed by means of a switch. Each connection is now a point-to point dedicated communication link. ws[2] L workstation[1] ws[..] 3. SCI LAN Topologies In traditional LANs, there is a development away from bus-based topologies as used in e.g. IEEE 802.3 (CSMA/CD LAN). New infrastructure in office buildings are based on structured cabling systems using a star topology with twisted pair or fiber optical cabling for point-to-point connections between central hubs and the distributed workstations. Even Ethernet is now preferred as a point to point link, avoiding the inherent bottleneck of shared media. Physical star topologies (as with structured cabling systems) easily allows logical ring topologies. Ring based topologies have a number of benefits and are often a good compromise between throughput, latency and reliability. A ring uses point-to-point connections and shares the transmission capacity on the ring. The ever-increasing demand for throughput, however, will in the long run force a development towards switched networks on the cost of switch-capable hubs. ws[N] Figure 5: SCI LAN star topology 3.3 SCI LAN hubs and backbone In figure 5, a mixed topology of interconnected hubs are presented. We assume no contention in the switches, and do not simulate the shaded part of the backbone network. Contention in the central switches will of course degrade overall throughput X and increase latency τ . We have chosen to consider tree topologies, a ring, a star using a central switch, and a mixed topology of hierarchically interconnected hubs. A hub is a component for transparently connecting two or more similar buses [HUB]. The hubs can be designed as e.g. rings or stars. 3.1 SCI LAN ring-based hub We use the ring topology in figure 1 as the starting point of our simulation, since this is the default SCI interconnection, and can be realized as a “patch-panel-hub” in a wire center. The new point in this simulation compared to other simulations of the SCI ring topology is again the focus on the transmission distance. We show how the performance metrics vary as a function of the workstation to hub cable length L. The SCI ring topology have been studied in a number of different simulations and have also been treated analytically [Scott92]. For small values of L, this experiment also serves as an informal verification of the simulation model. L2 L1 ws[N] HUB ws[..] ws[1] ws[N] HUB ws[1] ws[..] Figure 6: SCI LAN hubs and switched backbone 3 Initial studies of SCI LAN topologies for local area clustering 4. Results and Analysis We start with the two suggested implementations of a hub, a ring and a switched scheme, and discuss this for the random traffic and for the central server case. The mixed topology of interconnected hubs are then considered for the two traffic distributions random and central server case, respectively. 4.1 Ring and switch topology with random traffic Figure 7 shows the raw throughput X of the system versus the workstation to hub length L. When the traffic pattern in the systems is totally random, i.e. every node sends packets randomly to the other nodes with a uniform distribution, we can see that the system with 16 nodes connected by a switch has much higher throughput than the ring based topology as might have been expected. The results of the ring topology conform with analytical work and the simulations that has been performed by other simulators [Scott92, Bothner92]. The maximum throughput can not exceed approx. 1.5*125 Mbyte/s, i.e. 190 Mbyte/s for low values of L. It is also clear from the figure that even for a system with a Workstation to hub distance up to 1000 meters, the total raw throughput of the switch based system can still approach 150 Mbyte/s. The latency for the two cases (figure 8) shows an approximate linear increase for a switch based system, which can be interpreted by the physical line delay. Remember that in our model, the speed of the transmission is 20 cm/ns and a long distance will limit the system from saturation. The echo will have to travel long way back to acknowledge the success of sending, and then free the FIFO in the sender’s interface. The latency for the switch based system is only of tens of micro seconds. Figure 8: Average system latency versus WS to hub distance, random traffic We are not too surprised to see that even the switch based system will not have a high throughput in this configuration, the 220 Mbyte/s maximum in figure 9 is reasonable. The 90% of total traffic can not be taken by the server, 125 Mbyte/s is the peak throughput that can be consumed on this single link, while the rest 95 Mbyte/s is contributed by other links. The ring based system does not suffer very much from the client-server traffic pattern. The observed throughput reduction is due to the lower utilization of the ring after the server, because most of the packets are received by the server. This leads to a bottleneck on the input link of the server, while the output link has a very low utilization. Figure 7: Raw system throughput versus WS to hub distance, random traffic Figure 9: Raw system throughput versus WS to hub distance, 90% load on server 4.2 Ring and switch topology, central server case For the central server case, we suspect the input and output links of the server to be the bottleneck. We are interested to study how the ring based system and the switch based system can tolerate a traffic pattern where 90% of the total traffic is taken by the server, while the rest 10% is evenly distributed among the remaining 15 nodes (16 nodes together). 4 We are surprised to see the much longer latency for the switch based system in figure 10, while the ring based system enjoy the same order of latency as it has in random traffic. The explanation is that when a switch is used, each client send its packets to the server via the switch, the packets are buffered in Initial studies of SCI LAN topologies for local area clustering the queues of the switch and an echo will be returned immediately. This echo frees the client queue and results in a new packet being stored in the queue. However, the former packets that are stored in the switch queues can not be forwarded to the destination, because of the contention on the server. Thus, the new packets in the clients’ queues will get echo.busy all the time until the corresponding queue in the switch is freed (it will get through at last, a round robin scheme is used). So the reason of the high latency for the switch based system is due to our definition of latency as described in section 2.5 since busied packets in the switch are contributing to the latency, while busied packets in the CLAD’s do not contribute. We could achieve better throughput and latency if the central server was situated close to the switch. The link to the server would be a bottleneck anyway, but we would get faster access to the server. Figure 11: Raw system throughput versus distance between two switches Figure 10: Average system latency versus WS to hub distance, 90% load on server 4.3 Mixed topology of interconnected hubs It is likely that each hub-based system will be contained in a building, while remote sites will be connected using a bidirectional link. We discussed such a model in section 3.3. For the mixed topology, we compare two traffic scenarios as earlier: 1) random destination is selected, and 2) the traffic generated in each node consist of 90% traffic to a local server (local hub), and 10% traffic to a destination connected to the remote hub. In both cases, we choose the local hub to be switch-based. There is only one server in each local hub-based network and 15 clients. The workstation to hub distance is set to 10 meter. The throughput of the two scenarios compared in figure 11 shows that a system with the random traffic pattern has a very high throughput for short switch to switch distance L2, but this throughput decrease very fast with increasing L2. Figure 12 shows that a system with the random traffic pattern experience the highest latency. This is because the random traffic is random over all nodes in the system, which means that fifty percent of the traffic goes across the long distance connection, which makes the average latency larger than in the 90% local traffic case where most of the packets are sent to a local destination. Figure 12: Average system latency versus distance between two switches 5. Conclusion Generally, all topologies give poor performance when the workstation to hub distance L increases. This is due to a low link utilization when transmission distance increases. The throughput could be increased by means of larger FIFO buffers, but this will again increase latency. The detailed conclusions for the simulated topologies are given in the following sections. 5.1 Ring/switch based hubs, random traffic pattern With a workstation to hub distance of 1000 m, the switch-based hub throughput is more than 150 Mbyte/s with a latency of ~ 14 µs. In contrast, the ring-based hub give a very low throughput for this distance, but the mean latency experienced is still fairly low (~ 70 µs). 5 Initial studies of SCI LAN topologies for local area clustering A switch based hub gives much better throughput than a ring when the traffic is randomly distributed. For short workstation to hub distances, X is over 1.4 Gbyte/s while the ring based system cannot achieve more than 190 Mbyte/s. This is still a high total throughput for the 16 nodes in the system. 5.2 Ring/switch based hubs, central server case With a workstation to hub distance of 1000 m, the latency experienced is about 160 µs for the switch and about 70 µs for the ring case. Corresponding system throughput is 30 Mbyte/s and 10 Mbyte/s. This is much lower than in the random traffic case, but a central server case are thought to be a more realistic traffic scenario. Again, switch based hubs give the best throughput performance, but get a latency penalty because of server contention (see section 4.2 for discussion). For short workstation to hub distances, latency decrease to less than 30µs and throughput increase to more than 200 Mbyte/s and 130 Mbyte/s, for the switch and ring case, respectively. One solution to the bottleneck problem in the central server case, would be to distribute the server responsibility among the attached nodes. This way of sharing server responsibility would lead to a more even traffic pattern, and thus increase performance. 5.3 Mixed topology with interconnected hubs The analysis of the mixed topology with interconnected hubs leads to the following conclusions: Traffic must be kept as local as possible to utilize local troughput and avoid the bidirectional link interconnecting the hub-based local systems. A random traffic pattern leads to much traffic between the two hubs, and as L2 increase, both X and τ are strongly affected by the interconnection bottleneck. In the central server case, more traffic is local and the interconnection bottleneck is not as obvious as in the random traffic case. 5.4 General conclusion A general conclusion would be to keep L as low as possible in all topologies. In some cases, we would prefer rings to keep L short, but this can be in conflict with modern cabling principles (e.g. structured cabling are star-based) and as we have seen, switched hub solutions generally give us better throughput and latency. The simulated throughput and latencies in a distributed SCI interconnect using the candidate topologies show that a SCI LAN can be designed and high performance can be achieved. The main challenge is now to find a good architecture for cluster adapters to give applications in the end-systems access to this estimated interconnect performance. Acknowledgements We give special thanks to professor Stein Gjessing, University of Oslo, for comments on the paper. Both authors are supported by the Norwegian Research Council. 6 Terminology CSMA/CD Carrier Sense Multiple Access/Collision Detect FIFO A buffer implementing a First In, First Out queue. LACE Local Area Computing Environment LAN Local Area Network NodeChipTM SCI Interface Chip from Dolphin Interconnect Solutions (Dolphin ICS AS, Norway) SCI Scalable Coherent Interface TCP/IP Transport Control Protocol / Internet Protocol References [Bertsekas93] Bertsekas, D., Data Networks, 2nd edition, Prentice Hall, 1992, ISBN 0-13-201674-5, pp 373. [Bothner93] Bothner, J.W., Hulaas, T. I., Topologies for SCI-based systems with up to a few hundred nodes, Master Thesis, University of Oslo, Norway 1993. [Cheriton94] Cheriton, R., Kutter, R. A., Optimized Memory-Based Messaging: Leveraging the Memory System for High-Performance Communication, Technical Report, CS-TR-94-1506, Stanford University, 1994. [CLAD] SCI Cluster Adapter (a P1596 study-group activity), SCI 2-page summary from the hplsci.hpl.hp.com server, 21. april 1994. [Delp88] Delp, G., The Architecture and Implementation of Memnet: a High-Speed Shared-Memory Computer Communication Network. PhD thesis, University of Delaware, Department of Electrical Engineering, Newark, DE, 1988. [HUBS] Hubs: Collections of SCI Bridges (a P1596 study-group activity), SCI 2-page summary from the hplsci.hpl.hp.com server, 21. april 1994. [MODSIM] Modsim II Reference Manual, CACI Inc., USA, 1991. [Scott92] Scott, S.L., Goodman, J. R., Vernon, M. K: Performance of the SCI Ring, International symposium on Computer Architecture, May 1992. [Tanenbaum87]Tanenbaum, A., Operating Systems - Design and Implementation, Prentice Hall, 1987, ISBN 0-13-637331-3, pp. 51-70 [Wu94] Wu, B, Bogaerts, A, Kristiansen, E.H., Muller, H, Ernesto, P, Skaali, B, Applications of the Scalable Coherent Interface in Multistage Networks, to be appeared on IEEE TENCON’94, “Frontiers of Computer Technology”, Aug. 22-26, 1994, Singapore. (paper available on anonymous ftp server: fidibus.uio.no, under /incoming/SCI).