Providing Quality of Service Networking to High Performance GRID Applications Miguel Rio, Javier Orellana, Andrea di Donato, Frank Saka, Ian Bridge, Peter Clarke Networked Systems Centre of Excellence, University College London ABSTRACT This paper reports on QoS experiments and demonstrations done in the MB-NG and DataTAG EU projects. These are leading edge network research projects involving more that 50 researchers in the UK, Europe and North America, concerned with the development and testing of protocols and standards for the next generation of high speed networks. We implemented and tested the Differentiated Services Architecture (DiffServ) in a multidomain, 2.5 Gbits/s network (the first such deployment) defining appropriate Service Level Agreements (SLAs) to be used between administrative domains to guarantee end-to-end Quality of Service. We also investigated the behaviour of DiffServ on High Bandwidth, high delay development networks connecting Europe and North America using a variety of manufacturer’s equipment. These quality of service tests also included innovative MPLS (Multi-Protocol Label Switching) experiments to establish guaranteed bandwidth connections to GRID applications in a fast and efficient way. We finally report on experiences delivering quality of Service networking to high performance applications like Particle Physics data transfer and High Performance Computation. This included implementation and development of middleware incorporated in the Globus toolkit that enables these applications to easily use these network services. I. high demands from the network. Once more scientists require the network to be pushed to the limits and challenge the idea that bandwidth availability is not a problem any more. Unlike traditional applications like email, WWW or even peer-to-peer systems, these applications require reliable file transfers on the order of the 1Gb/s and have often tight delay requirements. High Energy Physics, Radio Astronomy or High Performance Steered Simulations cannot achieve its goals on a sustainable, efficient and reliable way with current production networks almost totally based on a best-effort service model. Although part of the networking community believe that bandwidth over-provisioning will always solve every network problem, our work shows that Quality of Service enabled networks provide a vital role to support high performance applications efficiently, inexpensively and with smaller additional configuration work. QoS performance has been studied exhaustively through analytical work and simulation (see for example [1]). These works are of extreme importance and relevance but they have two drawbacks. On one hand simulation models have proved to be incomplete [2,3] because they fail to represent all possible real configurations of the network. They also fail to account for several implementation details of “real” networks. These include operating system tuning, driver configurations, memory and CPU overflows, etc. Therefore testbed networks play a vital role in network research as a way of consolidating technology and, through exhaustive debugging and testing, provide implementation guidelines to the QoS network user community. INTRODUCTION Last years have seen the appearance of a wide range of scientific applications with extremely The work reported here used two testbeds with different characteristics: A United Kingdom testbed used in the context of the MB-NG project and a European/Transatlantic one used in the context of the DataTAG project. Control Plane in section V. We describe some GRID demonstrations using our QoS testbeds in section VI and present conclusions and further work in section VII. The Managed Bandwidth - Next Generation (MB-NG) project [4] created a pan-UK Networking and Grid testbed that focused upon advanced networking issues and interoperability of administrative domains. The project addressed the issues which arise in the sector of high performance inter-Grid networking, including sustained and reliable high performance data replication and end-toend advanced network services. MB-NG testbed can be seen in Figure 1 It consists of a triangle connecting RAL, University of and London at OC-48 (2.5Gb/s) speeds using CISCO’s 12000. Each of the edge domains is built with 2 Cisco’s 7600 interconnected at 1Gb/s. Manchester UKERNA 2.5Gb/s 1Gb/s London RAL Figure 1: MB-NG Network The DataTAG project [5] created a large-scale intercontinental Grid testbed involving the European DataGrid project, several national projects in Europe, and related Grid projects in the USA. It involves more than 40 people and among other things is researching inter-domain quality of Service and high throughput transfers of high delay networks (making use of an intercontinental link connecting Geneva to Chicago which can be seen in Figure 2). This paper is organized as follows. In the next session we describe the Differentiated Services Architecture and some experimental results implementing it in our testbeds. In section III we describe MPLS and why it is a useful technology for GRID applications. In section IV we discuss Service Level Agreements definition between administrative domains followed by the description of Middleware and Figure 2: DataTAG Network II. DIFFERENTIATED SERVICES In the last decade there has been many attempts to provide a Quality of Service enabled network that extends the current best-effort Internet by allowing applications to request specific bandwidth, delay or loss. These included complete new networks like ATM [6] or “extensions” to the TCP/IP like the Integrated Services Architecture [7]. Both these approaches required state to be stored in every router of the entire path for every connection. Soon was realized that the core routers would not be able to cope and these architectures would not scale. With the Differentiated Services Architecture [8] a simpler solution was proposed. Traffic at the edges is classified into classes and routers in the core only have to schedule the traffic among these classes. Typically routers in the core only deal with approximately 10 classes making it easier and manageable to implement. Routers at the edges may have to maintain perflow state (specially the first router in the path) but since the amount of traffic is several orders of magnitude smaller this is not a significant problem. However applications can only choose now in which class they want to be classified into as opposed to a complete traffic specification in the IntServ architecture. This makes experimentation in testbeds crucial to understand how applications should make use of a DiffServ enabled network. Our first tests used iperf [9], a traffic generation tool (which we used to generate UDP flows) to understand the end to end effects produced by such a network. We tested the implementation of two new services: Less than Best-effort (LBE) and Expedited Forwarding which will be of use by different kinds of applications. Both classes were tested individually against Best-effort traffic. Less than Best-effort tests can be seen in Figure 3 where we injected traffic in two classes increasing the offered load on both of them simultaneously. Here Scheduling mechanisms guarantee at least 1% of the link capacity to LBE and the rest to normal BestEffort. When there is no congestion both classes share the link equally. When the link gets saturated LBE traffic gets dropped and in the extreme only 1% of the link capacity is guaranteed to be given to LBE. Lest Than Best Effort 1400 III. MPLS MPLS – Multiprotocol Label Switching [10] has its origins in the IP over ATM effort. Nevertheless soon people realized that a label switching technology could be extended to other layer 2 technologies and complement IP in a global scale. MPLS is considered by some as a layer 2.5 technology since it resides between IP and the underlying medium. It adds a small label to each packet and the forwarding decision is made solely based on this label. To establish these labels a signaling protocol, like RSVPTE [11], must be used. These signaling protocols allow specifying explicitly the route that the MPLS flows will take bypassing normal routing tables (see Figure 5). Here we can see Label Edge Routers (LER) classifying traffic into a specific label and Label Switch Routers forwarding traffic according to these labels. Throughput (Mbps) 1200 1000 800 Bes t Effort LBE 600 400 200 0 0 200 400 600 800 1000 1200 1400 1600 per flow offered load (Mbps) Figure 3: Less than Best-effort Expedited Forwarding is the premium service in the DiffServ architecture. Whatever the congestion level of the link EF traffic should always receive the same treatment. In our example (see Figure 4) 10% of the link capacity is allocated to EF and this is always guaranteed. If EF traffic exceeds that percentage the remaining traffic is dropped or, less frequently, remarked to lower priority. MPLS has two major uses: Traffic Engineering and VPNs – Virtual Private Networks. In this work we were mainly concerned with the former. The ability to switch based on a label, as opposed to traditional IP forwarding based solely in the IP destination address, allows us to manage available bandwidth in a more efficient way. Since GRID applications have traffic flows orders of magnitude bigger than traditional applications but will represent a small percentage of the flows in the network, it becomes cost efficient to select dedicated paths for their flows. This way we can be sure that the network complies with the QoS constraints and that the bandwidth available in the network is used on a more efficient way. Throughput OC-48 (EF-10%) 2500 Throughput (Mbps) 2000 1500 BE1+BE2+BE3 received EF received 1000 500 0 0 200 400 600 800 1000 1200 per flow offered load (Mbps) Figure 4: Expedited Forwarding Figure 5: MPLS Example Unfortunately at the time of writing MPLS implementations we used do not allow MPLS traffic to be isolated in case of congestion. Although the signaling protocol allows specifying a bandwidth for the flow, this will not be a guaranteed bandwidth in the case of congested link(s). To force this bandwidth guarantee, traffic need to be policed at the edge of the network. In MB-NG we executed tests to verify if MPLS could be used to reserve bandwidth for a given TCP flow. In Figure 6 we can see the result of reserving an MPLS tunnel for a given TCP connection using explicit routes. Not only we can optimize the network bandwidth but we could guarantee with very good precision a 400Mbits/s connection to our application (in this case just simulated traffic with iperf). The typical TCP saw-tooth behaviour did not prevent us from having a very stable TCP connection in a congested network. As our first definition, we are trying to standardize the definition of IP Premium (or EF – Expedited Forwarding [12] in the DiffServ literature). This is to be used by applications that require tight delay bounds. The SLA is divided in two parts: An administrative part and a Service Level Specification part (SLS). The SLS contains information about: • • • o o o o o TCP flow over MPLS tunnel 2000 1800 1600 Throughput 1400 1200 Single TCP connection 1000 Scope - defines the topological region to which the IP premium service will be provided Flow Description – This will indicate for which IP packets the QoS guarantees of the SLS will be applied Performance Guarantees – The performance guarantee field depicts the guarantees that the network offers to the customer for the packet stream described by the flow descriptor over the topological extent given by the scope value. The suggested performance parameters for the IP Premium are: One-way delay Inter-Packet delay variation One way packet loss Capacity Maximum Transfer Unit Background Traffic 800 600 400 200 0 0 5 10 15 20 Time Figure 6: MPLS for Flow Reservation IV. SERVICE LEVEL AGREEMENTS An important issue in the implementation of an end to end Differentiated Services enabled network is the definition of Service Level Agreements (SLAs) between Administrative Domains. Because individual flows are only inspected in the edge of the network, strong, enforceable agreements about the traffic aggregates need to be made at each border of every pair of domains, so that all the guarantees made by all the providers can be met. One of the goals of MB-NG as a leading multidomain DiffServ experimental network is to provide guidelines for the definition and implementation of Service Level Agreements. • • • • • Traffic Envelope and Traffic Conformance Excess treatment Service Schedule Reliability User visible SLS metrics This SLA template is inspired by the one DANTE defined in [13] and can be read in more detail in [14]. V. MIDDLEWARE FOR THE CONTROL PLANE The final piece for providing the potential of QoS enabled networks to the applications is a usable and efficient control plane. Even when the network is configured to support Differentiated Services and/or MPLS it is unreasonable to assume human intervention for every flow request. There has to be a way for Applications to request resources from the network. In the Integrated Services Architecture applications would use a signaling protocol like RSVP [15] to allocated resources. In a DiffServ network resources are not allocated in the entire path for a specific flow. The literature [16] describes two mechanisms to achieve this: in the first RSVP is used and DiffServ clouds are seen as single hops. In the second a Bandwidth Broker [8] per domain is used. The application “contacts” the Bandwidth Broker which is responsible to check, guarantee and possible reserve resources. transfer applications and creating documentation of how to port other ones. Application Bandwidth Broker Middleware It is still unclear how the network of Bandwidth Brokers will work in the future and serious doubts to its scalability are always raised. In our context, where a small number of users, on a small number of computers using a small number of Administrative domains, we can postpone the scalability issue and implement a bandwidth broker architecture that works in our testbeds. We are currently researching the implementation of two architectures: The GARA architecture [17] and to a smaller extend the GRS architecture [18]. Gara was first presented in [17] and is is tightly connected to the Globus toolkit middleware package [19] (although in the future it may be made standalone). It is designed to be the module that reserves resources (mainly network resources but not only) in GRIDs. GARA follows the Bandwidth Broker architecture that can be seen in Figure 7 a program called Bandwidth Broker (BB) receives requests from applications and reserves, when possible, appropriate resources in the network. This Bandwidth Broker is designed to interact with heterogeneous networks hiding the particulars of each router/switch implementation from the final users. Applications should be linked with the client part of the Middleware to interact with the BB. Both in MB-NG and DataTAG we are contributing to the development of GARA and implementing it on the testbeds running trials with simple applications. Since applications need to be modified to interact with GARA this is unreasonable to get all the applications to work with it. We are porting simple file Figure 7: Bandwidth Broker Architecture VI. DEMONSTRATIONS The concluding part of the work reported here was to execute demonstrations of High Performance Applications on our QoS testbed. These kinds of applications will drive future network research and are, therefore, a vital piece to our work. We worked with two applications: High Performance Visualisation and High Energy Particle Physics High Performance Computing (HPC) Visualisation applications have particularly different requirements than pure data transfer applications. Because they are frequently interactive, they have tight delay constraints in both directions of the communication. When these requirements are coupled with high transfer rates (between 500Mbits/s and 1Gb/s) the necessity of new network paradigms becomes evident. The RealityGRID (see Figure 8) project aims to grid-enable the realistic modeling and simulation of complex condensed matter structures at the meso- and nanoscale levels as well as the discovery of new materials. The project also involves applications in bioinformatics and its long-term ambition is to provide generic technology for grid based scientific, medical and commercial. Our tests with RealityGRID consist of transferring visualization data from a remote high performance graphic serve to a user’s client interface. The Differentiated Services enabled network allows the application to run seamlessly across multiple domains in several degrees of network congestion. Our tests in an HEP environment tend to be bulk data transfer where the delay requirements are not as tight. The use of LBE (Less than best effort) is appropriate for this scenario since we can use spare capacity when available without affecting the rest of the traffic when the network is congested or near congestion. Figure 8: Communications between simulation, visualisation and client in the RealityGRID project Figure 9: HEP Data Collection The second application are that we focused our tests is the High Energy Particle Physics. By the nature of its large international collaborations and data-intensive experiments, particle physics has long been in the vanguard of computer networking for scientific research. The need for particle physics to engage in the development of high-performance networks for research is becoming stronger in the latest generation of experiments, which are producing, or will produce, so much data that traditional model of data production and analysis centred on the laboratories at which the experiments are located is no longer viable, and the exploitation of the experiments can only be performed through the creation of large distributed computing facilities. These facilities need to exchange very large volumes of data in a controlled production schedule, often with low real-time requirements on such characteristics as delay or jitter, but with very high aggregate throughputs. The requirements of HEP in data transport and management are one of the high profile motivations for “hybrid service networks”. The next generation of collider experiments will produce vast datasets measured in Petabytes that can only be processed by globally distributed computing resources (see Figure 9). High-bandwidth data transport between federated processing centres is therefore an essential component of the reconstruction and analysis of events recorded by HEP experiments. VII. CONCLUSIONS AND FURTHER WORK Experimental work in network testbed plays a crucial part in network research. Many problems are undetected by analytical and simulation work which practical experiments find in its early stages. They also provide good feedback for new topics of theoretical research. In our experiments we concluded that quality of service networks will play an important role in future GRIDs and IP networks in general. Bandwidth over-provisioning of the core network does not solve all the problems and it will be impossible to guarantee end to end. Both Differentiated Services and MPLS provide allow for the creation of valuable services for the scientific community with no major extra administration effort. We successfully demonstrated high performance applications in a multi-domain QoS network, showing major qualitative improvements on the quality perceived by the final users. As current work we are trying to integrate GARA middleware into the OGSA [20] architecture and researching how we can scale the Bandwidth Broker Architecture to several domains. Work is also being done on the integration of AAA (Authentication, Authorization and Accounting) mechanism into the GARA framework to solve crucial security problems arising in the GRID community. APPENDIX A: CONFIGURATION EXAMPLE Our QoS tests are being extended to research the behavior of new proposals for TCP in a Differentiated Services enabled network. This will enable us to use more efficiently applications that require reliable transfers in a QoS network. VIII. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] Best-Effort versus Reservations: A Simple Comparative Analysis, Lee Breslau and Scott Shenker, in Proceedings of SIGCOMM 98, September 1998. Difficulties in simulating the Internet. Sally Floyd and Vern Paxson. IEEE/ACM Transactions on Networking volume 9 number 4, 2001 Internet Research needs better models, Sally Floyd, Eddie Kohler, in Proceedings of HOTNETS-1, October 2002 http://www.mb-ng.net.org http://www.datatag.org Essentials of ATM Networks and Services, Oliver Ibe, Addison Wesley, 1997 RFC 1633 – Integrated Services in the Internet Architecture: an Overview, R. Braden, D. Clark, S. Shenker, June 1994 RFC 2475 – An Architecture for Differentiated Services, S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, W. Weiss. December 1998 http://dast.nlanr.net/Projects/Iperf MPLS: Technology and Applications, Bruce S. Davie and Yakov Rekhter, Morgan Kaufmann Series on Networking, May 2000 RSVP Signaling Extensions for MPLS Traffic Engineering, White Paper, Juniper Networks, May 2001 RFC 2598 – An Expedited Forwarding PHB, Van Jacobsen, K. Nichols, K. Poduri, June 1999 SLA definition for the provision of an EF-based service, Christos Bouras, Mauro Campanella and Afrodite Sevasti, Technical Report SLA definition – MB-NG Technical Report (work in progress) RSVP – A New Resource Reservation Protocol, Lixia Zhang, Steve Deering, Deborah Estrin, Scott Shenker, Daniel Zappala, in IEEE Nework, September 1993, volume 5, number 5. Internet Quality of Service: Architectures and Mechanisms. Zheng Wang, March 2001 A Quality of Service Architecture that Combines Resource Reservation and Application Adaptation, Ian Foster, A. Roy, V. Sander. Proceedings of the 8th International Workshop on Quality of Service (IWQoS2000), June 2000 Decentralised QoS Reservations for protected network capacity, S. N. Bhatti, S.A. Sorenson, P. Clarke and J. Crowcroft in Proceedings of TERENA Networking Conference 2003, 19-22 May 2003 http://www.globus.org The Physiology of the GRID: An Open Grid Services Architecture for Distributed Systems Integration, Ian Foster, Carl Kesselman, Jeffrey M. Nick and Steven Tuecke The following example shows an example for the configuration of DiffServ in Cisco IOS. As can be seen the amount of configuration needed in each router is minimal. class-map match-any EF match ip dscp 46 class-map match-any BE match ip dscp 0 class-map match-any LBE match ip dscp 8 ! ! policy-map UCL_policy class BE bandwidth percent 88 class LBE bandwidth percent 1 class EF priority percent 10 interface POS4/1 … service-policy output UCL_policy …