SCHEDULING FILE TRANSFERS ON A CIRCUITSWITCHED NETWORK DISSERTATION for the Degree of DOCTOR OF PHILOSOPHY (Electrical Engineering) Hojun Lee May 2004 SCHEDULING FILE TRANSFERS ON A CIRCUITSWITCHED NETWORK DISSERTATION Submitted in Partial Fulfillment of the REQUIREMENTS for the Degree of DOCTOR OF PHILOSOPHY (Electrical Engineering) at the POLYTECHNIC UNIVERSITY by Hojun Lee May 2004 Approved: __________________________ Department Head __________________________ Date Copy No.____ ii Approved by the Guidance Committee: Major: Electrical Engineering ________________________ Malathi Veeraraghavan Professor of Electrical and Computer Engineering ________________________ Date Major: Electrical Engineering ________________________ Shivendra S. Panwar Professor of Electrical and Computer Engineering ________________________ Date Minor: Electrical Engineering ________________________ Torsten Suel Professor of Computer and Information Science ________________________ Date Minor: Electrical Engineering ________________________ Edwin K. P. Chong Professor of Electrical and Computer Engineering Colorado State University ________________________ Date iii Microfilm or other copies of this dissertation are obtainable from: UMI Dissertations Publishing Bell & Howell Information and Learning 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Michigan 48106-1346 iv VITA Hojun Lee was born in Pusan, Korea on February 22, 1971. He received the B.S. degree in Electrical Engineering from Polytechnic University, Brooklyn, in 1997. He received the M.S. degree in Electrical Engineering from Columbia University, New York, in 1999. He is currently working toward the Ph.D. degree in Electrical Engineering at the Polytechnic University. He got a fellowship from Village Networks, Eatontown, NJ, from 2001 to 2002, working on a network throughput comparison of optical metro ring architectures. Mr. Hojun is a student member of IEEE, a member of KSEA (Korean-American Scientists and Engineers Association), and a member of KSA (Korean Student Association) in Polytechnic University. v To grateful thanks to my parents vi ACKNOWLEDGEMENTS I would like to thank my Ph.D. thesis supervisor Professor Malathi Veeraraghavan who influenced me most during my years at Polytechnic. Professor Malathi taught me how to look for new areas of research, how to understand the state of the art quickly, how to write good technical papers, and how to present my ideas effectively. Professor E. K. P. Chong, Professor S. Panwar, and Professor S. Torsten I thank for serving on my defense committee. A special thanks to Hua Li from Colorado State University. He not only provided help with packet-switched system simulations, but also valuable insights into my dissertation, especially on discrete-time unit simulation. I would like to acknowledge my fellow students at Polytechnic and the University of Virginia, including Jaewoo Park, Seunghun Cha, Sungjun Lee, Jongha Lee, Sangwook Suh, Jeff Tao, Xuan Zheng, and Tao Li. I also owe deep thanks to my friends in New York who supported me during my studies at Polytechnic: Jenny Kim, Seiwoon Kim, and Jaehuk Lee. Pursuing a Ph.D. requires not only technical skill but also tremendous amount of stamina and courage. I would like to thank my parents Kiehwa Lee, Yangja Yoo, and my sister, Inha Lee, for sharing their unconditional love with me and giving me the necessary amount of courage required for pursuing my goals at Polytechnic. vii AN ABSTRACT SCHEDULING FILE TRANSFERS ON A CIRCUITSWITCHED NETWORK by Hojun Lee Advisor: Malathi Veeraraghavan Submitted in Partial Fulfillment of the Requirements for the degree of Doctor of Philosophy (Electrical Engineering) May 2004 In the current Internet, files are typically transferred using application-layer protocols such as http and ftp with TCP as the transport protocol. TCP has been studied extensively, extending and proving its worth as a reliable protocol under a variety of network conditions and applications. However, current TCP implementations are not adequate to support the high performance of such applications as encountered in eScience projects (large file transfers). While others are working to improve TCP to work in highspeed networks, we propose an end-to-end optical circuit-switched solution called Circuit-switched High-speed End-to-End Transport ArcHitecture (CHEETAH). This solution is proposed on an add-on basis to the basic Internet service already available to end hosts. This has significant advantages. It allows the optical circuit-switched network to be run in a call-blocking mode. Given the presence of the primary path through the viii Internet, an end host can fall back to the TCP/IP path if its call request is blocked. We analyze this mode of operation. We also define a call-scheduling mode of operation for the optical circuit-switched network. This scheme is based on a new varying bandwidth list scheduling approach, which overcomes the main drawback of using circuit-switched networks for file transfers. Adopting this scheme in CHEETAH instead of call-blocking mode, we might have improved gain, i.e., less file-transfer delay. For example, with a large file (i.e., 1TB), the end-host might have a high-access service via the end-to-end circuit over the TCP/IP backup path. . ix Table of Contents Chapter 1: Background and Problem Statement . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Proposed CHEETAH service and its application to file transfers . . . . . . . . 9 2.1 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Optical Connectivity Service (OCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Hardware acceleration of signaling protocol implementations . . . . . . . . . . 11 2.4 Transport protocol used over the Ethernet/EoS circuit . . . . . . . . . . . . . . . . . 12 Chapter 3: Operating the optical circuit-switched network in call-blocking mode . . . . 14 3.1 Analytical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Numerical results for transfer delays of “large” files . . . . . . . . . 17 3.2.2 Numerical results for transfer delays of “small” files . . . . . . . . . . . 23 3.2.3 Optical circuit-switched network utilization considerations . . . . . . . 25 3.2.4 Implementation of routing decision algorithm . . . . . . . . . . . . . . . 28 Chapter 4: Call-queueing/scheduling mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Scheduling file transfers on a single link . . . . . . . . . . . . . . . . . . 34 4.1.1 Varying-Bandwidth List Scheduling (VBLS) overview . . . . . . . 37 4.1.2 Detailed description of VBLS . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1.3 VBLS with Channel Allocation (VBLS/CA) overview . . . . . . . . 40 4.1.4 Detailed description of VBLS/CA . . . . . . . . . . . . . . . . 43 4.2 Analysis and simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 Traffic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.2 Validation of simulation against analytical results . . . . . . . . . . . . . . 51 x 4.2.3 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.4 Comparison of VBLS with FBLS and PS . . . . . . . . . . . . . . . . . . 63 4.2.4.1 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.5 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 5: Multiple-link cases (centralized and distributed) . . . . . . . . . . . . . . . . . . . . . . 68 5.1 VBLS algorithm for multiple-link case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Analysis and simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.1 Traffic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.2 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 6: Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 xi List of Figures Figure 1. Current architecture: IP routers interconnect different types of networks. CHEETAH enables direct Ethernet/EoS circuits between hosts (see dashed lines and text in italics); File transfers between end hosts in enterprise building 1 and enterprise building 2 have a choice of two paths; (i) TCP/IP path through primary NICs, Ethernet switches, Leased circuits I and II and IP router I, (ii) Ethernet/EoS circuit through secondary NICs, MSPPs, optical circuit-switched network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Figure 2. Plot of equation (2) for large files wi th a link rate of 100Mbps, sig sp 0.7, k 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Figure 3. Plot of equation (2) for large files with a link rate of 1Gbps sig sp 0.7, k 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Figure 4. Plot of equation (2) for small files with a link rate of 100Mps, sig sp 0.7, k 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 5. Plot of equation (2) for small files with a link rate of 100Mbps, sig sp 0.7 , k = 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Figure 6. Plot of utilization u with a link rate of 100Mbps, sig sp 0.7, k 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 7. Single link mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 8. Example of (t ) , P1 0 , (0) 0 , P2 10 , z max 9 , and Pzmax P9 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Figure 9. Shaded area shows the allocation for the example 50MB file transfers, with i i 2 channels. Per channel link capacity is 1 Gbps and Treq 50 and Rmax one time unit is 10ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 10. Per-channel availability; G11 10, H11 30, n1 2 ; the (t ) shown in Figure 8 is derived from this A j (t ) . Lines indicate time ranges when the channel is available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Figure 11. Dashed lines show the allocation of resources for the example 75MB file i i 2 channels . . . . . . . . 50 transfer described above with Treq 10 and Rmax xii Figure 12. File latency comparison between analytical and simulation results . . . . . . 52 Figure 13. A comparison of VBLS file latency and mean file transfer delay for files i requesting same Rmax ; Link capacity is 100 channels . . . . . . . . . . . . . 54 Figure 14. File latency comparison for two different values of k (500MB and 10GB) while keeping p constant at 10GB . . . . . . . . . . . . . . . . . . . 55 Figure 15. Frequencies for different file size ranges (total number of bins = 50; bin-size =1.99 GB or 1.8GB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 16. i File throughput metric for files requesting three different values for Rmax , 1, 2, and 4 channels ( =10Gbps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 17. i File throughput metric for files requesting three different values for Rmax , 1, 5, and 10 channels ( =1Gbps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 18. File throughput comparison for different value of Tdiscrete (0.05, 0.5, 1, and 2 sec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Figure 19. Utilization comparison for different values of Tdiscrete (0.05, 0.5, 1, and 2 sec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 20. i File throughput metric for files requesting Rmax of 1 channel (10Gbps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Figure 21. i File throughput metric for files requesting Rmax of 5 channels (50Gbps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 22. i File throughput metric for files requesting Rmax of 10 channels (100Gbps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 23. VBLS on a path of K hops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 24. Example of 1 (t ) , P1 0 , (0) 0 , P2 10 , z max 9 , and Pzmax P9 80 - Link 1 (Shaded area shows the allocation for the example 25MB file transfer). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 25. Example of 2 (t ) , P1 0 , (0) 0 , P2 10 , z max 9 , and Pzmax P9 80 - Link 2 (Shaded area shows the allocation for the example 25MB file transfer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 xiii Figure 26. Network model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Figure 27. Percentages of blocked calls comparison for different values of M when Tdiscrete 0.01sec and p12 p23 5ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 28. F i l e t h r o u gh p u t c o m p a r i s o n f o r d i f f e r e n t v a l u e s o f M w h e n Tdiscrete 0.01sec and p12 p23 5ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 29. Percentage of blocked calls comparison for different values of Tdiscrete when M = 3 and p12 p23 5ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 30. File throughput comparison for different values of Tdiscrete when M = 3 and p12 p23 5ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 31. Dashed lines show the allocation of resources for example 15.625MB file i i 3 . . . . . . . . . . . . . . . . 83 transfer described above with Treq 15 and Rmax Figure 32. Dashed lines show the allocation of resources for example 3.125MB file i i 1 . . . . . . . . . . . . . . . . 85 transfer described above with Treq 32 and Rmax xiv List of Tables Table 1. Input parameters plus the time to transfer a 1GB file and a 1TB file . . . . 19 Table 2. Crossover file sizes in the [5MB, 1GB] range when r = 1 Gbps and Tprop = 0.1ms, k=20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Table 3. Crossover file sizes when r = 100 Mbps and Tprop = 0.1ms . . . . . . . . . . 25 Table 4. Notations for VBLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table 5. Additional notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Table 6. Notations for multiple-link case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Table 7. Input parameters for example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Table 8. TRL vectors for each round (Example 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Table 9. Input parameters for example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Table 10. TRL vectors for each round (Example 2) . . . . . . . . . . . . . . . . . . . . . . . . . . 85 1 Chapter 1. Background and Problem Statement Files are commonly transferred on the Internet using application-layer protocols such as http and ftp using TCP as their transport protocol. Since file transfers do not have a stringent delay requirement, it is quite acceptable to incur the retransmission delays associated with TCPs error-correction mechanisms or rate slowdowns associated with the TCP’s congestion-control mechanism. Since file transfers are typically small (an average of 10KB per flow has been cited in [1]), even with retransmission delays, the total time for the transfers is small enough to ignore the excess delays caused by retransmissions or rate slowdowns. However, there are some applications that require the transfer of large files [2]-[6]. Of particular interest is the effective throughput of large-file transfers, e.g., terabyte- and petabyte- (1015 ) sized files created in particle physics, earth observation, bioinformatics, radio astronomy, and other scientific studies, for which the current TCP has been shown to be inadequate [7]-[8]. One set of solutions calls for enhancing the TCP to improve end-to-end throughput, thus limiting upgrade to the end hosts. Such improvements can be made via congestion control [9]-[11] and/or flow control [12]-[14]. A second set of solutions requires upgrades to routers within the Internet. For example, Mathis [15] proposes the use of larger Maximum Transmission Unit (MTU) to improve end-to-end throughput. Given that the Internet is a global network of IP routers that interconnects different types of networks, as illustrated in Figure 1, to improve filetransfer performance between any two hosts connected via the Internet, research work must focus on enhancing TCP and/or IP, as is currently being done in these two sets of solutions. 2 We propose a third set of solutions, which we call Circuit-switched High-speed End-toEnd Transport ArcHitecture (CHEETAH). This solution leverages the dominance of Ethernet in LANs and SONET in MANs and WANs, the optical fiber deployment to enterprises, and the deployment of a new network system called a Multi Service Provisioning Platform (MSPP) in enterprises. In this solution, hosts would be equipped with second (high-speed) Ethernet NICs, which would be connected directly to ports of enterprise MSPPs. These MSPPs have the capability of encapsulating Ethernet frames into SONET frames for transport on wide-area segments. By establishing a wide-area SONET/SDH circuit between the two enterprise MSPPs, and mapping Ethernet signals from/to hosts to/from this wide-area circuit, we can realize end-to-end high-speed “Ethernet/EoS circuits.” Clearly this solution has limited applicability when compared to the first two sets of solutions because it can only be used when both ends have access to CHEETAH service and are interconnected by the same optical network. However, if a given network has a large coverage area, this solution will be useful in many transfers. Current-day SONET/SDH/WDM circuit-switched networks of different service providers are largely isolated, but as standards evolve, these can be interconnected directly to achieve larger coverage areas. A few research optical networks such as the Canarie network [16] extends coast-to-coast across Canada. 3 Figure 1: Current architecture: IP routers interconnect different types of networks. CHEETAH enables direct Ethernet/EoS circuits between hosts (see dashed lines and text in italics); File transfers between end hosts in enterprise building 1 and enterprise building 2 have a choice of two paths; (i) TCP/IP path through primary NICs, Ethernet switches, Leased circuits I and II and IP router I, (ii) Ethernet/EoS circuit through secondary NICs, MSPPs, optical circuit-switched network Since we cannot provision wide-area Ethernet-over-SONET circuits between every pair of end hosts that need to communicate, the circuit-switched network has to support dynamic circuit setup and release. Not only have standards been specified for signaling protocols for SONET/SDH/WDM networks [17], these signaling protocols have been implemented in many vendors’ switches [[18] of OIF demo]. If the optical network is operated on a dynamic shared basis, then it is possible that circuit resources may not be available when a call request is being processed. In this case, the network can either: (i) block the call, i.e., reject the call setup request, or (ii) queue the call. 4 Blocking calls is indeed an option in our proposed mode of operation: that the CHEETAH service be equipped with second Ethernet cards so as not to interfere with the host’s primary Internet service. The implication is that if a call is blocked for a lack of resources through the optical circuit-switched network, the end host can fall back to the TCP/IP path through its primary NIC. There are many advantages to this mode of operation, which of course, comes at a cost of additional Ethernet cards. We described this solution in detail in Chapter 3, in which we define the conditions under which an Ethernet/EoS circuit setup should be attempted. For example, if the file is very small, because of the setup overhead and the negative impact on utilization, we recommend that such a circuit setup not be attempted. Based on the loading conditions on the two paths, it may happen that even for large files, the TCP/IP path is the preferred one. However, there will be conditions under which it is worth attempting a circuit setup, because if it succeeds, a file-transfer delay can be significantly less than if it fails. For example, a 1GB file transfer on a TCP/IP path with a round-trip time of 50ms, link rate of 1 Gbps, and a loss probability of 0.0001, takes 395.7 sec, while on a circuit with the same link rate, the transfer time is only 8.08 sec. For very large files, e.g., on the order of terabytes, this solution of attempting a circuit setup, and if it fails, falling back to the TCP/IP path, may not be a good one. Instead, if the circuit-switched network is operated in call-queueing mode, there is a likelihood that the total file-transfer delay, even after being queued for a circuit, could be lower than the delay through the TCP/IP path. For example, we computed that we need over 4 days and 15 hours to transfer a 1 TB file with TCP if the round-trip propagation delay is 50ms and bottleneck link rate is 1 Gbps, even if the packet loss rate on the end-to-end path is as low 5 as 0.0001. On the other hand, if we successfully set up a 1Gbps circuit, we can complete the transfer in 2.3 hours. Hence, for this case (large file transfers), we propose that the optical, circuit-switched network be run in a call-queueing mode. Prior work on call queueing is fairly limited. In practice, call-queueing systems have not been implemented or tested extensively. Furthermore, when call holding times are large, as can be expected in eScience applications, where a remote scientist needs a circuit for a few hours of experimentation, call queueing is not practical. Instead, we propose “scheduling calls.” We discovered that this was indeed possible for file transfers if the network is provided file sizes. Prior work on scheduling file transfers/calls, making “book-ahead” reservations, and packing algorithms includes [19]-[30]. In [19], Coffman describes file transfer scheduling schemes for a star network assuming all transfers are of unit bandwidth but arbitrary durations. The paper by Erlebach and Jansen [20] extends this work to star and tree networks with arbitrary bandwidth and arbitrary durations. Both papers obtain competitive ratios for on-line greedy heuristics called list-scheduling algorithms, characterizing their performance when compared to off-line optimal solutions. The metric being compared is makespan, which is the total time needed to transfer a set of files. The basic list scheduling (LS) algorithm works as follows: if there is a call i such that its required bandwidth bi is available on all links of the end-to-end path, LS schedulers the first such call in the list of all calls. If not, it waits until one of the active transfers completes. This heuristic is extended for arbitrary bandwidth in a star network to Decreasing Bandwidth List Scheduling (DBLS), and for trees as Level-List Scheduling (LLS) and List Scheduling by Levels (LSL) schemes [20]. Both [19] and [20] require file 6 transfer requests to specify a bandwidth requirement, and schedule the constant bandwidth requested for each transfer. For arbitrary topologies, the file-scheduling problem has added dimension of route decision. In [21], four greedy algorithms are proposed, all of which appear to be focused on the route selection problem rather than on file scheduling. Although the file transfer request specifies the file sizes (as in our approach), it is a packet-switched network, the whole link bandwidth is used for each transfer and buffers are assumed at all switches. Thus, it is different from our problem, which is aimed at circuit-switched networks. There has been some work on book-ahead (BA) reservations in which a user requests that a connection be set up at some future time. In such reservations, typically the duration of the call is specified, either as a deterministic number or as a distribution. Papers on this topic include [22]-[25]. All of these papers develop and analyze algorithms, centralized or distributed, to perform BA reservations, but only in conjunction with call- blocking. None address the issue of call queueing or scheduling. In other words, when a user makes a BA request, the network analyzes the probability of honoring this request at the desired future start time if the call is admitted now. If this probability is below some threshold, the network will simply block the call. None of these papers address the concept of giving the BA call a delayed start time relative to the requested start time. Some of these papers [22]-[25] analyze schemes that allow for the sharing of resources between IR calls and BA calls. They typically assume that IR calls do not specify their expected durations. In contrast, we require all calls, IR and BA, to specify their file sizes. 7 In bin packing problems, the goal is to pack a given set of blocks into finite-sized bins using the smallest number of bins [26]. In container loading problems, the goal is to fill a single bin of infinite height to a minimum possible height [27]. In the knapsack-loading problem, each block has an associated profit and the problem is to select those blocks that will maximize profit when loaded into the bin [28]. This classification of packing problems is obtained from [29]. In none of these problems can the blocks be broken up into pieces to fit into bins, which is what happens when we schedule files with varying bandwidth allocation in different time ranges. Thus, the heuristics proposed for these problems do not directly apply. Solutions to the job-shop scheduling problem [30] typically compute optimal schedules that can be used for the offline problem equivalent to our online problem. Our problem is online because resources have to be scheduled for a file transfer without information about file transfer requests that arrive subsequent to the request being scheduled. We examine the problem of call scheduling with knowledge of file sizes in detail in Chapters 4 and 5. If such a mechanism is implemented in the optical circuit-switched network, the end host application can attempt a circuit setup for very large files, obtain a completion time for the transfer based on the schedule allocated, and then decide whether it should resort to the TCP/IP path. If the optical circuit-switched network is heavily loaded, then indeed the 1TB file transfer request may result in an answer from the network, which is greater than 4 days and 15 hours. In this case, the end host can resort to the TCP/IP path. The CHEETAH solution of growing-in a network solution as an add-on to basic Internet access is comparable to providing people multiple transportation options. For 8 example, a traveler between New York and Washington DC has a choice of flying, riding a train or driving, all of which are more-or-less comparable from a time perspective. Based on conditions on these three transportation networks, the traveler can choose. With CHEETAH, we provide a similar choice to end host applications. 9 Chapter 2. Proposed CHEETAH service and its application to file transfers Our solution calls for equipping end hosts with second (high-speed) Ethernet NICs and connecting these NICs directly to MSPPs, as illustrated in Figure 1. MSPPs are then interconnected across wide-area networks using EoS circuits. The circuits are established and released dynamically using signaling protocols. Section 2.1 describes the equipment needed to support the CHEETAH service. Since the CHEETAH service can only be used for communication between end hosts located on an optical circuit-switched network, a host requires some support to first determine whether its correspondent end host (the end host with which it is communicating) is reachable via an end-to-end Ethernet/EoS circuit. In Section 2.2, we describe a support service for this purpose called “Optical Connectivity Service (OCS).” Next, we consider the question of how to use the CHEETAH service for file-transfer applications. File-transfer sessions require the exchange of many back-and-forth messages in addition to the actual file transfer. We propose using a TCP connection via the primary Internet path for such short exchanges, and limiting the use of end-to-end Ethernet/EoS circuits for the actual file transfers. To achieve high utilization of the circuit-switched network, we propose (i) setting up the end-to-end high-speed Ethernet/EoS circuit just prior to the actual transfer and releasing it immediately after the file transfer, (ii) operating the circuit-switched network in call-blocking mode, (iii) using circuits only for certain transfers, and (iv) using a unidirectional EoS circuit from the server to the client (since this is the primary direction of data flow). 10 The implication of holding circuits only for the duration of file transfers is that call holding times can be quite small. For example, a 1MB transfer on a 100Mbps link incurs a transmission delay of only 80ms. This means call setup delays should be kept low and call handling capacities of switches should be high. Therefore, we recommend a hardware-accelerated implementation of signaling protocols at MSPPs, Add/Drop Multiplexers (ADMs), crossconnects and other optical circuit switches. Section 2.3 describes our current work on hardware-accelerated signaling implementations. In Section 2.4, we consider the question of transport protocols for end-to-end Ethernet/EoS circuits. We found a transport protocol called Scheduled Transfer (ST), an ANSI standard [31], which is ideally suited for end-to-end Ethernet/EoS circuits. Section 2.4 describes our data-transport approach. 2.1 Equipment Due to the “add-on” characteristic of the CHEETAH service, hosts that want access to this service should be equipped with second Ethernet NICs that are connected “directly” to the MSPP Ethernet cards as shown in Figure 1. Some of the MSPPs and SONET/SDH/WDM switches (crossconnects, ADMs) should be enhanced with signaling protocol engines to handle dynamic call setup and release. Circuits can be provisioned between nodes that do not have signaling capability. Adding signaling engines to MSPPs allows for concentration on access links from enterprises. Furthermore, application software in end hosts should be upgraded to interface with the CHEETAH service. 11 2.2 Optical Connectivity Service (OCS) A support service called the “Optical Connectivity Service (OCS)” is proposed to provide end hosts a mechanism to determine whether or not their correspondent end hosts have access to the CHEETAH service. OCS can be implemented much like the Domain Name Service (DNS) with enterprises and service provider networks maintaining servers with information on end hosts that have access to the CHEETAH service. These servers would answer queries from end hosts in much the same manner as DNS servers answer queries for IP addresses and other information. With caching, the delay incurred in this step can be reduced. 2.3 Hardware acceleration of signaling protocol implementations Processing signaling protocol messages involves many data table reads/writes, parsing/constructing complex messages, maintaining state information, managing timers, etc. For example, consider call setup. Upon receiving a call setup messages, a call processor needs to parse out parameters, such as destination address, bandwidth requested, etc. and then perform several actions. First, it determines the next-hop switch through which to reach the destination typically by consulting a precomputed routing table (similar to the longest-prefix match operation in IP routers). Second, it selects an interface connected to the selected next-hop switch on which sufficient bandwidth is available. Third, it selects free time-slots and/or wavelengths on the selected interface. Finally, it programs the switch fabric by writing a switch configuration table. This table is used by the switch to route data bits received at/on a given timeslot/wavelength on an incoming interface to a given timeslot/wavelength on a corresponding outgoing interface. Other actions performed by the signaling protocol processing engines at switches include 12 updating state information and constructing the outgoing signaling processing engines at switches include updating state information and constructing the outgoing signaling message. Similar actions are performed for circuit release. Accelerating signaling protocol processing engines is a challenging task. In [32], the other group of colleagues designed the signaling protocol specifically for SONET networks with a goal of achieving high performance rather than flexibility. They implemented the basic and frequently used operations in Field Programmable Gate Arrays (FPGAs), and relegated the complex and infrequently used operations (e.g., processing of optional parameters and error handling) to software. They modeled the signaling protocol in VHDL and then mapped it onto two FPGAs on the WILDFORCETM reconfigurable board with a Xilinx XC4036XLA FPGA with 62% resource utilization and a XC4013XLA with 8% resource utilization. The hardware implementation handles four messages: Setup, Setup-success, Release and Releaseconfirm. From the timing simulations, done using the ModelSim simulator, call setup message processing consumes between 77-101 clock cycles. Assuming a 25 MHz clock, this translates into 3.08-4 s. Compare this with the millisecond-based software implementations of signaling protocols [33]. 2.4 Transport protocol used over the Ethernet/EoS circuit In this section, we consider the question of what transport protocol to use on these endto-end high-speed Ethernet/EoS circuits. TCP is a poor choice for dedicated end-to-end circuits because of its slow start and congestion avoidance algorithms. Also, TCP’s window-based flow control and positive-ACK based error control scheme are not well suited for dedicated end-to-end circuits. Hence we considered a number of other transport 13 protocols, some high-speed transport protocols such as [34]-[35] and some OS bypass protocols [36]-[38]. Of these, we selected the Scheduled Transfer (ST) protocol, which is an ANSI standard [31], and is ideally suited for end-to-end circuits carrying Ethernet frames. ST provides sufficient hooks to allow for a high-speed, OS-bypass implementation, a feature that is necessary to achieve true high-speed end-to-end throughput. It does this by having the sender specify a receiver memory address in the data block header, which causes the receiving NIC to simply write the received payload using Direct Memory Access (DMA) into the specified memory location. This results in a low end-host transport layer delay. ST offers flexibility in its flow control and error control schemes. For flow control, we propose using a rate control approach in which the circuit rate is selected to taking into account the rate at which the receiving application can process received data from memory. An alternative is to have the receiver allocate a large-enough buffer space for the entire file prior to the start of the transfer. This solution however limits the maximum size of files that can be transferred, which may anyway be necessary from a network circuit-sharing perspective. This means we limit file sizes to a Maximum File Transfer Size (MFTS) per session. For error control, we propose using ST’s support for negative acknowledgments (NAKs) given that data blocks will be delivered in sequence on the Ethernet/EoS circuit. Missing/errored blocks resulting from bit errors will need to be retransmitted. ST supports the selective repeat approach. 14 Chapter 3. Operating the optical circuit-switched network in call-blocking mode In this section, we analyze the case when the optical circuit-switched network is operated in call-blocking mode. If a call is blocked, the end host application falls back to the TCP/IP path. Given that end hosts with access to CHEETAH service have two paths for certain transfers, they have to make a routing decision on whether or not to attempt setting up an Ethernet/EoS circuit. Such an attempt is not always a good idea as we will show through analysis below. On the other hand, in some circumstances, if there is a large difference in the delay on the two paths, it may be well worth attempting the circuit setup, which we also show the analysis below. 3.1 Analytical model Let E[Tcheetah ] be the mean delay incurred if an Ethernet/EoS circuit setup is attempted prior to the file transfer. E[Tcheetah ] (1 Pb )( E[Tsetup ] Ttransfer ) Pb ( E[T fail ] E[Ttcp ]) (1) where Pb is the call-blocking probability on the optical circuit-switched network, E[Tsetup ] is the mean call-setup delay of a successful circuit setup, Ttransfer is the time to transfer the file on the Ethernet/EoS circuit, E[T fail ] is the mean delay incurred in a failed call setup attempt, and E[Ttcp ] is the mean delay incurred in sending the file on the TCP/IP path. If the call is not blocked, mean delay experienced is E[Tsetup ] Ttransfer , but if it is blocked, then after incurring a cost E[T fail ] , the end host has to use the TCP/IP path and hence will incur the E[Ttcp ] delay. Comparing E[Ttcp ] , the delay incurred if a 15 circuit setup is not attempted, with E[Tcheetah ] , the delay incurred if a circuit setup is attempted, and approximating E[T fail ] to be equal to E[Tsetup ] , results in: E[Tsetup ] E[Ttcp ] Ttransfer use TCP/IP path (2) 1 P b E[Tsetup ] attempt circuit setup if 1 P E[Ttcp ] Ttransfer b Next, we obtain expressions for E[Ttcp ] , E[Tsetup ] and Ttransfer . E[Ttcp ] is obtained using if the models of [39]-[40], which capture the time spent in slow start E[Tss ] , the time spent in congestion avoidance E[Tca ] , the expected cost of a recovery following the first loss E[Tloss ] , and the time to delay the ACK for the initial segment E[Tdelayack ] . E[Ttcp ] E[Tss ] E Tloss ETca E[Tdelayack ] (3) E[Tss ] is a function of Round-Trip Time (RTT), Wmax , which is a limitation posed by the sender or receiver window, w1 , the initial congestion window, Ploss , the loss rate, the number of data segments in the file transfer, and the number of segments for which an ACK is generated (for example, if ACK-every-other-segment strategy is used, this number is 2). The E[Tloss ] term is a function of To , which is the average duration of a first time-out in a sequence of one or more time-outs, and RTT and the probability of the first loss being detected with a retransmission time-out or with a triple duplicate ACK. The reader is referred to [39] for details of these two terms E[Tss ] and E[Tloss ] . E[Tca ] is a function of the number of data segments in the file transfer, Ploss , RTT, To , and Wmax , as derived in [39]. We set the final term E[Tdelayack ] to 0 because we assume a starting initial window size of 2 [41] and the ACK-every-other-segment strategy. We do not 16 include TCP connection setup time assuming that the connection is already open (because messaging needed prior to the actual file transfer, such as the name of file being requested, would require the TCP connection to be opened first). If this is the first file transfer within an application session, then the actual file transfer would start in the slow start phase. For subsequent transfers, the window could potentially be in a state that indicates are often long, and [41] specifies that the congestion window should reset to a Restart Window size (2 segments) whenever the session is idle for more than one retransmission timeout. Hence we assume that all file transfers start in the slow start phase. The mean call setup delay E[Tsetup ] 1 includes mean-signaling message transmission delays, mean call processing delays (to process signaling protocol messages), and a round-trip propagation delay. sig sp (k 1) Tsp 1 k T prop 1 (4) 2(1 ) 2(1 ) rs sig sp is the cumulative size of signaling messages used in a call setup, rs is the signaling E[Tsetup ] m sig msig link rate, k is the number of switches on the end-to-end path, Tsp is the signaling message processing time incurred at each switch, and T prop is the round-trip propagation delay. The second component Ttransfer is the actual file-transfer delay: We assume the queueing delay for the signaling link with an M/D/1 queue at a load sig, and the queueing delay for the call processor also with an M/D/1 queue at load sp. M/D/1 queueing models are quite accurate since inter-arrival times between file transfers have been shown to be exponentially distributed [ [42]], and signaling message lengths and call processing delays are moreor-less constant. 1 17 f T prop (5) rc 2 where f is the sizes of the file being transferred and rc is the data rate of the circuit. We Ttransfer have not included retransmission delays here because on Ethernet/EoS circuits, retransmissions are only required when random bit errors affect a block of data, and theses types of errors also impact delays on the TCP/IP path. Since our approach is to compare delays on the TCP/IP path and on Ethernet/EoS circuits before deciding whether or not to attempt a circuit setup, we have omitted retransmission delays due to bit errors on both paths. Including this delay would in fact favor using the Ethernet/EoS circuit. This is because bit errors on the TCP/IP path would be misinterpreted as packet losses caused by congestion leading to reductions in sending rates. 3.2 Numerical results 3.2.1 Numerical results for transfer delays of “large” files Input parameter values assumed for the numerical computation are shown in Table 1. We assume four values for Ploss , two values for the bottleneck link rate r, and three values of the round-trip propagation delay T prop to create a total of 24 cases. RTT is computed from T prop and a rough estimate of queueing plus service delay at the bottleneck link. We derive this estimate by determining the load at which an M/D/1/k system 2 will experience the assumed Ploss values. Wmax , as stated earlier, is determined by limitations on the sender or receiver window. For all the cases, we set Wmax to the delay-bandwidth product, 2 While packet transmission (service) time is more-or-less deterministic because of MTU restrictions, the packet arrival process at a buffer feeding the bottleneck link is known not to be a Poisson process. However, we use this approximate model to obtain a rough estimate of queueing plus service delay. As seen from the numerical values, this component is not significant. 18 i.e., Wmax RTT r . When the congestion window reaches Wmax , any further increase is irrelevant because the system will reach a streaming state in which ACKs are received in time to permit further packet transmissions before the sender completes emitting its current congestion window. Using the input parameters shown in Table 1, we compute E[Ttcp ] given by (3) for a 1GB file and 1TB file and list the values in the last two columns of Table 1. The roundtrip propagation delay T prop has a significant impact on total file-transfer delay. For example, for a 1GB file transfer, increasing T prop from 5ms to 50ms results in a considerable increase in E[Ttcp ] from 89.45s to 396.5s. Also, at large values of the roundtrip propagation delay T prop (50ms), for a given Ploss , there is not much benefit gained from increasing the bottleneck link rate from 100Mbps to 1Gbps. Compare 396.5s for a 100Mbps link with the 395.7s number using a 1Gbps link for the 1GB file transfer. Increasing the bottleneck link rate has value when propagation delay is small. The higher the rate, the smaller the propagation delay at which this benefit can be seen. Loss probability Ploss also plays an important role. Even in a low propagation delay environment ( T prop of 0.1ms), E[Ttcp ] jumps from 82.25s to 283.56s for the 1GB file transfer for Ploss increase from 0.0001 to 0.1. If an end-to-end GbE/EoS circuit is established for the 1GB file transfer, the sum E[Tsetup ] Ttransfer is 80.08sec when the link rate is 100Mbps and 8.08sec when the link rate is 1Gbps. These numbers are obtained assuming both sp and sig are 0.8, T prop is 50ms and there are 20 switches on the endto-end path. The major component of these values is Ttransfer . E[Tsetup ] is only 55.3ms. 19 The total message length for call setup related signaling messages is assumed to be 100bytes and the call processing delay per switch is assumed to 4 sec given our hardware-accelerated signaling implementations (see Section 2.3). Case Table 1: Input parameters plus the time to transfer a 1GB file and a 1TB file Intermediate derived Final results Input parameters results Loss Round-trip Queueing RTT Rate E[Ttcp ] E[Ttcp ] for Wmax propagatio delay (ms) r Ploss (pkts) for a a 1TB file n delay plus 1GB service T prop file time Case 1 0.0001 100 Mbps 0.1ms 0.2ms 0.3 2.5 82.25 22.9 hours Case 2 5ms 5.2 41 89.45 1 day and 1.3 hours Case 3 50ms 50.2 418 396.5 4 days and 15.3 hours 0.12 10 8.25 2.3 hours Case 4 0.0001 1Gbps 0.1ms 0.02ms Case 5 5ms 5.02 418 39.6 11.1 hours Case 6 50ms 50.02 4168 395.7 0.36 3 82.93 4 days and 14.9 hours 22.9 hours Case 7 0.001 100 Mbps 0.1ms 0.26ms Case 8 5ms 5.26 43.8 135.4 Case 9 50ms 50.26 418.8 1293 0.13 10.8 8.64 1 day and 0.1 hour 4 days and 15.4 hours 2.3 hours 5ms 5.03 419 129.4 11.1 hours 50ms 50.03 4169 1287 0.48 4 92.41 4 days and 14.9 hours 22.9 hours 5ms 5.38 44.8 471.7 50ms 50.38 419.8 4417 0.138 11.5 12.43 Case 10 Case 11 Case 12 Case 13 Case 14 Case 15 Case 16 0.001 0.01 0.01 1Gbps 100 Mbps 1Gbps 0.1ms 0.1ms 0.1ms 0.026ms 0.38ms 0.038ms 1 day and 0.2 hours 4 days and 15.7 hours 2.3 hours 20 Case 17 Case 18 Case 19 Case 20 Case 21 Case 22 Case 23 Case 24 0.1 0.1 100 Mbps 1Gbps 5ms 5.038 419.8 441.7 11.2 hours 50ms 50.04 4169.8 4387 4 days and 14.9 hours 0.78 6.5 283.56 22.9 hours 5ms 5.68 47.33 2064.9 50ms 50.68 422.33 18424 0.168 14 61.07 1 day and 0.3 hours 4.days and 16.3 hours 2.3 hours 5ms 5.068 422.33 1842.4 11.2 hours 50ms 50.07 4172.3 18202 4 days and 15 hours 0.1ms 0.1ms 0.68ms 0.068ms Compare the file-transfer delays for a 1TB file shown in Table 1 with delays on an endto-end high-speed Ethernet/EoS circuit. For example, with a 1Gbps Ethernet/EoS circuit, a 1TB file will take about 2.2 hours, which is comparable to the TCP/IP path numbers for the low propagation delay environment when T prop is 0.1ms, but significantly less than the TCP/IP path numbers when T prop is 5 or 50ms. The bulk of the 2.2 hours number is the file transfer time Ttransfer ; E[Tsetup ] is in the order of ms as shown above. This is not a surprising result because the delay for the end-to-end circuit is possible only if the call is not blocked. Once a circuit is set up, there is no reduction in delay due to competition from other users. To take into account blocking probability, we plot (2), the basis for the routing decision, in Figure 2 and Figure 3 for the 100Mbps and 1Gbps link rates, respectively. For the three horizontal lines on which Pb values are listed, the y-axis is the left-hand size of (2), i.e., E[Tsetup ] (1 Pb ) . For the remaining three lines, which are marked 21 “Difference” with Ploss values, the y-axis is the right-hand side of (2), i.e., E[Ttcp ] Ttransfer . In Figure 2, when the link rate is 100Mbps for the entire file range (5MB, 1GB), an Ethernet/EoS circuit should be attempted if Pb and Ploss have the values shown. This is because E[Tsetup ] (1 Pb ) is always less than the difference term E[Ttcp ] Ttransfer (see (2)). However, when the bottleneck link rate increases to 1Gbps (see Figure 3), while we see a similar pattern when T prop is 50ms (WAN environment), in a lower-propagation delay environment (Figure 3 (a) in which T prop = 0.1ms), we see that there are crossover file sizes below which an end host should resort directly to the TCP/IP path and above which it should attempt an Ethernet/EoS circuit setup. These crossover file sizes are listed in Table 2. Figure 2: Plot of equation (2) for large files with a link rate of 100Mbps, sig sp 0.7 , k = 20 22 Figure 3: Plot of equation (2) for large files with a link rate of 1Gbps, sig sp 0.7 , k = 20 Table 2: Crossover file sizes in the [5MB, 1GB] range when r = 1 Gbps and Tprop = 0.1ms, k=20 Pb 0.01 Pb 0.1 Pb 3 Measure of loading on ckt. sw. network TCP/IP path Ploss 0.0001 22MB Ploss 0.001 9MB Ploss 0.01 <5MB 24MB 30MB 10MB <5MB 12MB <5MB In the current-day Internet, where bottleneck link rates are in the order of Mbps for enterprise users, it is worthwhile attempting a circuit setup for files 5MB and over in most MAN and WAN environments ( T prop of 0.1ms, 5ms, 50ms). This holds true even as rates increase to 100Mbps. But as links become upgraded to the Gbps, such circuit attempts should be made mainly in wide-area environments or for larger files. 23 3.2.2 Numerical results for transfer delays of “small” files Even though our motivation for this work comes from high-end scientific applications with very large files, we wanted to understand whether the CHEETAH service could be used for smaller files (100KB to 5MB). Unlike larger files, where we studied the impact of link rate, here we study the impact of the number of switches on the end-to-end path keeping the link rate at 100Mbps. Figure 4 plots the results for the case when the numbers of switches on the end-to-end path k is 4 and Figure 5 plots the k = 20 case. (a) Tprop is 0.1ms (b) Tprop is 50ms Figure 4: Plot of equation (2) for small files with a link rate of 100Mbps, sig sp 0.7 , k=4 Our first observation is that in wide-area network scenarios Figure 4 (b) and Figure 5 (b) for the entire file range (100KB, 5MB), an Ethernet/EoS circuit should be attempted if Pb and Ploss are (0.01, 0.001, 0.001) and (0.3, 0.1, 0.001) respectively. This is because the difference term E[Ttcp ] Ttransfer is always greater than E[Tsetup ] / 1 Pb . 24 For a lower-propagation delay environment, e.g., T prop is 0.1ms, in Figure 4 (a) and Figure 5 (a), we see crossover file sizes below which an end host should resort directly to the TCP/IP path and above which it should attempt an Ethernet/EoS circuit setup. Theses crossover file sizes are listed in Table 3. The number of switches on the end-to-end path k has little impact on the total transfer times, but it does not affect E[Tsetup ] especially when T prop is 0.1 ms. As a result, crossover file sizes in Figure 5 (a) are much larger than those in Figure 4 (a), as seen in Table 3. (a) Tprop is 0.1ms (b) Tprop is 50ms Figure 5: Plot of equation (2) for small files with a link rate of 100Mbps, sig sp 0.7 , k = 20 In summary, in the current-day Internet, where bottleneck link rates are in the order of Mbps for enterprise users, it is worthwhile attempting a circuit setup for files 5MB and over in most MAN and WAN environments ( T prop of 0.1 ms, 5ms, 50ms). This holds true even as rates increase to 100Mbps. But as links become upgraded to the Gbps range, such circuit attempts should be made mainly in wide-area environments or for larger files. 25 Table 3: Crossover file sizes when r = 100 Mbps and Tprop = 0.1ms Number of switches on the path k = 4 Measure of loading ckt. sw. network on TCP/IP path Number of switches on the path k = 20 Pb 0.01 Pb 0.1 Pb 3 Pb 0.01 Pb 0.1 Pb 3 Ploss 0.0001 610KB 640KB 840KB 2.4MB 2.65MB 3.4MB Ploss 0.001 490KB 730KB 2MB 2.2MB 2.8MB Ploss 0.01 120KB 140KB 500KB 550KB 650KB 550KB 140KB 3.2.3 Optical circuit-switched network utilization considerations While file-transfer delay is an important user measure for making the routing decision of whether or not to attempt a circuit setup, service provider measures such as utilization should also be considered since utilization ultimately does impact users through prices charged. Total network utilization has two components: aggregate network utilization u a and per-circuit utilization u c , which are given by: ua (1 Pb ) m / m! , where Pb m (Erlang-B formula), m k / k! (6) k 0 uc E[Ttransfer ] E[Tsetup ] E[Ttransfer ] , where E[Ttransfer ] E[ X ] , rc (7) is the offered traffic, m is the number of circuits, E[ X ] is the average file size, and rc is the circuit rate. Restricting transfers on the circuit-switched network to files larger than some crossover file size, , we can compute the fractional offered load ' and the average file size E[ X | ( X )] if we know the distribution of file sizes. Reference [43] suggests a Pareto distribution for file sizes. Using this distribution, we compute the fractional offered load ' as: 26 k ( 1) k ' P( X ) E[ X | ( X )] E[ X ] k 1 1 (8) where , the shape parameter, is 1.06 and k, the scale parameter, is 1000 bytes as computed in [43], and is the total offered load. We note that the offered load decreases as increases, which means aggregate utilization u a decreases for a given Pb . However, as increases, per-circuit utilization u c increases. Combining the two components of utilization, we obtain total utilization u as follows: u (1 Pb ) ' ( E[ X | X )]) / rc m E[Tsetup ] ( E[ X | X )]) / rc (9) We plot the total utilization u in Figure 6 for different call-blocking probabilities Pb , different values of and T prop . As crossover file size is increased, the plots show utilization increasing because of the second factor, i.e., the per-circuit utilization increases. However, the drop in the offered load and the corresponding drop in the aggregate utilization shows the increase of the total utilization, making it stable at some value below 1 or even dropping it slightly. In these plots, to keep Pb constant as is increased, we compute m for each value of , using the second equation of (6). The “zigzag” pattern of the plots occurs because m has to be an integer. 27 (a) Pb = 0.3 (b) Pb = 0.01 Figure 6: Plot of utilization u with a link rate of 100Mbps, sig sp 0.7 , k = 20 From our file-transfer delay analysis, we did not have a crossover file size when T prop is large (e.g., 50ms), but from the utilization analysis here we see the need to place a lower bound. Without such as lower bound, per-circuit utilization can be poor. For example, for a 100KB file transfer on a 100Mbps circuit with 4 switches on the end-toend path, we need 50.158ms setup time and 8ms total transfer time. As a result, the percircuit utilization is only 13.7%, which is why the 50ms plots are at a lower utilization than the 0.1ms plots in Figure 6. Another observation is that high utilizations are possible by operating the network at high call-blocking probability (30%). For example, with 50 and T prop 0.1ms , with a blocking probability of 30%, we can achieve a 90% utilization at the crossover file size of 150KB, while at a low blocking probability (1%), we can only achieve a 73% utilization for the same crossover file size (150KB). Thus, when the CHEETAH service is first introduced, the initial number of end hosts equipped with second NICs and 28 enterprises equipped with MSPPs will be small. The network can be operated at a high utilization and high call-blocking probability with many file transfers resorting to the TCP/IP path upon rejection from the optical network. But with growth in the number of CHEETAH service participants (as increases), lower call-blocking probabilities can be achieved while maintaining high utilization. These plots have been generated assuming all calls are of the long-distance variety ( T prop is 50ms) or all calls are in small propagation-delay environments ( T prop is 0.1ms). In reality, different file transfers will experience different round-trip propagation delays. This means the routing decision algorithm should have different crossover file sizes for different end-to-end paths. 3.2.4 Implementation of routing decision algorithm The routing decision algorithm implemented at an end host could use dynamically obtained values of RTTs, P b , Ploss , and link rate. However, such as dynamic algorithm could be complex. While RTT measurements can be made during the TCP connection establishment handshake, other parameters are harder to estimate. Tomography experiments have shown that Ploss can be estimated by end hosts [44]. Other options are to have network management stations track these values and respond to queries from end hosts. Since the benefit of using Ethernet/EoS circuits may not be significant for small file sizes, we need to carefully study the value of introducing this complexity. Alternatively, we could define static values for RTT and crossover file size based on nominal operating conditions of the two networks and simplify the routing decision algorithm implemented at end hosts. This needs experimental study. 29 Another question is whether the CHEETAH service should be implemented from IP router-to-router rather than end-to-end. We note the routing decision on whether or not to attempt an Ethernet/EoS circuit is difficult to make within an IP router. This is because it is hard to extract information on the file size and RTT at a router that supports many flows, and both these parameters are important in making this decision. Other attempts have been made in the past to perform flow classification within routers and then trigger cut-through connections between routers [45]. Given the difficulties with theses solutions, we realize that the routing decision is best made at the end hosts where it is easier to determine these parameters, and hence propose CHEETAH as an end-to-end service. This work was presented at PFLDNET2003 Workshop [46] and Opticomm2003 [47]. 30 Chapter 4. Call-queueing/scheduling mode In Table 1 (see Section 3.2.1), we list the delays for a 1TB file transfer. The numbers show that we need over 4 days and 15 hours with TCP if the round-trip propagation delay is 50ms almost independent of bottleneck link rate (100Mbps or 1Gbps) and Ploss. On the other hand, if we set up a 1Gbps circuit, we can complete the transfer in 2.3 hours. This means that with such large files, we should not adapt an "attempt-circuit-setup-and-ifrejected-fall-back-to-Internet-path" approach (blocking mode). Instead, if the circuitswitched network offered some form of call queueing, then perhaps the wait time for a circuit could be shorter than the 4-day 15-hour TCP time. Hence we started looking for call-queueing algorithms. However, we quickly concluded that call queueing was really not feasible in a multiple-hop circuit because utilization would suffer significantly if an upstream switch held resources while the call is queued at a downstream switch. Instead call scheduling was possible if file size information was provided to the network. Before we design call-scheduling schemes for multiple-hop paths, we design callscheduling algorithms for a single-link case. In the single-link problem, the main question is what bandwidth to assign a file transfer. Given that file transfers can occur at any rate, this is a key question that needs to be answered. We propose a scheme in which a file transfer is provided a vector of Time-Range-Capacity (TRC) allocations when admitted, where the capacity allocation varies from time range to time range. This is unlike the fixed bandwidth allocation mode where a fixed assignment of bandwidth is made for the entire duration of the transfer. To enable our proposed TRC allocation mode, we require end hosts to provide the network the sizes of the files to be transferred. With information 31 on the size of the file that an end host wants to transfer, the network can fit this file into time ranges when bandwidth is available based on the TRC allocations of ongoing transfers. This allows the network to offer an incoming file transfer an increased amount of bandwidth for future time ranges when there are fewer competing transfers. Besides file size, we require end hosts requesting a file transfer i to specify two more i i parameters: Rmax , a maximum bandwidth limit for the transfer, and Treq , a desired start time for the transfer. i First consider Rmax . In practice, any amount of bandwidth can be allocated to a new transfer because file transfers do not have an inherent bandwidth requirement. However, in practice, end hosts engaging in file transfers have communication link interface bandwidth limitations and/or processing limitations. Making a bandwidth allocation i larger than Rmax will only result in wasted bandwidth. Hence, we require end hosts to provide the network this information. If a given transfer has no such limit (in other i words, its limit is larger than the shared link bandwidth), then Rmax is simply set to be equal to the link bandwidth. i Second, consider Treq . An end host application may want to make a reservation for a circuit to transfer a file at a later time, say when it expects to have more resources. By booking ahead, it should have a greater probability of receiving the maximum bandwidth it requests from its requested start time. As in any reservation system, end users that alert the resource arbitrator ahead of time should be encouraged to allow for a better management of resources. Hence, in solving the problem of resource allocation for file transfers, we allow both “immediate-request” (IR) and “book-ahead” (BA) calls. 32 To understand the impact of these parameters, consider the following. If we disallow i BA calls, and there is no Rmax constraint, the single-link sharing problem becomes i simple. Without Rmax constraints, it means all transfers can take advantage of the full link capacity. Thus each file is simply transmitted one after the other. In comparing such a solution with packet-by-packet statistical multiplexing, the main issue becomes fairness. If a very large file grabs the full resources of the link, smaller transfers will end up with unduly large queueing delays. A solution is to limit the maximum file size to some number, which we call a Maximum File Transfer Size (MFTS). This is analogous to the Maximum Transmission Unit (MTU) used in packet-switched networks to solve the same fairness problem at the packet level. The smaller the MFTS, the fairer the solution. In the limit if MFTS is equal to MTU, then the scheme reduces to a packet-by-packet statistical multiplexing scheme. A slightly more complex problem is one in which we still disallow BA calls but allow i Rmax to vary from transfer to transfer. In this case, the switch has to find the time instant i beyond which each channel becomes free and try to maximally assign Rmax channels in each time range. The problem is still not that complex because an incoming call will not find any “holes” in the allocation, i.e., there will be no free time ranges followed by reserved time ranges on any channel past the instant of its arrival. To find a TRC vector for this incoming transfer, the network can simply assign a minimum of the available i capacity and Rmax on different time ranges. Adding in BA calls creates holes in the future schedules making it a more complex problem. 33 In summary, the goal of this work is to develop a scheduling scheme to schedule file i i i transfers characterized by ( F i , Treq is the , Rmax ) , where F i is the size of the file, Treq i requested starting time, and Rmax is the requested maximum bandwidth, on a single link of capacity C. To visualize a mode of our system, see Figure 7. Source hosts make requests to transfer files to a destination D through a shared link leading out of the switch. We assume that the shared single link consists of m channels. File transfer i requests arrive and depart dynamically. The Rmax constraint can be thought of as arising from the access links connecting each source node S i to the switch. A small part of the shared link resources is set aside for signaling and the remaining part is allocated for the actual file transfers. We expect the delay impact incurred from signaling (call setup delay plus increased transfer delay resulting from the reduction in bandwidth incurred by setting aside bandwidth for signaling) to be comparable to the packet header overhead incurred in the packet-by packet statistical multiplexing scheme. We present two heuristic schemes for this scheduling. In the first scheme, our goal is to determine capacity allocation on different time ranges to generate a TRC vector for a transfer. In the second scheme, we additionally determine the exact channels allocated to the transfer on each time range, and thus find a Time-Range-channeL(TRL) allocation vector. This becomes important when we extend our schemes from the single-link TDM/FDM multiplexing problem to the multiple-link circuit-switched network problem. Allocation of resources to an end-to-end circuit traversing multiple links requires an allocation of channels on each link of the end-to-end path. In some networks, there is a constraint of maintaining the same channel number on all links of the end-to-end path, e.g., in optical wavelength-division multiplexed (WDM) networks where the switches are 34 not equipped with wavelength converters. Therefore, in addition to the total capacity allocation problem, we address the problem how to allocate exact channel numbers in our single-link context. S1 Circuit D switch SN Figure 7: Single link mode 4.1 Scheduling file transfers on a single link We can think of three ways to assign bandwidth to file requests as they arrive: Greedy scheme: allocates maximum bandwidth available that is less than i or equal to Rmax Socialistic scheme: readjusts bandwidth allocation of all ongoing transfers every time a new call is admitted so that the link capacity, C, is divide equally at all times among the active file transfers Capitalistic scheme: needs requestors to provide information on the price they are willing to pay for each transfer, and uses this information to allocate bandwidth in a manner that will maximize revenues for the owner of the shared link 35 In this dissertation, we will only describe a greedy scheme for scheduling calls. Details of the scheme are provided in sub-sections 4.1.1- 4.1.4 below. The socialistic scheme may be feasible on a single link, but could be hard to implement in the multiple-link scenario. This is because in the multiple-link scenario, the bandwidth allocated to a circuit is the minimum available among all the links on the end-to-end path. This constraint will result in some bandwidth lying idle on certain links and also may cause shorter-route calls to get a larger share of the cumulative bandwidth. The packetby-packet statistical multiplexing scheme indeed achieves socialistic scheduling, automatically admitting any new transfer and dividing the link bandwidth equally among all ongoing transfers. This is hard, if not impossible, to achieve in circuit-switched networks. In contrast, a capitalistic scheme is hard to implement in the packet-by-packet statistical multiplexing mode, while a capitalistic scheme that allocates bandwidth according to how much a user is willing to pay is more feasible in circuit-switched networks. Given the increasing interest in enabling service providers to earn revenues on the Internet by providing differentiated services [48], we believe that there will be an increasing level of interest in using circuit-switched networks for file transfers since the latter appear to be better suited for capitalistic resource sharing. Capitalistic schemes are possible with extended versions of packet switching, e.g., with priority queuing or other forms of scheduling. These schemes lie in between the extremes of the complete-sharing packet-switched scheme and the compete-partitioning fixed bandwidth TDM/FDM scheme. 36 Before we can develop socialistic and capitalistic schemes in packet-switched and circuit-switched networks and then compare them to either validate or disprove our intuition that packet-switched networks are better suited for socialistic scheduling and circuit-switched networks are better suited for capitalistic scheduling, we start by developing a greedy heuristic that overcomes the fundamental disadvantages with fixedbandwidth TDM/FDM. We define two greedy schemes (list scheduling schemes) for an m-channel single link. The first is called Varying-Bandwidth List Scheduling (VBLS) and the second, a special case of practical interest, is called VBLS with Channel Allocation (VBLS/CA). VBLS is the basic heuristic in which we maintain the total bandwidth allocated to a file transfer in different time ranges, while VBLS/CA takes into account practical consideration and tracks actual channel allocations in different time ranges. Table 4 lists notations used in the algorithm. Symbol Fi i Treq Table 4: Notations for VBLS Meaning File transfer size requested by call i Start time requested for call i i Rmax TRCi = B , E , C , k 1,, i k i k i k i (t ) (t ) is expressed in the following form: Pz t Pz 1 mz t Pzmax m Maximum rate requested for call i expressed as a number of channels; typically limited by the access-link rate or end-host processing rates Time-Range-Capacity allocation: Capacity C ki is assigned to call i in time range k starting at Bki and ending at E ki Capacity availability function: Total number of available channels at time t z max denotes the number of times (t ) changes values before reaching m at t Pzmax after which all m channels of the 37 where mz m and z 1,2,, z max ; see link remain available Figure 8 for an example Per-channel bandwidth Discrete time unit Tdiscrete Number of channels 4 3 (t) 2 1 Time 10 20 30 40 50 60 70 80 Figure 8: Example of (t ) , P1 0 , (0) 0 , P2 10 , z max 9 , and Pzmax P9 80 4.1.1 Varying-Bandwidth List Scheduling (VBLS) overview The scheduler maintains a available capacity function (t ) . Given it knows the TRC allocations for all ongoing file transfers, it knows when and how much link capacity is i i available for a new request. A request i, specifies ( F i , Treq , Rmax ) . The switch’s response is TRCi, which is an allocation of capacity for different time ranges for request i. Capacity allocation for a new request is made on a round-by-round basis, where a round consists of the procedures used to allocate capacity for a time range that extends between two consecutive change points in (t ) . In a time range between two consecutive change points, we determine whether the entire remaining file can be transferred or the holding 38 time ends within this time range, and whether the available capacity is greater than or less i than/equal to Rmax . We define four cases corresponding to the four possible outcomes of theses two decisions. At the end of each round, we compute the remaining size of the file and start the next round. 4.1.2 Detailed description of VBLS i Start algorithm: Set time Treq , and remaining file F i ; k 1. Repeat loop (start next round): Find z such that Pz Pz 1 in the capacity availability function (t ) . If ( ) 0 , then reset Pz 1 . Continue repeat loop (start next round). i Case 1: Number of available channels is less than/equal to Rmax , and the whole file can i be transmitted before the next change in the available capacity curve, i.e., ( ) Rmax , and ( Pz 1 ) ( ) , then Set Bki , E ki /( ( ) ) , C ki ( ) (the begin time, end time and capacity allocation for the kth range of file transfer i). Set i k (total number of time ranges allocated to file transfer i). Terminate repeat loop. i Case 2: Number of available channels is less than/equal to Rmax , and the whole file cannot be transmitted before the next change in the available capacity curve, i.e., i ( ) Rmax , and ( Pz 1 ) ( ) , then Set Bki , E ki Pz 1 , C ki ( ) . Set k k 1 , Pz 1 , and ( Pz 1 ) ( ) . Continue repeat loop (start next round). 39 i Case 3: Number of available channels is greater than Rmax , and the whole file can be i transmitted before the next change in the available capacity curve, i.e., ( ) Rmax , and ( Pz 1 ) ( ) , then i i ) , C ki Rmax Set Bki , E ki /( Rmax . Set i k . Terminate repeat loop. i Case 4: Number of available channels is greater than Rmax , and the whole file cannot be i transmitted before the next change in the available capacity curve, i.e., ( ) Rmax , and ( Pz 1 ) ( ) , then i Set Bki , E ki Pz 1 , C ki Rmax . Set k k 1 , Pz 1 , and i ( Pz 1 ) Rmax . Continue repeat loop (start next round). End repeat loop. i As an example of VBLS, consider scheduling a transfer of 5GB file with a Treq of 50, i and an Rmax of 2. Let , the per channel link capacity, be 10Gbps, and each unit of time corresponds to 100ms. Assume the 4-channel link state is as shown in Figure 8. In the time range 50 t 60 , we can schedule 1 channel for the transfer. Within this range, 1.25GB ( 10Gbps 10 100ms) can be transferred. In the 60 t 70 range, we can i allocate 2 channels since Rmax is 2. Therefore we can transfer 2.5GB. The remaining 1.25GB can be assigned to 2 channels past t = 70. Even though available capacity is 3 i channels in the 70 t 80 , we can only assign two channels because of the Rmax limit. Therefore the TRC vector is as follows: 50, 60, 1, 60, 70, 2, 70, 75, 2 where each 40 tuple is of the form ( Bki , E ki , C ki ) and the number of ranges for this call i is 3. The TRC allocation is indicated by the shaded area in Figure 9. Number of channels 4 3 (t) 2 1 Time 10 20 30 40 50 60 70 80 Figure 9: Shaded area shows the allocation for the example 50MB file i i 2 channels. Per channel link transfers, with Treq 50 and Rmax capacity is 1 Gbps and one time unit is 10ms 4.1.3 VBLS with Channel Allocation (VBLS/CA) overview VBLS/CA is a specific instance of VBLS. In this heuristic, the network keeps track of individual channel occupancy as a function of time for each of the m channels on the link, unlike before, where the network only tracked the total available capacity with time ( (t ) ). This is required for practical considerations. In electronic TDM switches, the channel numbers are needed to establish the crossconnections at the right instances in time, while additionally in all-optical WDM networks without wavelength converters, the same channel has to be selected on multiple links in the multiple-link problem. We 41 introduce additional notation in Table 5 to handle tracking of channels and then describe VBLS/CA. Table 5: Additional notations Meaning Symbol 1 G dj t H dj A j (t ) 0 otherwise Indicator of the availability of channel C j . G dj and H dj are the start and end points of time range d when the channel C j is for 1 d n j and 1 j m ; see Figure 10 available. The end point of the last time n range H j j is necessarily . for an example C (t ) The exact set of channels available at instant t TRLi = Time-Range-channeL allocation: B , E i k i k , Lik , k 1, , i Channel Lik is assigned to call i in time range k starting at Bki and ending at E ki Aj (t ) Available Busy Available Channel 4 Channel 3 Channel 2 Channel 1 Time 10 20 30 40 50 60 70 80 Figure 10: Per-channel availability; G11 10 , H 11 30 , n1 2 ; the (t ) shown in Figure 8 is derived from this A j (t ) . Lines indicate time ranges when the channel is available. 42 A summary of the additions needed to account for channels is as follows. First, we track channel availability A j (t ) with time t for each channel j in addition to tracking (t ) , the total available bandwidth. We also track the set of channels available C (t ) . Time ranges are demarcated by changes in (t ) or C (t ) . For example, in (t ) shown in the example in Figure 8, one channel is available in the range 30 t 60 . However, different channels are available in different sub-ranges within this range. Channel 4 is available in the range 30 t 40 , and channel 1 is available in the range 40 t 60 . Thus, t 40 should be regarded a change point. Second, we keep track of a set called C open , which is the set of channels that remains available past a time-range demarcation point in (t ) or C (t ) . The reason for tracking these is so that they can be allocated in the next time range to save switch-programming times. Third, if multiple channels are allocated with the same time range, we count each such allocation as a separate entry in the Time-Range-channeL (TRL) vector. Therefore, a round of the algorithm could advance the range-tracking variable k by a number greater than 1 unlike in the basic VBLS scheme where it always advances only by 1. Fourth, when we have choices, i.e., when there are many candidate channels from which the heuristic needs to select a subset, we propose using two rules. If the file transfer completes within a time range where such a choice needs to be made, we choose the channels with the smallest remaining available time. On the other hand, if the file transfer does not complete, then we choose the channels with the largest remaining available time. The rationale for the former is to limit “holes” in the time range allocations of channels, while the rationale for the latter is to decrease the number of switch reprogramming actions needed. 43 There are other practical details such as accounting for the time needed to change the crossconnections established through the switch at the boundary of every time range in the TRL allocation for a file. In electronic switches, this time to set up or release a crossconnection is in nanoseconds, while in all-optical switches this time is in currently in milliseconds. Capacity allocations in TRL vectors should take into account this overhead. 4.1.4 Detailed description of VBLS/CA i Start algorithm: Set time Treq , and remaining file F i ; k 1; C open Nullset . Repeat loop (start next round): Find z such that Pz Pz 1 . If ( ) 0 , then reset Pz 1 . Continue repeat loop. i Case 1: Number of available channels is less than/equal to Rmax , and the whole file can i be transmitted before the next change in the available capacity curve, i.e., ( ) Rmax , and ( Pz 1 ) ( ) , then Cki ' Copen [k 'k 1] for k ' k , k 1,, (k C open 1) (continue using open channels from previous time range); Check whether ( Pz 1 ) C open : o If true, E ki ' /( C open ) , Bki ' , for k ' k , k 1, , (k C open 1) . Set i k C open 1 . Terminate repeat loop. 44 ______ o If false, choose all channels in C ( v ) C open . Set these channel number into C ki ' for k ' (k C open ), , (k ( ) 1) . Set E ki ' /( ( ) ) , Bki ' , for k ' k , k 1, , (k ( ) 1) . Set i k ( ) 1 . Terminate repeat loop. i Case 2: Number of available channels is less than/equal to Rmax , but the whole file cannot be transmitted before the next change in the available capacity curve, i.e., i ( ) Rmax , and ( Pz 1 ) ( ) , then Cki ' Copen [k 'k 1] for k ' k , k 1,, (k C open 1) (continue using open channels from previous time range); ______ Choose all channels in C ( v ) C open . Set these channel number into C ki ' for k ' (k C open ), , (k ( ) 1) . Set Bki ' and E ki ' Pz 1 , for k ' k , k 1, , (k ( ) 1) . Store set of still open channels as C open by removing channels for which H dj Pz 1 from the set of channels C ki ' for k ' k , k 1,, (k ( ) 1) . k k ( ) , Pz 1 , and ( Pz 1 ) ( ) . Continue repeat loop (start next round) 45 i Case 3: Number of available channels is greater than Rmax , and the whole file can be i transmitted before the next change in the available capacity curve, i.e., ( ) Rmax , and ( Pz 1 ) ( ) , then i Check if C open Rmax . i o If true, choose Rmax channels out of the C open channels with the smallest leftover available ranges by comparing H dj over all j C open . The reason for choosing the channels with the smallest leftover available ranges is to leave open the smallest gaps. If multiple channels have the same smallest leftover range, choose at random. Set these channel i k ' (k 1), , (k Rmax 1) . numbers into i E ki ' /( Rmax ) , C ki ' Bki ' for for i i k ' k , k 1, , (k Rmax 1) . Set i k Rmax 1 . Terminate repeat loop. o If false, Cki ' Copen [k 'k 1] for k ' k , k 1, , (k C open 1) (continue using open channels from previous time range); Check whether ( Pz 1 ) C open ; 46 o If E ki ' /( C open ) true, Bki ' , , for k ' k , k 1, , (k C open 1) . Set i k C open 1 . Terminate repeat loop. o If false, choose all channels in ______ C ( v ) C open , i C open ) channels with the smallest leftover available choose ( Rmax ranges by comparing H dj over all j C ( ) and j C open . The reason for choosing the channels with the smallest leftover available ranges is to leave smallest gaps. If multiple channels have the same smallest leftover range, choose at random. Set these channel numbers into C ki ' for i k ' (k C open ), , (k Rmax 1) . Set i i 1) . Set E ki ' /( Rmax ) , Bki ' , for k ' k , k 1,, (k Rmax i i k Rmax 1 . Terminate repeat loop. i Case 4: Number of available channels is greater than Rmax , but the whole file cannot be i transmitted before the next change in the available capacity curve, i.e., ( ) Rmax , and ( Pz 1 ) ( ) , then i Check if C open Rmax . i o If true, choose Rmax channels with the largest leftover available ranges by comparing H dj over all j C open . The reason for choosing the channels with the largest leftover available ranges is so 47 that fewer switch-reprogramming actions are required. If multiple channels have the same largest leftover range, choose at random. i 1) . Set these channel numbers into C ki ' for k ' (k 1), , (k Rmax E ki ' Pz 1 , Bki ' for i k ' k , k 1, , (k Rmax 1) . Set i i k k Rmax and Pz 1 , and ( Pz 1 ) Rmax . Store set of still open channels as C open by removing channels for which H dj Pz 1 from the i k ' k , k 1, , (k Rmax 1) . set of Continue channels C ki ' for repeat loop (start next round). o If Cki ' Copen [k 'k 1] false, for k ' k , k 1, , (k C open 1) (continue using open channels from previous time range); From channels in ______ C ( v ) C open , i C open ) channels with the largest leftover choose ( Rmax available ranges by comparing H dj over all j C ( ) and j C open . The reason for choosing the channels with the largest leftover available ranges is so that fewer switchreprogramming actions are required. If multiple channels have the same largest leftover range, choose at random. Set 48 these channel numbers into C ki ' for i k ' (k C open ), , (k Rmax 1) . i 1) . Set E ki ' Pz 1 , Bki ' , for k ' k , k 1, , (k Rmax i i Set k k Rmax and Pz 1 , and ( Pz 1 ) Rmax . Store set of still open channels as C open by removing channels for which H dj Pz 1 from the set of channels C ki ' for i k ' k , k 1, , (k Rmax 1) . Continue repeat loop. End repeat loop. Merge time ranges on a given channel: Find the subset of time ranges TRLim ' corresponding to each channel m' . Order time ranges in this subset such that Bki E ki for k (1, 2, , TRLim ' ). If E ki Bki 1 , then merge kth and (k 1) th ranges. At the end of this operation, obtain i time ranges in TRLi. End algorithm. As an example of VBLS/CA, assume that the link is in the same starting state as shown i in Figure 8 and Figure 10, and the file to be scheduled is a 75MB file with a Treq of 10, i and an Rmax of 2. Let , the per-channel link bandwidth, be 1Gbps, and each unit of time correspond to 10ms as before. In the time range 10 t 20 , we can schedule 2 i channels for the transfer. This falls into Case 4 where ( ) Rmax and the whole file cannot be transmitted before the next change in (t ) , which occurs at t 20 . Here we see 49 that a choice needs to be made from among the three available channels 1, 2, 4 (see Figure 10). We choose channels 1 and 4 because they have the largest leftover times in their availability ranges. At the end of the first round, we find two time-range-channel allocations, {(10, 20, 1), (10, 20, 4)} , where 10 is the begin time, 20 is the end time and channel numbers are 1 and 4 in the two allocations, respectively. With these allocations, we can transmit 25MB of the file. The next round begins with the set C open having channels (1, 4) in it because these channels continue to be available in the next z range of (t ) , which is (20 t 30) . Since the remaining part of the file cannot be fully transmitted in this range, and the i available capacity is equal to Rmax , which is 2 channels, we fall into the Case 2 category. At the end of this second round, we find two more time-range-channel allocations, {( 20, 30, 1), (20, 30, 4)} , where 20 is the begin time, 30 is the end time and channel numbers are 1 and 4 in the two allocations respectively. With these allocations, we can again transmit 25MB of the file. The third round begins with the set C open consisting of only one channel (4). An interesting case occurs here. We note that the end of this time range is at t 40 even though in the (t ) curve shown in Figure 8, it appears to stretch to t 60 . The reason for this is that C (t ) , the set of available channels at (t ) has a change at t 40 . The one available channel changes at this point from channel 4 to channel 1 (see Figure 10). Therefore, this third round, which also falls into the Case 2 category results in one timerange-channel allocation, which is {(30, 40, 4)}. In this allocation, we can schedule 12.5MB of the remainder of the file. We have only 12.5MB left to schedule. 50 The fourth round starts with no channels in the C open list. Since we can finish the transfer before the next change point, which occurs at t 60 , and the available capacity i is less than Rmax , we select the one channel available, which is channel 1. This is a Case 1 round, and the last allocation becomes {(40, 50, 1)}. At this point, when we finish the Repeat loop of the heuristic, the TRL allocation vector looks as follows: {(10, 20, 1), (10, 20, 4), (20, 30, 1), (20, 30, 4), (30, 40, 4), (40, 50, 1)}. Illustrative of the last merge function described in the VBLS/CA heuristic, we now merge consecutive time ranges for a given channel to avoid extra switch reprogramming actions. Following this merge, the final TRL vector is as follows: {(10, 30, 1), (10, 40, 4), (40, 50, 1)}, and i for the transfer, the number of time ranges is 3. The allocation is illustrated in Figure 11. Aj (t ) allocated to example file Channel 4 Channel 3 Channel 2 Channel 1 allocated to example file Time 10 20 30 40 50 60 70 80 Figure 11: Dashed lines show the allocation of resources for the example i i 2 channels 75MB file transfer described above with Treq 10 and Rmax 4.2 Analysis and simulation results We describe our traffic model in Section 4.2.1. In Section 4.2.2, we show how we validated our simulation against an analytical model. To understand the VBLS system 51 better, we carried out four sensitivity analysis experiments, which are described in Section 4.2.3. In Section 4.2.4, we describe a simulation comparison of VBLS with a fixed-bandwidth list scheduling (FBLS) scheme and a packet-switched (PS) system. It shows that VBLS can overcome the classical drawback of circuit switching as described in the beginning of Chapter 4, and achieve close to PS performance. Finally, we describe some practical considerations with regards to the implementation of VBLS in Section 4.2.5. 4.2.1 Traffic model We assume that file transfer requests arrive according to a Poisson process with rate . i The requested start time Treq for all transfers is assumed to be equal to the corresponding call arrival times, i.e., all calls are of the “immediate-request” type. We assume that file sizes are distributed according to a bounded Pareto distribution [49]. Specifically, the file size probability density function is given by (10): f X ( x) k x 1 , kx p (10) k 1 p where , the shape parameter, k and p are the lower and upper bounds, respectively, of the allowed file-size range. 4.2.2 Validation of simulation against analytical results To validate our VBLS simulator, we model and simulate a simple case in which all file i requests set their Rmax to match the link capacity, C . In this case, we can model the VBLS system as an M/G/1 queue. In an M/G/1 system where the arrival rate is and X is a random variable representing the service time, the average waiting time E(W) is [50]: 52 E (W ) E[ X 2 ] 2(1 ) (11) where E[ X 2 ] is the second moment of the service-time distribution and is the system load, E[ X ] . Using the file-size distribution specified in (10), and the link capacity, C , which is also the circuit rate3: E[ X 2 ] 1 2 C 1 C2 2 ( x f X ( x))dx k (1 (k / p ) p 1 ( x 2 f X ( x)) dx 2 C k 1 1 2 2 p 2 k (12) 0.8 0.7 Analytical model Simulation File latency (sec) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 System load 0.7 0.8 0.9 1 Figure 12: File latency comparison between analytical and simulation results Figure 12 shows our numerical results comparing the mean waiting time for the analytical model from (12) with our simulation results. Input parameter values are as follows: k = 500MB, p = 100GB, and =1.1. We define the term file latency to denote the mean waiting time across all files transferred. Waiting time for a given file transfer 3 Here we make an idealized assumption that the service time is file size divided by the circuit rate. Realistically, we will need retransmissions due to link errors and/or flow-control buffer overflows. 53 i i is defined as the time duration between Treq (which is the call arrival time in our simulations) and the time instant in which the first bit of the file gets transmitted or served. System load, , is defined as multiplied by the mean service time E[ X ] . The latter is fixed because the mean file size and link capacity are fixed. This means that to generate an increasing system load, we simply increase the call arrival rate . Note that for stability the system load must be below 1. As can bee seen in Figure 12, the analysis and simulation results match closely. In other words, our simulation code is validated and can be used to obtain more interesting i results in which calls specify varying values of Rmax , and/or to compare with other scheduling schemes. 4.2.3 Sensitivity analysis i In this section, we carry out four experiments: (i) to understand the impact of Rmax i when all calls request the constant values of Rmax , (ii) to understand the impact of the i allowed file-size range (i.e., parameters (k and p)), (iii) to understand the impact of Rmax i when calls request the three different values of Rmax , and (iv) to understand the impact of the size of Tdiscrete (discrete time unit). i For the first experiment, we still assume that all calls request the same Rmax , but study i the system performance under different values of Rmax , 1, 5, 10, and 100 channels, when the link capacity C is assumed to be 100 channels. The actual rate of a channel is not i i significant but rather the ratio between Rmax and C. As can be expected, if Rmax for all calls is 1 channel, we obtain lower file latency than in the case when all files request an 54 i Rmax equal to the link capacity. In other words, the mean waiting time is more as shown in Figure 13 (a). However, if we consider the mean file-transfer delay, which includes the file latency (mean waiting time) and mean service time (transmission delay), we get the opposite result, as shown in Figure 13 (b). The case in which all calls request 100 channels outperforms the case in which all calls request 1 channel. This result is effectively the same as a well-known result that the mean response time in an M/M/1 system with an m service rate is lower than that of a system of m M/M/1 servers in which each server operates at a service rate [50]. 2.5 0.7 0.6 Rimax = 100 channels 0.5 Rimax = 10 channels Rimax = 1 channel Mean file transfer delay (sec) File latency (sec) 2 Rimax = 5 channels 0.4 Rimax = 1 channel 0.3 0.2 1.5 Rimax = 5 channels Rimax = 10 channels 1 Rimax = 100 channels 0.5 0.1 0 0 0 0.2 0.4 0.6 System load 0.8 (a) File latency comparison 1 0 0.2 0.4 0.6 System load 0.8 1 (b) Mean file-transfer delay Figure 13: A comparison of VBLS file latency and mean file-transfer delay for comparison i files requesting same Rmax ; Link capacity is 100 channels For the second experiment, we studied the sensitivity of the results to the lower and upper bounds of the allowed file-size range, k and p in equation (10). This experiment is useful in helping us determine the file-size range suitable for dedicated circuits. Clearly, the use of the dedicated circuits for small files is not recommended because the round-trip 55 propagation delay incurred in establishing a wide-area circuit could add a significant overhead to the total transfer delay, especially as link rates increase. Therefore, we compare two cases (only the lower bound k is varied): Case 1: k = 500MB, p = 100GB, =1.1 Case 2: k = 10GB, p = 100GB, =1.1 i For each case, we ran three sets of simulations with Rmax equal to 1 channel, 5 channels and 10 channels, respectively, for all calls. Link capacity was assumed to be 100 channels. 1.6 Rimax = 10 channels (k = 10GB) 1.4 Case 2 File latency (sec) 1.2 Rimax = 5 channels (k = 10GB) Rimax = 1 channel (k = 10GB) 1 Rimax = 10 channels (k = 500MB) 0.8 0.6 Case 1 Rimax = 5 channels (k = 500MB) Rimax = 1 channel (k = 500MB) 0.4 0.2 0 0 0.2 0.4 0.6 System load 0.8 1 Figure 14: File latency comparison for two different values of k (500MB and 10GB) while keeping p constant at 10GB As shown in Figure 14, the file latency (mean waiting time) is lower for Case 1 than for Case 2. This is an interesting result because it is the opposite of what we expected. Since the file size range is smaller in Case 2, we expected the service-time variance to be lower, and this in turn to result in a lower file latency (see equation (11)). This is incorrect 56 because of the form of the Pareto distribution. Indeed the variance is greater in Case 2. We demonstrate this feature by an analysis of the simulated values. We divided the entire file ranges for both cases into 50 bins each. We then counted the frequency of occurrence of file sizes in each of these bins for all the generated file transfer requests. As can be seen in Figure 15, most of the files belong to the small file size bins (the first or second binds) in Case 1. This means the probabilities associated with theses small file size bins are quite high, and the probabilities of files belonging to the higher-size bins are small. Since E[ X 2 ] E[( X E[ X ]) 2 ] and E[ X ] in Case 1 is quite small at 2.27GB, we can see that variance is actually lower in Case 1 than in Case 2. 6 10 5 10 Bin 9 (Mean file size = 24.6 GB) 4 Frequency 10 k = 10GB; p = 100GB; alpha = 1.1 3 10 2 10 Bin 1 (Mean file size = 2.27 GB) k = 500MB; p = 100GB; alpha = 1.1 1 10 0 10 0 | | 10 20 30 bin-number 40 50 Figure 15: Frequencies for different file size ranges (total number of bins = 50; bin-size = 1.99GB or 1.8GB From this analysis, it is clear why the file latency is lower if we decrease the lower bound, k, while keeping upper bound, p, constant. The opposite is true if we keep the lower bound, k constant but increase the upper bound, p. This is again because the curve 57 flattens out as p is increased, thus increasing variance and correspondingly, the file latency. i For the third experiment, we studied the impact of Rmax when we allow three different i values for Rmax . We carried out two of the following cases: i 1 (30%), Case 1: (per-channel rate) = 10Gbps, C = 1Tbps (100 channels), Rmax i i Rmax 2 (30%), and Rmax 4 (40%) i 1 (30%), Case 2: (per-channel rate) = 1Gbps, C = 100Gbps (100 channels), Rmax i i Rmax 5 (30%), and Rmax 10 (40%) For the input parameters, we chose 1.1 , k 100MB , and p 1TB . Instead of a i i constant Rmax for all calls, we allowed different values for Rmax for the following two cases. For both cases, link capacity was assumed to be 100 channels. For this experiment, we used a new performance metric called file throughput, which we define as the long-run average of the file size divided by the file-transfer delay. The i file-transfer delay of file i is defined as the time duration between Treq and the instant when the transmission of file i is complete. 58 40 35 Rimax = 4 channels File throughput, Gbps 30 25 Rimax = 2 channels 20 15 Rimax = 1 channel 10 5 0 0.1 0.2 0.3 0.4 0.5 0.6 System load 0.7 0.8 0.9 1 Figure 16: File throughput metric for files requesting three i different values for Rmax , 1, 2, and 4 channels ( =10Gbps) 10 9 Rimax = 10 channels 8 File throughput, Gbps 7 Rimax = 5 channels 6 5 4 3 Rimax = 1 channel 2 1 0 0 10 20 30 40 50 60 System load 70 80 90 100 Figure 17: File throughput metric for files requesting three i different values for Rmax , 1, 5, and 10 channels ( =1Gbps) i As shown in Figure 16, file transfers with the smaller value of Rmax (say 1 in our example) have a lower degradation of file throughput at the same system load. This i degradation will be more apparent for the three different values of Rmax as the system load 59 increases. For example, at a load of 0.91, the percentage of the system performance i 1 is 28.5%, while with file transfers where degradation for a file transfer with Rmax i i Rmax 2 and Rmax 4 , these numbers are 33.3% and 37.9% respectively. Similar results are obtained in Figure 17. For instance, at a load of 0.92, the percentage of the system i 1 is 13.3%, while with file performance degradation for a file transfer with Rmax i i 5 and Rmax 10 , these numbers are 29.4 % and 39.5% transfers where Rmax respectively. i From this experiment, we learn that when we allow different values for Rmax , file i requests with larger values of Rmax will experience the system performance degradation i more than the smaller values of Rmax as the system load increases. For the fourth experiment, we studied the impact of the discrete time unit under different i values of Tdiscrete : 0.05, 0.5, 1, and 2 sec. We assume that all calls request the same Rmax (say 1 channel in our experiment), and the link capacity C is assumed to be 100 channels. Figure 18 shows the file throughput comparison for different values of Tdiscrete (0.05, 0.5, 1, and 2 sec). As the value of Tdiscrete is increased, the file throughput gets worse as the system load increases. This is because more unused time ranges are occupied due to the discretization, resulting in the succeeding file transfers experiencing the larger file i latency. For example, if all calls request the same Rmax of 1 channel (10Gbps), then with a discrete time unit of 1sec, the most we could end up scheduling for each file in the worst case is 1.25GB. Since the file size range is from 500MB to 100GB, this 1.25GB is a significant per file transfer. Therefore, the impact of the file throughput is larger. In addition, this large value of 60 Tdiscrete (say 2) will sacrifice the utilization more, as shown in Figure 19. However, one of the interesting findings is that the utilization does not drop much under low load condition. This is because each file does not experience the file delay caused by other subsequent ongoing file transfers. The file delay incurred in low load conditions is only dependent on the length of the discrete time unit. However, under low load conditions, this file delay will drop the system performance, but not make a significant impact on the utilization. In Figure 19, the utilization curve does not increase from a certain load. This means that the system is already full from that load. 100 Tdiscrete = 50msec 90 80 Tdiscrete = 0.5sec File throughput, Gbps 70 Tdiscrete = 1sec 60 50 Tdiscrete = 2sec 40 30 20 10 0 . 0 0.2 0.4 0.6 System load 0.8 1 Figure 18: File throughput comparison for different value of Tdiscrete (0.05, 0.5, 1, and 2 sec) 61 100 Tdiscrete = 50msec 90 Tdiscrete = 0.5sec 80 Tdiscrete = 1sec Utilization (%) 70 60 Tdiscrete = 2sec 50 40 30 20 10 0 0 0.2 0.4 0.6 System load 0.8 1 Figure 19: Utilization comparison for different value of Tdiscrete (0.05, 0.5, 1, and 2 sec) To understand the impact of utilization theoretically, we derived the following upper bound for the utilization penalty due to the discretization. Consider a situation as follows: suppose for a discrete-time implementation of VBLS, the discrete-time unit is Tdiscrete seconds, i.e., the scheduler is invoked every Tdiscrete seconds. A file transferring, under ideal non-discrete-time VBLS, takes T seconds. A file transfer will need T / Tdiscrete time slots to complete its transferring under the discrete-time VBLS. In other words, the system utilization will lose T / Tdiscrete T 100 percent. It is interesting to find an upper T bound for the utilization penalty incurred by the discretization. Denote the sequence of file sizes of the files arriving at the system as {F i , i 1, 2, } and the time taken for the i files to get through the system as {T i , i 1, 2, } . It is obvious that T i F i / Rmax i for all i , since the speed of the i th file transferring is bounded by Rmax , where is i the per channel bandwidth. Therefore, the utilization loss U loss for i th file, computed as 62 T / Tdiscrete T , can be bounded by T discrete T / T because T / Tdiscrete T Tdiscrete . Further, i U loss can be bounded by: i Tdiscrete Rmax Tdiscrete i i F /( Rmax ) Fi (13) Therefore, the average utilization loss in U i 1 i loss /n (14) is upper bounded by n i Rmax Tdiscrete Fi n i If F , i 1, 2, has the same distribution, then i 1 n (15) i Rmax Tdiscrete Fi 1 i Rmax Tdiscrete E i , where n (16) n F In other words, we can compute the upper bound for the system utilization loss incurred i 1 by discretization with 1 . (17) i F i For example, in our experiment, when all files request the same Rmax of 1 channel, i Rmax Tdiscrete E where the per-channel bandwidth is 10Gbps, we can calculate the upper bound for the system utilization loss from (17). When Tdiscrete equals to 0.05 sec, utilization loss incurred by the discretization is always less than 0.6567%. However, this value increases as the number of Tdiscrete increases. For example, under the same simulation settings, when Tdiscrete equals 2 sec, the upper bound for the utilization loss is 26.26% from (17). 63 4.2.4 Comparison of VBLS with FBLS and PS The primary objective of this simulation study is to compare the performance of VBLS with that of two alternative file transfer schemes serving the same file requests: (1) a packet-switched (PS) system, and (2) a fixed-bandwidth list scheduling (FBLS) scheme (FBLS is the greedy scheme that schedules each file request to start as soon as i possible, using a fixed bandwidth of Rmax ). The rationale for this comparison is to illustrate that although VBLS is a circuit-based resource-sharing scheme, its throughput behavior (on a file-by-file basis) more closely mimics packet switching than it does FBLS. As pointed out before, standard circuit switching using FBLS is expected to produce significantly lower file throughputs than packet switching, simply because ongoing file transfers cannot exploit the release of bandwidth resulting from completed file transfers. However, the variable-bandwidth nature of VBLS in scheduling file transfers mitigates this loss of throughput; indeed, by design VBLS exploits the bandwidth released by completed file transfers. In the simulation of the PS system, files are divided into packets of length 1500 bytes, i and arrive at the infinite packet buffer at a constant packet rate equal to Rmax divided by the packet length. In other words the packet interarrival time for a file is the packet length i divided by the Rmax value for the file. For the input parameter, we choose 1.1, k = 500MB, and p = 100GB. Instead of a i i constant Rmax for all calls, here we allow three values for Rmax , 1, 5, and 10 channels with corresponding probabilities of 0.3, 0.3, and 0.4, respectively. The link capacity, C is 64 100 channels. We use the same performance metric used in Section 4.2.3, which is file throughput. 4.2.4.1 Numerical results Figure 20-Figure 22 show plots of the file throughput versus system load for VBLS, FBLS, and packet switching (PS). Our plots are categorized according to the value of i Rmax ; the rationale here is that file throughputs (especially at low loads) are naturally i i limited by their Rmax values, and so comparing throughputs for files with different Rmax values is inappropriate. 10 PS 9 File throughput, Gbps 8 VBLS 7 6 FBLS 5 4 3 2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 System load 0.7 0.8 0.9 Figure 20: File throughput metric for files requesting i Rmax of 1 channel (10Gbps) 1 65 50 45 PS File Throughput, Gbps 40 35 VBLS 30 25 FBLS 20 15 10 5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 System load 0.7 0.8 0.9 1 Figure 21: File throughput metric for files requesting i Rmax of 5 channels (50Gbps) 100 90 PS 80 VBLS File throughput, Gbps 70 60 50 40 FBLS 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 System load 0.7 0.8 0.9 1 Figure 22: File throughput metric for files requesting i Rmax of 10 channels (100Gbps) As we can see in Figure 20-Figure 22, VBLS achieves throughput values that lie above that of FBLS, as expected. Significantly, the throughput performance of VBLS is indistinguishable from packet switching. This serves to illustrate our main point – that by 66 taking into account file sizes and varying the bandwidth allocation for each transfer over its transfer duration, we mitigate the performance degradation usually associated with circuit-based methods. We note that our simulation of the PS scheme is of an infinite-buffer system. This is clearly an idealized packet-switching scenario. In practice, buffers will be finite, which means packet losses will occur due to congestion. Mechanisms such as TCP’s congestion control schemes are then required to recover from these packet losses with retransmissions and rate adjustments. TCP mechanisms can add significant delays to the total file-transfer delays [47]. 4.2.5 Practical considerations VBLS achieves close-to-PS performance at the cost of complexity relative to FBLS. First, as noted in Section 4.1.1, VBLS requires that the circuit switches be reprogrammed multiple times within a transfer unlike in the fixed-bandwidth allocation mode where the switch is only programmed at the start and end of a call. With electronic TDM switches, where switch programming times is in the order of nanoseconds [32], impact of reprogramming on utilization will be less than with the slower optical Micro-ElectroMechanical Switches (MEMS) [51]. If VBLS is used only for very large files then these overheads will be relatively insignificant. Second, to make it practical to implement (t ) , the capacity availability function, we need to limit the number of bandwidth change points z max . For this purpose, we discretize time, and only allow for bandwidth changes to fall on discrete time instances. The smaller the discrete time unit, the larger the storage needed for (t ) . The larger this unit, the 67 worse the utilization because bandwidth cannot be reassigned in the middle of a time range. Details such as these will be explored in implementation of VBLS. Propagation delays and clock synchronization issues become important in distributed implementations of this scheduling algorithm. For now, since the dynamic provisioning of circuits is handled in a centralized manner, the VBLS scheduler can also be centralized avoiding propagation delay and clock synchronization problems. Maintenance of the schedulers for switch reprogramming should ideally be at the switches themselves with timer-based triggers. Adopting the VBLS scheme, in CHEETAH instead of call-blocking mode, we might have improved the gain, i.e., less file-transfer delay. For example, when a large file comes (for example 1TB, see the beginning of Chapter 4), the end-host might have a high-access service via the end-to-end circuit over the TCP/IP backup path. In addition, the routing decisions made in Chapter 3 might have to be adapted. This work was presented in [52]-[55]. 68 Chapter 5. Multiple-link cases (centralized and distributed) The centralized on-line greedy scheme for multiple-link is just extensions for one-link case. We can create a new new (t ) reflecting the available bandwidth for all links. However, in case of the distributed on-line greedy scheme for multiple link, each switch should keep the same information, such as TRCi and TRLi, in a distributed manner. In this section, we will only describe the distributed on-line greedy scheme for multiple-link case. Details of the scheme are provided in Section 5.1. In Section 5.2, we describe the analysis and simulation results. 5.1 VBLS algorithm for multiple-link case There are a few practical issues that need to be solved before call-scheduling algorithm can be implemented in multiple-link case. For example, we need to deal with time synchronization. Given than multiple end hosts and switches have to interpret the allocated time, such as Bki , E ki , and so on, in the same manner, some synchronization method is needed. An extensive survey done in 1995 reported that most machines using NTP (Network Time Protocol) to synchronize their clocks are within 21ms of their synchronization sources, and all are within 29ms on average [56]. Mechanisms that use relative time values or some such approach are needed to deal with these differences. The call setup delay of 50ms is not an issue because other calls can use the circuit when a call is being scheduling. However, by the time the first bit of a 8ms transfer (100KB/100Mbps) arrives in CA from Boston, it takes 25ms while the resources at the Boston switch are being held but not utilized – hence we need staggered setup. Merging 69 of propagation delay plus clock synchronization problem. Naturally aides the staggered setup approach. A rule: a time instant specified in a message sent by a switch S n is as per clock in that switch. (TRC1i f , TS1 f ) S1 (TRC 2i f , TS 2 f ) S2 (TRC1ir , TS 2 r ) S3 SN (TRC 2i r , TS 3r ) Figure 23: VBLS on a path of K hops Consider the first time range ( B11i , E11i , C1i ) in the TRC vector TRC1i f carried forward from switch S1 to switch S 2 in Figure 23. The call-scheduling message suffers a certain propagation delay p12 . Furthermore, the clocks at switch S1 and switch S 2 may not be synchronized. Let this clock difference be donated 12 with switch S 2 ’s clock ahead of S1 ’s clock4. Switch S1 places a time stamp TS1 f just before it emits the call-scheduling message on to the link. When switch S 2 receives the message, it immediately places a time stamp TS 2 f of the current time in switch S 2 ’s clock into the message buffer along with the message. TS 2 r TS1 f 12 p12 4 If S1’s clock is ahead of S2’s clock, 12 will be negative. (18) 70 Switch S 2 can not interpret the time carried in TRC1i f for i th call and check whether the bandwidth C1i is available within the time range ( PB12i , PE12i ), where PB12i B11i 12 p12 B11i TR2r TS1 f (19) PB12i E11i 12 p12 E11i TR2r TS1 f (20) The above action results in identifying time ranges that are subsets (proper or equal) of the time ranges selected in the TRC vector received from the upstream switch. Assuming i C1i Rmax , the goal is to find a set of ranges (which are sub-ranges of ( PB12i , PE12i ) ) in which C ij is less than or equal to C1i where j 2 . We define the following operation: TRC (in1) f TRC nfi n (t ) (21) TRC nri TRC nfi n (t ) (22) A range ( Bkif( n1) , Ekif( n1) , Ckif( n1) ) belongs to TRC (in1) f if PBki ( n1) Bki '( n1) Eki '( n1) PEki ( n1) and C ki ' ( n 1) C kni for k k k ' rn , rn 1,, r ( n 1) r 1 r 1 (23) (24) A range ( Bknir , E knir , C knir ) belongs to TRC nri if n (t ) Ckif( n 1) Cknir , where B t E ir kn ir kn Table 6 lists our notations for the multiple-link case. Table 6: Notations for multiple-link case (25) 71 Symbol Meaning TRC nfi B if kn B ir kn Time-Range-Capacity allocation: Capacity C ki is assigned to call i in the time i i , E knif , C knif , k 1,, ni , (n 1,2,, N ) range k starting at B kn and ending at E kn by switch n . Since the number of time ranges can change from link to link, we add the subscript n to ni . Time-Range-Capacity allocation: TRC nri Capacity C ki to be released i starting , E knir , C knir , k 1,, ni , (n 1,2,, N ) at Bknir and ending at E knir at switch n 1 . n (t ) Capacity availability function: Total number of available channels at time t at switch n . n (t ) is expressed in the following form: Pz t Pz 1 mz t Pz m z max denotes the number of times n (t ) changes values before reaching m n at t Pzmax after which all m n channels of the where mz m and z 1,2,, z max link n remain available max i kn i kn ( PB , PE ) M rn Per-channel bandwidth Potential begin and end times of the k th range on the n th link Multiplicative factor used in reserving TRCs; if M = 5, then the TRC vector reserved is 5 times the TRC allocation needed to transfer the file. The r th range on the (n 1) th link becomes rn ranges on the n th link. i As an example of VBLS, consider scheduling a transfer of a 50MB file with a Treq of i 50, and an Rmax of 2. Let , the per-channel link bandwidth, be 1 Gbps, and each unit of discrete time correspond to 10ms. Assume the 4-channel link state is as shown in Figure 24. In the time range 50 t 60 , we can schedule 1 channel for the transfer. Within this range, 12.5MB (= 1Gbps 10 10ms ) can be transferred. In the 60 t 70 range, we can 72 i allocate 2 channels since Rmax is 2. Therefore, we can transfer 25MB. The remaining 12.5MB can be assigned to 2 channels past t 70 . Even though available bandwidth is 3 i channels in the 70 t 80 , we can only assign two channels because of the Rmax limit. Therefore, the TRC vector is as follows: {(50, 60, 1), (60, 75, 2)} where each tuple is of the form ( Bki , E ki , C ki ) and the number of ranges for this call i is 2 if multiplicative factor M is 1. But, if M is 2, then TRC 1 f (50, 60, 1), (60, 95, 2) to be able to transfer 100MB when we really only need TRC allocation to transfer 50MB. The TRC allocation is indicated by the shaded area in Figure 24. 1(t) 4 3 2 1 Time 10 20 30 40 50 60 70 80 90 Figure 24: Example of 1 (t ) , P1 0 , (0) 0 , P2 10 , z max 9 , Pzmax P9 80 , M = 2- Link 1 (Shaded area shows the allocation for the example 25MB file transfer) At switch 2, it compares with the received TRC 1 f with 2 (t ) , and creates a new TRC vector. Assuming the propagation delay is 10ms, and that the clocks are synchronized, as shown in Figure 25, the TRC vectors as switch 2 is as follows: TRC 2 f (51, 80, 1), (80, 80.5, 2) and TRC 2r (60, 80, 1), (85.5, 95, 2). 73 2 (t) 4 3 2 1 Time 10 20 30 40 50 60 70 80 90 Figure 25: Example of 2 (t ) , P1 0 , (0) 0 , P2 10 , z max 9 , and Pzmax P9 80 - Link 2 (Shaded area shows the allocation for the example 25MB file transfer) 5.2 Analysis and simulation results We describe our traffic model in Section 5.2.1. To understand the VBLS system better in multiple-link scenario cases, we carry out two sensitivity analysis experiments, which are described in Section 5.2.2. 5.2.1 Traffic model The network model studied is presented in Figure 26. The network model consists of three switches, three sources (Source, Src1, and Src2), and three destinations (Dest, Dest1, and Dest2). Both link capacities C12 (between SW1 and SW2) and C 23 (between SW2 and SW3) are assumed to be 100 channels (per channel bandwidth is 10Gbps). File transfers between Source and Dest are studied (“study traffic”), and the file transfers between Srcx and Destx are creased as “interference traffic” to simulate cross traffic. Similar to the single link case (see Section 4.2.1), we assume that file transfer requests 74 i from all sources arrive according to a Poisson process. The request start time Treq of each file from all sources is equal to its arrival time. The size of each file is distributed according to a bounded Pareto distribution with a mean of 2.27 GB. For the input parameters, we choose 1.1 , k 500MB , and p 100GB . To simulate the scheduling scheme under different load conditions, the mean file interarrival time for the study traffic is kept constant through all simulations while varying the mean call interarrival time of the interference traffic. The mean call interarrival time used by Source is 10 files/sec, computing to a system load of 18% introduced to the network by Source. The mean call interarrival times used for the interference traffic (generated by Src1 and Src2) are varied (5, 10, 15, 20, 25, 30, 35, and 40 files/sec). With these different load conditions, the link between switches will experience a total load varying from 27% to 91%. Src1 Src2 Interference traffic Interference traffic Source SW 1 SW 2 SW 3 Dest1 Dest2 Study traffic Figure 26: Network model Dest 75 5.2.2 Sensitivity analysis In this section, we carry out two experiments: (i) to understand the impact of M (Multiplicative factor), and (ii) to understand the impact of the discrete time unit. i For the first experiment, we assume that all calls request the same Rmax (= 1channel) when the link capacity is assumed to be 100 channels, the values of the discrete time unit and propagation delay are fixed as 10ms and 5ms respectively, and the clocks are synchronized. Under different values of M (multiplicative factor), i.e., M = 2, M = 3, and M = 4, we study the percentages of blocked calls and file throughput as the interference traffic load increases. As shown in Figure 27, the call-blocking percentages of our VBLS scheme show that increasing the value of M decreases the call-blocking probability significantly. With the usage of a large value by SW1, the chances that a succeeding switch can successfully find the set of time ranges belonging to the time ranges allocated by SW1, are increased. However, as the value of M increases, file throughput drops significantly as shown in Figure 28. This is because successive file transfers at SW1 will suffer from large file latency caused by the size of M. For example, if M equals four, SW1 should allocate four times of F i . Because of these large occupied time ranges, succeeding file transfers will experience the large file delay. 76 100 90 80 Blocked calls(%) 70 60 50 40 M=2 M=3 30 20 10 0 M=4 0 0.1 0.2 0.3 0.4 0.5 Interference traffic load 0.6 0.7 0.8 Figure 27: Percentages of blocked calls comparison for different values of M when Tdiscrete 0.01sec and p12 p23 5ms 10 9 M=2 8 File throughput, Gbps 7 M=3 6 M=4 5 4 3 2 1 0 0 0.1 0.2 0.3 0.4 0.5 Interference traffic load 0.6 0.7 0.8 Figure 28: File throughput comparison for different values of M when Tdiscrete 0.01sec and p12 p23 5ms 77 i For the second experiment, we assume that all calls request the same Rmax (= 1channel) when the link capacity is assumed to be 100 channels. The values of M and propagation delay are fixed as 3 and 5ms respectively, and the clocks are synchronized. Under different values of Tdiscrete (discrete time unit), i.e., Tdiscrete 0.01sec , Tdiscrete 0.1sec , and Tdiscrete 1sec , we study the percentages of blocked calls and file throughput as the interference traffic load increases. As shown in Figure 29, as the value of Tdiscrete increases, the call-blocking percentages of our VBLS scheme increase as well. This is because if Tdiscrete is large, much larger time ranges will be occupied at SW1 and succeeding switches. When SW2 receives the TRC vector from SW1, the scheduler at SW2 has less chance of finding the subsets of time ranges selected in the TRC vector from SW1 due to the smaller empty ranges available. As already noticed from the single link case, there exists for the tradeoff between the storage needed for (t ) and the system performance. As we already expected, the performance of the file throughput drops significantly as the system load increases (see Figure 30). This result is consistent with those found in single-link case (see Section 4.2.3). 78 100 90 80 Blocked calls(%) 70 60 Tdiscrete = 1sec 50 40 Tdiscrete = 0.1sec 30 20 Tdiscrete = 0.01sec 10 0 0 0.1 0.2 0.3 0.4 0.5 Interference traffic load 0.6 0.7 0.8 Figure 29: Percentages of blocked calls comparison for different values of Tdiscrete when M = 3 and p12 p23 5ms 10 9 Tdiscrete = 0.01sec 8 File throughput, Gbps 7 Tdiscrete = 0.1sec 6 5 Tdiscrete = 1sec 4 3 2 1 0 0 0.1 0.2 0.3 0.4 0.5 Interference traffic load 0.6 0.7 0.8 Figure 30: File throughput comparison for different values of Tdiscrete when M = 3 and p12 p23 5ms 79 Chapter 6. Conclusions and future work In Chapters 2 and 3, we propose improving delay performance of file transfers by using intra-network paths where possible. Specifically, we propose a service called CHEETAH in which pairs of end hosts are connected on a call-by-call basis via highspeed end-to-end Ethernet/EoS circuits. This is feasible today given the deployment of fiber to enterprises, MSPPs in enterprises and EoS technologies within these MSPPs. Seeking to achieve high utilization, we propose setting up unidirectional EoS circuits and only holding circuits for the duration of the actual file transfers. The CHEETAH service is proposed as an add-on to basic Internet access service. The latter allows for the optical circuit-switched network to be operated in call-blocking mode such that if the circuit setup is blocked, an end host can fall back to the TCP/IP path. If the circuit setup is successful, there is a huge advantage in minimizing total delay especially in wide-area environments. For example, a 1TB file requires 2.2 hours on a 1Gbps end-to-end circuit, but could take more than 4 days on a TCP/IP path in a WAN environment. We analyzed the conditions under which a circuit setup should be attempted. For WAN environments and large files, it is clear that a circuit setup should be attempted. We also found that for medium-sized files (MBs) in WAN environments, it is worthwhile making this attempt. In lower propagation-delay environments, if bottleneck link rates are on the order of 100Mbps, for files larger than 3.5MB, it becomes worthwhile attempting a circuit setup. For higher link rates (1Gbps), or smaller files, one should consider the loading conditions on the two paths, probability of packet loss on the TCP/IP path, and call-blocking probability through the circuit-switched network, before deciding whether or not to attempt the circuit setup. 80 In Chapters 4 and 5, instead of call-blocking mode in CHEETAH, we propose a callscheduling algorithm in which varying levels of capacity are allocated to a file transfer using knowledge of file sizes of all admitted transfers. We called this heuristic for scheduling file transfers Varying-Bandwidth List Scheduling (VBLS). By having end host applications provide the VBLS scheduler file sizes, it is possible to make a Time-Range-Capacity (TRC) vector allocation for file transfers. This approach overcomes a well-known drawback of using circuits for file transfers where a fixedbandwidth allocation mode fails to allow users to take advantage of bandwidth that becomes available subsequent to the start of a transfer. We demonstrate through simulation that with VBLS we can improve performance significantly over fixedbandwidth schemes, making it indistinguishable from packet switching. Adopting the VBLS scheme, in CHEETAH instead of call-blocking mode, we might have improved the gain, i.e., less file-transfer delay. For example, when a large file comes (for example 1TB, see the beginning of Chapter 4), the end-host might have a high-access service via the end-to-end circuit over the TCP/IP backup path. In addition, the routing decisions made in Chapter 3 might have to be adapted. As one part of our future work, we can include a second class of user requests, specifically targeted at interactive applications (long-holding-time applications), such as remote visualization and simulation steering. Such requests will be specified as i i i i i and Rmax are the minimum ( H i , Rmin , Rmax , Treq ) , where H i is the holding time, Rmin i and maximum bandwidth acceptable to the user, and Treq is the requested start time. The VBLS scheduler can handle such requests in the same manner as it does file transfer 81 i i requests whereby it allocates a TRC vector starting at some time Tstart and ending Treq i i i H i . During this interval Tstart t Tstart H , varying levels of bandwidth will at Tstart be allocated in a TRC vector such that the capacity assigned in any time range k is not i i less than Rmin and not greater than Rmax . As another part of our future work, we can extend the simulations for the multiple link case. One possible set of simulations is to extend the sensitivity analysis of VBLS more thoroughly by varying the propagation delays but keeping the other parameters fixed i.e., M and Tdiscrete . Another possible set of simulations is to do the comparison between VBLS and TCP/IP via simulations. In our previous simulations for the comparison between VBLS and packet-switched system (PS), we considered that buffer size is infinity (no packet loss). But practically, that is impossible. If we have a finite buffer, then there would be packet loss due to buffer overflow. Hence, lost packets should be retransmitted. The time to retransmit the packets due to loss will cause degradation of performance in a packetswitched system. 82 Appendices A-1. Examples of VBLS/CA In this section, to understand the VBLS/CA algorithm more thoroughly, we provide the following two examples. For both cases, assume the granularity of our discrete time is 1ms and the per-channel rate, is 1Gbps. For the first example, assume the file being scheduled is 15.625MB, which means it requires 125 slots. We set the other parameters of the file transfer request as indicated in Table 7. The state of the single link is shown in Figure 31. Table 8 shows the TRL allocation vector for each round. After Round 10, all the time ranges have been identified for the file transfer. Now we need to merge ranges on the same channel. From Table 8, we see that for channel 2, ranges “(35-40), (40-45), (45-50), (50-55), and (55-60)” can be merged. For channel 3, ranges “(30-35) and (35-40)” and “(45-50) and (50-55)” can be merged. After merging the ranges for channel 4 and 5, the final allocation for TRLi is {( B1i 15 , E1i 30 , L1i 5 ), ( B2i 15 , E 2i 25 , Li2 4 ), ( B3i 15 , E3i 20 , Li3 2 ), ( B4i 30 , E 4i 40 , Li4 3 ), ( B5i 35 , E 5i 60 , Li5 2 ), ( B6i 40 , E 6i 68.3 , Li6 4 ), ( B7i 45 , E 7i 55 , Li7 3 ), ( B8i 55 , E8i 68.3 , Li8 5 ), and ( B9i 60 , E 9i 68.3 , Li9 1 )}.This allocation is shown with dashed lines in Figure 31. Table 7: Input parameters for example 1 Parameter Value i 15 Treq i Rmax m 3 channels 5 channels 83 Ai (t ) allocated to example file allocated to example file Channel 5 Channel 4 Channel 3 Channel 2 Channel 1 Time 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 Figure 31: Dashed lines show the allocation of resources for example i i 3 15.625MB file transfer described above with Treq 15 and Rmax Case Round 0 (Initial) Round 1 - 125 4 110 Table 8: TRL vectors for each round (Example 1) (v) k v z TRL i Pz 1 Copen 4 i =15 Treq 20 2 1 0 15 {} {} 4 1 20 {4, 5} B1i 15 , E1i 20 , and L1i 5 B2i 15 , E 2i 20 , and Li2 4 B3i 15 , E3i 20 , and Li3 2 Round 2 2 100 25 1 6 2 25 {5} B4i 20 , E 4i 25 , and Li4 4 B5i 20 , E 5i 25 , and Li5 5 Round 3 2 95 30 1 7 3 30 {} B6i 25 , E 6i 30 , and Li6 5 Round 4 2 90 35 2 8 4 35 {3} B7i 30 , E 7i 35 , and Li7 3 Round 5 2 80 40 2 10 5 40 {2} B8i 35 , E8i 40 , 84 and Li8 3 B9i 35 , E9i 40 , and Li9 2 Round 6 2 70 45 3 12 6 45 {2, 4} B10i 40 , E10i 45 i 2 and L10 B11i 40 , E10i 45 , i and L11 4 Round 7 2 55 50 4 15 7 50 {2, 3, 4} B12i 45 E12i 50 , i and L12 2 i B13 45 , E13i 50 , i 3 and L13 B14i 45 , E14i 50 , i and L14 4 Round 8 4 40 55 4 18 8 55 {1, 2, 4} B15i 50 E15i 55 , i 2 and L15 B16i 50 , E16i 55 , i 3 and L16 B17i 50 , E17i 55 , i 4 and L17 Round 9 4 25 60 3 21 9 60 {1, 4, 5} B18i 55 E18i 60 , i 2 and L18 B19i 55 , E19i 60 , i 4 and L19 i i B 20 55 , E 20 60 , and Li20 5 Round 10 (exit loop) 1 0 70 2 24 10 70 {5} i i B21 60 , E 21 68.3 i and L21 1 i i 68.3 , B22 60 , E22 i and L22 4 i i 68.3 B23 60 , E 23 and Li23 5 85 For the second example, assume the file being scheduled is 3.125MB, which means it requires 25 slots. We set the other parameters of the file transfer request as indicated in Table 9. The state of the single link is shown in Figure 32. Table 10 shows the TRL allocation vector for each round. After Round 6, all the time ranges have been identified. Now we need to merge ranges on the same channel if any. From Table 10, we see that for channel 2, ranges (35-40), (40-45), (45-50), (50-55), and (55-60) can be merged into one time range. Therefore, TRLi is { B1i 35 , E1i 60 , L1i 2 }. The final allocation is illustrated with dashed line in Figure 32. Table 9: Input parameters for example 2 Parameter Value i 32 Treq 1 channels i Rmax m 7 channels Ai (t ) Channel 7 Channel 6 Channel 5 Channel 4 Channel 3 allocated to example file Channel 2 Channel 1 Time 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 i Treq 32 Figure 32: Dashed lines show the allocation of resources for example i i 1 3.125MB file transfer described above with Treq 32 and Rmax 86 Case 25 Table 10: TRL vectors for each round (Example 2) (v) k v z TRL i Pz 1 Copen Round 0 (Initial) - Round 1 35 Round 2 Repeat 25 loop 4 20 40 Round 3 4 15 45 Round 4 4 10 50 i =32 Treq 0 1 0 32 {} {} 2 1 1 35 {} {} 5 2 2 40 {2,3} B1i 35 , E1i 40 , and L1i 2 3 3 45 {2,4,6,7 } B2i 40 , E 2i 45 , and Li2 2 4 4 50 {2,4,6,7 } B3i 45 , E 3i 50 , 4 6 and Li7 2 Round 5 4 5 55 6 10 5 55 {1,2,3,4 ,6} B4i 50 , E 4i 55 , and Li4 2 Round 6 4 0 60 3 6 6 60 {1,4,5} B5i 55 , E 5i 60 , and Li5 2 A-2. Some Characteristics of bounded Pareto distribution In our simulation, for the file size distribution, we used the bounded Pareto distribution rather that the exponential distribution. In this section, we briefly go over some important characteristics of bounded Pareto distribution. These are summarized as follows: Bounded Pareto distribution (Truncated Pareto) ~ probability density function (PDF) of the file size is given by f X ( x) k x 1 , kx p k 1 p where , the shape parameter, k, the lower bound, and p, the upper bound. (26) 87 Mean of this distribution (First Moment): E X ( x) x f p X k 1 k dx 1 k 1 1 p k 1 p k p 1 x 1 k (27) 1 1 1 1 1 p k 1 k 1 p Second moment of this distribution: x EX (x ) k 1 f X ( x)dx x 2 f X ( x)dx k k p k p 2 1 x x dx k k p k 1 p 2 1 k p p 1 2 2 x k 1 k 2 p 2 2 (28) k 1 1 2 2 p k 2 k 1 p th General formula for j moment of the bounded Pareto distribution: k k j p j p E X ( x j ) x j f X ( x)dx ( j )(1 (k / p) ) k k ln p ln k 1 (k / p) p 1 k 2 1 k k p k p xx k 1 ( x)dx x f X ( x)dx if j (29) if j 1 Variance of the file-size distribution: X2 ( x E X ( x)) 2 f X ( x)dx E X ( x 2 ) E X ( x) 2 (30) 88 Bibliography [1] Kevin Thompson, Gregory J. Miller, and Rick Wilder. Wide-area Internet traffic patterns and characteristics. IEEE Network, 11(6), November 1997. [2] T. Ndousse, “High-performance Networks Research Program Overview,” http://www.sc.doe.gov/ascr/mics/hpn/index.html. [3] Napsgter, http://www.napster.com. [4] Gnutella, http://www.wego.com [5] H. B. Newman, M. H. Ellisman, J. A. Orcutt, “Data-intensive e-science frontier research,” Communications of ACM, Vol. 46, No. 11, pp. 68-77, Nov. 2003. [6] T. DeFanti, C. D. Laat, J. Mambretti, K. Neggers, B. St. Arnaud, “TransLight: a global-scale LambdaGrid for e-science,” Communications of ACM, Vol. 46, No. 11, pp. 34-41, Nov. 2003. [7] First International Workshop on Protocols for Fast Long-Distance Networks, PFLDnet 2003, http://data-ag.web.cern.ch/datatag/pfldnet2003/, Feb. 3-4, 2003, Geneva, Switzerland. [8] W. Feng and P. Tinnakornsrisupapa, “The Failure of TCP in High-Performance Computational Grids,” Proc. of SC2000: High-Performance Network and Computing Conference, Dallas, TX, Nov. 2000. [9] S. Floyd, “High Speed TCP and Quick-Start for Fast Long-Distance Networks,”, PFLDnet 2003, http://datatag.web.cern.ch/datatag/pfldnet2003/, Feb. 3-4, 2003, Geneva, Switzerland. [10] C. Jin, D. Wei, S. Low, J. Bunn, D. H. Choe, J. C. Doyle, H. Newman, S. Ravot, S. Singh, G. Buhrmaster, R.M.A. Cottrell, and F. Paganini, “FAST Kernel: 89 Background Theory and Experimental http://datatag.web.cern.ch/datatag/pfldnet2003/, Results,” Feb. 3-4, PFLDnet 2003, 2003, Geneva, Switzerland. [11] T. Kelly, “Scalable TCP: Improving Performance in High Speed Wide Area Networks,” PFLDnet 2003, http://datatag.web.cern.ch/datatag/pfldnet2003/, Feb. 34, 2003, Geneva, Switzerland. [12] J. Semke, J. Mahdavi, and M. Mathis, “Automatic TCP Buffer Tuning,” Proc. of ACM SIGCOMM 1998, 28(4), October 1998. [13] W. Feng, M. Gardner, M. Fish, and E. Weigle, “Automatic Flow-Control Adaptation for Enhancing Network Performance in Computational Grids,” Journal of Grid Computing 2003. [14] M. Grardner, W. Feng, and M. Fish, “Dynamic Right-sizing in FTP(drsFTP): An Automatic Technique for Enhancing Grid Performance,” Proc. of the IEEE Symposium on High-Performance Distributed Computing, July 2002. [15] Matthew Mathis, www.psc.edu/~mathis/MTU [16] Bill St. Arnaud, “Proposed CA*net 4 Network Design and Research Program,” Revision no. 8, April 2, 2002. [17] P. Ashwood-Smith, et al. “Generalized MPLS - RSVP-TE Extensions,” IETF Internet Draft, draft-ietf-mpls-generalized-rsvp-te-04.txt, July 2001. [18] Optical Internetworking Forum, “User Network Interface (UNI) 1.0 Signaling Specification,” Oct. 1, 2001, http://www.oiforum.com/public/documents/OIF-UNI01.0.pdf. 90 [19] E. G. Coffman, Jr., M. R. Garey, D. S. Johnson, A. S. Lapaugh, “Scheduling File Transfers,” SIAM Journal of Computing, vol. 14, no. 3, Aug. 1985, pp. 744-780. [20] T. Erlebach and K. Jansen, “Off-line and On-line Call Scheduling in Stars and Trees,” in Proceedings of the 23rd International Workshop on Graph-Theoretic Concepts in Computer Science, WG ’97, LNCS1335, pp. 195-213, Springer-Verlag 1997. [21] J. T. Havil, W. Mao, and R. Simha, “A Lower bound for on-line File Transfer Routing and Scheduling,” Proceedings of the 1997 Conference on Information Sciences and Systems, 225-230, 1997. [22] D. Wischik and A. Greenberg, “Admission control for booking ahead shared resources,” Proc. of IEEE Infocom, 1998, pp. 873-882 [23] D. Ferrari, A. Gupta, G. Ventre, “Distributed advanced reservation of real-time connections,” Tech. Rep. TR95-008, International Computer Science Institute, Berkeley, March 1995. [24] W. Reinhardt, “Advanced resource reservation and its impact on reservation protocols.” In Proc. IWACA’94. [25] L. C. Wolf, L. Delgrossi, R. Steinmetz, S. Schaller and H. Wittig, “Issues of reserving resources in advance,” Proc. NODDAV ’95, 1995. [26] S. Martello and D. Vigo, “Exact solution of the two-dimensional bin packing problem,” Management Science, 1998. [27] E. E. Bischoff and M. D. Marriott, “A Comparative Evaluation of Heuristics for Container Loading,” European Journal of Operations Research, 44: 267-276, 1990 91 [28] M. Gehring, K. Menscher, M. Meyer, “A Computer Based Heuristic for Packing Pooled Shipment Containers,” European Journal of Operations Research, 44: 277288, 1990. [29] S. Martello, D. Pisinger, D. Vigo, “The Three-Dimensional Bin Packing Problem,” Operations Research, 48, 256-267, 2000. [30] M. Pinedo, Scheduling: Theory, algorithms and systems, Prentice Hall, Inc. 1995. [31] I. R. Philip and Y.-L. Liong, “The Scheduled Transfer (ST) Protocol,” 3rd Intl. Workshop on Communications, Architecture and Applications for Network-Based Parallel Computing (CANC’99), Lecture Notes in Computer Science, vol. 1502, Jan. 1999. [32] H. Wang, M.Veeraraghavan and R. Karri, “A hardware implementation of a signaling protocol,” Proc. of Opticomm 2002, July 29-Aug. 2, 2002, Boston, MA. [33] S. K. Long, R. R. Pillai, J. Biswas, T. C. Khong, “Call Performance Studies on the ATM Forum UNI Signaling,” http://www.krdl.org.sg/Research/Publications/Papers/pillai_uni_perf.pdf. [34] W. Doeringer, D. Dykeman, M. Kaiserswerth, B. W. Meister, H. Rudin, R. Williamson, “A survey of light-weight transport protocols for high-speed networks”, IEEE Trans. Comm., 38(11):2025-39, Nov. 1990. [35] S. Iren, P. D. Amer, and P.T. Conrad, “The Transport Layer: Tutorial and Survey,” ACM Computing Surveys, Vol. 31, No. 4, Dec. 99. [36] M. Blumrich, C. Dubrucki. E. Felton, and K. Li, “Protected, User-Level DMA for the SHRIMP Network Interface,” In Proceedings 2nd International Symposium on High Performance Architecture, San Jose, CA, Feb. 3-7, 1996, pp. 154-165. 92 [37] P. Druschel and L.L. Peterson and B.S. Davic, “Experiences with a High-Speed Network Adapter: A-Software Perspective,” In Proceedings of ACM Sigcomm ’94 Aug. 1994. [38] S. Pakin, M. Lauria, and A. Chien, “High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet,” In Proceedings of Supercomputing ’95, San Diego, CA, 1995. [39] N. Cardwell, S. Savage, and T. Anderson, “Modeling TCP Latency,” Proc. of IEEE Infocom, Mar. 26-30, 2000, Tel-Aviv, Israel, pp. 1724-1751. [40] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose, “Modeling TCP Throughput: A Simple Model and its Empirical Validation,” Proc. of ACM SIGCOMM 98, Aug. 31 - Sep. 4, Vancouver Canada, pp. 303-314. [41] M. Allman, V. Paxson, W. Stevens, “TCP Congestion Control”, IETF RFC 2581, Apr. 1999. [42] V. Paxson and S. Floyd, “Wide-area traffic: The failure of Poisson Modeling,” IEEE/ACM Trasn. Networking, Vol. 3, pp. 226-244, June 1995. [43] M. E. Crovella and A. Bestavros, “Self-similarity in World Wide Web Traffic Evidence and Possible Causes,” Proc. of the SPIE International Conference on Performance and Control of Network Systems, Nov., 1997. [44] T. Bu, N. G. Duffield, F. Lo Presti, D. Towsley, “Network Tomography on General Topologies,” Proceedings of ACM SIGMETRICS 2002. [45] P. Newman, G. Minshall, T. Lyon, L. Huston, “Flow Labeled IP: A Connectionless Approach to ATM,” Proc. of IEEE Infocom 1996. 93 [46] M. Veeraraghavan, H. Lee and X. Zheng, “File transfers across optical circuitswitched networks,” PFLDnet 2003, Feb. 3-4, 2003, Geneva, Switzerland. [47] M. Veeraraghavan, X. Zheng, H. Lee, M. Gardner, W. Feng, "CHEETAH: Circuitswitched High-speed End-to-End Transport ArcHitecture,” accepted for publication in the Proc. of Opticomm 2003, Oct. 13-17, Dallas, TX. [48] J. Walrand, “Managing QoS in Heterogeneous Networks, Internet Next Generatin,” http://robotics.eecs.berkeley,edu/~wlr/Presentatioins/Managing%20QoS.pdf, Porquerolles, 2003 [49] M. E. Crovella, M. Harchol-Balter, and C. D. Murta, “Task Assignment in a Distributed System: Improving Performance by Unbalancing Load,” BUCS-TR1997-018, October 31, 1997. [50] D. Bertsekas and R. Gallager, Data Networks, Prentice Hall: New Jersey, 1986. [51] R. Van der Meer, “MEMS Madness,” June 19, 2001, http://www.lightreading.com/document.asp?doc_id=6149&site-trading [52] Veeraraghavan, H. Lee, E.K.P. Chong, H. Li, “A varying-bandwidth list scheduling heuristic for file transfers,” in Proc of ICC2004, June 20-24, Paris, France. [53] M. Veeraraghavan, X. Zheng, W. Feng, H. Lee, E. K. P. Chong, and H. Li, “Scheduling and Transport for File Transfers on High-speed Optical Circuits,” PFLDNET 2004, Feb. 16-17, 2004, Argonne, Illinois, http://www.- didc.lbl.gov/PFLDnet2004/. [54] H. Lee, M. Veeraraghavan, E.K.P. Chong, H. Li, “Lambda scheduling algorithm for file transfers on high-speed optical circuits,” in Proc. of the Workshop of Grids and Advanced Networks (GAN’04), part of the IEEE International Symposium on 94 Cluster Computing and the Grid (CCGrid 2004), April 19-22, 2004, Chicago, Illinois. [55] M. Veeraraghavan, X. Zheng, W. Feng H. Lee, E.K.P. Chong, H. Li, “Scheduling and Transport for File Transfers on High-speed Optical Circuits,” Journal of Grid Computing (JOGC 2004) , Special Issue on High Performance Networking, to appear. [56] David L. Mills, Ajit Thyagarjan, and Brian C. Huffman, “Internet timekeeping around the globe,” in Proc. Precision Time and Time Interval (PTTI) Applications and Planning Meeting, Long Beach CA, Dec. 1997.