A Novel Approach to IP Traffic Splitting Using Flowlets by Shantanu K Sinha Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY -September 2004 © Massachusetts Institute of Technology 2004. All rights reserved. Author . ......... ........................... Depar ment of Electrical Engineering and Computer Science September 10, 2004 Certified by..... Dina Katabi Assistant Professor Thesis Supervisor I Accepted by . Arthur C. Smith Ch irmn DiiuNaTietE narLx ent, Committee on Graduate Students MASSCHUETI OF TECHNOLOGY JUL 18 2005 LIBRARIES BARKER A Novel Approach to IP Traffic Splitting Using Flowlets by Shantanu K Sinha Submitted to the Department of Electrical Engineering and Computer Science on September 17, 2004, in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science and Engineering Abstract TCP's burstiness is usually regarded as harmful, or at best, inconvenient. Instead, this thesis suggests a new perspective and examines whether TCP's burstiness is useful for certain applications. It claims that burstiness can be harnessed to insulate traffic from packet reordering caused by route change. We introduce the use of flowlets, a new abstraction for a burst of packets from a particular flow followed by an idle interval. We apply flowlets to the routing of traffic along multiple paths and develop a scheme using flowlet-switching to split traffic across multiple parallel paths. Flowlet switching is an ideal technique for load balancing traffic across multiple paths as it achieves the accuracy of packet-switching and the robustness to packet reordering of flow-switching. This research evaluates the accuracy, simplicity, overhead and robustness to reordering flowlet switching entails. Using a combination of trace analysis and network simulation, we demonstrate the feasibility of implementing flowlet-based switching. Thesis Supervisor: Dina Katabi Title: Assistant Professor 3 4 Acknowledgments I consider myself to have reached this point only because of the support and influence of many people around me. First, and foremost, I give my infinite thanks to my advisor, Professor Dina Katabi. Her intelligence, problem solving ability and ability to boil complex technical problems down to only the most essential details provided me with invaluable insight. But in addition, I thank her for the intensity with which she works and the attention she gives to her students (not to mention the great late-night dinners). Next, I thank my colleague for the last year, Srikanth Kandula. He is among the quickest problem solvers and engineers I have yet to meet. Often times, when working with him on a particular problem, I would find myself requiring days to get to the point he reaches within minutes. However, more than that, I cannot even put a price on the value of his vast knowledge of the intricacies of Unix and Linux, without which I would have most certainly been a lost cause! My life for the last 7 years in Boston could not be complete without my friends. Particularly, I thank Nosh Petigara, Vinay Pulim, Jeremy Roy, and Hemant Taneja for their support, much-needed times of respite, inspiration and a healthy competition spurring me through my return to school. I especially could not have been here if it were not for Laura Lurati and Tanaz Petigara, who continued to kick me until I finally returned to school. I am truly grateful to Hillary Eklund for being at the right place at the right time. Over the last three months, she has infinitely improved my quality of life, and I know that I am only at the beginning of an upward slope as I move on to the next phase. Finally, I cannot begin to thank my mother for the sacrifices and compromises she has made for me over the last 25 years. Without her compassion, warmth, support, work-ethic and ability to accomplish anything she sets out to do, I could not be here. I can only hope to achieve the same qualities she carries with her everyday. I dedicate this thesis to her. 5 6 Contents 13 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2 TCP Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3 Harnessing TCP Burstiness Through Flowlet Switching . . . . . . . . 15 1.4 Adaptive Multipath Routing . . . . . . . . . . . . . . . . . . . . . . . 16 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 19 2 Related Work 3 2.1 TCP Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 TCP Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Multipath Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 TeXCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Traffic Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 23 FLARE 3.1 The Splitting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 D esign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Token-Counting Algorithm . . . . . . . . . . . . . . . . . . . . 25 3.2.2 Flowlet Assignment Algorithm . . . . . . . . . . . . . . . . . . 25 27 4 Traffic Splitting Evaluation 4.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . 7 27 5 7 Packet Traces . . . . . . . . . . . . 27 4.1.2 Packet Trace Analyzer . . . . . . . 28 4.1.3 Measuring Accuracy . . . . . . . . 30 4.1.4 Measuring TCP Disturbance . . . . 30 4.1.5 Analyzing Splitting Schemes . . . . 31 4.2 Comparison with Other Splitting Schemes 31 4.3 Accuracy of Flowlet Splitting . . . . . . . 31 4.4 TCP's Disturbance . . . . . . . . . . . . . 34 4.5 Overhead of Flowlet Splitting . . . . . . . Interaction Between Traffic Splitting and TCP Dynamics 36 37 FLARE-TeXCP . . . . . . . . . . . . . . . . . . 38 5.1.1 . . . . . . . . . . . . . . . 39 5.2 Verifying the Benefits of Traffic Splitting . . . . 40 5.3 Impact to TCP Retransmission Timer . . . . . 43 5.4 End-to-End Performance . . . . . . . . . . . . . 44 5.5 Switching Opportunities 47 5.6 Enabling Adaptive Multipath Routing 5.1 6 4.1.1 Flow Profiles . . . . . . . . . . . . . . . . . . Flowlets 48 55 6.1 The Origin of Flowlets . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Why Flowlet Splitting is Accurate . . . . . . . . . . . . . . . . . . . . 57 6.3 Why Flowlet Tracking Requires a Small Table . . . . . . . . . . . . . 57 Future Work and Conclusions 59 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.2 C onclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A Handle Packet Departure in Flow Trace Analyzer 8 61 List of Figures 1-1 Switching Flow Traffic Without Introducing Reordering . . . . . . . . 16 4-1 Visualization of Splitting Scheme Comparison . . . . . . . . . . . . . 32 4-2 Flow- vs. Flowlet-switched Tracking Of Time-Varying Split Vector . . 33 4-3 Flowlet-switching Accuracy as a Function of Timeout . . . . . . . . . 34 4-4 Reordering vs MTBS and Flowlet Timeout . . . . . . . . . . . . . . . 35 4-5 Flowlet-Switching Accuracy as a Function of flowlet table size . . . . 36 5-1 FLARE-TeXCP ns-2 Architecture . . . . . . . . . . . . . . . . . . . . 38 5-2 Simulation Network Topology . . . . . . . . . . . . . . . . . . . . . . 39 5-3 Goodput Comparison of a Single Flow . . . . . . . . . . . . . . . . . 41 5-4 Single Flow cwnd Comparison . . . . . . . . . . . . . . . . . . . . . . 42 5-5 RTX Count versus MTBS . . . . . . . . . . . . . . . . . . . . . . . . 43 5-6 Mean RTT Variance versus MTBS . . . . . . . . . . . . . . . . . . . 44 5-7 Goodput versus MTBS . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5-8 Average per-flow Goodput (Static Split Vector) . . . . . . . . . . . . 45 5-9 CDF of Flow Goodputs . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . . . . . 48 5-11 Average per-flow Goodput (Dynamic Split Vector) . . . . . . . . . . . 49 5-12 Flowlet-switched Traffic Rebalancing with Cross Traffic . . . . . . . . 51 5-13 Flow-switched Traffic Rebalancing with Cross Traffic . . . . . . . . . 52 5-14 Error Comparison During Traffic Shock . . . . . . . . . . . . . . . . . 53 6-1 56 5-10 Traffic Splitting Accuracy Static and Dynamic Split Vectors Sub-RTT Nature of Flowlet Inter-arrival Times 9 . . . . . . . . . . . . 6-2 60ms-flowlet Size Distribution . . . . . . . . . . . . . . . . . . . . . . 10 57 List of Tables 4.1 Datasets Used by Packet Trace Analyzer . . . . . . . . . . . . . . . . 28 4.2 Accuracy and Error Comparison of Various Splitting Algorithms . . . 32 5.1 Simulation Flow Profiles . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Comparison of Splitting Algorithm Switching Opportunities . . . . . 47 6.1 Comparison of Arrival Rates and Concurrency of Flows and Flowlets 11 56 12 Chapter 1 Introduction Splitting traffic across multiple paths/links according to some desired ratios is an important functionality for network management. Many commercial router vendors, such as Cisco and Juniper, provide basic support for it in their products [12, 23]. It is also a key enabling technology for much research in the areas of traffic engineering [8, 39] and adaptive multipath routing [40, 24]. Adaptive multipath routers balance incoming load across multiple paths to reduce congestion and increase availability. Another potential application for traffic splitting includes adaptive multihoming, which allows a stub domain to adaptively split its traffic across multiple access links connected to different ISPs to optimize performance and cost [2, 17, 18]. 1.1 Motivation Traffic splitting is a challenging problem due to the tradeoff between achieving low deviation from the desired traffic ratios (i.e. high accuracy) and avoiding packet reordering, which hinders TCP performance. Because more than 80% of IP traffic flowing through the Internet consists of TCP traffic, understanding the impact of traffic splitting vis-h-vis this tradeoff is an important consideration. If traffic splitting across multiple paths could be efficiently implemented without the introduction of significant reordering, the implementation and use of multipath routing would come one step closer to reality. 13 Traditionally, systems use one of two general approaches for splitting traffic. The first is packet-based splitting, which assigns each packet to a path with a probability proportional to the path's desired traffic share and independent of the assignment of other packets [8, 30]. This method ensures the resulting allocation accurately matches the desired split ratios but may allocate packets from the same flow to different paths, causing reordering and confusing TCP congestion control. Some proposals aim to make TCP less vulnerable to reordered packets [28, 7, 44], which, if widely deployed, would make packet-based splitting more robust. But prior experiences suggest such wide-scale deployment is unlikely in the near future. Instead, routers [12, 23] use variations of flow-based splitting, assigning all packets of a flow to a single path. In contrast to packet-based splitting, this approach avoids reordering but cannot accurately achieve the desired splitting ratios [35]. Distributing traffic in units of flows rather than packets reduces the resolution of the splitting scheme. Further, flows differ greatly in their rates [27, 46, 32, 36]. Assigning an entirely new TCP flow to a path makes the change to the path's rate unpredictable. Prior work tried to estimate the rate of each flow and used these estimates when mapping flows to paths, but found these rates to be unstable and change quickly during the lifetime of a flow [32]. The inaccurate traffic splitting resulting from pinning a flow to particular path leads to an unbalanced load and potentially worse performance. It may also lead to extra cost if the domain is charged differently for sending traffic on different parallel links, as in adaptive multihoming [17]. Ideally, one would like to combine the accuracy and low-overhead of packet-based splitting with the ability of flow-based splitting to avoid reordering TCP packets. This thesis demonstrates the feasibility of achieving both of these goals through the use of flowlet switching. 1.2 TCP Burstiness To understand what flowlets are, we must first characterize the bursty nature of TCP. Although TCP is designed to fully utilize any bandwidth available to it, prevailing 14 network characteristics produce flow transmissions that take an on-off pattern- a burst of packets followed by an idle period [22]. Prior work has shown that TCP senders tend to send an entire congestion window in one burst or a few clustered bursts and then wait for an idle period for the rest of its RTT. This behavior is caused by multiple factors, like ack compression, slow-start, and others [22, 43, 47, 38]. Prior work either focuses on characterizing TCP's burstiness [43, 22, 47, 5, 11, 38, 14], or proposing mechanisms for smoothing it [1, 26, 41]. We suggest a new perspective and explore whether TCP's burstiness can be useful for certain applications. We claim that the idle periods of time in between the arrival of flowlets, if large enough, can be used to facilitate switching a granularity finer than flow-based switching. 1.3 Harnessing TCP Burstiness Through Flowlet Switching A flowlet is a burst of packets from a given TCP flow. Flowlets are characterized by a timeout value, J, which is the minimum inter-flowlet spacing, i.e., packet spacing within a flowlet is smaller than 6. Flowlet-based splitting exploits a simple observation. Consider a set of parallel paths, which diverge at a particular point and converge later, each containing some number of hops. Given two consecutive packets in a TCP flow, if the first packet leaves the convergence point before the second packet reaches the divergence point, one can route the second packet -and subsequent packets from this flow- path with no threat of reordering, as in Fig. 1-1. on to any available Thus, by picking flowlet timeout larger than the maximum latency of the set of parallelpaths, consecutive flowlets can be switched independently with no danger of packet reordering. In fact, for any set of parallel paths, we can further tighten the timeout value to the difference between the maximum and minimum path latencies. We call this maximum delay difference, the Minimum Time Before Switch-ability (MTBS). As long as the flowlet timeout, 6, is 15 TCP fl wI= Diverging Point Converging Point Figure 1-1: If the first packet leaves the convergence point before the second packet reaches the divergence point, one can assign the second packet to a new path without risking TCP packet reordering. larger than the MTBS, flowlet switching does not cause packet reordering. 1.4 Adaptive Multipath Routing Commercial router vendors, such as Cisco and Juniper [12, 23], include support for basic multipath routing in their products. Multipath routing typically consists of assigning a set of static split ratios to each available path, and allocating traffic on to those paths by some scheme. Schemes that fall into this category include OSPF and IS-IS. A recent area of research gaining much attention focuses on adaptive multipath routing. An adaptive multipath router dynamically adjusts its split ratios to accommodate changing network conditions. Schemes like MATE and TeXCP [13, 24] fall into this category. Adaptive multipath routers present a compelling application of a flowlet-switching traffic splitter. Adaptive multipath routing can be divided into three layers. Layer 1 determines the correct values of the splitting ratios given the current network conditions. This layer can be implemented using an online algorithm, such as MATE or TeXCP or by an offline optimizer, such as OSPF/IS-IS. Layer 2 splits traffic according to the split ratios defined by Layer 1. Given a set of paths through the network and a set of desired split ratios, the traffic splitter attempts to achieve the desired traffic allocations along the paths. Traffic splitting can be implemented through packetswitching, flow-switching or, as we propose, flowlet-switching. We show, in this thesis, 16 that packet-switching has adverse affects on TCP performance and flow-switching is not accurate, in addition to sometimes requiring an infeasible amount of state. Layer 3, the final layer, handles the physical delivery of each packet to its destination. Layers 1 and 2 will usually rely on Layer 3 to provide a list of paths available. This layer may be implemented using a scheme such as MPLS. The FLARE-TeXCP implementation, described in chapter 5, is a system for Layers 1 and 2. TeXCP is an adaptive multipath router which uses feedback in the network to adapt the split ratios over time. We apply flowlet-switching to adaptive multipath routing in order to examine its impact to TCP congestion control dynamics. We show that flowlet switching enables TeXCP to balance traffic load effectively, while ensuring that TCP goodput remains high. Finally, by implementing and simulating FLARE-TeXCP, we show that traffic splitting has a benign impact to TCP dynamics. 1.5 Contributions This thesis develops FLARE, a scheme that uses flowlet switching to split traffic across multiple paths or links according to some desired split ratio. It evaluates the scheme in two stages. First, it analyzes the traffic splitter in isolation, determining whether or not IP traffic can be split into flowlets. It will analyze the scheme using traces collected from a major peering point, a stub domain border router, and various backbone routers and then compare the new scheme to the current splitting methods. Second, it will embed FLARE within the TeXCP multipath routing system and simulate the impact on TCP of splitting traffic across multiple paths. In particular, this work makes the following contributions: e Introduces the concept of flowlets, a useful, new abstraction for a burst of packets within a flow e Investigates the origins and properties of flowlets 17 " Presents a low-overhead traffic splitting algorithm based on flowlet-switching that, in implementation, promises to: - allocate traffic with only small deviations from desired levels - significantly reduce the occurrence of packet reordering " Evaluates the impact of TCP feedback on traffic splitting " Examines the impact of traffic splitting on TCP congestion control 1.6 Organization The remainder of this thesis is organized as follows. Chapter 2 covers background and related research relevant to the development of flowlet-switched traffic splitting. Chapter 3 describes the design and implementation of the proposed traffic splitter. In chapter 4, we evaluate the performance of the traffic splitting algorithm in isolation, analyzing its performance using traced based simulations. In chapter 5, we analyze the performance of FLARE within the context of TeXCP using ns-2 simulations. Next, we investigate the origins and properties of flowlets in chapter 6. Finally, we look at future research remaining to be done and conclude this work in chapter 7. 18 Chapter 2 Related Work 2.1 TCP Reordering Packet reordering can negatively impact the performance of TCP. When packets within a TCP flow arrive out-of-order, a sender may spuriously perform a fastretransmit, subsequently reducing its congestion window. Much prior work has focused on improving TCP's robustness to packet reordering. Typically these schemes consist of updating TCP end hosts to become more robust to packet reordering [28, 7, 44]. They mitigate these effects of occasional packet reordering through a variety of different schemes, such as congestion window rollback or conservative window reduction. If a future TCP were to be completely immune to packet reordering, a packet-based splitting technique would outperform all other approaches. However, an end-host solution would require wide-spread deployment and there is no indication such a TCP will be deployed in the near future. Until then, the network is the most feasible point at which to prevent packet reordering. 2.2 TCP Burstiness Due to prevailing network characteristics, TCP flows are typically bursty [22]. A bursty TCP flow is characterized by a sender transmission taking an on-off patterna burst of packets followed by an idle period [22]. TCP's burstiness emerges from 19 a combination of factors, including ack compression, application transmission irregularity, and others. Much prior work advocates mechanisms to smooth TCP burstiness [1, 26]. While a paced TCP is useful for many applications [33, 41], the prospect of the deployment paced TCP end hosts is limited by the same reasons preventing the deployment of TCP hosts robust to reordering. At present time, the bursty nature of TCP provides an opportunity to use it to our advantage. 2.3 Multipath Routing Multipath routing algorithms have recently garnered researchers' attentions. The majority of proposed approaches to multipath routing require a method for traffic splitting across various parallel paths. Multipath routing sends traffic on multiple paths to balance the load and minimize the possibility of congestion [29, 19, 24, 13, 40, 42, 15]. Some of the work in this area focuses on adaptive approaches where the desired splitting vector varies with time and reacts to the observed network conditions [13, 24]. This capability further constrains the splitting mechanism to be able to track a changing split vector, in addition to the basic requirements of achieving accuracy and maintaining packet order. 2.4 TeXCP TeXCP is an online, distributed multipath routing system that routes traffic in a network to minimize the maximum link utilization. Like the off-line MPLS route optimizer, TeXCP assumes that multiple label switched paths (LSPs) have been established between each ingress-egress pair, either using a standard protocol like CRLDP [21] or RSVP-TE [6]. TeXCP adaptively splits the traffic between these LSPs to minimize the maximum utilization. Because it reacts in real-time, TeXCP can deal with traffic dynamics and unpredictable events, such as link failures and traffic spikes which occur frequently in today's Internet [20, 10]. TeXCP employs a control loop with feedback to periodically adjust the desired 20 traffic allocations along available multiple paths. It uses periodic probe packets to determine present network conditions (within some time-window of accuracy). The protocol uses ideas from XCP to ensure the protocol remains stable in the presences of network delays and other TeXCP agents [24]. TeXCP provides a convenient simulation environment in which to evaluate the efficacy of flowlet-switched traffic splitting. In chapter 5, we describe flowlet-switched TeXCP implementation within ns-2 and evaluate its performance characteristics. 2.5 Traffic Splitting Early work on traffic splitting considers forwarding packets onto multiple paths using some form of weighted round-robin or deficit round robin [37] scheduling. Others avoid packet reordering by consistently mapping packets to paths based on their endpoint information. Commercial routers [12, 23] implement the Equal-Cost Multipath (ECMP) feature of routing protocols such as OSPF and IS-IS. Hash-based versions of ECMP divide their hash space into equal-size partitions corresponding to the outbound paths, hash packets based on their endpoint information, and forward them onto the path whose boundaries envelop the packet's hash value [9, 40]. A few papers analyze the performance of various splitting schemes. Cao et al. evaluate the performance of a number of hashing functions on hash-based traffic splitting [9]. Rost and Balakrishnan [35] evaluate different traffic splitting policies, including rate-adaptive splitting methods. They identify high flow rate skew and the number of concurrent flows comprising the traffic aggregate as major factors affecting splitting performance. 21 22 Chapter 3 FLARE In this chapter, we present the design and implementation of FLARE, FLowlet Aware Routing Engine. FLARE accurately splits traffic across multiple paths, while minimizing TCP packet reordering. FLARE resides on a router that feeds multiple parallel paths and takes as input a split vector which can vary over time. FLARE could be implemented in any router responsible for routing traffic along multiple links or paths. 3.1 The Splitting Problem The traffic splitting problem is formalized as follows [35]. The aggregate IP traffic arriving at a router, at rate R, is composed of a number of distinct transport-layer flows of varying rates. Given N disjoint paths, which can be used concurrently, and a split vector F = (F1 , F 2, ..., FN), Fi E [0, 1] and z-i Fi = 1, split the aggregate traffic into N portions such that the traffic rate flowing on path i is equal to F x R. However, this description of the traffic splitting problem does not consider the effect of splitting traffic on transport-layer flows. In reality, the majority of packets on the Internet belong to TCP flows [46]. We formalize the amount of reordering introduced by a traffic splitter as the probability that some flow will experience a congestion event. Thus, in addition to achieving the desired traffic rates along each path, we seek to minimize this probability. The splitting problem is a key component of the general problem of load balancing. 23 In addition to a traffic splitter, balancing the load across multiple paths requires a mechanism to find the splitting vector F. Depending on the environment, the network administrator may set F to a static value, or use an adaptive routing protocol to dynamically adapt F to the state of the network. 3.2 Design Upon receiving a packet, FLARE determines the best path along which to route the packet to achieve the desired split vector, and forwards the packet to the appropriate link. 1 FLARE relies on the flowlet abstraction to accurately split TCP traffic along multiple paths without causing reordering. The network administrator configures FLARE with a flowlet timeout value 6. The administrator uses knowledge of the network to pick a 6 larger than the MTBS, the maximum delay difference between the set of parallel routes under consideration. This choice of 6 enables FLARE to assign two flowlets of the same flow to different parallel paths, without causing TCP packet reordering. Packets for which transport-layer performance is unaffected by reordering may be allocated to any path. For simplicity, we refer to these packets as non-TCP packets. Since routing flowlets will typically be slightly less accurate than a packetswitched splitter, FLARE uses non-TCP packets to balance residual error occurring from routing flowlets. FLARE is configured with a flowlet timeout 6 and has two components: a tokencounting algorithm and a flowlet assignment algorithm. 'FLARE actually hands the packet to the next stage toward transmission on the appropriate link (e.g., an output queue). 24 3.2.1 Token-Counting Algorithm FLARE assigns a token counter, tj to each path i of the set of parallel paths. For every packet of size b bytes, all token counters are updated as follows: tj = tj + F x b, vi where F is the fraction of the load to be sent on path i. If the packet is a non-TCP packet, it is assigned to the path with the maximum number of tokens. Otherwise, it is assigned according to the flowlet-to-path assignment algorithm. In either case, once the packet has been assigned to a particular path j, the corresponding token counter is decremented by the size of the packet: tJ = tj - b. The key feature of this token-counting algorithm is that packets belonging to TCP flows affect the per-path token counters in the same way that packets belonging to non-TCP flows do. At the same time, packets belonging to the TCP flows get mapped using the Flowlet Assignment Algorithm, rather than the token counters. The implication of this feature is that non-TCP packets will be routed as to compensate for errors emerging from switching flowlets as opposed to packets. 3.2.2 Flowlet Assignment Algorithm Tracking flowlets is not entirely different from tracking flows. Note that any particular flow may contain several flowlets (i.e. it may contain several distinct packet bursts), each separated by time periods of length at least 6. Each of these flowlets will have the same flow id, if we compute the flow id as the hash of the source IP, destination IP, source port and destination port of a packet within the flowlet. However, for any particular flow, at any point in time, only one flowlet can have packets in transmission. At most, we must maintain one entry per flow to be able to track a flowlet. In chapter 4, we show that at any point in time, the number of flowlets in a network is 25 very small and thus, need only to track a small number of flowlets, rather than one per-flow. FLARE uses a hash table to map flowlets to paths. Each table entry contains two fields last-seen-time and path-id. When a packet arrives, FLARE computes a hash of the source IP, destination IP, source port and destination port. The authors of [9] recommend a CRC-16 hash, as it is fast and efficiently implemented in hardware. FLARE uses this hash as a key into the flowlet table. If the current time is smaller than last-seen-time + 6, then this packet belongs to a flowlet in transmission. The packet is sent on the path identified by path-id and last-seen-time is set to current time. Otherwise, the entry in the flowlet table represents the previous flowlet and this packet marks the arrival of a new flowlet. As a result, it may be assigned to any of the available paths. Once assigned, FLARE sets path-id to the new path id, and sets last-seen-time to current time. Though any scheme could be used, the scheme which we found to work best is to assign new flowlets to the path with the maximum number of tokens. 26 Chapter 4 Traffic Splitting Evaluation The first step in our analysis of FLARE is to determine whether the FLARE traffic splitting scheme is feasible. To do this, we examine the traffic splitting component in isolation by analyzing the characteristics of packet traces. In this chapter, we investigate multiple issues. First, we attempt to determine whether or not flowletswitched traffic splitting can enable packets to be effectively switched. Then we determine whether or not they can be switched to achieve a time-varying split vector. Next, we analyze the amount of overhead necessary to implement flowlet-switching. Finally, we attempt to infer the amount of disturbance that flowlet-switching will incur. We compare all of these properties to other traffic splitting schemes. 4.1 4.1.1 Experimental Environment Packet Traces We use traffic traces from four sources. First, the Peering trace is collected at multiple 622 Mbps peering links from the same router connecting a Tier-1 ISP to two large ISPs. Second, the LCSout trace is collected at the border router connecting MIT's Computer Science and Artificial Intelligence Lab to the Internet over a 100 Mbps link. Finally, NLANR-1 and NLANR-2 are sets of backbone traces collected by NLANR on OC12 and OC3 links [31], respectively. Table 4.1 summarizes relevant information 27 Trace Date Duration Packets # Peering 03/05/2003 12 minutes 9.15 M 05/09/2003 1 hour 6 PM 03/07/2004 90 seconds 3 AM 04/15/2003 90 seconds Flows Avg. Flow Rate (Kbps) Max. % Flow Bytes Rate Non(Mbp )TCP 454K 1.83 6.01 7.97% 25.4 M 426K 13.73 75.29 2.80% 7.3 M 340.5K 7.74 50.89 13.1% 1.69 M 10K 31 98.1 12.1% 7 PM LCSout NLANR-1 NLANR-2 8 PM Table 4.1: Datasets used in the packet trace analysis. about these traces (flow rates are computed according to [46]). In all traces, TCP constitutes over 85% of the traffic, with the LCSout trace being 97% TCP. 4.1.2 Packet Trace Analyzer The Packet Trace Analysis tool is a tool which enabled us to study the feasibility of traffic splitting. We imagine the router at which the trace is collected to be feeding multiple parallel paths, and splitting the traffic among them according to a desired split vector. The tool takes a trace file, a network topology and a traffic splitting scheme as inputs. For each available path, specified by the network topology, the tool maintains state on the amount of traffic that has been delivered to it. The analyzer processes each packet, determining the path to which the packet would be allocated and then updating the state for the selected path. For static split ratios, the network topology includes the desired traffic allocations for each of the paths. To evaluate dynamic split ratios, the tool takes an optional input specifying a split function. This split function will produce a time-varying split vector, which the splitting scheme accesses when making path selection decisions. The tool produces data which we use to evaluate the degree to which FLARE tracks the desired splits and avoids TCP reordering. these parameters: 28 The experiments depend on " F, the split vector, specifying the fractions at which incoming traffic needs to be split. In our experiments, we use both a static vector F, = (.3, .3, .4), and a dynamic vector Fd(t) where x(t) = ?. p = .13(1,1,1) + .6(sinr4x, sin2x - cos 2 x, cos 2 X) We use two dynamic vectors, Fdl, which reflects changes over long time scales (p=40min), and Fd2, which reflects changes over short time scales (p=4min). " 6, the flowlet timeout interval. Unless specified otherwise, 6 = 60ms. This choice of value means that we are analyzing a situation in which the administrator thinks that the delay difference between the various parallel paths is less than 60 ms. Given current values for one-way delay in the Internet (e.g., coast-to-coast is typically < 40ms), a delay difference of 60 ms or less should be applicable to many possible cases of parallel paths. * MTBS, the actual maximum delay difference between the parallel paths. Unless specified otherwise, MTBS=80 ms. By making MTBS different from 6, we mimic errors in the administrator's estimate of MTBS. S Tay 9 , the time window over which the paths' rates are computed to measure whether they match the desired split. This is a measurement parameter irrelevant to the operation of FLARE. We fix Tg = 0.3s.1 " Shash, the size of hash table used by FLARE. Unless otherwise specified, we set Shash = 210 entries. 'The exact value of this parameter is not important as long as it is small enough to show the instantaneous variability of the load. We chose Tag = 0.3s because this is the update interval of TeXCP [24], an adaptive multipath routing protocol. Also, routers can typically buffer about 250ms worth of data [5]. 29 4.1.3 Measuring Accuracy An optimal traffic splitting policy ensures that path i receives a fraction of the traffic F on any timescale, but the actual fraction of traffic sent on i is Fl'. We measure the splitting error as: I N Error= - N |F.- F|| F( , (.1 where N is the number of parallel paths among which the traffic is divided. The graphs report the average error over non-overlapping windows of size T,. Accuracy is 1 - Error. 4.1.4 Measuring TCP Disturbance We estimate the disturbance of a certain splitting algorithm as the probability that a packet triggers 3 dup-acks due to the reordering caused by the splitting scheme. To determine instances of reordering that may lead to 3 dup-acks, the Packet Trace Analyzer uses the following scheme. First, the packet's flow id is calculated. Then, the packet is given an index number. For a given flow, the set of index numbers will indicate the arrival order of the packets. After the splitting scheme selects a path for this packet, a departure time is calculated using the propagation delay of the selected path, as specified by the network topology. Once calculated, the handle-packet-departure function is called. The handle-packet-departure function maintains a per-flow state table. Each entry in this table contains three fields, lastindex-inorder, out-of -order-que,dup-ack-count. When the function is called on the departing packet, if the index number of the packet is equal to last-index-in-order+ 1, then this packet has departed in order, and last-index-in-order is incremented to the index number of this packet. If not, this packet is inserted into the out-ofrorder-que and dup-ack-count is incremented. As handle-packet-departure is called on subsequently departing packets, as long as the next index number has not arrived, packets are inserted into out-of _orderque. Once a packet final departs that increments last-indexinorder,the out-of _order-que 30 is emptied until no packets in the queue will increment lastiJndex inorder. The pseudocode for the handle-packet-departure function is included in appendix A. 4.1.5 Analyzing Splitting Schemes We use the Deficit Round Robin to implement packet-switched traffic splitting [37], which is highly accurate [35]. We implement flow-based splitting by assigning each new flow to the path farthest away from its desired traffic share, and retaining the assignment for the duration of the flow. We also implement a static-hash splitting (SHASH) scheme by hashing the arriving packet's source and destination IP addresses and ports into a large space, then allocating the hash space to the various paths proportionally to their desired traffic shares [35]. 4.2 Comparison with Other Splitting Schemes Figure 4-1 and Table 4.2 compare FLARE with other splitting schemes vis-a-vis accuracy and TCP disturbance. The results in the table are computed using 6=60ms and MTBS=80ms, which is a slight mis-configuration of J. In our experiments, FLARE provides a good tradeoff between accuracy and robustness to reordering. Its errors are an order of magnitude lower than flow-based and static hash splitting schemes, and its tendency to trigger TCP congestion window reduction is an order of magnitude less than that of packet-based splitting. The table also shows that packet-based splitting is inadequate for these scenarios because it triggers 3 dup ack events at a rate comparable to or higher than the loss rate in the Internet [34, 4]. Finally, the SHASH scheme, though 10 times less accurate than FLARE, has a reasonable splitting accuracy for a static split vector, but does not react to dynamic split vectors. 4.3 Accuracy of Flowlet Splitting Flowlet-based splitting is accurate for realistic values of 6. Fig 4-3 shows the error as a function of flowlet timeout for our four traces. The figure shows results for 31 tFLARE 6 0. 0 FFLOW 0 PACKET S-HASH A 0 - 24 0 L.b - ~ o 0- 0 0 20 40 60 80 100 % Error Figure 4-1: Visualization of results in Table 4.2. Points near the origin represent performance with low error and low reordering. FLARE's performance falls within this region. The performance of S-HASH is close, but S-HASH cannot be used when the split vector is dynamic. Trace FLARE Peering LCSout NLANR-1 NLANR-2 0.47% 7.50% 0.55% 0.70% (0.01%) (0.07%) (0.02%) (0.05%) Trace Peering LCSout NLANR-1 NLANR-2 Trace Peering LCSout NLANR-1 NLANR-2 FLARE 0.75% (0.00%) 13.48% (0.07%) 0.82% (0.01%) 0.78% (0.04%) FLARE 0.74% (0.00%) 13.26% (0.07%) 0.90% (0.02%) 1.56% (0.05%) Static FLOW PACKET 4.54% (0%) 0.12% (0.71%) 0.07% (6.17%) 37.98% (0%) 0.02% (4.72%) 30.60% (0%) 0.03% (6.29%) 35.36% (0%) Mildly Dynamic FLOW PACKET 0.11% (0.69%) 7.82% (0%) 0.09% (5.34%) 65.10% (0%) 0.03% (5.43%) 16.86% (0%) 0.05% (6.65%) 73.55% (0%) Dynamic FLOW PACKET 29.33% (0%) 0.12% (0.52%) 63.54% (0%) 0.09% (5.36%) 61.05% (0%) 0.02% (4.30%) 87.47% (0%) 0.03% (5.83%) S-HASH 6.47% (0%) 31.69% (0%) 6.79% (0%) 3.08% (0%) S-HASH - S-HASH - Table 4.2: FLARE's accuracy is an order of magnitude higher than flowbased and static-hash splitting schemes and its robustness to reordering is an order of magnitude higher than packet-based splitting. The values outside the parenthesis are errors, while the numbers inside the parenthesis are the probability of mistakenly triggering 3 dup acks. Note that this experiment utilized the values, 6=60 ms, MTBS=80 ms. FLARE's reordering arise from this slight mis-configuration of 6. When 6=MTBS, FLARE shows no reordering. 32 Flo-based t- Splitting Pa ih'e 0.6 Time (sacs) 0 a_ 0.4 1 0.2 Path 0 Desired Time (seces) Figure 4-2: In contrast to flow-based splitting, FLARE is suitable for adap- tive multipath routing protocol as it can accurately track a varying split vector. Graphs are for the peering trace, 6Oms flowlet timeout, and 2 paths with a sinusoidal splitting function. For simplicity, we show the load on one path. the split vectors: F,, Fd1, and Fd2. The figure shows that on all traces other than the LCSout trace, fiowlet-based splitting achieves an accuracy comparable to packetbased splitting, as long as 6 <100 ins. Given typical values for one-way delay in the Internet (e.g., coast-to-coast delay is less than 4Oms), a delay difference in the range [50, 100]ms should apply to many possible sets of parallel paths. The errors on the LCSout trace are higher. For this domain, the administrator might want to pick 6=6Oms, which results in an error of 7%-14% depending on how quickly the split vector changes. Despite the relatively high error, other schemes that do not reorder packets have substantially higher errors on the LCSout trace (see Table 4.2). We attribute the higher errors in the LCSout trace to two factors. This trace contains a large amount of local intra-MIT traffic with a small RTT. Also, it has a low fraction of non-TCP traffic (less than 3% of the bytes), which FLARE uses to correct residual errors. We note that FLARE does not require the presence of non-TCP traffic to perform well because FLARE performs well on the other 3 traces even when we remove all non-TCP traffic from them. But the LCSout trace has a higher residual error because of the large fraction of local traffic (RTT < 6), and 33 100 DYNAMIC Peering DYNAMIC NLANR-1 DYNAMIC NLANR-2 190 80 .- AEC -e--STATIC 5f0 50 L Peering STATIC NLANR STATIC NLANR-2 70 - - -S-u1 DYNAMIC SLOW Peering .. DYNAMIC SLOW NLANR 1 DYNAMIC SLOW NLANR-2 - 40 30 20 10 0 - 1 10 100 1000 Flowlet Timeout Interval (msec) 10000 100000 Figure 4-3: A flowlet timeout in the range [50, 100]ms produces good accuracy. Errors as a function of flowlet timeout interval J for the static split, F,, the mildly dynamic split, Fd1, and the dynamic split F2. the lack of non-TCP traffic in this trace prevents FLARE from compensating for the residual errors. Figure 4-2 compares how flowlet- and flow-based splitting track a changing split vector in real-time, a feature required by adaptive multipath routing [13, 24]. The figure represents J=60 ms and two paths with a split that varies along a sinusoidal wave with a period of 2 minutes (F = 0.2(1, 1)+0.6(sinX, cos 2 X)). In this experiment, FLARE tracks the desired split much more closely than the flow-based splitter. 4.4 TCP's Disturbance We also evaluated FLARE's sensitivity to flowlet timeout values smaller than the actual MTBS. Such a choice of 6 will result in TCP packet reordering. Two parameters control the occurrence of reordering: the flowlet timeout, 6, and the difference between the timeout and the MBTS, 6 - MTBS. In particular, a reordering event happens only if two conditions are satisfied. First, FLARE switches the flow from one path to another. The frequency of such switching is determined by 6, i.e. larger flowlet timeout entails fewer opportunities to switch a flow from one path to another. Second, given that a flow is switched from one path to another, reordering will occur only if the delay difference between the two paths is larger than 6, i.e., MTBS - 6 > 0. 34 'I-I - mi - 0.8CL0.60 S0.4 0.208 1020 4/ (s) 40 0 0 60 80 100 40120 to 199 ' Figure 4-4: Percentage of packets that lead to three duplicate acks vs. flowlet timeout interval (DELTA 6) and the MTBS of a network. Packetbased splitting causes up to 1% of the packets to trigger TCP window reduction, which is about the loss rate in the Internet. FLARE with J = 60ms causes fewer than .06%, even when the actual MTBS is larger than 6 by 100ms. Further, the larger MTBS - 6 is, the higher the probability of reordering. Fig. 4-4 shows the probability of mistakenly triggering 3 dup acks, as a function of 6 and MTBS - 6, for the case of 3 paths with a static split vector F, path latencies (x, x + 0.5 MTBS, x + MTBS), on the Peering trace. The figure shows that FLARE is tolerant to bad choices of the flowlet timeout, i.e., choices in which 6 is smaller than the actual MTBS. In particular, we have seen that, choosing 6 in the range [50, 100]ms achieves good accuracy on our traces. Fig. 4-4 shows that for any 6 > 50ms, the percentage of packets that trigger a TCP window reduction is less than 0.06%, even when the actual MTBS is larger than the chosen 6 by 100 ms. This number is negligible in comparison with typical drop probabilities in the Internet [34, 3], and thus on average is unlikely to impact TCP's performance. In general, for 6 > 50ms, the probability of 3 dup ack occurrences increases slowly as the difference between MTBS and 6 increases. In other words, a mis-configured FLARE using a flowlet timeout smaller than the actual MTBS, may continue to perform reasonably well. 35 - -- STATC 100 80 - STATIC Peering STATIC NLANR-1 STATIC NLANR-2 o DYNAMIC Peering DYNAMIC NLANR DYNAMIC NLANR-2 6 Ca 40- aD 0 - 20 - 2 4 6 10 8 Hash Length (# of bits) 12 14 16 Figure 4-5: Error as a function of the flowlet table size for both static F, and dynamic Fd2 split vectors. FLARE achieves low errors without storing much state. A table of 210 five-byte entries is enough for the studied traces. 4.5 Overhead of Flowlet Splitting One of the most surprising results of this paper is the little overhead incurred by flowlet-based splitting. It requires edge routers to perform a single hash operation per packet and maintain a flowlet hash table, a few KB in size, which easily fits into the router's cache. We have estimated the required hash table size by plotting the splitting error, averaged over time windows Tg = 0.3s, as a function of the hash length. For example, a hash length of 10 bits results in table of 210 entries. Fig. 4-5 shows the error in our traces for both the static split vector F, and the dynamic sinusoidal vector Fd2. It reveals that the errors converge for a table size as small as 210 entries. Section 6.3 provides an explanation for the reasons behind these results. 36 Chapter 5 Interaction Between Traffic Splitting and TCP Dynamics Although FLARE may provide compelling benefits to network operators, we wish to investigate its interaction with TCP dynamics when incorporated into a full network environment. To accomplish this goal, we implment FLARE in ns-2 by integrating it into the TeXCP adaptive multipath router, calling the new implementation FLARETeXCP. In this chapter, we evaluate results of FLARE-TeXCP simulations. While the previous chapter demonstrates that traffic splitting could be feasibly implemented and can potentially deliver better performance with low overhead, this chapter investigates two issues. First, because TCP traffic is typically transmitted along a single path, we investigate the impact of varying path latencies (i.e. MTBS) on TCP's retransmission timer in order to determine if splitting traffic introduces adverse effects on TCP's performance. Second, it examines whether TCP's feedback and congestion control mechanism impacts the accuracy and reordering-robustness of flowlet-switching. The results we obtain from our experimentation in this chapter are still preliminary and further examination is needed to fully understand FLARE's performance in such environments. 37 TCP, TCP 2 j TPik / Path1 * Path 2 TCP~nk21 TeXCP TC, - Controller T 0 0 0 * Snk 0 0 0 TCP,, ~PathkTC TCPink 0 0 0 in, Figure 5-1: FLARE-TeXCP Architecture 5.1 FLARE-TeXCP Figure 5-1 illustrates the architecture of the FLARE-enabled TeXCP, called FLARETeXCP. A TCP agent is created for each flow in the system. The TeXCP controller from [24] was extended to support traffic originating from a separate transport layer. The controller previously was a transport-layer object, packetizing bytes delivered to it from the application layer, whereas the extended controller resides below the transport layer, receiving packetized data. Each TeXCP controller maintains a set of Path objects; each path in the simulation model processes probe information, feeds it back to the TeXCP controller, which then determines the appropriate traffic split vector to minimize the maximum link utilization. Finally, the controller and the TCP agents are connected to respective sink objects. The TeXCP Sink demultiplexes flow packets arriving from multiple paths in the network and delivers them to the appropriate TCP Sink. The TCP Sink handles the data and generates ACKs. Traffic is generated by FTP senders. The data is packetized by the TCP agent. These packets are sent to the controller. The controller applies the Token-Counting Algorithm described in Chapter 3. Next, the packet is mapped to a path using a traffic splitting scheme. Once a path is selected, the packet is queued into the path's ingress queue. The path object removes the packet from the ingress queue and delivers it to the network. The size of this queue is configured to be slightly larger than the bandwidth-delay product of the path. The TeXCP sink receives data packets from the network and routes them to the appropriate TCP sink. When a TCP sink generates an ACK, it delivers it back 38 Path 3 6 7 Path 2 0 4 51 2 3 Path 1 Figure 5-2: Simulation Network Topology. through the TeXCP Sink, which returns the ACK along the path from which the last data packet from this flow arrived. Figure 5-2 represents the network topology used in the simulations of FLARETeXCP. To simplify our analysis, we only model flows crossing a single Autonomous System (AS). We assume that the network characteristics between the TCP end-hosts and our network remain constant through our simulations (i.e. packets generated by the FTP senders are directly delievered to the ingress router of our network). In reality, this assumption, of course, is not true, since a TCP connection will traverse several hops and AS's having varying network characteristics. However, this model does facilitate an analysis of how any particular AS in the path of a connection can impact end-to-end performance. 5.1.1 Flow Profiles We wish to simulate real-world flow characteristics when generating traffic for our simulator. Because packet traces do not contain the dynamics of each of the flows, we use the traces to create flow profiles. These flow profiles specify the arrival times and sizes of each of the flows contained in a trace, using the method described in [46]. Each flow was identified by classifying packets by their source and destination IP addresses and ports. The flow profiles are used to generate the FTP senders used in our simulations. 39 Number of Flows Total Amount of Data Transferred Time of Arrival of Last Flow 12786 5767 6024 1893 622 Mb 86.69 Gb 2.53 Gb 9.10 Gb 200sec 300sec 90sec 90sec Peering LCSout NLANR-1 NLANR-2 Table 5.1: Flow Profiles Used in Simulations Because the flow profiles only indicate the arrival time and size of each flow, the dynamics of the flow transmissions are determined by their behavior within the network and will not correspond to the characteristics observed in the packet traces. Table 5.1 describes the flow profiles used through the remainder of this chapter. Flow profiles generated from the full list of flows from these packet traces required excessive simulation times. As a result, flow profiles were down-sampled. The link bandwidths in the network topologies were scaled appropriately to the sample rate used to generate the flow profiles. 5.2 Verifying the Benefits of Traffic Splitting First, we attempt to verify that TCP continues to perform well in the presence of traffic splitting across multiple paths. We analyze this problem along two dimensions. First, how does TCP congestion control react to the transmission of traffic over multiple paths with varying bandwidth and delay properties? Second, how does the TCP RTT estimator behave, and subsequently impact TCP retransmissions, in the presence of traffic splitting? Section 5.3 investigates the second question. To investigate the behavior of TCP congestion control, we simulated a single flow split over three paths using FLARE-TeXCP and the packet-switched splitting scheme (referred to as PACKET), on the topology described in figure 5-2. Each link of Paths 1 & 2 has a capacity of 10 Mbps, and a delay of 15 ms for a total path propagation delay of 45 ms. Each link of Path 3 has a capacity of 15 Mbps with a delay of 35 ms. A single FTP flow transfers 100 Megabytes across these paths. We compare these results to the case when the single flow is transmitting data across a single path of capacity 15 Mbps. 40 I - - - 6000 4000- 0 30000S3000200010000 4 PACKET FLARE-TeXCP 1 PATH Splitting Method Figure 5-3: Compares the goodputs from splitting a single flow across multiple paths and compares the result to single path transmission. Packetswitched splitting clearly introduces a significant reduction in data transfer rate, clearly due to the amount of reordering-induced retransmissions. On the other hand, FLARE-TeXCP performs very closely to the level of a single flow on a single path. Figure 5-3 shows that the packet-switched splitting scheme severely reduces TCP's performance. TCP's goodput using this splitting scheme is significantly lower than TCP's goodput over a single path or over FLARE-TeXCP. As Figure 5-4(c) shows, the flow's congestion window under the packet-switched splitter is never able to reach the congestion window that either of the other two schemes obtain, requiring a significantly longer period of time to transfer the equivalent amount of data. On the other hand, the single-flow results show TCP on FLARE-TeXCP performs comparably to TCP over a single path. A natural question to consider is why FLARETeXCP should be used at all, if it only provides equivalent performance to TCP. First, FLARE-TeXCP enables a greater degree of reliability. When a link fails or experiences a shock of traffic, the adaptive multipath router will automatically balance the orphaned traffic across all available paths [24], rather than just failing over to the next path in its routing table. Next, as we demonstrate later in this chapter, FLARETeXCP can produce better per-flow performance as the number of flows increases, sometimes outperforming both, flow- and packet-switching. Using FLARE, a network operator can ensure that end-host performance continues to perform well through path failure and network traffic impulses. We note that FLARE-TeXCP does not attempt to aggregate bandwidth from 41 a 300 Sigle Path FLARE-TeXCP | 25 0 250 20 0 200 15 0- 150 10 100 F 50 50 A 0 20 40 60 100 80 120 140 160 180 0 20 40 60 80 100 120 140 160 Time Time (b) FLARE-TeXCP (a) Single Path 18 PAOKET 16 14 12 - 10 8 2 0 50 100 150 200 250 Time 300 350 400 450 500 (c) Packet-switched Figure 5-4: Single Flow cwnd performance. Typical TCP transmissions occur over a single path, with windows similar to one shown in the top left. The window performs similarly when traffic is split using FLARE-TeXCP, shown in the top right. However, the bottom plot shows that packetswitched traffic splitting does not perform very well. 42 8000 Peering 7000 NLANR2 7000NNR 6000 5000 4000 3000 2000 1000 90 0 0 0 0 50 40 MTBS 0 70 so 90 Figure 5-5: Surprisingly, the number of timeout-based retransmissions decreases with an increase in MTBS. We attribute this observation to an increasing RTT variance. multiple paths, like a protocol such as mTCP [45]. Rather, due to the way it uses 6, at any point along the links in the network, packets from a flow will only exist on a single path. In other words, with all else being equal, the performance of a single flow split across multiple paths with FLARE-TeXCP cannot perform better than a single flow on a single path. However, as the number of flows increases, the interaction among the flows provides an opportunity for FLARE-TeXCP to outperform single path routing. 5.3 Impact to TCP Retransmission Timer In this section, we show that TCP retransmission timeouts are predictably affected by the splitting of traffic along multiple paths. We investigated the retransmission timer by simulating each of the flow profiles, varying the MTBS of the topology. In all simulations, 6 was configured to MTBS+ 10ms, in order to ensure that no reordering occurs. For each value of MTBS, the total number of timeout-based retransmissions was counted. Figure 5-5 plots the retransmission counts versus the MTBS. Somewhat surprisingly, increasing the MTBS reduces the number of timeoutbased retransmissions. However, this result can be explained through figure 5-6, which plots the mean RTT variance of all the flows in the simulation. The mean RTT 43 - - 90MOMM- ft= 0.0014 Peering NLANR 0.0012 0.001 A 0.0008 0.0006 - 0.0004 0.0002 0 .. - 0 10 20 30 _- - 50 40 MTBS 60 70 80 90 Figure 5-6: The mean RTT variance increases proportionally to MTBS. variance increases at a rate proportional to the MTBS. Further, the retransmission timer is configured to expire according to the formula, SRTT +8* RTTVAR, where SRTT is the smoothed RTT estimate and RTTVAR is the measured variance in the RTT. Thus, the increasing RTT variance gives the retransmit timer larger time boundary before which to trigger packet retransmissions. However, this result does not imply that larger MTBS values are desirable- only that they do not produce adverse effects on TCP through excessive retransmissions. Figure 5-7 plots the average per-flow goodput as a function of MTBS. It shows that the average per-flow goodput is inversely proportional to MTBS. Because J = MTBS + 10ms, very little to no reordering occurs. Since TCP throughput is inversely proportional to RTT, we conclude the decline in per-flow goodput is due to the longer path to the egress. 5.4 End-to-End Performance While the Packet Trace Analyzer can only predict the impact of reordering, FLARETeXCP can fully simulate TCP dynamics. We measured the relative performance of traffic splitting schemes by comparing average per-flow goodput values. We evaluate the traffic splitting schemes by simulating FLARE-TeXCP on flow profiles generated from the traces described in table 4.1. Figure 5-8 shows the average goodput achieved by each of the traffic splitting schemes. Clearly, flowlet-switching 44 60 Peering 1 N NLANR2 55 50 45 40 .. 5 -...... a . .3 30 25 20 15 0 10 20 30 40 60 50 70 80 90 MTBS Figure 5-7: Although the number of timeout-based retransmissions decreases with higher MTBS values, so does the average per-flow goodput. 100.00 90.00- 80.00 70.00 6. 0.00- 60.00 o 0 50.00- 40.00 30.00 S20.00- 10.00 0.00 Peering LCSout NLANR1 NLANR2 Flow Profile N FLARE U FLOWPIN 0 PACKET Figure 5-8: For each flow profile, the average per-flow goodput was measured in simulation. FLARE-TeXCP typically performs as well as or better than a flow-switched splitting algorithm, but with much less overhead. In all circumstances, FLARE-TeXCP outperforms a packet-switched algorithm. through FLARE-TeXCP outperforms both of the other schemes. Figure 5-9 shows the distribution of flow goodputs obtained by the various traffic splitting schemes for the different flow profiles. Flowlet-switching typically produces a larger fraction of flows with higher goodputs. The peering flow profile represents an instance in which the differences between the three splitting schemes is not apparent. While in certain simulations, flowlet-switching and flow-switching have comparable performance (e.g. peering, NLANR1), section 5.6 shows that flowlet-switching yields a better traffic splitting accuracy and is more reactive to network changes, critical features for adaptive multipath routing. 45 PACKET g AKET -- 8 7 .2 LL "d r 0 6' 5- LL 3 2 0 8 4 12 16 20 24 Rate (Kbps) 32 28 40 36 4 0 12 8 16 24 20 Rate (Kbps) 32 28 36 40 (b) LCSout Flow Profile (a) Peering Flow Profile 1 FLOW PACKET -- .9 - PACKET / 7 A' 6 I U_ - 0 3- - - 2 -- 1/ 0 4 8 12 16 20 24 Rate (Kbps) 28 32 36 0 40 4 8 12 16 20 24 Rate (Kbps) 28 32 36 40 (d) NLANR-2 Flow Profile (c) NLANR-1 Flow Profile Figure 5-9: CDF comparison of packet splitting Algorithms. In general, packet-switching tends to produce goodput distributions where most flows have rates less than 24-28 Kbps, leading to lower average goodputs. However, flow-switched and flowlet-switched algorithms produce a larger number of flows having higher goodputs, with flowlet-switched algorithms typically outperforming, if not tracking, flow-switching. 46 Peering LCSout NLANR-1 NLANR-2 FLARE FLOW PACKET 15389 150631 40589 60666 9048 5767 6024 1893 80688 4173475 237172 367501 Table 5.2: In general, traffic patterns tend to have significantly greater numbers of flowlets than flows. This table lists the number of switching opportunities available to a router across the three traffic splitting scheme. FLARE provides a router with significantly more switching opportunities than flow-switching. 5.5 Switching Opportunities One of the advantages of flowlet-switched traffic splitting over flow-switching is that flowlets arrive more frequently. Chapter 4 showed that, for most traces, an order of magnitude more flowlets will arrive than flows. Because the arrival of a flowlet is an opportunity for the router to readjust traffic allocations in response to adapt to changing network conditions, flowlet-switching provides more switching opportunities. Table 5.2 shows the number of switching opportunities available to the router in the FLARE-TeXCP simulations and compares it to the number of switching opportunities provided by packet- and flow-switching. As expected, the flow-switched traffic splitter produced a number of switching opportunities equal to the number of flows arriving; similarly, the packet-switched splitter introduces a number of switching opportunities equal to the number of arriving packets. FLARE-TeXCP, on the other hand, offers nearly an order of magnitude more switching opportunities than the flow-switched splitter. In other words, FLARE-TeXCP gives a router greater flexibility to make routing decisions in a multipath environment. One interesting observation from these figures is that all simulated traffic consisted of TCP traffic. Even in a simulation environment, the nature of TCP's transmission mechanism produces the burstiness which leads to the existence of flowlets. In other words, no other external factors in the network that may have contributed to producing similar flowlet figures observed by the Packet Trace Analyzer. 47 100% - 250% - ----- 90% 80% 70% 60% 50% 40% - --.- ----- - - - 200%150%100%- 30%20% 10%% 0%_ 50% 0%_ Peering LCSout NLANR1 Peering NLANR2 LCSout NLANR1 NLANR2 Flow Profile Flow Profile MFLARE U FLOWPIN U FLARE U FLOWPIN 0 PACKET [ PACKET (b) Dynamic Split Vector (a) Static Split Vector Figure 5-10: The packet-switched traffic splitter clearly achieves the best accuracy when splitting traffic. Flow-switching leads to the worst accuracy, while flowlet-switching achieves error rates in the middle. 5.6 Enabling Adaptive Multipath Routing Finally, we demonstrate that FLARE-TeXCP can potentially enable an efficient adaptive multipath router. In this section, we compare flowlet-switching performance to flow- and packet-switching performance. In all simulations, TeXCP produces the desired split vectors used as inputs to the traffic splitting algorithms. To enable adaptive multipath routing, a traffic splitting mechanism must support three key features. First, it must be able to accurately track a desired split vector. An adaptive multipath router produces a split vector based on some optimization function (e.g. minimizing the maximum link utilization or minimizing transmission delays). In FLARE-TeXCP, the TeXCP component provides the split ratio that becomes the input into the traffic splitter. Chapter 4 shows that a flowlet-switched traffic splitter can accurately track static and dynamic, time-varying split-vectors. Figures 5-10(a) and 5-10(b) compare the accuracy of FLARE-TeXCP, flow-switching and packet-switching for static and dynamic split vectors. We compute errors using the same accuracy metric defined in section 4.1.3. The accuracy measurements exhibit the same trend we observed with the Packet 48 100.00 EL 90.00 @ - 80.0070.0060.00 o 50.000 a 40.0030.00 !20.00 > 10.00 Peering LCSout NLANR1 NLANR2 Flow Profile M FLARE EFLOWPIN 0 PACKET Figure 5-11: TeXCP-FLARE outperforms flow-switching and packetswitching, achieving higher per-flow goodput rates with dynamic split vectors. Trace Analyzer. Packet-switching leads to the most accurate splitting, while flowswitched splitting leads to the worst and flowlet-switching produces an accuracy somewhere in the middle. However, a few results differ from the results we predicted with the Packet Trace Analyzer. For example, the accuracy achieved from the LCSout flow profile is significantly better than the accuracy obtained from the peering flow profile, which is the reverse of what we observed using the PTA. Another difference occurs in that the error rates from simulation are higher than what the PTA predicted. We attribute these discrepancies to two factors. First, the peering flow profile does not contain much traffic. As table 5.1 shows, the total amount of traffic transferred by almost 13,000 flows sums to only 621 MB, with flows arriving for 200s. When link utilizations are low enough, TeXCP does not attempt to balance traffic loads, and adapts the split vectors to send all traffic through a single path. This feature of TeXCP would lead to higher error rates for the dynamic split vectors. Second, the traffic generated in the simulation does not accurately model the traffic patterns in the packet traces, since all of our senders were FTP senders. In reality, the traffic patterns observed in the actual packet traces have different characteristics due to the nature of the applications which are generating the traffic. The next requirement for a traffic splitter is that the traffic-splitting algorithm must be able to perform well with a time-varying split-vector. 49 Figure 5-11 com- pares the average per-flow goodputs of TeXCP-FLARE, flow-switching and packetswitching. Clearly, flowlet-switching produces a higher average per-flow goodputs when compared to the other schemes. Finally, a traffic-splitting algorithm must react to changing network conditions. A key advantage of multipath routing is that during a path failure or other traffic shock event, the router can react quickly by distributing the traffic intended for the failed link across all or some of the other links. Figure 5-12 shows how FLARE-TeXCP quickly rebalances traffic to available links when a link on path 1 experiences a traffic shock. We model a traffic shock by introducing a square wave of cross traffic on one of the links along path 1. FLARE-TeXCP redistributes the traffic from path 1 equally between paths 2 and 3. Figure 5-13 graphs the same experiment with the flowswitched traffic splitter. Clearly, the flow-switched traffic splitter is unable to balance the traffic as well as FLARE-TeXCP. When the traffic shock arrives on path 1, the flow-switched traffic splitter is unable to effectively rebalance the traffic. Figure 5-14 shows how much more accurately the flowlet-switched traffic splitter distributes the traffic when a traffic shock occurs. 50 Path Utilization Path Utilization - At --0.8 0.8 0.6 + 0-6 0.4 0.4 0.2 0.2 0 10 20 30 40 50 60 70 80 90 1 0 0t 0 10 20 30 40 Time 50 Time 60 70 so 0O 90 1 80 10 1C (b) Path 1 (a) Path 1 Path Utiliziltion Actua Path Utilztion -o;" -----0.8 0.8 0.8 A& 0.2 0.2 0 0.4 0 10 20 30 40 80 50 70 so 80 1 0 0 0 10 20 30 40 60 70 0 (d) Path 2 (c) Path 2 1 I Patti Utiiztion- Acua -- Piatn Utilization Dosirc-o Acituel------ 0.8 0.8 A- 0.8 I 0.4 I'd 0 10 20 30 40 50 Time 60 70 08 Ti 0.4 0.2 0.2 0 50 Time Time so 90 I 10 *0 10 20 30 40 so Time 60 70 so 90 100 (f) Path 3 (e) Path 3 Figure 5-12: The left column shows the utilization, desired split vector and actual split values for FLARE-TeXCP on the NLANR1 flow profile. The right column shows the same simulation when a traffic shock occurs on path 1. The traffic that was previously transmitted along path 1 is equally distributed along paths 2 and 3. 51 Paiih+- iiOtP 0.8 0.8 0.6 0.8 0.4 0.4 0t 1 0 0 2 0 4 10 20 30 40 50 60 70 80 90 1 )0 0 10 20 30 40 Time 0 8 0 1 I + 0.6 0.6 & 0.4 L;~- 0.2 20 30 40 50 70 so 90 1 60 70 s0 0.4 0.2 90 0 K) 10 0 10 20 30 40 50 60 70 80 90 1 )0 90 100 Time (c) Path 2 -- 00 a iP Time - 60 T - Aitual -------0.8 10 so (b) Path 1 0.8 0 8 Time (a) Path 1 I. 0 0.2 0.2 0 - Paih U - (d) Path 2 -Paiitt -nf. 4 " 0.8 0. I I.8 4 - 0 2- - 0.8 0.4 0.2 20i~~ 0 10 20 30 40 50 Time 60 70 80 90 100 0 0 10 20 30 40 so so 70 (f) Path 3 (e) Path 3 Figure 5-13: The left column shows the utilization, desired split vector and actual split values for the flow-switched traffic splitter on the NLANR1 flow profile. The right column shows the same simulation when a traffic shock occurs on path 1. Clearly, the traffic splitter has difficulty allocating traffic according to the desired split vectors. 52 120% 100% 80% 60% 40% 20% 0%_ FLOWPIN FLARE Figure 5-14: When a traffic shock occurs on path 1, flow-switching produces a traffic splitting with relatively high error. 53 54 Chapter 6 Flowlets The idea underlying flowlet-based splitting is simple; instead of switching paths at the granularity of a packet or a flow, allow the router to switch bursts of packets from the same flow, as long as they are separated by a large-enough idle interval. Switching bursts of a few packets provides a higher switching resolution than flowbased switching, resulting in better accuracy. But a natural question to ask is why it is possible to divide most TCP flows into short flowlets, particularly the long ones, which contain the majority of total traffic [46]. Another is why tracking flowlets requires very little state even though the number of flowlets is larger than the number of flows. This section shows that by harnessing TCP's burstiness, flowlet-based splitting achieves an effectiveness that might appear puzzling at first. 6.1 The Origin of Flowlets Flowlets do not emerge solely from the existence of short flows, flows with small windows of one or two packets, or flows that are suffering timeouts. These flowlet sources would not produce a significant number of flowlets. If this were true, flowlet splitting could not be as effective as chapter 4 shows, since most of the bytes are in the long TCP flows [46, 16]. In fact, the main reason for the existence of flowlets is the burstiness of TCP at RTT and sub-RTT scales. Prior work has shown that a TCP sender tends to send 55 0.8E0.6-0.4 E Peering Trace - 0.2- U_ 0 0.001 0.01 0.1 1 10 100 Flowlet Interarrival/RTT 1000 10000 Figure 6-1: CDF of flowlet inter-arrival time normalized by flowlet RTT. About 68% of the 60ms-flowlets have sub-RTT inter-arrivals, indicating that most of these flowlets are a whole or a portion of a congestion window. Trace LCSout Peering NLANR-1 NLANR-2 Arrival Rate (/sec) Flows Flowlets 143.16 1454.98 611.95 8661.43 3784.10 35287.04 111.33 2848.76 #Concurrent Flows Flowlets 1450.42 (2030) 18.41 (49) 8477.33 (8959) 28.08 (56) 47883.33 (57860) 240.12 (309) 1559.33 (1796) 50.66 (71) Table 6.1: 60ms-Flowlets arrive at a much higher rate than flows; but there are much fewer concurrent flowlets than flows. The values outside parentheses are averages while the numbers inside are the maximum values. a whole congestion window in one burst or a few clustered bursts and then waits for an idle period for the rest of its RTT. This behavior is caused by ack compression, slow-start, and other factors [22, 43, 47, 38]. FLARE utilizes this burstiness when it processes a long TCP flow as a concatenation of short flowlets separated by idle periods. Figures 6-1, 6-2 support this argument. Both figures were computed using the peering trace for 6=60 ms. Fig. 6-1 plots the time between arrivals of two consecutive flowlets from the same flow normalized by the RTT of the flow (RTT is computed using the MYSTERY TCP analyzer [25]). The graph shows that the vast majority of flowlets are separated by less than an RTT, indicating that a flowlet is usually a congestion window or a portion of a congestion window. Fig 6-2 shows that flowlets do effectively split long TCP flows into a sequence of short flowlets. The figure shows that while 70% of the bytes are in flows larger than 10KB, only 20% of the bytes are in flowlets larger than 10KB. 56 o .8 0.6 0 0.4 0.2 LL 0 Flows - 10 100 1000 10000 100000 1e+06 1e+07 1e+08 Size(B) Figure 6-2: More than 70% of bytes are in 60ms-flowlets of size smaller than 2KB. This indicates that the concept of flowlets shift most of the bytes to small flowlets, which can be independently switched. 6.2 Why Flowlet Splitting is Accurate Flowlet-based splitting is accurate due to two reasons. First, there are many more flowlets than flows, leading to many opportunities to rebalance an imbalanced load. Table 6.1 shows that flowlet arrival rates are an order of magnitude higher than flow arrival rates, in our traces. This means that in every second, flowlet-based splitting provides an order of magnitude more opportunities to rebalance an incorrect split than with flow-based splitting. Second, as shown in Fig. 6-2, most of the bytes are in small flowlets, allowing load rebalancing to occur at a much higher granularity than at the size of a flow. 6.3 Why Flowlet Tracking Requires a Small Table Despite the large number of flowlets in a trace, FLARE only needs to maintain state for flowlets with packets in transmission, i.e., flowlets that currently have packets in the network. Table 6.1 shows that the average number of concurrent flowlets is two orders of magnitude smaller than the number of concurrent flows. Indeed the maximum number of concurrent flowlets in our traces never exceeds 400 hundred. To track these flowlets without collision, the router needs a hash table containing approximately thousand entries, which is compatible with the results in chapter 4. TCP enables one to divide each long flow into multiple short flowlets. Moreover, only a small number of these flowlets concurrently have packets in transmission. Since 57 TCP is bursty and is likely to remain bursty for the near future, it is beneficial to explore whether TCP burstiness can be useful. FLARE harnesses TCP burstiness to improve the performance of traffic splitting across multiple paths. Other applications that depend on TCP's burstiness may exist and could potentially take advantage of flowlets. 58 Chapter 7 Future Work and Conclusions 7.1 Future Work The work done thus far only represents the first steps in the understanding of traffic splitting and multipath routing; much work remains to be completed. Packet re- ordering is the first hurdle to overcome when spreading a single flow across multiple paths. An important next step will characterize the type of paths across which TCP traffic may be safely spread. For example, we wish to bound the differences in path delays, loss-rate, capacity, and other characteristics in order to provide precise constraints on when splitting TCP traffic may be advantageous. Another necessary task is comparing the performance of FLARE-TeXCP versus FLARE coupled with other multipath routing schemes, such as OSPF/IS-IS (even though in this case, no split vector adaptation occurs). Another area of research to be conducted is to investigate the relationship of link capacities, congestion levels and end-to-end TCP performance among the different splitting schemes. We observed that the congestion levels and relative link utilizations have noticeable effects on overall performance and accuracy. We believe that this may be due to the fact that TeXCP relies on XCP to manage link congestion levels. The interaction between XCP, TeXCP and FLARE must be characterized before this research can be considered complete. A particularly interesting next step would be to isolate the components, simply each piece and systematically investigate the 59 interaction between each layer. A final area of future work lies in finding other interesting applications of flowlets. 7.2 Conclusion To our knowledge, we are the first to introduce the concept of flowlet-switching and develop an algorithm which utilizes it. Our work reveals several interesting conclusions. First, highly accurate traffic splitting can be implemented with little to no impact on TCP packet reordering and without imposing a significant state requirement. Next, flowlets can be used to make adaptive multipath routing more practical. We showed simulations of full TCP dynamics tend to support the conclusions we make from analyzing FLARE on its own, although many question still remain. Finally, the existence and usefulness of flowlets show that TCP burstiness is not necessarily a bad thing, and can in fact be used advantageously. 60 Appendix A Handle Packet Departure in Flow Trace Analyzer Require: Packet: P 1: f low-id +- CRC16(P) 2: idx <- P.index.number 3: entry <- flow-table.lookup(f low-id) 4: td +- P.departure-time 5: if idx == entry.last-index-in-order+ 1 then 6: entry.last index-in-order + + 7: if entry.out-of-order que.isEmpty() then 8: 9: 10: 11: return end if if entry.dup-ack-count >= 3 then CountCongestionEvent() {TCP's DUPACK Threshold is 3} 12: end if 13: entry.dup-ack -count +- 0 { Out of order queue is maintained in sorted order by packet index} 14: while entry.out-of _order que.notEmpty() and entry.out-of _order-que.top().idx == entry. last _index -in-order+ 1 do 15: entry.out-of-orderque.pop() 61 16: 17: entry. last-index -in-order+ + end while 18: else 19: entry.dup-ack-count + + 20: entry.out-of-order que.push(P) 21: end if 62 Bibliography [1] Amit Aggarwal, Stefan Savage, and Thomas Anderson. Understanding the performance of TCP pacing. In INFOCOM, 2000. [2] A. Akella, B. Maggs, S. Seshan, A. Shaikh, , and R. Sitaraman. A measurementbased ananlysis of multihoming. In A CM SIGCOMM, 2003. [3] M. Allman, W. Eddy, and S. Ostermann. Estimating loss rates with tcp. ACM Performance Evaluation Review, 2003. [4] D. Andersen, A. Snoeren, and H. Balakrishnan. Best-path vs. multi-path overlay routing. In A CM IMC, 2003. [5] Guido Appenzeller, Isaac Keslassy, and Nick McKeown. Sizing Router Buffers. In SIGCOMM, 2004. [6] D. Awduche, L. Berger, D. Gan, T. Li, V. Srinivasan, and G. Swallow. Lsp tunnels, 2001. IETF RFC 3209. [7] E. Blanton and M. Allman. On making tcp more robust to packet reordering. In ACM Computer Communication Review, 2002. [8] James E. Burns, Teunis J. Ott, Anthony E. Krzesinski, and Karen E. Miller. Path selection and bandwidth allocation in mpls networks. Perform. Eval., 2003. [9] Zhiruo Cao, Zheng Wang, and Ellen W. Zegura. Performance of hashing-based schemes for internet load balancing. In IEEE INFOCOM,2000. 63 [10 C. N. Chuah and C. Diot. A tier-1 isp prespective: Design principles & observations of routing behavior. In PAM Workshop on Large-Scale Communications Networks, 2002. [11] Wu chun Feng and Peerapol Tinnakornsrisuphap. The Failure of TCP in HighPerformance Computational Grids. In Supercomputing, 2000. [12] Cisco express forwarding (cef). Cisco white paper, Cisco Systems., July 2002. [13] Anwar Elwalid, Cheng Jin, Steven H. Low, and Indra Widjaja. MATE: MPLS adaptive traffic engineering. In IEEE INFOCOM, 2001. [14] A. Feldmann, A. Greenberg, C. Lund, N. Reingold, and J. Rexford. Deriv- ing traffic demands from operational ip networks: Methodology and experience. IEEE/ACM Transactionin Networking, 2001. [15] B. Fortz and Mikkel Thorup. Internet traffic engineering by optimizing ospf weights in a changing world. In IEEE INFOCOM, 2000. [16] C. Praleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, D. Moll, R. Rockell, T. Seely, and C. Diot. Packet-level traffic measurements from the sprint IP backbone. IEEE Network, 2003. [17] David K. Goldenberg, Lili Qiu, Haiyong Xie, Yang Richard Yang, and Yin Zhang. Optimizing cost and performance for multihoming. In A CM SIGCOMM, 2004. [18] Fanglu Guo, Jiawu Chen, Wei Li, and Tzi-Cker Chiueh. Experiences in Building a Multihoming Load Balancing System. In INFOCOMM,2004. [19] E. Gustafsson and G. Karlsson. Literature survey on traffic dispersion. IEEE Network, 1997. [20] G. Iannaccone, C. Chuah, R. Mortier, S. Bhattacharyya, and C. Diot. Analysis of link failures in an ip backbone. In Proc. of ACM SIGCOMM Internet Measurement Workshop 2002, Marseille, France, Nov 2002. 64 [21] B. Jamoussi and et al. Constraints based isp setup using ldp, 2002. IETF RFC 3212. [22] H. Jiang and C. Dovrolis. The origin of tcp traffic burstiness in short time scales. Technical report, Georgia Tech., 2004. [23] Junos 6.3 internet software routing protocols configuration guide. www.juniper.net/techpubs/software/junos/junos63/swconfig63-routing/html/. [24] S. Kandula, A. Qureshi, S. Sinha, and D. Katabi. TeXCP: Intra-Domain Online Traffic Engineering with an XCP-Like Protocol. nms.lcs.mit.edu/ dina/texcp. [25] Sachin Katti, Charles Blake, Dina Katabi, Eddie Kohler, and Jacob Strauss. M&M: Passive measurement tools for internet modeling. In A CM IMC, 2004. [26] J. Kulik, R. Coulter, D. Rockwell, and C. Partridge. A simulation study of paced TCP. BBN Tech. Memo. Tech. Memo.#1218, BBN, 1999. [27] Kun-Chan Lan and John Heidemann. On the correlation of internet flow characteristics. Technical Report ISI-TR-574, USC/ISI, July 2003. [28] R. Ludwig and R. Katz. The eifel algorithm: Making TCP robust against spurious retransmissions. In ACM Computer Communication Review, 2000. [29] N. F. Maxemchuk. Dispersity routing. In IEEE ICC, 1975. [30] D. Mitra and K. G. Ramakrishna. A case study of multiservice multipriority traffic engineering design. In IEEE GLOBECOM, 1999. [31] National Laboratory for Applied Network Research. http://pma.nlanr.net/. [32] Konstantina Papagiannaki, Nina Taft, and Christophe Diot. Impact of flow dynamics on traffic engineering design principles. In IEEE INFOCOM, Hong Kong, March 2004. [33] Craig Partridge. ACK Spacing for High Delay-Bandwidth Paths with Insufficient Buffering, 1997. Internet Draft. 65 [34] Vern Paxson. End-to-end internet packet dynamics. A CM Transactions on Networking, 1999. [35] S. Rost and H. Balakrishanan. Rate-aware splitting of aggregate traffic. Technical report, MIT, 2003. [36] Matthew Roughan, Albert Greenberg, Charles Kalmanek, Michael Rumsewicz, Jennifer Yates, and Yin Zhang. Experience in measuring backbone traffic variability: Models, metrics, measurements and meaning. In Proc. of A CM Internet Measurement Workshop, 2002. [37] M. Shreedhar and George Varghese. Efficient fair queueing using deficit round robin. In SIGCOMM, 1995. [38] Andras Veres and Miklos Boda. The chaotic nature of TCP congestion control. In INFOCOM, 2000. [39] Curtis Villamizar. Mpls optimized multipath (mpls-omp), 1999. Internet Draft. [40] Curtis Villamizar. Ospf optimized multi-path (ospf-omp), 1999. Internet Draft. [41] V. Visweswaraiah and J. Heidemann. Improving restart of idle tcp connections. Technical report, 1997. [42] Y. Wang and Z. Wang. Explicit routing algorithms for internet traffic engineering. In IEEE ICCCN, 1999. [43] L. Zhang, S. Shenker, and D. D. Clark. Observations on the dynamics of a congestion control algorithm. In SIGCOMM, 1991. [44] M. Zhang, B. Karp, and S. Floyd. RR-TCP: A reordering-robust tcp with dsack. In IEEE ICNP, 2003. [45] Ming Zhang and Larry Peterson Randolph Wang Junwen Lai, Arvind Krishnamurthy. A transport layer approach for improving end-to-end performance and robustness using redundant paths. In USENIX, 2004. 66 [46] Y. Zhang, L. Breslau, V. Paxson, and S. Shenker. On the characteristics and origins of internet flow rates. In SIGCOMM, 2002. [47] Zhi-Li Zhang, V. Ribeiro, S. Moon, and C. Diot. Small-time scaling behaviors of internet backbone traffic: An emprical study. In INFOCOM, 2003. 67