TCP Performance Improvement over Heterogeneous Networks by David Lecumberri B.S., Telecommunications Engineering Universidad Pdiblica de Navarra, 1997 B.S., Electrical Engineering Institut National Polytechnique de Grenoble, 1996 Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology June 2000 @ 2000 David Lecumberri. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part. Signature of Author: ______________ Department of Electricaf'Engineering aniqbomputer Science March 10, 2000 Certified by -. Kai-Yeung Siu Associate Professor of Mechanical Engineering TbP.-cicZimprvisor Accepted by Aithur C. Smith Chairman, Committee on Graduate Students Department of Electrical Engineering and Computer Science ASSACHUSETTS INSTITUTE OF TECHNOLOGY JUN 2 2 2000 LIBRARIES ENG -A 2 TCP Performance Improvement over Heterogeneous Networks by David Lecumberri Submitted to the Department of Electrical Engineering and Computer Science on March 10 th 2000 in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electrical Engineering and Computer Science Abstract TCP/IP is one of the most popular protocol suites that are in use today. However, the desire and availability of increased bandwidth and differentiated applications have shown that the operation of current TCP implementations can have limitations. In this work, we address some of these limitations, in particular those derived from the TCP window size, the presence of traffic in asymmetric networks, and the optimization of TCP over high speed flow switching networks. We propose different solutions that will lead to improved performance, either by introducing software modules that optimize packet handling, or by proposing slight modifications to current TCP implementations. We implement three of these solutions, showing results that demonstrate how they improve TCP performance. Thesis Supervisor: Kai-Yeung Siu Title: Associate Professor of Mechanical Engineering 3 Acknowledgements I wish to thank Professor Kai-Yeung Siu for allowing me to become part of his research group for over almost two years, and for his constant support and insights for this work. Also, my officemates made my work much more enjoyable and valuable. My special gratitude to Paolo for his friendship and his guidance over what it means to do research at MIT. I also want to thank Mike Patrick and Whay Lee for his support, help and enthusiasm for large parts of this work. Working with all of them has been one of my best experiences and I hope to be able to repeat it in the future. To "la Caixa Fellowship Program" for the financial support for my studies at MIT, andto Josep-Anton Monfort and Anne Klarich for taking care of their fellows so well. To all my friends in Boston, that helped me enjoy living here, and to all those in the distance that helped me appreciate what is to be far from the ones we love, my deep gratitude. To my parents, because none of this would have been possible without them. And last, but not least, my love and gratitude to Maria, for always being there for me and for her constant love and support during the good and the hard times. 4 Index INTRODUCTION 7 TCP main characteristics 7 SOME ADVERSE EFFECTS IN TCP 13 The effect of the receiver window size TCP in asymmetric networks. Flow switching in high-speed networks. 13 15 17 PRELIMINARY ANALYSIS 21 Test scenario in asymmetric networks The effect of the receiver window size The effect of cross traffic Some possible solutions 21 23 24 31 ACK Suppression (AS) at CM Window modification at CMTS or CM ACK arrival estimation at CMTS ACK reconstruction (AR) at CMTS TCP with high-speed flow switching. Mode switching The recovery phase 32 32 32 33 33 36 37 IMPLEMENTATION OF SOLUTIONS PROPOSED 43 Window size modifier (WM) ACK suppression (AS) WM and AS Implementation Details 43 46 47 Window Modifier Operation ACK Suppression operation Interface description Operation details 48 49 49 50 High Performance TCP Basic algorithms 57 57 DESCRIPTION OF RESULTS 61 Window Modifier ACK Suppression High Performance TCP and high speed flow switching 61 64 66 CONCLUSIONS AND FUTURE WORK 73 REFERENCES 75 5 6 Introduction Introduction TCP/IP or Transport Control Protocol/Internet Protocol is one of the most popular protocol suites that are in use today. Since is inception in the early 70's [1], it has been the base upon which the Internet and many other network implementations have been deployed. Its reliability and robustness have made of it almost the de facto standard in network interconnection. However, the desire and availability of increased bandwidth and differentiated applications have shown that the operation of current TCP implementations can have limitations. The popularity of the Internet has lead to a variety of scenarios and configurations that present new challenges to the versatility that TCP/IP has demonstrated in the past. It is an enormous challenge to adapt TCP/IP to the increasing demands for high bandwidth in the network backbone, as well as to provide good performance for residential access, both at high and low bandwidth. We are interested in analyzing some of the scenarios that might lead to a poor TCP performance. These are situations to which TCP is confronted everyday. They lead to different problems that we could consider independently and which are not mutually exclusive, i.e. they can be experienced simultaneously and will have little or no interaction. There are a number of causes that would lead to this poor performance and we will address in this work a number of them. We will first describe the most relevant characteristics of TCP. Then we will describe the scenarios in which we are interested in evaluating TCP performance, pointing out the adverse effects that can be experienced. We will then outline the solutions that we intend to develop in order to relief some of these bad effects of TCP performance. Finally, we will implement some of these solutions and we will discuss the results obtained. TCP main characteristics TCP is a very complex protocol and we do not aim at explaining it here in all detail, since the full specifications are readily available in the literature ([1], [12], [18], [24], [27]). However, we will try to describe its main characteristics in order to argue later how they affect the 7 Introduction performance under the different scenarios that we are going to analyze. First, let us mention that there are a number of functions that a lossless protocol has to implement in order to be correct and functional. They can be divided into two main categories: * Flow control and Recovery: Flow control in the function that ensures that the flow of information between the two ends occurs in a controlled manner. It has to ensure that the sender will not transmit data faster than the receiver can process it. It also has to guarantee that the information is processed in the correct order, and that packet loss in the network is detected and solved. Recovery is the designation of the mechanism that a protocol uses to recover from packet loss (being it due to network congestion or to isolated losses) and to ensure that this loss will be remedied, usually by retransmission. * Congestion control: This is the mechanism by which a protocol reacts to network conditions, namely congestion. The congestion control mechanism can be implemented in many places, but when it is implemented within the protocol, it provides reactions to indications of network congestion by adapting the transmission rate to the new conditions. Flow control in TCP is achieved through a sliding window mechanism. The concept of sliding window has been around for a very long time and in order to understand it we have to imagine the packets that that connection has to send as aligned in a sequence. In order to provide lossless communication, a TCP sender has to keep a copy of any packet sent until it has a confirmation that the other end has correctly received that packet. TCP packets with a special flag, called ACK packets, offer that confirmation, carrying the sequence number of the next packet that the receiver is waiting for. This implicitly indicates that any packet with a lower sequence number has been correctly received. Each sender maintains a window of packets that correspond to those that could have been sent (but are not yet acknowledged) at a given point of time. When new packets are acknowledged, we say that the window slides, allowing more packets to be sent. The position of the window on the aligned sequence of packets therefore depends on the number of packets that have been acknowledged by the receiver at the other end. The size of the window 8 Introduction sndUna sndnxt sndcwnd Figure 1: The TCP sliding window scheme itself varies in time according to a series of algorithms and heuristics that have been developed over time and we will describe later on this introduction. They aim to ensure that the buffering space required at the sender is kept at a minimum, and that recovery from error can be achieved with the less possible disturbance to the ongoing transmission. In order to ensure that this sliding window mechanism is effectively implemented, each TCP sender maintains a number of window parameters about a given connection. The most relevant for our purpose are shown in Figure 1 and are the following: " snduna: this is a pointer to the first packet that was sent but has not yet been acknowledged. * sndnxt: this is a pointer to the next packet that will be send whenever the window mechanism allows it. * sndcwnd: this is the window size value that TCP is using at the moment. snduna is updated every time an ACK packet is received, and is set to the packet whose sequence number is demanded by the ACK. The maximum number of packets that can be outstanding (i.e. sent but not yet acknowledged) is given by sndcwnd. Therefore, sndnxt can advance over the stream of packets until its value is greater than snduna + sndcwnd. In this case, we say that the window is exhausted, and no other packet can be sent until we receive a new ACK that would increase snduna (and therefore, slide the window). 9 Introduction The maximum throughput that a TCP connection can achieve will depend on the network state at that particular time, but it will be bounded, among other things, by the maximum TCP window size (sometimes called socket size). The window size value is responsible for the congestion control mechanism in the sense that while there is no congestion in the network, the window size will steadily grow until it experiences some packet loss. Each end of the connection maintains a maximum value for this window. On the sender side, different control algorithms ([2], [3], [23], [18]) determine the window size and this value varies depending on the losses detected and the frequency at which ACKs are received. The window size usually starts with a small value and grows exponentially during the slow start phase and it is steadily increased during the congestion avoidance phase. On the receiver side, it is usually a fixed value and its unique purpose it to not overflow the receiver's buffers, rather than react to network congestion or losses. The effective socket size at which the TCP connection will work is the minimum between the sender window size and the receiver window size. In most of the practical realizations of a TCP connection, the sender window is the actual value that will bound the maximum window size and will be determined by the slow start and congestion avoidance phases. Let us describe how TCP performs the three key requirements for a lossless transport protocol: * Flow control: In regular TCP, flow control is implemented in the receiver window. The receiver, depending on its buffer population, can modify this value. Its purpose is to ensure that packets received out of order of after a loss will not overflow the receiver. * Congestion control: it is implemented also through the window settings, but in this case, on the sender. TCP takes a conservative approach (congestion avoidance) through which it slowly grows the window provided it does not experience losses. This mechanism is aimed at avoiding network congestion, since whenever it sees losses, it will shrink this window. " Recovery: TCP recognizes losses in the network through repeated ACK sequence numbers. When a packet is lost, the destination will keep sending ACKs with that packet's sequence 10 Introduction number, allowing the source to know that this packet did not arrive. Early implementations of TCP (Tahoe) did not distinguish between isolated losses (due to a transient error) and losses due to congestion, and therefore were too sensible to lossy links. In successive flavors of TCP (Reno) a new mechanism called fast recovery was devised, targeted at solving isolated losses (i.e. usually not due to congestion) without slowing down TCP throughput. In the TCP version we are going to work with, TCP Reno (probably the most widely accepted) the recovery information in implied from the ACK sequence number. When the sender receives three consecutive repeated ACKs, the fast recovery phase starts: the demanded packet is retransmitted, the window is halved, but it is increased by the number of repeated ACKs received so far, thus allowing transmission of further packets. This mechanism tries to cope with a single isolated loss. If the losses persist (the the sender will experience a timeout, then the window is shrunk to one, and every outstanding packet (starting from snduna) is retransmitted. 11 Introduction This page intentionally left blank 12 Some adverse effects in TCP Some Adverse Effects in TCP The effect of the receiver window size One of the characteristics of sliding window protocols like TCP is that they have to wait for acknowledgements of previously transmitted packets in order to slide its window and transmit new ones. A basic theoretical limit is that such a protocol cannot send more data that what it is contained in a full window, before receiving an ACK for any packet on this window. Therefore, for TCP to be working at the maximum rate at which it can transmit and not being limited by the window size, we need to ensure that the roundtrip time is less that the time it would take to transmit a full window. In figure 2, we depict the theoretical limitation that the receiver window size can place in a TCP connection. For each roundtrip time in the horizontal axis, and for a given maximum window value, we can find the maximum throughput that a connection can achieve, regardless of the maximum speed of the bottleneck link. Let us give a quick example: assume the roundtrip time is 100 ms (a value quite common in the actual Internet). As an example, if the maximum window size is 8760 bytes (generally the value that Microsoft Windows uses by default), we find that: Maximum throughput = window size (bytes) 8,760 - 87.6 Kbytes/sec. roundtrip time (sec.) 0.1 =max That is the maximum throughput that we can expect and in some cases it will introduce an artificial limitation to our connection. Traditionally, before the popularity of the public Internet, users could be placed in two main categories: Local LAN users and modem users. For users that are sitting in a LAN and want to access local resources, the window size is not generally an issue, since the roundtrip time is very small (on the order of some milliseconds) and ACK for packets within a window come back before the full window is exhausted. For users accessing a network from a modem, the speeds of the telephone line (56 kbps) itself are the bottleneck, so 13 Some adverse effects in TCP TCP Window Limitation " 8760 17560 35040 100000 xD-S-L)_-m-d---m, c- -_-_ --- - - - --------- -- -- ---- -- F -ne e A-es - 10000 'A 0. .0 1000 (U ~~iij ------------------------------------------------------- LAN _____ F' f1 _ - _- -I -L-L D 20 40 ~ - ~ - - ~ - ~ _ _ - - - _ _- _ -__ __ - _ __ - _I ----- -- - I ------------------ 60 80 100 120 140 160 180 200 Roundtrip time (ms) Figure 2: The theoretical window limitation in TCP the window size is again not an issue. But with the popularity of the Internet and the ability to access remote sites, together with new access methods, such as cable modem or xDSL at much faster rates, users start to get concerned about the influence of an ill-chosen maximum window size. This is not a problem if we are accessing the Internet from a modem, but if our connection can run at some Mbps, then we are clearly unnecessarily limiting the speed. As a solution, we will explore an algorithm whose main purpose is to break the limitation of not properly tuned receive windows. Because of the sliding window nature of TCP, we can only send a window worth of data every roundtrip time from the source to the destination. The main idea of this modification is splitting the roundtrip time in two components, from the source up to a switch close to the destination and from this switch to the destination. By doing this we will 14 Some adverse effects in TCP keep the limitation of the small receiver window on the destination only in the switch-destination system. In order to do this, we will install a Window Modifier (WM) at the switch, which will modify the value of the receive window advertised by the destination by a value configurable at the switch. Modifying the advertised window size has been explored in previous research [5], but with a focus on traffic management and limitation. We intend to implement this mechanism and demonstrate how this approach improves TCP performance. TCP in asymmetric networks. We will dedicate a special emphasis to analyze negative effects to TCP in asymmetric networks, in which the bandwidth available in one direction is different to the other direction. Among these types of networks, one of the most popular and promising are those used to provide residential access to data and Internet services. In particular we will focus on Hybrid Fiber-Coaxial (HFC) networks, generally those used by the cable TV industry and in the cable modem / headend system. Those systems are characterized by a downstream transmission of speeds up to 40 Mbps and a shared upstream channel with speeds up to 5 Mbps. In the downstream channel, all cable modems hear the same signals, but the access to the shared upstream channel is controlled by the headend. These kinds of systems have been the subject of numerous research works [4]. The HFC system is a shared medium and therefore, user population and usage will be of great impact in the performance seen by a user. Under a scenario in which the majority of the traffic is going to flow from the network to the users, i.e. downstream, the fact that the upstream channel is shared may not decrease dramatically the performance, since the bandwidth needed upstream is much smaller than the needed downstream. For example, say we have a downstream channel of 10 Mbps and an upstream channel of 1 Mbps. The maximum packet size is 1,500 bytes and the ACK packets are 40 bytes. We assume that the receiver is generating an ACK for each data packet, although this is not usually the case since most of the TCP implementations use delayed 15 Some adverse effects in TCP ACK and they generate and ACK every two or more packets. That means that the bandwidth ratio downstream/upstream is 37.5, i.e. the upstream channel could support a downstream channel of up to 37.5 Mbps. Therefore, for a downstream channel of 10 Mbps, an upstream speed of 260 Kbps should be enough to accommodate the ACK stream. However, if any of the users in a different cable modem on the same headend receiver is trying to send data upstream, then the ACK stream for the other users that are obtaining data from the network is likely to be affected. In the previous example, the ACK stream from a user getting data downstream would typically need some 10-20% of the upstream bandwidth. That seems small enough to not be adversely affected by the upstream data trying to get on the remaining 80%. But in some cases, this effect is worse than expected because the fact that now we are sending 1,500 byte packets upstream competing with the 40-byte ACK packets introduces a larger delay for the ACKs. If that source of upstream data is persistent, then it introduces another further possible disadvantage for the ACK stream. In that case, the data stream does not need to ever go to contention because it will use piggybacking (informing the headend of having further data ready to send with every packet), whereas the ACK stream would eventually need to go to contention. This translates into further ACK delays that can lead to adverse effects due to the window size limitation mentioned previously, and it degrades dramatically the speed of downstream transfers (even though they need very little bandwidth upstream). We will obtain measurements confirming this effects, but as an example, the roundtrip can go from around 30 ms without cross traffic up to around 140 ms with cross traffic, values at which the window size limitation may be in effect. This is due to the fact that under cross traffic conditions, an ACK is likely to arrive with many other ACKs before it and has to wait for this ACKs to be transmitted upstream before it can be transmitted itself. Let's recall now that ACKs in the same TCP connection are cumulative, i.e. an ACK with a higher number has the same information as an ACK with a lower one. Therefore, if we have two (or more) ACKs for the same connection in a buffer, we can disregard the information of the lower numbered, since it is contained on the highest numbered ACK. This was first proposed in 16 Some adverse effects in TCP [4], and it will help solve exactly the problem we have here. Based on this idea, we intend to demonstrate that its application is beneficial under this particular scenario, and extensible to many others. Flow switching in high-speed networks. As was stated before, TCP is a sliding window protocol and this nature makes in not very efficient in terms of bandwidth utilization over high latency networks or networks in which the available bandwidth varies rapidly over time. Ideally, TCP is not able to send more than one window worth of data every roundtrip time, so having a long delay in a very fast link is a worst case scenario for TCP. TCP is also characterized by the coupling between flow control (i.e. the mechanism that allows to not overflow the ends), congestion control (i.e. the mechanism by which TCP reacts to the state of the network), and recovery (i.e., the mechanism by which TCP is able to recover from losses on the network). This coupling may reduce its throughput because a loss can either come from any of these sources. In addition, the congestion avoidance mechanism can prove to be excessively conservative, therefore forcing the connection to be not as fast as it could be. All of these control methods are achieved by means of the ACK packets received from the destination and this may be a source of problems. In order to modify TCP, we want to take advantage of two factors: the possibilities of having the network inform the sources of the available bandwidth (the accuracy and interface of this information can be discussed later); and the notion of decoupling flow and congestion control from the recovery mechanism. Using these mechanisms we will be able to improve flow control under circumstances that make current TCP inefficient. Our proposal is based on the two key ideas mentioned before: decoupling the flow and congestion control mechanism from the recovery, and taking advantage of the bandwidth information provided by the network. 17 Some adverse effects in TCP In order to achieve these goals, small changes are required to the TCP stack at the source (the receiver stack remains as is). The recovery mechanism is to remain intact (i.e. duplicated ACKs trigger fast recovery and timeouts cause full retransmissions). The flow and congestion control, however, are going to be decoupled, and will no longer be dependent on ACK reception. We are going to introduce the notion of TCP modes. Our modified TCP stack is going to be able to work in two modes: Regular TCP mode and High Performance (HTCP) mode. The Regular TCP mode would obviously be the current implementation of traditional TCP stacks. The HTCP mode will be characterized by the assumption that the network is able to provide bandwidth information to the source, although how this information is provided is not the object of this description. When the source has this information, it will go into HTCP mode. In this mode the timing for packet transmission would not be controlled anymore by the ACKs. We then switch from window-based operation to rate-based operation, and the TCP source will send packets according to the rate it has been given by the network. Since now we have information from the network about the bandwidth available, the congestion avoidance phase is no longer needed. The practical result of this change to TCP mode will be the overriding of the congestion avoidance phase, thus permitting TCP transfer at full speed. The operation of our sender in HTCP mode, however, will still make use of the ACKs received from the destination, in order to ensure lossless transmission. For this purpose, all the recovery functionality on regular TCP is kept unchanged. Repeated ACKs still will trigger fast retransmission of lost packets, and timeouts will result in retransmitting all the outstanding packets (starting from snduna). Therefore, the recovery function is guaranteed to be functional. We must also ensure that the amount of outstanding data is less or equal than the receiver window. Therefore, even tough the packets are now send according to a clock self timed to the bandwidth given by the network, if the data sent and not yet acknowledged grows up to the receiver window, we cannot send any further packet until we receive an ACK. This implies that the receiver window is going to be of great importance and should be set to a fairly large amount. Let us remember that a basic theoretical limit is that such a protocol cannot send more data that what it is contained in a full window, before receiving an ACK for any packet on this window. 18 Some adverse effects in TCP Consider the available bandwidth at our high-speed optical network to be in the orders of 1 Gbps, and with roundtrip delays in the order of tens of milliseconds (say 20 ms.). The buffer needed in order to allow full speed at the link would be 20 Mbytes, which is not unfeasible. We should also note that since only one connection at a time may be using the full speed of the link, this buffer space is needed on a per link basis, not on a per connection basis, which allows for a much convenient scaling. 19 Some adverse effects in TCP This page intentionally left blank 20 Preliminary analysis Preliminary Analysis We will describe in this section some experiments that helped us understand and analyze further the adverse effects described in the previous sections, in particular the window size limitation and the effect of cross traffic in asymmetric networks. We will first describe the test scenario and we will comment on the different measurements that we obtained to support our interpretation of the problems. We will pay special attention here at the type of networks we described before as asymmetric networks, since it is one of the cases in which the two described phenomena may happen more evidently. In particular, we are going to obtain measurements in a HFC (Hybrid Fiber-Coaxial) system, similar to the configuration that can be observed in cable systems, in which a cable router in the premises of a cable provider serves a number of residential users. Those users have access to the network through a cable modem, a device hooked to the standard coaxial cable that provides cable TV service. This Internet access service is already running in some parts of the U.S. and some countries in Europe, so we consider it a realistic scenario for our purposes. Test scenario in asymmetric networks In this section we are going to describe the laboratory test we devised, as it is depicted in Figure 3. As mentioned before, it corresponds to a typical setup one can encounter when analyzing a cable data system. We use two Cable Modems (CM), connected through separate coaxial cables to a Cable Modem Termination System (CMTS). On the same transmitter, we have two coaxial lines with a cable modem each, as if they were two different users served by the same transmitter at the CMTS. One of the cable modems (CM2) would have an upstream traffic generator that we will use for certain test to emulate a user with heavy upstream traffic. The other cable modem (CM1) has a local Ethernet (downstream Ethernet) in which two computers are attached. One of them will be our test subject, in which we are going to monitor TCP performance. The other PC (running Linux and tcpdump) will act as a packet sniffer and will allow us to obtain traffic 21 Preliminary analysis traces. The packet sniffer has also a second network interface (ethl) card that listens to the Ethernet segment to which the network interface of the CMTS is connected (upstream Ethernet). This setup with a common time reference will allow us to track packets and obtain statistics about their delay within the HFC system. We can track the time it takes for a packet when it is visible in the upstream Ethernet segment until it appears in the downstream Ethernet (downstream delay) and vice-versa (upstream delay). The two arrows in Figure 3 represent the different types of connections that we are going to monitor. They will be from our user PC in the downstream Ethernet to either a machine that is directly on the upstream Ethernet (to minimize the effects of further network segments), or to a machine sitting across some network (which may be the Internet itself). This latter type or WinNTPC 150.36.158.1U 150.36.1. 9 CMTs Corporate Network 150.36.158.181 CM1 CM2 150.36.158.180 Upstream Traffic generator e th0 < ----------------------- Win95 PC 150.36.158.18 Dual Win95/Linux 6.0 PC 150.36.158.17 150.21.2.23 Ethernet RF Coax Figure 3: Test scenario in asymmetric networks 22 HP-UX server(Wc) Preliminary analysis connections will be the most interesting given that they are similar to the kind of TCP connections a residential user will establish. Our tests will consist of measuring the time elapsed for downloading a file using FTP from a server located across a corporate network. The roundtrip time from the user's machine to that server is in the order of 30 milliseconds. We will run the test for different upstream bandwidth, varying the maximum window value that the receiver advertises, and we will add the effect of another user sending traffic upstream in other cable modem, so we can see the effects that it causes in the TCP throughput. The effect of the receiver window size For isolating the effects of the window size, we do not use the upstream traffic generator in 800 (I) 700- -- t35040 32120 __ 600 17520 0. 500 E - - - - 40 0 - - - - - - - - - - - -- - -- -- - - -- - - - - - - - - - - -- - ---- - - 300 0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 200 - - - - - - E E 100 0 - --- 0 - -- 1000 --- -- - - - - 2000 -- - - 3000 - - -- - - 4000 - - 5000 - - - - 6000 Upstream bandwidth (Kbits/s) Figure 4: Effect of Window Size in TCP throughput 23 Preliminary analysis CM2. As we can see in Figure 4, the throughput that a TCP session can achieve is going to be limited by the receiver window size settings. The horizontal axis shows the bandwidth available for the upstream channel (which is entirely available for the FTP session, since there are no other traffic on the channel). Let us recall that the upstream bandwidth in the most restringing scenario is 640 Kbps, enough to carry a 24 Mbps stream downstream (considering 40-byte ACK packets, 1,500-byte data packets and no delayed ACK), although it is limited by the 10 Mbps by the Ethernet segment. Using our formula to calculate the theoretical maximum throughput, for a window of 8,760 bytes and a roundtrip time of 35 ms we obtain a maximum throughput of 292 Kbytes/s, which is clearly consistent with the values shown in Figure 4. Usually, Microsoft Windows' TCP stack default receiver window value is 8,760, so this is the most likely scenario for a residential user browsing the Internet from his home PC. We can see that only by increasing that default value, the throughput increases accordingly. When the window size is doubled to 17,520 bytes, we observe the maximum throughput to also double, which confirms that the reason of such a low throughput with a smaller window is indeed due to the window value. Further increases of the window do not produce such dramatic results because other effects appear. Therefore we can conclude that an ill-chosen setting for the receiver window size can create an artificial bottleneck. The effect of cross traffic In the same test scenario we want to verify the adverse effects of having other users sending data upstream. In an ideal scenario for a residential high speed access (as is a cable system), the users would be primarily browsing the Internet, and therefore the bulk of data transmission would be done in the downstream direction, i.e. from the network to the residential users. That is why an asymmetric cable system would be effective, since the larger bandwidth would be offered in the direction that is needed the most, the downstream path. In that case, the TCP packets that would be flowing in the upstream direction would be mainly ACK packets with no data, thus needing much less bandwidth than the downstream channel. 24 Preliminary analysis A possible adverse effect of having a smaller upstream channel (that is also shared by all users on the same CMTS transmitter) is when some users have a behavior that does not follow the patterns described above. In the case when one of the residential users is sending data in the upstream direction (e.g. running a web server), then we have in the upstream channel 1,500-byte data packets together with the 40-byte ACK packets from other users. Let us remember that in this type of cable system, there is the issue of media access to the upstream channel. This means that any packet experience an access delay, i.e. the time it takes for the cable modem to request a grant to the cable router until the packet is effectively transmitted. This access delay is not dependent of the size of the packet, but is affected by the size of other packets. Both kinds of packets (regardless of its size) would need the same amount of time to get a grant for transmitting upstream, but the larger other packets are, the more time they have to wait. Therefore, having larger packets competing for upstream bandwidth increase the delay experienced by the smaller ACK packets. 800 700 ---------- e. ----- ----- ---- 60 0 -- - - -- - - -- - -- - - - - - - ------------ 35040 - 32120 -- - - - - - - - - - - - - - - - - - C. 50 0 - - - 400 - - - - - - - 17520 - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - -- - - --- - 0 E - - - - - *0 300 - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 8760 0 20 0 - - - - - - - - - - - - - - - - - - - - -- - - - - - - - --- - - - - - - E cc j 1- 2 - - - - - -- - - - -- - - -- - - - - - -- - - - -- - - 0~ 0 1000 2000 3000 4000 5000 6000 Upstream bandwidth (Kbits/s) Figure 5: Effect of Cross Traffic in TCP throughput 25 Preliminary analysis In order to prove the effect of cross traffic over the performance of regular TCP sessions, we introduced a traffic generator connected to CM2 (see Figure 3). This traffic generator sends TCP data traffic upstream at the highest rate available. As we can see in Figure 5, when the upstream channel is large, the effect is not very important, but as soon as we use smaller channels, the effect of cross traffic can prove to be devastating. We also try to characterize the effect of different window sizes and we can verify that for small upstream channels, there is no gain in increasing the window size. In particular, for the smallest upstream channel (640Kbps), the maximum throughput we obtain is not larger than 250 Kbytes/s, as opposed to the nearly 680 Kbytes/s we were obtaining when there was no cross traffic. We want to find an explanation for such degradation, especially considering that for a 680 Kbytes/s stream (approximately 450 data packets per second), the bandwidth used in the Delay CMTS-CM (Upstream direction) Cross Traffic ......... No cross traffic 10 609.7 Kbytes/s 8 -- --- - - - - - 6 -- - - --- - 205,0 Kbytes/s aL 4 2 - - - - 0 0 20 40 60 80 100 120 140 Delay (ms) Figure 6: Upstream delay for ACK packets 26 160 180 200 Preliminary analysis upstream channel is less than 150 Kbps (less than 25% of the available bandwidth). Therefore, the mere fact that there is more traffic should not be affected, since a fair share of bandwidth would still allocate the same bandwidth to the regular TCP connection. We proceed to analyze several traffic traces trying to characterize the delay experienced by ACK packets in their upstream trip. We will measure the upstream delay in the HFC system, i.e. from when an ACK packet is seen at the local Ethernet connected to the cable modem until it appears in the Ethernet segment attached to the cable modem termination system (CMTS, or cable router). In Figure 6, we represent the probability density function (pdf) of this ACK delay, i.e. the probability that an ACK packet will experience a given delay in milliseconds. As we can see in the figure, when there is no cross traffic in CM2, the bulk of ACK packets spend about 25 ms within the CMCMTS system. Considering that the whole roundtrip time for this particular setup is of the order of 35ms, we can see how this delay is the major contributor, but it does not affect significantly TCP performance, since the window size limitation is far from occurring. However, when we witch on the upstream data traffic on CM2, we see how the delay dramatically increases up to 140 ms, thus falling within the window size limitation. Now the roundtrip time is close to 150 ms, which would only allow a theoretical maximum throughput of 233 Kbytes/s with a window of 35,040 bytes. Let us examine with more detail this effect and in order to do so, we will try to establish a breakdown of the delay CM-CMTS in different components: " Queuing delay: This refers to the time elapsed between and ACK packet is seen on the Ethernet segment near the cable modem, until it is located ahead of the upstream waiting queue, therefore allowing the CM to ask for a new transmission grant. " Access delay: This is the time elapsed between the grant request and the actual moment in which transmission starts. 27 Preliminary analysis * Transmission delay: This is the transmission (i.e. depending of the upstream bandwidth) plus propagation delay on the coaxial segment between the CM and the CMTS. " Processing delay: This is the time it takes for the CMTS to process and send the packet to the appropriate network interface (i.e., when the packet is seen in the Ethernet segment to which the CMTS is connected). We can estimate how important each of these contributions can be to the overall delay, and, at a first glance, we can suggest that the transmission delay is going to be independent of the amount of traffic in the upstream channel. We also can assume that the processing delay variation in the CMTS is negligible and therefore it will not be dramatically affected by the difference in traffic volume. However, the queuing and access delays are going to increase when we have cross traffic. The access delay is going to increase because now the upstream channel is utilized for longer periods of time, thus reducing the granularity with which new requests can be accommodated. As a side effect of this phenomenon, since the time that the packet currently server spends at the head of the queue increases, a packet coming to the queue is more likely to encounter more packets, thus also increasing the queuing delay. We do not have a direct way to separate the effect of both delays, but we can measure the number of packets that are simultaneously inside the CM-CMTS system to have an idea of how the queue size at a cable modem increases due to cross traffic. In Figure 7 we picture the evolution of two TCP connection; one of them is a regular TCP session with no other traffic and is compared to a second connection under the effect of cross traffic. In the isolated connection we see how the number of ACK packets within the system has an average value of approximately 5 packets, but the variation is very fast, thus indicating that ACKs may be generated in bursts. This assumption is justified by the fact that under a highspeed connection and due to TCP slow start characteristics, it will not be uncommon to have bursts of data packets, thus leading to bursts in ACK packets. 28 Preiminaty analysis On the other hand, when we have cross traffic, the number of packets in transit increases systematically up to 12 packets, with little variation. This indicates that, since the transmission and processing delay are considered invariant, the queuing and access delay dramatically increase. Since access time increases due to the bigger data packets present at CM2, the queues at the cable modem grow bigger. We can even try to extract more information on how the delay is characterized to definitely conclude about the causes of such a dramatic increase in delay. In order to do that, we want to depict the delay that an ACK packet experiences as a function of the number of ACK packets that still remain within the HFC system. We show such a chart in Figure 8, in which the horizontal axis is the number of ACK packets present in the system when a given ACK packet arrives to the CM, and the vertical axis corresponds to delay in milliseconds. We can estimate the ACK packets on the HFC system --- Cross Traffic - No cross traffic 14 12 10 (A (A U (A 0. 8 Ii 6 C) 4 4 2 0 0 2 4 6 8 10 12 14 16 18 20 Time (s) Figure 7: ACK packets inside the CM-CMTS system 29 Preliminary analysis different contributions to the delay from the shape of the data point. Each ACK packet will have a fixed delay component, which will correspond to the transmission and processing delay. This value is the common offset that we find in both sets of data, and can be estimated at around 1520 ms. In addition there is the delay part that varies when we have cross traffic, which is the queuing delay and the access delay. The way we can estimate the access delay is through the slope of the mean delay per value of ACK packets upon arrival. In the case that there is no cross traffic, when there is just one packet, the total delay experienced is around 20 milliseconds. When there is 10 ACK packets ahead of a new packet, the delay is around 40 ms, which gives us an estimate of a delay access of 2 ms per packet. Let us now look at the case in which there is upstream data traffic on CM2. The delay when there is only one ACK packet is around 30 ms, whereas the Buffering delay a 200 No cross traffic * Cross traff c .- 180 160 ---------- 140 E U. - -- ---- - - - - - - --- ------ -- - - -- --- - -- 120 I- -1 100 80 60 40 - - - - - - - - - - - - - - 20 0 0 1 2 3 4 5 6 7 8 9 10 11 ACK at HFC upon arrival Figure 8: Delay of ACK packets as a function of the number of ACK packets inside the CM-CMTS system 30 12 13 Preliminary analysis delay when we have 10 more packets in the queue is around 140 ms. This gives us an access time of around 10 ms per packet. We confirm this estimate noting that the difference between the delay for just one packet is around 8 ms, as it is shown in the Figure. This numbers confirm our fear that the effect caused by upstream data traffic are much worse than simply adding more traffic to the upstream channel. The access delay per packet is increased four or five times, thus creating a phenomenon in which several ACK packets for the same TCP connection have to wait in the same queue. This effect suggests an immediate improvement, which is the use of concatenation. Concatenation is a feature that a cable system may implement in which grants for upstream transmission are not done in a packet-by-packet basis, but rather depending on the amount of data available for transmission. In systems where there is no concatenation, the grants for transmission time in the upstream channel are given in a packet by packet basis, regardless of the size of the packet. This leads to adverse situations, especially under the scenario with cross traffic in another cable modem. This effect is mitigated with the use of a technique called piggybacking, which allows to inform the cable router (the element that actually does the scheduling) within an upstream data transmission that there is more data waiting for being transmitted. This mechanism allows persistent sources to spare some time in contention for upstream channel, but it does not prevent the effect of a larger access time due to the mismatch in packet sizes. For the measurements presented on this work, the cable modem was using piggybacking but no concatenation. The fact that the access time increases with cross traffic and that this leads to an increasing queue size in the cable modem suggests that concatenation can be useful in mitigating this effect, although it would increase the processing complexity at the cable modem. Some possible solutions We will describe in this subsection some of the solutions that can be thought of in order to relieve some of the adverse effects we have seen in the previous sections. Then, in the next 31 Preliminary analysis section we will develop some of them that are of particular interest for us, and we will describe in detail their functionality and degree of improvement. ACK Suppression (AS) at CM This is a method for alleviating the problem cause by cumulated ACKs during cross traffic. It was proposed on [4], after an idea of P. Karn and is based in the fact that TCP ACK packets are cumulative, so when we have several of them from a same connection in the same queue, only the most recent one is relevant. The main characteristics of this solution are: " No extra buffering required: we only need to have an efficient management of the queue that would allow for quick removal. " May create burstiness: When we eliminate intermediate information, the TCP source may tend to be bursty * Maintaining a per flow state would be ideal, but it is not strictly necessary. Window modification at CMTS or CM This solution is aimed at solving the window size limitation described in the previous sections. The main idea is to break down the scope of the window size limitation to a smaller part of the whole path, by installing somewhere along that path a module that would modify the receiver window advertised by one end of TCP. By increasing this value, we allow the sender to transmit data faster and avoid the typical stalling of the TCP connection due to the mismatch between window size, roundtrip time and link speed. This solution may require some amount of buffering and it needs to maintain per flow state, so it can a degree of complexity that we have to deal with. On the other hand, and as opposed to other similar full proxy solutions proposed, this mechanism would be soft state, i.e. in the event of a failure of the Window Manager software, the TCP connection would not be damaged and would recover itself from the failure. ACK arrival estimation at CMTS Another solution for the ACK compression problem we can explore takes in account that the 32 Preliminary analysis CM-CMTS segment of the TCP connection is common, and the upstream and downstream path traverse the same network elements (fact that cannot be guaranteed in the public Internet segment of the connection). Due to this property, and since the CMTS would be able to monitor the data packets flowing upstream, it could estimate when the cable modem would have an ACK ready to be send upstream. With this estimate, the CMTS could issue and unsolicited grant to the CM, before the CM has to ask for the grant itself, therefore reducing the access delay an ACK packet would experience. This solution has a major drawback on its complexity, since we need not only to keep per-flow state at the CMTS, but also devote processing time to estimate the roundtrip times CMTS-CM. We also have to alter the scheduling process at the CMTS, therefore introducing other problems that could reduce the scheduling efficiency. ACK reconstruction (AR) at CMTS This method was also proposed on [4]. It basically tries to reconstruct a constant stream of ACKs in such a way that the TCP sender would transmit data in a non-bursty manner. This method would be quite efficient for smoothing a stream of ACKs that would cause the TCP sender to behave in bursts, by delaying the ACKs and releasing the according to an estimated rate. However, if the problem is the opposite and we do not need only to delay, but occasionally to generate ACKs, then this mechanism becomes hard state with strong implications in case of failure. TCP with high-speed flow switching. In this section we are going to describe some scenarios in which the use of a high-speed flow switching does not allow TCP to fully utilize all the optical bandwidth that it has available. As we have described in the introduction, our objective here is twofold. In one hand, we want to ensure that the transfer protocol may use all the bandwidth that it has available. On the other hand, we want to maintain the feature that makes TCP so robust and which helped it to become a 33 Preliminary analysis de facto standard in networking. In TCP as we now it today, however, the mechanisms that control the flow of data (i.e. using efficiently the bandwidth) and those that ensure that the transfer is correct and complete (i.e. robustness as a lossless protocol) are combined. They obtain their information from a unique source, the ACK packets. These packets not only inform TCP of which packets have been correctly received, but at the same time (and due to the sliding window mechanism) control how fast new information is transmitted. In order to accomplish efficiently our first objective we have to come up with some idea that would allow separating those two mechanisms. However, in order to do so, we need to have some information about the current characteristics of the network, and the possession or not of this information will define tow possible modes of operation. One which is the regular TCP as we know it today, and a second mode (that we call High-performance TCP or HTCP) in which, although maintaining TCP's recovery features, we allow the flow control to be governed by the information the network has provided. In this mode, we no longer use the standard sliding window mechanism, but we send data at the rate specified by the network, although we are limited by the amount of information that can be stored in the sender's buffers. We are especially interested in a scenario in which we have several concurrent TCP transfers from separate sources over a bottleneck link, as it is shown in Figure 9. This bottleneck link has actually in parallel a high-speed optical fiber (e.g., OC-12) and a much slower conventional wired link. Traffic can be routed to the high-speed fiber only in a per incoming link basis, which source 1 Switch 1 source 2 Switch 2 OC-12 (677 Mbps) 10 Mbps source n Figure 9: Scenario for optical flow switching and TCP 34 Preliminary analysis means that in our scenario only one of the TCP transfers can be routed over a given wavelength in the fiber. All other transfers are routed to the remaining wire link, in which they have to share the bandwidth (significantly smaller than the optical link). Under this scenario, in order to achieve a fair share of the bottleneck link, we should establish some kind of rotation for the set of TCP connections sharing the link. We use a simple round robin manner in which the access to the optical link is given alternatively to a single connection (although the way in which this is scheduled is not the object of this work). We will rather focus in devising mechanisms by which TCP connection can efficiently be switched from the shared wire link to the high speed optical link and vice-versa. We have to take in account that since the bandwidth difference between the two links is fairly large, TCP will have problems to adapt its rate to the new bandwidth availability. In particular, when TCP is switched to a faster link, it might take too much time to actually realize that extra bandwidth is available. As a rule of thumb, a TCP transfer in congestion avoidance mode can increase its window by one unit every roundtrip time. Therefore, it will need several roundtrip times for increasing the window enough as to fully utilize the new bandwidth. In addition, if the time allotted to a TCP connection is not much larger than several roundtrip times, TCP's own congestion avoidance mechanism may not allow the connection to reach the throughput it could achieve when using the high-speed optical link. On the contrary, by the time TCP can utilize such bandwidth, it is not there anymore, clearly causing a waste of resources. On the other hand, the transition from a fast link to a slower one can be effective in a short period of time since the TCP stack will realize the eventual loss of some packets and slow down the transfer. However, if the adaptation is performed without caution, the TCP transfer may end up sending too many packets, therefore experiencing too many losses and, more importantly, causing losses to others. Therefore, it is crucial to devise carefully the mechanisms that would allow an efficient transition between the two modes. In the following subsection we describe in more detail the peculiarity of each transition. 35 Preliminary analysis Mode switching We define mode switching to be the transient phase between the two modes. The transition from regular mode to HTCP mode is simply done whenever we receive information from the network about the available bandwidth. The reverse transition (from HTCP back to Regular TCP mode) is done when the network explicitly tells the sender so, or when the sender cannot trust the received information anymore. It is not the object of this work to define the mechanisms by which this information is communicated, but a simple information packet exchange can be thought of. The second kind of transition, back to regular TCP is, as we have explained before, slightly more complicated and critical because there will be a larger amount of data on the fly and this may cause some problems due to the bandwidth mismatch. We describe here the main characteristics of the two transition phases: * Regular TCP -- HTCP: When switching to High Peformance TCP mode, we only need to keep the last value of sndcwnd for its eventual use when switching back to regular TCP mode. Depending of the strategy that we follow on the other transition, the value of sndcwnd will be modified or not during HTCP mode. " HTCP -> Regular TCP: This is the most delicate part, since now we have to return to the regular TCP mode. We have a large amount of data on-the fly, so the frequency of ACKs is going to be increased, although the link is back to a lower rate. There is a number of approaches we can take, but they will be focused in two objectives: preserving the own transfer's efficiency (i.e., avoiding causing some losses to the own TCP connection), and preserve other transfer's evolution (i.e., avoiding causing extraordinary losses to other connections). The basic scenario that we will encounter upon this transition is a large window (i.e. a high number of outstanding packets) that has to be decreased to a much smaller one. In addition, we have the inconvenient that during the first moments of regular TCP mode, there will be a great number of packets still unacknowledged that were sent during the HTCP mode. The timing of those ACK packets has nothing to do with the state of 36 Preliminaty analysis the newly recovered TCP connection, and they will arrive at a much higher rate that they would under the newly adopted connection characteristics. This can lead to the source sending too many packets too quickly, therefore causing overflow in the bottleneck link buffer, which will translate in losses and the ongoing connection to be stalled. The recovery phase We are going to devote this subsection to the most critical part of the mode switching mechanism in our proposal, the window size recovery in the transition from HTCP to regular TCP. When this transition occurs, the most likely scenario will have the following characteristics: * The number of packets outstanding at this point can be as large as the maximum window size, since we are overriding the congestion avoidance mode. During the HTCP phase data is sent at a rate according to the information provided by the network, not according to the timing provided by the ACK packets. Therefore the number of outstanding packets can grow as to reach the maximum window size or to fill the bandwidth-delay product, whichever is smaller. * Since there is a number of outstanding packets that have been sent at a high speed, when the ACKs for those packets arrive to the sender, they will do it at a rate according to which they were sent. If this rate is much higher than the rate the new regime can accommodate, if we slide the TCP window according to those ACKs, we are likely to send data at a too fast rate. * The value of the window size that will bound the throughput on the new regime is somewhat unknown, although we can use different methods to estimate it. In any case, it is most likely that this new window is going to be much smaller than the number of outstanding packets at the time of the HTCP -+ TCP mode switching. This implies that there are a number of strategies we can think of in order to bring the number of outstanding packets to the value we have estimated for the new regime. 37 Preliminary analysis According to this description, we are going to define some phases during the transition that would ease the description of our proposed solution. First, we define what we are going to call the recovery phase. It is the lapse of time between the reception of the information that suspends HTCP (or the moment in which according to that information, we decide to end that mode) until the instant in which we deem the connection recovered and we switch back to a full regular TCP. Within the recovery phase we can also distinguish two separate phases: the window adaptation phase, and the rate adjustment phase. The transfer is in the window adaptation phase while it is adjusting the number of outstanding packets we had at the end of the HTCP mode, to the window size at which we want to bring the transfer in the new regime. As an example, if at the end of the HTCP mode there were 100 outstanding packets and we wish to bring the connection to a window of 20 packets, the window adaptation phase will extend for the time in which the number of outstanding packets is reduced from 100 to 20. However, even though we have reached the desired window size, the reminding 20 ACKs for packets sent during the HTCP mode are likely to arrive at a higher rate that is really available on the new slower link (since those ACKs correspond to data packets sent over the high-speed link). Therefore, we define the transfer to be in the rate adjustment phase while the ACKs that are being received correspond to packets sent during the HTCP mode. As soon as we receive an ACK for a packet that was sent during the newly acquired TCP we end the rate adjustment phase (and by consequence, the recovery phase). We have to note that depending on the particular circumstances of the connection, the breakdown of the recovery phase may differ. In particular, when the number of outstanding packets at the end of the HTCP mode is equal or less that the new desired window, there will not be a window adaptation phase. Similarly, when the newly desired window is equal to one packet (corresponding to the most conservative solution in which the TCP regime starts practically from scratch) there will not be a rate adjustment phase. 38 Preliminary analysis There are a number of strategies that we can follow in order to make both window adaptation and rate adjustment efficient and not prone to losses. Some of them have been proposed in the literature ([11], [25]), although they are applied in different scenarios that the one we are dealing with. In particular, what is going to be one of the critical factors is the desired value of the TCP window that we want to return at. There are a number of choices that we can use and they will differ in the degree of aggressiveness. The most conservative will be setting the window to one packet, which is almost equivalent to restart TCP in the new regime from scratch. If we want to be more aggressive we can choose to come back at the value at which the window was previously to the switching to HTCP mode. An even more aggressive mechanism can be to grow the window as if when HTCP we were in regular TCP mode. Finally, and if we want to be almost suicidal, we can simply keep set the window immediately to the current number of outstanding packets, hoping that the new regime could sustain that rate (which is highly improbable). After deciding the value of the new window at which we want to reduce the rate, which can be estimated using some of the propositions in [25], we are faced with the problem of effectively reducing the current window to the desired value. During the window adaptation phase we can take the following approaches, depending of the trade off we want to make between safety and speed in recovering the desired window value: * Conservative: In order to not cause any additional loss due to the window mismatch, we will not send any data packet until the desired window is reached. We will receive ACKs during this period, but we only incorporate the information they carry about correctly received packets. They will not trigger the transmission of new packets and only will help close the window down. 39 Preliminary analysis Aggressive: In order to avoid the stall cause by the conservative approach, we can think of sending data packets interspersed between the ACKs. We need to set a parameter (htcpack_intersp_) that will control every how many ACKs do we send a data packet (which obviously has to be greater than one if we want to decrease the window). However, since the timing of the ACKs is completely unrelated to the actual conditions of the slower link, even a high parameter may lead to an excess of data packets. There are also some subtleties as to how the initial conditions of the window adaptation phase. As an example, we can decide that the new window is going to be of w, packets, when the current number of outstanding packets is much larger, say wh. We need to send at least one packet more before entering the window adaptation phase, in order to allow the new regime to provide the first ACK with the delay characteristics of the new situation. However, we can send more than one packet right before entering this mode. In this case, we would have sent some fraction of the newly desired window with the new characteristics and this may help the TCP connection to resume faster its normal behavior. Say that we send n packets (with n wn) before entering the window adaptation mode. The number of ACKs for packets out of the new window that will arrive (wh + n) will do so under the high-speed timing, and therefore, the rate adjustment phase will be smaller (w,,.- n). In an extreme case, when we send in the new regime as many packets as the new window allows (n = wn), the rate adjustment phase disappears. Similarly, during the rate adjustment phase, we may apply some of the same approaches described above, or a combination of them. If in the instant that we reach the rate adjustment phase (i.e., when the number of outstanding packets equals the desired window size, but we still have not received ACKs for packets sent during the new regime) we start "trusting" the timing of the incoming acknowledgements, we still might experience problems. This is due to the fact that these ACKs still do not carry any information about the current conditions of the new regime. We will receive w,.- n such acknowledgements, and when this number is still too large for what the new regime can accommodate, the TCP transfer may still experience losses. Therefore, for 40 Preliminary analysis the rate adjustment phase we can apply an interspersing mechanism that would be similar to the one described above for the window adaptation 41 Preliminary analysis This page intentionally left blank 42 Implementation of solutions proposed Implementation of Solutions Proposed In this section we will describe the implementation of two of the solutions proposed in the previous section. We will describe in detail the mechanisms involved in each of the solutions and how they can mitigate the effects of the problems they aim to solve. We will then describe the practical implications of their implementation. We will finish the work by presenting the results we collected to prove the benefits of these solutions. Window size modifier (WM) The main purpose of the algorithm is to break the limitation of not properly tuned receiver windows. As it has been explained in the previous sections, because of the sliding window nature of TCP, we can roughly send only one window worth of data every roundtrip time between the TCP sender and receiver. The main idea of this modification is splitting the roundtrip time in two components, Source-CM + CM-Receiver and keeping the limitation of the small receiver window only in the CM-Receiver system. A number of proposals have been made to use the receiver window to adapt the TCP behavior to the conditions on the network, but they have been focused in rate control, i.e. slowing down too aggressive TCP transfers. A proposal described in [5] implements this idea by reducing the receiver window size and therefore explicitly reducing the actual TCP window size. Our approach is different in the sense that it is not used to reduce the TCP throughput, but rather to allow TCP to access more efficiently the bandwidth available. However, our scheme can also be used as a rate limiter, thus giving the service provider more control over the quality of service perceived by the users. In order to accomplish the window size limitation split, we will install a Window Modifier (WM) at the CM, which will change the value of the receive window advertised by the CPE by a value configurable at the CM. In order to preserve TCP end to end semantics and maintain the consideration of soft state solution, the WM will not generate any spurious ACK nor will retransmit any packet. It will eavesdrop ongoing TCP connections and will change the value of the receive window advertised by the receiver. It will monitor the data coming downstream and 43 Implementation of solutions proposed will hold (buffer) those packets that are out of the true advertised window on the CM-Receiver system, until a new ACK makes room for them. When a packet is buffered at the CM, it means that it has not been sent yet to the receiver. As soon as is sent, the CM does no longer have it nor keeps a copy. Therefore, we are not implementing a proxy or retransmission service at the CM, but only providing temporary buffering for avoiding receiver window overflows. No spurious ACK packets are generated and only buffering up to the difference between the true advertised window and the modified value is needed. However, we need to maintain a per flow state but the overhead involved is reduced to identifying the flow, modifying the window size value within the TCP packet and is recalculating the TCP checksum, since the MAC CRC is automatically recomputed by the hardware. This approach allows for the WM to be considered soft state. In the event of a WM failure, TCP's own recovery mechanism will be enough to continue the data flow. A full TCP Proxy is considered hard state and the implications of a failure are much worse than in the WM case. In Figure 10 we present an example of operation of the Window Modifier mechanism. In a system with a receiver window value of 8,760 bytes (the default in most Microsoft Windows systems) over a roundtrip time of 40 milliseconds, the maximum throughput we can achieve is 219 Kbytes/s (i.e. 8,760 bytes every roundtrip time). In a cable modem system in which the downstream speeds can be up to 40 Mbps (5 Mbytes/s) this is clearly a waste. If we place a WM module in the cable modem, with a modified value of 35,040 bytes, then the limitation only takes place in the segment with minimum theoretical throughput. In the Receiver-CM segment, the limitation is 8,760. / 35,040/ = 1 Mbytes/s. In the CM-Sender segment, the limitation is ,/0.035 ' 1.75 Mbytes/s. Therefore, the limiting segment is now the CM-Sender segment, 0.005 * with a maximum window size of 35,040 bytes, which would allow a maximum throughput of 35,040.04 44 - 876 Kbytes/s, i.e. almost 4 times the previous throughput. Implementation of solutions proposed Win95 PC CM CMTS HP-UX server (tci) Corporate Network 1 -5 8760 ms 20 - 30 ms I - 5- 10 ms 5- 10 ms -45g 0 Regular TCP 40 ms 219 Kbytes/s 060 219 Kbytes/s Window Modifier at CM 876 Kbytes/s 8760 1,152 Kbytesls 35040 1, Q11 Kbytests =.3 Figure 10: Functionality of the Window Modifier We tested the performance after implementing this feature and it works as expected. We will show the exact details in the results subsection, but as an example, for a PC with a true receiver window of 8760 bytes, it speeds up a ftp transfer from around 260Kbytes/s. (without the WM), to 650 Kbytes/s with a fake window of 35040. It is clear that the WM feature is based on a misconfiguration of the customer PC, and if the user was able to set up this value on the Windows registry, under the key HKEYLOCALMACHINE\System\CurrentControlSet\Services\VxD\MSTCP\DefaultRcvWindow the WM feature would be of little effect, but for many users this simple setting may be out of their knowledge. On the other hand, more knowledgeable users may explode this configuration to get better service than others, and even setting this value to an impractical too large value. This could create an adverse effect, maybe allowing intermediate buffers to overflow. Therefore, 45 Implementation of solutions proposed it is interesting to have a mean of controlling this setting in a way transparent for the user. It may be thought of as a rate limiter, to avoid false expectations of service when there is more users in the future (this is to be of great importance for the cable companies) and/or to provide an uniform service to all users. Also, given the way it is implemented it would be relatively easy to introduce different classes of service: higher windows for some users, smaller for others, no matter what their actual settings on their PCs are. Everything will be easily controlled at the cable modem (and consequently, from the CMTS). This method was first thought to be implemented at the cable router, but it was finally decided to add it to the cable modem for two reasons: " since we need to keep per-flow state, having it at the CM makes it much more easily scalable, since the number of flows at the CM is limited by the number of computers connected to it (usually one) and the number of simultaneous flows per computer (usually some tens). * Assuming the small window is at the user equipment, the closest to it, the better. Also, most of the delay is at the CM-CMTS upstream path, so using is at the CM allows to constraint the small window size to a very low delay (basically, the Ethernet delay between the computer and the CM). ACK suppression (AS) One interesting phenomenon we saw in the analysis of the traces under cross traffic was the queuing effects on the cable CM-CMTS system. This refers to the time between we see a packet on the user's Ethernet segment (the one at which the user's computer and the cable modem are connected) and the headend network interface (the segment at which the headend is attached on the network side). Without cross traffic, the typical delay is in the order of 20 seconds. But with cross traffic, this delay can grow up to 120 ms. There is also a correlation between this delay and 46 Implementation of solutions proposed the number of previous ACKs still in this system when a given ACK arrived to it, i.e. the more ACKs in the system, the greater the delay. As it is shown in Figure 8, under cross traffic conditions, an ACK is likely to arrive with around 10 ACKs before it, and has to wait for this ACKs to be transmitted upstream before it can be transmitted itself. Let's recall now that ACKs in the same TCP connection are cumulative, i.e. an ACK with a higher number has the same information as an ACK with a lower one. Therefore, if we have two (or more) ACKs for the same connection in a buffer, we can disregard the information of the lower numbered, since it is contained on the highest numbered ACK. We only drop ACKs with no data associated and we apply a special treatment to ACKs that may trigger fast retransmission. Repeated ACKs serve as a mean for the sender of knowing whether there is congestion or an error during the transmission, and although there is no explicit information within the packet, it is a very important implicit information (just the fact of receiving a repeated ACK). Therefore, we will not purge any repeated ACK, only those that have not been repeated and are superseded by a higher numbered ACK. The possible drawback of applying this algorithm would be the increase of the degree of burstiness on the TCP connection. Since we would be receiving less ACKs, and those would acknowledge more data, it is more likely that the source will send bursts of packets, rather than a constant stream as it will be with an ideal ACK stream. However, the possible adverse effects of this increased burstiness are outweighed by the performance improvement brought by the reduction on the queuing delay WM and AS Implementation Details We will describe here the implementation details of the solutions proposed above. Since both (and possibly any approach to improve TCP performance) need to operate in a per TCP flow basis we will create a common set of routines that will allow to characterize and log TCP flows. On top of these, we will run the two now performance improvement methods. As explained in 47 Implementation of solutions proposed the introduction, both solutions are independent and mutually exclusive, so they will have little or no interaction. As shown in Figure 11, we define the upstream path in the CM to be the set of procedures applied to a packet in transit from the Ethernet interface to the coaxial interface. Similarly, the downstream path is defined to be the set of procedures applied to a packet in transit form the coaxial interface to the Ethernet interface. The operations of the different solutions would also depend in the path we are considering. The basic operations for both modules are as follows: Window Modifier Operation The operation of the window modifier is different depending on the path. On the upstream path we have to modify the advertised receiver window value, and on the downstream ensure that the user's machine does not get overflow with packet out of its true window. To user To CMTS Mi Downstream Path --------------------- . Et iernet Int erface RF TCP/IP MAC .......... 4 ------ 1 I Coax Cable Modem ControlP Interface -- --------- MAC TCP/IP =J=* --------------- ----------- RF Upstream PaV7 Figure 11: Cable Modem Block Diagram 48 I Implementation of solutions proposed * Upstream path: It will extract the value of the window advertised by the CPE on every packet flowing upstream and it will modify it to the window value specified by the configuration. This involves recalculating the TCP checksum. If the packet is an ACK it will compute the new window available at the CPE and if there are packets buffered, it will send those within the new window. * Downstream path: It will check if the packet is within the CPE window. It will buffer the packet if it is outside that window (overflowing the CPE if forwarded immediately) or simply forward it if it is within the window ACK Suppression operation This module is only applied to ACK packets with no data in the upstream path. Since going though the entire queue every time is time consuming and difficult to implement (and we don't want to alter the queue management) this module keeps a pointer to the most recent ACK in the queue (if any). Upon receipt of an ACK, if there is no pointer associated with this connection, that means there are no ACKs for this connection in the queue, so we simply update the pointer and enqueue the ACK packet. If the pointer is not null, we verify that the ACK in queue can be substituted by the current ACK, we overwrite the contents of the ACK in queue with the current ACK, and we drop the buffer holding the current packet (since its contents have been copied ahead in the queue). This implementation needs a callback in order to null the pointer every time an ACK that is pointed to leaves the queue. Interface description The whole set of routines is self-contained in a TCP enhancement module (tcpenh. c) and the existing platform will interface with the new module through a small set of routines: 49 Implementation of solutions proposed " TCPinit: This routine is the initialization of all variables used by the common set of routines. It should be called once at boot time. * TCPctl: This is the single point of entry to the TCP enhancement module. It is passed a pointer to the packet and a value indicating the originator of the packet: CMTS (downstream path) or CPE (upstream path). It will return 0 when the values have been successfully updated and the packet can keep being processed. It will return -1 when there is no further processing needed for this packet. In the downstream path, that means the packet has been buffered and it will be send to the CPE later. In the upstream path, it means that an ACK has been dropped or its contents have overwritten a previous one, so the buffer containing the current ACK packet can be freed. " TCPack2HW: This routine is a callback to warn the AS module that an ACK which was pointed to has left the queue. This ensures that the pointer to the latest ACK has always a valid reference. Operation details In this subsection, we will explain in detail what and how each component of the TCP enhancement module works. As explained before, since both solutions need to have information on a per TCP flow basis, we created a common platform for logging and keeping information about active TCP connection. The purpose of the common platform is to maintain an accurate view of the ongoing TCP connections, so WM and AS would operate error free. It will listen to TCP traffic in both downstream and upstream path and will extract the necessary information for each flow, namely the advertised window size and the last packet acknowledged, among other things. There will be a unified entry point to this information through a variable named TCPstate, a pointer to a structure with the following fields: 50 Implementation of solutionsproposed S16 numlogged; /* U8 WMenabled; /* status of WM (TRUE/FALSE) */ U8 ASenabled; /* status of AS U8 TCPenabled; /* TCP status U32 ASsuppressed; /* # of acks suppressed by AS */ U32 ASprocessed; /* # of acks processed by the TCP module */ U32 ASunique; /* # of unique acks processed by the TCP module */ U32 lasttimeout; /* instant of last cleaning timeout */ number of logged connections */ connstate-ptr cstate[TCPMAXCONN]; (TRUE/FALSE) */ (AS 11 WM TRUE/FALSE) */ /* array of per-connection state */ the last field is an array of pointers to connection state records, each of one has the following fields: U16 connid; connection id -- U8 cpe-state; state of TCP connection (CPE side) */ U8 cmtsstate; /* state of TCP connection (CMTS side) */ U32 dsaddr; /* downstream IP address U32 usaddr; upstream IP address U16 ds-port; downstream port U16 usport; upstream port U16 last_win; U16 TCP_win; U32 lastack; /* last ack # received from CPE side */ U32 lastseq; /* higher seq # received from CMTS side */ U32 isn; /* initial seq no for this connection */ S16 data; /* amount of data buffered U32 activity; U8 connstate; /* status of the TCP connection */ U32 cpe-fin; /* sequence number of FIN packet sent by CPE */ U32 cmtsfin; UsDataBuffer unique */ (CPE side)*/ (CMTS side) */ (CPE side) */ (CMTS side) */ /* last receive window size from CPE side */ window size to be advertised */ instant of last activity */ *queuedACK; U32 queuedACKnumber; U8 ASapplied; (bytes) */ sequence number of FIN packet sent by CMTS */ /* pointer to the last queued ACK */ /* value of the last queued ACK */ TRUE/FALSE AS applied to this session */ 51 Implementation of solutions proposed DataBuffer *databuffer; /* pointer to the first buffered packet */ DataBuffer *databufferjlast; /* pointer to the last buffered packet */ The field connstate indicates the status of the connection (under the eyes of the TCP enhancement module, which can be different of the actual state on each end). It can be in any of the following states: * TCPCLOSED: no info for this session * TCP_HALFOPEN: info for just one end " TCPALIVE: info for both ends " TCPHALFCLOSED: sesion is closing (one end has sent a FIN) WM and AS will only operate (buffering data or dropping ACKs) when the connection is on the TCPALIVE state. Similarly, cmts_state and cpe state hold the status information for the CMTS and CPE side, respectively. It can be: CPE: TCPCPEALIVE: there is info from this side. TCPFIN: this side sent a FIN and awaits the FIN ACK TCPCLOSED: no info for this side. CMTS: TCPCMTSALIVE: there is info from this side. TCPFIN: this side sent a FIN and awaits the FIN ACK TCPCLOSED: no info for this side. lastwin is the latest receive window size advertised by the CPE. It is updated with every upstream packet for this particular connection. 52 ... .. ... .. ......... .. Implementation of solutions proposed ~~Ret 0 -4 AS processing packet? No (if applicable) Ys WM Processing Initialize TCP flow: c TOPinitcon) DRet 0 Clear TCP flow: (if applicable) Is a SYN T last ACK &Update ? No Yes% Setrepeat Is a RST TI OCPPclearconn() paket ) C Y Is repeated? No De N get conn id Yes (we started logging in the middle of an open connection) I s an ACK Initialize TCP flow: TCP initc onn ( SRet 0 GS conn id == -A ? & )No Initialize TCP flow: No upstrea. Yes -4 Log seq. Number set this side to TCP_-FIN set TCPHALFCLOSED ye TCP initconin 6 TCPHALFOPEN? C RED No (flow is ACTIVE) Yes is side is TCP - No && pkt is ACK ACKs the F. Yes Is a FIN packet? No Change this side to TCP_CLOSED Other end TP, CPCLOSED9 No Yes Clear TCP 1f,1121,w:jj TCP c learconn ( ) Figure 12: TCP flow state flowchart 53 Implementation of solutions proposed TCPwin is the value that the CM will insert on every upstream packet, and is the value seen by the end on the CMTS side. lastack is the latest (and highest) ACK received from the CPE. Together with last_win, it allow us to tell whether a data packet downstream is within the CPE window or not. data, *databuffer and *databufferlast hold the information about the data packets queued awaiting to be within the CPE window. They form a linked list with FIFO discipline. *databuf f er points to the head of the queue (the oldest packet, to be dequeued first) and *databuf ferlast points to the end, where new incoming packets will be buffered if necessary. The algorithm used by this common platform is quite simple. The WM and AS processing routines will be different for each of the data paths. The algorithm differs from upstream to downstream in the fact that, upstream, we care about the window size and ACK number of the packet, whereas downstream we look at the sequence number and size of data packets. However, the most important task of the common algorithm is ensuring that the per-flow state is efficiently maintained. In Figure 12 we show the flowchart of the software module that controls and maintains such information. Basically, we have to ensure that the image the TCP enhancement module has of the TCP connection is accurate with the true state of the connection, and therefore, we must keep track of the SYN packets, as well as FIN and RST packets. Since we only perform modifications for TCP sessions for which we have complete information from both sides of the connection, before performing any action under the WM or AS mechanisms, we run different test to identify the previous information we have about that session. Let us remind that a TCP flow is completely identified by the IP address/TCP port pair (origin/destination). The routine TCP_ctl is inserted in the TCP/IP block in both upstream and downstream path (see Figure 11). It either returns 0, meaning that the packet will continue its normal processing 54 Implementation of solutions proposed on the data path (although its contents may have been modified), or it returns -1, meaning that there in not further processing pending for that packet. The flowchart depicted in Figure 12 is followed in both the upstream and downstream paths. The differences between them are specified within the WM and AS processing. Those two procedures are depicted in Figures 13 and 14. At this stage, per flow information has been WM processing (upstream) Yes Is this an ACK packet? Update lastUackp Slid buferIs there data buffered? No Send TCP data packet downstream Yes Is it within ceiver window? No Modify receiver window Update TCP and MAC CRC Figure 13: WM processing block diagram (upstream) 55 Implementation of solutions proposed already recovered and we are accessing the records pertaining only to each TCP session. In Figure 13, we show the Window Modifier processing block diagram corresponding to the upstream path. In it we monitor ACK sequence numbers flowing upstream, and we release any packet previously buffered in the downstream path when the true receiver window slides as to allow out-of-window packets to be forwarded downstream. The downstream path itself is quite simple, since we only check whether data packets are out of the true advertised window. This may occur when the sender has transmitted packets according to the window advertised by the WM engine, but which are not within the true window of the cable modem. In this case, data packets are buffered at the cable modem until an ACK that is flowing upstream slides the receiver window allowing the data packets to be sent without overflowing the receiver. In Figure 14 we show the AS processing block diagram. AS processing is only applied on the upstream path and it basically maintains a cache (pointer) to the last ACK seen on this TCP connection. When a new ACK is received, we verify whether there is another ACK packet cached for that connection. If it is not a repeated ACK we simply overwrite the contents of the .... .. ............... r.................................................................................. Yes No s taYes repeated ACK? Isthere an ACK cached? No Is it a repeated ACK? No Overwrite old ACK with new.AC Clean ACK cache] Cache ACK Figure 14: AS processing block diagram (upstream) 56 Implementation of solutions proposed previous ACK and we interrupt the processing of the buffer containing the packet, since its contents have been copied to another buffer already in the queue. In all other cases, we simply update the cache accordingly and continue the processing of the packet. High Performance TCP Basic algorithms We will show here the changes that are needed in order to implement the new proposed algorithm. Although the operation of TCP changes, the actual modifications to the TCP code should not be very costly. We will first describe the algorithm for regular TCP, followed by the new algorithm, showing the changes between the two. In regular TCP ACK reception triggers transmission of new packets since it may open the window. The basic idea is to send as many packets as possible, as to exhaust the current sender window (snd cwnd) or the receiver window (rcv_wnd): When receiving an ACK if (ACK sequence number >= snduna) { /* New ACK */ snduna <- ACK sequence number update sndcwnd while (sndnxt <= (snduna + min(rcv-wnd, sndcwnd))) { send (snd-nxt) increment sndnxt } } else { if (ACK sequence number == snd una) Repeated ACK (fast retransmit processing) else ACK out of date, ignore. } 57 Implementation of solutions proposed During the High-performance TCP (HTCP) ACK reception does not control anymore whether new packets are transmitted. Only the variable snduna gets updated. The recovery mechanism remains untouched, and we actually have two separate mechanisms that perform flow control and packet transmission. The task that will perform the flow control (i.e. ensuring that all packets are received and scheduling retransmissions if needed) is practically the same as we used in regular TCP, only removing the transmission of new packets fragment: When receiving an ACK if (ACK sequence number >= snduna) { /* New ACK */ snduna <- ACK sequence number /* Do not do anything else */ } else { (ACK sequence number == snduna) Repeated ACK (fast retransmit processing) else ACK out of date, ignore. if } Another task will regularly query for packets to be transmitted. The idea here is, whenever the channel is free, we check whether we can send another packet. As long as the packet to be sent is within the window determined solely by the receiver window (rcv wnd), it will be transmitted, so therefore we ignore the current value of sndcwnd. The algorithm for this routine should be as follows: Whenever the optical transmitter is idle while (snd-nxt <= (snd-una + min(rcv-wnd)) { send (snd-nxt) increment sndnxt } 58 Implementation of solutions proposed Actually, we can simplify the implementation of this second task by scheduling the times at which new packets are to be sent. When we send a new packet, since we now its size and the rate of the optical link, we can calculate at what time in the future the next packet has to be sent. Therefore we can schedule an interruption at that time to check whether there is more data to be sent. 59 Implementation of solutions proposed This page intentionally left blank 60 Description of results Description of Results In this section we will describe the results we obtained applying the mechanisms described so far. We used the same test scenario that we used for obtaining measurements (see Figure 3). We implemented the Window Modifier and the ACK Suppression methods through software in a prototype cable modem, as described in the previous section. In all results presented here for these two solutions we had a roundtrip time of approximately 30 milliseconds and we were using an upstream channel of 1280 Kbps. Window Modifier For the window modifier we do not use the traffic generator attached to CM2 (see Figure 3), but simply monitoring FTP transfers from the PC attached to CM1 to a server across a corporate TCP transfer evolution 4000000 3500000 - - ------ ---- -- - - - - - 600 Kbytes/s -- - -- - - --- - --------- 3000000 2500000 - -- -- - - - ------ ------------ - --- --- --280 Kbytes/s 2000000 0) 01 cc Q0- -- - - -- --- - -- --- --- - - - - - 1500000 -------------1000000 --- - - - 500000 - -- --- 35040 (WM) (8760) - 35040 - - ' -8760 0 i 0 2 i ! i i i 4 ! i i ! i i i 6 i i i 8 i i i i i i i i 10 f # 12 i i . . 14 Time (s) Figure 15: TCP transfer evolution with Window Modifier 61 Description of results network. The way we are going to demonstrate the improvements of this mechanism is by monitoring the TCP sequence number in the data packets and showing its evolution over time. The relevant parameter in this kind of chart is the slope of the sequence number evolution, which corresponds to the instantaneous throughput of the TCP transfer. We will show just a sample of some of the many connections we have observed, in which all of them the behavior is equal to the samples described here. In Figure 15, we show such a chart in which we have depicted the evolution of three TCP transfers. One corresponds to a transfer in which the receiver window size is set to 8,760 bytes. We can verify how the slope of the sequence number corresponds to a throughput of 280 Kbytes/s (which is consistent with the theoretical limitation to the maximum throughput posed by the receiver window size). We also show the transfer with a receiver window of 35,040 bytes, in which we see how the throughput increases up to 600 Kbytes/s. As an aside, we see how this particular transfer experienced a stall for 1.5 seconds due to consecutive losses within the same window. TCP cannot handle this types of losses (fast recovery can handle isolated loses as long as they are not within the same window), and a timeout happens to resume the TCP transfer. However, the point of the figure is not dealing with losses (which is an enormous task in and on itself), but looking at how the TCP throughput evolves through time with different schemes. Finally, the third set of data corresponds to a TCP transfer in which the receiver has its window set to 8,760 bytes (the same as the first trace shown), but we used the window modifier mechanism in the cable modem (CM1), with a simulated window of 35,040. We can verify how this transfer achieves a throughput that is comparable to the one obtained with a true window, as if the actual windows size parameter governing the transfer was the modified 35,040 instead of the true 8,760. The first conclusion we can extract from Figure 15 is that the window modifier mechanism works perfectly fine and for a specific modified window size achieves a throughput comparable to the one that would be achieved with the same value as the true value in the window. This is achieved in a manner that is completely transparent to the user, and independent of the value that 62 Description of results this user has specified in his TCP configuration. This is specially important to provide a uniform service to many users, since it would help those with a too small window size to achieve higher throughputs. It would also prevent other users that deliberately set too large values for their window sizes to interfere and decrease performance of other users. A possible drawback of this mechanism, as it was described on the previous section, may be the additional delay introduced by the downstream buffering in the cable modem when a data packet is momentarily outside the true receiver window. In Figure 16 we evaluate how important is this effect, and we show the probability density function of the delay that a data packet experiences in the downstream path within the CMTS-CM system. When we do not use the window modifier mechanism, the delay is confined between 5 and 10 milliseconds. When we apply the window Delay CMTS-CM (Downstream direction) .. With WM .. Without WM 40 35 30 --- - - -- - - - - --- --- -- --- --- - - 4 -- -- --- -0 - 20 - -- - - - --- --- ------ - --- -- - -- 15 10 ---- - - - - - -- - - - -- - - -- ---- -- 5 0 0 5 10 15 20 25 30 Time (ms) Figure 16: Downstream delay 63 Description of results modifier, the downstream delay increases, but as we expected, not dramatically. The mean increase in delay due to the window modifier could be approximated to about 10 milliseconds, which is a value that can be easily assumed and will not cause any major problem with the connection. ACK Suppression For the ACK Suppression mechanism we use the same setup that we used when obtaining measurements that confirmed the problem. In figure 17 we show the probability density function of the delay experienced by an ACK packet on the upstream path. We include the two data sets Upstream delay CM-CMTS 8 ossC traffic (S) 7 cross traffic -_No 6 0 Kbytes/T - ross ric 6 5 - 0 F -- 1- - - 4 a 420 Kbyt s/s 3 - I ^3-bytes - - 2 1 0 A. 0 r. 1 - A 20 40 'o iW59- 60 80 o'AA 100 Delay (m s) Figure 17: Upstream delay with ACK Suppression 64 AA 120 140 Description of results similar to those we showed in Figure 6, corresponding to the delay with and without cross traffic. In Figure 17 we also include the delay when we actually apply the ACK Suppression mechanism to a TCP transfer supporting cross traffic in a different cable modem. We see how the mean delay is decreased to about half its value when we have cross traffic. This decrement in delay mitigates the effect of the window size limitation that we were experiencing and increases the throughput from 310 Kbytes/s to some 420 Kbytes/s. In Figure 18 we show the evolution over time of the same TCP transfers shown in figure 17. Again, the important parameter to look at on the figure is the slope of the data set, which gives us an indication of the throughput. For the connection without cross traffic we see a throughput of near 600 Kbytes/s, whereas the transfer with cross traffic, this throughput decreases to TCP transfer evolution 4000000 3500000 - 420 K sytess 3000000 0L 2500000 .0 310 Kbytesls 2000000 - -- - - --- 1500000 - 1000000 - - --- - - - - - - ---- - - - - - ---- - -- - - - - -- - ----- - 500000 - 0 W-1 0 t i i i 2 ! i ! i 1 4 ! i 1 1 1 1 1 1 6 ! ------------Cross traffic(AS) No crosas-traffic cros traffic - ! 8 i - i i i i i i 10 i i i 12 i ! i i 14 Time (s) Figure 18: TCP transfer evolution with ACK Suppression 65 Description of results 310 Kbytes/s. However, when we introduce the ACK Suppression mechanism, since we decrease the delay significantly, for the same cross traffic conditions, the throughput increases from 310 Kbytes/s to 420 Kbytes/s. It does not eliminate completely the effect of cross traffic, but we see how it increases the TCP performance by nearly a 40 %. This results also suggest, as we expected, that the use of concatenation in a cable modem may be highly beneficial. High Performance TCP and high speed flow switching We are going to describe in this section the parameters we used in our simulations and the results and conclusions we can extract from them. We used the ns simulator package from UC Berkeley [28] and we performed our measurements over a topology as shown in Figure 9, in which we have a bottleneck link under study composed by an OC-12 optical link at 677.080 Mbps, and a traditional electronic link at 10 Mbps. All other links are big enough to accommodate all the traffic they would ever need to carry. The propagation delay on the bottleneck link is 5 ms and in the other links (from the sources to the switches and to the switches to the destination is 1 Ms. This gives a roundtrip time (not counting queuing delay of 14 milliseconds. Given the optical link speed, the bandwidth-delay product amounts to about 9.5 Mbps of data that can be outstanding for a given TCP connection. With packet sizes of 1,000 bytes, this corresponds to 1,185 packets, which is the amount of buffering space the endpoints would need in order to operate at full speed. We selected these values to allow the mentioned full utilization while maintaining reasonable values for buffering space at both ends. The propagation delay corresponds to a distance of about 1,000 miles, which can be thought of if we refer to a trunk link between two major US cities. During our simulations, we had a variable number of TCP sources that would share the electronic bottleneck link (at 10 Mbps). Since we want to achieve a relatively fair share of the bandwidth in the optical link, but we can only route to that link one of the entries to the switch, we perform some sort of time sharing between all the sources on the fast optical link. The way we implement this in our simulations is by alternatively switching flows from the slow electronic 66 Description of results link to the fast optical link and vice-versa. In this way, the bandwidth that a single TCP connection has available suddenly increases from some Mbps to more than 600 Mbps. It will have this bandwidth for a period of about one second, and then it will come back to the pool of connections using the electronic link. Regular TCP will be unable to use all the bandwidth immediately, and it will take much more than one second to grow its window as to allow that throughput. On the other hand, with HTCP we should be able to immediately use all the bandwidth that is available. We want to compare this performance with the one that we would see on the same scenario when we use regular TCP all along. We will show how the throughput obtained in the two cases is dramatically different, suggesting that HTCP is a promising idea that can help achieve a better utilization of such a flow switching mechanism. In Figure 19 we compare the throughput we obtain when a connection is using HTCP mode with the performance we would obtain if a regular TCP connection were given the same bandwidth TCP throughput (loss free link) 700 HTCP TCP 650 600 550 500 450 400 3 350 300 250 200 150 100 50 0 1 2 3 4 5 Time (s) 6 7 8 9 10 Figure 19: Throughput comparison 67 Description of results allocation, but without using our HTCP scheme. We can see how the connection using HTCP is able to utilize the full OC-12 bandwidth during the one second that the TCP transfer is given the optical link. On the other hand, we see how the connection not using TCP would not use almost any of the extra bandwidth it could use. This is due to the fact that at the time the connection was switched over the fast optical link, the current window size was appropriate for the slow link. Now that it has much more bandwidth available, it takes some time to take advantage of it. TCP in particular would be able to grow the window size by one packet every roundtrip time, which means that during the one second that the 677 Mbps are available, the TCP window will be increased by approximately 70 packets. This increase in the window may allow an increase of 40 Mbps in the maximum bandwidth, which is clearly not enough as to fully utilize the available bandwidth. Among the objectives we have when performing the simulations, we want to ensure that the new TCP transfer evolution 7000- 6000 - E 5000 C (D U 4000 -o 3000 2000 I 5.04 5.05 5.06 5.07 5.08 5.09 Time (s) I 5.1 I 5.11 Figure 20: Evolution of a TCP transfer under losses 68 5.12 5.13 Description of results TCP mode we have devised retains the recovery characteristics that TCP had. In particular, we want to see how our new HTCP mode would behave in case of losses. Let us recall that if while in HTCP mode, the sender receives three duplicated acknowledgements, it will retransmit the packet in question, while still sending new packets. When we are in HTCP mode the limitation for outstanding packets is the maximum value of the TCP window, not the current value controlled by the congestion avoidance mechanism. Therefore, as long as the maximum window is not exhausted, we will keep sending further packets, as opposed to regular TCP in which the congestion window would be halved and new packets would be sent only after the number of repeated ACKs allows to increase the window up to the previous value. In Figure 20 we observe the behavior of a TCP transfer during the HTCP mode, the blue points being the data packets sent by the source, and the red crosses being the ACK packets received also at the source. The optical link has a low error rate, but enough as to cause the fast retransmit mechanism to be applied. We can see how at 5.058, after three repeated ACKs, we retransmit one packet that had been lost before, although we keep sending packets until the maximum window (2,000 packets) is reached, at which point we stop. A roundtrip time after retransmitting the packet, we start receiving new ACKs, but they are again repeated because of other loss. A new retransmission is done but new packets are still being sent. This behavior is repeated several times, but we see how the connection is never interrupted and does not need a timeout to resume. In the worst case scenario, the connection may be idle for a roundtrip time. The only way in which the HTCP would need a timeout to resume is when the retransmitted packet is lost (which would be also critical for regular TCP), or when two packets are lost and the difference between then is less than three packets. In this case, after the first retransmission we will not be able to send enough packets as to generate three repeated ACKs that would trigger the retransmission of the second lost packet. This behavior does not allow the eventual stall and timeout that regular TCP may suffer when the window is halved and no more packets can be sent unless new ACKs are received, while at the same time not the receiver is unable to send more ACKs due to the lack of incoming data packets. Therefore, our proposal does not perform any worse than current 69 Description of results versions of TCP, and in some cases is even more robust, being less the chances of the connection having to resume due to a timeout. Regarding the amount of buffering needed at the source, If we take a closer look in figure 21 to the number of outstanding packets (sent but not yet acknowledged) during the connection, we see how for the most part of the connection that figure is 1,185, which is what we had previously calculated for completely filling the optical link. During the periods that we are in fastretransmit, the number of outstanding packet increases up to the maximum value that we allow (2,000 packets). Of course, receiver window has to be increased accordingly, in order to allow that much outstanding data. Finally, another area in which we want to have an understanding of which options are best fit is 200 01 1 Outstanding packets 18001600 - 1400 - Co, 1200- a 1000E z 800600400200- 0.4.5 5 5.5 Time (s) 6 Figure 6:Outstanding packets during HTCP 70 6.5 Description of results in the mode switching, especially during the transition between HTCP and regular TCP. The options we can play with are basically the window value at which we want the connection to revert after the transition to regular mode, and the strategy we can choose for interspersing the sending of data packets while receiving ACKs at fast timing. We basically found out during the simulation that since the timing of the ACKs for packets send during HTCP mode have no relation to the current state of the slow link, sending any packet according to that timing can be devastating. The time during which the connection may be inactive will be a roundtrip time, as opposed to a much larger stall period in the event of massive losses leading to a timeout. Therefore, it is preferable to be conservative and for this reason, the best solution is to remain inactive during the arrival of the ACKs for the packets sent during HTCP. Using techniques like rate halving [11] may be beneficial to prevent an excessive slowdown, but in our case it can lead to excessive losses that can cause a timeout on the current connection, and more importantly, to other connections. As to which value we should set the TCP window when we come back to regular mode, we have tried three approaches: " Outstanding packets during HTCP: Setting the window to the current number of outstanding packets at the end of the HTCP mode has proven to be devastating. At that time, that value is very high and the bandwidth in the slow link is much smaller, so the excess of packets in the slow link leads consistently to buffer overflows that cause the transitioning connection to stall, as well as other ongoing connections. " Previous value of TCP window: We can keep the value at which the window was before switching to HTCP mode and use it as the target window value when returning to regular TCP mode. However, by the time the connection is switched back to the slow link, the other ongoing connections would have adapted their rates to the new scenario (without the switched connection). Therefore the previous window value is relevant, but not truly realistic to the current conditions on the slow link. For this reason, we see how mode switching to 71 Description of results regular TCP using this option still leads to losses to the own connection, although it causes less disruptions to other ongoing connections. * Restart window: The third option we tried is to reinitialize the window to the value it would take at the beginning of a new connection. In some sense, it is actually the case, since the slow link would be shared among other connections and the connection that is switched back would need to be introduced as it was a fresh new connection. However, if switching back a connection to regular HTCP mode implies switching one of the ongoing connections in the slow link to HTCP, then the option of the previous value of the TCP window would be the more appropriate, since in this case, the connection that is being switched back does not need to start fresh, but can supplant the connection that has been brought to HTCP. 72 Conclusions and future work Conclusions and future work We have presented in this work different scenarios that lead to inefficiencies in TCP operation.. These are situations to which TCP is confronted in quite simple scenarios and they lead to different problems that we could consider independently. However, they are not mutually exclusive, therefore they can be experienced simultaneously, but will have little or no interaction. We have showed how an incorrect setting of the receiver window size can cause an artificial limitation to TCP throughput. We have proposed a method that, without introducing any changes to the current TCP implementation, allows to overcome this problem, while at the same time offers a tool for optimizing the fair allocation of bandwidth over a shared network. Future work on this area would include more extensive trials on heavily utilized networks, or analyzing the best methods to optimize the value at which the receiver window should be set, in order to avoid additional losses caused by increasing the window. We have also identified a problem with the compression of ACKs due to the mismatch on data and ACK packet sizes on shared channels. This problem causes more disruption that the mere addition of traffic and can significantly degrade TCP performance. We have looked at different simple methods that could help us relieve this effect and we have adopted a previously proposed solution and demonstrated how it improves the TCP throughput. Future work on this area may include optimizing the mechanism by which we identify and maintain the statistics of TCP connections Finally, we have presented a modification to TCP that would help optimize its performance over high latency networks, or in which the available bandwidth varies rapidly over time. Our proposal would allow to efficiently use the bandwidth that is offered, by using the information that is provided by the network to send data at an according rate, therefore overriding the need for TCP to slowly probe the available bandwidth. Probing is a fine solution when the bandwidth is unknown, but is not very efficient when the available bandwidth is very large and can be 73 Conclusions and future work specified. We have proven that our scheme preserves the robustness that characterize TCP, and behaves slightly better in the presence of losses. Future work would include describing the best methods to efficiently communicate the available bandwidth to the sources, defining the scheduling policy at the switch and how it would affect performance, and probably establish a framework for the use of this proposal over high-speed optical networks. 74 References References [1] V. Cerf and R. Kahn, A Protocolfor Packet Network Interconnection. IEEE Transactions on Communications COM-22, pp. 637-641, 1974 [2] V. Jacobson, Congestion Avoidance and Control, In Proc. ACM SIGCOMM '88 [3] D.-M. Chiu and R. Jain, Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks. Computer Networks and ISDN Systems, Vol. 17, pp. 1-14, 1989. [4] H. Balakrishnan, S. Seshan, and R. Katz, Improving Reliable Transport and Handoff Performance over Wireless Networks, ACM Wireless Networks, Vol. 1, No. 4, pp. 469481, December 1995. [5] L. Kalampoukas, A. Varma and K. K. Ramakrishnan, Explicit Window Adaptation: A Method to Enhance TCP Performance, Proc. of Infocom'98. [6] H. Balakrishnan, V. Padmanabhan and R. Katz. The Effects of Asymmetry on TCP Performance.ACM Mobile Networks and Applications (MONET), 1999 (to appear). [7] H. Balakrishnan, V. Padmanabhan, S. Sechan and R. Katz. A Comparisonof Mechanisms for Improving TCP Performance over Wireless Links. IEEE/ACM Transactions on Networking, December 1997. [8] S. Floyd and V. Jacobson. Random Early Detection Gatewaysfor CongestionAvoidance, IEEE/ACM Transactions on Networking, V. 1 N. 4, August 1993. [9] W. R. Stevens. TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms, January 1997. RFC 2001. [10] M. Mathis, J. Mahdavi, ForwardAcknowledgment: Refining TCP Congestion Control, Proceedings of SIGCOMM'96, August, 1996. [11] M. Mathis, J. Mahdavi, TCP Rate-Halving Algorithm for TCP Congestion Control, Draft, June, 1999. [12] W. R. Stevens, TCP/IPIllustrated, vol. 1. Addison-Wesley Publishing Company, 1994. [13] V. Jacobson, R. Braden, and D. Borman, TCP extensions for high performance, RFC 1323, May 1992. [14] A. Mankin, Random drop congestion control, in Proceedings of ACM SIGCOMM'90, pp. 1--7, September 1990. 75 References [15] T. V. Lakshman and U. Madhow, Performance analysis of windowbase flow control using TCP/IP: The effect of high bandwidthdelayproducts and random loss , in Proc. of High Performance Networking, V. IFIP TC6/WG6.4 Fifth International Conference, vol. C, pp. 135--149, June 1994. [16] L. Zhang, S. Shenker, and D. D. Clark, Observations on the dynamics of a congestion control algorithm: The effects of twoway traffic , in Proceedings of ACM SIGCOMM'91, pp. 133--147, September 1991. [17] S. Floyd, TCP and explicit congestion notification, Computer Communication Review, vol. 24, no. 5, pp. 8-23, October 1994. [18] L. S. Brakmo and L. L. Peterson, TCP Vegas: End to end congestion avoidance on a global Internet, IEEE Journal on Selected Areas in Communications, vol. 13, no. 8, pp. 1465--80, October 1995. [19] R. Jain, Myths about congestion management in highspeed networks , Internetworking: Research and Experience, vol. 3, no. 3, pp. 101-- 113, September 1992. [20] B. Bakshi, P. Krishna, N. Vaidya and D. Pradham, Improving Performance of TCP over Wireless Networks, 17th International Conference on Distributed Computing Systems (ICDCS), May 1997 [21] S. Floyd and K. Fall. Promoting the Use of End-to-End Congestion Control in the Internet. Submitted to IEEE Transactions on Networking. [22] S. Johnson. Increasing TCP Throughput by Using an Extended Acknowledgment Interval. Master's Thesis, Ohio University, June 1995. [23] M. Mathis, J. Semke, J. Mahdavi, T. Ott, The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm, Computer Communication Review, volume 27, number3, July 1997 [24] J. Postel. Transmission Control Protocol, RFC 793, September 1981. [25] V. Visweswaraiah and J. Heidemann. Improving Restart of Idle TCP Connections. Technical Report 97-661, University of Southern California, 1997. [26] M. Allman. On the Generation and Use of TCP Acknowledgments. ACM Computer Communication Review, 28(5), October 1998 [27] Mark Allman, Vern Paxson, W. R, Stevens. TCP Congestion Control, April 1999. RFC 2581. 76