Balancing TCP Buffer vs Parallel Streams in Application Level Throughput Optimization Esma Yildirim Dengpan Yin Tevfik Kosar Department of Computer Science & CCT Louisiana State University Baton Rouge, LA 70803 USA Department of Computer Science & CCT Louisiana State University Baton Rouge, LA 70803 USA Department of Computer Science & CCT Louisiana State University Baton Rouge, LA 70803 USA esma@cct.lsu.edu dyin@cct.lsu.edu kosar@cct.lsu.edu ABSTRACT 1. INTRODUCTION The end-to-end performance of TCP over wide-area may be a major bottleneck for large-scale network-based applications. Two practical ways of increasing the TCP performance at the application layer is using multiple parallel streams and tuning the buffer size. Tuning the buffer size can lead to significant increase in the throughput of the application. However using multiple parallel streams generally gives better results than optimized buffer size with a single stream. Parallel streams tend to recover from failures quicker and are more likely to steal bandwidth from the other streams sharing the network. Moreover our experiments show that proper usage of tuned buffer size with parallel streams can even increase the throughput more than the cases where only tuned buffers and only parallel streams are used. In that sense, balancing a tuned buffer size and the number of parallel streams and defining the optimal values for those parameters are very important. In this paper, we analyze the results of different techniques to balance TCP buffer and parallel streams at the same time and present the initial steps to a balanced modeling of throughput based on these optimized parameters. Large-scale network-based applications spaning multiple sites over wide area networks heavily depend on the underlying data transfer protocol for their data handling, and their end-to-end performance may suffer significantly if the underlying protocol does not use the available bandwidth effectively. Most of the widely used transfer protocols are based on TCP, which may sacrifice performance for the sake of fairness. There has been considerable research on enhancing TCP as well as tuning its parameters for improved performance [6, 12, 5, 14, 16, 15, 13]. In the application layer, opening parallel streams and tuning the buffer size could improve the bottlenecks of TCP performance. Parallel streams achieve high throughput by mimicking the behavior of individual streams and get an unfair share of the available bandwidth [23, 16, 2, 8, 6, 12, 17]. On the other hand, using too many simultaneous connections reaches the network on a congestion point and after that threshold, the achievable throughput starts to drop down . Unfortunately it is difficult to predict the point of congestion and is variable over some parameters which are unique in both time and domain. There are a few studies that try to find the optimal number of streams and they are mostly based on approximate theoretical models [7, 18, 1, 14]. They all have specific constraints and assumptions and can not predict the complex behavior of throughput in existence of congestion. Also the correctness of the proposed models are mostly proved with simulation results only. Hacker et al. claim that the total number of streams behaves like one giant stream that transfers in capacity of total of each streams’ achievable throughput [7]. However, this model only works for uncongested networks. Thus, it cannot provide a feasable solution for congested networks. Another study [5] declares the same theory but develops a protocol which at the same time provides fairness. Dinda et al. [18] model the bandwidth of multiple streams as a partial second order equation and require two different throughput measurement of different stream numbers to predict the others. However, this model cannot predict the optimal number of parallel streams necessary to achieve best transfer throughput. Yet in another model [1], the total throughput always shows the same characteristics depending on the capacity of the connection as the number of streams increases and 3 streams are sufficient to get a 90% utilization. A new protocol study [14] that adjusts sending rate according to calculated backlog presents a model to Categories and Subject Descriptors C.4 [Performance of Systems]: Measurement Techniques, Modeling Techniques, Performance Attributes General Terms Design, Measurement, Performance Keywords Data transfer, File Transfer, GridFTP, Parallel TCP, Parallelism, Buffer size Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DADC’09, June 9–10, 2009, Munich, Germany. Copyright 2009 ACM 978-1-60558-589-5/09/06 ...$5.00. Buf f ersize = Estimated T hroughput × RT T (1) In [19], a series of ICMP messages are sent out and the packet length is divided by the time difference between two consecutive packets and by dividing them to the RT T , the available bandwidth of the bottleneck link is calculated. Then by multiplying it with the RT T the buffer size is determined. The study in [11] and [21] separates the buffer of a congested path from a non-congested path. The approach parallel stream=1, buffer size throughput Throughput (Mbps) predict the current number of flows which could be useful to predict the future number of flows. In [26], we developed a model that could predict the behavior of parallel stream throughput with as few as 3 data values of throughput of different parallelism levels. The model presented in that study could give very accurate results when the correct data points are selected. Hence we need an algorithm to be able to select the correct data points intelligently that at the same time will maximize the accuracy. Another important parameter to be tuned for high throughput is the TCP buffer size. It is important in the sense that it affects the maximum number of bits that could be on the fly before an acknowledgement is received. Usually the buffer size is tuned as setting it to twice the value of bandwidth × delay product (BDP) [11]. However there are also different interpretations of bandwidth and delay concepts. Most of the current buffer size tuning work either require changes to the kernel stack [4], [22], [24], [25] or are based on estimations made on bandwidth and delay values in the application level [11], [21], [9], [19] . The techniques that need modifications to the kernel [4], [22], [24], [25] is usually based on dynamic changes to the buffer size during the transfer based on the congestion window or flow control window parameters. The approach presented in [4] requires changes to the Kernel Stack. Based on the current congestion window, RTT and server Read Time, they calculate the following congestion window and set the buffer variables based on the current and next congestion window sizes. Another study [22] is also similar to the previous where the sender buffer size is adjusted based on the congestion window. Also the buffer memory is fairly shared among the connections. The receive buffer is sufficiently large so that it will not limit the throughput. Other two of the competitive techniques which are widely used are Dynamic Right-Sizing [25] and Linux 2.4 Auto-Tuning [24]. DRS is basicly a receiver -based approach where the receiver tries to estimate the bandwidth × delay product by using TCP packet header information and time stamps. Instead of using static flow control windows the advertised receive window is dynamically changes so that the sender is not limited by the flow control. On the other side Linux auto-tuning is a memory management technique in which it increases or decreases window size continuously based on the available memory and socket buffer space. The techniques that are applied in the application level without changing the kernel and the protocol [11], [21], [9], [19] are usually static and once the buffer size is set it does not change throughout the transfer. In [9] the estimated throughput of a connection is calculated based on packet loss probability, RTT and Retransmission time out based on the model presented in [20]. Then by using the BDP product the buffer size is determined: 4000 3500 3000 2500 Throughput 2000 1500 1000 500 0 128 512 1024 2048 buffer size(KB) 4096 Figure 1: Optimal Buffer Selection they propose finds the efficient buffer size for especially noncongested paths. In a non-congested path the optimal buffer size is found by considering the existing cross traffic. They propose an application layer tool called SOBAS, where in uncongested paths it limits the buffer size so it does not overflow the buffers of the bottleneck link and in congested paths it does not limit the transfer window so that the buffer size increases and becomes congestion limited. They send periodic UDP probing packets and calculate the throughput based on those probes. When the throughput becomes stable for some period of time they limit the receiver buffer according to that value. Although the window size parameter is properly tuned, it does not show a better performance than using parallel streams because parallel streams recover from packet loss more quickly rather than a buffer tuned single stream. There are a few studies that tries to derive a mathematical model to find the relationship between buffer size and number of parallel streams [10, 3]. The study in [10] models the throughput of parallel streams as multiple continuous time models of TCP congestion control mechanism. Another study represents the relationship between number of parallel streams, buffer size and round trip time by a single regression equation [3]. However this equation again considers a no-loss network. Mostly the models are derived for no-loss networks and they do not present an optimal value for both parameters. In this study, we introduce models and algorithms that could calculate the throughput of parallel streams accurately and define and apply techniques that will balance the buffer size parameter with the optimal stream number to gain highest throughput. Our experiments show that a reasonable tuned buffer size parameter based on network samplings combined with our parallel stream optimization models could give higher throughput results than the only buffer size tuning or only parallel stream tuning techniques. 2. BUFFER SIZE OPTIMIZATION Buffer size parameter affects maximum number of packets that will be on the fly before the sender will wait for an acknowledgement. If a network buffer is undersized then the network cannot be fully utilized. However, if it is oversized there could also be degradation in the throughput because of packet losses hence window reductions. Usually it is tuned manually by the application users or in the kernel level of the operating system. A common method to tune the buffer size is to set it to the twice the value of bandwidth × delay product (BDP). However this assumption of making the buffer size as large as the twice of the BDP only holds when there is no cross traffic along the path which is quite impossible. So it brings the question of whether to use the capacity of the network or the available bandwidth as well as minimum RTT or maximum RTT. Hence there is a quite variety in the understandings of the bandwidth and delay concepts. Below is a list of different meanings of BDP [11]: According to [7], the throughput of n streams is equal to : T hn <= n • BDP3: B = A × RT Tmax • BDP4: B = A × RT Tmin • BDP5: B = BT C × RT Tave • BDP6: B = B∞ In the above equations B represents the buffer size, C represents the capacity of the link, A represents the available bandwidth and RT T is the round trip time. BT C in BDP5 is the average throughput of a bulk congestion-limited transfer and calculated based on the congestion window size. Finally B∞ in BDP6 is a large value which is always greater that the congestion window so that the connection will always be congestion-limited. Most of the tuning methods described in the literature do the tuning in the kernel level and approaches that use application level auto-tuning are so few [11], [21]. The methods that use one of the above equations usually rely on tools to take the measurement of available bandwidth and RTT and do not consider the effect of the cross traffic and congestion created by using large buffer sizes. A practical way that will predict the buffer size hiding all the details of the cross traffic and congestion factors is to do samplings to measure the point in which the throughput stops increasing and becomes stable. In Figure 1, the buffer size parameter is increased exponentially and around 1MB,it becomes stable. So rather than using a BDP related equation, finding the point that throughput does not increase is a good choice because it will encapsulate all the related issues regarding cross traffic and congestion created. 3. p′n = pn RT Tn2 2 c M SS 2 = a′ n2 + b′ n + c′ (3) According to Equation 3, we derive: n n T hn = √ ′ = √ pn a′ n2 + b′ n + c′ (4) In order to obtain the values of a′ , b′ and c′ presented in Equation 4, we need the throughput values of three different parallelism levels(T hn1 , T hn2 , T hn3 ) which can be obtained from the predictions of network measurement tools or past data transfers . T hn1 = p T hn2 = p T hn3 = p n1 a′ n21 + b′ n + c′ (5) n2 a′ n22 + b′ n + c′ (6) n3 a′ n23 + b′ n + c′ (7) By solving the following three equations we could place the a′ ,b′ and c′ variables to Equation 4 to calculate the throughput of any parallelism level. Based on equations 8,9 and 10, the values of a′ ,b′ and c′ can be calculated easily. n2 1 − T h2 n1 3 n3 −n1 n2 3 T h2 n a′ = ′ b = − n2 2 T h2 n 2 − n2 1 T h2 n n2 − n1 c′ = 1 n2 1 − T h2 n1 2 n2 −n1 n2 2 T h2 n n3 − n2 PARALLEL STREAM OPTIMIZATION In a previous study [26], we have presented mathematical models to predict the throughput of parallel streams. The model called Newton’s Method model was able to predict the peek point of throughput and hence the optimal parallel stream number with only three data points known in the throughput curve. If the correct data points were used the accuracy of the model is very high. In this section we introduce another model called Full Second Order model which can increase the throughput accuracy more if the selection of the data points is appropriate. The model derivation is similar to the one presented in [26]. (2) RT T represents round trip time, M SS represents the maximum segment size, p represents the packet loss rate , n is the number of streams and c is a constant. We define the relation among RT T , n and p with a new variable p′n . We assume that p′n is related to a full second order polynomial and the following equations are derived to be used in this model. • BDP1: B = C × RT Tmax • BDP2: B = C × RT Tmin M SS c √ RT Tn pn − (n1 + n2 )a′ n21 − n21 a′ − n1 b′ T h2n1 (8) (9) (10) As we have mentioned before the selection of correct data points to do the prediction is very important since it could increase the accuracy of the model as well as prevent unlikely behaviors of the throughput prediction curve. We could see in Figure 2 the ineffectiveness of random data point selection strategy clearly. The graphics show the throughput of wide area data transfers for parallel streams and three prediction models are applied with randomly selected points. In the worst case scenario, the prediction curves do not reflect a) Dinda et. al Model b) Newthon’s Method Model 35 GridFtp Dinda et al_1_2 30 Throughput(Mbps) Throughput(Mbps) 35 25 20 15 10 5 GridFtp Newton’s Method_4_14_16 30 25 20 15 10 5 0 0 0 5 10 15 20 25 30 35 Number of parallel streams 40 0 5 10 15 20 25 30 35 Number of parallel streams c) Full second order Model d) Model comparison 35 35 GridFtp Full Second Order_4_9_10 30 Throughput(Mbps) Throughput(Mbps) 40 25 20 15 10 5 GridFtp Dinda et al_1_2 Newton’s Method_4_14_16 Full Second Order_4_9_10 30 25 20 15 10 5 0 0 0 5 10 15 20 25 30 35 40 0 5 Number of parallel streams 10 15 20 25 Figure 2: Ineffective combination of Models the characteristics of the throughput curve at all. Therefore it is important that we should avoid a random selection strategy and make intelligent decisions. i)For Dinda et al model: • a′ > 0 • b′ > 0 3.1 Delimitation of Coefficients To figure out whether a certain combination of data points is good or not before we try to use it to calculate the optimum parallel stream numbers, we examine the coefficients a′ ,b′ and c′ derived from application of the model by using the certain combination of data points. If the coefficients meet some requirements for each model, then we can use this combination to calculate the optimum parallel stream number, otherwise we need to get another combination of data points. But we have to keep in mind that, we may still be far from the combination that will give the most accurate model. In the following subsections we give out the requirements of the coefficients that should be met so that they can correctly reflect the relationship between the throughput and parallel stream number for the models. Since we have based our model improvements on Dinda et al model [18], we consider this model as a base while comparing our models. Also we present a short proof for the Full Second Order Model. 3.1.1 Statements of the coefficients requirements 30 Number of parallel streams ii)For Newton’s Method Model: • a′ > 0 • b′ > 0 • c′ ≥ 2 • “ 2b′ a′ (c′ −2) ” 1′ c >1 iii)For Full second order Model: • a′ > 0 • b′ < 0 • c′ > 0 • 2c′ + b′ > 0 35 40 3.1.2 Proof of Full second order Model coefficients requirements In order to get a curve which increases first and then decreases monotonically, we should guarantee that the throughput function is positive first and when it reaches the peak point, becomes zero, finally decreases into negative value. T h′f ul 8 b′ n+c′ > 2 > 3 > 0 > ′ 2 > (a n +b′ n+c′ ) 2 > < b′ n+c′ 2 = 3 = 0 > (a′ n2 +b′ n+c′ ) 2 > ′ > b n+c′ > > 2 : 3 < 0 n < optnum n = optnum n > optnum (a′ n2 +b′ n+c′ ) 2 8 b′ n + c′ > 0 n < optnum > > < b2′ ′ n + c = 0 n = optnum ⇒ b2′ ′ > n + c < 0 n > optnum > : 2′ 2 a n + b′ n + c′ > 0 ∀n ∈ N + 8 ′ a >0 > > < b′ < 0 ⇒ c′ > 0 > > ′ : optnum = −2c >1 b′ 8 ′ a >0 > > < ′ b <0 ⇒ c′ > 0 > > : 2c′ + b′ > 1 Further more, since lim √ n→∞ √ 2 a′ n2 + b′ n + c′ n = lim 2a′ n + b′ a′ n2 + b′ n + c′ n→∞ q ′ ′ a′ + bn + nc2 = lim n→∞ a′ √ ′ a = ′ (11) a According to the above equalities and inequalities ,we conclude that Full second order Model increases first, then reaches a peak value, later decreases with a lower bound √ a′ . a′ 3.2 Exponential Increasing Steps Solution In this section we present an exponential increasing algorithm that will intelligently select a sequence of stream numbers 1, 2, 22 , 23 , · · · , 2k . Each time we double the number of streams until the throughput of the corresponding stream number begins to drop or increase very slightly compared with the previous one. After k +1 steps, we have k +1 data pairs that are selected intelligently. The ExpSelection algorithm takes a set T of throughput values for different parallelism values. These values could be gathered from historical transfers or through instant samplings performed at the time of algorithm execution. First the algorithm sets the accuracy to a predefined small value[Line 2], and starts with a single stream and add the stream number and the corresponding throughput values to the output set [Lines 6-7]. The second parallelism level is calculated as twice the value of the previous parallelism value[Line 9]. The slop among the current and the previous throughput values are calculated[Line 11]. If the slop is less than the accuracy value, the loop stops and we gather our selected set of stream number and throughput pairs (O). ExpSelection(T ) ⊲ Input: T ⊲ Output: O[i][j] 1 Begin 2 accuracy ← α 3 i←1 4 streamno1 ← 1 5 throughput1 ← Tstreamno1 6 O[i][1] ← streamno1 7 O[i][2] ← throughput1 8 do 9 streamno2 ← 2 ∗ streamno1 throughput2 ← Tstreamno2 10 11 slop ← throughput2−throughput1 streamno2−streamno1 i ← i+1 12 13 O[i][1] ← streamno2 O[i][2] ← throughput2 14 15 streamno1 ← streamno2 16 throughput1 ← throughput2 17 while slop > accuracy 18 End Now that we have our selected set O, we apply a comparison algorithm that will select the best combination based on both the coefficient values as well as the error rates of the selected combination. In BestCmb algorithm, we give the selected set O, number of items in the set (n) and the model to be applied as input. For every set of combination, the coefficients are calculated and if they are in the effective range that we presented in previous section, the total difference between the actual throughput and the predicted throughput is calculated[Lines 6-8]. The combination with the minimum err is selected and returned. BestCmb(O, n, model) ⊲ Input: O, n ⊲ Output: a, b, c, optnum 1 Begin errm ← init 2 3 for i ← 1 to (n − 2) do 4 for j ← (i + 1) to (n − 1) do for k ← (j + 1) to n do 5 6 a′ , b′ , c′ ← CalCoe(O, i, j, k, model) if a′ , b′ , c′ are 7 Pef f ective then err ← n1 n 8 t=1 |O[t][2] − T hpre (O[t][1])| 9 if errm = init || err < errm then 10 errm ← err a ← a′ 11 12 b ← b′ c ← c′ 13 14 end if end if 15 16 end for 17 end for 18 end for 19 optnum ← CalOptStreamNo(a, b, c, model) 20 return optnum 21 End In Figure 3, we give the experimental results of the application of our algorithm. Two different experimental environments are used. In Figure 3.a and 3.b, the experiments are conducted over the LONI network with 1Gbps NICs while 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 b)LONI-LONI Newton’s Method Model GridFTP Full 1_8_16 Dinda 1_16 1 Throughput (Mbps) Throughput (Mbps) a)LONI-LONI Full Second Order Model 5 10 15 20 25 number of parallel streams 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 30 1 c)LAN-WAN Full Second Order Model 5 10 15 20 25 number of parallel streams 30 d)LAN-WAN Full Newton’s Method Model 40 40 GridFTP Full 1_8_16 Dinda 1_16 30 Throughput (Mbps) Throughput (Mbps) GridFTP Newton 1_8_16 Dinda 1_16 20 10 0 GridFTP Newton 1_8_16 Dinda 1_16 30 20 10 0 1 5 10 15 20 25 30 1 number of parallel streams 5 10 15 20 25 30 number of parallel streams Figure 3: Application of prediction models over past data transfers in Figure 3.c and 3.d, wide area transfers are conducted over 100Mbps NICs. Our models are compared with the Dinda et al Model [18]. We have seen that ExpSelection algorithm combined with BestCmp algorithm gives very accurate predictions. With these algorithms we minimize the amount of data we need to calculate an accurate prediction while maximizing the accuracy at the same time. The cost of the algorithm is even suitable for making instant samplings to get the relevant data points to do an accurate optimization rather than selecting them from past historical transfers. 4. BALANCING THE BUFFERSIZE AND PARALLEL STREAM NUMBER There has been a large number of studies in the area of buffer optimization and our results in parallel stream optimization are very promising. However, a good combination of tuned buffer size and parallel streams could even give more effective results than the single applications of these two techniques. Unfortunately there are not any practical work to balance the buffer size and parallel stream number to achieve the optimal throughput. In this section, we present the results and discussion of simulations performed on NS-2 and a set of experiments in real network environment combined with the application of our models to describe the way a balance must be set. We conducted those experiments in LONI network with 10G NICs with a 6 ms delay. The net- !"#$ '(#$ "+%&'(!-*'! "+%&'(!,*'! "##$%&'(!)#*'! %#$ "+%&'(!-*'! %&$ "+%&'(!,*'! !"&$ '(&$ Figure 7: Topology work traffic of the LONI network varies a lot hence we could see the effect of the application of both techniques better. 4.1 Simulation Results and Discussion In our simulations we have tried different scenarios by changing the buffer size and parallel streams. The network topology we used is represented in Figure 7. The bottleneck link bandwidth is 100Mbps with a 20ms delay. The sources(Sr0,Sr1) for our transfers and the cross traffic are connected to the bottleneck link router R0 with 1Gbps bandwidth and 5ms delay, while the destination nodes (Ds0,Ds1) b)No cross traffic-parallel streams 16KB 32KB 64KB 128KB 256KB 125 100 75 Throughput (Mbps) Throughput (Mbps) a)No cross traffic-buffer size 50 25 0 1 5 10 15 20 30 40 125 100 75 50 25 0 1 10 20 30 number of streams 40 50 16 128 256 512 buffer size (KB) 1024 Figure 4: Effect of parallel streams and buffer size under no cross traffic 16KB 32KB 64KB 128KB 256KB 125 100 75 b)Not congesting cross traffic-Cross traffic Throughput (Mbps) Throughput (Mbps) a)Not congesting cross traffic-Parallel transfer 50 25 0 16KB 32KB 64KB 128KB 256KB 125 100 75 50 25 0 1 10 20 30 number of streams 40 50 1 10 20 30 number of streams 40 50 Figure 5: Effect of parallel streams and buffer size with a cross traffic of 5 streams with 64KB buffer size are connected to R1 with 1Gbps bandwidth and 4ms delay. The cross traffic flows from Sr0 to Ds0 while the actual transfer occurs between Sr1 and Ds1. In the first set of experiments we did not use any cross traffic and changed the buffer size with parallel streams. With a very small buffer size such as 16KB, the full utilization of the network can be achieved only with a very large number of streams (Figure 4.a). After reaching its peak point, the throughput starts to fall down around 45 streams. Increasing the buffer size into 32KB, peak throughput point is pulled back to 22 streams. Further increasing the buffer size to 64KB pulls the stream number further to 10 streams. However for larger buffer sizes than 64KB the throughput never can reach into its maximum point. The reasonable selection in this case is to use a buffer size around 16KB-64KB and parallel streams around 45-10 respectively to be able to get the highest throughput. In this case further increasing the buffer size does not help but causes a decrease in the throughput achieved. The maximum throughput values can be gained with a smaller buffer size than BDP and usage of parallel streams. The figure shows us a wave behavior of throughput, when we increase the buffer size and decrease parallel streams, it eventually decreases in its peak point. In Figure 4.b, we compare the parallel streams in vice versa. We could see that larger stream numbers could gain more throughput in smaller buffer sizes. In the second set of experiments, we have done the simulations in the existence of a non-congesting cross traffic of 5 streams of 64KB buffer sizes. The results were quite interesting. With a very small buffer size of 16KB, the throughput increases linearly up to the congestion point with the cross traffic as we increase the parallel streams(Figure 5.a). The throughput is shared between the two traffics. Further increasing the buffer size will pull the peak throughput point into smaller stream numbers. This kind of behavior is similar to the previous case where there is no cross traffic except that the throughput is shared. In the mean time there is no effect on the cross traffic as the parallel stream number increases(Figure 5.b). However, further increase of parallel stream number in our traffic, results in the cross traffic to loose the fight and as the total throughput of our traffic starts to increase, the throughput of the cross traffic starts to decrease. The best results of throughput without affecting the cross traffic in this case is 32KB-64KB buffer size with a parallel stream range of 6-13 streams. In the third case, there is congesting cross traffic of 12 16KB 32KB 64KB 128KB 256KB 125 100 75 b)Congesting cross traffic-Cross traffic Throughput (Mbps) Throughput (Mbps) a)Congesting cross traffic-Parallel transfer 50 25 0 16KB 32KB 64KB 128KB 256KB 125 100 75 50 25 0 1 10 20 30 number of streams 40 50 1 10 20 30 number of streams 40 50 Figure 6: Effect of parallel streams and buffer size with a cross traffic of 12 streams with 64KB buffer size streams with 64KB buffer sizes. Interestingly tuning the buffer size has no effect on the cross traffic since all of curves show the similar characteristics . As the number of streams increases, the throughput increases as well(Figure 6.a). In the mean time it starts to steal from the throughput of cross traffic as it gradually decreases(Figure 6.b). The only way to gain more throughput in this case is to open parallel streams. 4.2 LONI Experiments Although simulation results gave us a good idea about the behavior of tuning buffer size and parallel streams, we would like to see the effects of both in real network environment settings. We applied two different techniques to decide which one is more effective in gaining maximum throughput. 4.2.1 1st Settings In the first technique, we would like to measure the effect of a buffer size tuning over a transfer that is already tuned with optimal parallel stream number with our model. In Figure 8.a, we conducted transfers with an untuned buffer size of 256K and applied Newton’s Method Model. According to the model the optimal stream number that gives the peak point is 14 and around 1.7 Gbps throughput value in average is gained. By using this optimal stream number, we ranged the buffer size from 128K to 4M(Figure 8.b). We have seen that the highest throughput results were gained again with default buffer size of 256K that we used to optimize the stream number. There was not an improvement in average throughput values. 4.2.2 2nd Settings In the second technique, we decided to apply the procedure in the reverse order. We conducted some sampling transfers by increasing the buffer size exponentially using only single stream. In Figure 9.a, the throughput curve started to become stable around 1M buffer size. So we picked up this value as the optimal buffer size value and conducted parallel transfers with this buffer size value. The results could be seen in Figure 9.b. We applied the Full Second Order model this time. The optimal stream number with the Full Second Order was 4 and we were able to get a throughput value of more than 2 Gbps. With this technique we were able to get a higher throughput value and opened less number of streams for optimization and hence placed less burden on the end system. To better compare the techniques we present, we prepared a comparison graphic that presents both the average and maximum throughput results of the techniques applied. In Figure 10, the first column represents the settings where only single stream was used with an untuned buffer size. In the second column, the result of an optimized throughput of parallel streams with our models and an untuned buffer size is presented. In the last column, the result of the second technique is presented. Clearly the last technique was able to get more throughput for both average and maximum values. 5. CONCLUSION Tuning buffer size and parallel stream number are two ways of optimizing the application level throughput. When the correct balance is found, these techniques applied together could improve the end-to-end throughput comparing to the single application of each technique. In this study, we presented a mathematical model and an algorithm that could predict the throughput of parallel streams very accurately. Also we presented the steps for constructing a balance among tuned buffer sizes and parallel streams. Our experiments showed that the usage of parallel streams on fairly tuned buffers could give higher throughput values and less number of streams are needed for optimization. Acknowledgment This project is in part sponsored by the National Science Foundation under award numbers CNS-0846052 (CAREER), CNS-0619843 (PetaShare) and EPS-0701491 (CyberTools), and by the Board of Regents, State of Louisiana, under Contract Numbers DOE/LEQSF (2004-07), NSF/LEQSF (2007-10)-CyberRII-01, and LEQSF(2007-12)-ENH-PKSFIPRS-03. 6. REFERENCES [1] E. Altman, D. Barman, B. Tuffin, and M. Vojnovic. Parallel tcp sockets: Simple model, throughput and validation. In Proc. IEEE Conference on Computer 4000 3500 3000 2500 GridFTP Newton 2000 1500 1000 500 0 1 5 b)opt parallel stream=14, buffer size throughput Throughput (Mbps) Throughput (Mbps) a)bs=256K, parallel stream throughput 4000 3500 3000 2500 Throughput 2000 1500 1000 500 0 10 15 20 25 30 35 40 45 50 number of streams 256 1024 2048 buffer size(KB) 4096 Figure 8: First Technique b)opt bs=1M, parallel stream throughput Throughput Throughput (Mbps) Throughput (Mbps) a)parallel stream=1, buffer size throughput 4000 3500 3000 2500 2000 1500 1000 500 4000 3500 GridFTP Full Second Order 3000 2500 2000 1500 1000 500 0 0 128 512 1024 2048 buffer size(KB) 4096 1 5 10 15 20 25 30 35 40 45 50 number of streams Figure 9: Second Technique [2] [3] [4] [5] [6] Communications (INFOCOM’06), pages 1–12, Apr. 2006. H. Balakrishman, V. N. Padmanabhan, S. Seshan, and R. H. K. M. Stemm. Tcp behavior of a busy internet server: Analysis and improvements. In Proc. IEEE Conference on Computer Communications (INFOCOM’98), pages 252–262, California, USA, Mar. 1998. K. M. Choi, E. Huh, and H. Choo. Efficient resource management scheme of tcp buffer tuned parallel stream to optimize system performance. In Proc. Embedded and ubiquitous computing, Nagasaki, Japan, Dec. 2005. A. Cohen and R. Cohen. A dynamic approach for efficient tcp buffer allocation. IEEE Transactions on Computers, 51(3):303–312, Mar. 2002. J. Crowcroft and P. Oechslin. Differentiated end-to-end internet services using a weighted proportional fair sharing tcp. ACM SIGCOMM Computer Communication Review, 28(3):53–69, July 1998. L. Eggert, J. Heideman, and J. Touch. Effects of [7] [8] [9] [10] [11] ensemble tcp. ACM Computer Communication Review, 30(1):15–29, Jan. 2000. T. J. Hacker, B. D. Noble, and B. D. Atley. The end-to-end performance effects of parallel tcp sockets on a lossy wide area network. In Proc. IEEE International Symposium on Parallel and Distributed Processing (IPDPS’02), pages 434–443, 2002. T. J. Hacker, B. D. Noble, and B. D. Atley. Adaptive data block scheduling for parallel streams. In Proc. IEEE International Symposium on High Performance Distributed Computing (HPDC’05), pages 265–275, July 2005. G. Hasegawa, T. Terai, T. Okamoto, and M. M. Scalable socket buffer tuning for high-performance web servers. In International Conference on Network Protocols(ICNP01), page 281, 2001. T. Ito, H. Ohsaki, and M. Imase. On parameter tuning of data transfer protocol gridftp for wide-area networks. International Journal of Computer Science and Engineering, 2(4):177–183, Sept. 2008. M. Jain, R. S. Prasad, and C. Davrolis. The tcp bandwidth-delay product revisited: network buffering, /$0'1#2-$3)$4)5'(20261(2$3)!78"329%7-) &!!!" !"#$%&"'%()*+,'-.) %#!!" %!!!" $#!!" :8;" <=058>0" $!!!" #!!" !" '()*(+,-.-/01"2344056'()* (+,-.-/01"7,508."'(" '()*(+,-.-/01"2344056 9+,-.-/01"7,508."'(" 9+,-.-/01"2344056"9+,-.-/01" 7,508."'("" Figure 10: Balanced Optimization [12] [13] [14] [15] [16] cross traffic, and socket buffer auto-sizing. Technical report, Georgia Institute of Technology, 2003. R. P. Karrer, J. Park, and J. Kim. Adaptive data block scheduling for parallel streams. Technical report, Deutsche Telekom Laboratories, 2006. G. Kola, T. Kosar, and M. Livny. Run-time adaptation of grid data-placement jobs. Scalable Computing: Practice and Experience, 6(3):33–43, 2005. G. Kola and M. K. Vernon. Target bandwidth sharing using endhost measures. Performance Evaluation, 64(9-12):948–964, Oct. 2007. T. Kosar and M. Livny. Stork: Making data placement a first class citizen in the grid. In Proc. IEEE International Conference on Distributed Computing Systems (ICDCS’04), pages 342–349, 2004. J. Lee, D. Gunter, B. Tierney, B. Allcock, J. Bester, J. Bresnahan, and S. Tuecke. Applied techniques for high bandwidth data transfers across wide area networks. In Proc. International Conference on Computing in High Energy and Nuclear Physics (CHEP’01), Beijing, China, Sept. 2001. [17] D. Lu, Y. Qiao, and P. A. Dinda. Characterizing and predicting tcp throughput on the wide area network. In Proc. IEEE International Conference on Distributed Computing Systems (ICDCS’05), pages 414–424, June 2005. [18] D. Lu, Y. Qiao, P. A. Dinda, and F. E. Bustamante. Modeling and taming parallel tcp on the wide area network. In Proc. IEEE International Symposium on Parallel and Distributed Processing (IPDPS’05), page 68b, Apr. 2005. [19] A. Morajko. Dynamic Tuning of Parallel/Distributed Applications. PhD thesis, Universitat Autonoma de Barcelona, 2004. [20] J. Padhye, V. Firoiu, T. Towsley, and J. Kurose. Modeling tcp throughput: a simple model and its empirical validation. ACM SIGCOMM’98, 28(4):303–314, Oct. 1998. [21] R. S. Prasad, M. Jain, and C. Davrolis. Socket buffer auto-sizing for high-performance data transfers. Journal of Grid Computing, 1(4):361–376, Aug. 2004. [22] J. Semke, J. Madhavi, and M. Mathis. Automatic tcp buffer tuning. ACM SIGCOMM’98, 28(4):315–323, Oct. 1998. [23] H. Sivakumar, S. Bailey, and R. L. Grossman. Psockets: The case for application-level network striping for data intensive applications using high speed wide area networks. In Proc. IEEE Super Computing Conference (SCı̈£¡00), pages 63–63, Texas, USA, Nov. 2000. [24] L. Torvalds and T. F. S. Company. The linux kernel. http://www.kernel.org. [25] E. Weigle and W. Feng. Dynamic right-sizing: A simulation study. In Proc. IEEE International Conference on Computer Communications and Networks (ICCCN’01), 2001. [26] E. Yildirim, M. Balman, and T. Kosar. Dynamically tuning level of paralllelism in wide area data transfers. In Proc. IEEE Data-aware Distributed Computing Workshop (DADC’08), June 2008.