>> John Dunagan: So it's my pleasure to introduce Kashi Vishwanath, who will be talking to us today about demystifying Internet traffic. Take it away. >> Kashi V. Vishwanath: Do I need to stand here? >> John Dunagan: No, it will record you there. >> Kashi V. Vishwanath: All right. Thanks, John. Thanks for the introduction. It's a pleasure to be here and to talk to you today about my experiences regarding trying to unravel some of the mysteries surrounding Internet traffic at least in the context that I'm looking at it. So traffic is an important phenomena that as you may know is useful in a variety of different settings. For example, Internet Service Providers, ISPs, are interested in understanding what traffic demos look like at specific points in the network, not just for trade, for also something projected into the future before they take capacity planning measures. Similarly, people like you and us, who are trying to do management studies, trying to evaluate services are interested in understanding what kind of traffic patterns exist out there. So in this talk I'm going to be focusing on yet another example which I'll keep coming back to, which is the following. Let us say that we interested in designing the next generation Internet architecture, protocols, applications, services, one of these. So we want to evaluate that when we make certain design choices on the white board and they eventually translate to a real learning service on the Internet, what does the performance look like when individual clients are placing requests for data that is contained. Now, the Internet obviously has a large number of users, sending a number of different applications, protocols, et cetera. So it is clear that the performance of this application will be going by what kind of traffic exists out there. What is not clear, however, is what is the extent of this impact. So in this talk, towards the end, hopefully I will have convinced you the following things. Obviously the first one is that existing Internet traffic is extremely rich in its structure. It does a number of interesting statistical properties that one might wish to capture. For example burstiness pattern at different time scales. Similarly, the impact it will have on individual applications is extremely difficult to predict a priori, which in turn means that you need a more systematic method. I will then tell you about one of the methods that I've thought about how to go about arguing about, what kind of impact this traffic will have on individual applications and protocols that you are trying to evaluate. So with that being the general theme of the talk, let me now give you the specific problem that I've tried to solve and what is the solution I've reached. So let us go back to this problem. So we are trying to evaluate something that we have deployed on how does it interact with Internet traffic and what is the impact. So one obvious solution to this problem one way to approach it, we would get access to a number of different machines on the Internet. We would then deploy what was the prototype we are trying to evaluate on each of these machines. Now since we are running on the real Internet there is no question about how realistic the experiment is. The big problem as you reckon is how do you get access to these machines. Even if you get access to a number of different machines, how do you ensure that they are representative. And finally, experiments on the Internet are not quite reproducible as you might now. So to go around all of these challenges, we researchers somehow like to get away from running experiments on the Internet, but we would like to run an equal experiment by configuring machines that we have complete ownership and control of, interlocal testbed, a local set of machines in our own track. So now we take the prototype that we are trying to evaluate, run it on the local cluster. Similarly we reproduce the set of clients that were accessing content on the Internet on a local cluster again and in this fashion we have replicated the communication characteristics in a complete set of machines that we completely own. At this point, we might naively declare that the evaluation of the prototype is complete. But as you have seen, the picture on the right is missing a key thing, it does not quite capture the complexity of the Internet. So a high-level motivation for my research in the recent past and hopefully in the near future is the following question. So what would it take to create meaningful, simplified snapshots of the Internet so that you could try your best to reproduce that snapshot in a local cluster? Now, once you do that by appropriately configuring machines that can send data to each other, when you now evaluate this new service in the local cluster there is some hope that the numbers that you draw would have some correlation to the actual numbers on the Internet. So in short I want to expose individual application evaluations in the local cluster to realistic Internet like settings. So indeed I started my research by trying to build a better Internet, but pretty soon when you start reading the evaluation section of a number of papers, the evaluation methodology lies in between these spectrum. So either you could do a critical analysis of what the Internet looks like, what the individual companies look like or you could run experiments on the real Internet. But most often these decisions are going by how much time do I have, what kind of an expert intuition do myself and my colleagues have, et cetera. So I have the desire to make experiments in a local testbed more realistic. And as you can imagine, there are a number of different ingredients that would go into it. Some of these I have been fortunate to work on, but a lot of them still remain. In this talk I'll be focusing on one of these. So what would it take to reproduce Internet-like traffic conditions in a local testbed? And more importantly I'll try to argue that indeed individual applications that you are trying to evaluate care about that traffic. And if you expose individual applications to a certain different kind of traffic, you could receive conflicting results on what the performance looks like. So let us go back to the same problem again. Yes? >>: (Inaudible). >> Kashi V. Vishwanath: I'm sorry? >>: (Inaudible). >> Kashi V. Vishwanath: What is the Internet? >>: Yes, as defined (inaudible). >> Kashi V. Vishwanath: So Internet is defined on the problem that you have a definition. For example, if you're interested in doing certain kind of analysis on how better to route all traffic within an organization to Websites located outside, your Internet's view in that case would be the border link of Microsoft Research and everything outside would be represented as a cloud, everything inside would be represented by a number of clients. In the same way if you are an ISP and you want to take capacity planning measures then the Internet again is based on your view of the world. So in that case it would be how do I best manage traffic within my organization and how do I get to route things which are coming outside, and I can model that using a black box. So it's really the definition based on what is the problem you have at hand. So the kind of argument I'm trying to make here is that even if you have a crisp definition of what is the problem you have at hand, how do you then ensure that for that Internet, that definition of Internet you reproduce Internet-like conditions in a local testbed and what can be done about it. So I was saying that we have now somewhat reduced the problem definition to trying to understand Internet traffic at every single link on the Internet, and this somewhat goes back to (phonetic) question. That is a seemingly impossible and daunting task to do. So before we get there, there's a much simpler intractable version of the problem which too we do not have complete control over and which is what I'm going to be focusing on this talk. So as I said, let us say there were specific links on the Internet that you're interested in, for example the border link of MSR. So how do you understand that what is the kind of traffic that flows across that border link and how does that influence the flows of individual applications that I'm trying to evaluate which share that link? That is the focus of this talk, trying to understand traffic at specific links on the Internet and then trying to argue that how does that impact the performance? For example in this case, I've configured a local set of machines to generate background traffic for that specific link. I've then configured individual prototypes that I'm trying to evaluate, which are generating foreground traffic for the same link. So in this fashion I hope to understand traffic at this link and the impact it has on the application I'm trying to evaluate all in a controlled setting. So with that being said there are a few specific goals that I think any traffic generator should have in the setting which is realism and responsiveness. By realism I basically mean this if I'm trying to reproduce traffic for a specific link the generated traffic should at least look like the original traffic for whatever metrics of success you have. More importantly it should be responsiveness. And I mean two things by this. First of all, the generated traffic and application traffic should really interact with each other, exactly like they will do on the Internet and not simply kill each other. More importantly, I want to be in a position where I could say turning meaningful knobs based on your intelligent estimates, let's say, on what the network would look like in the future, I should be able to translate that into what does the traffic on this link look like in the future? For example one does everything from an upgrade in the access link based on my ISP. So this should be a meaningful way in which I could hope to express certain changes like that. >>: (Inaudible) talking about a single physical link or are you talking about a logical channel which is going to have the properties of the communication path between two hosts that might ->> Kashi V. Vishwanath: That is the general goal. But for this talk, yes, I am actually also focusing on a single physical link. >>: So everything else that would affect that logical communication path, such as routing change and DNS, everything that's how the scope of this talk (inaudible). >> Kashi V. Vishwanath: That is our scope of this talk. So anything that exists in the end-to-end path will be something I'll be extracting. So there are a thousand hosts on one side of the link and 10,000 loads on the other side of the link, I would try my best to capture the distributions of what is going on on the end-to-end path across these thousand cross 1,000 pairs. But everything else that external influence would be something I would be something that I would looking into the future. But time permitting I could tell you a little bit later on how I approach to get around that problem. So there are a number of challenges which would be presented if I'm trying to achieve this goal. First of all is the fact that individual applications themselves are changing over time. For example, even if we look at something very simple like Web traffic. So back in the days, we used Web pages which had a simple content, limited layout, et cetera. You visit the same Web page today and it's inundated with text, media, ads, portfolios, et cetera. So the application generator, the traffic generator should be able to understand and reproduce this phenomenon. Then again, individual application's popularity is constantly changing over time. For example, if you look at this link, it's a transpacific link which runs between Japan and United States, and I'm trying to show here across a five-year window sample data of popularity of three different applications measured in bytes. So as you can see here, the popularity of individual applications is constantly changing over time. So even if you had this information, there should be a way in which it would express this in the generated traffic that you're coming up with. Finally, I've been talking about this rich structure that is present in Internet traffic and how does it influence individual application behavior? Yes? >>: (Inaudible). >> Kashi V. Vishwanath: So perhaps they were using that protocol to tunnel P to P traffic. That's my guess there. So I don't have access to the pay load, so I cannot really understand. Yeah. So here I'm showing a sample of the same trace that I was looking at earlier, but this is for a specific day. And I'm looking at throughput which is measured in megabits per second for one second intervals across a 15-minute window and I'm showing two kinds of traffic here, Web traffic and Napster, which used to be a popular peer-to-peer file sharing service. So one thing you can see here is Napster, which is the bottom of the two curves is relatively smooth compared to Web traffic when I'm looking at one-second time bins. Now, if I zoom in to the first few time bins and look at traffic at a much finer time scale, at hundred millisecond intervals, now we can see that Napster traffic is no longer as smooth as it once appeared to be. It is still way better than Web traffic. If I further focus in on the first few time scales and look at a much finer level of granularity now, then at 10 millisecond time intervals there is hardly any difference between Napster traffic and Web traffic. So to the already existing list of challenges in trying to come up with the traffic generator, we added in one more, which is the fact that Internet traffic perhaps exhibits this rich burstiness property and the traffic generator should be able to understand and reproduce this. Yes? >>: (Inaudible) the users? >> Kashi V. Vishwanath: There's a single link, single aggregated link, the transfer between Japan and the United States. Yes. So more important, the problem is exacerbated by the fact that burstiness of Internet traffic is actually a function of what time scale you're trying to observe it in. So there is no specific answer you could arrive at which says Internet traffic is bursty, it has to be a much finer definition, and the traffic generator should be able to both understand and reproduce this. So with that being said, let me give you an overview of the solution I've reached to this problem, before giving away more details. So I start by observing all packets that are entering and exiting a specific link that I'm interested in. Starting from this complex structure I extract key properties that one needs to understand if one has to argue about Internet traffic. Starting from that, I will build parsimonious models which will explain these properties. Armed with these models, I will then configure individual hosts in a local testbed that understand these models and use that to communicate with each other. I have built all of this into a single tool which I call Swing. So Swing starts from observing packet choices for a specific link, build all of these models, extract distributions, configure its host appropriately to generate real traffic in a local testbed. At that point I can do a sanity check and compare whether the generated traffic looks similar to the original traffic that I started with. But more importantly, I can now envision the kind of things I was bringing out in the motivation, which is deploying the service in the original scenario. Corresponding to that, I can get an analog in the generated testbed here so I can introduce the service and evaluate a system in the presence of realistic traffic. Yes? >>: (Inaudible) responsive (inaudible). >> Kashi V. Vishwanath: That's based on -- so I'll get to that in a minute, but the quick answer is first of all use TCP to get congestion responsiveness, then you build a complex feedback loop one on top of the other, exactly like it would on the Internet. But I would have to give more details on that. But more importantly, in sort of just evaluating individual systems against a specific kind of traffic you could now turn meaningful knobs into generated traffic to project traffic demos into alternate scenarios and again evaluate your application in the presence of such future scenarios. So with that being said, here are the key contributions of this work. So starting from this complex phenomena, I tried to extract key properties that one needs to get a handle of in order to understand what Internet traffic really looks like for specific links. I then build all of this into an automated tool called Swing which can look the a Internet traces and reproduce that traffic in a local testbed. I then argue in my work how individual applications really care about such burstiness Internet traffic. And if you do not expose individual applications to realistic Internet-like traffic patterns, you could reach conflicting results on what the performance of individual applications should look like. Yes? >>: (Inaudible) what are your metrics for whether you get (inaudible). >> Kashi V. Vishwanath: Metrics for getting the traffic correct? >>: Yeah. How will you know that it's indistinguishable so like a cryptographer would have their definition of whether two messages are -- or whether a message is indistinguishable ->> Kashi V. Vishwanath: I see. >>: -- random, you want to have some level of confidence of whether you got this right. >> Kashi V. Vishwanath: I see. >>: What is the metric that tells us that's going to be used? >> Kashi V. Vishwanath: I see. So that is actually a great question. So at the very least when you try to generate traffic which looks compared to the original traffic the first thing you will do of course is first level order of metrics. So you could say what is the mean and variance of the traffic process, does it look similar to what I saw earlier. Because you could imagine, however, IQ of management studies would get about that factor. Then you could go one level higher and try to look at traffic at individual time scales because I've observed in my experiments that that property actually impacts individual application behavior. So the natural of the onset here is that if there are individual applications that you are exposing to Internet-like traffic patterns, then what properties of Internet traffic patterns do those applications see in the evaluation. So if you have generator traffic with these similar effects for those applications, then the generator traffic is meaningful enough. So if I can reproduce controlled setting Internet-like experiments in a controlled Swing-like setting, then the generator Swing trace is meaningful and good enough to go there. That's the answer. >>: (Inaudible) when you're saying out of model of traffic looks like and then I reproduce it if it matches my model then I succeed. >> Kashi V. Vishwanath: Yes. >>: But what you haven't captured is the whole point of trying to make things more realistic is that your controlled environment captures somehow everything, there's not going to be some feature of the real Internet traffic that you haven't captured that would affect applications in a way you don't expect, and the only way to validate that is to try your application out on the real Internet. So I don't understand how you can -- it seems like your metric of success is equal to your -- the thing you're modelling. You're saying, well, we know it's right if the traffic fits in a particular distribution which is the distribution we used to generate the traffic. >> Kashi V. Vishwanath: Okay. So I apologize if that was not clear early in the talk. So you start with a given link and for a controlled setting based on number of measurements and heuristics, you decide whether the application that you ran on the Internet under sufficient bounds is meaningful or not. Then you try your best to reproduce that scenario in the local testbed. That's just a sanity check. Now, the Swing setting that you have in the local testbed, you can tune meaningful knobs in there. For example, you could say everything else being fixed if the application protocol changes this is a certain way, how would that affect the traffic that is being generated and how would that impact the applications that I'm trying to evaluate. So if I did not do a meaningful sanity check at the first step, there is no hope that any numbers that I draw out of the changed and protected traffic would have any meaningful information. >>: (Inaudible) I agree you've done a necessary condition and you want to achieve hope, but there's more to hope -- there's more to the solution than ->> Kashi V. Vishwanath: So can you give me one specific example so that I can better understand what is the scenario? >>: So I guess one way to validate this would be if you projected three months into the future if these changes happen, this is what we get, make 10 predictions, see which way Internet traffic actually changes and then see whether you match. >> Kashi V. Vishwanath: That would be -- I would be happy to talk about some of those results. Yeah, that's a great question. That is exactly the kind of validation I have done. So what I was saying is that before you get there you have to validate a specific scenario, before you get there. >>: Let's come back to that later. >> Kashi V. Vishwanath: Yeah. So hopefully I've convinced you that understanding Internet traffic is a complex and challenging problem as you guys are also uncovering some of the mystery surrounding it. So now I'm going to give you the main insights I've tried to use, which will answer some of your questions on what hardware attack of responsiveness, too. So I've already been talking about one of the approaches to one of the main insights in the four insights that I have, which is that we need to have a hybrid approach to understanding Internet traffic. So by that I mean we should be able to run a real cord of the prototype that we are trying to evaluate interacted with realistic packets, real packets, exchange each other in a controlled setting. But somehow the traffic itself needs to be modelled for a better control. So once you have that, what would some of the initial simple traffic generators look like? For example you could say I look at traffic and to an existing the border link of MSR, and I simply try to reproduce the timing in those packets, using your repeat. Unlike most of the applications on the Internet. So if I do that, you can see that depending upon what level of granularity I chose to reproduce I can be fairly realistic in the kind of traffic that I generate. However, since the Internet traffic is not quite dominant by UDP protocols and even if it did, since we do not have meaningful observations to draw the high level application behavior, there is no hope that the generated traffic will quite be as responsive as you want it to be. So in order to fix that, what would you do is go back to most applications that's run on the Internet. For example, they use TCP to leverage this. So in this argument, I'll say that if it's the TCP protocol that we are trying to model here, we should use TCP as the underlying mechanism. So in this fashion, you can see that the generated traffic will really interact with application traffic similar to what you would have done on the Internet. Yes? >>: (Inaudible). >> Kashi V. Vishwanath: Oh, yeah, sure. >>: (Inaudible). >> Kashi V. Vishwanath: Exactly. Exactly. >>: That depends upon how much research -- how many researchers you have in your testbed. >> Kashi V. Vishwanath: True. >>: So how is it that so far you haven't described one requirement you have on the testbed, and if your testbed has only (inaudible) no matter what you have. >> Kashi V. Vishwanath: So that's a good question. So there are other approaches -- so there are two real answers to this. The honest answer is I'm kind of presuming that you are not starved on resources for the kind of experiments that you're trying to do. But even if you are, let's say there is no way you could get the number of hosts that you wanted to, then there are other approaches you could try. For example, one of the projects that I worked on talks about how could you configure individual virtual machines in a local testbed and run them appropriately dilated at a much slower timeframe so that when 10 real world clock seconds pass, the VM only thinks that one second has passed. So if you exchange a slower rate of data within that virtual machine, you could perhaps get in the relative timeframe a much higher throughput. So those are some techniques which I could think about. If really you did not have the physical resources and that was the only passing factor, then perhaps you might not mind being in terms of time to run the experiment. So as you can see now if I use TCP instead then, yes, I could be fairly responsive, I could try interacting with the applications that I'm trying to evaluate, et cetera. It is not clear whether I can still project things into the future, but let us not worry about that for the moment. Now, clearly once I have delegated the task of putting individual packets on the wire, it is no longer clear as (inaudible) was pointing out that unlike your repeat there is any hope that you can reproduce these packet timings for example. So that really is the second insight. So the fact that you should perhaps leverage TCP that is available in the number of hosts that you are trying to run the experiment on. Yes? >>: (Inaudible) the TCP is going to adapt to your control setting. How do you know how much offer (inaudible). When you're doing unity, you know, you send a packet when the trace tells you to send the packet. With TCP, if the link is suddenly more available you should send more data. >> Kashi V. Vishwanath: That's exactly the point I was trying to bring up. It is not clear that just because you use TCP how can you be realistic in the way you are reproducing packets? >>: (Inaudible) TCP is exactly no more powerful than unity. >> Kashi V. Vishwanath: That is not quite true. So the fact of the matter is that TCP is a fairly complex (inaudible) diagram to try to simulate in any other setting. So the observation I have here is that -- and this will be much clearer in the results I show, that since TCP plays a very critical role in determining what kind of burstiness properties exist on Internet traffic if you try your best to simulate that in any other world you will be hard pressed to reproduce that behavior. So the best hope you have is to leverage TCP that's actually running on individual hosts that are available for experimentation. >>: (Inaudible) a little bit depending on what else is going on around it. >> Kashi V. Vishwanath: Yes. >>: How does TCP make the decision between I should send a packet now because my congestion window says so, which is what they expect to do, and I should maybe not send a packet now because actually don't have any available from the application. >> Kashi V. Vishwanath: Oh, so that will be another aspect of the work that I will get into very soon. So if you'll hold your question for like five minutes, maybe I can answer; otherwise, i can come back. So once you have the second observation you're absolutely right. Just because you have TCP and you bury some sort of a simple model on top of this, pretty soon you will realize that any of those simple models is not going to satisfy the goals, exactly the point you're trying to make, either it will be not realistic or can it will not respond, but you cannot meet both of this. To get the third in the second to last insight, we ask again what does Internet traffic really depend upon? And to try to understand this, there is a complex interaction at multiple layers. At the top-most layer you've got these users that are trying to engage themselves in a variety of different activities. Then you've got the applications that the users surrounding on below them, and finally the network, which is responsible for carrying the large amount of packets that the applications are generating. So the hope here is that if you somehow capture these three layers independently and then model the interaction between them, you would be able to reproduce that in a much more controlled setting. So for individual users, I'm interested in understanding what are the periods of activity, what applications do they prefer, et cetera. For individual applications what does the semantics look like, how do they defer from each other, how are they similar to each other? And in the same way what is the fate of individual packets when they put them on the wire? Does it get delayed indefinitely, does it go all the way to the other end, exactly like the questions that you are trying to raise. So the hope here is that somehow you capture the interaction between these three layers, but more importantly it's not going to be super accurate based on the fact that you're doing this solely based on the observation of a single link, the view of the world from that single link. So that really is the third insight that goes into this world, that if you have any hope you would perhaps leverage the recommendations from structural modelling and generate traffic as a three-level hierarchy. We have the final insight. Let me go back to this picture again. So in trying to reproduce traffic using this three-level hierarchy, turns out that there is a complex feedback loop going on. For example, at the top-most layer, it's only when you get the entire HTML page and part the contents you realize to click on individual URLs that are embedded within. Similarly for TCP, only when previous data is acknowledged from the other side do you decide to put based again on your state diagram how much more data to put on the wire. So in order to generate traffic, we should be able to capture and model this closed-loop traffic generation process at these three layers. So with these four insights, let me now describe how do you generate traffic. So again going back to this picture, I'm going to extract all of these properties, build the models, and then tell you how do I configure individual hosts to configure traffic. So look at this similar picture again. Start by observing all packets entering and exiting a given link. At that point the first step is to identify individual applications that belong to the original trade. So this is a step I'm hoping to leverage existing world that exists out there, for example some of the people here are involved in doing it. So how do you look at individual packets and extract what applications do they belong to. So after this my research starts. So I'm going to focus on individual applications but the method mostly generalizes. I'm going to tell you how I extract user application and network properties based on a single application. So I look at these three packets and based on TCP/IP destination and port numbers I extract flow information from these packets, so there's a high-level description -- yes? >>: (Inaudible). >> Kashi V. Vishwanath: No. So but what I did not say here is that I'm not assuming anything about the underlying protocol at a fine level of detail. So if you have TCP level headers at those packets, I'll try my best to reverse and (inaudible) the protocol that was going on. >>: (Inaudible). >> Kashi V. Vishwanath: So I can infer the request response patterns that's going on. For example, I can infer that a certain number bytes are going on the other end. You're absolutely right, if there's a much level higher level behavior which is human dominated and there is a complex feedback control loop there, I might not be able to capture that. But at the very least, lowest level I might be able to capture how much data is going across, what happens to the data, how much response comes back, how long do I wait after that. So I'll be able to extract patterns like that. So that is my hope for this work. So then I look at individual flows and based on TCP sequence numbers and (inaudible) numbers and time span, I try to exact what is the information on a per-connection basis. For example in this case, what is the request response exchange, what is the total number of request response pairs, what is the timing separation between them, et cetera. Then you look to group together individual flows into a single cluster, a quantity which is known as an RRE. Auto request response exchange. So this of this as analogous to all requests that are going out for (inaudible) objects in an HTML page. At the next step I attempt to group together individual RREs into what is known as session. So this is again analogous to all the Web pages let's say you might browse in a single visit to the work station, or all the movies that you would download in a single P to P session for example. At this step, at the high level I've described how do I take out the user and application properties? So the only thing that remains is the network. To do that I go back to individual flows that I've extracted earlier. And now I will use variance of past TCP trace measurement techniques and not necessarily innovate in this space. The aim of these techniques is to look at individual IP addresses for sake of completeness in the original trace and if you look at those IP addresses and use a single logical link to connect the IP addresses to the link under observation, what are the static values of capacity loss rate and latency that I can attribute to those. So at this step I've described at a high level how do I build these margins. So the key thing to note here is that I need not be super accurate in any of these techniques, although it will not hurt to do that. The main observation here is the fact that can I extract sufficient information for all of these models that I can start explaining why Internet traffic looks like the way it does in the original case and how I can use to drive my simulation studies. So to summarize I looked at packet, drove them into applications, extracted flow information, RRE information, session information, and then used my data characteristics to extract capacity latency and loss rate. So I should again note here that much of these is like work in progress and it will continue to evolve as you come up with more applications. In particular the last one for instance has (inaudible) for a lot of people to provide a rich source of (inaudible) pieces in the past. So let me go into a little more detail on some of these. So how do you look at individual flows from packets. So those relatively straightforward. If you look at the TCP dump information present in the original trace and without going into too much detail it is information on where the packet came from, where it went to, what was the timing, what are the TCP sequence headers, et cetera. So based on that, I will try to reverse here the cause and information that is going on on the trace. For example, what did the TCP (inaudible) look like, what was the request data that was sent across, how long did the server on the other side wait, when was the response sent back, et cetera. So at the lowest level, I will extract flow connection and termination times and then also more detailed information. For example, what is the request response semantics. So once I do that as a basic building block, I could then try to figure out RRE and session information. And this is how it works at the high level. So I looked at all flows and sought them in increasing order of connects and initiation. I look at the first flow and mark at the beginning of the first session and the first RRE in that session. I will then attempt to take more flows that are concurrent or overlapping with this flow and attribute them to the same RRE. For example, I might get one new flow which is time separated by a certain amount and does the interconnection time for this particular flow. I will then attempt to group more flows which are also concurrent with this flow. Once in a while, I will get the flow which is physically separated from this cluster by a certain amount. And at this point, I will declare that the previous RRE ascended, a new one has started, and this is the enter RRE time. I will then -- yes? >>: (Inaudible). >> Kashi V. Vishwanath: Yes. >>: (Inaudible). >> Kashi V. Vishwanath: Sorry. For a specific IP address. So for a pair of destinations that the exchange is going on. So let's say I'm sending some data to a Web server and I'm trying to download like a number of HTML pages -number of (inaudible) objects within that. But if I'm talking to some other Web server which is not causally related to this, then that will not be a part of the same ->>: (Inaudible). >> Kashi V. Vishwanath: Yes, for now. Correct. >>: (Inaudible). >> Kashi V. Vishwanath: (Inaudible). >>: If the Web server at same time I'm looking at other stuff, right. (Inaudible) same servers. >> Kashi V. Vishwanath: So the source is the same and the destination is the same? >>: Yes. >> Kashi V. Vishwanath: So that's a great question. So let's say every time you access the Web server, every single time you are actually watching the video, then it makes sense that when I extract this RRE information I embed that as some sort of high-level behavior that the user itself is expressing. But every time a website is watched, a video is watched in parallel. So it would not hurt to group them together into one single cluster. However, if it happened a very low percentage of the time, then based on the distributions I extract, it will get weeded out. >>: (Inaudible) sometimes like there are certain -- for example the tendency between this traffic and the tendency may occur or may not occur. >> Kashi V. Vishwanath: That's a great question. So the answer in that case would be what is the level of granularity that you are interested in. So if you believe that you want to characterize individual users who may or may not access multiple data on the same Web server, then you might be interested in categorizing them into different classes. So it really depends upon whether the kind of fidelity you get from the traffic generator is a bottleneck to the kind of application studies that you're carrying out. So if a large percentage of users start exhibiting that behavior, then there is some reason to believe that it will influence the application behavior. And at that step I might want to characterize the fact that there are two different conversations going on but semantically they are different in nature. So the other one, what would you do instead? So there is no other better option instead of either figuring out whether to put them into different categories or grouping them into a single one. At the end of the day, I want to just capture the fact that how do two hosts talk to each other. If a large number of times it happens to be an HTML page and a video page, that is what I want to capture in this interaction. Similarly, once at some point of time there will be a new flow which will be so far separated in time that I don't even want to put it into the same set of priorities that I'm working on. So at this step I will say that the session, quote/unquote has ended and the new one has started. So the point here is I will try to extract distributions for all of these in the underlying model. Then in the end, I will extract data using variance of well-known techniques, so not necessarily innovating in the space. For example, to extract latency estimate for this link, I will try to look at the separation time, separation between the data value and the corresponding act value which is coming from the other end. Similarly, I'll use variance of packet measurement techniques to attribute capacity value for this link. And finally, extract loss rates in a similar fashion. So the point is that at the end of this, I would hopefully have extracted meaningful distributions corresponding to the parameters of the model that I've built. So the next step -- yes? >>: (Inaudible) packet size? >> Kashi V. Vishwanath: Yes, I do. So in fact in some of the experiments I've found that if you do not do it then you could reach lopsided results. So in the beginning I wanted to have a parsimonious model, which was sufficient for explaining what the traffic looks like. But in the end it was driven by studies on what applications cared about, and it turned out that packet size turned out to be a particular one, so I ended up (inaudible) yes. So I start with a network emulator that talks to a number of different hosts to carry out the emulation framework, and in this case I use Modelnet. So the network emulator basically knows what the Internet looks like based on what you tell it. So in this case we tell it that it's the border link that we are interested in, and sufficiently we would know the characteristics of this link. We would then look at the Modelnet link, and it talks to a number of different hosts to carry out the emulation. We again divide the task of carrying out the emulation based on the popularity and the original trace. So if a third of the hosts in original trace belong to HTTP a third of this would be responsible for emulating HTTP part of the background traffic. So the next step would be to take these individual hosts and connect them on either side of the dumbbell link. Again, that is based on the popularity in the original trace. So if two-thirds of the bytes ended up flowing from left to right or if you're interested in host, then if two-thirds of the hosts were on the left side on this dumbbell link, you would take two of these and place them on the left side. You do this for all of the hosts. At this step in the emulator you now want to connect the host to the dumbbell link. To get values for the data properties you go back to the distributions that you once extracted. For example, this like here might get a 10 megabits per second capacity, 10 millisecond delay and two percentage loss rate. You do this for all links and connect them to the dumbbell link. At this step, configuring the topology is complete. Only thing remaining is telling them about the user and application property. Yes? >>: So if I understand, are you assuming that two flows, that the only shared link is this one link, and that a subset of the flows don't share some other link and might interact with each other ->> Kashi V. Vishwanath: Yes, yes. >>: Because of whatever (inaudible). >> Kashi V. Vishwanath: Yes. You're absolutely right. So in reality that's going to happen, but as an approximation in this case I'm assuming that there is no shared pattern (inaudible). But you can imagine extending this using variance of packet bear extensions that (inaudible) for example has worked on, which tries to extract congestion on shared patterns and use that to understand how can I safely attribute this to like a different level of hierarchy. For example, there be could a tree spanning out of here, and there could be multiple hosts let's say if I could extract information on what prefixes they could belong to, I could put them all into a single cluster and feed them to a separate byte. But you're absolutely right, for these experiments, I'm not modelling them, and I'm modelling them independently. >>: (Inaudible) model of loss rate (inaudible) so as long as you don't change that assumption no matter of (inaudible) is going to get you interaction that's happening? >> Kashi V. Vishwanath: I see. So that's a great point. So the loss rate assumption here is for flows that I do not have visibility to. For example, if you're a link behind the border link of MSR and you're trying to connect to me I'm outside the server, then the link that you use to connect to me, let's say I cannot observe other traffic that goes through that link. For example, there might be hosts here which share this link and I cannot observe that interaction. In order to approximate that interaction I'm assigning loss-rate values. But if there are multiple flows, all of which flow through that link and that are part of my controlled experiment, then any induced loss rate would be captured. So the loss-rate setting is for capturing something that I cannot passively observe because I do not have visibility. And in a similar way if I have visibility like multiple routers within the same organization, maybe I can augment the technology. They are like two different aspects of loss rate. >>: (Inaudible) you need a big picture. (Inaudible). >> Kashi V. Vishwanath: Sure. But hopefully if you are designing a network and if you have complete control over the network, there is slight more information that you would have then I had to work with. For example, I work with individual traces which had no control over when they were gone. If you are within a network, not only you would perhaps have more traces at different points in the network which you could correlate, but you would also have like routing information that you could leverage to augment that how does traffic go from one point to the other. So again, quickly, so look at the single application, for example, figure out based on the probability distribution how many sessions to generate for each session, how many RREs, for each RRE, total number of connections. So let us look more closely at what an individual connection emulation might look like. So as you recall, it has rich information on what are the request response exchanges that are going on. So for this connection again based on a distribution I would pick a pair of hosts in the original topology and establish a real TCP connection in between them. The sender then would take a request size, create a dummy packet, real TCP packet, send it across. The server on the other end would generate a response which would come back. The client would wait for a certain amount of time, generate a new request, send it across and wait for this final response on this connection. So upon receiving this last response, the sender would then finish the connection and terminate it. So at this point, at the lowest level we have finished emulating a single TCP connection. So we proceed by doing other connection in this RRE, other RREs and other sessions. So when all of this real TCP data is flowing back in across this link, we can now imagine introducing the service that we are trying to evaluate. At this step, the flows belonging to this would interact with the flows belonging to the background traffic and then we could draw numbers on how is the flows being influenced by background traffic on that specific link. So let me now try to give you a detail of some of the evaluation that I carry out. So basically I'm interested in answering two questions. Can Swing look at Internet traffic traces and try to reproduce them as is possible in a local testbed, and if it does that, whether or not that influences the individual applications that I'm trying to evaluate. So to answer the first question -- yes? >>: (Inaudible) can you do something far simpler than Swing such as simply, you know, characterize let's say some property to the (inaudible). >> Kashi V. Vishwanath: That's a good question. >>: And just (inaudible) traffic and (inaudible). >> Kashi V. Vishwanath: That's a great question. Yes. So I actually answered some of that question. So this question here is really trying to capture that fact. So when I say background traffic I'm hiding because I didn't want to mess up the slide, I'm hiding the fact that I also mean realistic and responsive traffic. So that's a good question. So for the first question I'm looking at existing traffic traces. Part of the problem is that I don't control these traffic repositories. So they come from different parts of the world, have different properties, different link speed, different applications, et cetera, and I experimented with all three of these. So basically I'm interested in going back to my goals and seeing whether the traffic that I generate based on the (inaudible4) approach that I described is realistic and responsiveness. So in the interest of time, let me skip some of the initial results which basically say can you manage the course (inaudible) behavior. So as you can imagine, even simple techniques would be fairly equipped to do that. Instead I would focus on reproducing a certain aspect of Internet traffic that I did not explicitly model, which is really part of the total concept, which is worse than Internet traffic. So in literature, there's a large amount of text in order to understanding why is Internet traffic worse and what can you do about it? In this talk I'm going to use one of the many metrics that exist out there. Firstly because of its popularity but also because of its visual appeal, and that is Web-based multi-resolution analysis. So I'll give you a quick one-minute overview on what this means for sake of completeness, but then I'll follow it up with individual appeal I'm talking about. So imagine looking at all traffic, entering a specific link and trying to observe how many bytes exist on a single millisecond. So for every (inaudible) millisecond you calculate the total number of bytes. So at the smallest time scale you can imagine computing interesting metrics of interest, for example mean and variance of the traffic process. You can then start coarsening the time scale in powers of two, so you combine a different time bin, and try to understand how many bytes arrive in those time bins for the duration of the trace. So again at the next level of granularity, you can again compute mean and variance for example. So keep doing that so on and so forth. Now in addition to computing the mean and variance, I will compute a quantity which is known as the energy of the traffic at a particular time scale. And for sake of completeness, this computed using the details of the (inaudible) transform of the original traffic process. Intuitively speaking, this capture how bursty traffic is at a particular time scale, but I'll also give you more visual appeal a little later. So once I compute the energy at all of the time scales that I'm interested in, I can plot that in a simple diagram. So on the X axis, I'm showing the powers of two that I talked about. So as you go from left to right traffic will get more coarser. Similarly, as you go from bottom to top on the Y axis that I'm showing the log of the energy, traffic will become more bursty. So if there are two traffic plots I'm showing at a particular time scale, one is higher than the other, then at that time scale traffic is more bursty. Yes? >>: (Inaudible) milliseconds. >> Kashi V. Vishwanath: Yes, exactly. >>: (Inaudible). >> Kashi V. Vishwanath: Minus one because there's -- absolutely. Yes. >>: So (inaudible) time (inaudible) of how the (inaudible) you just give a full number for the ->> Kashi V. Vishwanath: Oh, because then I would have to show for 15 different time series, and visually looking at it might not be that meaningful. So this condenses the exact fact that you're trying to observe. This synthesizes all of those theories into individual numbers for a particular time scale. You're absolutely right, we want to look at that, but we would be hard pressed to like visually compare and conclude whether one is better than the other. >>: (Inaudible) mean standard deviation variance I can come up with many different shapes that have the same (inaudible). >> Kashi V. Vishwanath: That's absolutely right, which is why you're computing the energy of the traffic process. It is not exactly the mean and variance. >>: You answered that by saying okay so (inaudible). >> Kashi V. Vishwanath: So I did not pick this. So this is something ->>: (Inaudible) capturing and you wouldn't know it because you are using it around the parameters you happen to be actually capturing. >> Kashi V. Vishwanath: I'm sorry. Please repeat the question. >>: The concern was that you're condensing this essentially real function into a scaler. >> Kashi V. Vishwanath: Okay. >>: And you addressed that concern by saying I didn't just condense it into those two scalers, I condensed it into this third scaler called energy. How do you know you've got the right scalers? >> Kashi V. Vishwanath: So what I was trying to say is Internet traffic does represent burstiness. So in the networking literature a lot of researchers have focused around understanding burstiness parts of Internet traffic as one means to capture how much bursty Internet traffic is. And there's a number of visual appeals to the energy plot which I'll describe in a couple of minutes. Which is why I decided that without explicitly modelling burstiness, can I reproduce these salient properties that are present, so which is why I chose this metric. So if there is another metric where the signature of two different traffic processes differed and individual applications when they were interacting with this traffic would have seen the sensitivity to the difference, then yes, I would have had to reproduce that metric. >>: (Inaudible). >> Kashi V. Vishwanath: By running the experiments. >>: (Inaudible). >> Kashi V. Vishwanath: So for example, if you come up with a new ->>: (Inaudible). >> Kashi V. Vishwanath: Okay. So I'm going to give you a one minute ->>: I think actually you have more experiments about validating your work later. >> Kashi V. Vishwanath: Yes. >>: Let's get to the end of the talk about that. >> Kashi V. Vishwanath: Okay. So here is the energy plot corresponding one of the original traffic traces that I downloaded, for example, the (inaudible) and exactly as you were saying, at a time scale of one, it's one millisecond, but going up in powers of two, time scale of nine corresponds to two of the six milisecond times. So now there are some well-known properties about this energy plot which will come up a little later which may or may not have been apparent if you just looked at time series or some other metrics, which is why researchers prefer to go with it. For example, this dip that you're seeing here in the energy plot has commonly been attributed to the self-blocking nature of TCP. For example, as you would know the TCP works in this complex feedback control loop where you placed it on the wire, wait for data to be acknowledged before deciding when to put more data. So this in turn means that there is some amount of periodicity associated with how long it takes to get data all the way across and get it back. This periodicity is nothing but the converse of burstiness or the variance in the traffic process. So in some sense you might expect intuitively to see some periodicity around the round trip time of flows. So indeed if you look at the original (inaudible) trace a majority of flows show a round trip in the vicinity of 200, 250 milliseconds, which explains why I see a dip in the energy at the particular time scale. This in turn means that if I came up with a new trace and if indeed there was some regular pattern and I just plotted the trace and looked at the dip, that would give me a first order approximation of what might be the majority of round-trip times in the original flow. So this is one of the many reasons, for example, why I picked energy plot and why other researchers have used it in the past. There are other reasons. For example, people have been attributing that Internet traffic has self similar nature and they use the plot slow in large time scales to come up with an approximation of self-similarity in terms of the so-called course parameter. So those are some of the visual appeals that the energy plot has. So in my work, instead of trying to understand and reproduce those properties, I ask myself a very simple question. Can I capture Internet traffic's properties using the first principles approach to sufficient accuracy that for a handful of traces without really knowing what this means, can I reproduce that and extract the energy plot of the reproduced trace and match it to the original energy plot without explicitly modelling what self-similarity is for example. So here is what I do. So in the first experiment instead of generating traffic using the full power of Swing, I will triple it, also to put a little more perspective into related work and what other people have done. So in this case I'm extracting properties of users and applications, but I'm completely omitting out the network which means I'm running this experiment by running the experiment on a (inaudible) Internet, I'm not extracting capacity and loss-rate values whether or not the extraction was accurate. So the first thing I need to say here is even in this case as you exactly pointed the course bin behavior basically matches so there is no much deviation. But you look at the energy plot and it has nothing to do with the original traces which means that there is scope for improvement. Now I will relax the assumption that I made that I do not know about these properties. Because indeed I extracted them. So if I add in the capacity estimates as you can see the generated energy plot now is moving a little closer to the original plot especially in the first few time scales. If I relax this assumption further and add in the latency estimates, I can get a much better match. And if you bought the argument that I gave a little while back around the round-trim time and the tip in the common time scale I should be able to reproduce that because I'm not trying to reproduce the distribution latency values and indeed I can do that. There is still a lot of scope for improvement in the first few time scales. And as you guess it right, to get that, I add in the loss-rate estimates. And now as shown by the green pairs of curves, intensely chosen to be of the same color to confuse (laughter) we can now match the burstiness plot of the original traffic process by generating traffic in this fashion. Yes? >>: (Inaudible). >> Kashi V. Vishwanath: Concurrent? >>: Yes. >> Kashi V. Vishwanath: So this was around a thousand flows concurrent at any given time. >>: (Inaudible). >> Kashi V. Vishwanath: Which is why you're seeing this. So actually the number of flows increased, and I've experimented with the data trace, then you do not quite see the risk structure in the first few time scales, but you see it much later, so it is relatively flat. >>: Are you going (inaudible). >> Kashi V. Vishwanath: Yes. >>: Okay. I'll wait. >>: (Inaudible). >> Kashi V. Vishwanath: I see. >>: (Inaudible). >> Kashi V. Vishwanath: Yeah. >>: Generate (inaudible). >> Kashi V. Vishwanath: I see. >>: Traffic or through traffic. >> Kashi V. Vishwanath: True. >>: Or (inaudible). >> Kashi V. Vishwanath: True. >>: So why not create a number of (inaudible) and run some of those tools assuming that (inaudible) a good job and it was all right and why not do it that way (inaudible). >> Kashi V. Vishwanath: So this is where harpoon is. That's a great question. So this is where harpoon is. So this is where my assumption of where harpoon is. So I assume that harpoon is equally as accurate in extracting the user and application properties, using a first pencil's approach and it does nothing about the network or superimposing any of these or establishing a complex feedback control loop between these. >>: (Inaudible) try to achieve (inaudible). It's not great that you actually get it completely right (inaudible) I think experts (inaudible) you try to send traffic and modelling that and you say I fit that better. >> Kashi V. Vishwanath: As a sanity check, yes. >>: But really your (inaudible) isn't about that (inaudible). >> Kashi V. Vishwanath: Yes. >>: It's (inaudible). >> Kashi V. Vishwanath: Yes. >>: (Inaudible). >> Kashi V. Vishwanath: Exactly. >>: (Inaudible). >> Kashi V. Vishwanath: Exactly. >>: And the other (inaudible). >> Kashi V. Vishwanath: I see. >>: And things (inaudible). >> Kashi V. Vishwanath: I see. >>: So if you put all of them together. >> Kashi V. Vishwanath: I see. >>: You approximately get the same behavior ->> Kashi V. Vishwanath: I see. >>: (Inaudible). >> Kashi V. Vishwanath: Okay. Okay. So that's a great question. And I think you're absolutely right. It may actually turn out that some of the accuracies that I'm trying to shoot for does not matter in some sense. But it turns out and I actually know this answer because I did a number of different experiments, there are specific cases where even small deviations in this plot actually matter to the application that you're trying to evaluate. So if I did not have a good match starting out, then how do I scientifically answer the question that whether or not I matched I still captured the impact it had on individual applications accurately? So to get closer to that question, there is some desire that if I run a control version of the Internet, then at least for that I should be able to reproduce some of the properties that I saw. >>: (Inaudible) on this trafficking (inaudible). >> Kashi V. Vishwanath: I see. >>: (Inaudible). >> Kashi V. Vishwanath: Yeah. >>: (Inaudible). >> Kashi V. Vishwanath: Right. >>: And the environment is a lot more dynamic. >> Kashi V. Vishwanath: Sure. >>: (Inaudible). >> Kashi V. Vishwanath: And ->>: I mean whatever you do in the lab, for example, you take it out in the real world it doesn't work (laughter) (inaudible) it doesn't work so ->> Kashi V. Vishwanath: So time permitting I'll give you an example of a controlled experiment which I did where I knew exactly what are the sources that are generating background traffic, and then I completely ignored it, looked at the TCP dump trace and I reproduced that traffic using Swing. I was trying to reverse what I already should have known. I run some black-box experiments on top of this which tried run some experiment, they were interacting with traffic in one end and now they were interacting with some traffic in the other end. And I showed that the numbers matched. Then I played the scenario that you're exactly saying. I said the behavior of this application is going to evolve over time. And I applied the revolution of behavior by tuning the Swing parameters without changing the extraction process. And then I ran the black box again in this new scenario, and then I showed that the application matches, which is some hope basically that at least for the kinds of things I did it makes sense. But you're absolutely right in the expanded generalized case you really don't know like what aspects are important to capture. So in this work, I'm trying to argue that some of the aspects that I tried to capture at least make sense in common applications. But a general question is I think up for debate. >>: (Inaudible) of the link? >> Kashi V. Vishwanath: So this link is around five and a half megabits per second. So this is fairly underutilized around -- so they said it's 0C3 link, but the university blogged it to around 20, 25 megabits per second, so 15 to 20 percent, not more than that. So were you going to say that the performance, the numbers look like a heavily utilized link what ->>: Yeah, (inaudible) model application (inaudible) really matter what the traffic is doing? >> Kashi V. Vishwanath: Yes, actually the answer is yes. So it depends on the application that you're trying to evaluate. For example, if you're trying to build an available (inaudible) tool, even for relatively underutilized link but bursty traffic patterns, it can reach potentially conflicting results based on... (End of audio)