>>Dave Maltz: Hi. I'm very pleased today to introduce George Varghese visiting with us from University of California at San Diego. He's been here with us for about a week and a half, and he'll be here for another week and a half, so if you'd like to chat with him, please send him email and set up a time. George has a very long background of doing very novel and innovative things. Combining algorithms in network hardware, getting tremendous performance out of systems that people didn't think were possible, and he's going to be sharing some of those insights with us today. >>George Varghese: Thanks, Dave. All right. So I call this talk The Edge, Randomized Algorithms for Network Monitoring. And if the title sounds a little like some kind of band, don't worry, we'll try to explain what The Edge means. And there are two main themes in this talk. One is that for hardware people, they're so used to the deterministic drum of a clock that randomized algorithm actually are surprising to them that they can actually be done. And so when we went to Sysco, we actually found that there was a culture change that had to be done, but it can be done. And sometimes when it really gives you benefits, then people will consider doing it. And the second thing is that I've spent a lot of my life trying to make networking hardware fast, and the main functions were packets and trying to do packet classification and look-ups, but increasingly I'm beginning to think that the next generation of problems is in monitoring and trying to figure out how well a network is doing. So those are the two themes. And I'm going to try and tell you more. So to explain these themes, when the internet first started was this very simple network where everybody knew each other. They called up each other when there was trouble, and they had long hair too. And the main problem was to try to make it more flexible, try to make sure it was fast and try to make sure it was scalable. So I guess they succeeded. It's pretty flexible. It runs all kinds of things. It's pretty big. It's pretty fast. But it's complicated. And so the big problems now are things like overloads. The symptoms of success, you know. It's like when people, after a certain point they have blood pressure, and it's often overloads, attacks, and failures. So the next -- if you look a measurement and control, surprisingly in the internet there is nothing built in for measurement and control. Even things like trace route, which you may have heard of, were kind of hacked after the fact, and using a completely unsuspected mechanisms for those things. And so the claim here, though, is that you would like engineered solutions, and the current generation of router windows are actually -- are open to this, because it turns out that routing has become commoditized and there is Broadcom and there's various -Marvel who are selling these chips. And so almost anybody can build a router today. So the router window it actually open now to say, okay, I'd like to add these features that would differentiate myself from other things. And if you look at sophisticated systems like the Boeing 747 or bridges, today almost everybody builds in sort of controlled things. So it's the natural thing to be more autonomic. So this talk is also going to talk about this other theme of using randomized algorithms and network chips for basically doing these kind of engineered solutions for performance and monitoring, and since this is a very general thing, I'd like to sort of give you three specific instances. And I know there are theoreticians in this audience, and I think each of these three problems can be abstracted and capsulated very simply, and I will try to show you the relevance to the real systems problems. The first one is a well-studied one, but I'll just illustrate just very quickly to illustrate the benefits and the edge of a randomized algorithm, finding heavy bandwidth flows or heavy hitters, a very common -- I think most theoreticians know this. The second one is measuring microsecond latencies in networks, and that's a little surprising for people who think of the internet as being a millisecond or even a second. And the third problem is logging all infected nodes during an attack given certain constraints like limited memory and bandwidth. So those are three problems. And the main -- from the theoreticians point of view, and this is where The Edge comes in, in each case a simple sampling scheme will do the job. The only thing, though, is that if you add a little bit of hardware and put in a little bit of processing in the router, you get an edge. You do something better in some quantifiably better way. And that's what I mean by the edge. So I'm going to try to demonstrate the edge, and at the end what is common to all of these is I'll show you the edge over simple sampling. So wait for that. And let's just start focusing on the -- but there are some rules here that we have that are different from standard algorithm. So we need to sort of specify the rules of the game. So the first rule of the game is that we have comparatively small amounts of memory. If you're used to a PC and I think of the infinite amount of memory a PC has, you say what is the problem? Why are these guys worried? But it turns out that in routers offer we worry -- we have to do everything on chip, and so it's more like cache memory, which is all limited at every level of technology. So it's not like we have ten registers. It's on-chip cache, so it is on-chip SRAM. And so we have maybe, like, numbers like 32 megabits, which is quite a bit. But nevertheless, it's not scaling with the number of concurrent internet flows so you can't just sort of remember everybody. So you have to do something smarter than just remembering everybody. Also, we have very little processing. I like to tell people that we are in the notion of constant time complexity. Auto log in is slow for us, right? So the way -- if you look at the memories we have, a 40 byte packet has 8 nanoseconds to be processed. Even the most aggressive on-chip memory is, like, 1 nanosecond. So you get 8 reads or writes to memory. Now, that's cheating a little bit because there's a certain amount of parallelism. You can do something that's parallel. But I haven't seen chips at more than a factor of 30 parallelism. So from beginning to end you get 240 reads or writes to memory. So it's a very limited canvas that you have. And within that canvas you have to do very simple things. Yes? >>: I just want to ask in general in this field, 40 bytes is the absolute worst case. Are people actually designing hardware for 40 bytes. >>George Varghese: Yes, they do. So you could argue that that's not the right thing. But they have tests today called Light Speed, and Light Reading has a certain test where they will measure -- they'll do bake-offs [phonetic] between Sysco and Juniper routers, and if one of them fails the wire speed test, then there's a big news. Everybody seems to have bought into this Kool-aid. So you can argue, but that's a whole other topic. So let's just believe this for now. Whatever it is, there is some number that ->>: [inaudible] >>George Varghese: Half the packets are x, right? 50 percent of the packets are x. So that's a good point. Do you have a question? Okay. All right. So let's start with a specific problem, right? All the generalities. And this is a very simple problem, and so the problem is what is called heavy hitters, and you are a router and you're sitting here and you're watching a stream of packets flow through you, and the packets have some kink. And let's assume it's a source address, okay? So as they're marching to you, you see S2, S1, S5, S2, S6, S2, and if you had plenty of memory or you just initially looked at the stream, it's easy to see that S2 is occurring quite often compared to the rest. And so S2 is a heavy hitter, an elephant, and the rest are mice. Now, we no that if we had memory for all of them, a simple hash table suffices. Every time you see a new source address, we start a new entry, and then we bump a counter every time we see a source address again. And so at the end of the interval we see S2 is 3, and if the threshold is, say, 2 we say -- we declare 2 as the heavy hitter and we're done. The problem is by our constraints we do not have memory for all the possible source addresses. We could have millions of source addresses in a second interval. So what we would like to do is somehow directly extract the heavy hitters without storing all of them. And this is a well-known problem. So let's just -- so one of the ones -- now, here's where the statistician comes in and says, oh, that's easy, I simply sample. Right? Depending on the size of my memory -- let's say I have 10,000 pieces of memory and a million flows. I could sample it may be 1 in 100 or 1 in 1,000, and with high probability, the heavy people will float into my sample. And I can count how many times they occur in the sample and scale up by the inverse of the sampling factor to estimate how much they sent. But there's a lot of variance in that estimate of the amount they sent. You don't just wanted to find them, but you want to find out how much they did. Now, we have a very, very simple scheme which we call sample and hold. Now sample and hold says that we sample as before, as in the sampling scheme, but then once we sample, we store it in a hash table and we watch all the packets sent off to that by that source. So think about it this way. The Gallup poll estimates where Bush versus Gore by sampling households, a thousand households, and says Bush, Gore, Bush, Gore, and then they make an estimate of who's likely to win. But the problem for the Gallup Poll is that the expense is sampling, going to those households. For us sampling is not an expense because the households are tramping right through the polling booth, so to speak. So we might as well, once we've sampled somebody, watch all of them. And intuitively what happens is it's a variance reduction technique because the variance comes in only the first time. Once it floats into the cache, we've got it exactly and it's deterministic. So you can prove this. You can prove that the standard error is basically order 1 or M -- M is the memory -- while in standard sampling it's 1 over square root M. It's not hard to believe because variance goes as square root, right? And this is actually quite significant, right? So if memory is 10,000, the area is 1 over 10,000 and the other 1 over 100. So that's order of magnitude. Now, [inaudible] Prabhakar at Stanford and some students sort of found a simple improvement of this which we actually find obviously in the Sysco chip where he basically noticed that sample and hold, the small guys come in and they pollute the cache. So all we need to do is recycle the small guys and get rid of them, and so he calls that elephant traps, which basically it's sample and hold plus a recycling scheme. So, again, the point here is the team. It is important to find heavy hitters why? Because a lot of people would like to sort of figure out when attacks happen, they're often events by these people who send a significantly larger traffic footprint than the rest. So that's why it's useful. And there are many other reasons. You would like to do traffic engineering for the very largest flows because 80 percent of the traffic is sort of dominated by -- so that's a systems issue. And it's not easy to do -- it's easy to do if you have infinite memory, but you don't have it. Simple sampling will work, but you can do better by adding a little bit of memory in processing. So this just it's our theme, and now let's get to the more -- to the first of the two bigger topics, which is measuring microsecond latency. All right. So basically when you tell people that people now care in the internet about measuring latencies down to microseconds, nobody can believe it because you say, look, my email took me ten seconds! What are you talking, microseconds? Well, in fact, the first wave of latency-critical applications appeared about 10 years ago, and they were things like voice over IP, IPTV and gamers. Gamers care. But if you look hard at the numbers, most people believe that as long as you get within two hundred millisecond latency, all this class of applications are perfectly happy. So why microseconds? Well, you may not believe, but there's at least a sense that they're not the drivers for microsecond latency. In all of these cases, it's a human being that's reacting to a message. And you can imagine how can a human being tell the difference between a millisecond and a microsecond? That's impossible. The fundamental shift occurs when machines sort of are sending messages to machines. And they can tell the difference. So classic example is somewhat discredited after a crash, is automated financial programs, right? So why do automated financial programs care? So the Dow Jones feed is giving the feed of stock prices, and you have two bankers, Merrill Lynch and, let's say, Smith Barney, and one of them has a lower latency link by maybe 100 microseconds. Because it's a program, they can go ahead and see a cheap stock, bid up that stock before the other guy gets to it, and so there's arbitrage, and they very conscious of that. You could look at the stock exchange, look at all the finances and they talk about microsecond traders and edges and all kinds of stuff. So this is very serious. I mean, the traders do care. In fact, the biggest markets for detecting low latency is in these automated -- in the stock exchange, and verticals -- like Sysco has verticals. So they are actually talking about less than 100 microsecond latency. They care. >>: [inaudible] >>George Varghese: It depends. >>: [inaudible] >>George Varghese: Right. The average is 11 seconds, because they do a lot of short-term shifting, which is bad for the overall economy, but nevertheless, let's keep that aside [laughter]. And, also, they want very small loss, because it turns out that if you lose one packet, it turns out that the economical problems tend to have 200 millisecond time odds, and that's a disaster from the point of view of all of these people. But to me the more fundamental reason is high-performance computing. In high-performance computing, computers are working together with other computers in clusters to do large parallel jobs. That would be true in Fed Ex, route discovery, like Southwest might be finding routes. So the big Fortune 500 companies are doing these things. Now, a cluster cares because every time you wait 100 microseconds, a computer is sort of wasting so many instructions which it could be done to do the job faster. That is fundamental. Still, now, hitherto, these two interconnects have been separate. The computers have had their own interconnects, like infinite band and fiber channel, and the internet has had Ethernet. But Ethernet has been fundamentally cheaper, so there's been a huge shift in the last two years to try to put all of on this top of Ethernet. But then Ethernet needs to have low latency and low loss. And, of course, Microsoft knows that, and so the group I'm working with is learning tremendously about all these things. But either way, if you look at this, you really at least need to measure to know whether you're getting low latency or low lots. So with all of that motivation, let's define -- it turns out -- I don't want to talk about these things except that for those of you who know networking, there are things like SNMP and NetFlow, but there's too coarse. They have millisecond or second resolution, and you can store all the records, but you often lose all of them in NetFlow so there's no really good solution. All right. Now, what people do today is that they go ahead and send these messages periodically as test messages, they call them probes, and from one and to another and see how long it takes, and they time stamp it, and that's their sort of external check that things are working well. But you can't send these test messages every microsecond. You can send them every millisecond because they take a lot -- and then if you want to find out where the problem is, you have to do what is known as a join. You have to say, hey, I've sent these two messages, and both of them are taking latency, so possibly in the intersection of the two paths there is a bottleneck. But that's all -- now, what you're going to try to do is try to inverse this whole process and directly engineer this as opposed to believe the network is a block box. So our model is basically going to be -- so let's just -- it's going to be -- we're going to basically say every router in the world is going to add a little bit of hardware to every link in the path so that it can measure the latency on every link. So now -- first of all, in there's a problem somewhere in the middle of the network, you directly know the answer. You don't have to infer the answer from black box. Secondly, if you want to find path metric, you just add them up. Right? Okay. So basically in order to do that, what I want to do in this part of the talk is I want to show you a simple data structure that can help you find latencies on a link, and it's randomized or hash-based, and we're going to calculate loss, delay, average, and variance. And so in order to do this I have to explain and extract the problem. So I'll give you a model. I'm show you why simple data structures will not work, and then I'll explain the algorithm. All right. So here is -- what we're going to do is we're going to segment the link. And, remember, this is between two routers. Or it may not be between two routers. And, actually, the more common model -- and this is important. I haven't drawn it. So you have two things, a router and a link. The links are often very good. They hardly have any delays and they hardly have any loss. Most of the loss is between the input port of the router and the output port of a router because there are queues and fabrics and stuff. So when I think a link, this is the more important link I'm talking about. The link between the input port of a router and the output port of a router. So don't think of the fiberoptic link between routers, because that has almost no delay and almost no loss. So be aware that that's where most of the loss is. So that's what I mean extractly by link, but it looks like a link, so don't get confused. So you have three packets; a black, a gray and a white. And so packets always travel unidirectionally, so you have to do it in two directions separately to find the delays on both links. And you put a divide time into equal bins and we typically use some measurement interval. So the packets go across, and the idea is that both S and R are going to maintain some state. So the sender maintains DS and the receiver maintains DR. And at the end of the interval S transfers its state to -- and so -- and then the receiver computes the measure based on some function of DS and DR. Okay? Is that a simple model? Easy to understand? Okay. So we're going to make a couple of assumptions. So we're going to assume the link is FIFO, first in, first out. And we can extend it. Now, within routers you actually sometimes do stripe packets, but you often stripe it internally, and before they come out you resequence because TCP doesn't like it. So if you do it in certain points, you can make this assumption. We're going to assume that clocks are synchronized. That is, if this guy says 12 o'clock, then this guy is going to say 12 o'clock to within a microsecond. Now, it wasn't true a few years ago. Increasingly routers are doing this. At least the Sysco nexus does this, and it's microsecond synchronization. Now, across links, between this and this end, it's a little harder because if you use something called NTP it has millisecond -- you say what is this? Microseconds? But there is a new protocol called 1588 that is hardware based that directly gets time stamps in hardware and bypasses a lot of the software which is doing this too. So some of these assumptions are becoming true, and we're going to assume that a little bit of hardware can be put into routers. And the fifth assumption is very important. If you could put a time stamp in every packet, the problem is trivial. All you do is record the sent time at the input port and the receiver takes a look at the sent time and looks at its current time, and the trouble is that there is no such packet, right? And even if you did it, it's essentially linear overhead. You're adding this thing to every -- you're going to try to do better than linear overhead. So let's start by taking this ->>: Wait, George ->>George Varghese: Yes. Go ahead. >>: If I had time sync across all the routers, then I could Sam, but say I'll put this extra header in occasionally or I'll put a pacing packet in occasionally ->>George Varghese: You could do that too. But that's sampling. So let's talk about the [inaudible]. Both of them will come to the same thing. Either you are put it on everything or you sample. And both of them are the existing work, which I'm going to tell you why it's not as good. Right? Okay. So we're going to assume very little high-speed memory. We're also going to assume that processing -- but the other assumption is that you can't really send a lot of messages between S and R. Because if you could, you could just send arbitrary amount of information. And so routers don't generally like to send a lot of control information between this port and this port. So you can send a message every millisecond but not every microsecond. So just to get the numbers right, if you consider a 10-gigabit-per-second link, in one second you've basically -- you can get 5 million packets. So the numbers are large. You can't afford to keep 5 million pieces of memory. And you can assume that was one control packet per millisecond, possibly even per second. Okay. So let's start with the simplest scheme, which is this computing loss. And loss is so trivial, you realize why don't these router windows do this? And, actually, I don't know the answer. So, for example, if you have three packets and one of them is lost, it's easy. You simply keep a counter at the sender how much you sent, a counter at the receiver how much you received, and at the end of the interval S sends 3 to the other end, the guy subtracts and says 1 out of 3 was lost. Big deal. Okay? So why can't you do that with latency? Well, you kind of could, but the simplest way would be you simply keep time stamps at the sender -- remember, there's no time stamp in the packet. So you keep a time stamp at the sender, 10, right? And it was receive a 23. The second packet was sent at 12, received at 26. Remember, they're synchronized. And 15 and 35. And at the end of the interval you ship the entire truckload to the other end, subtract and divide, and that's your average. But this is 3, but the real number is 5 million. So as soon as you see 5 million you say, ah, well, maybe not. All right. And that's quite a lot of overhead. So it is high, and remember that time stamps per picture -- and within a router you could add another time stamp per packet, but it's still the similar overhead. You're adding, like, a time stamp, which is quite large, like 4 bytes or 8 bytes per packet, and for 40-byte packets that is quite a lot of overhead even if you're doing it per router. So you would like to do better than this. All right. Now, obviously, as I told you at the beginning, there is a simple sample solution to everything. So obviously you could sample. So don't store all the time stamps, but let's say you sample the first and the third. So the first one is 10, the second 1 is 23, and 15 and 35, and you have to have a reasonable way to sample it. But that's easy. You can do content-based sampling. And now you take the two samples, send them to the other end, hopefully you synchronized them with the corresponding received time stamps, subtract, divide by 2. So what's the problem with sampling? Typically to get the numbers reasonable, you want something like 1 in 100 samples. And the number of samples is quite small in the end. You get about 250 to make your memory reasonable. Now, if you see that the variance reduces as the square root of the sample size, so it turns out that you're trying to get to microsecond detection that these small sample sizes do matter. You would ideally like all the samples if you could. Go ahead. >>: Are you assuming that there's no loss from the samples? >>George Varghese: Good point. There is loss. So in this case I'm just saying if there is loss, then some of the samples are loss, and you decide when you go to the other end ->>: [inaudible] >>George Varghese: Let's talk about that when we talk about the real scheme. I don't want to talk about all the purported schemes before I talk about the -- okay. So this is okay. But we can do better. We can get much better standard error. So now the simplest scheme you can do is to basically -- is to simply keep a counter which is the sum of the time stamps of all the sender time stamps -- 10, 12, 15 -- and the sum of the time stamps of all the receiver time stamps. So far no loss. Assume no loss for everything. So now you store the time stamp sums and you also store the number -- I didn't show you in the animation -- 3, right? And now you simply take 23 plus 26 plus 35 minus 10 plus 12 plus 5 and it's just counter, possibly 64 bits, subtract, CS minus CR divided by N and you have the number. So it's like what is this talk about? Well, as [inaudible] pointed out, the whole issue is loss. So loss can actually do pretty badly. And let's see how badly loss can do. So the simplest counter example. Consider an interval of time T. T is big, like one second. Two packets, one sent at half the interval, T plus half a second, and the second one sent at 1 second. Both of them take zero time. Right? The first packet is lost. The sender measures 1 plus half, 1 and a half, right? The receiver just measures 1. So when you subtract the two, you get half a second, and if you don't do any other kind of thing, you divide by -- even if the sender -- if you divide by 1, you get half a second when the actual delay is zero. So you can be off by orders of magnitude by loss. And simple approaches like Bloom filters does don't quite work. >>: [inaudible] >>George Varghese: So you could do every detection, but not every correction. So you can detect that you have a problem, but you would rather get some estimate when there's loss as opposed to saying sorry -- so, yeah, by simply sending the counter you can detect that there's a problem, but you are also [inaudible] correct. So now you're almost to the obvious idea now. So what's the obvious idea? We know that a single copy of this algorithm can be detected to be wrong, but it can't be corrected. So, well, you run multiple copies. Right? So it's pretty easy. But we do that, let's start with the theoretical perspective. From the theory point of this view, this is a streaming algorithm, but it's not quite the same as a normal streaming algorithm. In a normal streaming algorithm you have a stream of data and you have a single point that computes a function on that stream. Max, min, median, whatever you want. Right? Here what happens is you have two streams. You have a stream of sent time stamps and a stream of received time stamps. And what you want to do is to actually do a function that coordinates corresponding points in the two streams. So you want to take, for example, the send -- the first received minus the first send, the second received minus the second send and so on and so forth. So you have to coordinate those. And secondly, you can have loss. So these are two small things, but, for example, just to show the theoreticians here, max is completely trivial in a standard. You simply keep the max. But you can prove, using a reduction to communication complexity, that it's provably -- it takes provably linear time, linear storage. So if you wanted to compute the maximum delay, you can prove that you can keep all the time stamps pretty much. So that's a big difference in model, right? So that's just for the theoreticians. If you're a systems person, you don't care, forget it. Go ahead. >>: [inaudible] >>George Varghese: Just to compute the max of all -- the maximum latency. So imagine the maximum latency is the packet -- there's one packet that takes a long time to arrive in this million packet, the needle in the haystack. So is there a simple summarizing function? No. You can show this by a reduction by [inaudible], et cetera. All right. So what do we do for delaying the presence of loss? So what we're going to do is simply have multiple hash buckets, and the obvious idea, we're going to take packets and spread them across buckets. So rather than put all our eggs in one baskets or one bucket, we're going to spread packets across buckets. And now if some buckets get destroyed by loss, well, hopefully other buckets will survive. So that's the main idea. So you take your first packet, it happens to hash onto the first bucket, and you store exactly the original [inaudible], you store the sum of the time stamps and the number. So actually you store 10, and the second packet hashes onto the second bucket, so 12 goes to the second bucket. The third packet happens to go onto the first bucket again. Since it collides, there's no collision, you simply sum. Go ahead. >>: [inaudible] >>George Varghese: We'll talk about all the [inaudible]. That's coming. So for the first thing is if there are almost 10 losses and you have 11 buckets, at least hundred will survive. >>: [inaudible] >>George Varghese: It doesn't really. Because you're simply hashing. You're taking the packets and hashing based on their content. Now, this is very important. You take some part of a packet that doesn't change, and that's important because the sender and receiver must hash in exactly the same way so that the correspondence is built in. So that's why the problem of which -- otherwise, you can't have them go to different buckets. You'll get the wrong estimates. Okay? So the third packet comes in and it goes into the second bucket. And so the first bucket has the sum of the two sent time stamps, so the ones that have 10 plus 15, the second one is 12 plus 17. Now, what's happened here? Well, it's easy. The white packet has been lost, and so it wasn't in that bucket. So the summary data structure, the synopsis you keep, you is keep the sum of the time stamps, 25, but you also keep the number -- so 29 and 2, but this guy has this delta 1. So you ship the summary to the other end, and now your combining operator is easy. Whenever you find the numbers don't match, you ignore that bucket. If the numbers match, you simply aggregate all of them and divide by the number. So in this case there's only one bucket that works. You take 65 minus 29, divide by 2. And that's basically the idea. Now, it turns out that you have this very simple idea, and then you write a paper that complicates this and makes it more complicated in the following senses, right? To make it work, you have to do a little more complexity, but the idea is very simple. So, first of all, this works -- you can't afford a lot of memory. So how many buckets could you have? Maybe a hundred, right? So that suggests that you can only deal with 99 losses. But in real life, losses are rare, but once in a while you might lose all packets or you could lose a lot. So you really can't tell what the losses are, so what we now do is we bring back sampling into the mix, right? We're dealing with an adversary who's giving us unknown loss. So we're going to control the loss after sampling such that at least half the buckets will survive. So we'll get a good estimate from there. And now the next problem is we don't know the loss -- if we knew the loss rate, we could make the sampling rate and adjust it. But we don't know the loss rate in advance. So now all we do is run log copies of the -- high loss, low loss, medium loss. So now we're completely resilient. In the internet environment, we hate to make assumptions. We don't know anything. So we'd remember just -- and the net result is quite good with about, like, some 500 buckets and reasonable amounts of chip real estate area, you can do a lot of this. >>: [inaudible] >>George Varghese: The summary goes to our loss. You're right. So the idea is ->>: [inaudible] >>George Varghese: No, you're right. You could have done that, but you're right. If the summary gets lost, you do lose -- you do lose the estimate for that interval. So you're right. In that case you'll -- yeah, you're right. So let's see. So the difference is ->>: [inaudible] >>George Varghese: You could send it five times. That's right. Random times and -yeah. But that's probably something we should have said. Yes. >>: Do you perhaps want to explore the fact that you have time in your favor where you could compute the summary of regular intervals? Because once you compute the summary, you can store that information and you have that information, correct. >>George Varghese: Right. >>: Suppose if you keep collecting information in these buckets for a long period of time, I guess ->>George Varghese: You don't. So every interval -- I should tell you, every interval you reset everything. >>: So you just control that to [inaudible]. >>George Varghese: Right. So typically the intervals depend on some manager will say, you know, I want the intervals in a second, right? And I want the average latency in a second, and I also want the variance. Right. So -- yeah. So those intervals come from higher levels. But every interval you reset everything and start the whole algorithm again. All the counters go to zero. All right. >>: And that's figured by the control packet? >>George Varghese: Yes. Exactly. Every time the control packets come in, we redo this. And we do have resiliency mechanisms to detect the fact that you synchronize them in case of losses and put sequence numbers on the control packet. But those are our control packets so we can synchronize them. The point behind this slide is simply that if you compare this to the active probe kind of approach, you get a big order magnitude difference given the bandwidth limits in order to be able to do the same kind of estimates, you get -- the red ones are just much smaller errors than you would with just sending messages periodically because messages periodically have only so much controlled bandwidth, so your sample size is so small that -- well, especially where the loss is small, you pretty much get all the samples. So you get a perfect estimate. As the loss goes up, you can see the error goes up, but not as badly. So at some point, though, the error gets very bad, then it's going to degenerate to simple sampling it's like sampling once every control packet. All right. So far you say, all right, the average is kind of obvious, right? Because instead of taking S2 minus the received R1 minus S1 plus R2 minus S2 and divide by N, you could simply take, you know, sigma RI minus sigma SI and divide by N. And that's obvious because it's the linear operator. But what about variance? It's much more interesting, right? Because a lot of people would like variance because it's not enough to know that your average delay is 10 microseconds. You'd really like to say that, look N the 99 percent case, it never exceeds so much. You can't get the max, it's too hard, but at least getting a variance is useful. All right. So variance is not a linear operator. You've got to take the received time stamp minus the send time stamp, square it, plus the second received R2 minus S2, square, plus R3 minus S3, square. So it's not obvious how you could simply take the sum of the received and the sum of the -- and square them. And most of the simple things don't work. Fortunately, we just steal this idea from the MS for estimating the second moment, and the idea is that -- I'll just describe it operationally. Normally what we did is at the receiver side and the sender side, we added the time stamps for everybody in the bucket. Now we're going to simply have another hash which takes the packet and decides whether to add or subtract. So we're going to take a time stamp and add or subtract with equal probability from this counter. And so we're going to keep this up/down counter for each bucket and a similar corresponding one at the receiver. And now we're going to take the two and subtract them, and we're going to simply divide by N, and that's going to be an estimate of the variance, and then you can get -- so why does this work? So intuitively what happens is it's fairly straightforward to see what's happening. So when you take this counter, you're taking R1 plus or minus R1 must or minus S1 and then -- sorry. So you're taking the -- so each of these counters, instead of taking the send 1 counter plus send 2 counter plus send 3 counter plus send 4 counter, you're basically taking S1 plus or minus S2 plus or minus S3 and you're also taking R1 plus or minus R2 plus or minus R3. So when you subtract these two, you get -- you can collect them together as plus or minus R1 minus S1 must or minus R2 minus S2 plus or minus R3 minus S3 and so on. So when you square, what's going to happen is you're going to get all the right terms. R1 square, sigma R1 square and -- you're going to get -- you're going to get R1 minus -you're going to get exactly R1 minus S1 squared, which you want, but all the cross-product terms are going to disappear with equal probability because you're going to get plus with equal probability and minus with equal probability. So in the end you're simply going to go get the sum of the RI minus SI squared, which is exactly the variance. So that's the MS result, and so you directly get it. But from the hardware point of view is it's no more complicated than the other one. You just had to add one more hash, look at it bit, and subtract now instead of adding. So you had a subtract as well as an adder. Yes? >>: Is the plus or minus also apparent? >>George Varghese: It has to be consistent. So that's essential. If you don't, you're screwed, right? So you're right. So the hash is, again, based on the packet. So that's what makes sure that -- when you actually keep R1 plus or minus R2 and he keeps S1 plus or minus S2. When you subtract, that's what makes sure the signs of these two are the same, and that's important. Otherwise it doesn't work. All right. So basically you can do this with -- we checked with our friends at Sysco, and there is standard chip sizes that are very conservative, 95 millimeters, and this scheme will take, like, one percent of such a [inaudible]. Now, it turns out that most chips have lots of gates today. So they do have gates to play, and it's a very reasonable proposition to make to them to do something like this. The FIFO model is still true between in ports and out ports of a router. You can deploy this by starting deploying this within single routers where most of the delay is and then later get it across routers and time synchronization is being done. All right. So, again, let me just summarize this part. With the rise in modern training and video, fine-grain latency -- if you forget everything else, remember that suddenly microsecond latency is becoming interesting to us. We used to be a throughput factor in networks. Suddenly we're becoming beaten up for latency, right? And it's really interesting. It's changed a lot of the way -- and we proposed LDAs. It's very simple to implement and deploy, and it's capable of measuring average delay in variance, loss and microbursts. And what's the edge? So the edge is basically the number of samples. It's really a way -- it's a sample amplifier. It gives you a lot more samples than you would, right? If there's no loss, you will have a million samples, effectively. It's the same result -- while you would have M samples, let's say a thousand, and that's much better than -- and so it can reduce the error by the square root of the difference, which is quite a bit. All right. >>: Is it actually true that most of the delays inside drivers [inaudible] almost all of the loss is inside drivers ->>George Varghese: So that's a very good point. The setting here is often data centers where the propagation delay is not significant. That's very important. Until you go to the white area, that's not true at all. Because banking, the high performance can be all within data centers. So, actually, that context should have been said before. My mistake. >>: [inaudible] >>George Varghese: We don't do the reverse hashing. In networking we follow the [inaudible] style stuff. We completely ignore their hashing and just do what we can do in hardware files. And our experiments say it just works great. And Mike has a lot of results with [inaudible] saying that various theoretical -- we've not bothered to verify that. We just simply do it. That's a good point. We don't -- the kind of hashing we do is typically like this. One of the simplest ways is we take -- we consider it as multiplication of the key by a random matrix of zeros and ones, and that seems to just work well, really well. And it's really easy because it's a network of exor [phonetic] gates, and we can implement this really fast. >>: [inaudible] >>George Varghese: It probably applies. Most networking people have cheerfully ignored the reasons for [inaudible] independence and we just simply do it. But it's a good point. At some point somebody should verify there is enough entropy to make this happen. We expect that. We should have measured real traffic streams and did what we did. Sorry. So much for [inaudible] independence. So now I want to completely change context, and the only way I can change context is to tell you a story so that you can erase all your memories of this previous slide because we're going to change the context, change the measures, change the problem, and if you get confused with the previous part, we're in trouble. Okay. So my friend -- I'm from India, and I had a friend from India whose mother came to visit old [inaudible] farm, I think in Massachusetts, and she had this little circle on her head that Indian women call a bindi, right? So an American came up to this friend's mother and said that's a beautiful thing. How do you do this? And so my friend launched into this spiel about how Indian women are sort of trained from birth to draw these beautiful circles and they practice hard every -- before a mirror. And as he was saying this, his mother sort of unpeeled this paste thing from her forehead, and he was so embarrassed. So the thing is he was right. In the old days they used to do things like this, right? But his model of the universe had shifted, right? So, similarly, if you're not careful, we could completely change our model. So be careful. It's a totally new problem. So this is with Mike Mitzenmacher and Terry Lam, and it appeared in NSDI. So it's called Carousel. So the problem at hand is that we have this deluge of interesting events coming to networks. Attacks are happening, packets are being dropped, and you have this manager who's trying to understand, make sensitivity this deluge of data. And the standard approach is to try to get a coherent view is basically sampling and summarizing. You either take a sample of some of the events or you summarize and you say you keep a count. But there are certain contexts where you really want complete information. So let's give you an example. So imagine a worm breaks out and attacks Microsoft -- well, actually, a university is better. It cannot, by definition, attack Microsoft. Microsoft is sacrosanct. Worms know better than that [laughter]. But let's say a university. Fair game. So they come to UCSD and they start infecting all these machines, and this is code red, and it shows you how fast it took over the world and how many machines it took over, right? So at that point, at the end of the -- during the attack, managers would like to know not a sample of which machines were infected, not a counted of how many machines were infected, they want to know which machines are infected. Your machine is infected, yours is not. And so they can remediate. So they want a collection of all of them. Another example is list all the stations of a particular type. The ones deploying IPv6, deploying TPC hack, and a problem that I encountered many years ago and I was always haunted by is list all the Mac addresses. And so what's the issue, right? What's the issue with this? So let's just -- before we -let's make our problem concrete, although it applies to many settings, by the following simple setting. The simple setting is there is -- you're trying to detect bad messages, and there is a device called an intrusion detection device. It's a detector of bad packets. Now, each packet has a key which is its source, let's say A, and some content which is somehow bad. There's a bad program there. Now, fortunately, the device has a signature which is a regular expression on the content which sort of decides that this is slammer, which is a worm, or this is witty [phonetic]. So the idea is as the packets are passing, the intrusion detection device taps on these devices and it simply applies it in parallel to all of these signatures, and if it matches the red, the idea is it should send a little message with the red -- with a flag saying it's red and the source A. Similarly, it should send B and it should send C. So the end of the result, you should be able to get all of your infected ones and also which ones are infected so that you could go ahead and remediate that. All right. So this is a classic kind of issue. So motion of the work in intrusion detection is focused on fast matching. As you can see, that's a problem. But very few people have heard about the logging. You think, ah, it's a boring thing. Just log it. But under certain assumptions, actually, logging is hard. So let's see why. So the model, the queueing model -- so there are N sources, and N is large. Maybe a million. They're arriving at a queue at some very large rate, B, capital B, which is like, let's say, 10 gigabits per second. The intrusion -- the log -- the intrusion detection device is the hardware device and has limited memory. So it might have, say, 10,000 pieces of memory. And now it wants to eventually log everything to a sink that has infinite measurement, but which has memory for the -- but the logging speed is small. It's like, let's say, 10 megabits per second. So think about the two extremes. If you had infinite memory, it's easy, or N memory, you simply just store them and then, in leisure, sort of log it to the source. If you had infinite speed, it's easy, because as they come, you simply log them, and you don't need a queue. But when you have two such constraints, you're in trouble. So just to make sure that you get the parameter space, the logging bandwidth is much smaller, several order of magnitude smaller than the arrival rate, and the memory is much smaller than the number of sources. So think of the memory as hundred and this thing is a million. What could you do? Okay. Well, fortunately, there's one assumption. The systems will always cheat, right? So they change the model. But the assumption is that sources are persistent. An infected person is going to keep coming back. So if you miss him at this opportunity to log him, as long as you log him later within reasonable time, then you can -- so you'd like to make up by playing games with time, so to speak. But how do you do this systematically? That's the question. So, again, you can imagine that simple sampling will work. Just keep a finite memory, and by randomness of things, you should eventually put things over here and eventually you'll get -- if you have enough time, you'll probably get everybody. But it's like coupon collectors. It's not an efficient way to grab all your coupons. There's better things you can do. Okay? So in fact, coupon collector tells you immediately that even in a simple random model, you have log N kind of inflation and you would like to do better than that. All right. So let's just -- so our result is we're going to show you a scheme called Carousel which basically foregoes this log N model, and in fact in any model of arrival, we're going to get -- log all sources in close to optimal time. What is optimal time? N million divided by the logging speed. We'll be a factor of 2 within that. And the standard approach is coupon collector, right? LNN. And adding Bloom filters and other things doesn't really help. So now here is the animation. So it's easy to see. What's the problem? So in this animation, you have two pieces of memory, and the sink is far away and it's going to be slow speed, and the packets are arriving here. So packet 1 arrives and it's stored in memory. No problem. We have space for it. Packet 4 arrives before we've logged because the logging speed is slow. We don't get a chance to log off it. Packet 3 comes in, oops, we don't have room for it, we drop it. Packet 4 comes in again, it doesn't really matter because you already have it. Packet 1 now finally gets a chance to be logged. Packet 4 moves up in the queue, packet 1 comes in and is stored, and packet 2 comes in and is dropped. So notice that in this example that you had four sources and you only logged two of them. And I could repeat this example ad nauseam. So if I was controlled by an adversary, the timing, I could make sure that some sources are never logged and never remediated. So this method is extremely susceptible to problems. If you assume a random model, yes, there's an element of inflation. But if you assume a deterministic adversary, it can be really bad. So basically sources 2 and 3 are never collected, 1 is logged many times, and in the worst case N minus M sources can be missed. And, remember, N is a million, M is 10,000. So you miss almost all sources. It's unlikely, but it at least could happen. And in security systems you have to be really careful, right? Because these guys could conspire on timing and do stuff, so you really kind of assume they're trying to help you get logged. That's not their job. So it's not part of their job description. Right. They won't get fired for -- okay. So, now, this is amazing -- theoreticians don't have this problem, but in systems, this is an amazing fallacy that Bloom filters solving everything. Somehow they're just excited about Bloom filters. Now, Bloom filters are a basically linear piece of memory. All they do is reduce the constant factor. The hash table, except that it reduces the constant factor by a factor of 10. So if you have 10,000 pieces of memory, how in the world is a Bloom filter going to help you log a million sources? It's not possible. But nevertheless, I have to go through this slide often because others will say why didn't you use a Bloom filter? So let's just do it. So imagine one comes in. And it's snored the Bloom filter. There's two pieces of memory. And the Bloom filter just records a little trace that one was there and then it's stored in the queue. 4 comes in. 4 is stored and replaces 2, right? And now 3 comes in and it is dropped because the Bloom filter is full. 4 comes in and rightly is dropped because the Bloom filter says it's already there. 1 is logged. 4 comes in. 1 is in the Bloom filter. 2 comes in. The Bloom filter is full. Gone. Okay. 4 goes on. And now what happens -- now here's the problem. At some point you can't just keep 1 and 4 in the Bloom filter forever so you have to clear it. If you clear it, I can repeat the same timing again and cause the same loss, so I don't even want to go through all these hundreds of pieces of animations. Hopefully between friends you will agree [laughter]. If there are adversaries in the audience, we can go through this again. Okay. So whatever. It takes too long. Okay. So you can prove that it's really similar performance to a standard one. It really doesn't help anything. Right? And so the main point is the Bloom filter is necessarily small, and it reduces the duplicates by some marginal factor. That's all it does. All right. Okay. So now we need to solve the problem, in case I didn't mention that. So let's be inspired by an example from networking which is something called admission control. In networking, it's a very common paradigm if you can only handle, say, 10 megabits and you have 100 megabits of arriving traffic, what you really want to do is admit only 10 megabits per second because you can give it service. And then the other guys have to wait on the edge of the network. It's like traffic and everybody else. So even highway traffic does that. So the difficult here is standard admission control, the sources cooperate. They help you. They say, all right, we'll wait at the entrance ramp and -- but here, these guys will be busting the ramps if they can. These are, likes, infected nodes. They're not likely to follow any protocol you set. So what we would like is unilateral admission control. The question we ask is what can a poor resource do regardless of what the sources are doing? So the sources can do any old thing. And so our approach is what we would call randomized admission control. And in essence what we do is we're going to take the sources, break them up into sets, random sets, that are small enough to handle and then, to be fair, they're going to cycle through all the sets to give everybody a fair chance. So that's the essence, or that's the animation to see how it works. So now we have some colors, which is great. So we have the same sources, we have the same Bloom filter. Yeah, it's useful, at least for -- but we now have a color -- a little color filter too, right? And the color filter is going to start with red and what we're going to do is we're going to magically color all the sources red and blue for now. And how do we color them? Think of hashing the keys, and if the hash lowered a bit, it's a zero, it's a red. If it's a 1, it's a blue. So we don't have to do anything special to paint these guys, but it just happens. And now we have a color filter which says now for the next x seconds we're going to only log red sources. So ->>: Does this require that you be able to have enough different colors? >>George Varghese: Let's watch. The next question is how many colors. So let's -- so you're one slide ahead again. You noticed how magically it was 2. But it will go to the next slide. So what happens is 1 comes in, it's red, we are logging red, source is fine, put him in the Bloom filter, put him in the queue and so on. 2 comes in and 2 is dropped because -- not the right color, right? So 3 -- just in case you didn't notice it, it decides to telegraph your intentions by saying this -- the 3 comes in, and 3 is allowed and 3 is dropped. And 4 comes in. 4 is dropped. 2 comes in, it's dropped. Blue. And 3 comes in, 3 is logged, and now 1 comes in and it is rightly dropped because it's already in the Bloom filter, so there's no point in doing it again. And then 3 comes in and it is again dropped because of the Bloom filter. 4 comes in and it's dropped. And so now what happens is that you've finally got all of these done, you've logged everybody you saw in your memory, and now what you have to do is you have to rotate the carousel. So you simply have to change colors and become blue. And for now on you log all blue sources. >>: [inaudible] [laughter]. >>George Varghese: Okay. And now you log blue sources and now, you know -- et cetera, et cetera. This is too painful. But anyway, 3 is dropped and so on. Okay. So -- what did I do wrong? >>: How many colors? >>George Varghese: Oh, how many colors. Okay. Okay. Sorry. So now I'm come back to your question, right? So it could be that you pick the wrong number of colors. So you generally start with zero colors and then you sort of -- I'm sorry, one color. Sorry. You start with one color. And then you go to two and four. So the question is when do you know that your colors not enough, and then what do you reduce? >>: What actually worried me was an adversary can choose which color he would like to participate in. >>George Varghese: No, because the hash is under your control. He doesn't know it. >>: Okay. [inaudible]. >>George Varghese: I think that's a better model, because given the [inaudible] in the room, like, we won't -- yeah, go ahead. >>: [inaudible] >>George Varghese: It's a good point. So we're actually using them in two ways. One is we do use it to suppress duplicates within a color phase. So if somebody comes in that's already in the queue, we don't -- that's actually not much of a probability because it turns out that such a small number that the probability of getting a duplicate out of N is very small. But there's a more important reason. We use it to sort of estimate when we are in trouble. Because what happens, it gives us -- if the Bloom filter gets full, we know we don't have enough colors. And then we increase the number of colors. So it's kind of -- it's a knob that we use ->>: [inaudible] >>George Varghese: You could think of it as a hash table. To me it's just simply a smaller constant hash table, that's all. See, there's a hash table and a queue. You have to keep a hash table to estimate the number of people currently in a -- who are ->>: [inaudible] >>George Varghese: Multiple numbers hashed -- yes. We can. But Bloom filters, that's more false -- just think of it as a hash table. >>: [inaudible] >>George Varghese: Just think of it as a hash table. So in the end what really matters is what's the size of the number of entries in this hash table. >>: [inaudible] >>George Varghese: It's easy to do because you can change colors in a microsecond by simply saying that if they're using the lower of one bit of your hash used [inaudible] so nothing has to be changed. The sources just get recolored automatically. So not the existing ones. And you have to clear everything and start again. So let's see what happens. 7 comes in. 7 is wrong color. 3 comes in, and 3 is the right color. It's stored. 4 comes in and -- now 5 comes in and something interesting happens, right? So 5 comes in and 5 is red, but there's only two sources, and so you know you're in trouble. You need three at least, right? And so at that point what you do is you say, oops, Bloom filter full, and you say increase the carousel colors and you automatically recolor to four, and now you have to recolor all these guys and then the right thing happens. So you'll keep recoloring until this. And the point is -- from a theoretician point, it's nothing. It's a small little counter that you bump from one bit to two bits to three bits and the hardware just does the right thing. There's no phasing. Nothing has to be done. There's no going back and recoloring all the ones that have flown past. But you have to clear everything. And you sort of forego your investment in logging so far. You say, ah, I don't care about that, right? And competitively speaking, you'll lose that [inaudible] factor of 2 because you're doubling, right? 1, 2, 4, 8. So like in any binary tree, you'll only lose a factor of 2. >>: You'd expect the number of colors you need to be bounded by the ratio between the incoming rate versus the logging rate. >>George Varghese: The number of colors is -- it's actually easy. So it's basically log to base. You can do M at a time. And so what happens is you can do M at a time and so the number of phases you need is going to be at least N by M. Right? Now, in N by M, what you need to do -- so just very simple math will give you the number of colors based on these parameters. Yes? >>: [inaudible] the only question ->>George Varghese: Think of them as a hash table, a slightly compressed hash table that, you know, sometimes will give you the wrong answer. >>: The only question I have about it is I can tell I can't add this entry to the Bloom filter, but I know that the Bloom filter -- by I know it's not in the Bloom filter. Is that the simple [inaudible]. >>George Varghese: Yes. Yes. Yes. Yes. So there is some -- there's a little bit of a probabilistic statement there. But nevertheless, that's probably true. And now you'll redo this, and let's skip through all of this because you guys sort of figured this out and so -- yeah. All right. So let's just summarize this algorithm. So what is the algorithm? The algorithm is basically three steps. It's a partition step which is like hash, and based on the color or the lower bits of the hash function, you partition. So more formally, you take Hk(X), which is the lower k bits of H(S). It's a hash function of a source. You divide the population into par significances with the same hash value, and then you iterate. Because once you've hashed a color, you have to hash all the other guys, right? And so you simply take -- and what is the time? It's a pretty obvious thing. If the memory is M, you might as well give enough time for that memory to be logged, M by L. And so after that you want [inaudible]. So the constants actually fall very nicely. Each iteration lasts T seconds and the Bloom filter does weed out duplicates. It's marginally useful. If you didn't have any other way to estimate the number, it would probably work. So there are other ways you can use [inaudible] and other kind of schemes to actually just estimate the number of distinct elements in a set, and that will probably just work fine too. If that number is too small, you can do it. And then you finally monitor. It turns out -- I've only shown you the up-tic, but you also have to have a down-tic. You have to -- there are too many colors. If the worm outbreak sort of grew and then fell, then you really want to speed up. Otherwise you'll waste time if you're trying to be optimal. So you want to make sure of all that. Okay. And so you increase scale, the Bloom filter is too full, you decrease scale, the Bloom filter is too empty. All right. So we did a number of experiments, and I'll just tell you one of them. Right? So what we did was we took a real intrusion detection system called Snort. That is publicly available. You can go ahead and take a Linux box, load it in, and we basically took 10 gig links and we took -- but we couldn't -- so what did we do? So we basically had to -sorry. We took 100 megabits per second links, we took a traffic generator that was sending stuff in with 10,000 sources and we basically had -- the logging rate was 1 megabit per second. We ideally would have liked higher numbers here and higher numbers here, but because we couldn't get those numbers, we scaled down the system. And we took two cases. The source S was picked randomly on each packet, and the other case was we went step through, 1, 2, 3, 4, to 10,000 and then they repeat. And if they're working in concert, this might occur. So it's not as unrealistic as you -- and so neither of this is completely deterministically adversarial. This is closer to random and this is closer to -- somewhat closer to bad cases. And so now what happens is if you look at what's the measure, the measure is how long does it take to collect almost all sources. So as the number of sources goes up -- so let's look at for 10,000. So with Carousel, it looks like within a little bit of time you log all sources, but if you use -- if you go ahead and use standard Snort, which is basically random, there's this long tail. And afterwards, just like all coupon collector problems, towards the end it gets increasingly longer to collect the rare stamps or the rare coupons. And that's not surprising, right? And notice that the effect is much worse with the periodic pattern. And that's, again, not surprising. Random is more like a coupon collector, which a coupon collector likes, but when you have this periodic stuff, it works. So if you look at the numbers, that takes 300, and this one -- we pretty much get everything by 300, they get it the by 1500 seconds, and here it is -- and so basically it's 5 times faster with the random and 100 times faster with periodic. You could argue which is the right model, but we just show you the range. Right? So there's lots more experiments in the paper, lots more stuff, but I don't want to talk about it. Okay? So what's the big disadvantage of Carousel? Too big things that you have to remember. It doesn't solve every problem. First of all, it doesn't deal with one-shot logging. If somebody logs something once and never has spoken again, too bad, we can't deal with it. In infected nodes it's okay because if the infected node goes away, then perhaps we're not interested in it. So we can possibly make that argument. So you've got to be careful. We don't get something for free, right? And, similarly, it's probabilistic. The final theorem is we get 99.9 percent of all sources within this time in competitive time. The last one or two we might still lose. So we get almost all, right? But it's not perfect logging. So it is, we believe, applicable to a wide range of monitoring tasks with high line speed, low memory, and small logging speed, and where the sources are persistent, right? And it's a form of randomized admission control. >>: [inaudible] >>George Varghese: There are. There are. So [inaudible] is close to 10 gigabits. So there are a few and there are hardware and they're doing pretty well. The company that Juniper bought, NetScreen, it's also pretty good. So people are beginning -- and people are increasingly beginning to think that it should be packaged into routers. Sysco ideas has been struggling for a long time to go from two to ten. They'll get there. Okay. So let me just tell you the edge and then I'll be done. >>: [inaudible] >>George Varghese: So what you do is you simply lose all -- you just ignore all the logs that you've sent so far. It's like you start a new experiment. >>: [inaudible] >>George Varghese: You have to be careful. So you want to make sure that you're assuming that the distribution is semi-stable. And if it's changing rapidly the population, you could thrash. So there's some assumption built in here that the number of sources changes slowly, and we react very slowly to that number of sources too. So if it does change fast, we'll work on some upper bound on that. So we might be slower, but it's not clear what optimal means in that case. So a real amazing algorithm, if it could maybe do significantly better in a period of great [inaudible] by constantly changing the number of sources, because when we sense it, we will take some bound for a while and work and then we'll come down only much later. So there is room for play here in a theoretical sense. Okay. So let's just finish up and say what is all this about the edge, right? So if you look at Carousel, it's a factor of 2 from the optimal to log all sources versus the standard coupon collector which is L and N by M. So, remember, the actual edge could be infinite in an adversarial model, but let's put that aside and let's -- so if you take N is equal to million and M is small, the edge is close to 14. It could be 14 times faster to log almost all sources. Now, compared to the random arrival model, which is like simple sampling, LDA sort of gives you N samples versus M samples. If N is million and M is 10,000, the edge is close to 10. Significant. Why 10? Because you take the square root of this number. And sample and hold is order 1 over M versus 1 over square root M, so if M is 10,000, the edge is 100. So, again, the edge is simply what's the corresponding measure for simple sampling and what is the advantage of doing this? A little bit in hardware. And the claim is it's significant. You don't add too much in gates, and so why shouldn't you do it? And so in terms of LDA, lots of work on streaming algorithms, less work that I know of -and I've asked [inaudible] and a few other people -- on two-party streaming algorithms where there's some coordination between these guys. There's a lot of work in network tomography which joins the results of black box measurements, but [inaudible] directly instrumenting things as opposed to inferring where the problem is. So Carousel, it's a very simple idea. The theoreticians are really interested. Basically it's randomly partitioning a large set into small enough sizes. And then the main idea is iterating through the sets, right? The only place I've seen it, and I would love to get more, is the Alto Scavenger. Butler Lamson [phonetic] many years ago wrote a paper where if -- they got too much stuff on disc, and they were trying to rebuild the file index. They had to do this random divide and conquer, and they sort of took a random partition in smaller and smaller amounts until the size, the number of file sets, could fit into existing memory. Then they rebuilt it for the rest and they cycled without [inaudible]. But it's a different problem because we have two limits. They had only one limit. Memory. We have memory and bandwidth. But there's some idea there. Okay. So summary is the big thing -- no let's go back to the 20,000-foot level. I think -for the students here, I think monitoring networks for performance is a big deal. Right? That's what the world cares about. Pushing packets faster is nice, but people will do this, but all this stuff is still unknown. And there's a number of problems. It's not quite just keep a counter, right? You have to do interesting things, right? Find latency. And randomized or really hash-based streaming algorithms really can offer cheap end gate solutions at high speeds. So it would be nice -- and I described two simple hash-based algorithms I hope you'll remember. If you can remember a one-line summary of LDA, all you're doing is aggregating the time stamps by summing and you're hashing to withstand loss. That's the one-line summary. And the main idea in Carousel is you hash to partition the input set into smaller sets and you cycle through the sets for fairness. Okay. That's the main idea. And so -- and I have to -- remember, I suggested a band? So there should have been some music here, but I don't know how to play it. So this is a silent band, and it's called The Edge. Okay? So I hope you'll remember that. Thanks. [applause] >>George Varghese: Any questions? Yes? >>: [inaudible] >>George Varghese: Right. I don't assume a good source of randomness at all [inaudible] hash-based. So the only thing -- in the analysis, we assume that -- we don't assume any [inaudible] we assume perfect classic [inaudible] that the hash is completely uniform. Right? And so ideally probably all these results could be strengthened with twisting Michael's arm a little bit to four-way independence or two-way or six-way, but we just ignored that. So we just used simple uniform -- so the only one that might require a resource of randomness is sample and hold. Sample and hold you're actually sampling a packet, and that one requires a source of randomness, but ideally, even that could be hash-based. You could look at the packet and take some piece of the packet, like source address, and you could do a hash on that and the bits are, you know, less than -- or these values and this set of values you could sample. So in some sense you don't need a source of randomness. So I think networking people prefer that because our experience has been even from doing Ethernet and other stuff, it's very hard to do physical -- the only real randomness that works is to use some source of physical randomness like short noise or something, and it's very hard. So the first Ethernet supplement all locked up, so our experience has been really terrifying. So we would rather use hash-based algorithms. So hashing seems to be fine, assuming enough entropy in the stream. So that's our uniform experience from, you know, 10, 15 years. That just seems to be fine. And we should go and dot the i's and cross the t's and check more, but we haven't done it. Any other questions? While the band keeps playing silently [laughter]. >>Dave Maltz: Thank you very much. George is going to be here until next Friday. [applause]