>> Jim Larus: It's my pleasure today to welcome Navendu Jain from the UT Austin, who he tells me he actually has his Ph.D., maybe not in hand, but he's finished, turned it in, gotten it signed, so congratulations. That's always good to hear. >> Navendu Jain: Thank you. >> Jim Larus: And he's going to talk to us about scaleable monitoring today. >> Navendu Jain: Thank you very much, Jim, and thank you very much for inviting me. So today I'm going to talk about PRISM which is a scaleable monitoring service we have designed and implemented for monitoring as well as managing large-scale distributed systems. This is a joint work with my colleagues at Texas, both my seniors and juniors, and the main motivation for this work arises from the fundamental challenge of how do you manage large-scale distribution systems. I want to make a quick disclaimer that even though I'm using an Apple and keynote for this presentation, please do not hold it against me. Okay. Okay. So to give you some background, in April 2003 on PlanetLab, which is a distribute research test bed comprising of about 800 machines that are spread all over the world, a user login was compromised. And within a short period of time, half of these nodes were port scanning external sites. And as you would expect, it generated lots of complaints. So to prevent any further damage, PlanetLab had to be shut down temporarily. Now, of course you can just blame it on the hackers and say, well, what can we do about it, but that would give them too much credit, okay. Because in reality it's not just the hackers, but there are several other reasons why our systems break, such as security vulnerabilities, hardware errors, operator mistakes. And I'm sure everyone in the audience here, as well as persons who are watching online, must have written a buggy application at some point in time. Okay. So the key problem here is that fundamentally distributed systems are hard to manage, they are unreliable, they break in complex or even unexpected ways, and it's really hard to know what's going on in the system, okay. So in order to manage the systems what we need to do is continuously monitor and check if there are any problems in our underlying systems, okay. So I'm going to give you a 10,000 feed view of the problem. This is an oversimplified view but it captures the necessary details. So here we want to monitor a large-scale distributed system comprising of tens of thousands of nodes. At each of these nodes there are several events happening at any point in time. So, for example, an event could be a CP lowered spike, or a (inaudible) arrival or a memory bug or memory leak because our operating systems are aren't perfect, right, so our aim here is to correlate these events together and find out what are these loud or the big events on a global scale. By loud events I mean those events that when you look at each node individually, they may be small, but when you look at collectively across the entire network they constitute a large volume. So, for example, a network in the center here might be interested in which network links are heavily congested, okay, or which web servers are headily loaded, or are there any buggy applications in my system that are consuming lots of resources, okay. And even from a securities perspective, monitoring these systems is very important. Now, what makes this monitoring problem fundamentally hard are three key challenges. Our first challenge here is scaleability. We need a monitoring solution that will scale both with respect to the number of nodes as well as the amount of data volume. Now, we can do centralized monitoring that will scale to hundreds of nodes, but what we need is solution that will scale as these systems go from tens of thousands and the vision is to go towards millions of nodes, okay. And similarly, the data volume here of these events could be massive, and their applications today that generate hundreds of gigabytes of data each day and we want to move it to even petabytes of data. So we need scaleability both with respect to the number of nodes as well as the amount of data volume. Second, we want to quickly respond to these events, so to detect any performance problems as well as find out any security threats. So you want to do monitoring in realtime. And finally, we need a solution that is robust against both node and network failures and still gives us an accurate view of the system. Okay. So the bottom line here is that for monitoring these systems, we need to process large volumes of data spanning tens of thousands of these nodes, and we want to do it in realtime. Now, of course our actual problem is much harder because this picture is not drawn to scale, okay. So to define the problem more broadly our vision here is to double up a distributed monitoring framework that monitors the underlying system state, performance queries, and reacts to global events. And numerous applications which I've shown here have similar requirements such as network monitoring, grid monitoring, storage management, sensor monitoring and so on. In the later part of this talk I'm going to show you results from some of these applications that we have built and in some cases even deployed. Okay. So to realize this vision, we have designed and implemented PRISM. PRISM is a scalable monitoring service for managing large-scale distributed systems. Now, you might say, well, there's been a lot of work in monitoring, what's new here. Our key contribution and our key distinguishing factor here is to define precision as a new fundamental abstraction to enable scalable monitoring. So what do I mean by that? So specifically our two big goals are to achieve high scaleability and high performance and to ensure correctness, that is give a robust solution that can tolerate newer network failures. So by defining this new precision abstraction, we will achieve these two goals, so specifically to achieve scaleability and high performance will trade precision for performance. So think of your getting an answer about the global sort of the system. So giving an exact answer will give you an approximate answer, okay. But we will bound the degree of approximation and further will adapt these approximation bounds to handle large-scale dynamic workloads. However in practice failures can violate the -- can cause the system to violate these approximation bounds, therefore, to save that correctness we'll quantify how accurate our results really are and how to improve this accuracy despite large-system failures. Okay. >>: The accuracy known a priori or after the fact? >> Navendu Jain: Yes, I'm going to talk about this in the very next slide. So here are two big goals, to achieve scaleability and high performance and ensure accuracy or ensure correctness despite failures. So the way we are going to realize these two goals is to define precision in form of three dimensions. Arithmetic, temporal and network imprecision. So arithmetic imprecision or AI bounds the numerical errors of a query result, okay. So instead of -- take as an example, instead of giving an exact answer with value hundred we'll give you an approximate answer, say the global query in the system has a value say hundred plus minus 10 percent but the answer is guaranteed to have a maximum numerical error of at most 10 percent. Okay. Similarly temporal imposition or TI bounds the stillness of a query result. So as an example a query result has maximum stillness of at most 30 seconds, that is it reflects all events that have happened in the system until 30 seconds from right now. Okay. So although AI arithmetic imprecision bounds the numerical errors and TI bounds stillness, in practice, however, failures can cause the system to violate these guarantees. Okay. Therefore, we define a fundamentally new abstraction network imprecision that bounds this uncertainty or this ambiguity because of failures. So very simply so think of this NI as a good or a bad flag. So when you get an answer, you get this flag, good or bad. Say if the flag says good, then these AI and TI bounds are guaranteed to hold. Otherwise, they may not hold and you cannot rely on the accuracy of the reported answer. So for simplicity I'm showing this as a flag, but in reality it is a continuum as to what extent the system actually provides these guarantees to the end user. Yes, please. >>: Frequently in randomized algorithms and other areas of statistics one makes theorems of the forum, you now, do the following and your answer will be within one plus or minus delta of the true answer with probability greater than or equal to one minus epsilon, and I feel like network imprecision seems like epsilon and arithmetic imprecision seems like delta. So when you say it's fundamentally new abstraction but it sounds a lot like epsilon-delta theorems of which I've seen a bunch. >> Navendu Jain: So okay. Let me ask you take it piecewise. So you're right essentially saying that the one minus delta abstraction is actually very equal to arithmetic imprecision, however, for network imprecision essentially, yes, you can provide (inaudible) absent guarantees but there are essentially two cases that essentially I'm going to talk about, one is you actually (inaudible) those impossibility result that you actually cannot give in the absolute bounds. And there are cases where actually you can give the absolute bounds but they become very expensive. So essentially it is impossible at first and it is expensive at best. Essentially I'm going to talk about. So that notion, in that perspective, so this new abstraction is actually fundamentally different. So I'm going to give more details later. >>: Okay. >> Navendu Jain: Okay. So these three dimensions nicely compliment each other to enable our goal of scaleable monitoring. Although the basic ideas of AI and TI are not new, okay, what will give new scalable implementations of these metrics and show that by carefully managing them we can reduce the monitoring load by several orders of magnitude. And by using NI we'll characterize how accurate the results really are and how to improve this accuracy by applying an order of magnitude. Another combination of these three dimensions is really powerful and my team (inaudible) is how this unified precision abstraction enables scalable monitoring. Another question? >>: Is magnitude really enough to get you to a million machines of (inaudible) each? >> Navendu Jain: Are you talking about this? >>: Yes. I mean 100X seems like (inaudible) close to what you need. >> Navendu Jain: So the -- so I'm also going to -- so this is, 100X is actually a relative term, right, I'm going to also choose the absolute number in terms of as system size increases what is the absolute or what is the performance requirements. Okay. So I've given you an overview and the basic motivation of the problem of scalable monitoring, and I'll present the fundamental present architecture and the key technologies we use to build PRISM. Then I'll describe how PRISM uses AI and TI to and choose scaleability and how by choosing NI will ensure accuracy of results and how to improve this accuracy despite failures. Okay. So let me first give you a overview of the PRISM. The key idea in PRISM is to define precision as a new unified abstraction, and we define this precision abstraction in form of three dimensions, arithmetic imprecision that bounds numerical errors; temporal imprecision bounds stillness; and network imprecision bounds the uncertainty due to failures, okay. So AI and TI allow us to achieve high scaleability by trading precision for performance. And NI addresses the fundamental challenge of providing consistency guarantees despite failures. Now each of these dimensions makes sense individually but how they relate can be confusing. So let me walk you through an example. So suppose very simply we have a security monitoring application that wants to detect the number of ports and attempts on a given port across say all of the machines in this building, right. So this in particular is the Microsoft, the UDB port for the Microsoft IAS (inaudible) okay. So the query here is that we want to detect the number of ports and attempts on this port and our precision requirements are that give me the answer within at most 10 percent of from the true value and the stillness should be at most 30 seconds. So Albert, does that answer your question? Specifying the position requirements a priori. There's also the work that the system actually gives us the best possible position it can provide within a limited monitoring requirement. >>: I guess I'm a little surprised. (Inaudible) >> Navendu Jain: That's kind of a dual problem. I can go into that later. So this is our basic query. So given this query, the system returns back an answer saying the number of ports and attempts on this port across all the machines in this building is 500 per second and characterize its accuracy using NI, okay. So to understand us, suppose if our system had only AI, then we guarantee that the answer lies in this range assuming no failures and negligible propagation DOIs. Okay. And we use AI to reduce a monitoring load by (inaudible) a base as long as the lines are such. So even if your answer becomes 451 or 549, you don't need to send any new updates, right, because it still satisfies our guarantees. Okay. Now, suppose if we had only TI then we guarantee that the answer values 500 and stillness is at most 30 seconds, that means this answer deflects all events that have happened 30 or more seconds ago. Younger events, that is between now and at most 30 seconds ago, may or may not be deflected. But everything before 30 seconds is reflected here. Okay. And again I'm here, assuming that nodes in the network are reliable and links have negligence propagation DOIs. And we use TI to reduce load by combining multiple updates together and ascending a single batch. So very simple idea. Now, when you combine AI and TI, then we guarantee that the answer lies in this range based on inputs that are no more than 30 seconds still. And by using this combination we can further reduce the load by first combining multiple updates together and then sending a batched update only if it drives our answer out of this range. So combining both numerical searching due to AI as well as temporal batching to further reduce the monitoring. Now, note that for both AI and TI, I said that I'm assuming nodes (inaudible) links of small dealings. And NI handles cases when these assumptions do not hold. So when my NI flag says good, then these AI and TI bounds are guaranteed to hold. Right. So this gives you absolute guarantees that the answer you're getting satisfies this accuracy bounce. However, when the NI flag says bad, then these bounds may not hold, and you cannot trust the accuracy of the reported answer. So it will be great if I can always give you this green light that, yes, you can trust this answer, versus the red light that no, you cannot trust this answer. Right? However in reality on large-scale systems are never really 100 percent stable, so essentially you would never get everything that -- your answer will actually always satisfy your bounds versus never satisfies, right, so essentially your answer always right in that range. So NI actually provides this metric as to what extent do these AI and TI bounds hold. So this is an overview of the oral present architecture. Yes, please. >>: So earlier when you said something to Albert about this is the -- I specify plus or minus 10 percent in my query, by the end of your slide I thought that feels like it's the output. You're saying the output of the query is 1434, 500 plus or minus 10 percent because otherwise I would never specify that, right, I would never specify that as an input, so that record is an output? >> Navendu Jain: So this is -- and these two are part of the query. The system outputs the answer and correct its accuracy using. >>: Okay. Got it. Great. >> Navendu Jain: Any other questions? >>: (Inaudible). It may be simpler to say good or bad, but probably (inaudible). >> Navendu Jain: Yes, so essentially that's exactly what I'm trying to -- so when I say good or bad, it's a flag, it's a zero-one flag, right? So good means great, everything's perfect, but systems are never (inaudible) stable. So I'm going to give you a continuum as to what extent. Okay. So this is the oral architecture. I want to talk about (inaudible) we use to build a system. So key abstraction we're going to use for building scaleable monitoring is aggregation. Aggregation, what is simply is the ability to summarize information, okay. We would define this aggregation abstraction in form of an aggregation tree that spans all the nodes in the system. So in this part of the example, I'm computing a sum aggregate of the inputs of the leaf nodes. We perform this network aggregation and get the global aggregate value of the input of the root. So that's our basic approach. We're going to use these aggregation trees to collect or aggregate the global from the system. So natural question is how do you build these trees in a distribute environment? Okay. So to build these trees a key technology we're going to use is distribute half table by DHTs. And certainly this audience doesn't need an introduction to DHTs. Well, simply DHT is a scaleable data structure that has become recently popular in the systems community. It provides important properties of scaleability, self-organization, and robustness. Now, I'm not going to go into the detail of these DHTs, but the key point, well, the only point I want you to remember from this slide is we're going to use DHTs to build a random aggregation tree, and for load balancing we're going to build multiple such trees. So for example you can build one aggregation tree that keeps track of the traffic center given destination, you can build another aggregation tree that keeps track of which of the most heavily loaded machines in the network and yet another tree that keeps track of which nodes are storing which files. All right. So our basic approach is aggregation, and we're going to use these aggregation trees to collect the global part of the system. So now let me taking an example that ties all of this together. So recall my first motivating slide where essentially PlanetLab was being used to launch (inaudible). So we formulate that problem in form of this query to find out that top hundred destination IPs that are receiving the highest traffic from all the PlanetLab nodes, that are about roughly 800 PlanetLabs nodes right now. So to answer this query, so essentially so think of it as finding out the top hundred destination, IPs means that if there's a likely attack that is going on on a victim site than it should likely be in the top 100 destination IPs because we are looking at the aggregate volume of the outgoing traffic. Okay. So how do we compute this query? We're going to compute it in two steps. In the first step we're going to compute the aggregate traffic send to each destination IP from all nodes in the system. So in this aggregation tree, the physical nodes are the leaves and the internal nodes are simply what I call virtual nodes that are mapped to different physical nodes, okay. So in the first step we're going to compute for each destination IP the aggregate traffic sent to that destination from all the nodes in my system. This is my first step. So for all the IP addresses I'm going to compute this aggregate traffic. And in the second step I'm going to compute a top 100 aggregate function and again doing this network aggregation I'll get the global top 100 list. Sure, go ahead. >>: You're going to take every single IP address that every PlanetLab node is sending data to and compute -- and create a separate aggregation tree on it? >>: It's okay. (Inaudible). (Laughter). >>: And therefore it's scalable. >> Navendu Jain: I'm going to actually hit that point on the very nail. So actually I'm going to reverse the question. So do you know how many IP addresses there are at any point in time? >>: Well, it's only IP before, so. >>: (Inaudible). >> Navendu Jain: In practice? >>: In the world? >> Navendu Jain: Yeah. So at any -- if you take a snap ->>: I have no idea. >> Navendu Jain: Take a guess. >>: I have no idea. >>: 100,000. >>: Oh, no, it's much more than that. >>: (Inaudible) millions of ->> Navendu Jain: So take a guess, so how many. >>: (Inaudible). (Brief talking over). >> Navendu Jain: Well, actually it's about a million roughly, so. >>: There's only a million live IP addresses? >> Navendu Jain: So if you take a -- at any point in time if you take a snapshot of the traffic that is going outside collecting from all the PlanetLab notes, the number of unique destination IP -(Brief talking over). >>: That wasn't the question we thought you asked. >> Navendu Jain: So this is the destination IPs ought to be contacted by these 800 PlaneLab nodes located around the world. >>: I'm actually surprised it's that large. >> Navendu Jain: If you take a snapshot, actually it is ->>: I believe you, but I'm surprised. >> Navendu Jain: Okay. >>: So there are a million such aggregation trees. >> Navendu Jain: And in principal -- yes. In principal we're going to build a million trees, right. But now a key observation is that a majority of these IP addresses receive very few updates. So remember we are computing the top on our list. >>: I agree. >> Navendu Jain: So essentially a majority of these IP addresses see very few dates, and if, you set your precision or your accuracy requirement to say give me these flows within say one percent of my maximum full value, right. >>: (Inaudible). >> Navendu Jain: I'm sorry? >>: How do you know that? >>: (Inaudible). >> Navendu Jain: So if you take the one percent of the maximum flow which is say roughly about 3.5 megabytes per second, right, then you can actually filter more than 99 percent of these destination IPs. Right. So which is important for scaleability that now you actually in principal, although you need to compute all of these millions of IP addresses, the aggregate values for those, but essentially by using a small amount of imprecision or small amount of error that you are willing to tolerate in your answer, you can actually filter out majority of these nodes, and you would still get the top hundred list that bounded accuracy. >>: So you're filtering out, I've got to understand here, you don't know how much is being sent to a particular IP address until you construct the tree and bring it up to the root, right? >> Navendu Jain: Well, not quite. But please go ahead. Which I'm going to actually cover -- >>: If this is going to be part of your talk, I'm not going to go into it. >> Navendu Jain: Yes, I'm going to go into it in the very next slide, couple of slides. So this is my -- I'm going to use it as example, so I'm going to keep coming back to it. But yes, we'll be on the right track. So essentially I'm going to talk about how we use this arithmetic imprecision to achieve high scaleability. So arithmetic imprecision or AI quick recall bounce the numerical error off of query result, that is instead of giving you an exact answer with value hundred we are giving you approximate answers. It's hundred plus minus 10 percent but the answer is guaranteed to have a maximum numerical error of at most 10 percent, right. So when applications don't really need exact answers, AI allows us to reduce load by caching old values and filtering small changes from these values. So even if my actual true answer becomes then 91 or 109, I don't need to send an update because it still meets my error guarantees. The two key issues in implementing AI are the mechanism question of how do you take this 10 percent error and divide it among the nodes in the system as you were pointing out and the policy question of how do you do it optimally. Right. So I'm going to a flexible mechanism in which you can take a total budget and divide it in any manner across the system, and I'm going to show you an adaptive algorithm that performs self-tuning of these budgets to minimize the monitoring load. So the way PRISM uses AI is by installing these filters at each node in aggregation tree. So remember we are aggregating the globals throughout the system using aggregation trees, right. So and each such filter, which is at each node in the aggregation tree denotes a bounded range low and high such that the actual value lies in this range and the width of this range is bounded by error budget's delta. Okay. So in this example, I'm setting these filter widths for a sum aggregate, right. So now when you get a new update, we only need to send this update if it validates the filter. So for example if you look in update here with value five, we don't need to send it because it already lies within our guarantee, that is the answer lies between four and six, so you can simply cull it or filter it. However, if you get an update that lies outside range that we need to update it to the parents and adjusting this filter width along the way. So the two key issues for AI are the mechanism question of how do you take the total budget delta and divide it among the nodes in the system. So given a total budget delta root you essentially have the flexibility to divide in any manner. So for example the root can keep some part of the budget for itself and divide the remaining among its children, okay. And the children would do the same, they keep some part of the budget for themselves and divide the remaining among its children and so on. So in this way you have tremendous flexibility to divide a total budget in any manners across the nodes, right. So intuitive question is how do you decide which way to go or what is the best possible, right? So ->>: I have a question. >> Navendu Jain: Yes, please. >>: If you specify it is a relative error rather than absolute error, how do you divide? It seems that you'd have to give the same relative error to both of your children unless you have an assumption about the magnitude of the values that were given you. >> Navendu Jain: Right. So there are essentially two ways of looking at the error. So one is, as I said, the absolute error, right. So essentially the talk I'm giving today is actually based on the absolute error. But there's an equal notion of relative error as well, essentially where you actually specify -- you can also specify the error in terms of relative to the absolute value. And there's a separate piece of work of on that which I haven't covered, but essentially I can give you more details when we talk. >>: Okay. I'll think absolute error for the rest of the talk. >> Navendu Jain: Okay. Great. So now the policy question is how do you do it optimally, right. So remember our goal is to be scalable monitoring. And our key constrain is that for scaleability we want to minimize the monitoring overhead, essentially you want to minimize the number of updates in the system. We want to minimize a number of updates in the system. So ideally you want to set these filters such that you cull maximum data as possible, not just the leaves at but at the internal load, as well. Right. So you want to minimize the total amount (inaudible). So if your input distribution the input workload is uniform, then you might want to give uniformly across all nodes in the system, okay. But however if your input is skewed, so for example if your right sub tree is generating a lot more updates than the left tree, you might want to give larger width to the right tree a take it from the left, right. And even within the same tree sometimes you may want to give larger widths to the leaf nodes and smaller to the internal nodes and sometimes larger to the internal nodes and smaller to the leave nodes. So what's the right thing to do here? And these are very quickly dynamic situations that are present in the applications. Our goal here is to do self-tuning of this filter. >>: Let me ask a question. Why would you (inaudible) budget at any of the internal nodes, rather than leaves, is it just so you could work with the distributed or -- because they don't generate data, so is that the answer? >> Navendu Jain: No. It's a very good question. So a very simple example where you actually want to give larger words to the internal nodes is for example so think of it as you're getting consecutive updates so you get a value, an update with value V, and the next update you get is a value of V plus 100, okay. So you get a value V, you get value V plus 100, you say V1, V1 plus hundred and you get V2 and V2 minus hundred, right, so I'm essentially computing sum aggregate. Right. So if I give this filter width as unit one, then there's no way I can filter that update, right, because the next value that we get from the previous value by 100 units. Does it make sense? So essentially, okay, so the high level answer is that when you combine these updates they can cancel each other out. >>: Sure, but they can cancel each other out whether you push the value -- or whether you push budget down or ->> Navendu Jain: Not quite because in one case essentially I can give larger widths, larger budgets here, but still the value itself may still (inaudible) so think of it as V -- the first value being V, the next value being V plus million. So one thing goes up by a large amount, the other thing comes back by a small amount. And think of it as a process filter. >>: And in of them (inaudible). >> Navendu Jain: Or the aggregate value more general, the aggregate value stays in some -- has very small change. So you can essentially cancel that very small change by having a larger filter with the internal ->>: So first (inaudible) but the large numbers gives you small numbers. (Inaudible). >>: I have the same question as Bill. I similarly don't understand why you wouldn't spend the budget on the leaves. >>: Because if it bursted the leaves, then you don't have enough budget to ->>: You but it's not one budget. It's that if you don't spend it at the leaves, it doesn't give you more of the internal nodes, as far as I understand. >>: As far as I could tell. >>: You don't get more of the internal nodes for not spending it at the leaves. >> Navendu Jain: Absolutely. You're not giving because a total budget is fixed here. Essentially you are taking this budget from the leaf nodes and putting it here. So the idea is suppose if I don't keep anything internal and push everything onto the leaves then the cost of the system is strictly if I put in this is at most the same cost essentially if you push all the budget to the leaves. Does that make sense? >>: The thing that you guys are disagreeing about is that you're saying if an internal node has a certain amount of budget and it gives some to the leaves, it is looses that budget, and John is saying ->>: I'm saying I don't understand why. >>: But you have to report -- if you're an internal node, you have some idea of your values, right, and in particular you know that your children are all within the range of ->> Navendu Jain: Yes, guarantees. >>: (Inaudible) right. So when I get an update from a leaf that says its moved out of its range, then I know -- okay. I see what you're doing. >> Navendu Jain: So essentially then you are also propagating these updates up as well. >>: (Inaudible). >> Navendu Jain: So is John ->>: I'm still confused. I'm still confused about why there's a trade-off in budget between the parent and the child. >> Navendu Jain: So the trade-off essentially is the following. Suppose I can keep the entire budget here and give nothing to the leaves. >>: I don't understand why it's a trade-off, why you can't -- if I have the range zero to 10, and I give zero to five to one child and five to ten to another child, right, then I still have zero to ten. >>: Because imagine ->> Navendu Jain: So the trade-off essentially is in the term of that especially I'm trying to minimize the updates, right. So what you are really saying is by setting those widths in such a way, right, the message essentially you are getting, right, so if I get a message from my children, do I actually when I combine them, when I aggregate them, do I still need to propagate it to my parent itself as well. >>: Right. >> Navendu Jain: Right. So now the point there essentially being that sometimes you may not want to give or give very small amount of this budget to the children but essentially keep your error window actually larger. >>: I understand keeping my error window larger. Why does that ever mean I am -- it is good to give less to my children? >> Navendu Jain: Okay. So actually since I want to -- 58, so let me give you a very quick answer to that. So I'm going to (inaudible). So here I'm essentially saying this 4-6, 3-4, right. My total budget is five, the parent has two, this is two, this is one, right. So we assemble some aggregate which essentially take the sum is sum of the lows, sum of the highs and then I apply my local filtering here. Right the local filtering essentially expanding this range by, you know, delta by two on both sides. So this is seven and ten. And my delta by two is one, essentially so I'm expanding it as six and 11. Makes perfect sense? >>: Yes. >> Navendu Jain: Right. So now I get a new update value which is the value six and for that is value five, right. So now I don't need to do anything here because it already lies in my range. >>: Great. >> Navendu Jain: So you simply filter it up. >>: Great. >> Navendu Jain: You need to send it up. Essentially you send it up here and you recompute essentially aggregate rate. You are using the cached value from here, and you use the updated value from there. Now, your value is eight and 11. So eight and 11 lies -- already lies in the previous range because six and 11 that I reported to my parent. Right? So now essentially because of keeping the budget here, right, I'm able to call that a break. If I kept no visit here, then every time a leaf changes I have to update it to my parent. Yes, precisely. >>: In other words, the so if ->>: You can't spend the margin of error twice (inaudible). >> Navendu Jain: Okay. If you ->>: So it's that I didn't know whether to give that budget to the left node or the right node. >> Navendu Jain: Or keep it. I mean I'm showing you a (inaudible) but it's actually a (inaudible). >>: If you keep it yourself there's a decent chance that you could cull the thing to work (inaudible). Push everything down to the leaves at any time anything ever exceeds its budget. It has to propagate all the way (inaudible). >> Navendu Jain: So if you still have concerns we will meet -(Inaudible discussion.) >> Navendu Jain: I'm going to skip that. Right. So now it's not, well, I guess the problem is not just the optimization but actually it's even harder because in order to adjust these widths we have to send messages. So essentially you want to -- I just want to minimize the number of messages that are being propagated to the monitoring overhead, but essentially if you want to adjust these grids optimally, you also end up paying the cost of sending messages to adjust them. >>: (Inaudible). >> Navendu Jain: Total number of messages. Both the going up as well as going down. >>: Popular metric. >> Navendu Jain: Okay. So now to address these challenges, we have implemented -- well, designed and implemented both a principles, self-tuning and a mere optimal solution. Okay. Note that we cannot apply heuristics here because heuristics may work for one workload but may not work for the other workload. So what essentially we need is a principal solution that uses the workload properties itself to guide the optimal setting of these filter widths. Further we need a self-tuning solution to adapt to dynamic workloads because workloads can change on a period time and to make sure that the benefits we get by sending these messages exceed the costs of sending them. Right. So you want to make sure your benefits always exceed your costs, right. And finally we have then theoretical analysis of this problem and shown that our solution is the optimum online algorithm. So first I want to show you a workload of their solution to estimate the optimal filter width. So for any given workload the key thing to consider is the variability or the noise in the workload, right, essentially if you have very high noise in the workload then you need larger filter width to cull its updates. Same point we were talking about if you have value V and the next value is V plus (inaudible). So if you have more noise in the workload you need larger filter width to cull its updates. So I'm expressing this is notion in this graph where the monitoring node on the Y axis is expressed in terms of a total budget delta, that is the error we are willing to tolerate, and the standard division of the variance in the input workload. Okay. So when you look at the left of this graph, when your error budget delta is much smaller compared to the noise in the workload, then you expect to filter very few updates because most of them will actually lie outside our range. However, on the right when error budget delta is much larger than the noise in the workload. So think of as a Gaussian. So for a Gaussian if you have delta as say plus minus three times the standard division, you can filter out roughly 99 point some percent of its updates, right. So now you expect the load to decease quickly until a point where a majority of these updates get filtered. Okay. We captured this skill mathematically using (inaudible) and equality, which allows us to express the expected message cost of going from any such filter in terms of the variance and the update width of the input workload as well as our input at our tolerance. And the whole idea of doing this mathematical modeling is to estimate the optimal filter widths that we are setting at each node in an aggregation tree. >>: (Inaudible) normal distribution? >> Navendu Jain: No. Actually so there is -- that's why essentially we're using the (inaudible) inequality, which doesn't make any input or any a priori assumption about input distribution. >>: (Inaudible). >>: Well, that's why it also doesn't quickly decay to 99.9 percent, right, if you set, you know, like he said, delta equal to three sigma, and it's only one mind. >> Navendu Jain: So essentially so this is not a very tight bound, but essentially this doesn't make any assumption of the workload. Okay. So using this model we formulate an optimization problem for a one level tree where we want to minimize the number of messages being sent from the children to the root such that our total budget, delta T is fixed, right. So your input tolerance is fixed so given that constraint you want to minimize the number of messages being sent from the children to the root, okay. So now, solving this optimization problem we essentially get a nice close solution. Now, I don't expect you to understand this equation, but one thing I want to point out here is the optimal setting of this filter width depend directly on the (inaudible) of workload, that is both the variance as well as the upgrade rates. And we extend this approach to a general aggregation. However, computing this optimal is not enough because we need to send messages to adjust these filters. Yes, please. >>: Update really doesn't seem very defined to me. >> Navendu Jain: Update really is the number of number of data number of for example network monitoring application, the number of package you're receiving per unit per month. >>: I understand that these are discreet systems, but it seems sort of abstract that that's continuously, the value function is very continuous throughout. The update rates come almost as quickly as you're willing to look at them. >> Navendu Jain: Essentially you are describing some sort of a time window which actually you're looking at how these things are changing. So for example ->>: (Inaudible). >> Navendu Jain: No, no, no. Well, we're essentially defining it the update or time window, right, so essentially you can look as some of the worst. So essentially those are time windows. Essentially you choose the number of updates you have received in a given time window and you divide by the unit of time. Right. Does that make sense? >>: No. Because I think you're measuring over that time window is some underlying let's call it continuous function that's updating (inaudible) right? >> Navendu Jain: So essentially this is an approximation of how the continuous signal is actually being (inaudible). So essentially as ->>: I guess what I'm trying to -- intuitively the standard deviation is well defined for any reasonable thing you're measuring, but the update rate is something you chose, right, you said I would like to measure these every ten seconds. I mean the update rate is (inaudible) value, and if the function is varying, the value function is varying, it's not constant, then it will update at every window. >> Navendu Jain: So what's really happening is that when you essentially receive a data packet, right essentially your are computing the (inaudible) standard division, so this is being computed in an online manner, essentially you are keeping track of the update rate as well. So essentially think of it as in one hour essentially I'm keeping track of what all updates I'm getting, essentially how many times this value is being refreshed. Right, so I'm not actually picking, I'm just picking the time window over which I'm looking at the system, I'm not actually picking how -- because this is part of the input workload, how the input data is being up generated. >>: I'll worry about that. >> Navendu Jain: Okay. So essentially computing this optimal is not enough. We need to send messages to adjust them. So imagine we do this periodically. So say every five minutes we go and we compute the optimal settings of these filter widths and we send messages to adjust them. Now, as you increase the frequency of distribution, that is you go from say five minutes to a minute to 30 seconds and so on, you are essentially getting more and more close to the optimal. Whenever there's a difference between the current setting and the optimal setting, I'm going to send messages to fix it, right. So essentially you are essentially getting more and more optimal and your filtering becomes more effective. Right. However, there is a point of (inaudible) after which the benefits you get are very marginal. Essentially you keep sending these messages to fix the imbalance between the current setting and the optimal setting, the benefits you get are very marginal compared to the cost of setting these messages. So in general there is this trade-off between the message overhead and the frequency of overhead distribution. So read this X axis here as going from five minutes to a minute to 30 seconds, 10 seconds and so on. So as we go from left to right, we are increasing the frequency of the distribution. We are sending messages much more frequently. So essentially then our total monitoring shown in blue decreases, right, because our filtering becomes more effective. However, there's a point of this diminution returns after which this redistribution cost showing red starts to dominate. Right. Is the point coming across? So essentially the idea here is to make sure that the benefits we get by redistributing always exceed the costs of redistribution. >>: So there's an assumption here that global redistribution as opposed to say just redistributing -- this is the one parenting ->> Navendu Jain: No, this is in general hierarchy. >>: (Inaudible). >> Navendu Jain: Yes. >>: So unless I'm incorrect, the way you're describing this, it sounds like there's an assumption that you're globally redistributing the whole tree, and one could imagine these systems where you only redistribute some nodes that are the most out of whack, which would reduce the redistribution cost possibly (inaudible). >> Navendu Jain: Right, so essentially -- which is precisely the point that when do you redistribute, right? So when I computed the optimal, I said by how much you redistribute it, what is the optimal setting. So then this logical question is when do you redistribute. So essentially you want to fix, you essentially want to move the global, not a distribution out of the global but a distribution of the entire tree the same as the optimal. >>: Right. But what I'm saying is that I think that's an assumption you're making that's not necessarily -- that requires that at least requires an explanation of why it's a valid assumption. Because you could imagine, say you have a big tree, you know, a binary tree and the left half turns out to have its internal stuff completely out of whack and the right half is completely close to right. So you can reduce your redistribution costs by half by just not effecting, just leaving the right half the way it is. Now, you're not optimal, but you've cut down on the red cost substantially while bringing the black cost -- getting most of the advantage of the black cost. >> Navendu Jain: Right. Exactly. That's precisely the point that essentially how do you make that decision. >>: Okay. So you're (inaudible). >> Navendu Jain: Yes. Precisely. That's precisely the point. So essentially what we're really doing is applying this cost benefit (inaudible), said that, you know, if you don't really need to redistribute then don't. So essentially how do you make that decision. >>: Making a different -- there's two questions, right. One is are you far enough off optimal that you just don't want to redistribute at all, but the second question is it may make sense to make a partial redistribution, and it sounded at least and it still kind of sounds like you're making an assumption that you either redistribute completely or not at all. >> Navendu Jain: So, right, so the redistribution is happening at each internal node which is the parent of its underlying children is essentially making this decision that whether I need to redistribute it amongst my children and so on. Every internal node is doing that. So there is no notion of a globally -- right. Exactly. So essentially what we're doing is reapplying this very simple key idea of cost benefit ->>: (Inaudible). >> Navendu Jain: It's not even, so everything is happening, so each parent is actually the individual, you know. So think of it as each internal node is a parent of its underlying children, essentially each of them is making a decision that from my subarea I want to minimize the number of updates that are happening. So here essentially we redistribute the budgets only when either there is a large load imbalance, that means there is big difference between the current setting and the optimal setting or the load imbalance itself is small, but over time it has accumulated so it becomes a long lasting imbalance. So essentially we redistribute and make sure then our benefits always exceed our costs and we have done theoretical analysis of our solution and shown that our solution actually matches the optimum online solution and for this problem there is no constant comparative algorithm that exists. Yes, please. >>: Would you describe how to choose the distribution of error budgets between a parent and children? It -- from what I could tell it didn't take into account any difference that would make in the later cost of redistribution. >> Navendu Jain: Right. So ->>: So I felt like you assumed the static workload and said if I never have to redistribute this is now the correct thing and now I feel like you're saying well now I'm going to make redistributions but I'm not going to make those redistributions taking into account my possible desire to make future redistributions cheaper. And redistribution seems like actually a very strong argument for keeping error budget at parent nodes. >> Navendu Jain: Right. So essentially -- right, so when you're essentially completing the optimal, you're looking -- the distribution essentially it has behaved over this time. So yeah, so essentially it's some notion that the workload is going to look like this, so essentially it has looked like that. So I'm computing the optimal based on that, right, essentially, and then exactly as you said this essentially is taking the dynamic aspects essentially how do you make sure your benefits are always -- so when you redistribute, your benefits are always going to be better than the costs? >>: Couldn't you do better by also taking the cost of future redistributions into account when you assign -- >> Navendu Jain: Essentially then you have to build more of a predictor model, how the distribution is going to behave, right, so the idea here essentially was we are building a general framework that essentially no matter what your distribution does the system is still applicable in all cases. >>: (Inaudible). You kind of keep going until the cost of fixing it is equal to what you've already paid. A lot of them. >> Navendu Jain: So to see if this is effective I'm going to first show you some quick simulation results, and then I'm going to show you results from a (inaudible) implementation. So in this experiment essentially we are using a 90 (inaudible) workload with 90 percent of the sources have zero noise that essentially they are giving the same input value over and over again and 10 percent of the nodes have very high noise. (Inaudible) So you expect your optimal algorithm to take all the budgets from these zero noise sources and give it to this noisy 10 percent, right. So in this graph I'm showing you the normalized message overhead on the Y axis and on the X axis I'm showing you the ratio of our budget delta to the (inaudible), so think of it again if you have delta as plus minus three times the standard deviation you can filter 99 percent of the input updates. And again here lower numbers means better, right. So first of all, compared to uniform allocation, which gives equal uniform filter rates to all the nodes, we reduce overhead by order of magnitude. And compared to adaptive filters, which is the state of the art in this area, we reduce overhead by an order of magnitude and even beyond because what adaptive filters does is it periodically distributes these budgets and hence leaves messages on user adjustments, whereas in our case we only redistribute if benefits outweigh costs. Okay. To see if self-tuning approximates ideal we use a uniformized workload. That means all the data ->>: (Inaudible). >> Navendu Jain: I'm sorry? >>: (Inaudible) uniform better than adaptive. >>: Because the (inaudible) ->>: The adaptive blew a budget on. (Brief talking over.) >>: Even though it was the very unbalanced. >> Navendu Jain: Right. And actually I'm going to show you the real workloads are actually very, very skewed. >>: (Inaudible). >> Navendu Jain: So and for a uniform workload the optimal policy is to give these filter grids workload. So again our solution self-tuning approximates, approximates the uniform. We are sometimes slightly better and we are sometimes slightly worse because the uniform allocation is the best online policy. It is not necessarily the optimum online solution. And again compared to adaptive filters we reduce overhead with several magnitude. So I'm going to quickly show you some results from implementation. So I've implemented a prototype PRISM on top of the aggregation system and we use free pastry as the underlying DHTs. And for this periodical evaluation I'm going to perform a distributing service query that find out the top hundred destination IPs receiving the highest aggregate traffic from all nodes in the system. And the input workload here is taken from (inaudible). So before I show you results let me quickly give you a quick overview of the input data distribution. So to process this workload, we need to handle about 80,000 flows of attributes that send roughly 25 million updates in an hour. So a quick (inaudible) would tell you that a centralized system needs to process about 7,000 updates per second. So in terms of bandwidth and processing. And further the distribution is heavy till. So if you look at the bytes distribution, then 60 percent of the flows send less than a kilobyte of traffic and 99 percent of these flows send less than 400 kilobytes traffic. However, the distribution of the heavy till and the maximum force is more than 200 megabytes of traffic. And you see similar patterns in the packets of distribution, right. So now the key challenge here is for doing soft tuning of these budgets, we need to manage or soft tune budgets for these tens of thousands to millions of these attributes. So now tying it back to a running example, when you have 99 percent of these flows send less than say half a megabyte of traffic and if you take say error budget that even say one percent of .1 percent of the maximum flow, then we can filter more than 99 percent of these small noises. So what I'm calling the mice flows would send very few updates compared to the elephant flows which are really large and which is the actual flows we are really interested in, the top hundred heavy hitters. Yes, please. >>: Did you get the maximum flow? >> Navendu Jain: So there are several. Several techniques so you can actually start up the system by essentially computing an estimate, right. The system allows you can actually update the error value as the system progresses, essentially as you're running you can update, you can input new values at any point in time. So you can think of them in terms of absolute as the relative rate. So in relative statistics you can always say within in one percent of the maximum. In absolute statistics you can you take say one percent of say hundred megabytes per second, and you compute the value and you say, well, it is something different then you can reinput the value back into the system. >>: Right. So this actually brings us back to a question that I had earlier, which is did it make sense to build aggregation trees (inaudible) as opposed to building an aggregation tree over the 800 hosts that you have where each host sends off its hundred, its 100 best things and then it just -- I mean, it doesn't give you the authorization (inaudible) but as an engineering solution it might actually work better. >> Navendu Jain: Actually my claim is the correctness actually gets heard, and so correctness has violated even your solution because the top hundred actually might not be in the top hundred of each individual node, right. >>: It must be. >> Navendu Jain: Not necessarily. You can actually have small values at each of them. (Brief talking over.) >>: (Inaudible). Kids school different. >> Navendu Jain: So this graph shows the results. So compute your centralized system that incurs the costs of these are the absolute numbers. About 7,000 messages per second. If you take an error budget at even five percent of the maximum flow, we can reduce this this monitoring overhead by the order of magnitude, and by using a key optimization you get another order of magnitude improvement. So I'm not going to go into details of this optimization. The punch line here is by doing the soft tuning of the error budget we can reduce the monitoring overhead by several orders of magnitude, which is really important for scaleability. So that's the key one. Yes, please. >>: (Inaudible) on identifying the most popular anything, the most popular million URLs or the most popular IP addresses, how does this compare to all those others? >> Navendu Jain: So, again, see, (inaudible) popular in the databases kind of in the networking community as well. Essentially the idea there has been what most the common notion essentially we are doing this common top hundred at each individual site, and the idea essentially is to and the aggregate of that at a central point, essentially think of a (inaudible) tree. So essentially this again in this system essentially assuming entire, this is a completely distribute environment where essentially the top hundred at each local site actually doesn't suffice. >>: I guess harbor network sites that we have here installed I could do top hundred heavy hitters and (inaudible) right now. >> Navendu Jain: Okay. So I guess we can (inaudible) more detail, but essentially the idea is that they are building this in a completely distribute environment so the top hundred globally aggregate across all the systems. But if there's published work I'll be interested in. >>: (Inaudible). >> Navendu Jain: Okay. So to summarize the contributions of AI, we are giving you a flexible mechanism in which you can take a global budget and divide it in any manner across the nodes in the system. Adaptive algorithms that performs self tuning of these budgets to minimize the monitoring overhead. Our solution has two key ideas. We are estimating the optimal based on the variability in the workload itself and we only redistribute if benefit exceeds the costs. It shows that our solution is the optimum online algorithm. Using experiments we can get significant reduction in the monitoring load by using a very small AI, AI of one percent and five percent. Also it gives you several orders of reduction in the monitoring overhead. Which is in fact the case for real workloads which is real, yes. But if you really think about it, if your distribution is really uniform, then we don't really need this sort of fancy optimization. But real world workloads. So I want to very quickly actually touch the second dimension of precision, which is a temporal imprecision. And again here I'm assuming that nodes in the network are reliable and links have small (inaudible) and describe how NI handles cases when these assumptions do not hold. So the key point essentially I want to make from this part of the talk is by combining AI and TI we can get significant reduction in the (inaudible) which is really important for scaleability. So using AI and TI we have built this application which is currently running on PlanetLab on about 500 nodes which detects the top 100 definition IPs that receive highest traffic from the PlanetLab nodes, right, and this application is currently running on the PlanetLab infrastructure. So in the (inaudible) I'm showing you the results from this application where the Y axis shows the message overhead and the X axis shows the temporal or the TI budget, and these different signs show the different AI settings. And again here lower numbers are better. Okay. So if you compare going from AI of zero to AI of one percent we get a reduction in overhead monitoring by a factor of 30. And you get another order of magnitude load reduction by going from AI of one percent to AI of 20 percent so essentially we are getting these orders of magnitude reduction in the monitoring overhead by using a small AI at a tolerance. Similarly by using TI if you go from 15 seconds to 60 seconds you get roughly an order of magnitude reduction. In order for this application having a TI of 60 seconds is reasonably good and having an AI of 10 percent is reasonably good because we are interested in the top heavy hitters list. So by combining AI and TI we are getting several orders of reduction in the monitoring overhead. Another advantage of combining AI and TI is we get highly responsive monitoring. So, for example, for approximately the same cost as AI of one percent and TI of five minutes, we can give you 20 times more responsive monitoring at AI of 10 percent and TI of 15 seconds. So this AI and TI combination is really powerful as it gives us several orders of reduction in the monitoring overhead as well as gives highly responsive monitoring. Okay. So until now, I have talked about AI and TI, how they give us high scaleability and how they give us strong accuracy guarantees. However, this is all nice and great in an ideal world because in practice failures can cause the system to violate these guarantees. Therefore we define a new abstraction network imprecision that bounce uncertainty due to failures. So since this is a new abstraction it will take me a couple of slides to define it let me (inaudible) why NI is important. NI is important for three reasons. First, it allows us to fulfill the guarantees of AI and TI and in comparison, existing systems today only give you best efforts. And I'm going to show you that best effort can be arbitrarily bad or it can have very high error in practice. Second, in the presence of failures, AI and TI can actually increase your risk of errors, right. So therefore NI characterizes how accurate the results really are. And finally, using NI, AI and TI can now assume that node in the network are reliable, right, so their implementations get simplified. And NI handles cases when these assumptions do not hold. So the key motivation for NI is that failures can cause the system to violate AI and TI guarantees. So in this very simple example we have a monitoring application that's returning a query result of 120 requests per second and node that we are essentially using aggregation trees to compute the global result and here the subtrees, the aggregate value of the subtrees are being cast at the parent, right, because we're essentially doing this to minimize the cost because if we don't get any update then we guarantee that these subtrees values lie in that range. Right. So given this AI and TI requirements we're getting this answer and we're then caching of the subtree aggregate values, right. So now when you get this result in the presence of failures, what does it really mean? Does it mean that the load is between 100 and 130 or in fact the load is actually much higher, but in fact there was a disruption which prevented this new update from reaching the root. However, for the root node, it's still using the cached value of the subtree, thereby reporting you an incorrect answer. Or in fact, the load could be much smaller, but there was a subtree that essentially moved to a new parent. So a reconfiguration caused these roots to count the subtrees value twice. Essentially it's caching the subtree's value here as well as the aggregate value being provided by its right tree. So here reconfiguration causes double counting. So already we made the claim that we can give you strong accuracy guarantees without failures but in practice when failures happen what guarantees can we all use? To see how bad it is in practice, we build an deployed this application on PlanetLab where we're keeping track of the experiment global research usage on all 800 PlanetLab machines. On the Y axis here I'm showing you the CD of answer, on the X axis, I have the difference of the two, so essentially we are taking the difference of the reported value from the true value and you get the true value by doing -- by taking offline processing of the logs. Okay. So two things to note here are that half of your reports (inaudible) by at least 30 percent from the true values and about 20 percent is one-fifth of the reports (inaudible) by roughly more than 70 percent. So now when you get an answer, where does the answer really lie? Does it actually lie here or near or actually even beyond. So this best effort can be arbitrarily bad in practice, okay. So what's the solution. Do we just give any accuracy guarantees in the presence of failures. Do you have an answer or do you have a question? >>: I definitely don't understand what the point of comparison was again, what the best effort PR mom does? >> Navendu Jain: So the notion of comparison essentially is we're taking the true value. This is the oracle value. So essentially we log everything and when we get an answer we often do passing of the logs to compute what the answer should have been versus what the system is telling us. >>: And the system, what is the system that you were saying? >> Navendu Jain: This is our PRISM system which is deployed on the PlanetLab. >>: (Inaudible). >> Navendu Jain: I'm sorry? >>: (Inaudible) a little better but it was -- so you get a uniform, you get a straight line. Correct? >> Navendu Jain: And we actually get (inaudible) essentially these answers got (inaudible) by a lot. >>: Depends on whether they're relative errors or not. >>: Right. But you're (inaudible) so their relative errors, right, in this graph? So if I just made up -- >>: If you made up random numbers between zero and hundred and the truth was always one you would be off by more than 50 percent. >>: Well, this hundred obviously has a special meaning, right, because it's a coincidence that it happens to arrive at 100 or what he's doing is he's putting the smaller number always (inaudible). >> Navendu Jain: Right. So essentially it would be normalized. (Inaudible). >>: So I mean, what Bill's saying if I make up a number between ->>: Yup, yup. >>: Okay. >> Navendu Jain: So what's the solution? So do we just give up any accuracy guarantees in the presence of failures? Sounds great. So the real thing to actually understand is that we have to accept that our systems are unreliable and therefore we cannot guarantee to always give the right answer. Right. So instead of giving always guaranteeing to give the right answer our idea here is essentially to quantify the stability of the system where an answer is reported, right. So go back to our institution of this stable good or bad flag, right, so when you get an answer you get this good or bad stable bad. The biggest that the system is stable and you can trust the accuracy of the reported answer. Otherwise our AI and TI bounds may not hold and you cannot trust the accuracy of your answer. Right. So it will be great if I can always give you this green light that, yes, you can please trust this answer versus a red light that, no, you cannot trust the answer, right; however, in reality large-scale systems are never really 100 percent stable. >>: (Inaudible). >> Navendu Jain: Yes, precisely. Thank you. So large-scale systems are never really 100 percent stable so we quantify how stable the system is when an answer is computed, okay. And the way we essentially quantify system stability is using three simple metrics, N all, N reachable and N dupe. N all is simply the number of live nodes in the system, okay. N reachable gives you a bond in the number of nodes who are meeting TI guarantees, that means what part of your network is reachable, that means from what part of the network you are getting the recent updates from the nodes. >>: How can you tell? You just put a lot of efforts into not sending update. You put all this effort into never sending messages, and so it's theoretically impossible to differentiate between a broken network link and I suppress the messages because you did such a great job of allocating the budget, so how can you ->> Navendu Jain: And when I come back to the (inaudible) give me a couple of slide. Thanks. >> Jim Larus: I know you guys like this (inaudible) lecture, but we've got like 15 minutes left so maybe we let him get through his talk. >> Navendu Jain: But thank you very much for all the enthusiastic questions. Thank you. So the three metrics we use essentially to quantify system stability are N all, N reachable and N dupe. As I said, N all is number of live nodes in the system, N reachable gives you a bond on the number of nodes whose recent inputs are being used in the reported answer, and N dupe gives you a bond on the number of nodes whose inputs may be doubly counted, right. So these C metrics together characterize the accuracy of (inaudible). So in this example on the left when you have 99 percent of your network is reachable, that means the answer that you're getting reflects recent updates from 99 percent nodes in the system and zero inputs have been doubly counted. It is highly likely that the answer you get reflects the true state of the system. Right. But as if you compare the example on the right, you have only 10 percent of the nodes that are reachable and half of the inputs may be doubly counted, that means the system may be reporting you either a highly still answer or essentially overcounting the inputs of the nodes, right, therefore you cannot trust them. So tying it to (inaudible) examples when right subtree disconnected from the node then our N reachable would indicate that only 40 percent of your network is reachable. Right. Because large link has been disconnected and therefore the answer you are getting may be highly still because a parent may be caching the disconnected subtrees value and therefore you cannot trust this highly still answer. And similarly, when a subtree reconfigure or when a subtree joined a new parent, even though all of your network is still reachable, that means I'm getting the recent updates from all my nodes, half of these inputs may be doubly counted, therefore the answer you're getting it may essentially have the input contributions of nodes multiply counted and therefore the answer might be overcounted and you shouldn't trust it, right. So even though we cannot guarantee to always give the right answer and I still useful because it's not telling us how these disruptions are affecting the accuracy of the global result. And essentially we use this metrics to characterize the state of PlanetLab. This is actually a couple years old, but essentially the graph I generated for this was the deadline actually doesn't look quite a whole lot different. So here what we actually see is that you get lots of disruptions in the PlanetLab nodes even though you have actually very few physical failures. So out of hundred nodes only about five percent have actually failed, but you actually get lots of disruptions. But supports our claim that real world systems are not really hundred percent stable, they're always in the yellow light, and therefore and these disruptions have a big impact on the accuracy of your answers. Okay. So now we have seen that NI is useful to characterize the accuracy. Can we actually use it to improve this accuracy? So a simple technique that we use is NI filtering. This is a part of series of techniques but I'm going to show you results from just one technique today. So in this NI based filtering we take, we mark an answer's quality as good, bad based on the number of unreachable and doubly counted inputs, okay. So and for simplicity I'm condensing my three NI metrics into a single number. So NI small means things are perfect and NI large means things are imperfect. There's too much shown in the system. And these different lines show the different NI thresholds. So if you compare the best of our system with 80 percent of reports can be viewed by as much as 65 percent by using this very simple technique, we can guarantee that 80 percent of the reports have at most 15 percent error. So in further benefit is that when we get an answer that's actually tagged with NI, I know if my answer lies in this range or this range or this range. But as compared to best effort, we are going to give you an answer, I say well, that could be here or could be here or could be somewhere else. I don't know. So by using this NI, I'm able to characterize both the accuracy as well as improve this accuracy. So now we see that NI is useful but can be computed efficiently. So the good news is yes, these NI metrics are simple to implement as that conceptually (inaudible). So N all is number of nodes in the system, N reachable is the number of nodes which are (inaudible) being used and N dupe is the number of nodes whose inputs may be doubly counted. However, they are difficult to implement efficiently. As you rightly pointed out, there's a big scaleability charge. And in particular the big challenges that we need to compute NI for each aggregation tree in our system, and secondly require active probing of each parent-child edge each aggregation treatment source. Okay. So the first challenge here is that we need to compute NI for each allegation tree in the system. And the reason we need to do that is because a failure can have different effects on different trees. So, for example, here, the failure of this node only disconnects a leaf from this aggregation tree, right; however, in a different tree the failure of the same node disconnects an entire subtree. So since now a failure is affecting different trees differently, you need to quantify its impact individually for each aggregation tree. And second, to give a TI guarantees, so these are the challenges to compute this NI metrics scaleability, right? The second challenge in computing this NI metrics scaleability is we need perform active probing of each parent-child edge in each aggregation tree. And we need to do that to satisfy a TI guarantees. And a naive way of doing ordering messages for nodes per second and for a thousand node system this requires about a hundred messages per node per second. Okay. So what's the solution? To address these challenges we are going to use a particular property of a DHT system and use a particular technique that are sort of a forest of these DHTs trees forms an approximate butterfly network. And in particular I'm showing you this 16-node net, this butterfly network for a 16-node tree, 16-node system, and this butterfly network encodes all the aggregation trees in the system. So for example an aggregation tree like this and an another aggregation tree like that. So for scaleability, our key idea is to reuse common calculations across different trees. These common calculations are for computing the NI metrics. So in particular each node in this butterfly network is a parent of two trees, an aggregation tree underlying it of children that compute an aggregate value and an aggregation tree of parents above it that depend on this underlying tree as input, right. So now the idea of scaleability here is rather than recomputing the aggregate value of this blue tree separately for each parent in the red tree, I'm going to compute it only once, then I'm going to forward that value to each parent in the red tree, and essentially we do this for the entire DHT system, for the entire butterfly network in our DHT system, right? And by doing this we reduce the cost from order in to order login messages per second. And for a thousand node system this only requires about five messages per node per second. Okay. We verified this experimentally, so as you increase the number of nodes, the per tree cost grows linearly on a large scale. Whereas using a dual tree approach this cost grows logarithmically, and for a 1,000 node system we would reduce it from about hundred messages per second, or nodes per second to only about roughly five messages per node per second, and this grows logarithmically. Okay. Yes, please. >>: (Inaudible). >> Navendu Jain: Okay. We also have one on one, so I can answer that. Okay. So to summarize the contributions of NI, NI addresses the fundamental challenges that failures can cause the system to violate our AI, TI guarantees. And since we cannot guarantee to always give the right answer our key idea is to quantify the stability of the system when an answer is computed, right. We generalize this notion of stability of the stable bit N all, N reachable and N dupe metrics. And our system provides a scalable implementation of these metrics by reducing the costs from order and messages to only about order logging messages per node per second. And by using NI we can improve the accuracy and the monitoring results by up to an order of magnitude for the workloads we consider. Okay. So in this talk I've talked about a bunch of things, how we use AI and TI to achieve high scaleability and how we use NI to ensure correctness of results. So let me tie to the bigger version of the bigger picture of my work. The key idea in a big goal is to do scalable monitoring, and in order to do that we face two big challenges, scaleability to large systems and ensuring accuracy despite failures. To address these challenges our key idea is to define precision as a newly unified abstraction. This abstraction has good properties. It combines these known idea of AI and TI in a sensible way. And it gives you big scaleability benefits. And NI is a fundamentally new abstraction that enables AI and TI to be guarantees in practice and greatly simplifies their implementation. And (inaudible) provides a scaleable implementation of these metrics. For AI we provide a self-tuning near optimal solution to adjust these filter widths based on dynamic workloads, and for NI by exploiting the symmetry of our dual tree butterfly network we can reduce the implementation overhead by several orders of magnitude. Okay. Now, apart from PRISM, I've worked on several other projects and built other systems that I haven't covered in this talk, but I'll be happy to talk about any of them offline. That will conclude my talk, and happy to answer any questions. Yes, please. >>: So when you talk about NI, it seems like this, I have this value, but don't trust me because the system is unstable at the moment, right? But the thing is that to view interesting event that we're going to monitor this time may be correlated with system on stability, so all the important things is happening especially at that timeframe (inaudible) the value of this NI. >> Navendu Jain: So exactly. The essentially being that when you're detecting and sort of anomalies in the system, essentially you are detecting that anomaly based on say the query answer, essentially an anomaly is there's been lot of traffic from node A to node B, which is not (inaudible), but essentially a huge chunk, so essentially the point being how do you actually get the correct value or the accurate value of that. So essentially one aspect of NI is to characterize when it is unstable in the system. The other aspect which I briefly touched and I can actually give you more flavor very quickly is the notion of how do you improve this accuracy, right. So one technique I talked about was you can actually do this NI filtering. Another simple technique we use is essentially here we are using redundancy to actually improve the accuracy. And the idea here is simple that when we are completing an aggregation tree instead of using one tree we're going to use K such trees. And you are going to pick the best result. And actually practice shows that, and there's also some theoretical analysis that if you use K very small, K four or five trees you actually improve this accuracy by a lot. Right. So essentially what we're really doing is that we are taking, you know, computing an aggregate value across multiple trees and picking the best result, okay. And this actually graph shows the improvement that this is just using the aggregation where you can improve this accuracy by 5X, and when you combine this multiple trees with the filtering you can actually improve this accuracy by 10X. So essentially, yes, so we are approaching the problem that you just mentioned by essentially trying to improve the accuracy as much as possible when there are reconfigurations. That's one plausible way to know if there are any problems in the underlying system. Does that answer your question? Okay. Yes, please. >>: So another way to cut the data rate would be to do statistical sampling of the -- I mean, that's the (inaudible) do you throw that in, too. >> Navendu Jain: Yes, so essentially it's -- I think I would try to -- I want to make the claim that essentially this is sampling is actually kind of a plug and play in the system because when you're getting inputs from the underlying system, right, they already have some errors, and you say sampling one (inaudible), depending on what sampling you use, so our edit actually becomes edited in the sense that now you actually have edit because of the sample data and on top of that you are introducing additional filtering. So for the system itself it's completely transparent that actually, yes, your input distribution, rather than keeping track of each and every update you keep track of sample data. Right? Essentially so I'm updating these statistics, these are actually based on the sample. Yes, please. >>: An interesting question that might be worth exploring is if you do system-wide sampling and send samples to a central location and things like that, are there competitive approaches from ensuring some point that don't have the complexity of the trees that still end up -- in other words, it seems like there's a straightforward strong ->> Navendu Jain: Absolutely. So that's a good point that yes, so think of this very approach that you essentially sample say periodically every five minutes or so you go to a central site and think everything up correlate. Yes. So that is plausible. The argument that has been essentially that now I want to do monitoring at not at five minutes but at a scale of five seconds. Right. >>: So I guess what I meant is it seems like (inaudible) you're looking for the largest event then you can just keep making the samplings sparser and sparser if the event is big, it's always going to stick out ->> Navendu Jain: Well, but essentially also being the order right, that essentially for the B query that I mentioned, right, I'm essentially detecting the system right now actually detect it in 15 seconds. So if there is a flow it actually becomes very large over a period of 15 seconds then essentially the system detects it very quickly, where essentially for something which is more of a periodic logging at a site, essentially now have to do this all the time. Right. So even though my local events may not be important, may not be important, I have to send them anyways and essentially send them much more frequently as the time (inaudible) so there's a big scaleability issue in that. >> Jim Larus: Okay. I think we all should thank our speaker. >> Navendu Jain: Thank you. (Applause)