>> Jim Larus: It's my pleasure today to welcome... Austin, who he tells me he actually has his Ph.D.,...

advertisement
>> Jim Larus: It's my pleasure today to welcome Navendu Jain from the UT
Austin, who he tells me he actually has his Ph.D., maybe not in hand, but he's
finished, turned it in, gotten it signed, so congratulations. That's always good to
hear.
>> Navendu Jain: Thank you.
>> Jim Larus: And he's going to talk to us about scaleable monitoring today.
>> Navendu Jain: Thank you very much, Jim, and thank you very much for
inviting me. So today I'm going to talk about PRISM which is a scaleable
monitoring service we have designed and implemented for monitoring as well as
managing large-scale distributed systems.
This is a joint work with my colleagues at Texas, both my seniors and juniors,
and the main motivation for this work arises from the fundamental challenge of
how do you manage large-scale distribution systems.
I want to make a quick disclaimer that even though I'm using an Apple and
keynote for this presentation, please do not hold it against me. Okay.
Okay. So to give you some background, in April 2003 on PlanetLab, which is a
distribute research test bed comprising of about 800 machines that are spread all
over the world, a user login was compromised. And within a short period of time,
half of these nodes were port scanning external sites. And as you would expect,
it generated lots of complaints.
So to prevent any further damage, PlanetLab had to be shut down temporarily.
Now, of course you can just blame it on the hackers and say, well, what can we
do about it, but that would give them too much credit, okay. Because in reality
it's not just the hackers, but there are several other reasons why our systems
break, such as security vulnerabilities, hardware errors, operator mistakes. And
I'm sure everyone in the audience here, as well as persons who are watching
online, must have written a buggy application at some point in time. Okay.
So the key problem here is that fundamentally distributed systems are hard to
manage, they are unreliable, they break in complex or even unexpected ways,
and it's really hard to know what's going on in the system, okay. So in order to
manage the systems what we need to do is continuously monitor and check if
there are any problems in our underlying systems, okay.
So I'm going to give you a 10,000 feed view of the problem. This is an
oversimplified view but it captures the necessary details. So here we want to
monitor a large-scale distributed system comprising of tens of thousands of
nodes. At each of these nodes there are several events happening at any point
in time. So, for example, an event could be a CP lowered spike, or a (inaudible)
arrival or a memory bug or memory leak because our operating systems are
aren't perfect, right, so our aim here is to correlate these events together and find
out what are these loud or the big events on a global scale. By loud events I
mean those events that when you look at each node individually, they may be
small, but when you look at collectively across the entire network they constitute
a large volume.
So, for example, a network in the center here might be interested in which
network links are heavily congested, okay, or which web servers are headily
loaded, or are there any buggy applications in my system that are consuming lots
of resources, okay. And even from a securities perspective, monitoring these
systems is very important.
Now, what makes this monitoring problem fundamentally hard are three key
challenges. Our first challenge here is scaleability. We need a monitoring
solution that will scale both with respect to the number of nodes as well as the
amount of data volume.
Now, we can do centralized monitoring that will scale to hundreds of nodes, but
what we need is solution that will scale as these systems go from tens of
thousands and the vision is to go towards millions of nodes, okay.
And similarly, the data volume here of these events could be massive, and their
applications today that generate hundreds of gigabytes of data each day and we
want to move it to even petabytes of data. So we need scaleability both with
respect to the number of nodes as well as the amount of data volume.
Second, we want to quickly respond to these events, so to detect any
performance problems as well as find out any security threats. So you want to
do monitoring in realtime.
And finally, we need a solution that is robust against both node and network
failures and still gives us an accurate view of the system.
Okay. So the bottom line here is that for monitoring these systems, we need to
process large volumes of data spanning tens of thousands of these nodes, and
we want to do it in realtime.
Now, of course our actual problem is much harder because this picture is not
drawn to scale, okay. So to define the problem more broadly our vision here is to
double up a distributed monitoring framework that monitors the underlying
system state, performance queries, and reacts to global events. And numerous
applications which I've shown here have similar requirements such as network
monitoring, grid monitoring, storage management, sensor monitoring and so on.
In the later part of this talk I'm going to show you results from some of these
applications that we have built and in some cases even deployed. Okay.
So to realize this vision, we have designed and implemented PRISM. PRISM is
a scalable monitoring service for managing large-scale distributed systems.
Now, you might say, well, there's been a lot of work in monitoring, what's new
here. Our key contribution and our key distinguishing factor here is to define
precision as a new fundamental abstraction to enable scalable monitoring. So
what do I mean by that?
So specifically our two big goals are to achieve high scaleability and high
performance and to ensure correctness, that is give a robust solution that can
tolerate newer network failures. So by defining this new precision abstraction,
we will achieve these two goals, so specifically to achieve scaleability and high
performance will trade precision for performance.
So think of your getting an answer about the global sort of the system. So giving
an exact answer will give you an approximate answer, okay. But we will bound
the degree of approximation and further will adapt these approximation bounds to
handle large-scale dynamic workloads.
However in practice failures can violate the -- can cause the system to violate
these approximation bounds, therefore, to save that correctness we'll quantify
how accurate our results really are and how to improve this accuracy despite
large-system failures. Okay.
>>: The accuracy known a priori or after the fact?
>> Navendu Jain: Yes, I'm going to talk about this in the very next slide. So here
are two big goals, to achieve scaleability and high performance and ensure
accuracy or ensure correctness despite failures. So the way we are going to
realize these two goals is to define precision in form of three dimensions.
Arithmetic, temporal and network imprecision.
So arithmetic imprecision or AI bounds the numerical errors of a query result,
okay. So instead of -- take as an example, instead of giving an exact answer
with value hundred we'll give you an approximate answer, say the global query in
the system has a value say hundred plus minus 10 percent but the answer is
guaranteed to have a maximum numerical error of at most 10 percent. Okay.
Similarly temporal imposition or TI bounds the stillness of a query result. So as
an example a query result has maximum stillness of at most 30 seconds, that is it
reflects all events that have happened in the system until 30 seconds from right
now. Okay. So although AI arithmetic imprecision bounds the numerical errors
and TI bounds stillness, in practice, however, failures can cause the system to
violate these guarantees. Okay. Therefore, we define a fundamentally new
abstraction network imprecision that bounds this uncertainty or this ambiguity
because of failures.
So very simply so think of this NI as a good or a bad flag. So when you get an
answer, you get this flag, good or bad. Say if the flag says good, then these AI
and TI bounds are guaranteed to hold. Otherwise, they may not hold and you
cannot rely on the accuracy of the reported answer.
So for simplicity I'm showing this as a flag, but in reality it is a continuum as to
what extent the system actually provides these guarantees to the end user. Yes,
please.
>>: Frequently in randomized algorithms and other areas of statistics one makes
theorems of the forum, you now, do the following and your answer will be within
one plus or minus delta of the true answer with probability greater than or equal
to one minus epsilon, and I feel like network imprecision seems like epsilon and
arithmetic imprecision seems like delta. So when you say it's fundamentally new
abstraction but it sounds a lot like epsilon-delta theorems of which I've seen a
bunch.
>> Navendu Jain: So okay. Let me ask you take it piecewise. So you're right
essentially saying that the one minus delta abstraction is actually very equal to
arithmetic imprecision, however, for network imprecision essentially, yes, you can
provide (inaudible) absent guarantees but there are essentially two cases that
essentially I'm going to talk about, one is you actually (inaudible) those
impossibility result that you actually cannot give in the absolute bounds.
And there are cases where actually you can give the absolute bounds but they
become very expensive. So essentially it is impossible at first and it is expensive
at best. Essentially I'm going to talk about. So that notion, in that perspective, so
this new abstraction is actually fundamentally different. So I'm going to give
more details later.
>>: Okay.
>> Navendu Jain: Okay. So these three dimensions nicely compliment each
other to enable our goal of scaleable monitoring. Although the basic ideas of AI
and TI are not new, okay, what will give new scalable implementations of these
metrics and show that by carefully managing them we can reduce the monitoring
load by several orders of magnitude. And by using NI we'll characterize how
accurate the results really are and how to improve this accuracy by applying an
order of magnitude. Another combination of these three dimensions is really
powerful and my team (inaudible) is how this unified precision abstraction
enables scalable monitoring. Another question?
>>: Is magnitude really enough to get you to a million machines of (inaudible)
each?
>> Navendu Jain: Are you talking about this?
>>: Yes. I mean 100X seems like (inaudible) close to what you need.
>> Navendu Jain: So the -- so I'm also going to -- so this is, 100X is actually a
relative term, right, I'm going to also choose the absolute number in terms of as
system size increases what is the absolute or what is the performance
requirements.
Okay. So I've given you an overview and the basic motivation of the problem of
scalable monitoring, and I'll present the fundamental present architecture and the
key technologies we use to build PRISM. Then I'll describe how PRISM uses AI
and TI to and choose scaleability and how by choosing NI will ensure accuracy of
results and how to improve this accuracy despite failures. Okay. So let me first
give you a overview of the PRISM.
The key idea in PRISM is to define precision as a new unified abstraction, and
we define this precision abstraction in form of three dimensions, arithmetic
imprecision that bounds numerical errors; temporal imprecision bounds stillness;
and network imprecision bounds the uncertainty due to failures, okay. So AI and
TI allow us to achieve high scaleability by trading precision for performance. And
NI addresses the fundamental challenge
of providing consistency guarantees despite failures.
Now each of these dimensions makes sense individually but how they relate can
be confusing.
So let me walk you through an example. So suppose very simply we have a
security monitoring application that wants to detect the number of ports and
attempts on a given port across say all of the machines in this building, right. So
this in particular is the Microsoft, the UDB port for the Microsoft IAS (inaudible)
okay. So the query here is that we want to detect the number of ports and
attempts on this port and our precision requirements are that give me the answer
within at most 10 percent of from the true value and the stillness should be at
most 30 seconds. So Albert, does that answer your question? Specifying the
position requirements a priori.
There's also the work that the system actually gives us the best possible position
it can provide within a limited monitoring requirement.
>>: I guess I'm a little surprised. (Inaudible)
>> Navendu Jain: That's kind of a dual problem. I can go into that later.
So this is our basic query. So given this query, the system returns back an
answer saying the number of ports and attempts on this port across all the
machines in this building is 500 per second and characterize its accuracy using
NI, okay. So to understand us, suppose if our system had only AI, then we
guarantee that the answer lies in this range assuming no failures and negligible
propagation DOIs. Okay. And we use AI to reduce a monitoring load by
(inaudible) a base as long as the lines are such.
So even if your answer becomes 451 or 549, you don't need to send any new
updates, right, because it still satisfies our guarantees. Okay.
Now, suppose if we had only TI then we guarantee that the answer values 500
and stillness is at most 30 seconds, that means this answer deflects all events
that have happened 30 or more seconds ago. Younger events, that is between
now and at most 30 seconds ago, may or may not be deflected. But everything
before 30 seconds is reflected here. Okay.
And again I'm here, assuming that nodes in the network are reliable and links
have negligence propagation DOIs. And we use TI to reduce load by combining
multiple updates together and ascending a single batch. So very simple idea.
Now, when you combine AI and TI, then we guarantee that the answer lies in this
range based on inputs that are no more than 30 seconds still. And by using this
combination we can further reduce the load by first combining multiple updates
together and then sending a batched update only if it drives our answer out of
this range. So combining both numerical searching due to AI as well as temporal
batching to further reduce the
monitoring.
Now, note that for both AI and TI, I said that I'm assuming nodes (inaudible) links
of small dealings. And NI handles cases when these assumptions do not hold.
So when my NI flag says good, then these AI and TI bounds are guaranteed to
hold. Right. So this gives you absolute guarantees that the answer you're
getting satisfies this accuracy bounce. However, when the NI flag says bad, then
these bounds may not hold, and you cannot trust the accuracy of the reported
answer.
So it will be great if I can always give you this green light that, yes, you can trust
this answer, versus the red light that no, you cannot trust this answer. Right?
However in reality on large-scale systems are never really 100 percent stable, so
essentially you would never get everything that -- your answer will actually
always satisfy your bounds versus never satisfies, right, so essentially your
answer always right in that range. So NI actually provides this metric as to what
extent do these AI and TI bounds hold.
So this is an overview of the oral present architecture. Yes, please.
>>: So earlier when you said something to Albert about this is the -- I specify
plus or minus 10 percent in my query, by the end of your slide I thought that feels
like it's the output. You're saying the output of the query is 1434, 500 plus or
minus 10 percent because otherwise I would never specify that, right, I would
never specify that as an input, so that record is an output?
>> Navendu Jain: So this is -- and these two are part of the query. The system
outputs the answer and correct its accuracy using.
>>: Okay. Got it. Great.
>> Navendu Jain: Any other questions?
>>: (Inaudible). It may be simpler to say good or bad, but probably (inaudible).
>> Navendu Jain: Yes, so essentially that's exactly what I'm trying to -- so when
I say good or bad, it's a flag, it's a zero-one flag, right? So good means great,
everything's perfect, but systems are never (inaudible) stable. So I'm going to
give you a continuum as to what extent. Okay.
So this is the oral architecture. I want to talk about (inaudible) we use to build a
system. So key abstraction we're going to use for building scaleable monitoring
is aggregation. Aggregation, what is simply is the ability to summarize
information, okay.
We would define this aggregation abstraction in form of an aggregation tree that
spans all the nodes in the system. So in this part of the example, I'm computing
a sum aggregate of the inputs of the leaf nodes. We perform this network
aggregation and get the global aggregate value of the input of the root. So that's
our basic approach. We're going to use these aggregation trees to collect or
aggregate the global from the system.
So natural question is how do you build these trees in a distribute environment?
Okay. So to build these trees a key technology we're going to use is distribute
half table by DHTs. And certainly this audience doesn't need an introduction to
DHTs. Well, simply DHT is a scaleable data structure that has become recently
popular in the systems community. It provides important properties of
scaleability, self-organization, and robustness.
Now, I'm not going to go into the detail of these DHTs, but the key point, well, the
only point I want you to remember from this slide is we're going to use DHTs to
build a random aggregation tree, and for load balancing we're going to build
multiple such trees. So for example you can build one aggregation tree that
keeps track of the traffic center given destination, you can build another
aggregation tree that keeps track of which of the most heavily loaded machines
in the network and yet another tree that keeps track of which nodes are storing
which files. All right.
So our basic approach is aggregation, and we're going to use these aggregation
trees to collect the global part of the system. So now let me taking an example
that ties all of this together.
So recall my first motivating slide where essentially PlanetLab was being used to
launch (inaudible). So we formulate that problem in form of this query to find out
that top hundred destination IPs that are receiving the highest traffic from all the
PlanetLab nodes, that are about roughly 800 PlanetLabs nodes right now.
So to answer this query, so essentially so think of it as finding out the top
hundred destination, IPs means that if there's a likely attack that is going on on a
victim site than it should likely be in the top 100 destination IPs because we are
looking at the aggregate volume of the outgoing traffic. Okay. So how do we
compute this query? We're going to compute it in two steps. In the first step
we're going to compute the aggregate traffic send to each destination IP from all
nodes in the system. So in this aggregation tree, the physical nodes are the
leaves and the internal nodes are simply what I call virtual nodes that are
mapped to different physical nodes, okay.
So in the first step we're going to compute for each destination IP the aggregate
traffic sent to that destination from all the nodes in my system. This is my first
step. So for all the IP addresses I'm going to compute this aggregate traffic. And
in the second step I'm going to compute a top 100 aggregate function and again
doing this network aggregation I'll get the global top 100 list.
Sure, go ahead.
>>: You're going to take every single IP address that every PlanetLab node is
sending data to and compute -- and create a separate aggregation tree on it?
>>: It's okay. (Inaudible). (Laughter).
>>: And therefore it's scalable.
>> Navendu Jain: I'm going to actually hit that point on the very nail. So actually
I'm going to reverse the question. So do you know how many IP addresses there
are at any point in time?
>>: Well, it's only IP before, so.
>>: (Inaudible).
>> Navendu Jain: In practice?
>>: In the world?
>> Navendu Jain: Yeah. So at any -- if you take a snap ->>: I have no idea.
>> Navendu Jain: Take a guess.
>>: I have no idea.
>>: 100,000.
>>: Oh, no, it's much more than that.
>>: (Inaudible) millions of ->> Navendu Jain: So take a guess, so how many.
>>: (Inaudible).
(Brief talking over).
>> Navendu Jain: Well, actually it's about a million roughly, so.
>>: There's only a million live IP addresses?
>> Navendu Jain: So if you take a -- at any point in time if you take a snapshot
of the traffic that is going outside collecting from all the PlanetLab notes, the
number of unique destination IP -(Brief talking over).
>>: That wasn't the question we thought you asked.
>> Navendu Jain: So this is the destination IPs ought to be contacted by these
800 PlaneLab nodes located around the world.
>>: I'm actually surprised it's that large.
>> Navendu Jain: If you take a snapshot, actually it is ->>: I believe you, but I'm surprised.
>> Navendu Jain: Okay.
>>: So there are a million such aggregation trees.
>> Navendu Jain: And in principal -- yes. In principal we're going to build a
million trees, right. But now a key observation is that a majority of these IP
addresses receive very few updates. So remember we are computing the top on
our list.
>>: I agree.
>> Navendu Jain: So essentially a majority of these IP addresses see very few
dates, and if, you set your precision or your accuracy requirement to say give me
these flows within say one percent of my maximum full value, right.
>>: (Inaudible).
>> Navendu Jain: I'm sorry?
>>: How do you know that?
>>: (Inaudible).
>> Navendu Jain: So if you take the one percent of the maximum flow which is
say roughly about 3.5 megabytes per second, right, then you can actually filter
more than 99 percent of these destination IPs. Right. So which is important for
scaleability that now you actually in principal, although you need to compute all of
these millions of IP addresses, the aggregate values
for those, but essentially by using a small amount of imprecision or small amount
of error that you are willing to tolerate in your answer, you can actually filter out
majority of these nodes, and you would still get the top hundred list that bounded
accuracy.
>>: So you're filtering out, I've got to understand here, you don't know how much
is being sent to a particular IP address until you construct the tree and bring it up
to the root, right?
>> Navendu Jain: Well, not quite. But please go ahead. Which I'm going to
actually cover --
>>: If this is going to be part of your talk, I'm not going to go into it.
>> Navendu Jain: Yes, I'm going to go into it in the very next slide, couple of
slides.
So this is my -- I'm going to use it as example, so I'm going to keep coming back
to it. But yes, we'll be on the right track.
So essentially I'm going to talk about how we use this arithmetic imprecision to
achieve high scaleability. So arithmetic imprecision or AI quick recall bounce the
numerical error off of query result, that is instead of giving you an exact answer
with value hundred we are giving you approximate answers. It's hundred plus
minus 10 percent but the answer is guaranteed to have a maximum numerical
error of at most 10 percent, right.
So when applications don't really need exact answers, AI allows us to reduce
load by caching old values and filtering small changes from these values. So
even if my actual true answer becomes then 91 or 109, I don't need to send an
update because it still meets my error guarantees.
The two key issues in implementing AI are the mechanism question of how do
you take this 10 percent error and divide it among the nodes in the system as
you were pointing out and the policy question of how do you do it optimally.
Right. So I'm going to a flexible mechanism in which you can take a total budget
and divide it in any manner across the system, and I'm going to show you an
adaptive algorithm that performs self-tuning of these budgets to minimize the
monitoring load. So the way PRISM uses AI is by installing these filters at each
node in aggregation tree. So remember we are aggregating the globals
throughout the system using aggregation trees, right.
So and each such filter, which is at each node in the aggregation tree denotes a
bounded range low and high such that the actual value lies in this range and the
width of this range is bounded by error budget's delta. Okay.
So in this example, I'm setting these filter widths for a sum aggregate, right. So
now when you get a new update, we only need to send this update if it validates
the filter. So for example if you look in update here with value five, we don't need
to send it because it already lies within our guarantee, that is the answer lies
between four and six, so you can simply cull it or filter it.
However, if you get an update that lies outside range that we need to update it to
the parents and adjusting this filter width along the way.
So the two key issues for AI are the mechanism question of how do you take the
total budget delta and divide it among the nodes in the system. So given a total
budget delta root you essentially have the flexibility to divide in any manner. So
for example the root can keep some part of the budget for itself and divide the
remaining among its children, okay. And the children would do the same, they
keep some part of the budget for themselves and divide the remaining among its
children and so on.
So in this way you have tremendous flexibility to divide a total budget in any
manners across the nodes, right. So intuitive question is how do you decide
which way to go or what is the best possible, right? So ->>: I have a question.
>> Navendu Jain: Yes, please.
>>: If you specify it is a relative error rather than absolute error, how do you
divide? It seems that you'd have to give the same relative error to both of your
children unless you have an assumption about the magnitude of the values that
were given you.
>> Navendu Jain: Right. So there are essentially two ways of looking at the
error. So one is, as I said, the absolute error, right. So essentially the talk I'm
giving today is actually based on the absolute error. But there's an equal notion
of relative error as well, essentially where you actually specify -- you can also
specify the error in terms of relative to the absolute value. And there's a separate
piece of work of on that which I haven't covered, but essentially I can give you
more details when we talk.
>>: Okay. I'll think absolute error for the rest of the talk.
>> Navendu Jain: Okay. Great. So now the policy question is how do you do it
optimally, right. So remember our goal is to be scalable monitoring. And our key
constrain is that for scaleability we want to minimize the monitoring overhead,
essentially you want to minimize the number of updates in the system. We want
to minimize a number of updates in the system. So ideally you want to set these
filters such that you cull maximum data as possible, not just the leaves at but at
the internal load, as well. Right. So you want to minimize the total amount
(inaudible).
So if your input distribution the input workload is uniform, then you might want to
give uniformly across all nodes in the system, okay. But however if your input is
skewed, so for example if your right sub tree is generating a lot more updates
than the left tree, you might want to give larger width to the right tree a take it
from the left, right. And even within the same tree sometimes you may want to
give larger widths to the leaf nodes and smaller to the internal nodes and
sometimes larger to the internal nodes and smaller to the leave nodes. So
what's the right thing to do here? And these are very quickly dynamic situations
that are present in the applications.
Our goal here is to do self-tuning of this filter.
>>: Let me ask a question. Why would you (inaudible) budget at any of the
internal nodes, rather than leaves, is it just so you could work with the distributed
or -- because they don't generate data, so is that the answer?
>> Navendu Jain: No. It's a very good question. So a very simple example
where you actually want to give larger words to the internal nodes is for example
so think of it as you're getting consecutive updates so you get a value, an update
with value V, and the next update you get is a value of V plus 100, okay. So you
get a value V, you get value V plus 100, you say V1, V1 plus hundred and you
get V2 and V2 minus hundred, right, so I'm essentially computing sum aggregate.
Right.
So if I give this filter width as unit one, then there's no way I can filter that update,
right, because the next value that we get from the previous value by 100 units.
Does it make sense? So essentially, okay, so the high level answer is that when
you combine these updates they can cancel each other out.
>>: Sure, but they can cancel each other out whether you push the value -- or
whether you push budget down or ->> Navendu Jain: Not quite because in one case essentially I can give larger
widths, larger budgets here, but still the value itself may still (inaudible) so think
of it as V -- the first value being V, the next value being V plus million. So one
thing goes up by a large amount, the other thing comes back by a small amount.
And think of it as a process filter.
>>: And in of them (inaudible).
>> Navendu Jain: Or the aggregate value more general, the aggregate value
stays in some -- has very small change. So you can essentially cancel that very
small change by having a larger filter with the internal ->>: So first (inaudible) but the large numbers gives you small numbers.
(Inaudible).
>>: I have the same question as Bill. I similarly don't understand why you
wouldn't spend the budget on the leaves.
>>: Because if it bursted the leaves, then you don't have enough budget to ->>: You but it's not one budget. It's that if you don't spend it at the leaves, it
doesn't give you more of the internal nodes, as far as I understand.
>>: As far as I could tell.
>>: You don't get more of the internal nodes for not spending it at the leaves.
>> Navendu Jain: Absolutely. You're not giving because a total budget is fixed
here. Essentially you are taking this budget from the leaf nodes and putting it
here. So the idea is suppose if I don't keep anything internal and push
everything onto the leaves then the cost of the system is strictly if I put in this is
at most the same cost essentially if you push all the budget to the leaves. Does
that make sense?
>>: The thing that you guys are disagreeing about is that you're saying if an
internal node has a certain amount of budget and it gives some to the leaves, it is
looses that budget, and John is saying ->>: I'm saying I don't understand why.
>>: But you have to report -- if you're an internal node, you have some idea of
your values, right, and in particular you know that your children are all within the
range of ->> Navendu Jain: Yes, guarantees.
>>: (Inaudible) right. So when I get an update from a leaf that says its moved
out of its range, then I know -- okay. I see what you're doing.
>> Navendu Jain: So essentially then you are also propagating these updates
up as well.
>>: (Inaudible).
>> Navendu Jain: So is John ->>: I'm still confused. I'm still confused about why there's a trade-off in budget
between the parent and the child.
>> Navendu Jain: So the trade-off essentially is the following. Suppose I can
keep the entire budget here and give nothing to the leaves.
>>: I don't understand why it's a trade-off, why you can't -- if I have the range
zero to 10, and I give zero to five to one child and five to ten to another child,
right, then I still have zero to ten.
>>: Because imagine ->> Navendu Jain: So the trade-off essentially is in the term of that especially I'm
trying to minimize the updates, right. So what you are really saying is by setting
those widths in such a way, right, the message essentially you are getting, right,
so if I get a message from my children, do I actually when I combine them, when
I aggregate them, do I still need to propagate it to my parent itself as well.
>>: Right.
>> Navendu Jain: Right. So now the point there essentially being that
sometimes you may not want to give or give very small amount of this budget to
the children but essentially keep your error window actually larger.
>>: I understand keeping my error window larger. Why does that ever mean I
am -- it is good to give less to my children?
>> Navendu Jain: Okay. So actually since I want to -- 58, so let me give you a
very quick answer to that. So I'm going to (inaudible). So here I'm essentially
saying this 4-6, 3-4, right. My total budget is five, the parent has two, this is two,
this is one, right. So we assemble some aggregate which essentially take the
sum is sum of the lows, sum of the highs and then I apply my local filtering here.
Right the local filtering essentially expanding this range by, you know, delta by
two on both sides. So this is seven and ten. And my delta by two is one,
essentially so I'm expanding it as six and 11. Makes perfect sense?
>>: Yes.
>> Navendu Jain: Right. So now I get a new update value which is the value six
and for that is value five, right. So now I don't need to do anything here because
it already lies in my range.
>>: Great.
>> Navendu Jain: So you simply filter it up.
>>: Great.
>> Navendu Jain: You need to send it up. Essentially you send it up here and
you recompute essentially aggregate rate. You are using the cached value from
here, and you use the updated value from there. Now, your value is eight and
11. So eight and 11 lies -- already lies in the previous range because six and 11
that I reported to my parent. Right? So now essentially because of keeping the
budget here, right, I'm able to call that a break. If I kept no visit here, then every
time a leaf changes I have to update it to my parent. Yes, precisely.
>>: In other words, the so if ->>: You can't spend the margin of error twice (inaudible).
>> Navendu Jain: Okay. If you ->>: So it's that I didn't know whether to give that budget to the left node or the
right node.
>> Navendu Jain: Or keep it. I mean I'm showing you a (inaudible) but it's
actually a (inaudible).
>>: If you keep it yourself there's a decent chance that you could cull the thing to
work (inaudible). Push everything down to the leaves at any time anything ever
exceeds its budget. It has to propagate all the way (inaudible).
>> Navendu Jain: So if you still have concerns we will meet -(Inaudible discussion.)
>> Navendu Jain: I'm going to skip that. Right. So now it's not, well, I guess the
problem is not just the optimization but actually it's even harder because in order
to adjust these widths we have to send messages. So essentially you want to -- I
just want to minimize the number of messages that are being propagated to the
monitoring overhead, but essentially if you want to adjust these grids optimally,
you also end up paying the cost of sending messages to adjust them.
>>: (Inaudible).
>> Navendu Jain: Total number of messages. Both the going up as well as
going down.
>>: Popular metric.
>> Navendu Jain: Okay. So now to address these challenges, we have
implemented -- well, designed and implemented both a principles, self-tuning and
a mere optimal solution. Okay. Note that we cannot apply heuristics here
because heuristics may work for one workload but may not work for the other
workload. So what essentially we need is a principal solution that uses the
workload properties itself to guide the optimal setting of these filter widths.
Further we need a self-tuning solution to adapt to dynamic workloads because
workloads can change on a period time and to make sure that the benefits we
get by sending these messages exceed the costs of sending them. Right. So
you want to make sure your benefits always exceed your costs, right.
And finally we have then theoretical analysis of this problem and shown that our
solution is the optimum online algorithm. So first I want to show you a workload
of their solution to estimate the optimal filter width. So for any given workload the
key thing to consider is the variability or the noise in the workload, right,
essentially if you have very high noise in the workload then you need larger filter
width to cull its updates. Same point we were talking about if you have value V
and the next value is V plus (inaudible). So if you have more noise in the
workload you need larger filter width to cull its updates.
So I'm expressing this is notion in this graph where the monitoring node on the Y
axis is expressed in terms of a total budget delta, that is the error we are willing
to tolerate, and the standard division of the variance in the input workload. Okay.
So when you look at the left of this graph, when your error budget delta is much
smaller compared to the noise in the workload, then you expect to filter very few
updates because most of them will actually lie outside our range. However, on
the right when error budget delta is much larger than the noise in the workload.
So think of as a Gaussian. So for a Gaussian if you have delta as say plus
minus three times the standard division, you can filter out roughly 99 point some
percent of its updates, right. So now you expect the load to decease quickly until
a point where a majority of these updates get filtered. Okay.
We captured this skill mathematically using (inaudible) and equality, which allows
us to express the expected message cost of going from any such filter in terms of
the variance and the update width of the input workload as well as our input at
our tolerance. And the whole idea of doing this mathematical modeling is to
estimate the optimal filter widths that we are setting at each node in an
aggregation tree.
>>: (Inaudible) normal distribution?
>> Navendu Jain: No. Actually so there is -- that's why essentially we're using
the (inaudible) inequality, which doesn't make any input or any a priori
assumption about input distribution.
>>: (Inaudible).
>>: Well, that's why it also doesn't quickly decay to 99.9 percent, right, if you set,
you know, like he said, delta equal to three sigma, and it's only one mind.
>> Navendu Jain: So essentially so this is not a very tight bound, but essentially
this doesn't make any assumption of the workload. Okay.
So using this model we formulate an optimization problem for a one level tree
where we want to minimize the number of messages being sent from the children
to the root such that our total budget, delta T is fixed, right. So your input
tolerance is fixed so given that constraint you want to minimize the number of
messages being sent from the children to the root, okay.
So now, solving this optimization problem we essentially get a nice close
solution. Now, I don't expect you to understand this equation, but one thing I
want to point out here is the optimal setting of this filter width depend directly on
the (inaudible) of workload, that is both the variance as well as the upgrade rates.
And we extend this approach to a general aggregation.
However, computing this optimal is not enough because we need to send
messages to adjust these filters. Yes, please.
>>: Update really doesn't seem very defined to me.
>> Navendu Jain: Update really is the number of number of data number of for
example network monitoring application, the number of package you're receiving
per unit per month.
>>: I understand that these are discreet systems, but it seems sort of abstract
that that's continuously, the value function is very continuous throughout. The
update rates come almost as quickly as you're willing to look at them.
>> Navendu Jain: Essentially you are describing some sort of a time window
which actually you're looking at how these things are changing. So for example ->>: (Inaudible).
>> Navendu Jain: No, no, no. Well, we're essentially defining it the update or
time window, right, so essentially you can look as some of the worst. So
essentially those are time windows. Essentially you choose the number of
updates you have received in a given time window and you divide by the unit of
time. Right. Does that make sense?
>>: No. Because I think you're measuring over that time window is some
underlying let's call it continuous function that's updating (inaudible) right?
>> Navendu Jain: So essentially this is an approximation of how the continuous
signal is actually being (inaudible). So essentially as ->>: I guess what I'm trying to -- intuitively the standard deviation is well defined
for any reasonable thing you're measuring, but the update rate is something you
chose, right, you said I would like to measure these every ten seconds. I mean
the update rate is (inaudible) value, and if the function is varying, the value
function is varying, it's not constant, then it will update at every window.
>> Navendu Jain: So what's really happening is that when you essentially
receive a data packet, right essentially your are computing the (inaudible)
standard division, so this is being computed in an online manner, essentially you
are keeping track of the update rate as well. So essentially think of it as in one
hour essentially I'm keeping track of what all updates I'm getting, essentially how
many times this value is being refreshed. Right, so I'm not actually picking, I'm
just picking the time window over which I'm looking at the system, I'm not actually
picking how -- because this is part of the input workload, how the input data is
being up generated.
>>: I'll worry about that.
>> Navendu Jain: Okay. So essentially computing this optimal is not enough.
We need to send messages to adjust them. So imagine we do this periodically.
So say every five minutes we go and we compute the optimal settings of these
filter widths and we send messages to adjust them. Now, as you increase the
frequency of distribution, that is you go from say five minutes to a minute to 30
seconds and so on, you are essentially getting more and more close to the
optimal. Whenever there's a difference between the current setting and the
optimal setting, I'm going to send messages to fix it, right. So essentially you are
essentially getting more and more optimal and your filtering becomes more
effective. Right. However, there is a point of (inaudible) after which the benefits
you get are very marginal. Essentially you keep sending these messages to fix
the imbalance between the current setting and the optimal setting, the benefits
you get are very marginal compared to the cost of setting these messages. So in
general there is this trade-off between the message overhead and the frequency
of overhead distribution. So read this X axis here as going from five minutes to a
minute to 30 seconds, 10 seconds and so on. So as we go from left to right, we
are increasing the frequency of the distribution. We are sending messages much
more frequently.
So essentially then our total monitoring shown in blue decreases, right, because
our filtering becomes more effective. However, there's a point of this diminution
returns after which this redistribution cost showing red starts to dominate. Right.
Is the point coming across? So essentially the idea here is to make sure that the
benefits we get by redistributing always exceed the costs of redistribution.
>>: So there's an assumption here that global redistribution as opposed to say
just redistributing -- this is the one parenting ->> Navendu Jain: No, this is in general hierarchy.
>>: (Inaudible).
>> Navendu Jain: Yes.
>>: So unless I'm incorrect, the way you're describing this, it sounds like there's
an assumption that you're globally redistributing the whole tree, and one could
imagine these systems where you only redistribute some nodes that are the most
out of whack, which would reduce the redistribution cost possibly (inaudible).
>> Navendu Jain: Right, so essentially -- which is precisely the point that when
do you redistribute, right? So when I computed the optimal, I said by how much
you redistribute it, what is the optimal setting. So then this logical question is
when do you redistribute. So essentially you want to fix, you essentially want to
move the global, not a distribution out of the global but a distribution of the entire
tree the same as the optimal.
>>: Right. But what I'm saying is that I think that's an assumption you're making
that's not necessarily -- that requires that at least requires an explanation of why
it's a valid assumption. Because you could imagine, say you have a big tree, you
know, a binary tree and the left half turns out to have its internal stuff completely
out of whack and the right half is completely close to right. So you can reduce
your redistribution costs by half by just not effecting, just leaving the right half the
way it is. Now, you're not optimal, but you've cut down on the red cost
substantially while bringing the black cost -- getting most of the advantage of the
black cost.
>> Navendu Jain: Right. Exactly. That's precisely the point that essentially how
do you make that decision.
>>: Okay. So you're (inaudible).
>> Navendu Jain: Yes. Precisely. That's precisely the point. So essentially
what we're really doing is applying this cost benefit (inaudible), said that, you
know, if you don't really need to redistribute then don't. So essentially how do
you make that decision.
>>: Making a different -- there's two questions, right. One is are you far enough
off optimal that you just don't want to redistribute at all, but the second question is
it may make sense to make a partial redistribution, and it sounded at least and it
still kind of sounds like you're making an assumption that you either redistribute
completely or not at all.
>> Navendu Jain: So, right, so the redistribution is happening at each internal
node which is the parent of its underlying children is essentially making this
decision that whether I need to redistribute it amongst my children and so on.
Every internal node is doing that. So there is no notion of a globally -- right.
Exactly. So essentially what we're doing is reapplying this very simple key idea
of cost benefit ->>: (Inaudible).
>> Navendu Jain: It's not even, so everything is happening, so each parent is
actually the individual, you know. So think of it as each internal node is a parent
of its underlying children, essentially each of them is making a decision that from
my subarea I want to minimize the number of updates that are happening.
So here essentially we redistribute the budgets only when either there is a large
load imbalance, that means there is big difference between the current setting
and the optimal setting or the load imbalance itself is small, but over time it has
accumulated so it becomes a long lasting imbalance.
So essentially we redistribute and make sure then our benefits always exceed
our costs and we have done theoretical analysis of our solution and shown that
our solution actually matches the optimum online solution and for this problem
there is no constant comparative algorithm that exists. Yes, please.
>>: Would you describe how to choose the distribution of error budgets between
a parent and children? It -- from what I could tell it didn't take into account any
difference that would make in the later cost of redistribution.
>> Navendu Jain: Right. So ->>: So I felt like you assumed the static workload and said if I never have to
redistribute this is now the correct thing and now I feel like you're saying well now
I'm going to make redistributions but I'm not going to make those redistributions
taking into account my possible desire to make future redistributions cheaper.
And redistribution seems like actually a very strong argument for keeping error
budget at parent nodes.
>> Navendu Jain: Right. So essentially -- right, so when you're essentially
completing the optimal, you're looking -- the distribution essentially it has
behaved over this time. So yeah, so essentially it's some notion that the
workload is going to look like this, so essentially it has looked like that. So I'm
computing the optimal based on that, right, essentially, and then exactly as you
said this essentially is taking the dynamic aspects essentially how do you make
sure your benefits are always -- so when you redistribute, your benefits are
always going to be better than the costs?
>>: Couldn't you do better by also taking the cost of future redistributions into
account when you assign --
>> Navendu Jain: Essentially then you have to build more of a predictor model,
how the distribution is going to behave, right, so the idea here essentially was we
are building a general framework that essentially no matter what your distribution
does the system is still applicable in all cases.
>>: (Inaudible). You kind of keep going until the cost of fixing it is equal to what
you've already paid. A lot of them.
>> Navendu Jain: So to see if this is effective I'm going to first show you some
quick simulation results, and then I'm going to show you results from a (inaudible)
implementation.
So in this experiment essentially we are using a 90 (inaudible) workload with 90
percent of the sources have zero noise that essentially they are giving the same
input value over and over again and 10 percent of the nodes have very high
noise. (Inaudible)
So you expect your optimal algorithm to take all the budgets from these zero
noise sources and give it to this noisy 10 percent, right. So in this graph I'm
showing you the normalized message overhead on the Y axis and on the X axis
I'm showing you the ratio of our budget delta to the (inaudible), so think of it again
if you have delta as plus minus three times the standard deviation you can filter
99 percent of the input updates. And again here lower numbers means better,
right.
So first of all, compared to uniform allocation, which gives equal uniform filter
rates to all the nodes, we reduce overhead by order of magnitude. And
compared to adaptive filters, which is the state of the art in this area, we reduce
overhead by an order of magnitude and even beyond because what adaptive
filters does is it periodically distributes these budgets and hence leaves
messages on user adjustments, whereas in our case we only redistribute if
benefits outweigh costs. Okay.
To see if self-tuning approximates ideal we use a uniformized workload. That
means all the data ->>: (Inaudible).
>> Navendu Jain: I'm sorry?
>>: (Inaudible) uniform better than adaptive.
>>: Because the (inaudible) ->>: The adaptive blew a budget on.
(Brief talking over.)
>>: Even though it was the very unbalanced.
>> Navendu Jain: Right. And actually I'm going to show you the real workloads
are actually very, very skewed.
>>: (Inaudible).
>> Navendu Jain: So and for a uniform workload the optimal policy is to give
these filter grids workload. So again our solution self-tuning approximates,
approximates the uniform. We are sometimes slightly better and we are
sometimes slightly worse because the uniform allocation is the best online policy.
It is not necessarily the optimum online solution.
And again compared to adaptive filters we reduce overhead with several
magnitude. So I'm going to quickly show you some results from implementation.
So I've implemented a prototype PRISM on top of the aggregation system and
we use free pastry as the underlying DHTs. And for this periodical evaluation I'm
going to perform a distributing service query that find out the top hundred
destination IPs receiving the highest aggregate traffic from all nodes in the
system. And the input workload here is taken from (inaudible).
So before I show you results let me quickly give you a quick overview of the input
data distribution. So to process this workload, we need to handle about 80,000
flows of attributes that send roughly 25 million updates in an hour. So a quick
(inaudible) would tell you that a centralized system needs to process about 7,000
updates per second. So in terms of bandwidth and processing. And further the
distribution is heavy till. So if you look at the bytes distribution, then 60 percent
of the flows send less than a kilobyte of traffic and 99 percent of these flows send
less than 400 kilobytes traffic.
However, the distribution of the heavy till and the maximum force is more than
200 megabytes of traffic. And you see similar patterns in the packets of
distribution, right. So now the key challenge here is for doing soft tuning of these
budgets, we need to manage or soft tune budgets for these tens of thousands to
millions of these attributes.
So now tying it back to a running example, when you have 99 percent of these
flows send less than say half a megabyte of traffic and if you take say error
budget that even say one percent of .1 percent of the maximum flow, then we
can filter more than 99 percent of these small noises. So what I'm calling the
mice flows would send very few updates compared to the elephant flows which
are really large and which is the actual flows we are really interested in, the top
hundred heavy hitters. Yes, please.
>>: Did you get the maximum flow?
>> Navendu Jain: So there are several. Several techniques so you can actually
start up the system by essentially computing an estimate, right. The system
allows you can actually update the error value as the system progresses,
essentially as you're running you can update, you can input new values at any
point in time. So you can think of them in terms of absolute as the relative rate.
So in relative statistics you can always say within in one percent of the maximum.
In absolute statistics you can you take say one percent of say hundred
megabytes per second, and you compute the value and you say, well, it is
something different then you can reinput the value back into the system.
>>: Right. So this actually brings us back to a question that I had earlier, which
is did it make sense to build aggregation trees (inaudible) as opposed to building
an aggregation tree over the 800 hosts that you have where each host sends off
its hundred, its 100 best things and then it just -- I mean, it doesn't give you the
authorization (inaudible) but as an engineering solution it might actually work
better.
>> Navendu Jain: Actually my claim is the correctness actually gets heard, and
so correctness has violated even your solution because the top hundred actually
might not be in the top hundred of each individual node, right.
>>: It must be.
>> Navendu Jain: Not necessarily. You can actually have small values at each
of them.
(Brief talking over.)
>>: (Inaudible). Kids school different.
>> Navendu Jain: So this graph shows the results. So compute your centralized
system that incurs the costs of these are the absolute numbers. About 7,000
messages per second. If you take an error budget at even five percent of the
maximum flow, we can reduce this this monitoring overhead by the order of
magnitude, and by using a key optimization you get another order of magnitude
improvement. So I'm not going to go into details of this optimization. The punch
line here is by doing the soft tuning of the error budget we can reduce the
monitoring overhead by several orders of magnitude, which is really important for
scaleability. So that's the key one. Yes, please.
>>: (Inaudible) on identifying the most popular anything, the most popular million
URLs or the most popular IP addresses, how does this compare to all those
others?
>> Navendu Jain: So, again, see, (inaudible) popular in the databases kind of in
the networking community as well. Essentially the idea there has been what
most the common notion essentially we are doing this common top hundred at
each individual site, and the idea essentially is to and the aggregate of that at a
central point, essentially think of a (inaudible) tree. So essentially this again in
this system essentially assuming entire, this is a completely distribute
environment where essentially the top hundred at each local site actually doesn't
suffice.
>>: I guess harbor network sites that we have here installed I could do top
hundred heavy hitters and (inaudible) right now.
>> Navendu Jain: Okay. So I guess we can (inaudible) more detail, but
essentially the idea is that they are building this in a completely distribute
environment so the top hundred globally aggregate across all the systems. But if
there's published work I'll be interested in.
>>: (Inaudible).
>> Navendu Jain: Okay. So to summarize the contributions of AI, we are giving
you a flexible mechanism in which you can take a global budget and divide it in
any manner across the nodes in the system. Adaptive algorithms that performs
self tuning of these budgets to minimize the monitoring overhead. Our solution
has two key ideas.
We are estimating the optimal based on the variability in the workload itself and
we only redistribute if benefit exceeds the costs. It shows that our solution is the
optimum online algorithm. Using experiments we can get significant reduction in
the monitoring load by using a very small AI, AI of one percent and five percent.
Also it gives you several orders of reduction in the monitoring overhead. Which
is in fact the case for real workloads which is real, yes. But if you really think
about it, if your distribution is really uniform, then we don't really need this sort of
fancy optimization. But real world workloads.
So I want to very quickly actually touch the second dimension of precision, which
is a temporal imprecision. And again here I'm assuming that nodes in the
network are reliable and links have small (inaudible) and describe how NI
handles cases when these assumptions do not hold.
So the key point essentially I want to make from this part of the talk is by
combining AI and TI we can get significant reduction in the (inaudible) which is
really important for scaleability. So using AI and TI we have built this application
which is currently running on PlanetLab on about 500 nodes which detects the
top 100 definition IPs that receive highest traffic from the PlanetLab nodes, right,
and this application is currently running on the PlanetLab infrastructure.
So in the (inaudible) I'm showing you the results from this application where the Y
axis shows the message overhead and the X axis shows the temporal or the TI
budget, and these different signs show the different AI settings. And again here
lower numbers are better.
Okay. So if you compare going from AI of zero to AI of one percent we get a
reduction in overhead monitoring by a factor of 30. And you get another order of
magnitude load reduction by going from AI of one percent to AI of 20 percent so
essentially we are getting these orders of magnitude reduction in the monitoring
overhead by using a small AI at a tolerance.
Similarly by using TI if you go from 15 seconds to 60 seconds you get roughly an
order of magnitude reduction. In order for this application having a TI of 60
seconds is reasonably good and having an AI of 10 percent is reasonably good
because we are interested in the top heavy hitters list. So by combining AI and
TI we are getting several orders of reduction in the monitoring overhead.
Another advantage of combining AI and TI is we get highly responsive
monitoring. So, for example, for approximately the same cost as AI of one
percent and TI of five minutes, we can give you 20 times more responsive
monitoring at AI of 10 percent and TI of 15 seconds. So this AI and TI
combination is really powerful as it gives us several orders of reduction in the
monitoring overhead as well as gives highly responsive monitoring. Okay.
So until now, I have talked about AI and TI, how they give us high scaleability
and how they give us strong accuracy guarantees. However, this is all nice and
great in an ideal world because in practice failures can cause the system to
violate these guarantees. Therefore we define a new abstraction network
imprecision that bounce uncertainty due to failures. So since this is a new
abstraction it will take me a couple of slides to define it let me (inaudible) why NI
is important.
NI is important for three reasons. First, it allows us to fulfill the guarantees of AI
and TI and in comparison, existing systems today only give you best efforts. And
I'm going to show you that best effort can be arbitrarily bad or it can have very
high error in practice.
Second, in the presence of failures, AI and TI can actually increase your risk of
errors, right. So therefore NI characterizes how accurate the results really are.
And finally, using NI, AI and TI can now assume that node in the network are
reliable, right, so their implementations get simplified. And NI handles cases
when these assumptions do not hold. So the key motivation for NI is that failures
can cause the system to violate AI and TI guarantees.
So in this very simple example we have a monitoring application that's returning
a query result of 120 requests per second and node that we are essentially using
aggregation trees to compute the global result and here the subtrees, the
aggregate value of the subtrees are being cast at the parent, right, because we're
essentially doing this to minimize the cost because if we don't get any update
then we guarantee that these subtrees values lie in that range. Right.
So given this AI and TI requirements we're getting this answer and we're then
caching of the subtree aggregate values, right.
So now when you get this result in the presence of failures, what does it really
mean? Does it mean that the load is between 100 and 130 or in fact the load is
actually much higher, but in fact there was a disruption which prevented this new
update from reaching the root. However, for the root node, it's still using the
cached value of the subtree, thereby reporting you an incorrect answer.
Or in fact, the load could be much smaller, but there was a subtree that
essentially moved to a new parent. So a reconfiguration caused these roots to
count the subtrees value twice. Essentially it's caching the subtree's value here
as well as the aggregate value being provided by its right tree. So here
reconfiguration causes double counting.
So already we made the claim that we can give you strong accuracy guarantees
without failures but in practice when failures happen what guarantees can we all
use?
To see how bad it is in practice, we build an deployed this application on
PlanetLab where we're keeping track of the experiment global research usage on
all 800 PlanetLab machines.
On the Y axis here I'm showing you the CD of answer, on the X axis, I have the
difference of the two, so essentially we are taking the difference of the reported
value from the true value and you get the true value by doing -- by taking offline
processing of the logs. Okay. So two things to note here are that half of your
reports (inaudible) by at least 30 percent from the true values and about 20
percent is one-fifth of the reports (inaudible) by roughly more than 70 percent.
So now when you get an answer, where does the answer really lie? Does it
actually lie here or near or actually even beyond.
So this best effort can be arbitrarily bad in practice, okay. So what's the solution.
Do we just give any accuracy guarantees in the presence of failures. Do you
have an answer or do you have a question?
>>: I definitely don't understand what the point of comparison was again, what
the best effort PR mom does?
>> Navendu Jain: So the notion of comparison essentially is we're taking the
true value. This is the oracle value. So essentially we log everything and when
we get an answer we often do passing of the logs to compute what the answer
should have been versus what the system is telling us.
>>: And the system, what is the system that you were saying?
>> Navendu Jain: This is our PRISM system which is deployed on the
PlanetLab.
>>: (Inaudible).
>> Navendu Jain: I'm sorry?
>>: (Inaudible) a little better but it was -- so you get a uniform, you get a straight
line. Correct?
>> Navendu Jain: And we actually get (inaudible) essentially these answers got
(inaudible) by a lot.
>>: Depends on whether they're relative errors or not.
>>: Right. But you're (inaudible) so their relative errors, right, in this graph? So
if I just made up --
>>: If you made up random numbers between zero and hundred and the truth
was always one you would be off by more than 50 percent.
>>: Well, this hundred obviously has a special meaning, right, because it's a
coincidence that it happens to arrive at 100 or what he's doing is he's putting the
smaller number always (inaudible).
>> Navendu Jain: Right. So essentially it would be normalized. (Inaudible).
>>: So I mean, what Bill's saying if I make up a number between ->>: Yup, yup.
>>: Okay.
>> Navendu Jain: So what's the solution? So do we just give up any accuracy
guarantees in the presence of failures? Sounds great. So the real thing to
actually understand is that we have to accept that our systems are unreliable and
therefore we cannot guarantee to always give the right answer. Right. So
instead of giving always guaranteeing to give the right answer our idea here is
essentially to quantify the stability of the system where an answer is reported,
right. So go back to our institution of this stable good or bad flag, right, so when
you get an answer you get this good or bad stable bad.
The biggest that the system is stable and you can trust the accuracy of the
reported answer. Otherwise our AI and TI bounds may not hold and you cannot
trust the accuracy of your answer. Right. So it will be great if I can always give
you this green light that, yes, you can please trust this answer versus a red light
that, no, you cannot trust the answer, right; however, in reality large-scale
systems are never really 100 percent stable.
>>: (Inaudible).
>> Navendu Jain: Yes, precisely. Thank you. So large-scale systems are never
really 100 percent stable so we quantify how stable the system is when an
answer is computed, okay. And the way we essentially quantify system stability
is using three simple metrics, N all, N reachable and N dupe. N all is simply the
number of live nodes in the system, okay. N reachable gives you a bond in the
number of nodes who are meeting TI guarantees, that means what part of your
network is reachable, that means from what part of the network you are getting
the recent updates from the nodes.
>>: How can you tell? You just put a lot of efforts into not sending update. You
put all this effort into never sending messages, and so it's theoretically impossible
to differentiate between a broken network link and I suppress the messages
because you did such a great job of allocating the budget, so how can you ->> Navendu Jain: And when I come back to the (inaudible) give me a couple of
slide. Thanks.
>> Jim Larus: I know you guys like this (inaudible) lecture, but we've got like 15
minutes left so maybe we let him get through his talk.
>> Navendu Jain: But thank you very much for all the enthusiastic questions.
Thank you.
So the three metrics we use essentially to quantify system stability are N all, N
reachable and N dupe. As I said, N all is number of live nodes in the system, N
reachable gives you a bond on the number of nodes whose recent inputs are
being used in the reported answer, and N dupe gives you a bond on the number
of nodes whose inputs may be doubly counted, right. So these C metrics
together characterize the accuracy of (inaudible).
So in this example on the left when you have 99 percent of your network is
reachable, that means the answer that you're getting reflects recent updates from
99 percent nodes in the system and zero inputs have been doubly counted. It is
highly likely that the answer you get reflects the true state of the system. Right.
But as if you compare the example on the right, you have only 10 percent of the
nodes that are reachable and half of the inputs may be doubly counted, that
means the system may be reporting you either a highly still answer or essentially
overcounting the inputs of the nodes, right, therefore you cannot trust them.
So tying it to (inaudible) examples when right subtree disconnected from the
node then our N reachable would indicate that only 40 percent of your network is
reachable. Right. Because large link has been disconnected and therefore the
answer you are getting may be highly still because a parent may be caching the
disconnected subtrees value and therefore you cannot trust this highly still
answer.
And similarly, when a subtree reconfigure or when a subtree joined a new parent,
even though all of your network is still reachable, that means I'm getting the
recent updates from all my nodes, half of these inputs may be doubly counted,
therefore the answer you're getting it may essentially have the input contributions
of nodes multiply counted and therefore the answer might be overcounted and
you shouldn't trust it, right. So even though we cannot guarantee to always give
the right answer and I still useful because it's not telling us how these disruptions
are affecting the accuracy of the global result.
And essentially we use this metrics to characterize the state of PlanetLab. This
is actually a couple years old, but essentially the graph I generated for this was
the deadline actually doesn't look quite a whole lot different. So here what we
actually see is that you get lots of disruptions in the PlanetLab nodes even
though you have actually very few physical failures. So out of hundred nodes
only about five percent have actually failed, but you actually get lots of
disruptions. But supports our claim that real world systems are not really
hundred percent stable, they're always in the yellow light, and therefore and
these disruptions have a big impact on the accuracy of your answers.
Okay. So now we have seen that NI is useful to characterize the accuracy. Can
we actually use it to improve this accuracy? So a simple technique that we use
is NI filtering. This is a part of series of techniques but I'm going to show you
results from just one technique today. So in this NI based filtering we take, we
mark an answer's quality as good, bad based on the number of unreachable and
doubly counted inputs, okay. So and for simplicity I'm condensing my three NI
metrics into a single number. So NI small means things are perfect and NI large
means things are imperfect. There's too much shown in the system. And these
different lines show the different NI thresholds.
So if you compare the best of our system with 80 percent of reports can be
viewed by as much as 65 percent by using this very simple technique, we can
guarantee that 80 percent of the reports have at most 15 percent error. So in
further benefit is that when we get an answer that's actually tagged with NI, I
know if my answer lies in this range or this range or this range.
But as compared to best effort, we are going to give you an answer, I say well,
that could be here or could be here or could be somewhere else. I don't know.
So by using this NI, I'm able to characterize both the accuracy as well as improve
this accuracy.
So now we see that NI is useful but can be computed efficiently. So the good
news is yes, these NI metrics are simple to implement as that conceptually
(inaudible). So N all is number of nodes in the system, N reachable is the
number of nodes which are (inaudible) being used and N dupe is the number of
nodes whose inputs may be doubly counted.
However, they are difficult to implement efficiently. As you rightly pointed out,
there's a big scaleability charge. And in particular the big challenges that we
need to compute NI for each aggregation tree in our system, and secondly
require active probing of each parent-child edge each aggregation treatment
source. Okay.
So the first challenge here is that we need to compute NI for each allegation tree
in the system. And the reason we need to do that is because a failure can have
different effects on different trees. So, for example, here, the failure of this node
only disconnects a leaf from this aggregation tree, right; however, in a different
tree the failure of the same node disconnects an entire subtree.
So since now a failure is affecting different trees differently, you need to quantify
its impact individually for each aggregation tree. And second, to give a TI
guarantees, so these are the challenges to compute this NI metrics scaleability,
right?
The second challenge in computing this NI metrics scaleability is we need
perform active probing of each parent-child edge in each aggregation tree. And
we need to do that to satisfy a TI guarantees. And a naive way of doing ordering
messages for nodes per second and for a thousand node system this requires
about a hundred messages per node per second. Okay.
So what's the solution? To address these challenges we are going to use a
particular property of a DHT system and use a particular technique that are sort
of a forest of these DHTs trees forms an approximate butterfly network.
And in particular I'm showing you this 16-node net, this butterfly network for a
16-node tree, 16-node system, and this butterfly network encodes all the
aggregation trees in the system. So for example an aggregation tree like this
and an another aggregation tree like that.
So for scaleability, our key idea is to reuse common calculations across different
trees. These common calculations are for computing the NI metrics. So in
particular each node in this butterfly network is a parent of two trees, an
aggregation tree underlying it of children that compute an aggregate value and
an aggregation tree of parents above it that depend on this underlying tree as
input, right.
So now the idea of scaleability here is rather than recomputing the aggregate
value of this blue tree separately for each parent in the red tree, I'm going to
compute it only once, then I'm going to forward that value to each parent in the
red tree, and essentially we do this for the entire DHT system, for the entire
butterfly network in our DHT system, right? And by doing this we reduce the cost
from order in to order login messages per second.
And for a thousand node system this only requires about five messages per node
per second. Okay. We verified this experimentally, so as you increase the
number of nodes, the per tree cost grows linearly on a large scale. Whereas
using a dual tree approach this cost grows logarithmically, and for a 1,000 node
system we would reduce it from about hundred messages per second, or nodes
per second to only about roughly five messages per node per second, and this
grows logarithmically. Okay. Yes, please.
>>: (Inaudible).
>> Navendu Jain: Okay. We also have one on one, so I can answer that. Okay.
So to summarize the contributions of NI, NI addresses the fundamental
challenges that failures can cause the system to violate our AI, TI guarantees.
And since we cannot guarantee to always give the right answer our key idea is to
quantify the stability of the system when an answer is computed, right. We
generalize this notion of stability of the stable bit N all, N reachable and N dupe
metrics.
And our system provides a scalable implementation of these metrics by reducing
the costs from order and messages to only about order logging messages per
node per second. And by using NI we can improve the accuracy and the
monitoring results by up to an order of magnitude for the workloads we consider.
Okay.
So in this talk I've talked about a bunch of things, how we use AI and TI to
achieve high scaleability and how we use NI to ensure correctness of results. So
let me tie to the bigger version of the bigger picture of my work. The key idea in
a big goal is to do scalable monitoring, and in order to do that we face two big
challenges, scaleability to large systems and ensuring accuracy despite failures.
To address these challenges our key idea is to define precision as a newly
unified abstraction. This abstraction has good properties. It combines these
known idea of AI and TI in a sensible way. And it gives you big scaleability
benefits. And NI is a fundamentally new abstraction that enables AI and TI to be
guarantees in practice and greatly simplifies their implementation. And
(inaudible) provides a scaleable implementation of these metrics. For AI we
provide a self-tuning near optimal solution to adjust these filter widths based on
dynamic workloads, and for NI by exploiting the symmetry of our dual tree
butterfly network we can reduce the implementation overhead by several orders
of magnitude. Okay.
Now, apart from PRISM, I've worked on several other projects and built other
systems that I haven't covered in this talk, but I'll be happy to talk about any of
them offline. That will conclude my talk, and happy to answer any questions.
Yes, please.
>>: So when you talk about NI, it seems like this, I have this value, but don't trust
me because the system is unstable at the moment, right? But the thing is that to
view interesting event that we're going to monitor this time may be correlated with
system on stability, so all the important things is happening especially at that
timeframe (inaudible) the value of this NI.
>> Navendu Jain: So exactly. The essentially being that when you're detecting
and sort of anomalies in the system, essentially you are detecting that anomaly
based on say the query answer, essentially an anomaly is there's been lot of
traffic from node A to node B, which is not (inaudible), but essentially a huge
chunk, so essentially the point being how do you actually get the correct value or
the accurate value of that.
So essentially one aspect of NI is to characterize when it is unstable in the
system. The other aspect which I briefly touched and I can actually give you
more flavor very quickly is the notion of how do you improve this accuracy, right.
So one technique I talked about was you can actually do this NI filtering. Another
simple technique we use is essentially here we are using redundancy to actually
improve the accuracy. And the idea here is simple that when we are completing
an aggregation tree instead of using one tree we're going to use K such trees.
And you are going to pick the best result. And actually practice shows that, and
there's also some theoretical analysis that if you use K very small, K four or five
trees you actually improve this accuracy by a lot. Right. So essentially what
we're really doing is that we are taking, you know, computing an aggregate value
across multiple trees and picking the best result, okay.
And this actually graph shows the improvement that this is just using the
aggregation where you can improve this accuracy by 5X, and when you combine
this multiple trees with the filtering you can actually improve this accuracy by
10X. So essentially, yes, so we are approaching the problem that you just
mentioned by essentially trying to improve the accuracy as much as possible
when there are reconfigurations. That's one plausible way to know if there are
any problems in the underlying system. Does that answer your question? Okay.
Yes, please.
>>: So another way to cut the data rate would be to do statistical sampling of
the -- I mean, that's the (inaudible) do you throw that in, too.
>> Navendu Jain: Yes, so essentially it's -- I think I would try to -- I want to make
the claim that essentially this is sampling is actually kind of a plug and play in the
system because when you're getting inputs from the underlying system, right,
they already have some errors, and you say sampling one (inaudible), depending
on what sampling you use, so our edit actually becomes edited in the sense that
now you actually have edit because of the sample data and on top of that you are
introducing additional filtering.
So for the system itself it's completely transparent that actually, yes, your input
distribution, rather than keeping track of each and every update you keep track of
sample data. Right? Essentially so I'm updating these statistics, these are
actually based on the sample.
Yes, please.
>>: An interesting question that might be worth exploring is if you do
system-wide sampling and send samples to a central location and things like
that, are there competitive approaches from ensuring some point that don't have
the complexity of the trees that still end up -- in other words, it seems like there's
a straightforward strong ->> Navendu Jain: Absolutely. So that's a good point that yes, so think of this
very approach that you essentially sample say periodically every five minutes or
so you go to a central site and think everything up correlate. Yes. So that is
plausible. The argument that has been essentially that now I want to do
monitoring at not at five minutes but at a scale of five seconds. Right.
>>: So I guess what I meant is it seems like (inaudible) you're looking for the
largest event then you can just keep making the samplings sparser and sparser if
the event is big, it's always going to stick out ->> Navendu Jain: Well, but essentially also being the order right, that essentially
for the B query that I mentioned, right, I'm essentially detecting the system right
now actually detect it in 15 seconds. So if there is a flow it actually becomes very
large over a period of 15 seconds then essentially the system detects it very
quickly, where essentially for something which is more of a periodic logging at a
site, essentially now have to do this all the time. Right. So even though my local
events may not be important, may not be important, I have to send them
anyways and essentially send them much more frequently as the time (inaudible)
so there's a big scaleability issue in that.
>> Jim Larus: Okay. I think we all should thank our speaker.
>> Navendu Jain: Thank you.
(Applause)
Download