>>Dave Maltz: Hi. I'm very pleased today to... us from University of California at San Diego. He's...

advertisement
>>Dave Maltz: Hi. I'm very pleased today to introduce George Varghese visiting with
us from University of California at San Diego. He's been here with us for about a week
and a half, and he'll be here for another week and a half, so if you'd like to chat with
him, please send him email and set up a time.
George has a very long background of doing very novel and innovative things.
Combining algorithms in network hardware, getting tremendous performance out of
systems that people didn't think were possible, and he's going to be sharing some of
those insights with us today.
>>George Varghese: Thanks, Dave.
All right. So I call this talk The Edge, Randomized Algorithms for Network Monitoring.
And if the title sounds a little like some kind of band, don't worry, we'll try to explain what
The Edge means.
And there are two main themes in this talk. One is that for hardware people, they're so
used to the deterministic drum of a clock that randomized algorithm actually are
surprising to them that they can actually be done. And so when we went to Sysco, we
actually found that there was a culture change that had to be done, but it can be done.
And sometimes when it really gives you benefits, then people will consider doing it.
And the second thing is that I've spent a lot of my life trying to make networking
hardware fast, and the main functions were packets and trying to do packet
classification and look-ups, but increasingly I'm beginning to think that the next
generation of problems is in monitoring and trying to figure out how well a network is
doing. So those are the two themes. And I'm going to try and tell you more.
So to explain these themes, when the internet first started was this very simple network
where everybody knew each other. They called up each other when there was trouble,
and they had long hair too. And the main problem was to try to make it more flexible, try
to make sure it was fast and try to make sure it was scalable.
So I guess they succeeded. It's pretty flexible. It runs all kinds of things. It's pretty big.
It's pretty fast. But it's complicated. And so the big problems now are things like
overloads. The symptoms of success, you know. It's like when people, after a certain
point they have blood pressure, and it's often overloads, attacks, and failures.
So the next -- if you look a measurement and control, surprisingly in the internet there is
nothing built in for measurement and control. Even things like trace route, which you
may have heard of, were kind of hacked after the fact, and using a completely
unsuspected mechanisms for those things.
And so the claim here, though, is that you would like engineered solutions, and the
current generation of router windows are actually -- are open to this, because it turns
out that routing has become commoditized and there is Broadcom and there's various -Marvel who are selling these chips. And so almost anybody can build a router today.
So the router window it actually open now to say, okay, I'd like to add these features
that would differentiate myself from other things. And if you look at sophisticated
systems like the Boeing 747 or bridges, today almost everybody builds in sort of
controlled things. So it's the natural thing to be more autonomic.
So this talk is also going to talk about this other theme of using randomized algorithms
and network chips for basically doing these kind of engineered solutions for
performance and monitoring, and since this is a very general thing, I'd like to sort of give
you three specific instances. And I know there are theoreticians in this audience, and I
think each of these three problems can be abstracted and capsulated very simply, and I
will try to show you the relevance to the real systems problems.
The first one is a well-studied one, but I'll just illustrate just very quickly to illustrate the
benefits and the edge of a randomized algorithm, finding heavy bandwidth flows or
heavy hitters, a very common -- I think most theoreticians know this.
The second one is measuring microsecond latencies in networks, and that's a little
surprising for people who think of the internet as being a millisecond or even a second.
And the third problem is logging all infected nodes during an attack given certain
constraints like limited memory and bandwidth.
So those are three problems. And the main -- from the theoreticians point of view, and
this is where The Edge comes in, in each case a simple sampling scheme will do the
job. The only thing, though, is that if you add a little bit of hardware and put in a little bit
of processing in the router, you get an edge. You do something better in some
quantifiably better way. And that's what I mean by the edge.
So I'm going to try to demonstrate the edge, and at the end what is common to all of
these is I'll show you the edge over simple sampling. So wait for that. And let's just
start focusing on the -- but there are some rules here that we have that are different
from standard algorithm. So we need to sort of specify the rules of the game.
So the first rule of the game is that we have comparatively small amounts of memory. If
you're used to a PC and I think of the infinite amount of memory a PC has, you say what
is the problem? Why are these guys worried? But it turns out that in routers offer we
worry -- we have to do everything on chip, and so it's more like cache memory, which is
all limited at every level of technology.
So it's not like we have ten registers. It's on-chip cache, so it is on-chip SRAM. And so
we have maybe, like, numbers like 32 megabits, which is quite a bit. But nevertheless,
it's not scaling with the number of concurrent internet flows so you can't just sort of
remember everybody. So you have to do something smarter than just remembering
everybody.
Also, we have very little processing. I like to tell people that we are in the notion of
constant time complexity. Auto log in is slow for us, right? So the way -- if you look at
the memories we have, a 40 byte packet has 8 nanoseconds to be processed. Even
the most aggressive on-chip memory is, like, 1 nanosecond. So you get 8 reads or
writes to memory.
Now, that's cheating a little bit because there's a certain amount of parallelism. You can
do something that's parallel. But I haven't seen chips at more than a factor of 30
parallelism. So from beginning to end you get 240 reads or writes to memory.
So it's a very limited canvas that you have. And within that canvas you have to do very
simple things.
Yes?
>>: I just want to ask in general in this field, 40 bytes is the absolute worst case. Are
people actually designing hardware for 40 bytes.
>>George Varghese: Yes, they do. So you could argue that that's not the right thing.
But they have tests today called Light Speed, and Light Reading has a certain test
where they will measure -- they'll do bake-offs [phonetic] between Sysco and Juniper
routers, and if one of them fails the wire speed test, then there's a big news. Everybody
seems to have bought into this Kool-aid. So you can argue, but that's a whole other
topic. So let's just believe this for now.
Whatever it is, there is some number that ->>: [inaudible]
>>George Varghese: Half the packets are x, right? 50 percent of the packets are x. So
that's a good point.
Do you have a question? Okay. All right.
So let's start with a specific problem, right? All the generalities. And this is a very
simple problem, and so the problem is what is called heavy hitters, and you are a router
and you're sitting here and you're watching a stream of packets flow through you, and
the packets have some kink. And let's assume it's a source address, okay?
So as they're marching to you, you see S2, S1, S5, S2, S6, S2, and if you had plenty of
memory or you just initially looked at the stream, it's easy to see that S2 is occurring
quite often compared to the rest. And so S2 is a heavy hitter, an elephant, and the rest
are mice.
Now, we no that if we had memory for all of them, a simple hash table suffices. Every
time you see a new source address, we start a new entry, and then we bump a counter
every time we see a source address again. And so at the end of the interval we see S2
is 3, and if the threshold is, say, 2 we say -- we declare 2 as the heavy hitter and we're
done.
The problem is by our constraints we do not have memory for all the possible source
addresses. We could have millions of source addresses in a second interval.
So what we would like to do is somehow directly extract the heavy hitters without storing
all of them. And this is a well-known problem. So let's just -- so one of the ones -- now,
here's where the statistician comes in and says, oh, that's easy, I simply sample.
Right? Depending on the size of my memory -- let's say I have 10,000 pieces of
memory and a million flows. I could sample it may be 1 in 100 or 1 in 1,000, and with
high probability, the heavy people will float into my sample. And I can count how many
times they occur in the sample and scale up by the inverse of the sampling factor to
estimate how much they sent.
But there's a lot of variance in that estimate of the amount they sent. You don't just
wanted to find them, but you want to find out how much they did.
Now, we have a very, very simple scheme which we call sample and hold. Now sample
and hold says that we sample as before, as in the sampling scheme, but then once we
sample, we store it in a hash table and we watch all the packets sent off to that by that
source. So think about it this way. The Gallup poll estimates where Bush versus Gore
by sampling households, a thousand households, and says Bush, Gore, Bush, Gore,
and then they make an estimate of who's likely to win.
But the problem for the Gallup Poll is that the expense is sampling, going to those
households. For us sampling is not an expense because the households are tramping
right through the polling booth, so to speak.
So we might as well, once we've sampled somebody, watch all of them. And intuitively
what happens is it's a variance reduction technique because the variance comes in only
the first time. Once it floats into the cache, we've got it exactly and it's deterministic. So
you can prove this. You can prove that the standard error is basically order 1 or M -- M
is the memory -- while in standard sampling it's 1 over square root M. It's not hard to
believe because variance goes as square root, right?
And this is actually quite significant, right? So if memory is 10,000, the area is 1 over
10,000 and the other 1 over 100. So that's order of magnitude.
Now, [inaudible] Prabhakar at Stanford and some students sort of found a simple
improvement of this which we actually find obviously in the Sysco chip where he
basically noticed that sample and hold, the small guys come in and they pollute the
cache. So all we need to do is recycle the small guys and get rid of them, and so he
calls that elephant traps, which basically it's sample and hold plus a recycling scheme.
So, again, the point here is the team. It is important to find heavy hitters why? Because
a lot of people would like to sort of figure out when attacks happen, they're often events
by these people who send a significantly larger traffic footprint than the rest. So that's
why it's useful. And there are many other reasons. You would like to do traffic
engineering for the very largest flows because 80 percent of the traffic is sort of
dominated by -- so that's a systems issue.
And it's not easy to do -- it's easy to do if you have infinite memory, but you don't have
it. Simple sampling will work, but you can do better by adding a little bit of memory in
processing. So this just it's our theme, and now let's get to the more -- to the first of the
two bigger topics, which is measuring microsecond latency.
All right. So basically when you tell people that people now care in the internet about
measuring latencies down to microseconds, nobody can believe it because you say,
look, my email took me ten seconds! What are you talking, microseconds?
Well, in fact, the first wave of latency-critical applications appeared about 10 years ago,
and they were things like voice over IP, IPTV and gamers. Gamers care.
But if you look hard at the numbers, most people believe that as long as you get within
two hundred millisecond latency, all this class of applications are perfectly happy. So
why microseconds? Well, you may not believe, but there's at least a sense that they're
not the drivers for microsecond latency.
In all of these cases, it's a human being that's reacting to a message. And you can
imagine how can a human being tell the difference between a millisecond and a
microsecond? That's impossible.
The fundamental shift occurs when machines sort of are sending messages to
machines. And they can tell the difference. So classic example is somewhat
discredited after a crash, is automated financial programs, right? So why do automated
financial programs care? So the Dow Jones feed is giving the feed of stock prices, and
you have two bankers, Merrill Lynch and, let's say, Smith Barney, and one of them has
a lower latency link by maybe 100 microseconds.
Because it's a program, they can go ahead and see a cheap stock, bid up that stock
before the other guy gets to it, and so there's arbitrage, and they very conscious of that.
You could look at the stock exchange, look at all the finances and they talk about
microsecond traders and edges and all kinds of stuff. So this is very serious. I mean,
the traders do care. In fact, the biggest markets for detecting low latency is in these
automated -- in the stock exchange, and verticals -- like Sysco has verticals. So they
are actually talking about less than 100 microsecond latency. They care.
>>: [inaudible]
>>George Varghese: It depends.
>>: [inaudible]
>>George Varghese: Right. The average is 11 seconds, because they do a lot of
short-term shifting, which is bad for the overall economy, but nevertheless, let's keep
that aside [laughter].
And, also, they want very small loss, because it turns out that if you lose one packet, it
turns out that the economical problems tend to have 200 millisecond time odds, and
that's a disaster from the point of view of all of these people.
But to me the more fundamental reason is high-performance computing. In
high-performance computing, computers are working together with other computers in
clusters to do large parallel jobs. That would be true in Fed Ex, route discovery, like
Southwest might be finding routes. So the big Fortune 500 companies are doing these
things.
Now, a cluster cares because every time you wait 100 microseconds, a computer is sort
of wasting so many instructions which it could be done to do the job faster. That is
fundamental.
Still, now, hitherto, these two interconnects have been separate. The computers have
had their own interconnects, like infinite band and fiber channel, and the internet has
had Ethernet. But Ethernet has been fundamentally cheaper, so there's been a huge
shift in the last two years to try to put all of on this top of Ethernet. But then Ethernet
needs to have low latency and low loss. And, of course, Microsoft knows that, and so
the group I'm working with is learning tremendously about all these things.
But either way, if you look at this, you really at least need to measure to know whether
you're getting low latency or low lots. So with all of that motivation, let's define -- it turns
out -- I don't want to talk about these things except that for those of you who know
networking, there are things like SNMP and NetFlow, but there's too coarse. They have
millisecond or second resolution, and you can store all the records, but you often lose
all of them in NetFlow so there's no really good solution.
All right. Now, what people do today is that they go ahead and send these messages
periodically as test messages, they call them probes, and from one and to another and
see how long it takes, and they time stamp it, and that's their sort of external check that
things are working well.
But you can't send these test messages every microsecond. You can send them every
millisecond because they take a lot -- and then if you want to find out where the problem
is, you have to do what is known as a join. You have to say, hey, I've sent these two
messages, and both of them are taking latency, so possibly in the intersection of the two
paths there is a bottleneck. But that's all -- now, what you're going to try to do is try to
inverse this whole process and directly engineer this as opposed to believe the network
is a block box.
So our model is basically going to be -- so let's just -- it's going to be -- we're going to
basically say every router in the world is going to add a little bit of hardware to every link
in the path so that it can measure the latency on every link.
So now -- first of all, in there's a problem somewhere in the middle of the network, you
directly know the answer. You don't have to infer the answer from black box.
Secondly, if you want to find path metric, you just add them up. Right? Okay. So
basically in order to do that, what I want to do in this part of the talk is I want to show
you a simple data structure that can help you find latencies on a link, and it's
randomized or hash-based, and we're going to calculate loss, delay, average, and
variance.
And so in order to do this I have to explain and extract the problem. So I'll give you a
model. I'm show you why simple data structures will not work, and then I'll explain the
algorithm.
All right. So here is -- what we're going to do is we're going to segment the link. And,
remember, this is between two routers. Or it may not be between two routers. And,
actually, the more common model -- and this is important. I haven't drawn it. So you
have two things, a router and a link. The links are often very good. They hardly have
any delays and they hardly have any loss. Most of the loss is between the input port of
the router and the output port of a router because there are queues and fabrics and
stuff.
So when I think a link, this is the more important link I'm talking about. The link between
the input port of a router and the output port of a router. So don't think of the fiberoptic
link between routers, because that has almost no delay and almost no loss. So be
aware that that's where most of the loss is.
So that's what I mean extractly by link, but it looks like a link, so don't get confused.
So you have three packets; a black, a gray and a white. And so packets always travel
unidirectionally, so you have to do it in two directions separately to find the delays on
both links. And you put a divide time into equal bins and we typically use some
measurement interval. So the packets go across, and the idea is that both S and R are
going to maintain some state.
So the sender maintains DS and the receiver maintains DR. And at the end of the
interval S transfers its state to -- and so -- and then the receiver computes the measure
based on some function of DS and DR. Okay? Is that a simple model? Easy to
understand?
Okay. So we're going to make a couple of assumptions. So we're going to assume the
link is FIFO, first in, first out. And we can extend it. Now, within routers you actually
sometimes do stripe packets, but you often stripe it internally, and before they come out
you resequence because TCP doesn't like it. So if you do it in certain points, you can
make this assumption.
We're going to assume that clocks are synchronized. That is, if this guy says 12
o'clock, then this guy is going to say 12 o'clock to within a microsecond. Now, it wasn't
true a few years ago. Increasingly routers are doing this. At least the Sysco nexus
does this, and it's microsecond synchronization.
Now, across links, between this and this end, it's a little harder because if you use
something called NTP it has millisecond -- you say what is this? Microseconds? But
there is a new protocol called 1588 that is hardware based that directly gets time
stamps in hardware and bypasses a lot of the software which is doing this too. So
some of these assumptions are becoming true, and we're going to assume that a little
bit of hardware can be put into routers.
And the fifth assumption is very important. If you could put a time stamp in every
packet, the problem is trivial. All you do is record the sent time at the input port and the
receiver takes a look at the sent time and looks at its current time, and the trouble is that
there is no such packet, right? And even if you did it, it's essentially linear overhead.
You're adding this thing to every -- you're going to try to do better than linear overhead.
So let's start by taking this ->>: Wait, George ->>George Varghese: Yes. Go ahead.
>>: If I had time sync across all the routers, then I could Sam, but say I'll put this extra
header in occasionally or I'll put a pacing packet in occasionally ->>George Varghese: You could do that too. But that's sampling. So let's talk about the
[inaudible]. Both of them will come to the same thing. Either you are put it on
everything or you sample. And both of them are the existing work, which I'm going to
tell you why it's not as good. Right? Okay.
So we're going to assume very little high-speed memory. We're also going to assume
that processing -- but the other assumption is that you can't really send a lot of
messages between S and R. Because if you could, you could just send arbitrary
amount of information. And so routers don't generally like to send a lot of control
information between this port and this port. So you can send a message every
millisecond but not every microsecond.
So just to get the numbers right, if you consider a 10-gigabit-per-second link, in one
second you've basically -- you can get 5 million packets. So the numbers are large.
You can't afford to keep 5 million pieces of memory. And you can assume that was one
control packet per millisecond, possibly even per second. Okay.
So let's start with the simplest scheme, which is this computing loss. And loss is so
trivial, you realize why don't these router windows do this? And, actually, I don't know
the answer.
So, for example, if you have three packets and one of them is lost, it's easy. You simply
keep a counter at the sender how much you sent, a counter at the receiver how much
you received, and at the end of the interval S sends 3 to the other end, the guy
subtracts and says 1 out of 3 was lost. Big deal. Okay?
So why can't you do that with latency? Well, you kind of could, but the simplest way
would be you simply keep time stamps at the sender -- remember, there's no time
stamp in the packet. So you keep a time stamp at the sender, 10, right? And it was
receive a 23. The second packet was sent at 12, received at 26. Remember, they're
synchronized. And 15 and 35. And at the end of the interval you ship the entire
truckload to the other end, subtract and divide, and that's your average.
But this is 3, but the real number is 5 million. So as soon as you see 5 million you say,
ah, well, maybe not. All right. And that's quite a lot of overhead.
So it is high, and remember that time stamps per picture -- and within a router you could
add another time stamp per packet, but it's still the similar overhead. You're adding,
like, a time stamp, which is quite large, like 4 bytes or 8 bytes per packet, and for
40-byte packets that is quite a lot of overhead even if you're doing it per router. So you
would like to do better than this.
All right. Now, obviously, as I told you at the beginning, there is a simple sample
solution to everything. So obviously you could sample. So don't store all the time
stamps, but let's say you sample the first and the third. So the first one is 10, the
second 1 is 23, and 15 and 35, and you have to have a reasonable way to sample it.
But that's easy. You can do content-based sampling.
And now you take the two samples, send them to the other end, hopefully you
synchronized them with the corresponding received time stamps, subtract, divide by 2.
So what's the problem with sampling? Typically to get the numbers reasonable, you
want something like 1 in 100 samples. And the number of samples is quite small in the
end. You get about 250 to make your memory reasonable.
Now, if you see that the variance reduces as the square root of the sample size, so it
turns out that you're trying to get to microsecond detection that these small sample
sizes do matter. You would ideally like all the samples if you could.
Go ahead.
>>: Are you assuming that there's no loss from the samples?
>>George Varghese: Good point. There is loss. So in this case I'm just saying if there
is loss, then some of the samples are loss, and you decide when you go to the other
end ->>: [inaudible]
>>George Varghese: Let's talk about that when we talk about the real scheme. I don't
want to talk about all the purported schemes before I talk about the -- okay.
So this is okay. But we can do better. We can get much better standard error. So now
the simplest scheme you can do is to basically -- is to simply keep a counter which is
the sum of the time stamps of all the sender time stamps -- 10, 12, 15 -- and the sum of
the time stamps of all the receiver time stamps. So far no loss. Assume no loss for
everything.
So now you store the time stamp sums and you also store the number -- I didn't show
you in the animation -- 3, right? And now you simply take 23 plus 26 plus 35 minus 10
plus 12 plus 5 and it's just counter, possibly 64 bits, subtract, CS minus CR divided by N
and you have the number. So it's like what is this talk about? Well, as [inaudible]
pointed out, the whole issue is loss.
So loss can actually do pretty badly. And let's see how badly loss can do. So the
simplest counter example. Consider an interval of time T. T is big, like one second.
Two packets, one sent at half the interval, T plus half a second, and the second one
sent at 1 second. Both of them take zero time. Right?
The first packet is lost. The sender measures 1 plus half, 1 and a half, right? The
receiver just measures 1. So when you subtract the two, you get half a second, and if
you don't do any other kind of thing, you divide by -- even if the sender -- if you divide by
1, you get half a second when the actual delay is zero. So you can be off by orders of
magnitude by loss.
And simple approaches like Bloom filters does don't quite work.
>>: [inaudible]
>>George Varghese: So you could do every detection, but not every correction. So
you can detect that you have a problem, but you would rather get some estimate when
there's loss as opposed to saying sorry -- so, yeah, by simply sending the counter you
can detect that there's a problem, but you are also [inaudible] correct.
So now you're almost to the obvious idea now. So what's the obvious idea? We know
that a single copy of this algorithm can be detected to be wrong, but it can't be
corrected. So, well, you run multiple copies. Right? So it's pretty easy.
But we do that, let's start with the theoretical perspective. From the theory point of this
view, this is a streaming algorithm, but it's not quite the same as a normal streaming
algorithm. In a normal streaming algorithm you have a stream of data and you have a
single point that computes a function on that stream. Max, min, median, whatever you
want. Right?
Here what happens is you have two streams. You have a stream of sent time stamps
and a stream of received time stamps. And what you want to do is to actually do a
function that coordinates corresponding points in the two streams. So you want to take,
for example, the send -- the first received minus the first send, the second received
minus the second send and so on and so forth. So you have to coordinate those. And
secondly, you can have loss.
So these are two small things, but, for example, just to show the theoreticians here, max
is completely trivial in a standard. You simply keep the max. But you can prove, using
a reduction to communication complexity, that it's provably -- it takes provably linear
time, linear storage. So if you wanted to compute the maximum delay, you can prove
that you can keep all the time stamps pretty much. So that's a big difference in model,
right?
So that's just for the theoreticians. If you're a systems person, you don't care, forget it.
Go ahead.
>>: [inaudible]
>>George Varghese: Just to compute the max of all -- the maximum latency. So
imagine the maximum latency is the packet -- there's one packet that takes a long time
to arrive in this million packet, the needle in the haystack. So is there a simple
summarizing function? No. You can show this by a reduction by [inaudible], et cetera.
All right. So what do we do for delaying the presence of loss? So what we're going to
do is simply have multiple hash buckets, and the obvious idea, we're going to take
packets and spread them across buckets. So rather than put all our eggs in one
baskets or one bucket, we're going to spread packets across buckets. And now if some
buckets get destroyed by loss, well, hopefully other buckets will survive. So that's the
main idea.
So you take your first packet, it happens to hash onto the first bucket, and you store
exactly the original [inaudible], you store the sum of the time stamps and the number.
So actually you store 10, and the second packet hashes onto the second bucket, so 12
goes to the second bucket. The third packet happens to go onto the first bucket again.
Since it collides, there's no collision, you simply sum.
Go ahead.
>>: [inaudible]
>>George Varghese: We'll talk about all the [inaudible]. That's coming. So for the first
thing is if there are almost 10 losses and you have 11 buckets, at least hundred will
survive.
>>: [inaudible]
>>George Varghese: It doesn't really. Because you're simply hashing. You're taking
the packets and hashing based on their content.
Now, this is very important. You take some part of a packet that doesn't change, and
that's important because the sender and receiver must hash in exactly the same way so
that the correspondence is built in. So that's why the problem of which -- otherwise, you
can't have them go to different buckets. You'll get the wrong estimates. Okay?
So the third packet comes in and it goes into the second bucket. And so the first bucket
has the sum of the two sent time stamps, so the ones that have 10 plus 15, the second
one is 12 plus 17. Now, what's happened here? Well, it's easy. The white packet has
been lost, and so it wasn't in that bucket.
So the summary data structure, the synopsis you keep, you is keep the sum of the time
stamps, 25, but you also keep the number -- so 29 and 2, but this guy has this delta 1.
So you ship the summary to the other end, and now your combining operator is easy.
Whenever you find the numbers don't match, you ignore that bucket. If the numbers
match, you simply aggregate all of them and divide by the number. So in this case
there's only one bucket that works. You take 65 minus 29, divide by 2. And that's
basically the idea.
Now, it turns out that you have this very simple idea, and then you write a paper that
complicates this and makes it more complicated in the following senses, right? To
make it work, you have to do a little more complexity, but the idea is very simple.
So, first of all, this works -- you can't afford a lot of memory. So how many buckets
could you have? Maybe a hundred, right? So that suggests that you can only deal with
99 losses. But in real life, losses are rare, but once in a while you might lose all packets
or you could lose a lot.
So you really can't tell what the losses are, so what we now do is we bring back
sampling into the mix, right? We're dealing with an adversary who's giving us unknown
loss. So we're going to control the loss after sampling such that at least half the buckets
will survive. So we'll get a good estimate from there.
And now the next problem is we don't know the loss -- if we knew the loss rate, we
could make the sampling rate and adjust it. But we don't know the loss rate in advance.
So now all we do is run log copies of the -- high loss, low loss, medium loss. So now
we're completely resilient. In the internet environment, we hate to make assumptions.
We don't know anything. So we'd remember just -- and the net result is quite good with
about, like, some 500 buckets and reasonable amounts of chip real estate area, you
can do a lot of this.
>>: [inaudible]
>>George Varghese: The summary goes to our loss. You're right. So the idea is ->>: [inaudible]
>>George Varghese: No, you're right. You could have done that, but you're right. If
the summary gets lost, you do lose -- you do lose the estimate for that interval. So
you're right. In that case you'll -- yeah, you're right. So let's see. So the difference is ->>: [inaudible]
>>George Varghese: You could send it five times. That's right. Random times and -yeah. But that's probably something we should have said. Yes.
>>: Do you perhaps want to explore the fact that you have time in your favor where you
could compute the summary of regular intervals? Because once you compute the
summary, you can store that information and you have that information, correct.
>>George Varghese: Right.
>>: Suppose if you keep collecting information in these buckets for a long period of
time, I guess ->>George Varghese: You don't. So every interval -- I should tell you, every interval you
reset everything.
>>: So you just control that to [inaudible].
>>George Varghese: Right. So typically the intervals depend on some manager will
say, you know, I want the intervals in a second, right? And I want the average latency
in a second, and I also want the variance. Right. So -- yeah. So those intervals come
from higher levels.
But every interval you reset everything and start the whole algorithm again. All the
counters go to zero. All right.
>>: And that's figured by the control packet?
>>George Varghese: Yes. Exactly. Every time the control packets come in, we redo
this. And we do have resiliency mechanisms to detect the fact that you synchronize
them in case of losses and put sequence numbers on the control packet. But those are
our control packets so we can synchronize them.
The point behind this slide is simply that if you compare this to the active probe kind of
approach, you get a big order magnitude difference given the bandwidth limits in order
to be able to do the same kind of estimates, you get -- the red ones are just much
smaller errors than you would with just sending messages periodically because
messages periodically have only so much controlled bandwidth, so your sample size is
so small that -- well, especially where the loss is small, you pretty much get all the
samples. So you get a perfect estimate.
As the loss goes up, you can see the error goes up, but not as badly. So at some point,
though, the error gets very bad, then it's going to degenerate to simple sampling it's like
sampling once every control packet.
All right. So far you say, all right, the average is kind of obvious, right? Because
instead of taking S2 minus the received R1 minus S1 plus R2 minus S2 and divide by
N, you could simply take, you know, sigma RI minus sigma SI and divide by N. And
that's obvious because it's the linear operator.
But what about variance? It's much more interesting, right? Because a lot of people
would like variance because it's not enough to know that your average delay is 10
microseconds. You'd really like to say that, look N the 99 percent case, it never
exceeds so much. You can't get the max, it's too hard, but at least getting a variance is
useful.
All right. So variance is not a linear operator. You've got to take the received time
stamp minus the send time stamp, square it, plus the second received R2 minus S2,
square, plus R3 minus S3, square. So it's not obvious how you could simply take the
sum of the received and the sum of the -- and square them. And most of the simple
things don't work.
Fortunately, we just steal this idea from the MS for estimating the second moment, and
the idea is that -- I'll just describe it operationally. Normally what we did is at the
receiver side and the sender side, we added the time stamps for everybody in the
bucket. Now we're going to simply have another hash which takes the packet and
decides whether to add or subtract. So we're going to take a time stamp and add or
subtract with equal probability from this counter.
And so we're going to keep this up/down counter for each bucket and a similar
corresponding one at the receiver. And now we're going to take the two and subtract
them, and we're going to simply divide by N, and that's going to be an estimate of the
variance, and then you can get -- so why does this work? So intuitively what happens is
it's fairly straightforward to see what's happening. So when you take this counter, you're
taking R1 plus or minus R1 must or minus S1 and then -- sorry. So you're taking the --
so each of these counters, instead of taking the send 1 counter plus send 2 counter plus
send 3 counter plus send 4 counter, you're basically taking S1 plus or minus S2 plus or
minus S3 and you're also taking R1 plus or minus R2 plus or minus R3.
So when you subtract these two, you get -- you can collect them together as plus or
minus R1 minus S1 must or minus R2 minus S2 plus or minus R3 minus S3 and so on.
So when you square, what's going to happen is you're going to get all the right terms.
R1 square, sigma R1 square and -- you're going to get -- you're going to get R1 minus -you're going to get exactly R1 minus S1 squared, which you want, but all the
cross-product terms are going to disappear with equal probability because you're going
to get plus with equal probability and minus with equal probability. So in the end you're
simply going to go get the sum of the RI minus SI squared, which is exactly the
variance. So that's the MS result, and so you directly get it.
But from the hardware point of view is it's no more complicated than the other one. You
just had to add one more hash, look at it bit, and subtract now instead of adding. So
you had a subtract as well as an adder.
Yes?
>>: Is the plus or minus also apparent?
>>George Varghese: It has to be consistent. So that's essential. If you don't, you're
screwed, right? So you're right. So the hash is, again, based on the packet. So that's
what makes sure that -- when you actually keep R1 plus or minus R2 and he keeps S1
plus or minus S2. When you subtract, that's what makes sure the signs of these two
are the same, and that's important. Otherwise it doesn't work.
All right. So basically you can do this with -- we checked with our friends at Sysco, and
there is standard chip sizes that are very conservative, 95 millimeters, and this scheme
will take, like, one percent of such a [inaudible].
Now, it turns out that most chips have lots of gates today. So they do have gates to
play, and it's a very reasonable proposition to make to them to do something like this.
The FIFO model is still true between in ports and out ports of a router. You can deploy
this by starting deploying this within single routers where most of the delay is and then
later get it across routers and time synchronization is being done.
All right. So, again, let me just summarize this part. With the rise in modern training
and video, fine-grain latency -- if you forget everything else, remember that suddenly
microsecond latency is becoming interesting to us. We used to be a throughput factor
in networks. Suddenly we're becoming beaten up for latency, right? And it's really
interesting. It's changed a lot of the way -- and we proposed LDAs. It's very simple to
implement and deploy, and it's capable of measuring average delay in variance, loss
and microbursts. And what's the edge?
So the edge is basically the number of samples. It's really a way -- it's a sample
amplifier. It gives you a lot more samples than you would, right? If there's no loss, you
will have a million samples, effectively. It's the same result -- while you would have M
samples, let's say a thousand, and that's much better than -- and so it can reduce the
error by the square root of the difference, which is quite a bit. All right.
>>: Is it actually true that most of the delays inside drivers [inaudible] almost all of the
loss is inside drivers ->>George Varghese: So that's a very good point. The setting here is often data
centers where the propagation delay is not significant. That's very important. Until you
go to the white area, that's not true at all. Because banking, the high performance can
be all within data centers. So, actually, that context should have been said before. My
mistake.
>>: [inaudible]
>>George Varghese: We don't do the reverse hashing. In networking we follow the
[inaudible] style stuff. We completely ignore their hashing and just do what we can do in
hardware files. And our experiments say it just works great. And Mike has a lot of
results with [inaudible] saying that various theoretical -- we've not bothered to verify that.
We just simply do it. That's a good point. We don't -- the kind of hashing we do is
typically like this.
One of the simplest ways is we take -- we consider it as multiplication of the key by a
random matrix of zeros and ones, and that seems to just work well, really well. And it's
really easy because it's a network of exor [phonetic] gates, and we can implement this
really fast.
>>: [inaudible]
>>George Varghese: It probably applies. Most networking people have cheerfully
ignored the reasons for [inaudible] independence and we just simply do it. But it's a
good point. At some point somebody should verify there is enough entropy to make this
happen. We expect that. We should have measured real traffic streams and did what
we did. Sorry. So much for [inaudible] independence.
So now I want to completely change context, and the only way I can change context is
to tell you a story so that you can erase all your memories of this previous slide because
we're going to change the context, change the measures, change the problem, and if
you get confused with the previous part, we're in trouble. Okay.
So my friend -- I'm from India, and I had a friend from India whose mother came to visit
old [inaudible] farm, I think in Massachusetts, and she had this little circle on her head
that Indian women call a bindi, right? So an American came up to this friend's mother
and said that's a beautiful thing. How do you do this? And so my friend launched into
this spiel about how Indian women are sort of trained from birth to draw these beautiful
circles and they practice hard every -- before a mirror. And as he was saying this, his
mother sort of unpeeled this paste thing from her forehead, and he was so
embarrassed.
So the thing is he was right. In the old days they used to do things like this, right? But
his model of the universe had shifted, right? So, similarly, if you're not careful, we could
completely change our model. So be careful. It's a totally new problem.
So this is with Mike Mitzenmacher and Terry Lam, and it appeared in NSDI. So it's
called Carousel.
So the problem at hand is that we have this deluge of interesting events coming to
networks. Attacks are happening, packets are being dropped, and you have this
manager who's trying to understand, make sensitivity this deluge of data.
And the standard approach is to try to get a coherent view is basically sampling and
summarizing. You either take a sample of some of the events or you summarize and
you say you keep a count.
But there are certain contexts where you really want complete information. So let's give
you an example.
So imagine a worm breaks out and attacks Microsoft -- well, actually, a university is
better. It cannot, by definition, attack Microsoft. Microsoft is sacrosanct. Worms know
better than that [laughter].
But let's say a university. Fair game. So they come to UCSD and they start infecting all
these machines, and this is code red, and it shows you how fast it took over the world
and how many machines it took over, right? So at that point, at the end of the -- during
the attack, managers would like to know not a sample of which machines were infected,
not a counted of how many machines were infected, they want to know which machines
are infected. Your machine is infected, yours is not. And so they can remediate. So
they want a collection of all of them.
Another example is list all the stations of a particular type. The ones deploying IPv6,
deploying TPC hack, and a problem that I encountered many years ago and I was
always haunted by is list all the Mac addresses.
And so what's the issue, right? What's the issue with this? So let's just -- before we -let's make our problem concrete, although it applies to many settings, by the following
simple setting. The simple setting is there is -- you're trying to detect bad messages,
and there is a device called an intrusion detection device. It's a detector of bad packets.
Now, each packet has a key which is its source, let's say A, and some content which is
somehow bad. There's a bad program there. Now, fortunately, the device has a
signature which is a regular expression on the content which sort of decides that this is
slammer, which is a worm, or this is witty [phonetic].
So the idea is as the packets are passing, the intrusion detection device taps on these
devices and it simply applies it in parallel to all of these signatures, and if it matches the
red, the idea is it should send a little message with the red -- with a flag saying it's red
and the source A. Similarly, it should send B and it should send C.
So the end of the result, you should be able to get all of your infected ones and also
which ones are infected so that you could go ahead and remediate that.
All right. So this is a classic kind of issue. So motion of the work in intrusion detection
is focused on fast matching. As you can see, that's a problem. But very few people
have heard about the logging. You think, ah, it's a boring thing. Just log it.
But under certain assumptions, actually, logging is hard. So let's see why.
So the model, the queueing model -- so there are N sources, and N is large. Maybe a
million. They're arriving at a queue at some very large rate, B, capital B, which is like,
let's say, 10 gigabits per second. The intrusion -- the log -- the intrusion detection
device is the hardware device and has limited memory. So it might have, say, 10,000
pieces of memory.
And now it wants to eventually log everything to a sink that has infinite measurement,
but which has memory for the -- but the logging speed is small. It's like, let's say, 10
megabits per second. So think about the two extremes. If you had infinite memory, it's
easy, or N memory, you simply just store them and then, in leisure, sort of log it to the
source.
If you had infinite speed, it's easy, because as they come, you simply log them, and you
don't need a queue. But when you have two such constraints, you're in trouble. So just
to make sure that you get the parameter space, the logging bandwidth is much smaller,
several order of magnitude smaller than the arrival rate, and the memory is much
smaller than the number of sources.
So think of the memory as hundred and this thing is a million. What could you do?
Okay. Well, fortunately, there's one assumption. The systems will always cheat, right?
So they change the model. But the assumption is that sources are persistent. An
infected person is going to keep coming back. So if you miss him at this opportunity to
log him, as long as you log him later within reasonable time, then you can -- so you'd
like to make up by playing games with time, so to speak.
But how do you do this systematically? That's the question. So, again, you can
imagine that simple sampling will work. Just keep a finite memory, and by randomness
of things, you should eventually put things over here and eventually you'll get -- if you
have enough time, you'll probably get everybody. But it's like coupon collectors. It's not
an efficient way to grab all your coupons. There's better things you can do. Okay?
So in fact, coupon collector tells you immediately that even in a simple random model,
you have log N kind of inflation and you would like to do better than that.
All right. So let's just -- so our result is we're going to show you a scheme called
Carousel which basically foregoes this log N model, and in fact in any model of arrival,
we're going to get -- log all sources in close to optimal time. What is optimal time? N
million divided by the logging speed. We'll be a factor of 2 within that.
And the standard approach is coupon collector, right? LNN. And adding Bloom filters
and other things doesn't really help. So now here is the animation. So it's easy to see.
What's the problem?
So in this animation, you have two pieces of memory, and the sink is far away and it's
going to be slow speed, and the packets are arriving here. So packet 1 arrives and it's
stored in memory. No problem. We have space for it. Packet 4 arrives before we've
logged because the logging speed is slow. We don't get a chance to log off it. Packet 3
comes in, oops, we don't have room for it, we drop it. Packet 4 comes in again, it
doesn't really matter because you already have it. Packet 1 now finally gets a chance to
be logged. Packet 4 moves up in the queue, packet 1 comes in and is stored, and
packet 2 comes in and is dropped.
So notice that in this example that you had four sources and you only logged two of
them. And I could repeat this example ad nauseam. So if I was controlled by an
adversary, the timing, I could make sure that some sources are never logged and never
remediated. So this method is extremely susceptible to problems.
If you assume a random model, yes, there's an element of inflation. But if you assume
a deterministic adversary, it can be really bad.
So basically sources 2 and 3 are never collected, 1 is logged many times, and in the
worst case N minus M sources can be missed. And, remember, N is a million, M is
10,000. So you miss almost all sources. It's unlikely, but it at least could happen.
And in security systems you have to be really careful, right? Because these guys could
conspire on timing and do stuff, so you really kind of assume they're trying to help you
get logged. That's not their job. So it's not part of their job description. Right. They
won't get fired for -- okay.
So, now, this is amazing -- theoreticians don't have this problem, but in systems, this is
an amazing fallacy that Bloom filters solving everything. Somehow they're just excited
about Bloom filters.
Now, Bloom filters are a basically linear piece of memory. All they do is reduce the
constant factor. The hash table, except that it reduces the constant factor by a factor of
10. So if you have 10,000 pieces of memory, how in the world is a Bloom filter going to
help you log a million sources? It's not possible. But nevertheless, I have to go through
this slide often because others will say why didn't you use a Bloom filter? So let's just
do it.
So imagine one comes in. And it's snored the Bloom filter. There's two pieces of
memory. And the Bloom filter just records a little trace that one was there and then it's
stored in the queue. 4 comes in. 4 is stored and replaces 2, right? And now 3 comes
in and it is dropped because the Bloom filter is full. 4 comes in and rightly is dropped
because the Bloom filter says it's already there. 1 is logged. 4 comes in. 1 is in the
Bloom filter. 2 comes in. The Bloom filter is full. Gone. Okay. 4 goes on. And now
what happens -- now here's the problem. At some point you can't just keep 1 and 4 in
the Bloom filter forever so you have to clear it.
If you clear it, I can repeat the same timing again and cause the same loss, so I don't
even want to go through all these hundreds of pieces of animations. Hopefully between
friends you will agree [laughter]. If there are adversaries in the audience, we can go
through this again.
Okay. So whatever. It takes too long. Okay.
So you can prove that it's really similar performance to a standard one. It really doesn't
help anything. Right? And so the main point is the Bloom filter is necessarily small, and
it reduces the duplicates by some marginal factor. That's all it does. All right.
Okay. So now we need to solve the problem, in case I didn't mention that. So let's be
inspired by an example from networking which is something called admission control.
In networking, it's a very common paradigm if you can only handle, say, 10 megabits
and you have 100 megabits of arriving traffic, what you really want to do is admit only
10 megabits per second because you can give it service. And then the other guys have
to wait on the edge of the network. It's like traffic and everybody else. So even highway
traffic does that.
So the difficult here is standard admission control, the sources cooperate. They help
you. They say, all right, we'll wait at the entrance ramp and -- but here, these guys will
be busting the ramps if they can. These are, likes, infected nodes. They're not likely to
follow any protocol you set.
So what we would like is unilateral admission control. The question we ask is what can
a poor resource do regardless of what the sources are doing? So the sources can do
any old thing.
And so our approach is what we would call randomized admission control. And in
essence what we do is we're going to take the sources, break them up into sets,
random sets, that are small enough to handle and then, to be fair, they're going to cycle
through all the sets to give everybody a fair chance.
So that's the essence, or that's the animation to see how it works.
So now we have some colors, which is great. So we have the same sources, we have
the same Bloom filter. Yeah, it's useful, at least for -- but we now have a color -- a little
color filter too, right? And the color filter is going to start with red and what we're going
to do is we're going to magically color all the sources red and blue for now. And how do
we color them? Think of hashing the keys, and if the hash lowered a bit, it's a zero, it's
a red. If it's a 1, it's a blue. So we don't have to do anything special to paint these guys,
but it just happens.
And now we have a color filter which says now for the next x seconds we're going to
only log red sources. So ->>: Does this require that you be able to have enough different colors?
>>George Varghese: Let's watch. The next question is how many colors. So let's -- so
you're one slide ahead again. You noticed how magically it was 2. But it will go to the
next slide.
So what happens is 1 comes in, it's red, we are logging red, source is fine, put him in
the Bloom filter, put him in the queue and so on. 2 comes in and 2 is dropped
because -- not the right color, right? So 3 -- just in case you didn't notice it, it decides to
telegraph your intentions by saying this -- the 3 comes in, and 3 is allowed and 3 is
dropped. And 4 comes in. 4 is dropped. 2 comes in, it's dropped. Blue. And 3 comes
in, 3 is logged, and now 1 comes in and it is rightly dropped because it's already in the
Bloom filter, so there's no point in doing it again.
And then 3 comes in and it is again dropped because of the Bloom filter. 4 comes in
and it's dropped.
And so now what happens is that you've finally got all of these done, you've logged
everybody you saw in your memory, and now what you have to do is you have to rotate
the carousel. So you simply have to change colors and become blue. And for now on
you log all blue sources.
>>: [inaudible] [laughter].
>>George Varghese: Okay. And now you log blue sources and now, you know -- et
cetera, et cetera. This is too painful. But anyway, 3 is dropped and so on.
Okay. So -- what did I do wrong?
>>: How many colors?
>>George Varghese: Oh, how many colors. Okay. Okay. Sorry.
So now I'm come back to your question, right? So it could be that you pick the wrong
number of colors. So you generally start with zero colors and then you sort of -- I'm
sorry, one color. Sorry. You start with one color. And then you go to two and four.
So the question is when do you know that your colors not enough, and then what do
you reduce?
>>: What actually worried me was an adversary can choose which color he would like
to participate in.
>>George Varghese: No, because the hash is under your control. He doesn't know it.
>>: Okay. [inaudible].
>>George Varghese: I think that's a better model, because given the [inaudible] in the
room, like, we won't -- yeah, go ahead.
>>: [inaudible]
>>George Varghese: It's a good point. So we're actually using them in two ways. One
is we do use it to suppress duplicates within a color phase. So if somebody comes in
that's already in the queue, we don't -- that's actually not much of a probability because
it turns out that such a small number that the probability of getting a duplicate out of N is
very small. But there's a more important reason. We use it to sort of estimate when we
are in trouble. Because what happens, it gives us -- if the Bloom filter gets full, we know
we don't have enough colors. And then we increase the number of colors.
So it's kind of -- it's a knob that we use ->>: [inaudible]
>>George Varghese: You could think of it as a hash table. To me it's just simply a
smaller constant hash table, that's all.
See, there's a hash table and a queue. You have to keep a hash table to estimate the
number of people currently in a -- who are ->>: [inaudible]
>>George Varghese: Multiple numbers hashed -- yes. We can. But Bloom filters,
that's more false -- just think of it as a hash table.
>>: [inaudible]
>>George Varghese: Just think of it as a hash table.
So in the end what really matters is what's the size of the number of entries in this hash
table.
>>: [inaudible]
>>George Varghese: It's easy to do because you can change colors in a microsecond
by simply saying that if they're using the lower of one bit of your hash used [inaudible]
so nothing has to be changed. The sources just get recolored automatically.
So not the existing ones. And you have to clear everything and start again.
So let's see what happens. 7 comes in. 7 is wrong color. 3 comes in, and 3 is the right
color. It's stored. 4 comes in and -- now 5 comes in and something interesting
happens, right? So 5 comes in and 5 is red, but there's only two sources, and so you
know you're in trouble. You need three at least, right?
And so at that point what you do is you say, oops, Bloom filter full, and you say increase
the carousel colors and you automatically recolor to four, and now you have to recolor
all these guys and then the right thing happens. So you'll keep recoloring until this.
And the point is -- from a theoretician point, it's nothing. It's a small little counter that
you bump from one bit to two bits to three bits and the hardware just does the right
thing. There's no phasing. Nothing has to be done. There's no going back and
recoloring all the ones that have flown past. But you have to clear everything. And you
sort of forego your investment in logging so far. You say, ah, I don't care about that,
right?
And competitively speaking, you'll lose that [inaudible] factor of 2 because you're
doubling, right? 1, 2, 4, 8. So like in any binary tree, you'll only lose a factor of 2.
>>: You'd expect the number of colors you need to be bounded by the ratio between
the incoming rate versus the logging rate.
>>George Varghese: The number of colors is -- it's actually easy. So it's basically log
to base. You can do M at a time. And so what happens is you can do M at a time and
so the number of phases you need is going to be at least N by M. Right?
Now, in N by M, what you need to do -- so just very simple math will give you the
number of colors based on these parameters.
Yes?
>>: [inaudible] the only question ->>George Varghese: Think of them as a hash table, a slightly compressed hash table
that, you know, sometimes will give you the wrong answer.
>>: The only question I have about it is I can tell I can't add this entry to the Bloom filter,
but I know that the Bloom filter -- by I know it's not in the Bloom filter. Is that the simple
[inaudible].
>>George Varghese: Yes. Yes. Yes. Yes. So there is some -- there's a little bit of a
probabilistic statement there. But nevertheless, that's probably true.
And now you'll redo this, and let's skip through all of this because you guys sort of
figured this out and so -- yeah.
All right. So let's just summarize this algorithm. So what is the algorithm? The
algorithm is basically three steps. It's a partition step which is like hash, and based on
the color or the lower bits of the hash function, you partition. So more formally, you take
Hk(X), which is the lower k bits of H(S). It's a hash function of a source. You divide the
population into par significances with the same hash value, and then you iterate.
Because once you've hashed a color, you have to hash all the other guys, right?
And so you simply take -- and what is the time? It's a pretty obvious thing. If the
memory is M, you might as well give enough time for that memory to be logged, M by L.
And so after that you want [inaudible]. So the constants actually fall very nicely. Each
iteration lasts T seconds and the Bloom filter does weed out duplicates. It's marginally
useful. If you didn't have any other way to estimate the number, it would probably work.
So there are other ways you can use [inaudible] and other kind of schemes to actually
just estimate the number of distinct elements in a set, and that will probably just work
fine too. If that number is too small, you can do it.
And then you finally monitor. It turns out -- I've only shown you the up-tic, but you also
have to have a down-tic. You have to -- there are too many colors. If the worm
outbreak sort of grew and then fell, then you really want to speed up. Otherwise you'll
waste time if you're trying to be optimal. So you want to make sure of all that.
Okay. And so you increase scale, the Bloom filter is too full, you decrease scale, the
Bloom filter is too empty. All right.
So we did a number of experiments, and I'll just tell you one of them. Right? So what
we did was we took a real intrusion detection system called Snort. That is publicly
available. You can go ahead and take a Linux box, load it in, and we basically took 10
gig links and we took -- but we couldn't -- so what did we do? So we basically had to -sorry. We took 100 megabits per second links, we took a traffic generator that was
sending stuff in with 10,000 sources and we basically had -- the logging rate was 1
megabit per second. We ideally would have liked higher numbers here and higher
numbers here, but because we couldn't get those numbers, we scaled down the
system.
And we took two cases. The source S was picked randomly on each packet, and the
other case was we went step through, 1, 2, 3, 4, to 10,000 and then they repeat. And if
they're working in concert, this might occur. So it's not as unrealistic as you -- and so
neither of this is completely deterministically adversarial. This is closer to random and
this is closer to -- somewhat closer to bad cases.
And so now what happens is if you look at what's the measure, the measure is how long
does it take to collect almost all sources. So as the number of sources goes up -- so
let's look at for 10,000.
So with Carousel, it looks like within a little bit of time you log all sources, but if you
use -- if you go ahead and use standard Snort, which is basically random, there's this
long tail. And afterwards, just like all coupon collector problems, towards the end it gets
increasingly longer to collect the rare stamps or the rare coupons. And that's not
surprising, right?
And notice that the effect is much worse with the periodic pattern. And that's, again, not
surprising. Random is more like a coupon collector, which a coupon collector likes, but
when you have this periodic stuff, it works.
So if you look at the numbers, that takes 300, and this one -- we pretty much get
everything by 300, they get it the by 1500 seconds, and here it is -- and so basically it's
5 times faster with the random and 100 times faster with periodic. You could argue
which is the right model, but we just show you the range. Right?
So there's lots more experiments in the paper, lots more stuff, but I don't want to talk
about it. Okay?
So what's the big disadvantage of Carousel? Too big things that you have to
remember. It doesn't solve every problem. First of all, it doesn't deal with one-shot
logging. If somebody logs something once and never has spoken again, too bad, we
can't deal with it. In infected nodes it's okay because if the infected node goes away,
then perhaps we're not interested in it. So we can possibly make that argument. So
you've got to be careful. We don't get something for free, right?
And, similarly, it's probabilistic. The final theorem is we get 99.9 percent of all sources
within this time in competitive time. The last one or two we might still lose. So we get
almost all, right? But it's not perfect logging.
So it is, we believe, applicable to a wide range of monitoring tasks with high line speed,
low memory, and small logging speed, and where the sources are persistent, right?
And it's a form of randomized admission control.
>>: [inaudible]
>>George Varghese: There are. There are. So [inaudible] is close to 10 gigabits. So
there are a few and there are hardware and they're doing pretty well. The company that
Juniper bought, NetScreen, it's also pretty good. So people are beginning -- and people
are increasingly beginning to think that it should be packaged into routers. Sysco ideas
has been struggling for a long time to go from two to ten. They'll get there.
Okay. So let me just tell you the edge and then I'll be done.
>>: [inaudible]
>>George Varghese: So what you do is you simply lose all -- you just ignore all the logs
that you've sent so far. It's like you start a new experiment.
>>: [inaudible]
>>George Varghese: You have to be careful. So you want to make sure that you're
assuming that the distribution is semi-stable. And if it's changing rapidly the population,
you could thrash. So there's some assumption built in here that the number of sources
changes slowly, and we react very slowly to that number of sources too. So if it does
change fast, we'll work on some upper bound on that. So we might be slower, but it's
not clear what optimal means in that case. So a real amazing algorithm, if it could
maybe do significantly better in a period of great [inaudible] by constantly changing the
number of sources, because when we sense it, we will take some bound for a while and
work and then we'll come down only much later. So there is room for play here in a
theoretical sense.
Okay. So let's just finish up and say what is all this about the edge, right? So if you
look at Carousel, it's a factor of 2 from the optimal to log all sources versus the standard
coupon collector which is L and N by M. So, remember, the actual edge could be
infinite in an adversarial model, but let's put that aside and let's -- so if you take N is
equal to million and M is small, the edge is close to 14. It could be 14 times faster to log
almost all sources.
Now, compared to the random arrival model, which is like simple sampling, LDA sort of
gives you N samples versus M samples. If N is million and M is 10,000, the edge is
close to 10. Significant. Why 10? Because you take the square root of this number.
And sample and hold is order 1 over M versus 1 over square root M, so if M is 10,000,
the edge is 100. So, again, the edge is simply what's the corresponding measure for
simple sampling and what is the advantage of doing this? A little bit in hardware. And
the claim is it's significant. You don't add too much in gates, and so why shouldn't you
do it?
And so in terms of LDA, lots of work on streaming algorithms, less work that I know of -and I've asked [inaudible] and a few other people -- on two-party streaming algorithms
where there's some coordination between these guys. There's a lot of work in network
tomography which joins the results of black box measurements, but [inaudible] directly
instrumenting things as opposed to inferring where the problem is.
So Carousel, it's a very simple idea. The theoreticians are really interested. Basically
it's randomly partitioning a large set into small enough sizes. And then the main idea is
iterating through the sets, right?
The only place I've seen it, and I would love to get more, is the Alto Scavenger. Butler
Lamson [phonetic] many years ago wrote a paper where if -- they got too much stuff on
disc, and they were trying to rebuild the file index. They had to do this random divide
and conquer, and they sort of took a random partition in smaller and smaller amounts
until the size, the number of file sets, could fit into existing memory. Then they rebuilt it
for the rest and they cycled without [inaudible].
But it's a different problem because we have two limits. They had only one limit.
Memory. We have memory and bandwidth. But there's some idea there.
Okay. So summary is the big thing -- no let's go back to the 20,000-foot level. I think -for the students here, I think monitoring networks for performance is a big deal. Right?
That's what the world cares about. Pushing packets faster is nice, but people will do
this, but all this stuff is still unknown. And there's a number of problems. It's not quite
just keep a counter, right? You have to do interesting things, right? Find latency.
And randomized or really hash-based streaming algorithms really can offer cheap end
gate solutions at high speeds. So it would be nice -- and I described two simple
hash-based algorithms I hope you'll remember. If you can remember a one-line
summary of LDA, all you're doing is aggregating the time stamps by summing and
you're hashing to withstand loss. That's the one-line summary.
And the main idea in Carousel is you hash to partition the input set into smaller sets and
you cycle through the sets for fairness.
Okay. That's the main idea.
And so -- and I have to -- remember, I suggested a band? So there should have been
some music here, but I don't know how to play it. So this is a silent band, and it's called
The Edge. Okay? So I hope you'll remember that.
Thanks.
[applause]
>>George Varghese: Any questions? Yes?
>>: [inaudible]
>>George Varghese: Right. I don't assume a good source of randomness at all
[inaudible] hash-based. So the only thing -- in the analysis, we assume that -- we don't
assume any [inaudible] we assume perfect classic [inaudible] that the hash is
completely uniform. Right? And so ideally probably all these results could be
strengthened with twisting Michael's arm a little bit to four-way independence or
two-way or six-way, but we just ignored that. So we just used simple uniform -- so the
only one that might require a resource of randomness is sample and hold. Sample and
hold you're actually sampling a packet, and that one requires a source of randomness,
but ideally, even that could be hash-based. You could look at the packet and take some
piece of the packet, like source address, and you could do a hash on that and the bits
are, you know, less than -- or these values and this set of values you could sample. So
in some sense you don't need a source of randomness. So I think networking people
prefer that because our experience has been even from doing Ethernet and other stuff,
it's very hard to do physical -- the only real randomness that works is to use some
source of physical randomness like short noise or something, and it's very hard. So the
first Ethernet supplement all locked up, so our experience has been really terrifying. So
we would rather use hash-based algorithms.
So hashing seems to be fine, assuming enough entropy in the stream. So that's our
uniform experience from, you know, 10, 15 years. That just seems to be fine. And we
should go and dot the i's and cross the t's and check more, but we haven't done it.
Any other questions? While the band keeps playing silently [laughter].
>>Dave Maltz: Thank you very much. George is going to be here until next Friday.
[applause]
Download