>> Jin Li: It is my great pleasure to... Technology to come to Microsoft Research and give us a...

advertisement
>> Jin Li: It is my great pleasure to have Jun Xu from Georgia Institute of
Technology to come to Microsoft Research and give us a talk on his research
work.
What's happening?
>>: I should get power down.
>> Jin Li: Okay. Okay. Jun get his Ph.D. from Ohio State University in the year
2000. In year 2003 he received NSF CAREER award. His contributions on
basically establish performance for bound for networking in the year I believe it's
2006 and ->> Jun Xu: IBM give me some.
>> Jin Li: He receive IBM faculty award. Jun done a series of wonderful work
related to computer networks performance evaluation measurement establish
bounds and so on.
Without further ado, let's hear what Jun has to talk about in network data
streaming and his journey in signal processing.
>> Jun Xu: Thanks. Thanks for nice introduction. So I think actually I've been
giving this talk I think many times but last time when I came to Microsoft
Research to give a talk that was a different building, December 2004. I'm not
sure how many of you are in the audience. None. Which is good. So I think we
are using ->>: [inaudible].
>> Jun Xu: We were using different title actually because I didn't think about this
title until probably 2005 or 2006. So it was a different title.
And actually at that time I only did like probably a quarter -- less than a quarter of
my data streaming work by like December 2004. So I think when I did a talk last
time I think a lot of this work has not been to. So but I probably -- I think that but I
may spend the first 20 minutes talking about the work which I talked about in
year 2004. Which I guess is okay since [inaudible].
So for the work described in this slide file, I think this is done -- this is joint work
with my former Ph.D. student Abshee Kumar [phonetic], Mi Ho Sun [phonetic],
Shesaw [phonetic] and the collaborators from IBM like Tim Lu [phonetic] and
Chow Wong [phonetic] from AT&T. And I'm at Georgia Tech and our faculty is
networking the Telecom Group.
So the reason why I call it A Computer Scientist's Journey in Signal Processing is
because my educational background is a hundred percent CS, so I've never
taken any course in electrical engineering but turns out when I study the problem
in this area, it's -- I have to familiar my with a lot of electrical engineering
concepts like coding theory, information theory, these things. Although they may
not be explicit in the work but you have to have to type of understanding basically
pursue this line of research. Of course after you do the research you describe
the result you may not use any of the terms from the theory.
But we do use techniques from that. So now let's talk about the motivation for
network monitoring. So we have to monitor -- I mean so basically we want to
[inaudible] why we need data streaming to monitor a computer network, why
traditional techniques for monitoring will not work? So the reason is in the future
we need to monitor computer networks for a lot of different quantities, some
quantity like elephant flows like I mean you want to find the largest flows in our
network, largest flows means those flows which occupy the majority packets.
Which, I mean, has the majority packets.
And sometimes we want to monitor the distinct flows, we want to count how
many flows there and sometimes we want to find the average flow size. Actually
the number of flows and the average flow size are equivalent because when they
multiple together that's total number of packets in a timeframe which you can
easily count.
So if you know one quantity, you know the other. There are also other quantities
you might be interested in like a flow side distribution. So all these have
applications and they are working for example if you find elephant flows you can
do trafficking [inaudible] which means you want to ship the elephant flows
properly, you don't care about too much about mice because they don't occupy
too much traffic anyway. And all these number, distinct flows, et cetera, they're
useful for queue management, right. Remember when you divided the
bandwidth, you need to divide the bandwidth amount of these flows, you need to
know how many flows -- you need to divide this kind of limited bandwidth by how
many flows.
So also I mean we talk about a flow side distribution it's useful for detaching
because for example a flow side distribution means you want to know how many
flows have size one, has one means has one packaging, how many flows have
size two, how many flows have size three. I mean, one typical application I often
talk about is in old days when you have a virus or worm it has -- usually it has
fixed size like 1,000 bytes or something, so if you divide the typical MTUs, the
minimum transmission unit, in the Internet is like 500, like 512 bytes plus the
[inaudible] all these things.
>>: [inaudible].
>> Jun Xu: [inaudible] 576, I think that's a socket -- that's really a socket
[inaudible] which means -- it's different from one OS to the other. Some OS
choose to send out 512 byte chunks, some OS choose to send out 536 byte
chunks. Some ->>: [inaudible].
>> Jun Xu: Yeah. So you can think about a thousand byte worm will be cut into
two packets, so you can think about all these worm flows have size two. So
suddenly if you have like an epidemic of a particular type of worm which is two
packets long, they will see lot of flows of size two. So if you use your profile flow
size two is like a thousand flows within a five minute timeframe, you suddenly
see 10,000 flows of size two within like a five minute timeframe, then it may
indicate some kind of worm propagation type of behavior.
And there's lots of other quantity you want to measure like per flow traffic worm.
Given any flow you want to know how much traffic it contains. Approximately.
So I mean that's also useful for [inaudible] detaching and actually we also
oftentimes we need to measure entropy of traffic. It turns out entropy is very
important quantity.
I visited CMU I think in 2004 and they talk about they talk about like the need for
measuring the entropy at of traffic at a certain node. The reason they want to do
it is they have a network of something like a thousand nodes. But they only
have -- they -- but they only have enough like a computation power to monitor all
the traffic 10 nodes at a time. So you can think about a town, there's a town that
have a thousand people but you don't hire a thousand police men to protect a
thousand people, right. You cannot afford that, right. So usually a town of a
thousand people you only have 10 policemen, right?
So their idea is you -- these entropy hopefully you can design like a data
streaming type of algorithm to measure the entropy of flow traffic and then when
the entropy of traffic looks suspicious, which means and then you ship this, all
the traffic on this particular node for like a further analysis.
So just like you only when the citizen said his town is threatened they call 911,
right, and then policemen comes. So usually 10 policemen is enough to handle
all the 911 calls. So that's basically their picture. But they need the entropy
estimation algorithm and they perform this problem. And I came back from CMU
and I work on it with my student and then we solve the problem like in a month,
and then we actually write a paper together with CMU team so and this was at -so these are all coming from the real applications. It's not like sort of lock myself
in my office and thinking about the paper I had to send next year and then come
up with problem. So it's really not artificial problems, they come from really
applications.
So by the way, some of you may wonder why entropy traffic may indicate
anomaly detection. For example where you have a [inaudible] attack, then you
have lots of singleton flows, and when you have singleton flows you increase the
entropy and particularly in traffic.
So actually it turns out there are a lot of unlikely application streaming like traffic
matrix admission can be viewed as data streaming problem, peer-to-peer routing
can be viewed as a data streaming problem, peer-to-peer routing for example in
the uninstructed peer-to-peer network, not DHT, but talking about the BitTorrent
type of network, so it can be formulated as data streaming problem, we actually
have a paper on that. And even IP chase back can be viewed as like a data
streaming problem.
And we have papers on all these things. So the challenge of high need network
monitoring is tremendous because simply because high speed packets arrive
actually it turns out packets arrive every eight nanoseconds for the gig per certain
link, if we're talking about minimum packet. I wrote 25 nanosecond here because
I'm assuming a thousand bits per packet which I called Crack Patchett's
[phonetic] constant because Crack Patchett's paper has like a thousand bits per
packet so I feel safe to put 1,000 and nobody will criticize it.
So but in reality if you assume minimum packet length we are talking about eight
nanoseconds per packet. So you have so use [inaudible] for per pack
processing. Unfortunately per flow state is too large to be put into [inaudible]
because you could have millions of flows and if you put all the flows state into
[inaudible] I think it's just too large for [inaudible].
And so the traditional solution because your SRAM is too small to do per flow
processing so traditional solution is to use some pulling which means you only
sample small personal of packets but because after sampling your traffic stream
become much slower so DRAM is fast enough to handle all the packets after
sampling.
And so you process all these packets using per flow state studying in slow
memory [inaudible] and then but of course you didn't see all the traffic, right,
which means your per flow state does not record all the traffic, only record for
example one over 100 of traffic or something like that. So you have to use some
type of scaling to recover the [inaudible] statistics. And if the sample rate is very
low then the -- then you have the blowup like more times so the -- it cause high
accuracy because upscaling you also scale the noise.
And we are fighting losing cause because the link speed gets faster and faster.
For example right now at AT&T their sampling rate is I think right now it's one
over 500, which sample 1 over 500 packets but with link speed goes up they are
talking about the one over 5,000 packets. And so I think with this kind of
sampling rate your accuracy is not going to be very high with sampling.
So network data streaming is also a smarter solution. So the computation model
for network data streaming is basically process long stream of data like in one
pass, which means you look at this packet, you decide whether you use it to
change your internal state. If you decide not to change internal state and it goes
away, [inaudible] can not change your mind, which means you cannot say oh, I
seem to see this packet and I go one second ago but it's gone so can I get it
back, you cannot. So you have one pass you have to make a realtime decision.
And of course if you have infinite amount of memory it's no big deal because you
just store everything. But it turns out you have very limited amount of memory,
so you have to be very judicious on what you're storing this limited amount for
high speed memory.
So the problem to solve is basically you need to answer. So I think if some of
you are like video coding people here it's this problem actually can be viewed as
the online rate distortion problem, so you want to compress the data, you want to
compress data but you have to make a realtime solution. Not like image
compression, you can scan it a thousand times, so I think you have to make
realtime like rate distortion decision.
So problem to solve you need to answer some queries about stream at the end
or continuously. And you can think about this like the go and you can actually
translate it into the distortion function, the rate distortion function. So the trick is
to remember the most important information about a stream pertinent to the
queries. So you are very much goal oriented. If you try to remember everything,
not possible, right, for information theory, so you try remember most important
information pertinent to the query. Compare with sampling basically streaming
peruses every piece of data for most important information. For some polling
which digests only a small percentage of data but absorb all the information
within this data. So that's a difference between sampling and the streaming.
So my analogy's always as follows: I think when I was undergraduate I
remember I think I have -- I think studying is not my only interest actually, a minor
part of my interests. And so I don't spend too much time studying courses, but I
still want to get good grades at the end, so reasonably good grades at the end,
so -- and so what I do is accurately streaming so which means I have a very thick
book which I don't read for the whole semester but at the end I'm going to stream
through the book. I only have time for one pass, I don't even have time to go
back and forth.
And then my memory for one day is pretty small so I guess I -- and so I have to
remember the -- but I -- the data stream basically means I think I need to answer
more query relevant to the exam which means I need to [inaudible] I have some
idea about the query which will be given in the exam. So I basically peruse it,
basically go through this book in one pass with basically trying to remember most
important information that's relevant to the exam the next day. So basically
that's basically algorithm.
So obviously sample algorithm does not work, which means sample algorithm if I
read page one of the book, page 11, page 2 one of the book, I'm not going to
pass exam, right, because I don't get to the right context. So but by doing this
kind of smart streaming, I think I manage to I think get through my undergraduate
years without studying too much. Without actually learning too much.
So now I want to give you a [inaudible] example of data streaming. So you think
about giving a long stream of data you want to count the number distinct
elements. It's a 1985 problem, so it's a very old problem. It turns out already
more than a dozen solutions has been proposed for different kind of -- in different
kind of -- there's pros and cons for all these like data streaming algorithms they
would be good for different contests, et cetera. I'm not going to go through the
list of them but I only talk about the classical one.
So the data stream is ABCA, CBDA and obviously you know the number distinct
elements is four, right. But if I'm throwing like archevian [phonetic] elements on
board and among them maybe millions of them are distinct, I think then millions
distinct elements and you're not going to be able to count the number. So I think
I want to talk about simple algorithm here. The simple algorithm you choose a
hash function H with a the [inaudible] one. I mean basically it's hash into a
uniform random variable and then let V1, V2 be -- like D1, D2, D3 be the data
items. So you hash each in the every data item and you always remember the
smallest hash value. Is that easy? You keep one register, right? If that to be
plus infinity and everything -- so you compare, you hash, you compare with this
and you set to the smaller of the comparison, right? So you keep doing that.
You can always remember the minimal of the hash value have seen so far.
And then, then we can easily prove that the expectation of this -- this you can
view it as random variable because the randomness come from hashing, okay.
You can view it as random variable. So this -- you can prove that expectation of
this random variable is one over F0. F0 is a number distinct elements among
this data stream. Plus one. You can prove that. Because the minimum you can
prove -- you can actually show it's a beta, it's a beta random variable and you
can do the expectations, yes.
And then obviously with one such hash function your accuracy will not be very
good, but you can always have like a hundred different hash functions and then
you can average up a hundred such estimates. Although I think to be -- to make
things simple I'll just say averaging but as for people -- for those of you who study
statistics you know that medium or estimator often looks better than the mean
estimator and actually in some scenarios people come up with small advanced
estimator such as how [inaudible] estimator like in stable distribution with small P,
things like that.
So averaging just one generic way to say it but in many ways averaging it. So
you can make accuracy much higher. So now we're ready to talk about our data
streaming work. I think this is one where earliest data streaming work and it
turns out to be instant success which means this actually the second piece of our
data streaming work and then received the best paper award from the ACMC
metrics 2004.
So the problem is to estimate the probability distribution flow sizes, basically want
to know how many flows have size one, how many flows have size two, how
many flows size three, so on, so forth. So application is -- I mean we already
talked about application traffic, customization engineering, billing, accounting,
anomaly, detaching. I mean it all depends on this flow size distribution
estimation.
It's also very important because once you have distribution of everything else, if
you want to ask the first moment you have first moment, you have second, you
want first moment, have second moment, whatever you want you can estimate
from the distribution. So definition of flow is very flexible, which means typically
we talk about flow as a bunch of packets with the same source side P destination
IP source spot [inaudible]. That's definition of flow but they could be different
which means it's a generic definition.
For example, sometimes we just use source IP and the destination -- I'm sorry,
source IP and destination IP as a flow. You don't care about the port numbers.
This could also be a flow. So I think definition is basically very flexible. So
architectural solution is very simple. I think it's basically what we call [inaudible]
data structure. We maintain array of counter thing in fast memory. Counter just
means it's array and each element has array is a counter, it counts from zero
every time you increment by one. It's just that simple.
And for each packet a counter is chosen through hashing increment, I'm going to
have the animation on that. And there's no attempt to attack the result collisions,
which means if two flows hash to the same counter, we let it be and we will have
some way to recover. I mean, we will have some statistical means to recover
from this kind of collision.
And every 64 bit counter only -- and we have some additional innovation later on
which so that every 64 bit counter only use four bit of SRAM which means the
thing about some flows can be very large so you the value can go beyond the
like four billion so you may use more than 42 bits so each counter you have to
make it like a maximum size and 64 typically is large enough. But if you make
every counter 64 bits it's too wasteful because these count as being SRAMs so
you want to reduce the size of counter and there are certain systematic means to
do it which I'm going to talk about in the second half of talk.
So data collection is [inaudible] but it's very fast. I'm going to give you the
animation. Think about this processor, this is array of counters, packets comes,
you hash its flow ID, it goes to particular location, increment counter by one. So
you can think about your counter value goes from zero to one. And then another
packet comes to a different flow, it goes to a different counter increment -- value
go from zero to one.
And then the packets that belong to same flow comes it goes to same counter
obviously the value goes to two. And then a new flow, a third flow comes.
Fortunately it's flow ID so the hash to same value as the red one so you -- but I
mean here I show you that there's two red and one yellow, but in reality you don't
have these, right, you don't have this tag, so you only have this value so you
dome know it so you still change the value from two to three. Make sense?
So you have collision but you don't know it, okay. So that's basically the
encoding algorithm. And the -- what matters the decoding. So I want to show
you this is a raw counter values after the I mean basically after the all the
increments, this is raw counter values. So we are talking about the trace from
AT&T and data trace packages from AT&T and it has about a million flows inside
it. I think it's like ten minute, I think it's probably five minute or ten minute traffic
trace and it has like a million flows in it. I'm sorry. This is one hour. This is one
hour trace has like a million flows, about a million, one million flows. But so this
is an actual flow size distribution. This is a flow size, this is basically the
frequency. So you can think about this is like -- you can think about like 300,000
flows of size one. So I'm saying that. And it goes down like this.
And then these are the other curves, the other curves are the raw counter value
distribution. For example, this is where you have one million counters, where
you have one million counters the raw counter values is this curve. And where
you have a half million counters, this is a curve. When you have like a quarter
million basically this one's a curve. So on, so forth.
And you may say okay, the difference is not that large especially between these
two curves but that's not the case because this is our large scale. So I think
you're talking about the difference between something like a 200,000, you
decode 200,000 flows but it's actually something like 400,000 flows. So there's a
huge difference down there. Because large scale.
So the ->>: [inaudible] not that big, right, you don't really care about the [inaudible].
>> Jun Xu: Well, this means when the flow size, for example, when you estimate
flows of size like 80s then you don't get too many error to a certain extent you
don't get too much error. But the way you estimate to the flows, the number of
flows of size one, your error is huge.
>>: Do you care about ->> Jun Xu: You in some situations you may -- you have to care about it.
Because for example if you want to know whether you are on the DDoS attack,
especially like ISPs, like AT&T they care about because you scrub -- they provide
a scrubbing service. You scrub the DDoS traffic for the customers so it first has
to detect it for certain customers.
And actually, yes, so I think for example the usual number of flows of size one is
200,000, but during attack it could be 400,000. If you cannot distinguish between
200,000 and 400,000 they may miss a DDoS attack or something.
>>: You don't know what kind of [inaudible] you are getting for [inaudible].
>> Jun Xu: This is just raw counter value. So I just want to give you some idea.
If you just cheat raw counter values as a full size distribution, this is kind of error
you're going to get. Of course we're not going to cheat it as it is, so we are going
to have to cheat-ment.
So we are going to -- so basically we are going to estimate the full size
distribution from this raw counter values. But that's all we got. So we had to do
some estimation. So first we would do some quick and dirty ones. I think let's
see first let a total number of counters be M, so this is a known quantity a number
for value zero can be M0. This is also a known quantity, right, we can see how
many counters has value zero.
Then immediately you can estimate the total number flows which can be
estimated as N hat equal to M, which is this quantity times log. This is a natural
log. M over M0. All those unknown quantities. So this is a pretty good
estimator.
And then you can then starting from here you can estimate the number for flows
of size one. How do you estimate it? You let the number of value one be Y1,
this is a known quantity, right? You can scan all these counters. You count how
many ones are there.
Then you can have -- you can come up with a -- you can come up with a pretty
good estimate of number flows of side one as N1 hat equal to Y1, which is this
quantity equal to E raise to power N hat which is this estimator. This estimator of
cost is equal to this known quantity. N over M.
So conceivably you can generalize process which means you get N1 hat, then
you can construct -- you can start to do N2 hat. You can do that. You can
generate this process to estimate N to N3 and so on, so forth, but this approach
is not going to work. Why? Because you can see N1 hat has N hat inside it.
And indeed if you estimate N2 hat you are going to have N1 hat and N hat in it.
So you are going to estimate N100 hat, it's going to have N99, N98 hat, all these
things in it. So your error accumulates, which is no good. So you need a
wholistic solution which is the joint estimation using the expectation maximization
algorithm. So instead of doing this sequentially you do a wholistic like you do it in
a wholistic fashion.
So you estimate the entire distribution using the expectation maximization. The
solutions are basically very common-sensical. You begin -- it's a EM algorithm,
so you begin with a guess of the full-size distribution which is this one. Based on
this guess, you can -- based on this guess this guess basically give you probably
space. Then you look at all the -- then you look at the all the counter-values and
your reason how these counter-values are going to split probabilistically.
And how do you split probabilistically depends on the probability space induced
by this guess. Okay? And then -- so based on this you come to the best way
possible is a splitting of particular counter-value and the respective probability of
such events this allows us to computer a refined estimator for such distribution
which means when you actually split these counters, according to the result of
this [inaudible] statistics then you are going to get a new distribution. And this
new distribution will be used to do this whole thing again.
And then you will repeat this multiple times allows estimate to converge to a local
maximum and this basically the split of the expectation maximization. So to give
you example, for example a counter value three could be caused by three such
events. Three flow of three means there's no -- if you see a counter-value of
three it's indeed a flow size three. Also one plus two means a flow of size one
collided with a flow of size two into showing up as three. I think three -- one plus
one plus one means three flow size one colliding to a flow size three into same
location.
Suppose the respected probability of these three events are 0.5, 0.3, 0.2
respectively and how to compute these probability basically it's like remember
you have initial probability so you get to the -- you can compute the prepare
apriori and then based on what you observe you can compute the [inaudible]
formula stuff. It's pretty standard.
So suppose the probability, a pose probability is 0.5, 0.3, 0.2, then suppose you
have 1,000 counters with value three, then you just say you just claim 500, 300,
200 of these counters split in these three ways, right? That's expectation which
means -- and then what do the we mean by split? It means for example 500 split
this way, these 500 splits will not result in a fragment of size one, right? Three
plus three has no flow size one. But 300 of those this way, each of those
contribute one fragment of size one, right? And 200 this way but each such split
costs three fragment, has three fragments of size one. So 200 times three. So
you credit 900 to your count of the first of size one. So you of course this is just
for size three but you do this to flow size -- all the flows of size two, all the flows
size three, all the flows size four, so on, so forth, and then you will get a new -- a
bunch of new counts and you treat these new counts as a new distribution. So
you do this over am over until they converge.
>>: [inaudible] sure about the convergence. So I mean basically after this
analysis they credit 900 to [inaudible] returns.
>> Jun Xu: No, it's three -- let's see. 300 to N3 and let's see, I'm sorry 500 to
N3, yes, 300 to N3. Yes.
>>: [inaudible].
>> Jun Xu: Let me see if I have that slide. I think -- I'm not sure if I have it.
Okay. I don't have that slide. Okay.
>>: [inaudible].
>> Jun Xu: Once you have all these N1 -- the time count of N1, N2, N3, right,
then turns out that they normalize the probability values, right. The percentage of
it is this. And then it turns out that the -- it goes -- it really goes into how you
compute apriori distribution. The apriori distribution is basically you can -- the
hashing if you look at the particular location, I mean, basically you're talking
about the hashing process, right, and the hashing process is off a model that has
the [inaudible] a model [inaudible]. So you can think about these N1, N2 values
as translated into the lambda inside these [inaudible]. And all these points of
course multiply together or whatever and when all these points are multiplied
together you do a normalization it becomes something called divergent
distribution.
So basically initially you start with a [inaudible] the full size distribution and you
get all these lambda, right, you get all these lambda, and then.
>>: [inaudible].
>> Jun Xu: Then you get new lambda by normalizing you get new lambda and
you keep doing this.
>>: Okay. Okay.
>> Jun Xu: And then of course you have a [inaudible] step which computes -which the result is just [inaudible] step is this. So that's basically the split of the
algorithm. So. But I don't have the slide here because I think -- and that's EM
algorithm and people prove EM algorithm guarantee to converge. Not
necessarily converge to the maximum accuracy estimator unfortunately.
So this one shows you the actual result. So this one you remember I already
showed this -- these curves before, so basically this one the rock is the actual full
size distribution, this is a raw counter-values when you have one million
counters, which means when you have equal number of counters as almost
equal number of counters as number of flows. But still you see the difference is
quite a lot.
And then we're supposed to show three curves, but I only show two curves and
the reason is that our estimation sort of overlaps with the actual number of flows
pretty well. And of course if it's completely identical it looks fishy, so to avoid that
I think we added some glitch here and there. So there is some sort of difference
in there. But I think overlaps pretty well. And of course some of you may say I
hide the difference by drawing it as the sort of large scale, right, I draw the large
scale but the actual difference is bounded by like five, two percent, three percent,
it's not that much difference. I mean overall I think the difference is around two to
three percent.
>>: Two to three percent is [inaudible].
>> Jun Xu: Yes. So you could think about for example we have -- your actual
number for flow size one is 400,000, so you make errors around maybe 80,000,
80,000 will be 20 percent. 80,000.
>>: I don't think [inaudible].
>> Jun Xu: It's not bound, it's just empirical. The empirical like five percent
different. Because not -- even [inaudible] cannot bond its performance. And so
our sort of estimation is works really well. And I think we have a bunch of other
experiments I think this is -- this one comparison with sampling. You can see this
means you sample with probability 0.1, I mean one over ten and then you
estimate from sample and I think AT&T Nick Field [phonetic] has in his group has
some algorithm of inverting the sampling which also use EM algorithm and this is
so you invert from ten percent sampling. This is when you invert from one
percent sampling. You can see that the actual flow size distribution is basically
covered very well by our algorithm, okay, it's covered very well by our algorithm,
but you can see inverting from sampling performs very badly. Because it's really
a moving average. It's really a moving average. It cannot -- it cannot win these
ups and downs in this kind of full size distribution. Which can be important in
anomaly detection type purposes.
And the reason why it's moving average is for example if you recover from
sampling say one over 100 if you see a flow size one you don't really know if it's
actual flow size one or it's actual flow size 10, it's actual flow size 100, you have
no idea I think where it come from. So that's the reason I think sampling will not
work very well.
So I'm going to skip the next -- that's only one I put here. That's good. So that's
so there's flow size distribution work. And I think one technical problem which
come from this flow size distribution work is we have to use an array of counters
and each counter is like 64 bits which is very expensive. Of course pre -- I think
people have designed certain algorithm earlier. It turns out that algorithm by
Judge McGee's [phonetic] group I think require nine plus two bits which nine bits
counter, two bits [inaudible] which 11 bits. 11 is a number which computer
science people don't like, right, I mean you don't have 11 bit SRAM or things like
that, so we like nice number like four, things like that, okay.
So the -- so I feel that we can do better and so the problem statement is we want
to maintain a large array of counters that need to be incremented by one in
arbitrary fashion, so you think about you are maintaining this array A and the
customer comes up with this I1, I2, and when customer comes with index
incremented by one, so this sounds like a trivial problem. But it's become
non-trivial when the customer specify such increment every eight nanoseconds,
then it become non-trivial and also it become non-trivial when you want to spend
as little money as possible.
So the -- so increments come very fast and the values of some counters maybe
large because some flows are large so the counters can be large. But if you
want to provision every counter to the worst case you give everybody 64 bits
which can be very expensive. So you fit everything to an array of long SRAM
[inaudible] can be very expensive and also some of you may say caching may
work but actually caching will not work because you may not have locality access
sequence.
So motivation -- there's lots of motivation. Of course now I put things in this
context I will think accomplishing entirely from data streaming but it has lots of
applications other than data streaming. So of course I mean the first motivation
come from distribution called hashing increment, you hash something in the
incremented counter. But there's also other motivation. For example, routers
may need to keep track of many different counts, I mean, for example, count the
packets among different source IP [inaudible] a source IP perfect destination, IP
perfect pairs.
And it turns out I fear that these kind of counter array can help us implement
millions of [inaudible] and routers, so Cisco people may be interested in this as
well. And that's extensible to other -- that's great, it's extensible to other non-CS
application in sewage management.
So I'm going to talk about the -- think about if you have -- if your subdivision has
like a hundred thousand people you think about a hundred thousand toilets, but
you don't want to design like cesspool for a hundred thousand toilets, right, you
want to design the cesspool to be smaller than the total size of all these toilets,
right? Then I think this time algorithm will work.
And then our work basically able to make 16 SM bits out of one, the outcome of
21st Century. So the main idea in the previous -- I mean I think we didn't even
mention the -- sort of the main framework, I think this [inaudible] earlier work, I
think its idea is basically all the counter increments goes to very short SRAM
counter, and then when the short SRAM counter becomes like overflowing, for
example, if we have 8 bit SRAM counter overflowing remains approaching 255,
256, then you will have a counter-management other than to sort of flush these
overflowing counters to the DRAM, overflowing -- the flushing means basically
you also have like if you have one million SRAM counters, you also need the one
million DRAM counters, each correspondence to each. So anything overflows
you just flush over. For example, if this one overflows from 256 to -- 255 to 256
then you reset to be zero increment to 156 here. It's just as simple.
It turns out this algorithm is very hard to design. Why? Because you can think
about you have to work with arbitrary increment patterns so for example you
could come up with policy say if you are half of four then you're flush but the
[inaudible] can make everybody like 127. Then you hit 128 simultaneously then
you're dead.
So no matter how you come up with policy, you have to -- you have to imagine
there's [inaudible] case which may not make -- which makes you sort of strategy
may not work. So this thing is extremely hard to design it turns out.
>>: [inaudible] how often did you have to go to the [inaudible]?
>> Jun Xu: How often? Well, depends on how large is your counter. So you
have a counter is short a longer [inaudible] of course you want to design
something which is short as possible, so I think that's basically our competition,
right, I think if we design something much shorter than the previous scheme then
I think we win.
So but of course if we desire to be much shorter we have to come up with much
smarter this to do it, so I think that's basically the challenge.
So a previous approach has used different kind of counter-management
algorithm I think CMA also stands for in real life stands for the condo association
of condo management association or something. So I think you can think
subdivision, my subdivision I think an analogy I think goes with that. But anyway.
So we have this, Nick McCure's [phonetic] team have this work and they
basically implement, they basically do the I think the fullest I call the fullest
[inaudible] first. If your counter value is the largest you have the highest priority
to flush, so there are basically the fullest tallest first and then I think -- but the
algorithms aren't very efficient. For example even when the [inaudible] difference
12, SRAM speed SRAM speed difference 12, then only four bits is enough, right,
four bits is basically reduce the urgency by 16 times, right, 16 bigger than 12, so
it's enough.
But first they will need eight bits, which is I think which is like twice as four bits.
Another thing is because when you do -- because they implement the whole
thing as a heap, with a heap you have to have the pointer and the way you have
one million counters log one million is like 20, so your pointer, I mean, I think your
left pointer, right pointer to do the heap takes you like I think you -- I think you
probably only need one pointer, this pointer takes 20 bits per counter, which is for
the pointer, which is very wasteful.
So the total is like 28 bits per counter, even though the theoretical minimum is
four. But they declare victory because they go down from 64 to 28, so and they
need the [inaudible] implementation of a heap which is non trivial. And actually
energy-wise it's not good because anytime you have pipeline it costs some
energy, right, I think.
And the CMA used there's a late work by George McKeez [phonetic]. It's much
more efficient, I think they go down, I'm not going to go through the detail, they
go down from -- go down to eight SRAM bits per counter and the two bits for the
contralogic. So Nick McCune's team is like 8 plus 20 and they can do 8 plus 2.
And the hardware logic actually seem present -- we will probably I think simpler
than the Nick McCune's team solution.
And our scheme with -- does much better, our scheme only need four about it,
which is a minimum. And our idea is very simple. We flush only when the SRAM
count is completely full. So it's like only completely full toilet can flush. That's
basically the policy.
And we use a -- and then we use a small SRAM buffer to hold [inaudible] DRAM.
You can think about very small cesspool to hold all the flushing like the toilets.
And then there's a key innovation. There's a simple randomized algorithm
ensure that the counters do not overflow and burst large enough to overflow a
five hole buffer which means if every household flush the toilet simultaneously
then I think then we are dead.
So our -- so basically how do we do it? It's very simple we start the initial values
of SRAM counters to independent random variable. It's like boot strapping. You
bootstrap at very beginning. You set the initial counter values to uniform random
variables. I mean uniform random they distribute between zero and 15. Why
15? Because we are talking four bits. So there's only zero to 15. Value is only
zero to 15. So you set to be uniform number.
But of course you are doing counting so count -- if you count, you have to
counsel from zero, right, so you have to remember seed value. And how do you
remember it, it's very simple. You just [inaudible] counter to the negative of your
SRAM seed value. Right? So you start from zero. And at the [inaudible] now
this is rated online algorithm context. The [inaudible] knows our random
[inaudible] scheme but not the initial value SRAM counters. That's my [inaudible]
how does that work, you know?
And then we pull rigorously that the small five hole queue can ensure that the
queue will flow with very small probability. So basically this scheme basically,
basically this community has a like ordinance, like a city ordinance. It says that
when new home ordinance move in, you have to set the water level in your toilet
to a uniform random value. And then a small cesspool will work I think for a very
long time.
Let's look at the numeric example. For example one million, you have one million
counters, it's four bit SRAM, 64 DRAM and a speed difference like 12. And it's
just like all these parameters. And then for one million counters you don't need
the one million like one million buffer slots you only need for example a 300
buffer slots, 54Q for storing in that to be refreshed. And then after like 10 is to
power 12 like a trillion counter increments, arbitrary fashion so think about like
eight hours of four gigahertz per second traffic. 40 million packets per second
like links.
After that many -- so which means you can think about after eight hours of this
kind of high speed then the probability of overflowing from this the 54Q is like 10
minus 14 which is think is exactly the digital fountain undecodability probability.
So the meantime between failure for one of these router or counter is a hundred
billion years which is longer than the Big Ben. I think Big Ben like a seven, I don't
know, 17 gillion or something.
So I actually went to Cisco to give this talk and they told me that they are usually
nervous about randomize algorithm with its probability but I think [inaudible] said
he looks at [inaudible] feel pretty comfortable. He said if it's 10 to the minus
minus, I think once ever thousand years -- I would think once every year is pretty
fairly reasonable but I think this is one a hundred billion years so it's even better.
But even once a day I think you [inaudible] with it, you lose a counter value. And
I think I told them that this is California, there's a probability for the meantime
between earthquakes is like, I don't know a hundred years. So anything about
hundred years will work in California, so -- and this place as well, right? I think
this place also meantime between [inaudible] is problem a hundred years or
something, so.
So the conclusion for that one is to data stream is very powerful tool for -- I'm
sorry, I think I don't know why I have this typo after so much years. The
monitoring, it's a challenging research problems arise due to stringent space in
the speed requirements, so I think -- so actually we will need some -- let's see.
Actually we didn't talk about distributed streaming. We probably won't be able to
have time to talk about it.
But we actually do distributed data streaming as well. Distributed data streaming
means you have lots of data streams and you want to find out the some statistic
about the unit without actually unioning them. Because unioning them just like
shipping all the machines from datacenter for single machine and the processing,
you don't -- you just cannot do that. So you want to -- you want to compute some
statistics all the data across all these like a datacenter nodes and you get -- you
want to get some statistics but without shipping all the data together, without
reading every piece of data.
So I think that's actually pretty challenging thing. And I do have a second slide
file which is mathematical very exciting.
>>: Before you continue let me ask some questions.
>> Jun Xu: [inaudible] data streaming.
>>: [inaudible].
>> Jun Xu: Is data streaming. But go ahead.
>>: Actually [inaudible].
>> Jun Xu: Yes.
>>: What's your comment on that [inaudible].
>> Jun X: Well, counter [inaudible] talk -- I think let me first summarize
counterBits idea. So counterBits idea basically is they want to they want to be
the able to decode the size of each and every flow, so which means, so which
means so basically they will -- they need a sketch, basically you will have a
sketch and then this sketch -- and then they also gather all the flow IDs offline.
>>: So I mean basically I think in addition to basically your hash and basically in
the first part of the talk, they hashed national [inaudible] and they tried to
[inaudible].
>> Jun Xu: Exactly. Exactly. Yes. And also because they -- let's see, also, they
do the variable [inaudible] encoding which means that they have fewer, fewer
counters, they have a pyramid type of scheme which means they have fewer and
fewer like high order counterBits I think so that you can save space and then -and then the -- you, where the slow comes -- when a packet comes in you hash
its ID to multiple locations to implement it.
>>: But I think they argue they can do exact [inaudible] here you are basically
[inaudible] estimation. I mean ->> Jun Xu: They can -- well, they have more input than I do, and they have more
output than I do. And basically they can -- the more -- they feed the basically all
the IDs. They need all the flow IDs to be stored, to be able to decode. Because
the flow ID is translated into the dough coding matrix.
>>: Okay.
>> Jun Xu: Without decoding matrix. Of course you can map the flow ID to
decoding matrix and the decode. But having the decoding matrix is basically the
same as having all these IDs. So they need all the IDs and the way the IDs they
can find out had flow size of each and every ID. So that's basically what they do.
And ->>: [inaudible] but we can ask them later on. Because that's actually pretty stuff
basically as far as requirements if you have to do that.
>> Jun Xu: Okay. But their decoding is not like instance, which means you
can't -- if you immediately -- if you want to decode for each and evidence flow
realtime, you cannot, basically you have to spend like 30 seconds at the end to
decode everyone. So if you decode a batch for everyone it costs you 30
seconds. If you [inaudible] over all the counters, it may work.
>>: [inaudible] question I will ask you [inaudible].
>> Jun Xu: Yeah, it's -- yeah, I think we actually -- they actually cited I think
several of our works I think, including counter or -- because for example we -- our
work did blow a hole any think to a certain extent we did blow a hole into their
work in a sense that for example we can do 64 bit to four bit, right. With this
counter thing, there's no way they can go from 64 bit to four bit. So they, I think,
they have to argue that in certain applications you don't want DRAM -- SRAM to
DRAM traffic because that cost you bus traffic and things like that. In these
situations I think their situation is entirely SRAM.
>>: I think you also have some kind of [inaudible] I mean they also [inaudible].
>> Jun Xu: But they don't have SRAM to DRAM. They don't have SRAM to
DRAM like traffic. So I think -- but I think we ->>: I think they actually should be able to do that as well because they
potentially have a second stage. So they [inaudible] I mean the second part of
[inaudible].
>> Jun Xu: Yeah.
>>: [inaudible].
>> Jun Xu: Oh, they -- no, they I think -- yes, they can use mine but they
specifically say they don't want to use mine because I think they're underlining -because if they can use mine I think nobody wants their work because if their
work is like a complicated encoding or decoding. So they have ->>: They have, they [inaudible].
>> Jun Xu: You will get exact count, yeah. But basically what they could do is
they get all the flow IDs and they also -- they also use for example our algorithm.
They could just use our counter array and to do it, so basically to [inaudible] their
work they have to say that in certain situations you don't want SRAM to DRAM
traffic so they have to say SRAM so they have an SRAM only solution. So their
solution SRAM only.
>>: I think one of the basic thing I'm not that clear is if they actually need to know
all the ID of the flow or they just need to basically ->> Jun Xu: [inaudible].
>>: [inaudible].
>> Jun Xu: Right now they need all the IDs flows and they're working on some
kind of partial decoding. There's some kind of partial decoding work going on.
But I had a very long conversation with authors with both [inaudible] and the first
author student.
>>: Another basically my question is [inaudible] snapshot is this actually the
snapshot [inaudible] I mean for [inaudible] transferring to SRAM to DRAM very
last [inaudible] but basically sometimes you need to get data out.
>> Jun Xu: I see.
>>: I mean for [inaudible] the counter break guys definitely to get their data out.
They basically need to take a snapshot. And the only thing I can think of is
basically during the snapshot we have to stop.
>> Jun Xu: Well, you could have what they call ping-pong buffer which means
you sort of throwing twice as many resources, you need like ->>: Basically a shadow memory.
>> Jun Xu: [inaudible] shadowing or something.
So I tend not to deal with things like this because that's a separate issue. You
could have some generic solution for things like that.
>>: [inaudible].
>> Jun Xu: It may not be [inaudible] you can't think that could be some coding,
there could be even coding solution but that's a different dimension, right,
because you -- the shadow period is very small compared to the [inaudible]
period because you are doing a sequential reader, right, because way you read
out is a sequential reads much faster than the random access reader so you can
think only one-tenth of it, you can think -- you can even imagine some error
crashing code type of solution into it. But that's.
>>: [inaudible].
>> Jun Xu: But I want to isolate this as a generic -- it's a generic problem, not
just for this problem but for many other such ping pong buffer type of questions.
So now I think I'm going to look at, let's see, it's a different slide and it's here, it's
my student's [inaudible] talk so. So it's a data streaming algorithm for estimating
entropy overflows and I choose to talk about this because first -- actually this is
distributed data streaming. That's the first thing.
Second, it's used some fancy math and it turns out these fancy maths are very
easy to understand. So I think -- and I think the connection -- and that also -- and
this connects very well with a classical theory result. So sometimes I think if I
don't have this kind of connection I will feel very bad because it's like theory
people are doing deep theory work and I'm doing -- I'm sort of -- there's some
kind of suspicion there so duplicating theory results by doing some radomization I
think by doing obfuscation of the terms or things like that. But this one connects
very well with the theory result.
I think I -- but I do check carefully that our work is not really sort of permutation or
obfuscation of existing theory though. So. And the real -- I have to be honest
with you that the real value of this work to be very honest that I think it's too
debatable. But I think a mathematical-wise I think it's beautiful work.
So I think the probably is very simple. I think we did have a piece of work done in
2005 estimating the entropy. We always talk about that. I'm not going to talk
about that today because the technique there is pretty standard in theory. And
this problem is much harder problem, which no -- I think our entropy estimation
solution is one of the earliest. I remember there has been about, let's see, I think
around the same time we have -- there are about a dozen solutions a like entropy
estimation. Newspaper of these solutions are able to estimate the union or
intersection -- the entropy of the union and intersection with two data streams.
So you can estimate the entropy -- a single data stream, but you're able to
estimate for single digit like S1, but you're not able to estimate to the entropy of
the intersection of data stream S1, S2.
But turns out estimating the intersection for estimating the entropy of the
intersection for data streams is very important and for example you have all the
flow, which means all the traffic that goes from the original to destination you
want to estimate as the entropy OD flow. What is OD flow? OD flow is basically
all the traffic, the intersection of all the traffic goes from a particular origin to -with the all the traffic that goes to a particular destination, right. It's an
intersection of these two data streams. So you can see that these three are
intersection.
>>: So these are the ->> Jun Xu: [inaudible] these are like ingress points. These are like egress points
of ISP.
>>: [inaudible] we are looking at all the flows from the basically the ->> Jun Xu: This is particular I, particular J, particular O, particular D, we are
talking about the entropy of all this OD flow traffic.
>>: Okay.
>> Jun Xu: Which is intersection of all the ingress traffic here and all the egress
traffic here, right, because you can you can see some traffic which originates
from here but goes to somewhere else, some traffic goes to here but does not
already from here. So it's really intersection.
>>: What information do you have [inaudible].
>> Jun Xu: What information do we have? The only information we have is
everybody sees all the traffic that comes in and everybody here sees all the
traffic comes out. That's only information we have. We don't need any sort of
back -- any sort of communication. We do not need the communication and the
basically the solution is you will digest the data stream, you get some sketch, you
will digest data stream here you get some sketch and you compare these two
sketch.
We will have some communication but which is basically piecing together two
sketch. Which is much better -- this is much smaller than actual traffic.
So now let's talk about entropy. So entropy is basically this quantity and this
causation, so this slide is made by my student and I think I won't have this much
time getting photo stuff but I think he has done much better slide than I usually
do. And so this is an entropy definition and you can think of PI as a fraction of
traffic from the I flow.
And what do we measure actually what we call the [inaudible] entropy, it's not the
actual entropy because the actual entropy is different than [inaudible] entropy.
Actual entropy required to specific model, things like that, we're now.
We imagine the [inaudible] entropy of a data stream. It turns out that in many
situations estimated entropy is equivalent to estimated entropy norm, so this is
something called entropy norm, which is basically you can think well the PR is
equal to MI over M, so M is a total number packets. So instead of we're
estimating sigma MI over M log MI over M, you can estimate the MI log MI if you
know M, these things are convertible. So MI is a frequency of the I flow.
So it's also equivalent to estimate entropy norm and it turns out entropy norm is
much more manipulatable than entropy. So the motivation, lots of motivation for
entropy I think is not only detaching traffic clustering, I think everybody is very
interesting entropy and DDoS attacks trafficking.
So the theory we use is the theory of stable distributions, and the distribution B is
called the P stable if for any constant -- if for any constant A 1 to AN and the
random variable X, X1 to XN from this P stable distribution, A1, X1 plus da da da,
plus ANXN is the same as in distribution as this LP norm of this A1 to AN times
X. For example, Gaussian is stable because if you have X1 to XN as a standard
Gaussian, then it's basically you could this time standard and distribution, right.
Gaussian distribution is basically P2. So I think this -- right, because the
variance adds up, right, with Gaussian the variance add up and of course you
had to divide by one with two, so. And that's discovered in 1907 by Paul Levy
[phonetic], so I think sometimes you regret it and you were not born a hundred
years ago, so -- 200 years ago, then there's a result could be discovered by you.
So the property of stable distribution we talk about it. So Gaussian is two stable,
Coshie [phonetic] is one stable, and cos forms are only known for these three
values and there are -- but there are known formulas for generating samples,
samples from P stable distributions, it's called Chambers Formula.
And now I think there has been existing piece of work, that's not my work, that's
an Indic [phonetic] actually this is his JSM paper, I think his conference paper
was in 1999. It's much earlier and then basically he's able to -- he is able to
figure out the how to estimate the P frequency moment of a data stream which is
in this form. So you can think about MIs a number of packets in the flow I and
the he can allow you to estimate sigma MIMI raised to power P.
Using the stable distribution. So basically this is algorithm for each flow you draw
stable they distributed the XI and the counter starts from zero and then for every
packet you see, you increment it so the flow -- you can think about there's a hash
function which maps the full ID to a stable distribution and the variable so you
can see after you increment all the packets then the final value is distributed as
the LP norm of these full size distribution times stable distribution. Standard
stable distribution.
I mean, it is easy to verify that. It's just this is already done I think several years
ago. And then this becomes a parameter estimation problem, right? And the
how do you do parameter estimation? You can do this experiments -- you can
do this experiments multiple times and then you extract this quantity. It turns out
by the way oftentimes we think about the mean estimator but mean estimator
does not work here because stable distribution is unstable in terms of -- in terms
of I think this convergings of these things.
And so this thing can be -- but this thing can be done using the median estimator.
Median estimator works very well. But it turns out recently we discover I think
some researchers discover that when P approach zero then the median
estimate, even median estimate does not work that well, you need it turns out
harmonic mean estimator works very well. So I think there has been lot of
research on what's the best estimator in these kind of contests.
But anyway ->>: [inaudible] a number of basic counters lead which is basically [inaudible]
number of times ->> Jun Xu: Exactly, exactly, yeah, yeah, yeah. And [inaudible] hash functions
[inaudible] function, whatever.
And exactly this is exactly I call the Christmas conjecture, so I think I thought
about the conjecture I think on Christmas Eve so and I -- I was thinking but I want
to estimate to the entropy, so I was thinking if -- is it possible to represent this
entropy function or entropy known function by a linear combination of what we
call power functions? For example something like this it turns out we can. So I
think my students actually figured out in one day that -- I totally -- because I saw
right a small program, play with a MATLAB but I always have a hump which
means I sort of fly I do the regression very well on most of points but there's
always a hump somewhere here and there which I hate lot, so I ask my student
how about, I mean, trying -- ask my student to find out whether you can do like -at that time I asked him to do four plus major which means approximately four
times but he actually returned a result much better. You can plus by two times.
>>: Okay.
>> Jun Xu: It turns out function like X log X can be approximated very well by a
family of function of this form. Which means X raise to power 1 plus alpha minus
X to power 1 minus alpha multiplied C. It turns out actually this is not -- this C is
actually equal to 1 over 2 times alpha. So [inaudible] so family approximation
and it turns out the proof is straightforward using Taylor expansion. It's just
Taylor expansion.
And just look at the numerical value. I'm sorry, I apologize. This is not over 2.
This is basically 10 raise -- 10 times X to power 1 power of 5 minus Z. So this is
typo. So let's see how we -- how well do we approximate the X log X with this
function in this interval? You can see the pluses pretty well, right? It's pretty
well, and one is approximation, the other is actual code, there's not much
difference.
And in fact, it goes pretty well to 5,000, I think. 1,000 I think. So you can
approximate it pretty well with like zero to 5,000. And then if you can
approximate then good because the M1 log M1 because this is -- this is the
entropy known function. M log -- if M log 1 can approximate the Z, N log M can
process Z and you add up everything here you get entropy norm, but the F end
up having here you get something like L1 plus alpha norm, right, except it's
raised to power 1 plus alpha, and then you get something like A1 -- L1 minus
alpha norm. But we already know how to estimate the 1, L1 plus alpha norm, L1
minus alpha norm, so we just -- we can just sort of take the difference.
And by the way because S -- the approximation stops at like 5,000 or something,
I guess worse at the beyond 5,000 but anything beyond 5,000 is elephant, right?
It's a large flow. So we can have elephant detaching model which basically
handles all the flows larger than this kind of structure.
So we can tell we handle it or the -- or had basically we do approximation.
>>: I assume you use sampling to do elephant detaching?
>> Jun Xu: Yes, I think you could use some [inaudible] sampling, some kind of
data streaming algorithm which you can isolate almost all the elephants.
>>: [inaudible].
>> Jun Xu: And which I think works pretty well. I think these two pieces. And
then, then I think this method, I think the LP norm method actually extends very
well to the intersection case and this is an extension. You can think about at the
origin side, the -- based on this sketch you can estimate the LP norm of all the
ingress traffic, right? But suppose this is the OD flow, which means these are the
common between the O and the D, but these are the cross-traffic here, this is
cross-traffic there. So you can think about if you estimate the LP norm to the O
side, then you get this, right, and you estimate to the traffic, add to the D side you
get this, then you can actually -- these sketch are linear which means you can
sketch, you can think about like a vector, a value vector, right, you can do the -you can do the point-wise subtracting and do the point-wise addition and then it
turns out if you don't the point-wise subtraction and you estimate the LP norm
from it, you get -- you actually get to the -- you actually subtract these two things
away, you basically get both cross-traffic part add up.
And then you can also do a point-wise addition and you estimate LP norm you
actually -- you get -- you actually get a sum. So you can see that we can come
up with two different estimators. One is just O and D minus all the cross-traffic
you get to one estimator of the OD, OD flow. The LP normal OD flow traffic,
right? Or you can do the -- you can subtract these two things and we both will
give you an estimation of this one, which is the LP -- LP norm of the OD flow
traffic.
All these are cross -- this is cross-traffic, this is -- all these are cross-traffic. So
which means the LP norm [inaudible] mass extends very well to the intersection
case. Okay? And then, then ours is just the difference between these LP norms,
so we let the LP norm handle the intersection, then we take a difference so we
know the entropy norm of the intersection. Which I think works very well and we
have a bunch of flow.
And then it turns out that if to go from the entropy norm to entropy we need to
know the total number of packets during a certain timeframe. It's trivial when it's
a single data stream, right, because you can count. You can count it, right. You
need a counter to count it. So it's trivial. But in the OD flow contents not sure it's
the OD flow traffic, it's called traffic matrix estimation which is not trivial.
So obviously and so traffic matrix estimation is actually itself is not true, there
probably has been lot of papers on that including -- I actually had a paper on
traffic matrix estimation. And so it turns out that the OD flow traffic which it
means the trafficking matrix element is exactly what we call L1 norm. So it is L1
norm. So we can actually use Coshie distribution which is a kind of stable
distribution estimate L1 norm using [inaudible] index master. Yes, we can do it.
But what we are -- we are doing something extra, right, which means that beyond
the L1 plus alpha norm and L1 minus alpha norm we have to do L1 norm which
is the extra overhead. But turns out because the alpha we use typically very
small, like L1 plus 05 norm and L0.95 norm so it turns out that you can see the
function where alpha is small this function approximates X very well.
So therefore, the sum you can think about the sum of L1.05 norm plus 1.L -- 0.95
norm divided by two approximates L1 norm pretty well. So based on that, so
basically we are shooting two birds in one stone, right, because once we have
these two norms when you add them up and divide by two, you get -- you get
the -- you get L1 norm and when you take the difference, you get to the entropy
norm so then we have both components we need to estimate the entropy of the
OD flow traffic.
And then so we have a lot of experiments with AT&T and I think we have a huge
trace and I think these are experiments so this is the basically the relative error I
think which is a [inaudible] for experiments. You can see the relative error is
[inaudible] less than 15, 0. -- 15 percent. So what is this code. These two codes
means when we estimate the elephant we have to use some sampling, so these
are sampling probability for elephants. It turns out it's not very sensitive, which
means which means I think it's not very sensitive on this sample in probability
which means how good your elephant detecting solution does not impact this
result very much. What's intuition? The intuition is if you are large elephant then
you are going to impact our result significantly, right? But the larger elephant has
a higher probability to be detected, right?
But if you are small, small elephant, you think about the elephant [inaudible]
small elephant means 5,500, 6,000. If you are small elephant, yes, the
probability of that you are not detected is pretty high.
But the error you cause is also small. So which -- so you can see this kind of
error dynamics of elephant detection works very well with our sort of entropy
estimation. So that's basically the moral of the story.
So I think we have other experiments but I would rather just skip these -- these
results and so basically I think we are able to sort of extend the study of entropy
for OD flows as a new two for net tomography and so we present the algorithm to
solve the problem in practice with low relative error and the reasonable resource
usage and we introduce a new type of distribution streaming problem so basically
the estimate is statistic origin of destination flows. And I think I like this work.
These are the conclusion by my students. I think I probably will write it differently
but the thing which I like best is we are able to connect to the mainstream data
stream work which is the kind of theoretical LP norm.
For example Indic [phonetic] has this paper for like more than 8, 9 years, but of
course theories -- he has a bunch of applications but these are all theory
applications which means you use theory results to do another theory result and
so it's like to a certain extent it's like self licking ice cream.
And the -- but we are able to find something which is -- which come from the real
application need and I think we can tap into [inaudible] we find a very concrete
practical application of his theoretical result which is very exciting and I think the
connection is a connection is very simple after the fact. If I tell you this
connection is very simple but it's probably not that easy to think about like before
hand.
So that's the reason -- that's the reason I think I like -- I like this work very much.
Personally.
>>: It's a very nice piece of work.
>> Jun Xu: Thanks.
>>: I saw you [inaudible] I mean [inaudible] try to see what kind of application
can be used in application [inaudible].
>> Jun Xu: That has been some -- we have a [inaudible] which has been quite
useful for like peer-to-peer search. Like peer-to-peer search content so the
peer-to-peer content search type application and actually people have been
applying our result to like an ad hoc net routing, identity of node ad hoc network
is like -- it's like a -- like object that needs to be searched.
>>: We can probably talk offline.
>> Jun Xu: Yes.
>> Jin Li: Are there any other questions from the audience?
>> Jun Xu: So how many statistics do we have here? I think you can be liberal I
think on the definition so I think just want to get some sense.
>> Jin Li: Okay. We can thank you the speaker.
>> Jun Xu: Oh, thanks.
[applause]
Download