>> Jin Li: It is my great pleasure to have Jun Xu from Georgia Institute of Technology to come to Microsoft Research and give us a talk on his research work. What's happening? >>: I should get power down. >> Jin Li: Okay. Okay. Jun get his Ph.D. from Ohio State University in the year 2000. In year 2003 he received NSF CAREER award. His contributions on basically establish performance for bound for networking in the year I believe it's 2006 and ->> Jun Xu: IBM give me some. >> Jin Li: He receive IBM faculty award. Jun done a series of wonderful work related to computer networks performance evaluation measurement establish bounds and so on. Without further ado, let's hear what Jun has to talk about in network data streaming and his journey in signal processing. >> Jun Xu: Thanks. Thanks for nice introduction. So I think actually I've been giving this talk I think many times but last time when I came to Microsoft Research to give a talk that was a different building, December 2004. I'm not sure how many of you are in the audience. None. Which is good. So I think we are using ->>: [inaudible]. >> Jun Xu: We were using different title actually because I didn't think about this title until probably 2005 or 2006. So it was a different title. And actually at that time I only did like probably a quarter -- less than a quarter of my data streaming work by like December 2004. So I think when I did a talk last time I think a lot of this work has not been to. So but I probably -- I think that but I may spend the first 20 minutes talking about the work which I talked about in year 2004. Which I guess is okay since [inaudible]. So for the work described in this slide file, I think this is done -- this is joint work with my former Ph.D. student Abshee Kumar [phonetic], Mi Ho Sun [phonetic], Shesaw [phonetic] and the collaborators from IBM like Tim Lu [phonetic] and Chow Wong [phonetic] from AT&T. And I'm at Georgia Tech and our faculty is networking the Telecom Group. So the reason why I call it A Computer Scientist's Journey in Signal Processing is because my educational background is a hundred percent CS, so I've never taken any course in electrical engineering but turns out when I study the problem in this area, it's -- I have to familiar my with a lot of electrical engineering concepts like coding theory, information theory, these things. Although they may not be explicit in the work but you have to have to type of understanding basically pursue this line of research. Of course after you do the research you describe the result you may not use any of the terms from the theory. But we do use techniques from that. So now let's talk about the motivation for network monitoring. So we have to monitor -- I mean so basically we want to [inaudible] why we need data streaming to monitor a computer network, why traditional techniques for monitoring will not work? So the reason is in the future we need to monitor computer networks for a lot of different quantities, some quantity like elephant flows like I mean you want to find the largest flows in our network, largest flows means those flows which occupy the majority packets. Which, I mean, has the majority packets. And sometimes we want to monitor the distinct flows, we want to count how many flows there and sometimes we want to find the average flow size. Actually the number of flows and the average flow size are equivalent because when they multiple together that's total number of packets in a timeframe which you can easily count. So if you know one quantity, you know the other. There are also other quantities you might be interested in like a flow side distribution. So all these have applications and they are working for example if you find elephant flows you can do trafficking [inaudible] which means you want to ship the elephant flows properly, you don't care about too much about mice because they don't occupy too much traffic anyway. And all these number, distinct flows, et cetera, they're useful for queue management, right. Remember when you divided the bandwidth, you need to divide the bandwidth amount of these flows, you need to know how many flows -- you need to divide this kind of limited bandwidth by how many flows. So also I mean we talk about a flow side distribution it's useful for detaching because for example a flow side distribution means you want to know how many flows have size one, has one means has one packaging, how many flows have size two, how many flows have size three. I mean, one typical application I often talk about is in old days when you have a virus or worm it has -- usually it has fixed size like 1,000 bytes or something, so if you divide the typical MTUs, the minimum transmission unit, in the Internet is like 500, like 512 bytes plus the [inaudible] all these things. >>: [inaudible]. >> Jun Xu: [inaudible] 576, I think that's a socket -- that's really a socket [inaudible] which means -- it's different from one OS to the other. Some OS choose to send out 512 byte chunks, some OS choose to send out 536 byte chunks. Some ->>: [inaudible]. >> Jun Xu: Yeah. So you can think about a thousand byte worm will be cut into two packets, so you can think about all these worm flows have size two. So suddenly if you have like an epidemic of a particular type of worm which is two packets long, they will see lot of flows of size two. So if you use your profile flow size two is like a thousand flows within a five minute timeframe, you suddenly see 10,000 flows of size two within like a five minute timeframe, then it may indicate some kind of worm propagation type of behavior. And there's lots of other quantity you want to measure like per flow traffic worm. Given any flow you want to know how much traffic it contains. Approximately. So I mean that's also useful for [inaudible] detaching and actually we also oftentimes we need to measure entropy of traffic. It turns out entropy is very important quantity. I visited CMU I think in 2004 and they talk about they talk about like the need for measuring the entropy at of traffic at a certain node. The reason they want to do it is they have a network of something like a thousand nodes. But they only have -- they -- but they only have enough like a computation power to monitor all the traffic 10 nodes at a time. So you can think about a town, there's a town that have a thousand people but you don't hire a thousand police men to protect a thousand people, right. You cannot afford that, right. So usually a town of a thousand people you only have 10 policemen, right? So their idea is you -- these entropy hopefully you can design like a data streaming type of algorithm to measure the entropy of flow traffic and then when the entropy of traffic looks suspicious, which means and then you ship this, all the traffic on this particular node for like a further analysis. So just like you only when the citizen said his town is threatened they call 911, right, and then policemen comes. So usually 10 policemen is enough to handle all the 911 calls. So that's basically their picture. But they need the entropy estimation algorithm and they perform this problem. And I came back from CMU and I work on it with my student and then we solve the problem like in a month, and then we actually write a paper together with CMU team so and this was at -so these are all coming from the real applications. It's not like sort of lock myself in my office and thinking about the paper I had to send next year and then come up with problem. So it's really not artificial problems, they come from really applications. So by the way, some of you may wonder why entropy traffic may indicate anomaly detection. For example where you have a [inaudible] attack, then you have lots of singleton flows, and when you have singleton flows you increase the entropy and particularly in traffic. So actually it turns out there are a lot of unlikely application streaming like traffic matrix admission can be viewed as data streaming problem, peer-to-peer routing can be viewed as a data streaming problem, peer-to-peer routing for example in the uninstructed peer-to-peer network, not DHT, but talking about the BitTorrent type of network, so it can be formulated as data streaming problem, we actually have a paper on that. And even IP chase back can be viewed as like a data streaming problem. And we have papers on all these things. So the challenge of high need network monitoring is tremendous because simply because high speed packets arrive actually it turns out packets arrive every eight nanoseconds for the gig per certain link, if we're talking about minimum packet. I wrote 25 nanosecond here because I'm assuming a thousand bits per packet which I called Crack Patchett's [phonetic] constant because Crack Patchett's paper has like a thousand bits per packet so I feel safe to put 1,000 and nobody will criticize it. So but in reality if you assume minimum packet length we are talking about eight nanoseconds per packet. So you have so use [inaudible] for per pack processing. Unfortunately per flow state is too large to be put into [inaudible] because you could have millions of flows and if you put all the flows state into [inaudible] I think it's just too large for [inaudible]. And so the traditional solution because your SRAM is too small to do per flow processing so traditional solution is to use some pulling which means you only sample small personal of packets but because after sampling your traffic stream become much slower so DRAM is fast enough to handle all the packets after sampling. And so you process all these packets using per flow state studying in slow memory [inaudible] and then but of course you didn't see all the traffic, right, which means your per flow state does not record all the traffic, only record for example one over 100 of traffic or something like that. So you have to use some type of scaling to recover the [inaudible] statistics. And if the sample rate is very low then the -- then you have the blowup like more times so the -- it cause high accuracy because upscaling you also scale the noise. And we are fighting losing cause because the link speed gets faster and faster. For example right now at AT&T their sampling rate is I think right now it's one over 500, which sample 1 over 500 packets but with link speed goes up they are talking about the one over 5,000 packets. And so I think with this kind of sampling rate your accuracy is not going to be very high with sampling. So network data streaming is also a smarter solution. So the computation model for network data streaming is basically process long stream of data like in one pass, which means you look at this packet, you decide whether you use it to change your internal state. If you decide not to change internal state and it goes away, [inaudible] can not change your mind, which means you cannot say oh, I seem to see this packet and I go one second ago but it's gone so can I get it back, you cannot. So you have one pass you have to make a realtime decision. And of course if you have infinite amount of memory it's no big deal because you just store everything. But it turns out you have very limited amount of memory, so you have to be very judicious on what you're storing this limited amount for high speed memory. So the problem to solve is basically you need to answer. So I think if some of you are like video coding people here it's this problem actually can be viewed as the online rate distortion problem, so you want to compress the data, you want to compress data but you have to make a realtime solution. Not like image compression, you can scan it a thousand times, so I think you have to make realtime like rate distortion decision. So problem to solve you need to answer some queries about stream at the end or continuously. And you can think about this like the go and you can actually translate it into the distortion function, the rate distortion function. So the trick is to remember the most important information about a stream pertinent to the queries. So you are very much goal oriented. If you try to remember everything, not possible, right, for information theory, so you try remember most important information pertinent to the query. Compare with sampling basically streaming peruses every piece of data for most important information. For some polling which digests only a small percentage of data but absorb all the information within this data. So that's a difference between sampling and the streaming. So my analogy's always as follows: I think when I was undergraduate I remember I think I have -- I think studying is not my only interest actually, a minor part of my interests. And so I don't spend too much time studying courses, but I still want to get good grades at the end, so reasonably good grades at the end, so -- and so what I do is accurately streaming so which means I have a very thick book which I don't read for the whole semester but at the end I'm going to stream through the book. I only have time for one pass, I don't even have time to go back and forth. And then my memory for one day is pretty small so I guess I -- and so I have to remember the -- but I -- the data stream basically means I think I need to answer more query relevant to the exam which means I need to [inaudible] I have some idea about the query which will be given in the exam. So I basically peruse it, basically go through this book in one pass with basically trying to remember most important information that's relevant to the exam the next day. So basically that's basically algorithm. So obviously sample algorithm does not work, which means sample algorithm if I read page one of the book, page 11, page 2 one of the book, I'm not going to pass exam, right, because I don't get to the right context. So but by doing this kind of smart streaming, I think I manage to I think get through my undergraduate years without studying too much. Without actually learning too much. So now I want to give you a [inaudible] example of data streaming. So you think about giving a long stream of data you want to count the number distinct elements. It's a 1985 problem, so it's a very old problem. It turns out already more than a dozen solutions has been proposed for different kind of -- in different kind of -- there's pros and cons for all these like data streaming algorithms they would be good for different contests, et cetera. I'm not going to go through the list of them but I only talk about the classical one. So the data stream is ABCA, CBDA and obviously you know the number distinct elements is four, right. But if I'm throwing like archevian [phonetic] elements on board and among them maybe millions of them are distinct, I think then millions distinct elements and you're not going to be able to count the number. So I think I want to talk about simple algorithm here. The simple algorithm you choose a hash function H with a the [inaudible] one. I mean basically it's hash into a uniform random variable and then let V1, V2 be -- like D1, D2, D3 be the data items. So you hash each in the every data item and you always remember the smallest hash value. Is that easy? You keep one register, right? If that to be plus infinity and everything -- so you compare, you hash, you compare with this and you set to the smaller of the comparison, right? So you keep doing that. You can always remember the minimal of the hash value have seen so far. And then, then we can easily prove that the expectation of this -- this you can view it as random variable because the randomness come from hashing, okay. You can view it as random variable. So this -- you can prove that expectation of this random variable is one over F0. F0 is a number distinct elements among this data stream. Plus one. You can prove that. Because the minimum you can prove -- you can actually show it's a beta, it's a beta random variable and you can do the expectations, yes. And then obviously with one such hash function your accuracy will not be very good, but you can always have like a hundred different hash functions and then you can average up a hundred such estimates. Although I think to be -- to make things simple I'll just say averaging but as for people -- for those of you who study statistics you know that medium or estimator often looks better than the mean estimator and actually in some scenarios people come up with small advanced estimator such as how [inaudible] estimator like in stable distribution with small P, things like that. So averaging just one generic way to say it but in many ways averaging it. So you can make accuracy much higher. So now we're ready to talk about our data streaming work. I think this is one where earliest data streaming work and it turns out to be instant success which means this actually the second piece of our data streaming work and then received the best paper award from the ACMC metrics 2004. So the problem is to estimate the probability distribution flow sizes, basically want to know how many flows have size one, how many flows have size two, how many flows size three, so on, so forth. So application is -- I mean we already talked about application traffic, customization engineering, billing, accounting, anomaly, detaching. I mean it all depends on this flow size distribution estimation. It's also very important because once you have distribution of everything else, if you want to ask the first moment you have first moment, you have second, you want first moment, have second moment, whatever you want you can estimate from the distribution. So definition of flow is very flexible, which means typically we talk about flow as a bunch of packets with the same source side P destination IP source spot [inaudible]. That's definition of flow but they could be different which means it's a generic definition. For example, sometimes we just use source IP and the destination -- I'm sorry, source IP and destination IP as a flow. You don't care about the port numbers. This could also be a flow. So I think definition is basically very flexible. So architectural solution is very simple. I think it's basically what we call [inaudible] data structure. We maintain array of counter thing in fast memory. Counter just means it's array and each element has array is a counter, it counts from zero every time you increment by one. It's just that simple. And for each packet a counter is chosen through hashing increment, I'm going to have the animation on that. And there's no attempt to attack the result collisions, which means if two flows hash to the same counter, we let it be and we will have some way to recover. I mean, we will have some statistical means to recover from this kind of collision. And every 64 bit counter only -- and we have some additional innovation later on which so that every 64 bit counter only use four bit of SRAM which means the thing about some flows can be very large so you the value can go beyond the like four billion so you may use more than 42 bits so each counter you have to make it like a maximum size and 64 typically is large enough. But if you make every counter 64 bits it's too wasteful because these count as being SRAMs so you want to reduce the size of counter and there are certain systematic means to do it which I'm going to talk about in the second half of talk. So data collection is [inaudible] but it's very fast. I'm going to give you the animation. Think about this processor, this is array of counters, packets comes, you hash its flow ID, it goes to particular location, increment counter by one. So you can think about your counter value goes from zero to one. And then another packet comes to a different flow, it goes to a different counter increment -- value go from zero to one. And then the packets that belong to same flow comes it goes to same counter obviously the value goes to two. And then a new flow, a third flow comes. Fortunately it's flow ID so the hash to same value as the red one so you -- but I mean here I show you that there's two red and one yellow, but in reality you don't have these, right, you don't have this tag, so you only have this value so you dome know it so you still change the value from two to three. Make sense? So you have collision but you don't know it, okay. So that's basically the encoding algorithm. And the -- what matters the decoding. So I want to show you this is a raw counter values after the I mean basically after the all the increments, this is raw counter values. So we are talking about the trace from AT&T and data trace packages from AT&T and it has about a million flows inside it. I think it's like ten minute, I think it's probably five minute or ten minute traffic trace and it has like a million flows in it. I'm sorry. This is one hour. This is one hour trace has like a million flows, about a million, one million flows. But so this is an actual flow size distribution. This is a flow size, this is basically the frequency. So you can think about this is like -- you can think about like 300,000 flows of size one. So I'm saying that. And it goes down like this. And then these are the other curves, the other curves are the raw counter value distribution. For example, this is where you have one million counters, where you have one million counters the raw counter values is this curve. And where you have a half million counters, this is a curve. When you have like a quarter million basically this one's a curve. So on, so forth. And you may say okay, the difference is not that large especially between these two curves but that's not the case because this is our large scale. So I think you're talking about the difference between something like a 200,000, you decode 200,000 flows but it's actually something like 400,000 flows. So there's a huge difference down there. Because large scale. So the ->>: [inaudible] not that big, right, you don't really care about the [inaudible]. >> Jun Xu: Well, this means when the flow size, for example, when you estimate flows of size like 80s then you don't get too many error to a certain extent you don't get too much error. But the way you estimate to the flows, the number of flows of size one, your error is huge. >>: Do you care about ->> Jun Xu: You in some situations you may -- you have to care about it. Because for example if you want to know whether you are on the DDoS attack, especially like ISPs, like AT&T they care about because you scrub -- they provide a scrubbing service. You scrub the DDoS traffic for the customers so it first has to detect it for certain customers. And actually, yes, so I think for example the usual number of flows of size one is 200,000, but during attack it could be 400,000. If you cannot distinguish between 200,000 and 400,000 they may miss a DDoS attack or something. >>: You don't know what kind of [inaudible] you are getting for [inaudible]. >> Jun Xu: This is just raw counter value. So I just want to give you some idea. If you just cheat raw counter values as a full size distribution, this is kind of error you're going to get. Of course we're not going to cheat it as it is, so we are going to have to cheat-ment. So we are going to -- so basically we are going to estimate the full size distribution from this raw counter values. But that's all we got. So we had to do some estimation. So first we would do some quick and dirty ones. I think let's see first let a total number of counters be M, so this is a known quantity a number for value zero can be M0. This is also a known quantity, right, we can see how many counters has value zero. Then immediately you can estimate the total number flows which can be estimated as N hat equal to M, which is this quantity times log. This is a natural log. M over M0. All those unknown quantities. So this is a pretty good estimator. And then you can then starting from here you can estimate the number for flows of size one. How do you estimate it? You let the number of value one be Y1, this is a known quantity, right? You can scan all these counters. You count how many ones are there. Then you can have -- you can come up with a -- you can come up with a pretty good estimate of number flows of side one as N1 hat equal to Y1, which is this quantity equal to E raise to power N hat which is this estimator. This estimator of cost is equal to this known quantity. N over M. So conceivably you can generalize process which means you get N1 hat, then you can construct -- you can start to do N2 hat. You can do that. You can generate this process to estimate N to N3 and so on, so forth, but this approach is not going to work. Why? Because you can see N1 hat has N hat inside it. And indeed if you estimate N2 hat you are going to have N1 hat and N hat in it. So you are going to estimate N100 hat, it's going to have N99, N98 hat, all these things in it. So your error accumulates, which is no good. So you need a wholistic solution which is the joint estimation using the expectation maximization algorithm. So instead of doing this sequentially you do a wholistic like you do it in a wholistic fashion. So you estimate the entire distribution using the expectation maximization. The solutions are basically very common-sensical. You begin -- it's a EM algorithm, so you begin with a guess of the full-size distribution which is this one. Based on this guess, you can -- based on this guess this guess basically give you probably space. Then you look at all the -- then you look at the all the counter-values and your reason how these counter-values are going to split probabilistically. And how do you split probabilistically depends on the probability space induced by this guess. Okay? And then -- so based on this you come to the best way possible is a splitting of particular counter-value and the respective probability of such events this allows us to computer a refined estimator for such distribution which means when you actually split these counters, according to the result of this [inaudible] statistics then you are going to get a new distribution. And this new distribution will be used to do this whole thing again. And then you will repeat this multiple times allows estimate to converge to a local maximum and this basically the split of the expectation maximization. So to give you example, for example a counter value three could be caused by three such events. Three flow of three means there's no -- if you see a counter-value of three it's indeed a flow size three. Also one plus two means a flow of size one collided with a flow of size two into showing up as three. I think three -- one plus one plus one means three flow size one colliding to a flow size three into same location. Suppose the respected probability of these three events are 0.5, 0.3, 0.2 respectively and how to compute these probability basically it's like remember you have initial probability so you get to the -- you can compute the prepare apriori and then based on what you observe you can compute the [inaudible] formula stuff. It's pretty standard. So suppose the probability, a pose probability is 0.5, 0.3, 0.2, then suppose you have 1,000 counters with value three, then you just say you just claim 500, 300, 200 of these counters split in these three ways, right? That's expectation which means -- and then what do the we mean by split? It means for example 500 split this way, these 500 splits will not result in a fragment of size one, right? Three plus three has no flow size one. But 300 of those this way, each of those contribute one fragment of size one, right? And 200 this way but each such split costs three fragment, has three fragments of size one. So 200 times three. So you credit 900 to your count of the first of size one. So you of course this is just for size three but you do this to flow size -- all the flows of size two, all the flows size three, all the flows size four, so on, so forth, and then you will get a new -- a bunch of new counts and you treat these new counts as a new distribution. So you do this over am over until they converge. >>: [inaudible] sure about the convergence. So I mean basically after this analysis they credit 900 to [inaudible] returns. >> Jun Xu: No, it's three -- let's see. 300 to N3 and let's see, I'm sorry 500 to N3, yes, 300 to N3. Yes. >>: [inaudible]. >> Jun Xu: Let me see if I have that slide. I think -- I'm not sure if I have it. Okay. I don't have that slide. Okay. >>: [inaudible]. >> Jun Xu: Once you have all these N1 -- the time count of N1, N2, N3, right, then turns out that they normalize the probability values, right. The percentage of it is this. And then it turns out that the -- it goes -- it really goes into how you compute apriori distribution. The apriori distribution is basically you can -- the hashing if you look at the particular location, I mean, basically you're talking about the hashing process, right, and the hashing process is off a model that has the [inaudible] a model [inaudible]. So you can think about these N1, N2 values as translated into the lambda inside these [inaudible]. And all these points of course multiply together or whatever and when all these points are multiplied together you do a normalization it becomes something called divergent distribution. So basically initially you start with a [inaudible] the full size distribution and you get all these lambda, right, you get all these lambda, and then. >>: [inaudible]. >> Jun Xu: Then you get new lambda by normalizing you get new lambda and you keep doing this. >>: Okay. Okay. >> Jun Xu: And then of course you have a [inaudible] step which computes -which the result is just [inaudible] step is this. So that's basically the split of the algorithm. So. But I don't have the slide here because I think -- and that's EM algorithm and people prove EM algorithm guarantee to converge. Not necessarily converge to the maximum accuracy estimator unfortunately. So this one shows you the actual result. So this one you remember I already showed this -- these curves before, so basically this one the rock is the actual full size distribution, this is a raw counter-values when you have one million counters, which means when you have equal number of counters as almost equal number of counters as number of flows. But still you see the difference is quite a lot. And then we're supposed to show three curves, but I only show two curves and the reason is that our estimation sort of overlaps with the actual number of flows pretty well. And of course if it's completely identical it looks fishy, so to avoid that I think we added some glitch here and there. So there is some sort of difference in there. But I think overlaps pretty well. And of course some of you may say I hide the difference by drawing it as the sort of large scale, right, I draw the large scale but the actual difference is bounded by like five, two percent, three percent, it's not that much difference. I mean overall I think the difference is around two to three percent. >>: Two to three percent is [inaudible]. >> Jun Xu: Yes. So you could think about for example we have -- your actual number for flow size one is 400,000, so you make errors around maybe 80,000, 80,000 will be 20 percent. 80,000. >>: I don't think [inaudible]. >> Jun Xu: It's not bound, it's just empirical. The empirical like five percent different. Because not -- even [inaudible] cannot bond its performance. And so our sort of estimation is works really well. And I think we have a bunch of other experiments I think this is -- this one comparison with sampling. You can see this means you sample with probability 0.1, I mean one over ten and then you estimate from sample and I think AT&T Nick Field [phonetic] has in his group has some algorithm of inverting the sampling which also use EM algorithm and this is so you invert from ten percent sampling. This is when you invert from one percent sampling. You can see that the actual flow size distribution is basically covered very well by our algorithm, okay, it's covered very well by our algorithm, but you can see inverting from sampling performs very badly. Because it's really a moving average. It's really a moving average. It cannot -- it cannot win these ups and downs in this kind of full size distribution. Which can be important in anomaly detection type purposes. And the reason why it's moving average is for example if you recover from sampling say one over 100 if you see a flow size one you don't really know if it's actual flow size one or it's actual flow size 10, it's actual flow size 100, you have no idea I think where it come from. So that's the reason I think sampling will not work very well. So I'm going to skip the next -- that's only one I put here. That's good. So that's so there's flow size distribution work. And I think one technical problem which come from this flow size distribution work is we have to use an array of counters and each counter is like 64 bits which is very expensive. Of course pre -- I think people have designed certain algorithm earlier. It turns out that algorithm by Judge McGee's [phonetic] group I think require nine plus two bits which nine bits counter, two bits [inaudible] which 11 bits. 11 is a number which computer science people don't like, right, I mean you don't have 11 bit SRAM or things like that, so we like nice number like four, things like that, okay. So the -- so I feel that we can do better and so the problem statement is we want to maintain a large array of counters that need to be incremented by one in arbitrary fashion, so you think about you are maintaining this array A and the customer comes up with this I1, I2, and when customer comes with index incremented by one, so this sounds like a trivial problem. But it's become non-trivial when the customer specify such increment every eight nanoseconds, then it become non-trivial and also it become non-trivial when you want to spend as little money as possible. So the -- so increments come very fast and the values of some counters maybe large because some flows are large so the counters can be large. But if you want to provision every counter to the worst case you give everybody 64 bits which can be very expensive. So you fit everything to an array of long SRAM [inaudible] can be very expensive and also some of you may say caching may work but actually caching will not work because you may not have locality access sequence. So motivation -- there's lots of motivation. Of course now I put things in this context I will think accomplishing entirely from data streaming but it has lots of applications other than data streaming. So of course I mean the first motivation come from distribution called hashing increment, you hash something in the incremented counter. But there's also other motivation. For example, routers may need to keep track of many different counts, I mean, for example, count the packets among different source IP [inaudible] a source IP perfect destination, IP perfect pairs. And it turns out I fear that these kind of counter array can help us implement millions of [inaudible] and routers, so Cisco people may be interested in this as well. And that's extensible to other -- that's great, it's extensible to other non-CS application in sewage management. So I'm going to talk about the -- think about if you have -- if your subdivision has like a hundred thousand people you think about a hundred thousand toilets, but you don't want to design like cesspool for a hundred thousand toilets, right, you want to design the cesspool to be smaller than the total size of all these toilets, right? Then I think this time algorithm will work. And then our work basically able to make 16 SM bits out of one, the outcome of 21st Century. So the main idea in the previous -- I mean I think we didn't even mention the -- sort of the main framework, I think this [inaudible] earlier work, I think its idea is basically all the counter increments goes to very short SRAM counter, and then when the short SRAM counter becomes like overflowing, for example, if we have 8 bit SRAM counter overflowing remains approaching 255, 256, then you will have a counter-management other than to sort of flush these overflowing counters to the DRAM, overflowing -- the flushing means basically you also have like if you have one million SRAM counters, you also need the one million DRAM counters, each correspondence to each. So anything overflows you just flush over. For example, if this one overflows from 256 to -- 255 to 256 then you reset to be zero increment to 156 here. It's just as simple. It turns out this algorithm is very hard to design. Why? Because you can think about you have to work with arbitrary increment patterns so for example you could come up with policy say if you are half of four then you're flush but the [inaudible] can make everybody like 127. Then you hit 128 simultaneously then you're dead. So no matter how you come up with policy, you have to -- you have to imagine there's [inaudible] case which may not make -- which makes you sort of strategy may not work. So this thing is extremely hard to design it turns out. >>: [inaudible] how often did you have to go to the [inaudible]? >> Jun Xu: How often? Well, depends on how large is your counter. So you have a counter is short a longer [inaudible] of course you want to design something which is short as possible, so I think that's basically our competition, right, I think if we design something much shorter than the previous scheme then I think we win. So but of course if we desire to be much shorter we have to come up with much smarter this to do it, so I think that's basically the challenge. So a previous approach has used different kind of counter-management algorithm I think CMA also stands for in real life stands for the condo association of condo management association or something. So I think you can think subdivision, my subdivision I think an analogy I think goes with that. But anyway. So we have this, Nick McCure's [phonetic] team have this work and they basically implement, they basically do the I think the fullest I call the fullest [inaudible] first. If your counter value is the largest you have the highest priority to flush, so there are basically the fullest tallest first and then I think -- but the algorithms aren't very efficient. For example even when the [inaudible] difference 12, SRAM speed SRAM speed difference 12, then only four bits is enough, right, four bits is basically reduce the urgency by 16 times, right, 16 bigger than 12, so it's enough. But first they will need eight bits, which is I think which is like twice as four bits. Another thing is because when you do -- because they implement the whole thing as a heap, with a heap you have to have the pointer and the way you have one million counters log one million is like 20, so your pointer, I mean, I think your left pointer, right pointer to do the heap takes you like I think you -- I think you probably only need one pointer, this pointer takes 20 bits per counter, which is for the pointer, which is very wasteful. So the total is like 28 bits per counter, even though the theoretical minimum is four. But they declare victory because they go down from 64 to 28, so and they need the [inaudible] implementation of a heap which is non trivial. And actually energy-wise it's not good because anytime you have pipeline it costs some energy, right, I think. And the CMA used there's a late work by George McKeez [phonetic]. It's much more efficient, I think they go down, I'm not going to go through the detail, they go down from -- go down to eight SRAM bits per counter and the two bits for the contralogic. So Nick McCune's team is like 8 plus 20 and they can do 8 plus 2. And the hardware logic actually seem present -- we will probably I think simpler than the Nick McCune's team solution. And our scheme with -- does much better, our scheme only need four about it, which is a minimum. And our idea is very simple. We flush only when the SRAM count is completely full. So it's like only completely full toilet can flush. That's basically the policy. And we use a -- and then we use a small SRAM buffer to hold [inaudible] DRAM. You can think about very small cesspool to hold all the flushing like the toilets. And then there's a key innovation. There's a simple randomized algorithm ensure that the counters do not overflow and burst large enough to overflow a five hole buffer which means if every household flush the toilet simultaneously then I think then we are dead. So our -- so basically how do we do it? It's very simple we start the initial values of SRAM counters to independent random variable. It's like boot strapping. You bootstrap at very beginning. You set the initial counter values to uniform random variables. I mean uniform random they distribute between zero and 15. Why 15? Because we are talking four bits. So there's only zero to 15. Value is only zero to 15. So you set to be uniform number. But of course you are doing counting so count -- if you count, you have to counsel from zero, right, so you have to remember seed value. And how do you remember it, it's very simple. You just [inaudible] counter to the negative of your SRAM seed value. Right? So you start from zero. And at the [inaudible] now this is rated online algorithm context. The [inaudible] knows our random [inaudible] scheme but not the initial value SRAM counters. That's my [inaudible] how does that work, you know? And then we pull rigorously that the small five hole queue can ensure that the queue will flow with very small probability. So basically this scheme basically, basically this community has a like ordinance, like a city ordinance. It says that when new home ordinance move in, you have to set the water level in your toilet to a uniform random value. And then a small cesspool will work I think for a very long time. Let's look at the numeric example. For example one million, you have one million counters, it's four bit SRAM, 64 DRAM and a speed difference like 12. And it's just like all these parameters. And then for one million counters you don't need the one million like one million buffer slots you only need for example a 300 buffer slots, 54Q for storing in that to be refreshed. And then after like 10 is to power 12 like a trillion counter increments, arbitrary fashion so think about like eight hours of four gigahertz per second traffic. 40 million packets per second like links. After that many -- so which means you can think about after eight hours of this kind of high speed then the probability of overflowing from this the 54Q is like 10 minus 14 which is think is exactly the digital fountain undecodability probability. So the meantime between failure for one of these router or counter is a hundred billion years which is longer than the Big Ben. I think Big Ben like a seven, I don't know, 17 gillion or something. So I actually went to Cisco to give this talk and they told me that they are usually nervous about randomize algorithm with its probability but I think [inaudible] said he looks at [inaudible] feel pretty comfortable. He said if it's 10 to the minus minus, I think once ever thousand years -- I would think once every year is pretty fairly reasonable but I think this is one a hundred billion years so it's even better. But even once a day I think you [inaudible] with it, you lose a counter value. And I think I told them that this is California, there's a probability for the meantime between earthquakes is like, I don't know a hundred years. So anything about hundred years will work in California, so -- and this place as well, right? I think this place also meantime between [inaudible] is problem a hundred years or something, so. So the conclusion for that one is to data stream is very powerful tool for -- I'm sorry, I think I don't know why I have this typo after so much years. The monitoring, it's a challenging research problems arise due to stringent space in the speed requirements, so I think -- so actually we will need some -- let's see. Actually we didn't talk about distributed streaming. We probably won't be able to have time to talk about it. But we actually do distributed data streaming as well. Distributed data streaming means you have lots of data streams and you want to find out the some statistic about the unit without actually unioning them. Because unioning them just like shipping all the machines from datacenter for single machine and the processing, you don't -- you just cannot do that. So you want to -- you want to compute some statistics all the data across all these like a datacenter nodes and you get -- you want to get some statistics but without shipping all the data together, without reading every piece of data. So I think that's actually pretty challenging thing. And I do have a second slide file which is mathematical very exciting. >>: Before you continue let me ask some questions. >> Jun Xu: [inaudible] data streaming. >>: [inaudible]. >> Jun Xu: Is data streaming. But go ahead. >>: Actually [inaudible]. >> Jun Xu: Yes. >>: What's your comment on that [inaudible]. >> Jun X: Well, counter [inaudible] talk -- I think let me first summarize counterBits idea. So counterBits idea basically is they want to they want to be the able to decode the size of each and every flow, so which means, so which means so basically they will -- they need a sketch, basically you will have a sketch and then this sketch -- and then they also gather all the flow IDs offline. >>: So I mean basically I think in addition to basically your hash and basically in the first part of the talk, they hashed national [inaudible] and they tried to [inaudible]. >> Jun Xu: Exactly. Exactly. Yes. And also because they -- let's see, also, they do the variable [inaudible] encoding which means that they have fewer, fewer counters, they have a pyramid type of scheme which means they have fewer and fewer like high order counterBits I think so that you can save space and then -and then the -- you, where the slow comes -- when a packet comes in you hash its ID to multiple locations to implement it. >>: But I think they argue they can do exact [inaudible] here you are basically [inaudible] estimation. I mean ->> Jun Xu: They can -- well, they have more input than I do, and they have more output than I do. And basically they can -- the more -- they feed the basically all the IDs. They need all the flow IDs to be stored, to be able to decode. Because the flow ID is translated into the dough coding matrix. >>: Okay. >> Jun Xu: Without decoding matrix. Of course you can map the flow ID to decoding matrix and the decode. But having the decoding matrix is basically the same as having all these IDs. So they need all the IDs and the way the IDs they can find out had flow size of each and every ID. So that's basically what they do. And ->>: [inaudible] but we can ask them later on. Because that's actually pretty stuff basically as far as requirements if you have to do that. >> Jun Xu: Okay. But their decoding is not like instance, which means you can't -- if you immediately -- if you want to decode for each and evidence flow realtime, you cannot, basically you have to spend like 30 seconds at the end to decode everyone. So if you decode a batch for everyone it costs you 30 seconds. If you [inaudible] over all the counters, it may work. >>: [inaudible] question I will ask you [inaudible]. >> Jun Xu: Yeah, it's -- yeah, I think we actually -- they actually cited I think several of our works I think, including counter or -- because for example we -- our work did blow a hole any think to a certain extent we did blow a hole into their work in a sense that for example we can do 64 bit to four bit, right. With this counter thing, there's no way they can go from 64 bit to four bit. So they, I think, they have to argue that in certain applications you don't want DRAM -- SRAM to DRAM traffic because that cost you bus traffic and things like that. In these situations I think their situation is entirely SRAM. >>: I think you also have some kind of [inaudible] I mean they also [inaudible]. >> Jun Xu: But they don't have SRAM to DRAM. They don't have SRAM to DRAM like traffic. So I think -- but I think we ->>: I think they actually should be able to do that as well because they potentially have a second stage. So they [inaudible] I mean the second part of [inaudible]. >> Jun Xu: Yeah. >>: [inaudible]. >> Jun Xu: Oh, they -- no, they I think -- yes, they can use mine but they specifically say they don't want to use mine because I think they're underlining -because if they can use mine I think nobody wants their work because if their work is like a complicated encoding or decoding. So they have ->>: They have, they [inaudible]. >> Jun Xu: You will get exact count, yeah. But basically what they could do is they get all the flow IDs and they also -- they also use for example our algorithm. They could just use our counter array and to do it, so basically to [inaudible] their work they have to say that in certain situations you don't want SRAM to DRAM traffic so they have to say SRAM so they have an SRAM only solution. So their solution SRAM only. >>: I think one of the basic thing I'm not that clear is if they actually need to know all the ID of the flow or they just need to basically ->> Jun Xu: [inaudible]. >>: [inaudible]. >> Jun Xu: Right now they need all the IDs flows and they're working on some kind of partial decoding. There's some kind of partial decoding work going on. But I had a very long conversation with authors with both [inaudible] and the first author student. >>: Another basically my question is [inaudible] snapshot is this actually the snapshot [inaudible] I mean for [inaudible] transferring to SRAM to DRAM very last [inaudible] but basically sometimes you need to get data out. >> Jun Xu: I see. >>: I mean for [inaudible] the counter break guys definitely to get their data out. They basically need to take a snapshot. And the only thing I can think of is basically during the snapshot we have to stop. >> Jun Xu: Well, you could have what they call ping-pong buffer which means you sort of throwing twice as many resources, you need like ->>: Basically a shadow memory. >> Jun Xu: [inaudible] shadowing or something. So I tend not to deal with things like this because that's a separate issue. You could have some generic solution for things like that. >>: [inaudible]. >> Jun Xu: It may not be [inaudible] you can't think that could be some coding, there could be even coding solution but that's a different dimension, right, because you -- the shadow period is very small compared to the [inaudible] period because you are doing a sequential reader, right, because way you read out is a sequential reads much faster than the random access reader so you can think only one-tenth of it, you can think -- you can even imagine some error crashing code type of solution into it. But that's. >>: [inaudible]. >> Jun Xu: But I want to isolate this as a generic -- it's a generic problem, not just for this problem but for many other such ping pong buffer type of questions. So now I think I'm going to look at, let's see, it's a different slide and it's here, it's my student's [inaudible] talk so. So it's a data streaming algorithm for estimating entropy overflows and I choose to talk about this because first -- actually this is distributed data streaming. That's the first thing. Second, it's used some fancy math and it turns out these fancy maths are very easy to understand. So I think -- and I think the connection -- and that also -- and this connects very well with a classical theory result. So sometimes I think if I don't have this kind of connection I will feel very bad because it's like theory people are doing deep theory work and I'm doing -- I'm sort of -- there's some kind of suspicion there so duplicating theory results by doing some radomization I think by doing obfuscation of the terms or things like that. But this one connects very well with the theory result. I think I -- but I do check carefully that our work is not really sort of permutation or obfuscation of existing theory though. So. And the real -- I have to be honest with you that the real value of this work to be very honest that I think it's too debatable. But I think a mathematical-wise I think it's beautiful work. So I think the probably is very simple. I think we did have a piece of work done in 2005 estimating the entropy. We always talk about that. I'm not going to talk about that today because the technique there is pretty standard in theory. And this problem is much harder problem, which no -- I think our entropy estimation solution is one of the earliest. I remember there has been about, let's see, I think around the same time we have -- there are about a dozen solutions a like entropy estimation. Newspaper of these solutions are able to estimate the union or intersection -- the entropy of the union and intersection with two data streams. So you can estimate the entropy -- a single data stream, but you're able to estimate for single digit like S1, but you're not able to estimate to the entropy of the intersection of data stream S1, S2. But turns out estimating the intersection for estimating the entropy of the intersection for data streams is very important and for example you have all the flow, which means all the traffic that goes from the original to destination you want to estimate as the entropy OD flow. What is OD flow? OD flow is basically all the traffic, the intersection of all the traffic goes from a particular origin to -with the all the traffic that goes to a particular destination, right. It's an intersection of these two data streams. So you can see that these three are intersection. >>: So these are the ->> Jun Xu: [inaudible] these are like ingress points. These are like egress points of ISP. >>: [inaudible] we are looking at all the flows from the basically the ->> Jun Xu: This is particular I, particular J, particular O, particular D, we are talking about the entropy of all this OD flow traffic. >>: Okay. >> Jun Xu: Which is intersection of all the ingress traffic here and all the egress traffic here, right, because you can you can see some traffic which originates from here but goes to somewhere else, some traffic goes to here but does not already from here. So it's really intersection. >>: What information do you have [inaudible]. >> Jun Xu: What information do we have? The only information we have is everybody sees all the traffic that comes in and everybody here sees all the traffic comes out. That's only information we have. We don't need any sort of back -- any sort of communication. We do not need the communication and the basically the solution is you will digest the data stream, you get some sketch, you will digest data stream here you get some sketch and you compare these two sketch. We will have some communication but which is basically piecing together two sketch. Which is much better -- this is much smaller than actual traffic. So now let's talk about entropy. So entropy is basically this quantity and this causation, so this slide is made by my student and I think I won't have this much time getting photo stuff but I think he has done much better slide than I usually do. And so this is an entropy definition and you can think of PI as a fraction of traffic from the I flow. And what do we measure actually what we call the [inaudible] entropy, it's not the actual entropy because the actual entropy is different than [inaudible] entropy. Actual entropy required to specific model, things like that, we're now. We imagine the [inaudible] entropy of a data stream. It turns out that in many situations estimated entropy is equivalent to estimated entropy norm, so this is something called entropy norm, which is basically you can think well the PR is equal to MI over M, so M is a total number packets. So instead of we're estimating sigma MI over M log MI over M, you can estimate the MI log MI if you know M, these things are convertible. So MI is a frequency of the I flow. So it's also equivalent to estimate entropy norm and it turns out entropy norm is much more manipulatable than entropy. So the motivation, lots of motivation for entropy I think is not only detaching traffic clustering, I think everybody is very interesting entropy and DDoS attacks trafficking. So the theory we use is the theory of stable distributions, and the distribution B is called the P stable if for any constant -- if for any constant A 1 to AN and the random variable X, X1 to XN from this P stable distribution, A1, X1 plus da da da, plus ANXN is the same as in distribution as this LP norm of this A1 to AN times X. For example, Gaussian is stable because if you have X1 to XN as a standard Gaussian, then it's basically you could this time standard and distribution, right. Gaussian distribution is basically P2. So I think this -- right, because the variance adds up, right, with Gaussian the variance add up and of course you had to divide by one with two, so. And that's discovered in 1907 by Paul Levy [phonetic], so I think sometimes you regret it and you were not born a hundred years ago, so -- 200 years ago, then there's a result could be discovered by you. So the property of stable distribution we talk about it. So Gaussian is two stable, Coshie [phonetic] is one stable, and cos forms are only known for these three values and there are -- but there are known formulas for generating samples, samples from P stable distributions, it's called Chambers Formula. And now I think there has been existing piece of work, that's not my work, that's an Indic [phonetic] actually this is his JSM paper, I think his conference paper was in 1999. It's much earlier and then basically he's able to -- he is able to figure out the how to estimate the P frequency moment of a data stream which is in this form. So you can think about MIs a number of packets in the flow I and the he can allow you to estimate sigma MIMI raised to power P. Using the stable distribution. So basically this is algorithm for each flow you draw stable they distributed the XI and the counter starts from zero and then for every packet you see, you increment it so the flow -- you can think about there's a hash function which maps the full ID to a stable distribution and the variable so you can see after you increment all the packets then the final value is distributed as the LP norm of these full size distribution times stable distribution. Standard stable distribution. I mean, it is easy to verify that. It's just this is already done I think several years ago. And then this becomes a parameter estimation problem, right? And the how do you do parameter estimation? You can do this experiments -- you can do this experiments multiple times and then you extract this quantity. It turns out by the way oftentimes we think about the mean estimator but mean estimator does not work here because stable distribution is unstable in terms of -- in terms of I think this convergings of these things. And so this thing can be -- but this thing can be done using the median estimator. Median estimator works very well. But it turns out recently we discover I think some researchers discover that when P approach zero then the median estimate, even median estimate does not work that well, you need it turns out harmonic mean estimator works very well. So I think there has been lot of research on what's the best estimator in these kind of contests. But anyway ->>: [inaudible] a number of basic counters lead which is basically [inaudible] number of times ->> Jun Xu: Exactly, exactly, yeah, yeah, yeah. And [inaudible] hash functions [inaudible] function, whatever. And exactly this is exactly I call the Christmas conjecture, so I think I thought about the conjecture I think on Christmas Eve so and I -- I was thinking but I want to estimate to the entropy, so I was thinking if -- is it possible to represent this entropy function or entropy known function by a linear combination of what we call power functions? For example something like this it turns out we can. So I think my students actually figured out in one day that -- I totally -- because I saw right a small program, play with a MATLAB but I always have a hump which means I sort of fly I do the regression very well on most of points but there's always a hump somewhere here and there which I hate lot, so I ask my student how about, I mean, trying -- ask my student to find out whether you can do like -at that time I asked him to do four plus major which means approximately four times but he actually returned a result much better. You can plus by two times. >>: Okay. >> Jun Xu: It turns out function like X log X can be approximated very well by a family of function of this form. Which means X raise to power 1 plus alpha minus X to power 1 minus alpha multiplied C. It turns out actually this is not -- this C is actually equal to 1 over 2 times alpha. So [inaudible] so family approximation and it turns out the proof is straightforward using Taylor expansion. It's just Taylor expansion. And just look at the numerical value. I'm sorry, I apologize. This is not over 2. This is basically 10 raise -- 10 times X to power 1 power of 5 minus Z. So this is typo. So let's see how we -- how well do we approximate the X log X with this function in this interval? You can see the pluses pretty well, right? It's pretty well, and one is approximation, the other is actual code, there's not much difference. And in fact, it goes pretty well to 5,000, I think. 1,000 I think. So you can approximate it pretty well with like zero to 5,000. And then if you can approximate then good because the M1 log M1 because this is -- this is the entropy known function. M log -- if M log 1 can approximate the Z, N log M can process Z and you add up everything here you get entropy norm, but the F end up having here you get something like L1 plus alpha norm, right, except it's raised to power 1 plus alpha, and then you get something like A1 -- L1 minus alpha norm. But we already know how to estimate the 1, L1 plus alpha norm, L1 minus alpha norm, so we just -- we can just sort of take the difference. And by the way because S -- the approximation stops at like 5,000 or something, I guess worse at the beyond 5,000 but anything beyond 5,000 is elephant, right? It's a large flow. So we can have elephant detaching model which basically handles all the flows larger than this kind of structure. So we can tell we handle it or the -- or had basically we do approximation. >>: I assume you use sampling to do elephant detaching? >> Jun Xu: Yes, I think you could use some [inaudible] sampling, some kind of data streaming algorithm which you can isolate almost all the elephants. >>: [inaudible]. >> Jun Xu: And which I think works pretty well. I think these two pieces. And then, then I think this method, I think the LP norm method actually extends very well to the intersection case and this is an extension. You can think about at the origin side, the -- based on this sketch you can estimate the LP norm of all the ingress traffic, right? But suppose this is the OD flow, which means these are the common between the O and the D, but these are the cross-traffic here, this is cross-traffic there. So you can think about if you estimate the LP norm to the O side, then you get this, right, and you estimate to the traffic, add to the D side you get this, then you can actually -- these sketch are linear which means you can sketch, you can think about like a vector, a value vector, right, you can do the -you can do the point-wise subtracting and do the point-wise addition and then it turns out if you don't the point-wise subtraction and you estimate the LP norm from it, you get -- you actually get to the -- you actually subtract these two things away, you basically get both cross-traffic part add up. And then you can also do a point-wise addition and you estimate LP norm you actually -- you get -- you actually get a sum. So you can see that we can come up with two different estimators. One is just O and D minus all the cross-traffic you get to one estimator of the OD, OD flow. The LP normal OD flow traffic, right? Or you can do the -- you can subtract these two things and we both will give you an estimation of this one, which is the LP -- LP norm of the OD flow traffic. All these are cross -- this is cross-traffic, this is -- all these are cross-traffic. So which means the LP norm [inaudible] mass extends very well to the intersection case. Okay? And then, then ours is just the difference between these LP norms, so we let the LP norm handle the intersection, then we take a difference so we know the entropy norm of the intersection. Which I think works very well and we have a bunch of flow. And then it turns out that if to go from the entropy norm to entropy we need to know the total number of packets during a certain timeframe. It's trivial when it's a single data stream, right, because you can count. You can count it, right. You need a counter to count it. So it's trivial. But in the OD flow contents not sure it's the OD flow traffic, it's called traffic matrix estimation which is not trivial. So obviously and so traffic matrix estimation is actually itself is not true, there probably has been lot of papers on that including -- I actually had a paper on traffic matrix estimation. And so it turns out that the OD flow traffic which it means the trafficking matrix element is exactly what we call L1 norm. So it is L1 norm. So we can actually use Coshie distribution which is a kind of stable distribution estimate L1 norm using [inaudible] index master. Yes, we can do it. But what we are -- we are doing something extra, right, which means that beyond the L1 plus alpha norm and L1 minus alpha norm we have to do L1 norm which is the extra overhead. But turns out because the alpha we use typically very small, like L1 plus 05 norm and L0.95 norm so it turns out that you can see the function where alpha is small this function approximates X very well. So therefore, the sum you can think about the sum of L1.05 norm plus 1.L -- 0.95 norm divided by two approximates L1 norm pretty well. So based on that, so basically we are shooting two birds in one stone, right, because once we have these two norms when you add them up and divide by two, you get -- you get the -- you get L1 norm and when you take the difference, you get to the entropy norm so then we have both components we need to estimate the entropy of the OD flow traffic. And then so we have a lot of experiments with AT&T and I think we have a huge trace and I think these are experiments so this is the basically the relative error I think which is a [inaudible] for experiments. You can see the relative error is [inaudible] less than 15, 0. -- 15 percent. So what is this code. These two codes means when we estimate the elephant we have to use some sampling, so these are sampling probability for elephants. It turns out it's not very sensitive, which means which means I think it's not very sensitive on this sample in probability which means how good your elephant detecting solution does not impact this result very much. What's intuition? The intuition is if you are large elephant then you are going to impact our result significantly, right? But the larger elephant has a higher probability to be detected, right? But if you are small, small elephant, you think about the elephant [inaudible] small elephant means 5,500, 6,000. If you are small elephant, yes, the probability of that you are not detected is pretty high. But the error you cause is also small. So which -- so you can see this kind of error dynamics of elephant detection works very well with our sort of entropy estimation. So that's basically the moral of the story. So I think we have other experiments but I would rather just skip these -- these results and so basically I think we are able to sort of extend the study of entropy for OD flows as a new two for net tomography and so we present the algorithm to solve the problem in practice with low relative error and the reasonable resource usage and we introduce a new type of distribution streaming problem so basically the estimate is statistic origin of destination flows. And I think I like this work. These are the conclusion by my students. I think I probably will write it differently but the thing which I like best is we are able to connect to the mainstream data stream work which is the kind of theoretical LP norm. For example Indic [phonetic] has this paper for like more than 8, 9 years, but of course theories -- he has a bunch of applications but these are all theory applications which means you use theory results to do another theory result and so it's like to a certain extent it's like self licking ice cream. And the -- but we are able to find something which is -- which come from the real application need and I think we can tap into [inaudible] we find a very concrete practical application of his theoretical result which is very exciting and I think the connection is a connection is very simple after the fact. If I tell you this connection is very simple but it's probably not that easy to think about like before hand. So that's the reason -- that's the reason I think I like -- I like this work very much. Personally. >>: It's a very nice piece of work. >> Jun Xu: Thanks. >>: I saw you [inaudible] I mean [inaudible] try to see what kind of application can be used in application [inaudible]. >> Jun Xu: That has been some -- we have a [inaudible] which has been quite useful for like peer-to-peer search. Like peer-to-peer search content so the peer-to-peer content search type application and actually people have been applying our result to like an ad hoc net routing, identity of node ad hoc network is like -- it's like a -- like object that needs to be searched. >>: We can probably talk offline. >> Jun Xu: Yes. >> Jin Li: Are there any other questions from the audience? >> Jun Xu: So how many statistics do we have here? I think you can be liberal I think on the definition so I think just want to get some sense. >> Jin Li: Okay. We can thank you the speaker. >> Jun Xu: Oh, thanks. [applause]