22431 >> Sudipta Sengupta: I am Sudipta Sengupta, and it is my pleasure to introduce Shreeshankar Bodas, who goes by the name of Shree. Shree is currently a post-doc at Laboratory for Information and Decision Systems at MIT. The LIDS lab is the same lab which is home to folks like Bob Gallagher, Demetrius Butikus [phonetic] and John [inaudible]. Shree did his Ph.D. in ECD from University of Austin in September of 2010 and his research interests include resource allocation in data centers and wireless scheduling algorithms. Today he will talk about his work on fast averaging, which appeared in this year's ISIT, International Symposium on Information Theory, and he will also point out applications to Map-Reduce. I'll hand it over to Shree. >> Shreeshankar Bodas: All right. Thanks for the introduction Sudipta. And good morning, everyone, and thanks for attending this talk on fast averaging with applications to Map-Reduce and consensus on graphs. So before we begin, let me tell you that most of this talk is going to be about applications to Map-Reduce. And after the main body of the talk we are going to also consider the consensus on graphs as an application of fast averaging. This is joint work with Devavrat Shah. So the particular problem I'd like you to keep at the back of your minds throughout this talk is this problem of auto complete suggestions for a search engine; that is, so there is one user that is, this is a snapshot from Bing, and I started typing a query with F, and this is the suggestions that Bing gave to me. Facebook, Firefox, Foxwoods, and so on. And so the question is how does a search engine give me such queries? These are based on the query logs in the recent past by different users. And based on the value of a given keyword, these are going to be presented to me in some order. So we want to determine in an efficient and fast and online way a way of coming up with these suggestions. So that this is an application that has motivated us. So this application had certain features as we're going to see that, first of all, it is -- it's possible to approximate the computation that goes on behind this without hampering the results. So this is the particular application that I would like you to keep at the back of your minds. So let's we'll what happens when someone starts typing with F and waits for the auto complete to suggest something. So this is the picture of the back end of the system, where there is, let's say, a master server. There are these bunch of slave servers, and these are the different query logs in terms, in the form of files. So when a query comes in, this master server wants to know, let's say for the time being, that there are only three possibilities here. There is a Facebook, Firefox and FedEx. I want to see the related ranking or related importance of these keywords and based on the previous search queries. So a simple number that I want is the following: That indeed search queries, how many of them contain the word "Facebook" and how many contain the word "Firefox" and "FedEx." And based on those results I'll rank them and present them to the user. We want to speed up this operation, that's the way I'm going to present to you the fast averaging algorithms, and we are going to see how these are applicable in this setup. Because this is essentially a counting problem. You want to count the frequency of something and that can be approximate. So at the heart of the problem is this following mathematical operation; that is, I'm given a long vector of N numbers, think of N as large, because I have these huge query logs, I want to search over them fast. I have these long vector of N numbers and what I'm interested in is this continuity XI. So XI represents the number of times a given keyword, let's say Firefox, came in the search query. It's 01, 2, 3, number of times ->>: Do you know ahead of time? If you do, then average sum are the same problems. >> Shreeshankar Bodas: Are the same problems. Yes, if you know them, then some of are the same problems. But for mathematical reasons we're going to have this one in front because we want a multiplicative approximation error as opposed to an aggregator error. So as you said, they're the same problem. We're interested in this summation, and equivalently we're interested in one by N times this summation. So mathematically this problem is not hard. This problem just wants you to add up all these N numbers and just divide it by N. As can you do it in a distributed fashion as I've shown it here that there are these N numbers, suppose I got these 100 servers that can do the task. Then I'm going to give, let's say, one-hundredth of the fraction of my numbers to first server and the second server and the hundred server and so on. These guys are going to compute the integer sums and then they'll report the sums to the central guy. The central guy will add up all the hundred numbers and then divide it by N. And that's your mu. So this is the basic simplest algorithm you can do. And right now I have shown here two levels of hierarchy, but it could have more levels of hierarchy. So this mathematically the problem is not hard. What makes the problem hard is the data size and your delay constraints that you want to respond quickly and you want to process large volumes of data. Nevertheless, so this deterministic algorithm has certain advantages and certain issues that we need to address. So the advantages are first it returns an exact answer, which is always a good feature of any algorithm. It's clearly a distributed algorithm, as I've shown here, it can be done in parallel by hundred servers or multiple servers, multiple level hierarchy, because the addition operation is inherently distributed operation. So this algorithm can be implemented in a distributed form. What are some of the issues with this algorithm? First issue is complexity or latency. And so we know that if you want to sum up N numbers, then there is a theta of N complexity bound. You cannot do less than, or rather N minus 1 additions if you want the exact answer in the worst case. But the point here is you may not want, you may not care for the exact answer, because this addition operation of the averaging operation can be a means to an end. Even if my actual answer to this averaging problem is not exact, maybe the final thing I'm interested in, which is the ranking of my results, may still be good. So if I'm willing to trade off some accuracy for speed, then I want to beat this latency bound because this T'd off end may be prohibitively large. That's one issue we need to address. Another issue is robustness. If one of these servers is slow or one of these servers fails its task, then in my implementation that I've shown here, first of all, the bottleneck is the slowest server. I cannot, on the fly, decide okay that this server should not report. Because I'm interested in the exact answer in this algorithm. Or if this server fails, then somebody else has to complete his task. And it's going to add to your other -- it's going to affect your answer or it's going to add to your network bottlenecks. So these are a couple of issues that complexity and robustness that we need to address for this algorithm. So let's look at the issue of latency in a little more detail. So suppose I've got these file servers and I have a bunch of numbers I want to compute the average of. This is the central server who is going to do the final averaging operation. So what happens here is suppose there is this server that is the slowest server. It's always the case that the slowest server will determine when your answer is ready. But if this guy is overloaded because there's some background loading in the datacenter that it is doing some other operations or there are some network bottlenecks for which this guy is too slow to respond; then although all the other numbers are in single digit milliseconds, this guy is in 50 milliseconds. You have time to finish, time to finish your operation is here in this case 54 milliseconds. But, hey, what happens is if I'm not interested in the exact answer, maybe I can say that, okay, forget this guy. I'm going to compute the average of the rest of the people and suitably normalize it, and maybe my answer is going to be approximate, but my time to finish is 13 milliseconds for the numbers I have chosen here. This is a recurring theme that as we know from the simple MM1Q Markov chain analysis, if you are very close to capacity and if you slightly reduce the load, then your delay comes down by a large amount. So that's the same thing that is happening here; that if you reduce the number of servers that you need to respond by, let's say, one percent or five percent, then you get dramatic speedup because these are the slowest servers that are going to determine how fast your answer is really. It's true that the slowest server is always the bottleneck, but you're making the bottleneck wider in this case. And for all we know it may not be a bad thing for averaging operation. Because our numbers are not just some arbitrary numbers. They could have some correlation between them. So depends upon the structure of the underlying data and how much error you are willing to tolerate that determines how much area you're going to get in your final ranking. So suppose in the extreme case I know that the X size, the underlying numbers I want to average, are all equal. Suppose all of them are equal to some number. I don't know what they're equal to, but in the extreme case they're all equal I can just sample one of them and that's it that's my final answer. So this complexity bound of theta of N it's actually in the worst case. That is, these given numbers can come from what has to be no distribution underlying it. But if the numbers exhibit some regularity, then I can get away with smaller amount of sampling. That's the main point here. One thing I would like to mention here, so this business of averaging with or without replacement has been around forever, like people have beaten this to death, and you can find classic large deviations results on how many samples you need if you want to estimate the mean within a plus server and an epsilon guarantees. But most of those results, in fact I should say all the results that are known at least in classic textbooks, they are in terms of the entries of the given vector. So as opposed to that we are representing results in terms of the moment of the given underlying sequence. So it's a different take on the problem. And for Map-Reduce kind of problems that we are interested in, this is a new dimension in which you can do the trade-off, for accuracy, for speed, which if this is a means to an end averaging operation, then it can be, it can be used without affecting the final result. So here is the summary of our contribution to the problem. We propose a simple randomized algorithm for averaging, and we analyze its performance. We analyze the trade-off between accuracy and latency. Basically we're counting latency in terms of how many samples do you need for getting an approximate answer. The improved job completion types of Map-Reduce applications using these ideas from fast averaging, and the main intuition I'd like you to take away from this talk, so forget about the actual system or forget about the exact result, the one thing is if the numbers are regular, numbers are concentrated at least in approximate sense around its mean, then the mean computation is an easy thing. So I don't have to sample each and everything if I can sample, let's say, half of them or root N of them. Even that can give me a good answer. So before or without keeping you in any more suspense, here's a centralized version of our algorithm. It's a very simple algorithm. It's the first thing that would come to your mind. Given this sequence of N numbers, these are the numbers, and I'm going to pick any R out of them. R is a parameter that we want to choose later. But pick a fraction of these numbers and average them and report their average. So this is a very simple algorithm. One slight modification that I would like to make to the final version of the algorithm is that instead of picking R numbers out of N, let's pick every number with some probability. So that this becomes a distributable algorithm. Because if I had to pick R numbers out of N deterministically, then it's difficult to distribute the choices among different servers. So the algorithm is this: Pick every number uniformly at random with some probability according to some random variable and report their average. The advantage of this approach is that each number is going to be picked with some probability already, but the probability that you pick a number is close to 0. Because this number N is large I will be averaging enough of them, but I can still substantially reduce the amount of sampling I need. So this is the algorithm. Take a subset, average it, and report its average as a sample. Mathematically, this is what is going on here. These are the given N numbers, X1, X2, XN. This is a fixed given underlying sequence. This is the number of occurrences of your keywords in the file. This is not random. Randomness comes from the algorithm. This Y1, Y2, YN, are our random variables, these are Bernoulli random variables or binomial random variables. And you're going to sample these numbers X1, X2, XN. These YI number of times. You can think of the YIs as the weight. Compute the weighted sum of the XIs and divide by total weight you put in the system and that is your estimate of mu. That is your mu hat. So this is a nice simple algorithm. And as presented here it's a centralized algorithm. But it's clear it could be implemented in a distributed way because the binomial random variables and because these choices are independent you can distribute this algorithm over multiple servers. So here are some other features of this algorithm. Although it's a centralized algorithm as -- you had a question? >>: Need to distribute the summation of the YIs. >> Shreeshankar Bodas: Exactly. >>: So ->> Shreeshankar Bodas: These YIs -- so think of it as follows: So XI is one of the entries in the given vector. I've got these bunch of slave servers that are going to do the computation that are going to compute the XIs or agree with the XIs. Say I flip a Bernoulli coin and decide for every server every file I flip a Bernoulli coin and decide to sample that file, then the total number of people that are sampling a give file are binomial. In that way the algorithm is distributable. >>: Is there a hidden assumption that the variables are 0 and 1 because as they're widely skewed, like if you try to estimate mean amount of money in Medina, gates, one missing sample cannot be totally out of whack. >> Shreeshankar Bodas: That's true. That's true. So the assumption is this: So let's talk about what we are going to do with these samples. So we are interested in giving auto complete kind of suggestions. So even in the undersampling, heavy hitters like Firefoxes and Facebooks are not going to be underrepresented. Let's say a million entries thousands of Facebooks. If I sample it by 50 percent, then the related ranking of Facebook and Firefox with high probability will be what I want. So if you had some obscure entry like FQ Q of X that's not going to be presented but that's okay. Okay. So this algorithm can be distributed. So that's the good news. This algorithm is an online algorithm. So I don't need to know the system or statistics or anything. If it is available, it can be used. But it can nevertheless be an online algorithm. I don't need to know the statistics of anything else. It's just a sample algorithm. This algorithm can clearly trade off accuracy for speed because if my number of samples is very small I cannot give good guarantees for my performance. But my speedup will be large, for two reasons. First I'm doing less computation and second traffic is reduced there. I'm reporting less amount of data over the network. If network bottleneck is the issue even that bottleneck is widened and this algorithm, because of its randomized nature, it's robust in all failures. So here is the main result of the algorithm. Now, I typically don't like to put equations on my slides. But given this is MSR and given this expression actually tells us something, let's go over this expression. So it says that under our algorithm, if R, the number of samples are big are greater than this one by epsilon square log of 2 by delta times some constant, then we get a probability approximately correct answer. That is the estimate of your mean, that is mu hat, minus mu, divided by mu, the probability that this exceeds a given constant epsilon is no more than delta. And this constant depends upon how regular your underlying sequence is, the number of finite moments of the given sequence. So we're relating the probability of successful probability approximately correct answer to the number of moments of this underlying sequence. That is, if this sequence is extremely regular, suppose it has all finite moments, like it's an exponential distribution, then this constant will be small. If the number of finite moments is small, like three or four, then this constant will be larger. Nevertheless, it's the right asymptotic scaling in the following way. Suppose I want to estimate the mean of an unknown distribution, there's an unknown distribution from which I can draw samples and I want such probability approximately correct estimate of the mean. Then how many samples do I need from this? We know from Chernof bounds that the number of samples you need scales approximately like one by epsilon square log of one by delta and this expression matches almost exactly with this expression. So what this says is the algorithm that you are presenting, although it's nice and simple and runny algorithm, beyond constant factors it's essentially unimprovable. That if you can improve this algorithm, you can improve a classic bound random variable like the Chernoff bound, which itself would be a big result. So this is essentially the best that you can do. So this is the algorithm has the correct asymptotic scaling. So, so far so good. We have this randomized averaging problem and we have come up with an algorithm that looks like it's giving the right asymptotic scaling. So how can we use this in a datacenter network? So this is what a large datacenter network looks. Conceptually it's a big network of thousands of interconnected servers and disks and the queries come in, the queries or jobs come in from outside. They get processed by one or typically more of these servers and they're reported, the results are reported out. These are just some canonical situations for anybody who needs to process large volumes of data like terabites or petabytes of data, like Microsoft, Google, anybody. So one key feature of these asymptotes is that your job needs to be processed in a certain way. Because you're sensitive to delays and you want to make sure when the system is underloaded the response times are good in a networking setup. So some of the questions in this datacenter are one natural question is what's the topology that's right for these servers, because as you might be aware the servers come in form of racks. Interrack communication is much slower than intrarack communication. And so what is the right topology from a networking point of view. And so that's definitely one problem that I'd like you to understand more. But for the purpose of this talk we are going to focus on what all functions can compute in this framework when I have a datacenter. And the particular computational framework that has emerged over the years is Map-Reduce, which is a distributed computation framework. It's a very simple abstraction of the kind of computations you can do here. So there are two functions, function called map and function called reduce, user-defined functions. Once these functions are specified, the black box takes care of the rest. It works well with thousands of servers, extremely scaleable. And according to some of these rather old numbers Google was processing something like 20 petabytes of data per day. This is January 2008 -- in May 2011 this must be closer to, what, 50, I would guess. So Map-Reduce is here to stay and we want to understand what all Map-Reduce can do. If we can speed up at least some applications that use Map-Reduce, then that would be a good thing to do. This is Map-Reduce. There's two parts of computation. There's map part and reduce part. And these are user-defined functions. The map part is basically the number crunching part. You divide the input, give it to your slave servers, and these slave servers will do computations based upon the map function. The reduce function is a simple combination of your outputs of the slave servers, of the measured outputs, and it's typically a counting operation like summation or simple operations that just combine the data. Map-Reduce can do other operations that are not distributed like you can do medium computation or person by computation, but the majority of the applications are summation-based and even if there are these other applications as we're going to see later, although the algorithm has not been analyzed for that, the algorithm can be extended and you could get results. This is Map-Reduce. Let's go over a simple example of Map-Reduce. Suppose the search query is, let's say, it's Federal. And as a first, the zero level approximation as how important the keyword is. Want to count the number of occurrences of this given keyword in this large dataset. There are these files and contents that are given to me. The map function is going to just read each one of these files and count the number of occurrences and the reduce function is going to add, let's introduce some technology, there's a key called federal and it's going to add all the values, the inter measured outputs corresponding to this particular key federal. Simple operation. This is the natural thing you would do if you want to distribute your operation. And this reduce function typically outputs just 0 or 1 values, and it does simple operations like summation. So let's get back to our problem of autocomplete suggestions here. So I have started typing with F. And these are the 11 results according to Bing, this is a snapshot. So how do we come up with these suggestions fast? So this is what will happen in this datacenter framework. And then this query starting with the F comes to the datacenter. I want to find relative ranks of Facebook, Firefox and FedEx, over the large query log. One way of doing this would be to compute the brute force way, that is count the number of times Facebook was accessed and the number of times this was accessed and so on. But if all that I'm interested in is the related ranking then I don't need to do this brute force computation. And coming back to the question that was asked a couple of minutes ago. What happens if my data is extremely skewed. That's not going to happen if the auto complete suggestions you're going to give are at least for famous or important queries. Even if our undersample is data by 50 percent. It's believable that the related rankings are not going to change. You can make it precise mathematically. But that is the idea. The heavier represent even in undersampling. That's what you care for. So this basically is a summary of Map-Reduce. What is typically used for. Typically used for word count, or URL access count or for searching for text or generating reverse web link graph, who all is responding to a given web page to count its importance or the max of an array, or basically all the operations you can do if you can count a histogram of your given input that is corresponding to the key one, I want to know how many times this key one occurred in my data, key two, key three, so on. And if I can compute the histogram of this then basically any function that can use a histogram can be computed using Map-Reduce. And getting the histogram is basically a summation operation. I want to count the frequency of each one of these keywords. So most of the operations under Map-Reduce, most of the interesting operations fall under this category where reduce function is just a simple combination, which is just a summation. That's what brings us to the motivation that fast averaging can clearly benefit this kind of Map-Reduce setup. Here's what's going to happen in the datacenter. When a query comes in from outside, there is a central server. This will do the reduce operation. The central server has a bunch of slave servers connected to it. There's these files F1, F2, F3, FN and these numbers are query-specific numbers. So suppose my query was federal. This file F1, number X1 is the number of times that the query came in in this particular file. And file 2 and file 3 and so on. Typically what happens is the number of servers is much fewer than the number of files. Think of asymptotic scaling of M to be like square root of N. So we have a thousand servers and million files or even more. You're interested in is the number of sums or the summation of XIs or the average of XIs equivalently. So let's see how this algorithm, the proposed algorithm for approximate computation in a Map-Reduce setup will work. Query comes into a slave server. This slave server is going to decide randomly out of these N files which files am I going to sample. It's going to sample in this case file 2 and file N. Then there's no approximation beyond that point. It's going to exactly compute the numbers X2 and XN and it's going to report to the central server these two numbers, X2 plus XN, that's the total weight of my contribution, and the number of files that I sampled. That is two. And the central server is going to sum up the -- as mentioned before, is going to sum up XIYI and divide that by summation YI, the number of files you sample, the total weight that you assigned. And that is going to be your approximate estimate of the mean. So how does this work? Why do we expect this to work? So as mentioned before, if the underlying sequence is regular then the earlier result applies and you can get away with far fewer samples than shat is given to you. This makes intuitive sense. Suppose I'm given a uniform random variable between let's say 0 to 1. If I'm given million samples of this, I can say, okay, I don't need a million samples, maybe I can get away with a thousand samples and estimate a mean. And that mean is not going to be too far away from what I would get if I just did the brute force computation for the mean for a million numbers. So because these numbers are coming from, for famous queries like Facebook, so the heavy hitters well represented, even under the undersampling. So if all we're interested in is the relative ranking of the important results, then it can be quickly computed. So far we had looked at -- by the way, so were there any questions so far? >>: I had one question. X1 to XN, that N number is usually not no when you process those files. How do you distribute those files, the files and server? >> Shreeshankar Bodas: Sorry, so these ->>: Is that what you have this number, right? >> Shreeshankar Bodas: Yeah. >>: When you are processing your data, do you know what N is? >> Shreeshankar Bodas: Yeah. >>: That way how do you distribute those files to different servers say, okay, you have N and you have N weight different words, for example, and then you use M machines to process and you choose N files, because any is unknown. How do you do that? >> Shreeshankar Bodas: Is the concern that this N is not known? >>: Yes. Because the problem is when you have a lot of files you don't know how many is taken the values that you have. >>: You mean the number of keywords I have is not known? >>: Yeah. >> Shreeshankar Bodas: So this is, I would say this is a module for computing how many keywords I have. So it's like this, so this is an example that I've shown as for one particular keyword. But in practice I understand your concern that the keywords themselves are not known and related frequencies are not known. So I want the following. I want to -- so it's like this. There are two keywords, let's say keyword one and keyword two. There could be many more. I want to count the related frequencies of these keywords, right? So this is for a particular query. So when -- so the idea is that you just randomly sample a subset of the files and compute the histogram. It's an approximation to the histogram. So you sample a bunch of files, you'll get an approximate word count or approximate count for the number of hits for the given query, and you do this in a distributed fashion on each of the slave servers and merge those histograms. So because each one of those histograms is approximately right, at least for the heavy hitters, when you merge the result, it remains approximately right. So although you don't know the keywords to begin with, you don't know what you're looking for, you compute the approximate histogram. >>: Could you approximate the stage, first stage was and is [phonetic] and under second stage it could do that, an operation. >>: Those files, N is the number of logs. >>: So N is logs. So XN is keyword. >> Shreeshankar Bodas: XN is the number of occurrences of my query in the given file log. These are not files on the Internet or anything. >>: Number of keywords. >> Shreeshankar Bodas: Number of keywords in the file. That's right. >>: Number of keywords in the file. >> Shreeshankar Bodas: Number of keywords corresponding to my interest. >>: I just don't understand how that -- and N is known. >> Shreeshankar Bodas: N is known. Sorry for the confusion, was that a question? >>: Should that be a vector? >>: I have a question. This sampling technology, clearly statistics, some of the statistics is relatively stable. Some of the statistics it is not, basically let's just say average salary, I think. >> Shreeshankar Bodas: Yeah. >>: Average figure is actually very sensitive how you sample it, very few people, the accountant in the file average is going to be significant swing, but if you come ->> Shreeshankar Bodas: I understand. So even going back to the motivation. So does your application allow for an approximate computation is one question? >>: Many of the complex queries at least when you just do the query itself, it may not be easy to distinguish how this query is sensitive to that. I mean, let's say give you quite complex two queries, you submit, how do I know whether this query, I mean, basically when I use the sampling technology, how accurate my curve ->>Shreeshankar Bodas: I see what you're saying. So if you want to compute the mean of a distribution, which is like heavy tail, where there are a few people that have huge values and there are a lot of people that have small values, then you want to make sure that these people with huge values or huge salaries are sampled. Otherwise, you're going to be extremely totally off your target. So that's true. That's a property of your underlying even numbers, and it's a mathematical problem. You can't deal with it, because so what is going to happen in that case is that your underlying sequence is not regular enough. That is, it does not have finite second or third moments if you were to extend the distribution. So that's true. Your underlying sequence has to have certain regularity for these approximate computations to work. >>: Here you can kind of expect X to be all identical, very close to the numbers. The lead could be -- >> Shreeshankar Bodas: The range could be small. And so what happens is if the typical values of XIs is let's say 100 and the range is let's say plus or minus 10 or 20 or even more, but true if the XIs are close to their mean then only you can estimate their mean by some sparse sampling. >>: By one the log of yesterday, the previous day. >> Shreeshankar Bodas: And XIs are not ->>: The XIs, a fraction of fairly [inaudible] not using. >> Shreeshankar Bodas: Yes, but if this log is from 2000, then the distributive ->>: XIs is not going to be heavy tailed. >> Shreeshankar Bodas: That's true. Distribution of XIs is not going to be heavy tailed for the applications we have in mind. But it could be heavy tailed, in which case you are to be careful while applying the result. That's true. That is the assumption here that not everything can be computed in this way, but there are certain things that you can speed up. Yes? >>: Do you sample the servers, the site machines? >> Shreeshankar Bodas: Yes. So every machine samples one of the files with a Bernoulli coin toss. >>: Every machine? >> Shreeshankar Bodas: Yes, three machine samples every file with a Bernoulli coin toss. Suppose there's ten servers and 100 files, I'll have ten times 100 coin tosses, and that's going to determine who samples which file. >>: [inaudible] this is just a picture of one slave. >> Shreeshankar Bodas: This is a picture of one slave server. >>: You have this many files rely on this machine? >> Shreeshankar Bodas: Yes. Yes? >>: Since the query that you talk about, the crosslink nature they have an arc max at the end like a [inaudible] case you want the averages but you want R max, if it's helping you a lot because it's orienting you toward the head of the distribution, do you think you could somehow abstract out, sort of talk about the second layer of what are the types of queries for which you'll have these nice physical properties versus others, if you're doing min obviously you're screwed because [inaudible] is there any way to organize the space of queries to think of classes that would be well served so you could analyze each class? >> Shreeshankar Bodas: So that's a good question. So I would say that -clearly this heavy hitter you had to have heavy hitters other it's not going to work, approximation for auto complete for these kind of things that are not sensitive to some one or two people skewing the distribution. These kind of queries were off the top of my head, let's see, so if I want to compute the page rank, then I want to count the number of incoming links as a basic building block. Then suppose I take a Web page like New York Times, then if I approximately compute the number of incoming links, even that is going to be good enough because it's going to be like millions of links pointing to that. If somebody does not have enough links pointing, then even if I do an exact computation that person is not important, that person or side is not important. So maybe for these applications it can be used. Was there another question? Okay. So so far we had seen this approximate computation in a static setup, that is, we're going to briefly go over what happens if there are system dynamics, that is, so far what we have talked about is all the servers are empty. I just want to see what happens if I submit a given query and what happens to its completion time and how many samples do I need. It could happen that these servers already are loaded by some of these queries. That is, I have this yellow query coming in from outside and it brings in a certain amount of work for each one of these servers because corresponding to this server this server is going to sample a subset of files and server two is going to sample a subset of the files and so on. It's like there's a Q in front of these servers. And every incoming query is going to bring in some amount of work and these servers could have earlier query that are not yet complete. Suppose this is the amount of residue that is there from the previous work. So this query comes in. It brings in two units of or two packets into the first queue and so. These files, as I mentioned, are the search query logs. So what happens in this dynamic setup is our system, first of all, stable? Does it give good qeueing performance, good delivery performance, and how can we extend the algorithm or how can we even analyze it in a dynamic setup. So there is one simple way of bounding a response time that looks nevertheless useful in this case. So let's assume that I'm going to slightly modify the system and I'm going to batch the incoming queries into units of time T. So these are some of the green queries and the blue queries and so on. And I'm going to give precisely T units of time for each batch of this T queries. So remember that because there is underlying randomness in the algorithm, although the number of queries per unit time, even if that is constant, I'll have a random amount of workload that comes into the system. I'm going to say that, okay, you are given some T units of time for your overall process to complete. If there is some amount of work that is left in the system, then I'm going to just terminate all that work and start processing the new queries. So this brings in two points. First of all, it gives deterministic delay guarantees for response time, but this might additionally introduce some errors in your estimate because of your sampling, because of this termination event. So it turns out that even this system can be analyzed in more or less closed form. So you can get expressions like probability of error. In the earlier case this was a fixed number. Now this is a function of your epsilon, the conference you want. E, the amount of time you give. And the sampling parameter, P. And this expression can be, this function F can be computed in closed form but I'm not putting it up here. But the main message here is that you can trade off accuracy response time and confidence. If you put T equal to infinity, then you get the earlier result back. But even if you're constrained by how much time you can afford to give for a particular job, you can analyze the trade-off between your confidence and your response time. >>: Would the [inaudible] norm for example come into ->>: Right. So that's a good question. So what happened here is because we are doing this binomial kind of sampling, because binomial random variable sharply concentrates around invariants same order so there is not much variation here. But that's true if your sampling is a bursty [phonetic] scheme, if it's not just Bernoulli coin toss, if it's more complicated sampling the burstiness of your arrival process will determine how your function F changes. >>: I understand this logic of T. If one is the server did sample from ability based on its role what link ->>: That's a [inaudible] sampling you're saying? >>: Each or some, that's right, with your distribution. So the server one just picks two files, server, because it's [inaudible] and two is lightly loaded five files. >> Shreeshankar Bodas: That's a good point. So that will even out the lure in a more better way. That will give a sharper -- that will give us more of a function of our agree. But we have not analyzed that. I should think about the sampling strategy. Okay. So this is what we have seen so far. So Map-Reduce is the sample distributed computing framework where for most of the cases of interest this reduce function is at least in the exact form the reduce function at summation. We believe this summation can be approximated at least in some ways that gives us, first of all, network condition benefits and also completion time benefits, and there are simple randomized algorithms for mean computations that are robust and fast and that can trade off your accuracy and response time and confidence. Again, not everything can be spread out this way; like if you are extremely sensitive to your estimation error, then this kind of approach is not the right one. But there are certain applications for which this kind of randomized calculation of the mean is good enough and it saves you computation time. So let's see if this area of mean computation can be extended beyond these Map-Reduce type of math computations. There's a whole community that worries about consensus on graph. The graph you can think of it as a communication graph. These could be let's say sensors in a field and these XIs are the numbers. Each XI's a real number every node has a real number. You want to compute summation XI at each of these nodes. So why is this problem interesting, why do we care for this problem? It could be, for example, these are some unmanned vehicles or robots that are moving in a certain direction and they just want to compute a consensus; that is, I'm going in this particular direction. Or this consensus can also be used in CSMI type applications where this is a black box for doing more complicated tooling operations, or these could be, as I mentioned this could be sensors in a field and you want to count, measure the temperature. The underlying feature is that these XIs in at least the sensor network setup these could be correlated to each other that the temperatures in a given room are not going to be that different from each other, at least we can expect that. Nevertheless, if that assumption is valid, then you can use these ideas of approximate mean computation to quickly count them or quickly measure the average of XIs. So the idea is simple. We know that I can take random walk on this graph according to this graph structure, that is, an edge represents presence that if I'm in this, on this node of times 0 I can go to one of its neighbor at time 1. So I'm going to take a random walk on this graph, and I'm going to have a token with me. This token is going to sample the values of XIs as and when it goes to those nodes and not all the time but let's say every T mix times where T is the mixing time of the underlying Markov chain. I'm going to take a random walk. And if the underlying Markov chain is nice and carefully constructed, in particular if it has a uniform stationary distribution, then I'm sampling from that distribution of XIs. And there is a main idea here that you have this token, it's just going to take a random walk here. It's as it is passing on the graph, it's going to collect samples and it's going to average them. We have some theoretical results even for this part. You can show for certain structures of the graph like ring graph this approximate computation of mean can improve on the earlier known bounds but, of course, you have to assume that the underlying XIs have certain distribution, it's not completely arbitrary numbers. So this is just one small application that showed that these results are, these ideas of approximate mean computation can also be used in other fields. So this is what I'd like to work on in the future as regards to the Map-Reduce computations. So one thing and I'm sure already all of you have asked this question, what happens on the real data; that is, it's all okay if it works in theory. Does it work in practice, and that's one thing that I'll be focusing on. Another question is the smart randomized algorithm. This is the basic vanilla version of the algorithm where it's all uniform sampling, maybe you want to do smarter things because there are network bottlenecks or because it's heterogeneous environment where server speeds are inherently unequal or there could be other smarter things you can do to save on sampling. So these are a couple of directions that I want to take in the future. So here is the summary of what we have talked about so far: The main message here is that if the underlying sequence was regular, then mean computation is an easy problem. Your numbers are nicely concentrated around the mean. Just sample a few of them and you'll get a good estimate of the mean. This idea can be used in a datacenter framework, in a Map-Reduce setup, where this Map-Reduce uses simple mean computation, simple mathematical operations like averaging, that have this feature that reduce function is a simple summation and then that can be spread up. And this randomized algorithm for fast averaging can give you a trade-off between accuracy and completion time, which is one direction in which, well, at least the trade-off as far as IO has not been studied, it's a one more dimension to your toolbox. So that's it. Thank you. If you have any questions I'll be glad to answer. >>: If your canonical versions are essentially 01 variables it seems to be standard random sampling, that's kind of made famous. >> Shreeshankar Bodas: It is. It is exactly that, but your XIs -- exactly. If the XIs are just coming from 0, 1 or any finite support, then given lots of samples it's not going to benefit my -- there's just a diminishing return to how much accurate you can get the estimate, that's true, exactly. We're not doing something that is mathematically heavyweight or anything, it's just a nice observation that can be used in the setups to reduce the amount of completion time. Any other questions? All right, then, thank you. [applause]