22431 >> Sudipta Sengupta: I am Sudipta Sengupta, and... Shreeshankar Bodas, who goes by the name of Shree. ...

>> Sudipta Sengupta: I am Sudipta Sengupta, and it is my pleasure to introduce
Shreeshankar Bodas, who goes by the name of Shree. Shree is currently a
post-doc at Laboratory for Information and Decision Systems at MIT.
The LIDS lab is the same lab which is home to folks like Bob Gallagher,
Demetrius Butikus [phonetic] and John [inaudible].
Shree did his Ph.D. in ECD from University of Austin in September of 2010 and
his research interests include resource allocation in data centers and wireless
scheduling algorithms.
Today he will talk about his work on fast averaging, which appeared in this year's
ISIT, International Symposium on Information Theory, and he will also point out
applications to Map-Reduce. I'll hand it over to Shree.
>> Shreeshankar Bodas: All right. Thanks for the introduction Sudipta.
And good morning, everyone, and thanks for attending this talk on fast averaging
with applications to Map-Reduce and consensus on graphs.
So before we begin, let me tell you that most of this talk is going to be about
applications to Map-Reduce. And after the main body of the talk we are going to
also consider the consensus on graphs as an application of fast averaging.
This is joint work with Devavrat Shah. So the particular problem I'd like you to
keep at the back of your minds throughout this talk is this problem of auto
complete suggestions for a search engine; that is, so there is one user that is,
this is a snapshot from Bing, and I started typing a query with F, and this is the
suggestions that Bing gave to me.
Facebook, Firefox, Foxwoods, and so on. And so the question is how does a
search engine give me such queries? These are based on the query logs in the
recent past by different users. And based on the value of a given keyword, these
are going to be presented to me in some order.
So we want to determine in an efficient and fast and online way a way of coming
up with these suggestions. So that this is an application that has motivated us.
So this application had certain features as we're going to see that, first of all, it
is -- it's possible to approximate the computation that goes on behind this without
hampering the results. So this is the particular application that I would like you to
keep at the back of your minds.
So let's we'll what happens when someone starts typing with F and waits for the
auto complete to suggest something.
So this is the picture of the back end of the system, where there is, let's say, a
master server. There are these bunch of slave servers, and these are the
different query logs in terms, in the form of files.
So when a query comes in, this master server wants to know, let's say for the
time being, that there are only three possibilities here. There is a Facebook,
Firefox and FedEx. I want to see the related ranking or related importance of
these keywords and based on the previous search queries.
So a simple number that I want is the following: That indeed search queries, how
many of them contain the word "Facebook" and how many contain the word
"Firefox" and "FedEx." And based on those results I'll rank them and present
them to the user. We want to speed up this operation, that's the way I'm going to
present to you the fast averaging algorithms, and we are going to see how these
are applicable in this setup.
Because this is essentially a counting problem. You want to count the frequency
of something and that can be approximate.
So at the heart of the problem is this following mathematical operation; that is, I'm
given a long vector of N numbers, think of N as large, because I have these huge
query logs, I want to search over them fast. I have these long vector of N
numbers and what I'm interested in is this continuity XI. So XI represents the
number of times a given keyword, let's say Firefox, came in the search query.
It's 01, 2, 3, number of times ->>: Do you know ahead of time? If you do, then average sum are the same
>> Shreeshankar Bodas: Are the same problems. Yes, if you know them, then
some of are the same problems. But for mathematical reasons we're going to
have this one in front because we want a multiplicative approximation error as
opposed to an aggregator error. So as you said, they're the same problem.
We're interested in this summation, and equivalently we're interested in one by N
times this summation. So mathematically this problem is not hard. This problem
just wants you to add up all these N numbers and just divide it by N.
As can you do it in a distributed fashion as I've shown it here that there are these
N numbers, suppose I got these 100 servers that can do the task. Then I'm
going to give, let's say, one-hundredth of the fraction of my numbers to first
server and the second server and the hundred server and so on.
These guys are going to compute the integer sums and then they'll report the
sums to the central guy. The central guy will add up all the hundred numbers
and then divide it by N. And that's your mu.
So this is the basic simplest algorithm you can do. And right now I have shown
here two levels of hierarchy, but it could have more levels of hierarchy.
So this mathematically the problem is not hard. What makes the problem hard is
the data size and your delay constraints that you want to respond quickly and
you want to process large volumes of data.
Nevertheless, so this deterministic algorithm has certain advantages and certain
issues that we need to address. So the advantages are first it returns an exact
answer, which is always a good feature of any algorithm.
It's clearly a distributed algorithm, as I've shown here, it can be done in parallel
by hundred servers or multiple servers, multiple level hierarchy, because the
addition operation is inherently distributed operation. So this algorithm can be
implemented in a distributed form.
What are some of the issues with this algorithm? First issue is complexity or
latency. And so we know that if you want to sum up N numbers, then there is a
theta of N complexity bound. You cannot do less than, or rather N minus 1
additions if you want the exact answer in the worst case.
But the point here is you may not want, you may not care for the exact answer,
because this addition operation of the averaging operation can be a means to an
end. Even if my actual answer to this averaging problem is not exact, maybe the
final thing I'm interested in, which is the ranking of my results, may still be good.
So if I'm willing to trade off some accuracy for speed, then I want to beat this
latency bound because this T'd off end may be prohibitively large. That's one
issue we need to address. Another issue is robustness.
If one of these servers is slow or one of these servers fails its task, then in my
implementation that I've shown here, first of all, the bottleneck is the slowest
server. I cannot, on the fly, decide okay that this server should not report.
Because I'm interested in the exact answer in this algorithm.
Or if this server fails, then somebody else has to complete his task. And it's
going to add to your other -- it's going to affect your answer or it's going to add to
your network bottlenecks. So these are a couple of issues that complexity and
robustness that we need to address for this algorithm.
So let's look at the issue of latency in a little more detail. So suppose I've got
these file servers and I have a bunch of numbers I want to compute the average
of. This is the central server who is going to do the final averaging operation.
So what happens here is suppose there is this server that is the slowest server.
It's always the case that the slowest server will determine when your answer is
But if this guy is overloaded because there's some background loading in the
datacenter that it is doing some other operations or there are some network
bottlenecks for which this guy is too slow to respond; then although all the other
numbers are in single digit milliseconds, this guy is in 50 milliseconds. You have
time to finish, time to finish your operation is here in this case 54 milliseconds.
But, hey, what happens is if I'm not interested in the exact answer, maybe I can
say that, okay, forget this guy. I'm going to compute the average of the rest of
the people and suitably normalize it, and maybe my answer is going to be
approximate, but my time to finish is 13 milliseconds for the numbers I have
chosen here.
This is a recurring theme that as we know from the simple MM1Q Markov chain
analysis, if you are very close to capacity and if you slightly reduce the load, then
your delay comes down by a large amount.
So that's the same thing that is happening here; that if you reduce the number of
servers that you need to respond by, let's say, one percent or five percent, then
you get dramatic speedup because these are the slowest servers that are going
to determine how fast your answer is really.
It's true that the slowest server is always the bottleneck, but you're making the
bottleneck wider in this case. And for all we know it may not be a bad thing for
averaging operation. Because our numbers are not just some arbitrary numbers.
They could have some correlation between them. So depends upon the
structure of the underlying data and how much error you are willing to tolerate
that determines how much area you're going to get in your final ranking.
So suppose in the extreme case I know that the X size, the underlying numbers I
want to average, are all equal. Suppose all of them are equal to some number. I
don't know what they're equal to, but in the extreme case they're all equal I can
just sample one of them and that's it that's my final answer.
So this complexity bound of theta of N it's actually in the worst case. That is,
these given numbers can come from what has to be no distribution underlying it.
But if the numbers exhibit some regularity, then I can get away with smaller
amount of sampling. That's the main point here.
One thing I would like to mention here, so this business of averaging with or
without replacement has been around forever, like people have beaten this to
death, and you can find classic large deviations results on how many samples
you need if you want to estimate the mean within a plus server and an epsilon
But most of those results, in fact I should say all the results that are known at
least in classic textbooks, they are in terms of the entries of the given vector. So
as opposed to that we are representing results in terms of the moment of the
given underlying sequence.
So it's a different take on the problem. And for Map-Reduce kind of problems
that we are interested in, this is a new dimension in which you can do the
trade-off, for accuracy, for speed, which if this is a means to an end averaging
operation, then it can be, it can be used without affecting the final result. So here
is the summary of our contribution to the problem.
We propose a simple randomized algorithm for averaging, and we analyze its
performance. We analyze the trade-off between accuracy and latency. Basically
we're counting latency in terms of how many samples do you need for getting an
approximate answer.
The improved job completion types of Map-Reduce applications using these
ideas from fast averaging, and the main intuition I'd like you to take away from
this talk, so forget about the actual system or forget about the exact result, the
one thing is if the numbers are regular, numbers are concentrated at least in
approximate sense around its mean, then the mean computation is an easy
thing. So I don't have to sample each and everything if I can sample, let's say,
half of them or root N of them. Even that can give me a good answer.
So before or without keeping you in any more suspense, here's a centralized
version of our algorithm. It's a very simple algorithm. It's the first thing that
would come to your mind.
Given this sequence of N numbers, these are the numbers, and I'm going to pick
any R out of them. R is a parameter that we want to choose later. But pick a
fraction of these numbers and average them and report their average. So this is
a very simple algorithm.
One slight modification that I would like to make to the final version of the
algorithm is that instead of picking R numbers out of N, let's pick every number
with some probability. So that this becomes a distributable algorithm. Because if
I had to pick R numbers out of N deterministically, then it's difficult to distribute
the choices among different servers.
So the algorithm is this: Pick every number uniformly at random with some
probability according to some random variable and report their average.
The advantage of this approach is that each number is going to be picked with
some probability already, but the probability that you pick a number is close to 0.
Because this number N is large I will be averaging enough of them, but I can still
substantially reduce the amount of sampling I need.
So this is the algorithm. Take a subset, average it, and report its average as a
sample. Mathematically, this is what is going on here.
These are the given N numbers, X1, X2, XN. This is a fixed given underlying
sequence. This is the number of occurrences of your keywords in the file. This
is not random. Randomness comes from the algorithm.
This Y1, Y2, YN, are our random variables, these are Bernoulli random variables
or binomial random variables. And you're going to sample these numbers X1,
X2, XN. These YI number of times. You can think of the YIs as the weight.
Compute the weighted sum of the XIs and divide by total weight you put in the
system and that is your estimate of mu. That is your mu hat.
So this is a nice simple algorithm. And as presented here it's a centralized
algorithm. But it's clear it could be implemented in a distributed way because the
binomial random variables and because these choices are independent you can
distribute this algorithm over multiple servers. So here are some other features
of this algorithm.
Although it's a centralized algorithm as -- you had a question?
>>: Need to distribute the summation of the YIs.
>> Shreeshankar Bodas: Exactly.
>>: So ->> Shreeshankar Bodas: These YIs -- so think of it as follows: So XI is one of
the entries in the given vector. I've got these bunch of slave servers that are
going to do the computation that are going to compute the XIs or agree with the
XIs. Say I flip a Bernoulli coin and decide for every server every file I flip a
Bernoulli coin and decide to sample that file, then the total number of people that
are sampling a give file are binomial. In that way the algorithm is distributable.
>>: Is there a hidden assumption that the variables are 0 and 1 because as
they're widely skewed, like if you try to estimate mean amount of money in
Medina, gates, one missing sample cannot be totally out of whack.
>> Shreeshankar Bodas: That's true. That's true. So the assumption is this: So
let's talk about what we are going to do with these samples. So we are
interested in giving auto complete kind of suggestions. So even in the
undersampling, heavy hitters like Firefoxes and Facebooks are not going to be
underrepresented. Let's say a million entries thousands of Facebooks. If I
sample it by 50 percent, then the related ranking of Facebook and Firefox with
high probability will be what I want. So if you had some obscure entry like FQ Q
of X that's not going to be presented but that's okay.
Okay. So this algorithm can be distributed. So that's the good news. This
algorithm is an online algorithm. So I don't need to know the system or statistics
or anything. If it is available, it can be used. But it can nevertheless be an online
algorithm. I don't need to know the statistics of anything else. It's just a sample
algorithm. This algorithm can clearly trade off accuracy for speed because if my
number of samples is very small I cannot give good guarantees for my
But my speedup will be large, for two reasons. First I'm doing less computation
and second traffic is reduced there. I'm reporting less amount of data over the
network. If network bottleneck is the issue even that bottleneck is widened and
this algorithm, because of its randomized nature, it's robust in all failures.
So here is the main result of the algorithm. Now, I typically don't like to put
equations on my slides. But given this is MSR and given this expression actually
tells us something, let's go over this expression.
So it says that under our algorithm, if R, the number of samples are big are
greater than this one by epsilon square log of 2 by delta times some constant,
then we get a probability approximately correct answer.
That is the estimate of your mean, that is mu hat, minus mu, divided by mu, the
probability that this exceeds a given constant epsilon is no more than delta.
And this constant depends upon how regular your underlying sequence is, the
number of finite moments of the given sequence. So we're relating the
probability of successful probability approximately correct answer to the number
of moments of this underlying sequence.
That is, if this sequence is extremely regular, suppose it has all finite moments,
like it's an exponential distribution, then this constant will be small.
If the number of finite moments is small, like three or four, then this constant will
be larger. Nevertheless, it's the right asymptotic scaling in the following way.
Suppose I want to estimate the mean of an unknown distribution, there's an
unknown distribution from which I can draw samples and I want such probability
approximately correct estimate of the mean.
Then how many samples do I need from this? We know from Chernof bounds
that the number of samples you need scales approximately like one by epsilon
square log of one by delta and this expression matches almost exactly with this
So what this says is the algorithm that you are presenting, although it's nice and
simple and runny algorithm, beyond constant factors it's essentially
unimprovable. That if you can improve this algorithm, you can improve a classic
bound random variable like the Chernoff bound, which itself would be a big
So this is essentially the best that you can do. So this is the algorithm has the
correct asymptotic scaling. So, so far so good. We have this randomized
averaging problem and we have come up with an algorithm that looks like it's
giving the right asymptotic scaling.
So how can we use this in a datacenter network? So this is what a large
datacenter network looks. Conceptually it's a big network of thousands of
interconnected servers and disks and the queries come in, the queries or jobs
come in from outside. They get processed by one or typically more of these
servers and they're reported, the results are reported out.
These are just some canonical situations for anybody who needs to process
large volumes of data like terabites or petabytes of data, like Microsoft, Google,
So one key feature of these asymptotes is that your job needs to be processed in
a certain way. Because you're sensitive to delays and you want to make sure
when the system is underloaded the response times are good in a networking
So some of the questions in this datacenter are one natural question is what's the
topology that's right for these servers, because as you might be aware the
servers come in form of racks. Interrack communication is much slower than
intrarack communication. And so what is the right topology from a networking
point of view. And so that's definitely one problem that I'd like you to understand
But for the purpose of this talk we are going to focus on what all functions can
compute in this framework when I have a datacenter. And the particular
computational framework that has emerged over the years is Map-Reduce, which
is a distributed computation framework.
It's a very simple abstraction of the kind of computations you can do here. So
there are two functions, function called map and function called reduce,
user-defined functions. Once these functions are specified, the black box takes
care of the rest. It works well with thousands of servers, extremely scaleable.
And according to some of these rather old numbers Google was processing
something like 20 petabytes of data per day. This is January 2008 -- in
May 2011 this must be closer to, what, 50, I would guess.
So Map-Reduce is here to stay and we want to understand what all Map-Reduce
can do. If we can speed up at least some applications that use Map-Reduce,
then that would be a good thing to do.
This is Map-Reduce. There's two parts of computation. There's map part and
reduce part. And these are user-defined functions. The map part is basically the
number crunching part. You divide the input, give it to your slave servers, and
these slave servers will do computations based upon the map function.
The reduce function is a simple combination of your outputs of the slave servers,
of the measured outputs, and it's typically a counting operation like summation or
simple operations that just combine the data.
Map-Reduce can do other operations that are not distributed like you can do
medium computation or person by computation, but the majority of the
applications are summation-based and even if there are these other applications
as we're going to see later, although the algorithm has not been analyzed for
that, the algorithm can be extended and you could get results.
This is Map-Reduce. Let's go over a simple example of Map-Reduce. Suppose
the search query is, let's say, it's Federal. And as a first, the zero level
approximation as how important the keyword is. Want to count the number of
occurrences of this given keyword in this large dataset.
There are these files and contents that are given to me. The map function is
going to just read each one of these files and count the number of occurrences
and the reduce function is going to add, let's introduce some technology, there's
a key called federal and it's going to add all the values, the inter measured
outputs corresponding to this particular key federal.
Simple operation. This is the natural thing you would do if you want to distribute
your operation. And this reduce function typically outputs just 0 or 1 values, and
it does simple operations like summation.
So let's get back to our problem of autocomplete suggestions here. So I have
started typing with F. And these are the 11 results according to Bing, this is a
So how do we come up with these suggestions fast? So this is what will happen
in this datacenter framework. And then this query starting with the F comes to
the datacenter. I want to find relative ranks of Facebook, Firefox and FedEx,
over the large query log. One way of doing this would be to compute the brute
force way, that is count the number of times Facebook was accessed and the
number of times this was accessed and so on.
But if all that I'm interested in is the related ranking then I don't need to do this
brute force computation. And coming back to the question that was asked a
couple of minutes ago. What happens if my data is extremely skewed. That's
not going to happen if the auto complete suggestions you're going to give are at
least for famous or important queries. Even if our undersample is data by 50
percent. It's believable that the related rankings are not going to change. You
can make it precise mathematically. But that is the idea. The heavier represent
even in undersampling. That's what you care for.
So this basically is a summary of Map-Reduce. What is typically used for.
Typically used for word count, or URL access count or for searching for text or
generating reverse web link graph, who all is responding to a given web page to
count its importance or the max of an array, or basically all the operations you
can do if you can count a histogram of your given input that is corresponding to
the key one, I want to know how many times this key one occurred in my data,
key two, key three, so on. And if I can compute the histogram of this then
basically any function that can use a histogram can be computed using
And getting the histogram is basically a summation operation. I want to count the
frequency of each one of these keywords. So most of the operations under
Map-Reduce, most of the interesting operations fall under this category where
reduce function is just a simple combination, which is just a summation. That's
what brings us to the motivation that fast averaging can clearly benefit this kind of
Map-Reduce setup.
Here's what's going to happen in the datacenter. When a query comes in from
outside, there is a central server. This will do the reduce operation. The central
server has a bunch of slave servers connected to it. There's these files F1, F2,
F3, FN and these numbers are query-specific numbers.
So suppose my query was federal. This file F1, number X1 is the number of
times that the query came in in this particular file. And file 2 and file 3 and so on.
Typically what happens is the number of servers is much fewer than the number
of files. Think of asymptotic scaling of M to be like square root of N. So we have
a thousand servers and million files or even more. You're interested in is the
number of sums or the summation of XIs or the average of XIs equivalently. So
let's see how this algorithm, the proposed algorithm for approximate computation
in a Map-Reduce setup will work.
Query comes into a slave server. This slave server is going to decide randomly
out of these N files which files am I going to sample. It's going to sample in this
case file 2 and file N. Then there's no approximation beyond that point.
It's going to exactly compute the numbers X2 and XN and it's going to report to
the central server these two numbers, X2 plus XN, that's the total weight of my
contribution, and the number of files that I sampled. That is two.
And the central server is going to sum up the -- as mentioned before, is going to
sum up XIYI and divide that by summation YI, the number of files you sample,
the total weight that you assigned. And that is going to be your approximate
estimate of the mean.
So how does this work? Why do we expect this to work? So as mentioned
before, if the underlying sequence is regular then the earlier result applies and
you can get away with far fewer samples than shat is given to you. This makes
intuitive sense. Suppose I'm given a uniform random variable between let's say
0 to 1. If I'm given million samples of this, I can say, okay, I don't need a million
samples, maybe I can get away with a thousand samples and estimate a mean.
And that mean is not going to be too far away from what I would get if I just did
the brute force computation for the mean for a million numbers.
So because these numbers are coming from, for famous queries like Facebook,
so the heavy hitters well represented, even under the undersampling. So if all
we're interested in is the relative ranking of the important results, then it can be
quickly computed.
So far we had looked at -- by the way, so were there any questions so far?
>>: I had one question. X1 to XN, that N number is usually not no when you
process those files. How do you distribute those files, the files and server?
>> Shreeshankar Bodas: Sorry, so these ->>: Is that what you have this number, right?
>> Shreeshankar Bodas: Yeah.
>>: When you are processing your data, do you know what N is?
>> Shreeshankar Bodas: Yeah.
>>: That way how do you distribute those files to different servers say, okay, you
have N and you have N weight different words, for example, and then you use M
machines to process and you choose N files, because any is unknown. How do
you do that?
>> Shreeshankar Bodas: Is the concern that this N is not known?
>>: Yes. Because the problem is when you have a lot of files you don't know
how many is taken the values that you have.
>>: You mean the number of keywords I have is not known?
>>: Yeah.
>> Shreeshankar Bodas: So this is, I would say this is a module for computing
how many keywords I have. So it's like this, so this is an example that I've
shown as for one particular keyword.
But in practice I understand your concern that the keywords themselves are not
known and related frequencies are not known. So I want the following.
I want to -- so it's like this. There are two keywords, let's say keyword one and
keyword two. There could be many more. I want to count the related
frequencies of these keywords, right? So this is for a particular query.
So when -- so the idea is that you just randomly sample a subset of the files and
compute the histogram. It's an approximation to the histogram. So you sample a
bunch of files, you'll get an approximate word count or approximate count for the
number of hits for the given query, and you do this in a distributed fashion on
each of the slave servers and merge those histograms.
So because each one of those histograms is approximately right, at least for the
heavy hitters, when you merge the result, it remains approximately right. So
although you don't know the keywords to begin with, you don't know what you're
looking for, you compute the approximate histogram.
>>: Could you approximate the stage, first stage was and is [phonetic] and under
second stage it could do that, an operation.
>>: Those files, N is the number of logs.
>>: So N is logs. So XN is keyword.
>> Shreeshankar Bodas: XN is the number of occurrences of my query in the
given file log. These are not files on the Internet or anything.
>>: Number of keywords.
>> Shreeshankar Bodas: Number of keywords in the file. That's right.
>>: Number of keywords in the file.
>> Shreeshankar Bodas: Number of keywords corresponding to my interest.
>>: I just don't understand how that -- and N is known.
>> Shreeshankar Bodas: N is known. Sorry for the confusion, was that a
>>: Should that be a vector?
>>: I have a question. This sampling technology, clearly statistics, some of the
statistics is relatively stable. Some of the statistics it is not, basically let's just say
average salary, I think.
>> Shreeshankar Bodas: Yeah.
>>: Average figure is actually very sensitive how you sample it, very few people,
the accountant in the file average is going to be significant swing, but if you
come ->> Shreeshankar Bodas: I understand. So even going back to the motivation.
So does your application allow for an approximate computation is one question?
>>: Many of the complex queries at least when you just do the query itself, it may
not be easy to distinguish how this query is sensitive to that. I mean, let's say
give you quite complex two queries, you submit, how do I know whether this
query, I mean, basically when I use the sampling technology, how accurate my
curve ->>Shreeshankar Bodas: I see what you're saying. So if you want to compute
the mean of a distribution, which is like heavy tail, where there are a few people
that have huge values and there are a lot of people that have small values, then
you want to make sure that these people with huge values or huge salaries are
sampled. Otherwise, you're going to be extremely totally off your target. So
that's true.
That's a property of your underlying even numbers, and it's a mathematical
problem. You can't deal with it, because so what is going to happen in that case
is that your underlying sequence is not regular enough. That is, it does not have
finite second or third moments if you were to extend the distribution.
So that's true. Your underlying sequence has to have certain regularity for these
approximate computations to work.
>>: Here you can kind of expect X to be all identical, very close to the numbers.
The lead could be --
>> Shreeshankar Bodas: The range could be small. And so what happens is if
the typical values of XIs is let's say 100 and the range is let's say plus or minus
10 or 20 or even more, but true if the XIs are close to their mean then only you
can estimate their mean by some sparse sampling.
>>: By one the log of yesterday, the previous day.
>> Shreeshankar Bodas: And XIs are not ->>: The XIs, a fraction of fairly [inaudible] not using.
>> Shreeshankar Bodas: Yes, but if this log is from 2000, then the distributive ->>: XIs is not going to be heavy tailed.
>> Shreeshankar Bodas: That's true. Distribution of XIs is not going to be heavy
tailed for the applications we have in mind. But it could be heavy tailed, in which
case you are to be careful while applying the result. That's true. That is the
assumption here that not everything can be computed in this way, but there are
certain things that you can speed up. Yes?
>>: Do you sample the servers, the site machines?
>> Shreeshankar Bodas: Yes. So every machine samples one of the files with a
Bernoulli coin toss.
>>: Every machine?
>> Shreeshankar Bodas: Yes, three machine samples every file with a Bernoulli
coin toss. Suppose there's ten servers and 100 files, I'll have ten times 100 coin
tosses, and that's going to determine who samples which file.
>>: [inaudible] this is just a picture of one slave.
>> Shreeshankar Bodas: This is a picture of one slave server.
>>: You have this many files rely on this machine?
>> Shreeshankar Bodas: Yes. Yes?
>>: Since the query that you talk about, the crosslink nature they have an arc
max at the end like a [inaudible] case you want the averages but you want R
max, if it's helping you a lot because it's orienting you toward the head of the
distribution, do you think you could somehow abstract out, sort of talk about the
second layer of what are the types of queries for which you'll have these nice
physical properties versus others, if you're doing min obviously you're screwed
because [inaudible] is there any way to organize the space of queries to think of
classes that would be well served so you could analyze each class?
>> Shreeshankar Bodas: So that's a good question. So I would say that -clearly this heavy hitter you had to have heavy hitters other it's not going to work,
approximation for auto complete for these kind of things that are not sensitive to
some one or two people skewing the distribution.
These kind of queries were off the top of my head, let's see, so if I want to
compute the page rank, then I want to count the number of incoming links as a
basic building block. Then suppose I take a Web page like New York Times,
then if I approximately compute the number of incoming links, even that is going
to be good enough because it's going to be like millions of links pointing to that.
If somebody does not have enough links pointing, then even if I do an exact
computation that person is not important, that person or side is not important. So
maybe for these applications it can be used.
Was there another question? Okay. So so far we had seen this approximate
computation in a static setup, that is, we're going to briefly go over what happens
if there are system dynamics, that is, so far what we have talked about is all the
servers are empty.
I just want to see what happens if I submit a given query and what happens to its
completion time and how many samples do I need. It could happen that these
servers already are loaded by some of these queries.
That is, I have this yellow query coming in from outside and it brings in a certain
amount of work for each one of these servers because corresponding to this
server this server is going to sample a subset of files and server two is going to
sample a subset of the files and so on. It's like there's a Q in front of these
servers. And every incoming query is going to bring in some amount of work and
these servers could have earlier query that are not yet complete. Suppose this is
the amount of residue that is there from the previous work.
So this query comes in. It brings in two units of or two packets into the first
queue and so. These files, as I mentioned, are the search query logs. So what
happens in this dynamic setup is our system, first of all, stable? Does it give
good qeueing performance, good delivery performance, and how can we extend
the algorithm or how can we even analyze it in a dynamic setup. So there is one
simple way of bounding a response time that looks nevertheless useful in this
So let's assume that I'm going to slightly modify the system and I'm going to
batch the incoming queries into units of time T. So these are some of the green
queries and the blue queries and so on.
And I'm going to give precisely T units of time for each batch of this T queries.
So remember that because there is underlying randomness in the algorithm,
although the number of queries per unit time, even if that is constant, I'll have a
random amount of workload that comes into the system.
I'm going to say that, okay, you are given some T units of time for your overall
process to complete. If there is some amount of work that is left in the system,
then I'm going to just terminate all that work and start processing the new
queries. So this brings in two points. First of all, it gives deterministic delay
guarantees for response time, but this might additionally introduce some errors in
your estimate because of your sampling, because of this termination event.
So it turns out that even this system can be analyzed in more or less closed form.
So you can get expressions like probability of error. In the earlier case this was a
fixed number. Now this is a function of your epsilon, the conference you want.
E, the amount of time you give. And the sampling parameter, P. And this
expression can be, this function F can be computed in closed form but I'm not
putting it up here. But the main message here is that you can trade off accuracy
response time and confidence. If you put T equal to infinity, then you get the
earlier result back. But even if you're constrained by how much time you can
afford to give for a particular job, you can analyze the trade-off between your
confidence and your response time.
>>: Would the [inaudible] norm for example come into ->>: Right. So that's a good question. So what happened here is because we are
doing this binomial kind of sampling, because binomial random variable sharply
concentrates around invariants same order so there is not much variation here.
But that's true if your sampling is a bursty [phonetic] scheme, if it's not just
Bernoulli coin toss, if it's more complicated sampling the burstiness of your arrival
process will determine how your function F changes.
>>: I understand this logic of T. If one is the server did sample from ability based
on its role what link ->>: That's a [inaudible] sampling you're saying?
>>: Each or some, that's right, with your distribution. So the server one just picks
two files, server, because it's [inaudible] and two is lightly loaded five files.
>> Shreeshankar Bodas: That's a good point. So that will even out the lure in a
more better way. That will give a sharper -- that will give us more of a function of
our agree. But we have not analyzed that. I should think about the sampling
strategy. Okay.
So this is what we have seen so far. So Map-Reduce is the sample distributed
computing framework where for most of the cases of interest this reduce function
is at least in the exact form the reduce function at summation. We believe this
summation can be approximated at least in some ways that gives us, first of all,
network condition benefits and also completion time benefits, and there are
simple randomized algorithms for mean computations that are robust and fast
and that can trade off your accuracy and response time and confidence.
Again, not everything can be spread out this way; like if you are extremely
sensitive to your estimation error, then this kind of approach is not the right one.
But there are certain applications for which this kind of randomized calculation of
the mean is good enough and it saves you computation time.
So let's see if this area of mean computation can be extended beyond these
Map-Reduce type of math computations. There's a whole community that
worries about consensus on graph. The graph you can think of it as a
communication graph. These could be let's say sensors in a field and these XIs
are the numbers. Each XI's a real number every node has a real number. You
want to compute summation XI at each of these nodes. So why is this problem
interesting, why do we care for this problem? It could be, for example, these are
some unmanned vehicles or robots that are moving in a certain direction and
they just want to compute a consensus; that is, I'm going in this particular
Or this consensus can also be used in CSMI type applications where this is a
black box for doing more complicated tooling operations, or these could be, as I
mentioned this could be sensors in a field and you want to count, measure the
The underlying feature is that these XIs in at least the sensor network setup
these could be correlated to each other that the temperatures in a given room are
not going to be that different from each other, at least we can expect that.
Nevertheless, if that assumption is valid, then you can use these ideas of
approximate mean computation to quickly count them or quickly measure the
average of XIs.
So the idea is simple. We know that I can take random walk on this graph
according to this graph structure, that is, an edge represents presence that if I'm
in this, on this node of times 0 I can go to one of its neighbor at time 1. So I'm
going to take a random walk on this graph, and I'm going to have a token with
me. This token is going to sample the values of XIs as and when it goes to those
nodes and not all the time but let's say every T mix times where T is the mixing
time of the underlying Markov chain.
I'm going to take a random walk. And if the underlying Markov chain is nice and
carefully constructed, in particular if it has a uniform stationary distribution, then
I'm sampling from that distribution of XIs. And there is a main idea here that you
have this token, it's just going to take a random walk here. It's as it is passing on
the graph, it's going to collect samples and it's going to average them.
We have some theoretical results even for this part. You can show for certain
structures of the graph like ring graph this approximate computation of mean can
improve on the earlier known bounds but, of course, you have to assume that the
underlying XIs have certain distribution, it's not completely arbitrary numbers. So
this is just one small application that showed that these results are, these ideas
of approximate mean computation can also be used in other fields.
So this is what I'd like to work on in the future as regards to the Map-Reduce
computations. So one thing and I'm sure already all of you have asked this
question, what happens on the real data; that is, it's all okay if it works in theory.
Does it work in practice, and that's one thing that I'll be focusing on.
Another question is the smart randomized algorithm. This is the basic vanilla
version of the algorithm where it's all uniform sampling, maybe you want to do
smarter things because there are network bottlenecks or because it's
heterogeneous environment where server speeds are inherently unequal or there
could be other smarter things you can do to save on sampling. So these are a
couple of directions that I want to take in the future.
So here is the summary of what we have talked about so far: The main message
here is that if the underlying sequence was regular, then mean computation is an
easy problem.
Your numbers are nicely concentrated around the mean. Just sample a few of
them and you'll get a good estimate of the mean. This idea can be used in a
datacenter framework, in a Map-Reduce setup, where this Map-Reduce uses
simple mean computation, simple mathematical operations like averaging, that
have this feature that reduce function is a simple summation and then that can
be spread up.
And this randomized algorithm for fast averaging can give you a trade-off
between accuracy and completion time, which is one direction in which, well, at
least the trade-off as far as IO has not been studied, it's a one more dimension to
your toolbox.
So that's it. Thank you. If you have any questions I'll be glad to answer.
>>: If your canonical versions are essentially 01 variables it seems to be
standard random sampling, that's kind of made famous.
>> Shreeshankar Bodas: It is. It is exactly that, but your XIs -- exactly. If the XIs
are just coming from 0, 1 or any finite support, then given lots of samples it's not
going to benefit my -- there's just a diminishing return to how much accurate you
can get the estimate, that's true, exactly.
We're not doing something that is mathematically heavyweight or anything, it's
just a nice observation that can be used in the setups to reduce the amount of
completion time. Any other questions? All right, then, thank you.