>> Nikolaj Bjorner: All right. It's my pleasure... was here this summer. Now he's here for a...

advertisement
>> Nikolaj Bjorner: All right. It's my pleasure to invite back Assaf Schuster, who
was here this summer. Now he's here for a few days. And if you are interested
and have time to meet with him after the talk, please let us know.
This time he will talk about ExPert which in the absence accordingly is a -coordinates backup tasks. I hope there will be a bag of tricks for the according
tasks in the talk. Thanks. Please go ahead.
>> Assaf Schuster: Thank you Nikolaj. So it's a pleasure to be here again. I will
talk about ExPert, it's pareto-efficient replicated task execution in unreliable
environment. It's joint work with two of my students, Orna Agmon Ben-Yehuda
and Mark Silberstein, and with Alexandru losup from TU Delft.
And I'll go through the motivation, our solution. I'll show some details of the
solution in terms of examples and graphs and not release -- I will not delve into
all tiny details and summarize.
All right. So what's the problem we're dealing with? We have a game that is
played between the grid users and the grid owners. The grid owners have lots of
resources that he wants to perform large bags of tasks on top. The grid user has
these bag of tasks. The grid user may have bags of millions or even more of
tasks that he wants to -- she wants to perform, and she's sending this bag of
asynchronous tasks to a -- for execution on the grid.
The grid owner has of course a maintenance cost, and he incurs some cost back
on to the grid user. And he also in order to try to optimize the whole thing, he
also enforces some policies in order to get some quality of service of for the
prioritized users. And in return, the grid user, because the grid user wants to
optimize the turn around time of its execution and also wants to lower the cost,
the grid user has some strategies. And the strategies say okay, I'm going to use
unreliable sources that are provided by the grid owner. They are unreliable
because from time to time a more prioritized user can come in and the grid owner
will give him priority with evacuate my task, will stop my task and will let another
task execute. So in the view of the grid user, the grid is an unreliable resource.
And the grid user has some set of strategies where he wants to optimize and
minimize the cost and the turn around time of its execution in this unreliable
environment.
The grid user also has at its disposal, usually, some reliable alternative. A
reliable alternative may be a local cluster or may be a cloud where it can
purchase some dedicated resources, but the dedicated resources are usually
slower and costlier than as alternatives that a user can find on the grid.
The grid owner, on the other hand, would like to minimize operational costs. And
when I talk about costs in this talk, the best way to understand it is to think about
energy. Okay? But costs can be anything, okay? It can even be money. But
the best way to think about it is to try -- when I say minimized cost, the best way
to understand this is to think about minimizing the energy expenditure. And also
the grid owner would like to minimize the effective node because it wants to
make sure that it can provide quality of service when a prioritized user comes in.
So in order to better understand what is going on, I would like to show you some
slides from actual -- from actual executions. This is a slide that we showed last
week in supercomputing. This is a real run on top of a system that we developed
at the Technion called Superlink Online. You can see that at the beginning there
are many, many tasks in the queue. And the steady decline, meaning the
environment can execute the task at a specific rate, so the user can expect
completion at this point. However, a phenomena that we see in many times is
that there's a tail, and once the number of tasks become smaller or the number of
resources, the tail starts stretching and stretching, and the actual completion is
way beyond -- way later than the expected completion point. Such a phenomena
was also observed in a more reliable environment such as the Google cloud and
in other places.
So it's a well known phenomena. And what we do, we think about the initial
phase where the number of tasks is very high relative to the number of resources
is the high throughput phase. And we think about the tail starting when the
number of tasks drops below the number of resources. Just to give you a hint
about the variety of environments that we deal with, we run -- this is a production
system that serves hundreds of geneticists for all over the world, so we run on
top of very large clusters tens of thousands of machines in the University of
Wisconsin Madison. And you can see the preempted rate. You can see why an
unreliable environment.
We are not the prioritized user. This is an understatement in University of
Wisconsin. So we are preempting -- every fifth task that we submit is preempted.
And so there's also a high failure rate. In open science grid, which is composed
of 160,000 machines all across the United States, we also get some time to
perform our tasks and the preemption rate is half, it's like 10 percent, it's still
relatively high in your preempt grid for eScience, which is a megagrid that is
supposed to support the operation of the large head-on collider. We get
preemption rate of seven, but the failure rate is twice.
And then we have our own dedicated a so to speak a reliable -- so called reliable
resources, our own small set of machines at Technion that are dedicated to our
system, and there's a preemption and -- there's still a preemption right there
which is -- drops down to two. And the failure rate which is much, much lower.
And we also run on community grids. We have hundred thousand machines that
are registered, but 15,000 of them are active. And you can see that the
preemption rate in the community grids, community grids is like you setting a
tone, so it compose of PCs and it [inaudible] of people that donate their
computation at time to the execution.
So the preemption rate is very, very low. But the failure rate is very, very high.
So these are unreliable resources, and we'll consider this one as a reliable one.
>>: [inaudible]. So these are tests and they all should have complete
[inaudible]?
>> Assaf Schuster: Yeah. Well, the whole execution ->>: So all [inaudible] tests?
>> Assaf Schuster: More or less, yes. More or less. In this talk for simplicity I'll
assume that all of them has the same execution time and so on.
>>: But then some tasks took three days to complete?
>> Assaf Schuster: Oh, this is a number of tasks. And the decline of the tasks
that are completed that are sent for execution, they find the machine, they
complete and then the number of tasks drop.
>>: And by this [inaudible] all grids or on secure grids?
>> Assaf Schuster: And we work on all of them. We consider this one as a
reliable resource, but -- and the rest unreliable.
>>: So the tail end is executed to a particular grid, subgrid or ->> Assaf Schuster: No, no. Actually the tail is because we proceed working on
all of them whenever we mind the free machine and we have a task we send it to
that free machine.
>>: And how many machines are there on the other grids? The Technion?
>> Assaf Schuster: This one -- it depends on the availability of the grid. So it
goes up and down in a very erratic way, jittering between -- by hundreds of
thousands.
This one has thousands of machines. This one 160,000. This one has a
hundred thousands of machines dispersed across Europe. This one is across
the world. And so on. This one is at the Technion. We have it at our disposal.
But still even though you can see that the reality is a bit more complicated than
what I -- what I will describe next, we will consider -- we will refer in the rest of the
talk to reliable -- to reliable resources as fully reliable, meaning that if we submit
a task for a reliable resource, if we submit a task to a machine that we consider
reliable, the task will always finish execution and return the result. Okay?
>>: This makes your real results stronger?
>> Assaf Schuster: We can refine the results to suit better the reality but I will
not finish in time the talk.
>>: What explains this such involved tail?
>> Assaf Schuster: Okay. So this is a good question, actually. The tasks that
remain here first sometimes when they are scheduled on unreliable machines
and these machines do not come back, you don't know what's happening. You
wait and wait and wait, and this can take a lot of time. And if the result does not
come, you have to reschedule them. Here you have other tasks that finish, and
most of the tasks finish and there's this decline is in a -- for the tasks that do
finish.
Here you have tasks that do not finish, so they have to reschedule and
reschedule again, and sometimes these are the problematic tasks that need very
stable and well configured machine in order to actually operate well.
There was another question? Okay. So what we want to do in order to serve it
is I guess it's already clear to you what we are going to do. In the high
throughput phase we are going to use a very cheap unreliable resources like
grids, community grids, non dedicated clusters, whatever we can find. And in the
high performance phase, in the tail where we want to shorten the turn around
time of the whole execution, we are going to use two methods. First, we are
going to use reliable resources that I told you that we are now going to assume
that the task always complete in terms of dedicated cluster or in terms of
purchasing execution on clouds.
And second, we are going to replicate. We are going to use unreliable
resources, but we are going to send few copies to maximize the probabilities that
one of them returns, okay? What is a problem here? The problem here is that
replication is very wasteful in terms of resources. If you have millions of tasks
you are wasting a lot of the -- a lot of execution here, and if we are thinking about
the cost as energy, we are wasting a lot electricity.
And second, reliable and dedicated resources are expensive anyway, even in
terms of money. Okay? So this is a problem. Just to show you that when using
these methods we can actually succeed, this is an actual execution on top of
Superlink Online with a little bit more than three million tasks that will perform in
21 days. You can see that we were able by replication and dedicated resources
to shorten the tail and the finish in the anticipated time schedule.
The question is what are the best strategies to employ with these dedicated
resources costly dedicated resources and with the replication that waits for the
application in order to minimize the cost and both minimize the cost and minimize
the total execution time.
Okay. So now let me return to the theory. So we start to consider a user bank of
strategies, okay? The user has a lot of strategies. We are going to formalize the
strategies in terms of three parameters. The first parameter is N. N is a number
of times we are going to invoke an unreliable resource before we are going to
invoke -- before we give up with unreliable resources and we are going to invoke
reliable resources.
The second parameter is T. T is a deadline after which we do not trust the
reliable -- the unreliable resource anymore to produce the result and we start -either start another unreliable resource or we call the reliable resource.
The third parameter is D. D is a deadline for the unreliable resource after which
we do not wait anymore for the unreliable resource to return the result, and the
unreliable resource actually knows that it is not expected to return a result and it
stops execution. These parameters are supported by all the environment such
as Boeing Condor and so on. And this is the three parameters that we are going
to play with -- for different strategies. Okay?
So here is a strategy. For instance, N equals 3. So we are going to try three
times to unreliable resources and only invoke unreliable resources and only give
up when we don't get the result and invoke reliable resource. So at the
beginning we invoke a unreliable resource if with we reach the time T and no
result returned, we are going to invoke another unreliable resource. Okay? And
until time D we still wait for the first unreliable resource to return the result.
Then with a second unreliable resource if no result returns, we still wait until time
2T and wait for the second unreliable resource to return a result. And then still
wait a little more, a little bit more for it to return the result, but if until time 2T no
result returns, we start another unreliable task. And we start the other unreliable
task on the same machine of the first unreliable task because we restrict
ourselves in this presentation to the situation that we use only two unreliable
resources. Only two unreliable machines. If until time 3T no result return from
the unreliable resources, we give up, and we call the reliable resource and the
reliable resource as I told you, we assume here in this simplistic presentation that
it always returns a result after the time TR, the T takes a reliable resource to
compute the task. Okay?
So two simultaneous instances at most. At once we are going to restrict the
number -- the strategies in this way. The reliable machine is used to ensure that
we can always terminate -- we can always complete the task execution. T is a
replication time. D is a deadline. And the -- well, the whole thing is interesting
only when the number of machines is larger than the number of unfinished tasks
of course.
Okay. So this is an example of strategy with N equals 3. But you can see that
there are a lot of such strategies. And the question is how to choose from them.
So the user, what does the user care about? The user cares about the cost as a
mean cost per task that user has to pay. And the user also cares about the
execution time and the execution time is minimized when the throughput is
maximized. So I'm going to talk about throughput and not about execution time.
Okay? So the user wants to maximize the mean peak throughput, and the mean
peak throughput I am going to measure on top of two unreliable machines and an
additional reliable machine is an execution unit for the time being. Okay? And
then I can scale it for the whole grid.
But this is the way I am going to measure. I'm going to measure two parameters
across a per task and the -- and the mean throughput. And now the user may
have all kind of utility function she wants to optimize. For instance, the user may
have a deadline after which it cannot proceed with the execution. So the user
may want to maximize the throughput to meet it -- to meet our deadline. The
user may have the budget, so the user may want to minimize the cost in order to
meet our budget.
The user may want to maximize the cost that she pays for every unit of
throughput. Okay? This is energy delay, the energy delay parameters that is
usually referred to in computer architecture. And the user may have other utility
functions that she wants to optimize. Any other function that is expressed by
means of cost and throughput can be expressed in this way.
>>: I'm confused about the [inaudible] in the definition. [inaudible].
>> Assaf Schuster: You can think about it -- this is a simple [inaudible] of the
time that the task -- that the first result of task execution returns. We call it the
peak throughput, and we measure it for a unit of two unreliable machines and an
additional reliable machine just simplifies the presentation here. It gets
complicated in here later, but I don't want to present it -- I want to present a
general approach.
All right. Again, pack to the motivation, we have the game that is played between
the owner of the grid and the user. The user wants to optimize his execution so it
tries to find out of the world of possible strategies a strategy that will minimize it
to -- that we optimize the utility function. The user usually do that in terms of -by means of tuition and trial and error. So the behavior that is seen by the
owner, the behavior of the user that is seen by the owner is erratic and not
stable.
So the owner considers the user erratic and not stable and irrational, tries also to
optimize her own environment in a trial and error manner, and the whole thing is
-- does not converge. We see this many times. We saw this -- this is the reason
why very large environment has in erratic behavior and anticipated behavior such
as EG and takes years to converge to a specified way of work, to a [inaudible] of
work.
What we want to do, we want to break this cycle by giving the user means and
tool to optimize her strategies, and in this way, once the user is optimized and is
-- and is working in view of the owner in a rational way, the owner can also
optimize her own environment, and the whole thing is energy optimized or cost
optimized or whatever a cost function you have. Okay. So this is the -- what we
want to do. In terms of related work because of the importance of this problem is
relatively many recent works. All of these works consider specify utility function
in specific configurations and do not have a -- do not subject a general
framework using which you can plug in your configuration and solve for any -- for
any environment.
Okay. So what is the solution concept that we have in mind? We are going to do
-- we are going to have this system ExPert. ExPert, the first thing that it does, it
extracts from the user some parameters that are needed for -- some data that is
needed for the computation of the best strategies. The data is the cost, for
instance, that it costs the user to accepted a task to an unreliable machine, the
cost of maintaining the reliable machine and so on. And the expert also needs
additional data from the large grid, from the unreliable resource pool.
We get the grid parameters in terms of analysis of traces that are coming from
the grid. I'll show you later how more or less in high level how it's done. Once
we have all of this data, ExPert produces ExPert produces the space of the -- in
certain fine enough resolution the space of all throughput times cost point, and in
this -- in this space pareto helps the user to choose those points that are most
efficient. That are most efficient. The points that are most efficient are fixed very
simply by using the pareto frontier which I'm going to show new a second.
Once the user already know what are the best working points, the user chooses,
according to its -- to their utility function the throughput, where it wants -- where
the user wants to work. The throughput refers to a certain cost which the user
computed to be the best from her point of view. Once the throughput the known,
we can already go to the pace of throughput and all the possible strategies and
subtract the corresponding strategies that will provide this throughput at the best
cost. The best strategy is given again in terms of N, D, and T, the parameters
that I showed you before.
Once we have N, D, and T, N, D, and T are fed back into the scheduler, and the
scheduler can now replicate and replicate into the unreliable system or into the
reliable system according to the selected strategy. So this is the -- this is how we
think that the system should work. This is how our expert works. And it provides
the user with the well defined framework to choose best strategies.
Okay. So in order to show you some details, I want to do it in terms of examples
and graphs and the example and graphs will be taken on a specific unreliable
resource. The specific unreliable resource that I want to mention is called
Auvergrid. The traces of Auvergrid from 2006 are available online. And the trace
holds a line per task with the following fields: The status of the task, whether it
succeeded or failed after sent to the unreliable resource. The runtime of the
task. But we get this information only for successful tasks because unsuccessful
tasks it may be that the machine never returns to us. So we do not know what
happened. And the waiting time for submitting to starting -- to starting the run
itself. Okay? So the result time is the runtime, is the runtime plus the wait time.
And this is what interests the user actually. Times the wait time is known only for
those tasks that did not fail in certain environments. But we don't care. These
are the three elements that we need in order to compute in the probability of a
task that is sent to the system to return within time D.
How do we do that? First we define some data and parameters that we get from
the -- as I told you, pareto gets this data and parameters from the user and from
the traces. For instance, the time to execute the task on an unreliable machine
with Auvergrid used to be at certain traces that we consider used to be around
seven hours. Okay? A single task. Okay? This is average task length in
Auvergrid.
For the cost of the unreliable system what we need, what pareto needs in order
to compute the strategies is the cost of time, the cost of time is considered to be
the average -- here the example is with energy, so we take the average cost of
the electricity around Europe which is around 10 cents per kilowatt hour at 100
watt, which is approximately the power that you need for a single resource, for a
single CPU. And there's a fixed -- okay. So the cost of time is multiplied in order
to compute the cost, the total cost, it usually will be multiplied for successful task
by the time it takes to compute the task itself.
And then we have also a fixed overhead when we call C sub-OH, overhead. This
is a fixed overhead which we have to pay for every task that is submitted to the
grid, regardless whether this task was successful or not. It is needed for
preparing the task and submitting it. So we estimate the cost of 10 seconds
worth of work for this -- for this fixed order.
For the reliable system properties we also need to -- we need to know what we
have at our disposal on we took the Technion local cluster as an example. The
technical local cluster is composed of fairly old machines. Although it's reliable,
the machines are old. So they are five times slower than the average machine in
the unreliable system. So the time to execute on a reliable machine is five times
the time it takes to execute on a non reliable machine. And the cost of
maintaining the cluster and everything is estimated to be also five times the cost
as the cost of executing a task is estimated at five times the cost which five time
more expensive than executing it on an unreliable machine.
Okay. The other information that we need is the probability that an unreliable -the unreliable system returns the task successfully after 20. How do we
compute. So we take the wait time, the graph for the CDF for the wait times,
okay. We decide that a failed task is infinite -- is infinite wait time. We formulate
a failed task is if it waited infinite time for being a schedule. So we multiple the
reliability factor of Auvergrid which is .86 but the CDF of the weight times, and we
get this graph. And now we take -- and now we have the CDF, the probability of
the task returning successfully from Auvergrid of after time T to be the reliability
factor times the wait times, times the wait times of the shifted by the amount of
time that is needed to execute the task on the unreliable resource.
So overall the CDF that we get ->>: [inaudible] CDF?
>> Assaf Schuster: The cumulative probability function.
>>: [inaudible].
>> Assaf Schuster: Distribution function. Cumulative distribution function. So
that's the green line that we -- the green graph that we get here is the probability
after a time T to of a successful -- a successful result return to us by the
unreliable resource. Okay? This is what we get.
And this is a simplistic computation. A more involved computation can be
plugged in. Again, this is only for the purpose of presentation here. All right.
So once we have all of the -- all of these details in place, we want to see some
results. So we first look at the cost expectation values. And let's take N equals
one. Let's take all the strategies for N equals one. So to remind you, N equals
one means that we submit ones to the unreliable grid, and after that we
immediately, if the unreliable grid after time T does not return the result, we give
up, and we -- we submit to the reliable grid. To the reliable resource. Okay?
So what do we see here? The Y axis is ratio of the time T that we wait until we
start the reliable execution to the time that -- to the time of execution on the
unreliable resource. Okay?
The X axis is the time that we still going to wait for result D, the time that is still
going to wait for result from the unreliable grid normalized by the -- again by the
time that it takes for that T unreliable. The time the T takes for the unreliable to
compute the task. And what -- and what we can see here -- and the cost is in
cents per task. And what we can see here is that if T is less than one, meaning
that we do not let the unreliable grid a chance to return a result before we start
the reliable execution, then the costs are very high. Okay? Because we always
invoke the reliable resource.
But once T crosses this threshold, once T becomes larger -- once T becomes
larger than a -- the time it takes to execute the task on a unreliable resource, the
-- the cost becomes lower but they're still relatively high because the unreliable
resource sometimes do not return the result in time. But -- and the cost gets
smaller and smaller as we grow T. So small -- so lower cost means higher T.
Okay?
This is for cost. Let's look at the throughput. What happens with the throughput?
Again, we have as a Y axis T normalized by the times that it takes to execute on
an unreliable grid and the X axis is denormalized by TR, T unreliable. And what
we can see is that when D is very small and the units here are for throughput, our
task 100,000 seconds. Okay?
So when D is very small, it's close to one. Again, we do not let the unreliable
resource sufficient time to return as a result. Okay? So you can see that the
throughput is very low because the unreliable grid -- this is again for N equals
one. So the unreliable does not -- does not return the result on time and stops
execution because that's D, the unreliable resource stop execution. So we have
to wait for the reliable machine to provide the result. But the reliable machine is
five times slower than the unreliable machines. And so the throughput is very
low.
Once we grow D towards 2, the throughput becomes higher and higher, and the
-- if we grow it further, the larger the D, the larger the throughput. And we -- if we
also look at T, the lower the T, the higher the throughput. Why that? Because
the lower the T, the faster we invoke the reliable resource. The faster we invoke
the reliable resource, the faster the reliable resource comes back with result
when the unreliable resource fails. So the throughput grows with shrinking T and
growing D. The opposite of what we had for the cost. And this is a tradeoff that
we want to hit.
Okay. So now we -- that we understand the tradeoff, let's look at several
possible ends. Okay? So these are all the strategies that we can think -- we
take all the possible strategies. We sample them at a certain granularity. And
we take them only until -- and we take only the first four where N equals 1 -- 0, 1,
2, 3. For the rest of them we can try invoking more and more times the unreliable
grid but the consideration will be the same.
So we took only N equals 0, 1, 2, 3. So N equals 0 means that we do not use the
unreliable grid at all, we only use reliable. We can see what the cost. The cost is
relatively high. And the throughput, the throughput is relatively low because the
reliable grid is very slow. If you use unreliable resources, for instance with N
equals 1, the red point here, we call 1s the unreliable machine, and then if it fails,
we call the reliable one, we can see that there are two clusters of strategies.
One that is clustered here and one that is clustered above it. Okay?
Why is this gap in cost? This is a gap of cost. Remember, Y axis is cost. Why
this large gap? This is because T crosses a threshold of T unreliable, of the time
to compute an unreliable resource. When T is less than T unreliable, then the
time it takes to compute an unreliable resource we don't give the unreliable
resource a chance to complete and we resort to the reliable resource, so we
always pay the extra cost of the reliable system so we are here.
When T is greater than T unreliable, we will be in this cluster.
>>: But the throughput is higher.
>> Assaf Schuster: The throughput is just a little bit higher, yes. We buy ours
just a little bit higher throughput. Why is that so? Because we call ->>: [inaudible] higher than the [inaudible].
>> Assaf Schuster: [inaudible]?
>>: But the throughput is much higher than the reliable -- for the N equals 0.
>> Assaf Schuster: No, I'm talking just for N equals one on the red point.
>>: Okay.
>> Assaf Schuster: Okay. So the throughput is a little bit higher. Why?
Because the reliable sometimes comes in with a result that was actually needed.
The unreliable resource, even though the reliable resource was called too early,
the unreliable resource failed and so the actual result came for the reliable
resource, and we actually needed it. So the throughput is higher. Okay?
For N equals 2, you can see two gaps, okay? N equals 2 are the blue circles.
You can see blue circles -- a cluster of blue circle here, a cluster here and cluster
there. The first gap is when T crosses the threshold of T unreliable, okay? So
we -- the first unreliable resource does not finish. We call the second unreliable
resource. The second gap comes from a 2T is less than -- 2 times T is less than
T unreliable. So we call the reliable resource before we let the two unreliable
resource submit the results. Okay?
So you can -- but still, you can see that this gets a more -- more throughput like
Nikolaj said before, because sometimes it's needed. Okay? But for a much
higher cost. So what do we do? We draw what we call the pareto frontier.
>>: [inaudible] N equals 3.
>> Assaf Schuster: For N equals 3 ->>: I would expect three gaps.
>> Assaf Schuster: Yeah, right. But remember that we restrict ourselves to work
only on two unreliable resources. So we don't have these points drawn here,
otherwise, it will be a mess. So but you are right. For low cost, low throughput,
we choose N equal 3, okay? We choose these points. For high cost, high
throughput, we choose N equals 2 here and there. And in general, you can see
-- and in general what we do, we draw the pareto frontier, the pareto frontier goes
through all of these points. This also called the skyline sometimes and goes
through all the dominating strategies. Why are they dominating? They're
dominating in the sense that for each of the strategies that you find on the
frontier, for certain throughput, you cannot get another strategy with the lower
cost. And for certain cost, you cannot get another strategy with a higher
throughput.
This is why all the strategies that are drawn on the skyline on the frontier are
dominated. What are they dominating in for instance, you can see that they are
dominating all the strategies -- all the strategies that have N equal 1. N equal 1
is not participating in the best -- are not the best strategies in the configurations
that I showed you, in the examples that I showed you. So any pick of good
strategy must be from N equals 3, 2, for high throughput or for N equals 3. In the
configuration and in the data that I gave you. Okay?
So what expert does it extracts all of these points, and these are the points that
the user want to play with. How does the user want to play with this point? The
user have as I told you before a certain utility function that she wants to optimize.
The utility function can be any of these or any of the other utility functions. It
takes all the points on the efficient strategies from the pareto frontier and draw
the graph according to her utility function.
So for instance, if the utility function is a cost -- the square of the cost per the
throughput, then the minimum is going to be here. The minimum is going to be
found here. The minimum is the minimum in terms of throughput, number of
tasks per hundred thousand second in this case. So the user is going to pick 3.-perhaps 3.3 as the point of the -- as the throughput of 3.3 as the point where it's
the throughput that it wants a system to execute in.
This throughput determines on the pareto frontier a precise -- a precise strategy.
Okay? The precise strategy in turn determines NTs that help the system decide
what to do. If the user wants to optimize the cost per throughput and not the cost
square -- the square of the cost per throughput, then the different MPI is found at
the 3.395. And so on. Okay?
So this is how the user works. It has this tool that helps in the reverse way to find
the best working point.
Something else that we can do if the -- of course the user can choose any of
these points, any of these is a throughput to work with, and so the there is a
strategy that supplies its throughput if the user can decide that he wants to pay
the related cost. However, there are some points on the pareto frontier that to
not have pure strategies that can support that. Okay? These are -- this is this
gap, for instance, or this gap. What can we do?
So what -- what we can do if the user decides that he wants exactly this
throughput, and he wants to -- and she wants to work here, we need to design
the strategy an optimal strategy, pareto frontier optimal strategy that with we
support this throughput. What we can do, we can mix several pure strategies
together, a linear combination of pure strategy will also give always give a result
on the pareto frontier. And we can use this mixed strategy, and it's an optimal
working point.
Okay. So once the user decides on the throughput that she wants, that she
picks, now we have to extract -- the system has to extract the values of N, T, and
D. How does the system do that? So here on the X axis, we have the
throughput. And remember that the user said I want it to work here, or I want to
work there. Okay? And there is a parameter value. And we can see in our
example in the specific example that the throughput is divided into two parts.
The high cost, high throughput part and the low cost, low throughput part. In the
high cost, high throughput part, everything is fairly simple. T is always equal to
zero. So we always -- so it's a high throughput high cost. All the strategy is on
the pareto frontier N equals 2. So we call the unreliable resource Y, [inaudible]
reliable. T always equals zero, which means that immediately after starting with
the first unreliable copy we start with a second unreliable copy and immediately
start with a reliable one. And D grows. Okay. D gets a growing value in order to
grow the -- in order to grow the throughput further. Yes.
>>: [inaudible] this pareto stuff. Is it always convex?
>> Assaf Schuster: Yes.
>>: It cannot be concave?
>> Assaf Schuster: It's defined as convex, yes.
>>: By definition?
>> Assaf Schuster: By definition. Exactly. I didn't want to get into this, but ->>: So it is essentially the best that you can reach with mixed strategies?
>> Assaf Schuster: Yes.
>>: [inaudible].
>> Assaf Schuster: Exactly. Exactly. Sorry it was not clear. But anyway, back
to the choosing parameters. Okay. So it's easy with the -- with the high cost,
high throughput path. With the low cost, low throughput path, all of the strategies
defines the pareto frontier at any cost. So three times invoking unreliable
resource before revoking the reliable. And we can see that it's strongly
dependent on the T, on the value of T which the blue circles and D and it's also
strongly dependent on D until a certain point where D stabilizes and remains
fairly high, and then D starts to fluctuate and jump in between different values for
reasons I don't want to get into.
So you can see this is the way that the system picks T and D -- N, T, and D, the
values of N, T, and D. It's very easy for the system, it's very fast, it can be done
online.
So my question that can come up is how much can we gain from using it? Okay.
So we defined the cost ratio to be the ratio of a certain strategy -- of the cost of a
certain strategy to the cost of the best strategy that will be tee find using the
pareto frontier. Okay? And the maximal pure strategy will be the ratio of the
worst strategy that you can choose to the best strategy that exert chooses.
Okay?
So here's an example. The example -- this is an example graph. The Y axis is
the ratio of the cost of the reliable machine to the unreliable machine. The X axis
is the ratio of the time it takes to execute a task on the reliable machine to the
time it takes to execute the result on an unreliable machine. So higher ratio of
cost of the reliable to unreliable, higher ratio of time of computation of reliable to
unreliable.
And let's look at different parts of the space.
On the upper, upper right part of the space, we know that the reliable resource is
both a -- is both very lazy, is both very slow and costly. Okay? So on the one
hand if you make a mistake in picking a strategy, this mistake is very harmful,
okay? The max cost ratio is going to be very, very large from the worst case
strategies to the best strategy.
On the other hand, it's relatively easy for user not to make mistakes here.
Because this is a reliable resource is both slow and costly, it will be deferred until
as much as possible.
If we look at the other point, okay, where the reliable resource is both cheap and
fast, we have the same situation. A mistake here can be very harmful. The max
cost ratio will be very high. However, it is fairly easy for user not to make
mistakes here because the reliable resource is a very fast and very cheap.
Okay? On the other places where the reliable resource is a costly but fast or
when the reliable resource is slow but costly -- is slow but cheap, it's more
involved to make a decision. Okay. It's harder for the user to make a decision.
It's easier for the user to make a mistake and to choose a non optimal strategy.
Okay?
So how much can we gain? The max cost ratio here is not very high. It's 1.142,
so we can gain something like 15 percent by using the best, assuming that the
user chooses the worst case strategy. Here in this place we can gain already the
1.77 for the max cost ratio. Overall we estimate that the reason that they will
gain from using this optimization scheme is around 50 percent to 70 percent.
And if you consider this is the energy cost for executing millions and hundreds of
millions of tasks, it becomes substantial.
Okay. So to conclude, the pareto frontier enables to a rationally optimizing
[inaudible] the utility function and finds the best strategy for execution. Realistic
savings are of the order of 50 to 70 percent. And what I hide from you, what I did
not expose, a lot and lot of details that has to do with realistic systems.
For instance, at the beginning of the tail, the ratio of the number of tasks to the
number of available unreliable machine is relatively high. At the end of the tail,
it's relatively slow. So it makes sense to change the best -- to change the best -the replication strategies as we go along the tail. Okay? So we can do that.
We're still working on optimizing that. It's not trivial.
One other thing that you would like to do is to -- because you have a choice
because the unreliable resource is not homogenic like what I told you here, but
sometimes it's [inaudible] and he's made out of many different machines with
different reliability factors and different execution time and different cost is a -- to
ask for a resource when we get the parameters of this resource to compute the
best strategy for working with this specific resource. The best N, T, and D for the
specific time in the execution of the bag of task. And with respect to the specific
parameters of the resource. The whole thing become a more and more
complicated, the graph become more and more complicated, and this is current
and future work. And that's it.
>>: [inaudible] so but still you have relatively few parameters to play with N, T,
and D, so everything is all [inaudible].
>> Assaf Schuster: The basic -- the basic scheme I agree is relatively easy.
Adapting it to actual systems becomes very hairy because you have a choice of
several systems and the resources are not homogenic and so on, so forth. So in
realistic situations even though relative -- we have relatively few parameters, it is
not -- it is not the computation is not very simple. The approach is simple. And
this is what I wanted to show. Because it's not currently done and saving of 50
percent is really substantial. If you need nuclear reactor -- if they say now that in
grids you can send a bag of tasks and activate a nuclear reactor with just a click,
a saving of 50 percent is substantial.
>>: [inaudible] resource [inaudible] is the cost of your computation. The more
complicated strategy you use the more expensive computations. And they may
themself come to some numbers.
>> Assaf Schuster: Yeah, but I ->>: [inaudible].
>> Assaf Schuster: You're right, but I thought that Nikolaj said that you can
reach the optimal point of execution with the some numeric -- numeric
approximation.
>>: [inaudible].
>> Assaf Schuster: You have to do that. Yes. The computations here are very
easy. They do not take a lot of computational time. We can do them on a laptop
with tens of seconds. When you start taking into account the fact that you have
few unreliable sources and you want to do adaptation of N, T, and D, as you
proceed ->>: [inaudible].
>> Assaf Schuster: Yeah, the dynamic -- then it starts to be a little more
complicated.
>>: [inaudible] using the [inaudible] what about from [inaudible]? So there was
[inaudible] which somehow related to cost.
>> Assaf Schuster: Yeah, you're right. I mean this is an excellent question.
Many people work on this -- on these issues. There are still unresolved, meaning
that for instance large companies such as Intel, when it has a peak in the
demand of resources still is confined to its own grid. They have a grid of 200,000
CPUs and -- that are spread all over the world. And they have a system that's
called [inaudible] that she's sending the jobs to -- and they do not use external
clouds, and they do not use external grids, although they could have used them
with a very low cost. Sometimes they are in very high demand. And the reason
is precisely what you said, the security. They do not trust the security -- the
security means.
There's also another issue of cheating. Some resources would take a lot of
computational -- computation, return wrong results. So there are some methods
that people use in order to check that the resource is reliable and to put it in a
blacklist when it's unreliable.
Usually what people try to develop these days in order to ensure security is
working on top of virtual machines. So the resource will get a -- will receive a
task. The task is already wrought with virtual machine. The virtual machine
starts executing. The task cannot do any harm to the environment. And
hopefully the environment cannot infer from the task anything about the full bot.
>>: [inaudible] comes with its own virtual machine.
>> Assaf Schuster: Yeah. The task comes with its own virtual machine. The
problem is of course that virtual machines have a computational overhead which
people are not willing to pay at this point, and there's a lot of work in these
directions, too.
>> Nikolaj Bjorner: Thank you.
>> Assaf Schuster: Thank you.
[applause]
Download