>> Nikolaj Bjorner: All right. It's my pleasure to invite back Assaf Schuster, who was here this summer. Now he's here for a few days. And if you are interested and have time to meet with him after the talk, please let us know. This time he will talk about ExPert which in the absence accordingly is a -coordinates backup tasks. I hope there will be a bag of tricks for the according tasks in the talk. Thanks. Please go ahead. >> Assaf Schuster: Thank you Nikolaj. So it's a pleasure to be here again. I will talk about ExPert, it's pareto-efficient replicated task execution in unreliable environment. It's joint work with two of my students, Orna Agmon Ben-Yehuda and Mark Silberstein, and with Alexandru losup from TU Delft. And I'll go through the motivation, our solution. I'll show some details of the solution in terms of examples and graphs and not release -- I will not delve into all tiny details and summarize. All right. So what's the problem we're dealing with? We have a game that is played between the grid users and the grid owners. The grid owners have lots of resources that he wants to perform large bags of tasks on top. The grid user has these bag of tasks. The grid user may have bags of millions or even more of tasks that he wants to -- she wants to perform, and she's sending this bag of asynchronous tasks to a -- for execution on the grid. The grid owner has of course a maintenance cost, and he incurs some cost back on to the grid user. And he also in order to try to optimize the whole thing, he also enforces some policies in order to get some quality of service of for the prioritized users. And in return, the grid user, because the grid user wants to optimize the turn around time of its execution and also wants to lower the cost, the grid user has some strategies. And the strategies say okay, I'm going to use unreliable sources that are provided by the grid owner. They are unreliable because from time to time a more prioritized user can come in and the grid owner will give him priority with evacuate my task, will stop my task and will let another task execute. So in the view of the grid user, the grid is an unreliable resource. And the grid user has some set of strategies where he wants to optimize and minimize the cost and the turn around time of its execution in this unreliable environment. The grid user also has at its disposal, usually, some reliable alternative. A reliable alternative may be a local cluster or may be a cloud where it can purchase some dedicated resources, but the dedicated resources are usually slower and costlier than as alternatives that a user can find on the grid. The grid owner, on the other hand, would like to minimize operational costs. And when I talk about costs in this talk, the best way to understand it is to think about energy. Okay? But costs can be anything, okay? It can even be money. But the best way to think about it is to try -- when I say minimized cost, the best way to understand this is to think about minimizing the energy expenditure. And also the grid owner would like to minimize the effective node because it wants to make sure that it can provide quality of service when a prioritized user comes in. So in order to better understand what is going on, I would like to show you some slides from actual -- from actual executions. This is a slide that we showed last week in supercomputing. This is a real run on top of a system that we developed at the Technion called Superlink Online. You can see that at the beginning there are many, many tasks in the queue. And the steady decline, meaning the environment can execute the task at a specific rate, so the user can expect completion at this point. However, a phenomena that we see in many times is that there's a tail, and once the number of tasks become smaller or the number of resources, the tail starts stretching and stretching, and the actual completion is way beyond -- way later than the expected completion point. Such a phenomena was also observed in a more reliable environment such as the Google cloud and in other places. So it's a well known phenomena. And what we do, we think about the initial phase where the number of tasks is very high relative to the number of resources is the high throughput phase. And we think about the tail starting when the number of tasks drops below the number of resources. Just to give you a hint about the variety of environments that we deal with, we run -- this is a production system that serves hundreds of geneticists for all over the world, so we run on top of very large clusters tens of thousands of machines in the University of Wisconsin Madison. And you can see the preempted rate. You can see why an unreliable environment. We are not the prioritized user. This is an understatement in University of Wisconsin. So we are preempting -- every fifth task that we submit is preempted. And so there's also a high failure rate. In open science grid, which is composed of 160,000 machines all across the United States, we also get some time to perform our tasks and the preemption rate is half, it's like 10 percent, it's still relatively high in your preempt grid for eScience, which is a megagrid that is supposed to support the operation of the large head-on collider. We get preemption rate of seven, but the failure rate is twice. And then we have our own dedicated a so to speak a reliable -- so called reliable resources, our own small set of machines at Technion that are dedicated to our system, and there's a preemption and -- there's still a preemption right there which is -- drops down to two. And the failure rate which is much, much lower. And we also run on community grids. We have hundred thousand machines that are registered, but 15,000 of them are active. And you can see that the preemption rate in the community grids, community grids is like you setting a tone, so it compose of PCs and it [inaudible] of people that donate their computation at time to the execution. So the preemption rate is very, very low. But the failure rate is very, very high. So these are unreliable resources, and we'll consider this one as a reliable one. >>: [inaudible]. So these are tests and they all should have complete [inaudible]? >> Assaf Schuster: Yeah. Well, the whole execution ->>: So all [inaudible] tests? >> Assaf Schuster: More or less, yes. More or less. In this talk for simplicity I'll assume that all of them has the same execution time and so on. >>: But then some tasks took three days to complete? >> Assaf Schuster: Oh, this is a number of tasks. And the decline of the tasks that are completed that are sent for execution, they find the machine, they complete and then the number of tasks drop. >>: And by this [inaudible] all grids or on secure grids? >> Assaf Schuster: And we work on all of them. We consider this one as a reliable resource, but -- and the rest unreliable. >>: So the tail end is executed to a particular grid, subgrid or ->> Assaf Schuster: No, no. Actually the tail is because we proceed working on all of them whenever we mind the free machine and we have a task we send it to that free machine. >>: And how many machines are there on the other grids? The Technion? >> Assaf Schuster: This one -- it depends on the availability of the grid. So it goes up and down in a very erratic way, jittering between -- by hundreds of thousands. This one has thousands of machines. This one 160,000. This one has a hundred thousands of machines dispersed across Europe. This one is across the world. And so on. This one is at the Technion. We have it at our disposal. But still even though you can see that the reality is a bit more complicated than what I -- what I will describe next, we will consider -- we will refer in the rest of the talk to reliable -- to reliable resources as fully reliable, meaning that if we submit a task for a reliable resource, if we submit a task to a machine that we consider reliable, the task will always finish execution and return the result. Okay? >>: This makes your real results stronger? >> Assaf Schuster: We can refine the results to suit better the reality but I will not finish in time the talk. >>: What explains this such involved tail? >> Assaf Schuster: Okay. So this is a good question, actually. The tasks that remain here first sometimes when they are scheduled on unreliable machines and these machines do not come back, you don't know what's happening. You wait and wait and wait, and this can take a lot of time. And if the result does not come, you have to reschedule them. Here you have other tasks that finish, and most of the tasks finish and there's this decline is in a -- for the tasks that do finish. Here you have tasks that do not finish, so they have to reschedule and reschedule again, and sometimes these are the problematic tasks that need very stable and well configured machine in order to actually operate well. There was another question? Okay. So what we want to do in order to serve it is I guess it's already clear to you what we are going to do. In the high throughput phase we are going to use a very cheap unreliable resources like grids, community grids, non dedicated clusters, whatever we can find. And in the high performance phase, in the tail where we want to shorten the turn around time of the whole execution, we are going to use two methods. First, we are going to use reliable resources that I told you that we are now going to assume that the task always complete in terms of dedicated cluster or in terms of purchasing execution on clouds. And second, we are going to replicate. We are going to use unreliable resources, but we are going to send few copies to maximize the probabilities that one of them returns, okay? What is a problem here? The problem here is that replication is very wasteful in terms of resources. If you have millions of tasks you are wasting a lot of the -- a lot of execution here, and if we are thinking about the cost as energy, we are wasting a lot electricity. And second, reliable and dedicated resources are expensive anyway, even in terms of money. Okay? So this is a problem. Just to show you that when using these methods we can actually succeed, this is an actual execution on top of Superlink Online with a little bit more than three million tasks that will perform in 21 days. You can see that we were able by replication and dedicated resources to shorten the tail and the finish in the anticipated time schedule. The question is what are the best strategies to employ with these dedicated resources costly dedicated resources and with the replication that waits for the application in order to minimize the cost and both minimize the cost and minimize the total execution time. Okay. So now let me return to the theory. So we start to consider a user bank of strategies, okay? The user has a lot of strategies. We are going to formalize the strategies in terms of three parameters. The first parameter is N. N is a number of times we are going to invoke an unreliable resource before we are going to invoke -- before we give up with unreliable resources and we are going to invoke reliable resources. The second parameter is T. T is a deadline after which we do not trust the reliable -- the unreliable resource anymore to produce the result and we start -either start another unreliable resource or we call the reliable resource. The third parameter is D. D is a deadline for the unreliable resource after which we do not wait anymore for the unreliable resource to return the result, and the unreliable resource actually knows that it is not expected to return a result and it stops execution. These parameters are supported by all the environment such as Boeing Condor and so on. And this is the three parameters that we are going to play with -- for different strategies. Okay? So here is a strategy. For instance, N equals 3. So we are going to try three times to unreliable resources and only invoke unreliable resources and only give up when we don't get the result and invoke reliable resource. So at the beginning we invoke a unreliable resource if with we reach the time T and no result returned, we are going to invoke another unreliable resource. Okay? And until time D we still wait for the first unreliable resource to return the result. Then with a second unreliable resource if no result returns, we still wait until time 2T and wait for the second unreliable resource to return a result. And then still wait a little more, a little bit more for it to return the result, but if until time 2T no result returns, we start another unreliable task. And we start the other unreliable task on the same machine of the first unreliable task because we restrict ourselves in this presentation to the situation that we use only two unreliable resources. Only two unreliable machines. If until time 3T no result return from the unreliable resources, we give up, and we call the reliable resource and the reliable resource as I told you, we assume here in this simplistic presentation that it always returns a result after the time TR, the T takes a reliable resource to compute the task. Okay? So two simultaneous instances at most. At once we are going to restrict the number -- the strategies in this way. The reliable machine is used to ensure that we can always terminate -- we can always complete the task execution. T is a replication time. D is a deadline. And the -- well, the whole thing is interesting only when the number of machines is larger than the number of unfinished tasks of course. Okay. So this is an example of strategy with N equals 3. But you can see that there are a lot of such strategies. And the question is how to choose from them. So the user, what does the user care about? The user cares about the cost as a mean cost per task that user has to pay. And the user also cares about the execution time and the execution time is minimized when the throughput is maximized. So I'm going to talk about throughput and not about execution time. Okay? So the user wants to maximize the mean peak throughput, and the mean peak throughput I am going to measure on top of two unreliable machines and an additional reliable machine is an execution unit for the time being. Okay? And then I can scale it for the whole grid. But this is the way I am going to measure. I'm going to measure two parameters across a per task and the -- and the mean throughput. And now the user may have all kind of utility function she wants to optimize. For instance, the user may have a deadline after which it cannot proceed with the execution. So the user may want to maximize the throughput to meet it -- to meet our deadline. The user may have the budget, so the user may want to minimize the cost in order to meet our budget. The user may want to maximize the cost that she pays for every unit of throughput. Okay? This is energy delay, the energy delay parameters that is usually referred to in computer architecture. And the user may have other utility functions that she wants to optimize. Any other function that is expressed by means of cost and throughput can be expressed in this way. >>: I'm confused about the [inaudible] in the definition. [inaudible]. >> Assaf Schuster: You can think about it -- this is a simple [inaudible] of the time that the task -- that the first result of task execution returns. We call it the peak throughput, and we measure it for a unit of two unreliable machines and an additional reliable machine just simplifies the presentation here. It gets complicated in here later, but I don't want to present it -- I want to present a general approach. All right. Again, pack to the motivation, we have the game that is played between the owner of the grid and the user. The user wants to optimize his execution so it tries to find out of the world of possible strategies a strategy that will minimize it to -- that we optimize the utility function. The user usually do that in terms of -by means of tuition and trial and error. So the behavior that is seen by the owner, the behavior of the user that is seen by the owner is erratic and not stable. So the owner considers the user erratic and not stable and irrational, tries also to optimize her own environment in a trial and error manner, and the whole thing is -- does not converge. We see this many times. We saw this -- this is the reason why very large environment has in erratic behavior and anticipated behavior such as EG and takes years to converge to a specified way of work, to a [inaudible] of work. What we want to do, we want to break this cycle by giving the user means and tool to optimize her strategies, and in this way, once the user is optimized and is -- and is working in view of the owner in a rational way, the owner can also optimize her own environment, and the whole thing is energy optimized or cost optimized or whatever a cost function you have. Okay. So this is the -- what we want to do. In terms of related work because of the importance of this problem is relatively many recent works. All of these works consider specify utility function in specific configurations and do not have a -- do not subject a general framework using which you can plug in your configuration and solve for any -- for any environment. Okay. So what is the solution concept that we have in mind? We are going to do -- we are going to have this system ExPert. ExPert, the first thing that it does, it extracts from the user some parameters that are needed for -- some data that is needed for the computation of the best strategies. The data is the cost, for instance, that it costs the user to accepted a task to an unreliable machine, the cost of maintaining the reliable machine and so on. And the expert also needs additional data from the large grid, from the unreliable resource pool. We get the grid parameters in terms of analysis of traces that are coming from the grid. I'll show you later how more or less in high level how it's done. Once we have all of this data, ExPert produces ExPert produces the space of the -- in certain fine enough resolution the space of all throughput times cost point, and in this -- in this space pareto helps the user to choose those points that are most efficient. That are most efficient. The points that are most efficient are fixed very simply by using the pareto frontier which I'm going to show new a second. Once the user already know what are the best working points, the user chooses, according to its -- to their utility function the throughput, where it wants -- where the user wants to work. The throughput refers to a certain cost which the user computed to be the best from her point of view. Once the throughput the known, we can already go to the pace of throughput and all the possible strategies and subtract the corresponding strategies that will provide this throughput at the best cost. The best strategy is given again in terms of N, D, and T, the parameters that I showed you before. Once we have N, D, and T, N, D, and T are fed back into the scheduler, and the scheduler can now replicate and replicate into the unreliable system or into the reliable system according to the selected strategy. So this is the -- this is how we think that the system should work. This is how our expert works. And it provides the user with the well defined framework to choose best strategies. Okay. So in order to show you some details, I want to do it in terms of examples and graphs and the example and graphs will be taken on a specific unreliable resource. The specific unreliable resource that I want to mention is called Auvergrid. The traces of Auvergrid from 2006 are available online. And the trace holds a line per task with the following fields: The status of the task, whether it succeeded or failed after sent to the unreliable resource. The runtime of the task. But we get this information only for successful tasks because unsuccessful tasks it may be that the machine never returns to us. So we do not know what happened. And the waiting time for submitting to starting -- to starting the run itself. Okay? So the result time is the runtime, is the runtime plus the wait time. And this is what interests the user actually. Times the wait time is known only for those tasks that did not fail in certain environments. But we don't care. These are the three elements that we need in order to compute in the probability of a task that is sent to the system to return within time D. How do we do that? First we define some data and parameters that we get from the -- as I told you, pareto gets this data and parameters from the user and from the traces. For instance, the time to execute the task on an unreliable machine with Auvergrid used to be at certain traces that we consider used to be around seven hours. Okay? A single task. Okay? This is average task length in Auvergrid. For the cost of the unreliable system what we need, what pareto needs in order to compute the strategies is the cost of time, the cost of time is considered to be the average -- here the example is with energy, so we take the average cost of the electricity around Europe which is around 10 cents per kilowatt hour at 100 watt, which is approximately the power that you need for a single resource, for a single CPU. And there's a fixed -- okay. So the cost of time is multiplied in order to compute the cost, the total cost, it usually will be multiplied for successful task by the time it takes to compute the task itself. And then we have also a fixed overhead when we call C sub-OH, overhead. This is a fixed overhead which we have to pay for every task that is submitted to the grid, regardless whether this task was successful or not. It is needed for preparing the task and submitting it. So we estimate the cost of 10 seconds worth of work for this -- for this fixed order. For the reliable system properties we also need to -- we need to know what we have at our disposal on we took the Technion local cluster as an example. The technical local cluster is composed of fairly old machines. Although it's reliable, the machines are old. So they are five times slower than the average machine in the unreliable system. So the time to execute on a reliable machine is five times the time it takes to execute on a non reliable machine. And the cost of maintaining the cluster and everything is estimated to be also five times the cost as the cost of executing a task is estimated at five times the cost which five time more expensive than executing it on an unreliable machine. Okay. The other information that we need is the probability that an unreliable -the unreliable system returns the task successfully after 20. How do we compute. So we take the wait time, the graph for the CDF for the wait times, okay. We decide that a failed task is infinite -- is infinite wait time. We formulate a failed task is if it waited infinite time for being a schedule. So we multiple the reliability factor of Auvergrid which is .86 but the CDF of the weight times, and we get this graph. And now we take -- and now we have the CDF, the probability of the task returning successfully from Auvergrid of after time T to be the reliability factor times the wait times, times the wait times of the shifted by the amount of time that is needed to execute the task on the unreliable resource. So overall the CDF that we get ->>: [inaudible] CDF? >> Assaf Schuster: The cumulative probability function. >>: [inaudible]. >> Assaf Schuster: Distribution function. Cumulative distribution function. So that's the green line that we -- the green graph that we get here is the probability after a time T to of a successful -- a successful result return to us by the unreliable resource. Okay? This is what we get. And this is a simplistic computation. A more involved computation can be plugged in. Again, this is only for the purpose of presentation here. All right. So once we have all of the -- all of these details in place, we want to see some results. So we first look at the cost expectation values. And let's take N equals one. Let's take all the strategies for N equals one. So to remind you, N equals one means that we submit ones to the unreliable grid, and after that we immediately, if the unreliable grid after time T does not return the result, we give up, and we -- we submit to the reliable grid. To the reliable resource. Okay? So what do we see here? The Y axis is ratio of the time T that we wait until we start the reliable execution to the time that -- to the time of execution on the unreliable resource. Okay? The X axis is the time that we still going to wait for result D, the time that is still going to wait for result from the unreliable grid normalized by the -- again by the time that it takes for that T unreliable. The time the T takes for the unreliable to compute the task. And what -- and what we can see here -- and the cost is in cents per task. And what we can see here is that if T is less than one, meaning that we do not let the unreliable grid a chance to return a result before we start the reliable execution, then the costs are very high. Okay? Because we always invoke the reliable resource. But once T crosses this threshold, once T becomes larger -- once T becomes larger than a -- the time it takes to execute the task on a unreliable resource, the -- the cost becomes lower but they're still relatively high because the unreliable resource sometimes do not return the result in time. But -- and the cost gets smaller and smaller as we grow T. So small -- so lower cost means higher T. Okay? This is for cost. Let's look at the throughput. What happens with the throughput? Again, we have as a Y axis T normalized by the times that it takes to execute on an unreliable grid and the X axis is denormalized by TR, T unreliable. And what we can see is that when D is very small and the units here are for throughput, our task 100,000 seconds. Okay? So when D is very small, it's close to one. Again, we do not let the unreliable resource sufficient time to return as a result. Okay? So you can see that the throughput is very low because the unreliable grid -- this is again for N equals one. So the unreliable does not -- does not return the result on time and stops execution because that's D, the unreliable resource stop execution. So we have to wait for the reliable machine to provide the result. But the reliable machine is five times slower than the unreliable machines. And so the throughput is very low. Once we grow D towards 2, the throughput becomes higher and higher, and the -- if we grow it further, the larger the D, the larger the throughput. And we -- if we also look at T, the lower the T, the higher the throughput. Why that? Because the lower the T, the faster we invoke the reliable resource. The faster we invoke the reliable resource, the faster the reliable resource comes back with result when the unreliable resource fails. So the throughput grows with shrinking T and growing D. The opposite of what we had for the cost. And this is a tradeoff that we want to hit. Okay. So now we -- that we understand the tradeoff, let's look at several possible ends. Okay? So these are all the strategies that we can think -- we take all the possible strategies. We sample them at a certain granularity. And we take them only until -- and we take only the first four where N equals 1 -- 0, 1, 2, 3. For the rest of them we can try invoking more and more times the unreliable grid but the consideration will be the same. So we took only N equals 0, 1, 2, 3. So N equals 0 means that we do not use the unreliable grid at all, we only use reliable. We can see what the cost. The cost is relatively high. And the throughput, the throughput is relatively low because the reliable grid is very slow. If you use unreliable resources, for instance with N equals 1, the red point here, we call 1s the unreliable machine, and then if it fails, we call the reliable one, we can see that there are two clusters of strategies. One that is clustered here and one that is clustered above it. Okay? Why is this gap in cost? This is a gap of cost. Remember, Y axis is cost. Why this large gap? This is because T crosses a threshold of T unreliable, of the time to compute an unreliable resource. When T is less than T unreliable, then the time it takes to compute an unreliable resource we don't give the unreliable resource a chance to complete and we resort to the reliable resource, so we always pay the extra cost of the reliable system so we are here. When T is greater than T unreliable, we will be in this cluster. >>: But the throughput is higher. >> Assaf Schuster: The throughput is just a little bit higher, yes. We buy ours just a little bit higher throughput. Why is that so? Because we call ->>: [inaudible] higher than the [inaudible]. >> Assaf Schuster: [inaudible]? >>: But the throughput is much higher than the reliable -- for the N equals 0. >> Assaf Schuster: No, I'm talking just for N equals one on the red point. >>: Okay. >> Assaf Schuster: Okay. So the throughput is a little bit higher. Why? Because the reliable sometimes comes in with a result that was actually needed. The unreliable resource, even though the reliable resource was called too early, the unreliable resource failed and so the actual result came for the reliable resource, and we actually needed it. So the throughput is higher. Okay? For N equals 2, you can see two gaps, okay? N equals 2 are the blue circles. You can see blue circles -- a cluster of blue circle here, a cluster here and cluster there. The first gap is when T crosses the threshold of T unreliable, okay? So we -- the first unreliable resource does not finish. We call the second unreliable resource. The second gap comes from a 2T is less than -- 2 times T is less than T unreliable. So we call the reliable resource before we let the two unreliable resource submit the results. Okay? So you can -- but still, you can see that this gets a more -- more throughput like Nikolaj said before, because sometimes it's needed. Okay? But for a much higher cost. So what do we do? We draw what we call the pareto frontier. >>: [inaudible] N equals 3. >> Assaf Schuster: For N equals 3 ->>: I would expect three gaps. >> Assaf Schuster: Yeah, right. But remember that we restrict ourselves to work only on two unreliable resources. So we don't have these points drawn here, otherwise, it will be a mess. So but you are right. For low cost, low throughput, we choose N equal 3, okay? We choose these points. For high cost, high throughput, we choose N equals 2 here and there. And in general, you can see -- and in general what we do, we draw the pareto frontier, the pareto frontier goes through all of these points. This also called the skyline sometimes and goes through all the dominating strategies. Why are they dominating? They're dominating in the sense that for each of the strategies that you find on the frontier, for certain throughput, you cannot get another strategy with the lower cost. And for certain cost, you cannot get another strategy with a higher throughput. This is why all the strategies that are drawn on the skyline on the frontier are dominated. What are they dominating in for instance, you can see that they are dominating all the strategies -- all the strategies that have N equal 1. N equal 1 is not participating in the best -- are not the best strategies in the configurations that I showed you, in the examples that I showed you. So any pick of good strategy must be from N equals 3, 2, for high throughput or for N equals 3. In the configuration and in the data that I gave you. Okay? So what expert does it extracts all of these points, and these are the points that the user want to play with. How does the user want to play with this point? The user have as I told you before a certain utility function that she wants to optimize. The utility function can be any of these or any of the other utility functions. It takes all the points on the efficient strategies from the pareto frontier and draw the graph according to her utility function. So for instance, if the utility function is a cost -- the square of the cost per the throughput, then the minimum is going to be here. The minimum is going to be found here. The minimum is the minimum in terms of throughput, number of tasks per hundred thousand second in this case. So the user is going to pick 3.-perhaps 3.3 as the point of the -- as the throughput of 3.3 as the point where it's the throughput that it wants a system to execute in. This throughput determines on the pareto frontier a precise -- a precise strategy. Okay? The precise strategy in turn determines NTs that help the system decide what to do. If the user wants to optimize the cost per throughput and not the cost square -- the square of the cost per throughput, then the different MPI is found at the 3.395. And so on. Okay? So this is how the user works. It has this tool that helps in the reverse way to find the best working point. Something else that we can do if the -- of course the user can choose any of these points, any of these is a throughput to work with, and so the there is a strategy that supplies its throughput if the user can decide that he wants to pay the related cost. However, there are some points on the pareto frontier that to not have pure strategies that can support that. Okay? These are -- this is this gap, for instance, or this gap. What can we do? So what -- what we can do if the user decides that he wants exactly this throughput, and he wants to -- and she wants to work here, we need to design the strategy an optimal strategy, pareto frontier optimal strategy that with we support this throughput. What we can do, we can mix several pure strategies together, a linear combination of pure strategy will also give always give a result on the pareto frontier. And we can use this mixed strategy, and it's an optimal working point. Okay. So once the user decides on the throughput that she wants, that she picks, now we have to extract -- the system has to extract the values of N, T, and D. How does the system do that? So here on the X axis, we have the throughput. And remember that the user said I want it to work here, or I want to work there. Okay? And there is a parameter value. And we can see in our example in the specific example that the throughput is divided into two parts. The high cost, high throughput part and the low cost, low throughput part. In the high cost, high throughput part, everything is fairly simple. T is always equal to zero. So we always -- so it's a high throughput high cost. All the strategy is on the pareto frontier N equals 2. So we call the unreliable resource Y, [inaudible] reliable. T always equals zero, which means that immediately after starting with the first unreliable copy we start with a second unreliable copy and immediately start with a reliable one. And D grows. Okay. D gets a growing value in order to grow the -- in order to grow the throughput further. Yes. >>: [inaudible] this pareto stuff. Is it always convex? >> Assaf Schuster: Yes. >>: It cannot be concave? >> Assaf Schuster: It's defined as convex, yes. >>: By definition? >> Assaf Schuster: By definition. Exactly. I didn't want to get into this, but ->>: So it is essentially the best that you can reach with mixed strategies? >> Assaf Schuster: Yes. >>: [inaudible]. >> Assaf Schuster: Exactly. Exactly. Sorry it was not clear. But anyway, back to the choosing parameters. Okay. So it's easy with the -- with the high cost, high throughput path. With the low cost, low throughput path, all of the strategies defines the pareto frontier at any cost. So three times invoking unreliable resource before revoking the reliable. And we can see that it's strongly dependent on the T, on the value of T which the blue circles and D and it's also strongly dependent on D until a certain point where D stabilizes and remains fairly high, and then D starts to fluctuate and jump in between different values for reasons I don't want to get into. So you can see this is the way that the system picks T and D -- N, T, and D, the values of N, T, and D. It's very easy for the system, it's very fast, it can be done online. So my question that can come up is how much can we gain from using it? Okay. So we defined the cost ratio to be the ratio of a certain strategy -- of the cost of a certain strategy to the cost of the best strategy that will be tee find using the pareto frontier. Okay? And the maximal pure strategy will be the ratio of the worst strategy that you can choose to the best strategy that exert chooses. Okay? So here's an example. The example -- this is an example graph. The Y axis is the ratio of the cost of the reliable machine to the unreliable machine. The X axis is the ratio of the time it takes to execute a task on the reliable machine to the time it takes to execute the result on an unreliable machine. So higher ratio of cost of the reliable to unreliable, higher ratio of time of computation of reliable to unreliable. And let's look at different parts of the space. On the upper, upper right part of the space, we know that the reliable resource is both a -- is both very lazy, is both very slow and costly. Okay? So on the one hand if you make a mistake in picking a strategy, this mistake is very harmful, okay? The max cost ratio is going to be very, very large from the worst case strategies to the best strategy. On the other hand, it's relatively easy for user not to make mistakes here. Because this is a reliable resource is both slow and costly, it will be deferred until as much as possible. If we look at the other point, okay, where the reliable resource is both cheap and fast, we have the same situation. A mistake here can be very harmful. The max cost ratio will be very high. However, it is fairly easy for user not to make mistakes here because the reliable resource is a very fast and very cheap. Okay? On the other places where the reliable resource is a costly but fast or when the reliable resource is slow but costly -- is slow but cheap, it's more involved to make a decision. Okay. It's harder for the user to make a decision. It's easier for the user to make a mistake and to choose a non optimal strategy. Okay? So how much can we gain? The max cost ratio here is not very high. It's 1.142, so we can gain something like 15 percent by using the best, assuming that the user chooses the worst case strategy. Here in this place we can gain already the 1.77 for the max cost ratio. Overall we estimate that the reason that they will gain from using this optimization scheme is around 50 percent to 70 percent. And if you consider this is the energy cost for executing millions and hundreds of millions of tasks, it becomes substantial. Okay. So to conclude, the pareto frontier enables to a rationally optimizing [inaudible] the utility function and finds the best strategy for execution. Realistic savings are of the order of 50 to 70 percent. And what I hide from you, what I did not expose, a lot and lot of details that has to do with realistic systems. For instance, at the beginning of the tail, the ratio of the number of tasks to the number of available unreliable machine is relatively high. At the end of the tail, it's relatively slow. So it makes sense to change the best -- to change the best -the replication strategies as we go along the tail. Okay? So we can do that. We're still working on optimizing that. It's not trivial. One other thing that you would like to do is to -- because you have a choice because the unreliable resource is not homogenic like what I told you here, but sometimes it's [inaudible] and he's made out of many different machines with different reliability factors and different execution time and different cost is a -- to ask for a resource when we get the parameters of this resource to compute the best strategy for working with this specific resource. The best N, T, and D for the specific time in the execution of the bag of task. And with respect to the specific parameters of the resource. The whole thing become a more and more complicated, the graph become more and more complicated, and this is current and future work. And that's it. >>: [inaudible] so but still you have relatively few parameters to play with N, T, and D, so everything is all [inaudible]. >> Assaf Schuster: The basic -- the basic scheme I agree is relatively easy. Adapting it to actual systems becomes very hairy because you have a choice of several systems and the resources are not homogenic and so on, so forth. So in realistic situations even though relative -- we have relatively few parameters, it is not -- it is not the computation is not very simple. The approach is simple. And this is what I wanted to show. Because it's not currently done and saving of 50 percent is really substantial. If you need nuclear reactor -- if they say now that in grids you can send a bag of tasks and activate a nuclear reactor with just a click, a saving of 50 percent is substantial. >>: [inaudible] resource [inaudible] is the cost of your computation. The more complicated strategy you use the more expensive computations. And they may themself come to some numbers. >> Assaf Schuster: Yeah, but I ->>: [inaudible]. >> Assaf Schuster: You're right, but I thought that Nikolaj said that you can reach the optimal point of execution with the some numeric -- numeric approximation. >>: [inaudible]. >> Assaf Schuster: You have to do that. Yes. The computations here are very easy. They do not take a lot of computational time. We can do them on a laptop with tens of seconds. When you start taking into account the fact that you have few unreliable sources and you want to do adaptation of N, T, and D, as you proceed ->>: [inaudible]. >> Assaf Schuster: Yeah, the dynamic -- then it starts to be a little more complicated. >>: [inaudible] using the [inaudible] what about from [inaudible]? So there was [inaudible] which somehow related to cost. >> Assaf Schuster: Yeah, you're right. I mean this is an excellent question. Many people work on this -- on these issues. There are still unresolved, meaning that for instance large companies such as Intel, when it has a peak in the demand of resources still is confined to its own grid. They have a grid of 200,000 CPUs and -- that are spread all over the world. And they have a system that's called [inaudible] that she's sending the jobs to -- and they do not use external clouds, and they do not use external grids, although they could have used them with a very low cost. Sometimes they are in very high demand. And the reason is precisely what you said, the security. They do not trust the security -- the security means. There's also another issue of cheating. Some resources would take a lot of computational -- computation, return wrong results. So there are some methods that people use in order to check that the resource is reliable and to put it in a blacklist when it's unreliable. Usually what people try to develop these days in order to ensure security is working on top of virtual machines. So the resource will get a -- will receive a task. The task is already wrought with virtual machine. The virtual machine starts executing. The task cannot do any harm to the environment. And hopefully the environment cannot infer from the task anything about the full bot. >>: [inaudible] comes with its own virtual machine. >> Assaf Schuster: Yeah. The task comes with its own virtual machine. The problem is of course that virtual machines have a computational overhead which people are not willing to pay at this point, and there's a lot of work in these directions, too. >> Nikolaj Bjorner: Thank you. >> Assaf Schuster: Thank you. [applause]