16544 >> Galen Hunt: Good morning. It's my pleasure to introduce Daniel Shelepov, who is one of -- so he did his undergraduate work at Simon Fraser with Alexander Federova, who was the advisor, and who is a co-author on this work. And they've been looking at heterogeneous multi-core, a topic close at hand, something that many of us care about. Something we've been looking at in at the HELIOS work and some other people as well. And I guess the other thing I would mention is that so they do have a publication on this will appear in the April issue of Operating Systems Review. >> Daniel Shelepov: Okay. Thank you, Galen. Is this on? Okay. Great. So, yeah, my topic today would be HASS, a schedule of heterogenous multi-core systems. And this is work that I have done with several other people, including Alexander Federova, the supervisor. He's also underlined there because he's the author of these slides. Most of the people on the teamwork from our Simon Fraser University, although we've had the one person from the University of Madrid and a person from the University of Waterloo. Okay. So the executive summary. This talk is about an operating system scheduler that is called HASS. HASS stands for Heterogeneity Aware Signature Supported. The first part means we're targeting a particular type of emerging architecture, heterogenous multi-core architectures. And the second part means that our scheduler relies on a certain piece of data collected offline, which is called a signature, and we use it to simplify the scheduling overhead. Of course, I will be going over this in more detail later on. The objective of our scheduler is to increase the overall system through-put as compared to a default operating system scheduler. So a brief introduction to heterogenous multi-core, although I'm sure some of you might know it already. Basically it's multi-core architecture where the cores are not identical. For example, in this case we have one big red core, three small blue ones. And besides the color and size difference, the fast core would have a faster clock, maybe a more complex execution core. It's also physically larger and contains perhaps larger cache and it consumes a lot more power, compared to a smaller, simpler core, which is slower. Might be single issue in order but very power efficient. And we're also in this talk implying a situation where all these cores are, share the same instruction set, which means they're all like general purpose. So why is heterogeneous important? Well, we want to support a diversity of applications. Ordinarily simple cores might offer you better power savings but some applications, especially those that are not paralyzable, can only be spread out by running on those fast cores, because, first of all, they might be sequential and because they are able to use these high performance resources really efficiently. So we call this application-sensitive. And in this talk I'm going to focus on sensitivity to clock speed differences. On the other hand, we might have insensitive programs, programs that do not, are not really affected by changes in clock speed, for example, because they have a high memory access rate. So they spend a lot of time being stalled, waiting for memory requests, and as a result they really don't care if the CPU is fast or not. For example, these are real figures, if we have a sensitive application and we run it on a core, two cores, one of which is two times as fast as another, we can expect the performance will be just as fast, two times. On the other hand, we want to have an insensitive application, we might increase the speed by 100 percent, get a performance increase of only 30 percent. So there you go. So the objective, then, is to schedule these programs appropriately. Have the sensitive application, get the high speed CPU and the slower application, slower CPU to insensitive application. Like that. And we want to maximize the overall through-put, which in this case equates with performance. And, I mean, everything I've said up to now has been shown in previous research. I'm not going to dwell on this. But it has been shown that heterogeneous architectures of this kind are in fact efficient and that optimal scheduling are of this kind. So the problem statement is then to match this with cores, basically by going through a decision of which threads are sensitive and which aren't and create a good match based on that. When I talk about heterogeneous architectures, most people think that I'm talking about architectures that are asymmetrical by design, which means that the manufacturer has decided to actually have some powerful cores on chip and some less powerful cores and design time. But, in fact, some kinds of asymmetry can be encountered even now. For example, process variation, which is a phenomenon related to manufacturing where due to minor differences in microscale features, some cores that are supposed to be identical are faster than others in a single package unit. At the same time we can have also DFS scenarios where the operating system is able to select the clock speed that the chip runs on. So these kind of things will give you asymmetry in today's systems. Now, the rest of the talk I'm going to, it will be as follows. First of all, I'm going to talk about some of the existing approaches to heterogeneous scheduling and describe our solution in more detail, go over some performance results that we'll have, and then do some side-by-side sort of analysis comparison with the other approaches and then summarize. So many of the existing approaches work as follows: As the thread comes in, it is run on both types of cores and that determines the comparative performance. We can get some kind of a performance ratio of fast performance versus slow. And that number then determines whether the application is sensitive or not. Now, the key thing is that we have to run this application on both types of cores to get this kind of information. After we compute these ratios, we assign them cores as appropriate, and we are also able to detect phase changes by monitoring performance of threads continuously. And, for example, phase changes can trigger a repeat of this loop. The advantages of such a scheme include being flexible in accurately reflecting the conditions in the current environment, as well as being phase aware and input aware. >>: How do you measure performance? >> Daniel Shelepov: Like, for example, like we see in the instructions per second or something like that. By measuring the differences on different types of cores, you can compare it. >>: Execute it? >> Daniel Shelepov: Yeah. So there are some disadvantages, though, because this type of monitoring, when you are required to run each thread on every type of core, it means that there is more complexity, and this complexity actually grows as you add more cores, more threads, and especially more types of cores. So as a result of this high complexity step algorithm scale pool are not really expected to work in many core scenarios, see, where you have dozens or hundreds of cores. In contrast, our approach works on the basis of something called a signature, which is a summary of applications, architectural properties. Basically without going into details too much at this point it would be a description of how a program operates, how it uses resources, and using that information the operating system would be able to determine at scheduling time whether the application is sensitive or insensitive. Ideally, this type of signature would be micro architecture independent, meaning we should be able to apply one signature to pretty much any core we can encounter of the same instruction set. It has to be universal. And the key part about the signature is that if it's obtained statically, for example, at development time, it can be shipped along with the binary, and as a result all this performance monitoring work during scheduling time will be completely avoided. Now, going into some more detail about the signatures, we know that they must be able to predict sensitivity to changes in clock speed. Now, how would they do that? For our model in this work, we have assumed a domain of CPUs that differ by cache size and clock frequency. For this scenario the signature would then estimate performance ratio based on those two parameters only. And it will estimate performance for different, for say one core has a cache size of X and another cache size of Y and also perhaps different clock speeds. So what we would do is we would estimate performance of the program on the one core and then another, and by comparing those two values would determine this sensitivity index that we could use to compare different jobs against each other in matching. Now, I'm going to go through this backwards. From the previous slide, we know we want to predict performance dynamics sort of using the signature. And we are looking at a cache size and the clock frequency. It turns out that we can actually do this estimation if you have a cache miss rate of the thread and the cache miss rate works in this case by knowing how much time, how many caches the CPU has to serve. You can estimate how much time will the program spend stalled when it's on CPU. And as a result you can figure out how much of an improvement will it get from having a higher clock speed. Now, estimating cache miss is a separate problem sort of that is displayed here in this blue region, which, by the way, represents work done off line. And it can be estimated from something called a reuse distance profile of memory. And I'm deliberately not going to go into too much detail on this one because it's a separate problem related to cache modeling. Suffice it to say that it's easy to run the program once, correct this memory reuse distance profile, and from that estimate the cache miss rate for any cache configuration you are interested in. So if you go through the step for, if you do the step for every common cache configuration, and there is a fairly limited set of cache size you can expect in the real system, you can actually build a matrix of cache misses for common configuration, and this would be the content of the signature. And then at scheduling time the operating system will be able to select the correct, to look up the correct cache configuration and predict performance based on that. Lastly, I'm going to say that to do this process now work we have now the pin binary framework from Intel. I don't know if you've heard about it. Probably. And, finally, I also would like to mention that this method, although accurate, relies on knowing the cache size in advance. And this cache has to be exclusive for this to work, because if you do not know how much-if the cache is shared you cannot effectively say how much cache will we actually have if you run on the shared cache. So this method relies on exclusive cache, and I'll talk about this later on when I discuss the results. So at this point I'm going to pause. Are there any questions so far? All right. Great. So how accurate is all this? Like I mentioned before, the objective here is to calculate performance ratios of performances on different types of cores. So, for example, if one core is two times as fast, we would also like to be able to predict that it's two times as fast, that their program or their transmitter is two times as fast or one half times as fast. So these diagrams show how well our estimations match the real ratios. So on the X axis here, we see the actual observed ratio, which means that for a given change in clock speed from one to two times as much, how much did the performance actually change on a given program. So each dot is a program. So for MCF, for example, we know that when we increase the clock speed from X to 2X, the performance increased approximately 1.35. And the two graphs represent two different systems. So in this system we increased the clock speed by a factor of 2 here by a factor of one half. And each dot is a benchmark from spec CPU 2,000. The Y axis here is the same ratio as here except it's estimated using our signature method. So in the perfect scenario, all the dots would be on this diagonal line, and that would mean that our estimations are perfectly accurate. Our estimations are, of course, not perfectly accurate, but you can see in general where we keep this diagonal trend especially in the area above. >>: The performance ratio, is it per cycle or per second? >> Daniel Shelepov: It's per cycle. No. Instructions per second, you're right. And in that respect you can actually compare. You can actually compare against an absolute baseline, because if you have cycles here you would kind of -- if you had something two times slower you would not notice it. Actually, if you work out the math you can see that you can transfer from these two, and these, the changes between these two values, instruction per second would also indicate how much time is being wasted on stuff other than executing. It's a side note. You can basically assume we're actually talking about instructions per second here in this comparison. >>: Saying if you're looking at transition frequency, the change in frequency is disproportionate per second might not give you it in terms of ->> Daniel Shelepov: Actually, it will, because we're looking at, if you remember, we're looking at how much time does a thread, is thread and stall state when it's waiting for memory request. And that portion is going to be approximately the same regardless of the clock speed. So the IPC would also reflect the dynamic, because you can increase the clock speed. But if you spend some time stalled waiting for memory request you won't get a proportion for IPC. You can get IPC for this sort of information but it's less intuitive, I would say. All right. So now with the signature stuck behind us, I can actually go on to explain about the algorithm, actual schedule of the algorithm. We start with the concept of partition. So here's our CPU, different types, of course, and then a partition is just a group, of course, which can be arbitrarily defined with the condition that it has to contain, of course, one layer of one type. So, for example, a red one and a blue core could not be in the same partition. And then threads are assigned to partitions, and within partitions then they're assigned to CPUs. So it's a two-stage kind of process. So what happens is let's say a new thread arises into the system. It would take its architectural signature, look up the type of this core and then estimate its performance for that particular type of core. Using that information as well as the load factor of the partition, it can estimate its performance in that partition where it's scheduled there. And by doing the same process on every partition, it can generate expected performance ratio for every partition. And then by selecting the maximum, it is able to bind optimally. So in this respect the thread sort of acts greedily. It tries to find out the best partition to bind to and then goes there. And then within partitions threads can be scheduled using normal operating system scheduling techniques. There's no difference in this scenario, because all the cores inside the partitions are identical, so we're not going to talk about that. So what's the best partitions? And, yes, and then to accommodate for possible load changes this process can be repeated periodically. Let's say every, I don't know, half a second. Each thread for every half second of CPU time each thread wakes up and goes to the assignment again. Finally, because greedy actions alone cannot bring us to a globally optimal solution in general, there is also mechanism built in where a thread can swap with another thread if it decides that this will increase the overall system efficiency. So this point here allows us to achieve globally optimal solutions, at least in theory. Whereas greedy decision making doesn't. So, yeah, the algorithm itself is really simple. And what are the advantages? Of course, that huge chunk of complexity-related dynamic monitoring threads is completely removed to, removed off line, basically. The implementation is really simple. The algorithm is not complex by itself. And as a result it scales well. Of course, it doesn't come for free. Because we assume only one signature per thread, we can say for certain that we cannot accommodate different phase changes inside that thread. The operating system views the thread as something uniform. Just one signature per thread and we cannot accommodate for that. Now, in fact this is in theory reconcilable, because we can think of a system where a programmer would set up certain points inside a program and then define a phase change. And then the signatures could be generated, for example, for each phase. And then at scheduling time the correct signature could be passed to the scheduler appropriately. But this has not been implemented yet. Another point is that one signature per application does not allow for changes in programming input. It does allow for changes in program but the signature can only assume one input. And in theory this could seem a big problem, because sometimes a program it seems can make huge, can have huge behavioral differences based on input. By testing it somewhat limited in scope on spec CPU 2000 programs we found that actually input does not make that much of a difference on the majority of programs. So a typical input will give you a signature that is more or less applicable in a broad variety of scenarios. However, we cannot say for certain that the same pattern keeps for all benchmarks out there or all programs out there. And in our experience it has not been a very significant barrier. Now, two big points are the following: Multi-threaded applications, being aware of parallelism, for one thing, because if the jobs are interdependent, you could have certain priorities which our algorithm is not aware of. Secondly, in a multi-threaded scenario, you could have data sharing which would be an additional factor to consider in optimality of schedule to reduce, for example, cache coherence traffic. This is the subject of our current work, and there are some things that they're trying out. I guess I could talk about them in more detail later on, if there is time. Shared caching is another big problem. Like I said, our methods assumes exclusive cache for performance estimation, and that is completely compromised and shared cache scenarios. At the same time we know that with a big degree of certainty future heterogeneous system will in fact involve shared cache, so it's a problem we must tackle. Fortunately, it doesn't appear to be insurmountable. It's a separate problem, optimal scheduling based in shared cache conditions. But our challenge is taking that solution and combining it with our algorithm, because optimality, of course, can be conflicting those scenarios. This is also the subject of current work. Extensive investigation and all that. So these are substantial limitations. And then I'm going to move on to performance valuation. So the experimental platform we used has consisted of two machines. One was a 16 core Barcelona machine where we could scale frequency from .15 to 2.3 gigahertz. The second machine was a zeon eight core, also DFS, vary the frequency from 2 to 3-gigahertz. Then the algorithm itself I've implemented in Solaris kernel. So the experiment that I'm going to point out to you here is just one of a set of experiments, and you can read more about them in an upcoming paper if you want, or I can describe it in the Q&A session. But this experiment is as follows. We have four cores. Two of them are fast. Two times as fast as slow cores, and we have four applications. Two of them are considered to be sensitive, meaning that they react strongly to changes in frequency. Two of them are insensitive. So there's a good matching here, except the algorithm has to determine, has to be able to exploit it. And then I'm going to show two workloads. One is highly heterogeneous, meaning that there is like a big difference between these two categories, and in this case the difference is smaller than that. So let me show you the results with that experiment. So we ran that set of four benchmarks simultaneously on four cores. So there were four benchmarks running at the same time. The system was fully utilized, and the graph should be read as follows. The 100 percent line represents the expected completion time of that particular program in a default scheduler. And then the bars represent the difference in expected completion time using our scheduler or the best static assignment. So, for example, we know that if, let's say, a six track had run on our scheduler, it would complete 33 percent faster than the default scheduler on average. MCF, on the other hand, would complete, slower by 18 percent. And then this bar here represents the geometrical mean, which stands for the overall system through-put change. So as you can see, these highly sensitive benchmarks were forced by the scheduler to be placed, were forced onto faster cores, and as a result they got large completion time decreases, up to 33 percent in this case. These two benchmarks, on the other hand, were forced to slower cores, but because they are insensitive, the performance hid the experience as comparatively smaller in this gain here. And as a result the overall performance has decreased by 13 percent in this case. So what this means is that the overall system through-put has increased. And if you look at the blue bar here, it shows that our algorithm in this case has matched the best static assignment. This is a mildly heterogeneous scenario, where the setup is very much similar, except in this case the two benchmarks that are considered to be insensitive are less insensitive than they were here. As a result, they experienced a larger performance hit when they were placed on slower partition, on a slower course. So as a result the total win that we have is smaller. But it still matches the geometric mean. Any questions about this graph? No? >>: Can you explain the 100 -- the baseline better? >> Daniel Shelepov: How did we come up with this number? >>: Yes, what's the baseline number? >> Daniel Shelepov: The baseline -- we consider an agnostic heterogeneity scheduler, basically a scheduler that doesn't not know about any difference, assigns programs randomly. And the 100 percent would be a statistical composite, which would represent the expected completion time you would get were you actually, where you actually scheduling that. >>: You ran two of the processors fast and two slow. >> Daniel Shelepov: Yes. >>: The default scheduler would just randomly assign? >> Daniel Shelepov: Yes. >>: So half the time each shredded spin, statistically half the time the thread would spin on a fast processer, half the time spin on a slow processer, that's the baseline? >> Daniel Shelepov: Well, it's [inaudible] but in fact there is a little bit of a problem there. Because as the metric that you just described is one metric, but it can be too pessimistic in reality, because when you schedule threads on a fast and slow course, on average threads will retire faster from faster course. So actually threads will spend on average more time -- I mean, yeah, they spend more time on a faster core. So what you should actually do is consider an ideal round robin scheduler where the time shares will be equal. But if you count the instruction retired on the fast core versus slow core, you would have a tilt towards the faster core. So we'll have to use two of those metrics, in fact, because on one hand the default is too pessimistic, but the round robin, the ideal thing we believe is too optimistic, because in fact ensuring that the bench, the programs get equal time sharing is very difficult. So you cannot really expect the default scheduler to behave in a round robin fashion. So we found that the ideal round robin is not a valid metric. So what we did was actually take two of those metrics and combine them somehow, take an average. I can tell you right away that the ideal round robin metric is usually faster than this one by, let's say, under five percent. Like two to five percent. So you would still get the benefits compared to that. But they would be smaller. But it's very great observation you make there. So 13 versus nine speed of plant. Let me point out to you that this is the best you can get with the static assignment on this platform. Where the heterogeneity more strongly pronounced with the cores you'll get more benefit. For example, if you had, I don't know, different cache sizes as well as clock speeds, perhaps you could gain a larger speed because the contrast would be larger between the cores. All right. Talked about that. So this was just one experiment on others. We'll try different workloads, different setups. On average gain about speed of about the same as the one that I just showed you. Most of the time we actually match the static assignment. Although in a few cases the inaccuracies of our estimation method made us develop an assignment that was not optimal, because we misestimated the performance of the benchmark. Those are relatively few. Now we also tested this algorithm on relatively complex scenarios up to 12 cores at the same time, up to 60 threads, and in all those cases the overhead algorithm was fairly small, under one percent. So it scales well. However, because of that part I was talking about, about shared caches, the optimality is really compromised in those scenarios, in that case half doesn't do very well, because what it thinks is optimal might not be due to conflicting cache accesses. All right. So go on to compare this with the other algorithm. So we have also decided to implement a dynamic monitoring algorithm, because previously those were only simulated, we didn't find the actual implementation of an algorithm that would measure the performance in real time and use it in scheduling. So we decided to do that. For baseline we have chosen the same algorithm that I was mentioning in the first half of the presentation, where we would monitor IPC on different core types, compare them get a performance ratio, based on that performance ratio find the optimal matching. And because we have this monitor component in place we're able to actually monitor for IPC changes, for sudden drastic IPC changes, which would indicate a phase change inside a program. Once a phase change is detected we could redo the scheduling. So we don't get stuck with an obsolete schedule that's no longer optimal. Okay. So this algorithm is not, was not developed in-house. It was actually proposed in previous work by another researcher. But in their case they were doing it on a simulator. We decided to implement it in Solaris. And, of course, we expected it to outperform HASS because it's aware of phase changes and it sounds more flexible and it should work better. In fact, we did not get that when -- here's a new bar added for dynamic algorithm. You can see, first of all, the assignments that it ends up making are a little bit different. But the bottom line is that it doesn't decrease performance by as much as we expected it to. So why was it? Why was it occurring? There were several problems with this approach. The first problem had to do with phase change detections. For example, the original algorithm suggested that the phase change was supposed to be -- it suggested a mechanism to detect phase changes, and once the phase change was detected a new round of performance measurement was started and those IPC ratios were generated, and as you remember we had to generate IPC ratios for two types of cores. So, for example, we would detect a phase change, measure the performance at this point, then immediately switch to another core. And on that other core we would measure performance on that point. And as you can see, by the time that the contact switch is made, the program has continued in its phase change and as a result the two measurements are not equivalent. So this means that we should not only detect phase change, beginnings, like starts, we should also detect phase change ends. This is a tricky problem in general to solve. As a result, the ratios that we got were mostly bogus. And this problem is expected to get even worse when you get more core types, too, because in that case you'd have to take four points or whatever. So it would be even more difficult to synchronize them all together. >>: [Indiscernible] similarities saying that instead of -- drastic IPC change, longer periods of time. >> Daniel Shelepov: Yeah, basically you're talking about like a buffer, history buffer that would detect a long term change. >>: So long time [inaudible]. >> Daniel Shelepov: That could work, but that has a downside of forcing you to measure for a longer time. And not only does it make it more difficult to actually take the measurement, but it also means that the measurement is going to be more invasive to the overall, to the overall workload, because when you're measuring this IPC, you can also be in conflict with other threads that were supposed to be running on that core legitimately. And this thrashing, as I will actually show you on the next slide, yes, this thrashing is very detrimental to performance, because by having threads move around all the time, you are prone to wasted CPU cycles when some cores are left idle as well as like that. As well as artifacts such as destroying of cache, working sets and whatnot. So overall phase changes in themselves are very destructive. And by making them longer that would even increase that problem. And, of course, as you increase the number of cores and the threads, this problem aggravates. So the outcome of this comparison was that we found that monitoring not only caused a performance degradation, but it was not only cause a performance degradation relative to the best static scenario, but even some cases even worst than a default scheduler which I didn't show on any diagram. And it's really difficult to make the system work, because you really have to keep this very fragile balance of flexibility versus stability. And in our invariant we weren't actually able to achieve that balance. And we feel that this occurred because the original algorithm that was proposed was only based on simulations. So it's really important to validate hypothesis. Not only in simulators, but in real systems. In retrospect, we see that by being blind to these phase changes has avoided all that thrashing around and instead was relatively stable. Okay. So for a conclusion, I presented half, which is a new scheduling algorithm based on architectural signals. It trades simplicity robustness and scalability for accuracy. It's not aware of phases, but we see that algorithms are aware of phases are not necessarily better when they're developed using dynamic monitoring scheme. Dynamic monitor was found to be pretty difficult to implement in this domain. We were not able to implement it so it works correctly. While HASS gives us a robust implementation that exploits guaranteed performance gains from optimal static assignments. In future work, we would like to extend our signatures to types of heterogeneity other than clock speed. For example, cache size should already be supported by our scheme because we already estimate performance based on cache size. But we just weren't able to find test machines where we could find CPUs with different cache sizes at the same time. It's easy to extend it there. But we would also like to provide support for pipeline heterogeneity, so signatures that would indicate sensitivity to parameters such as pipeline width or depth, issue width or pipeline depths or out of order execution. We would also like to add shared caches into the problem, which is an orthogonal problem and that creates an additional challenge of combining the two approaches together. And support from all the thread applications. So here's a reference from the upcoming paper which will be published in ACMOSR this April. And contact information for me and Alexander. Thank you. [Applause] >>: What's in your signature right now? >> Daniel Shelepov: Like I said, it's a matrix of cache miss ratios configured for common cache configurations, if you go back to that, go back to that slide. Yeah. So using this single profile, we can estimate cache misses for a variety of cache configurations. So by doing that for a fairly limited set of common caches, you only have caches powers of two and maybe between like 512, one meg, two meg, whatever. So we can do the matrix that would have a niche field of cache miss for that particular cache size, and then at this time we would pretty much look up the correct cache configuration, extract the value of the missed rate from there. From that missed rate we can estimate how much time is being stalled in the memory request and how sensitive is the thread today change clock speed and so on. >>: In a sense your signatures are about cache sizes but what you varied on the machine was clock speed, right? >> Daniel Shelepov: Okay. Well, maybe I can show you how it works. So we know that if this is the time that a thread spends on a CPU, we can estimate using the cache missed rate how much time will be spent, let's say, servicing memory requests by taking the cache missed rate and multiplying it by memory access time. And you can estimate this portion. And you know this portion will be constant in respect to the number of instructions executed in this portion. Because we know, for example, if the cache missed rate is 10 percent, one cache missed rate per 10 instructions we'll know that here will be -- here we'll have 10 instructions no matter what. So by actually, by taking the clock speed as a parameter, we can estimate how much will this vary in comparison to this portion. And that will take care of the difference in clock speed in this model. Other questions? >>: So talk about sometimes the signature can be not correct. Is there any mechanism for correcting it? Or modifying it? Because you had a situation where you know this workload is going to be very different than that workload? >> Daniel Shelepov: Well, that's a great question. So far there isn't a mechanism. We just have to stick with what we have. But you could build some sort of a feedback mechanism that would validate your initial guess perhaps and correct it if necessary. And you could, I guess, do it to a different degree. You could do a periodic adjustment but that would move you into the direction of dynamic profiling more or less. So, yes, you can build certain parts where you can have a feedback, adjusting those initial guesses but it's important not to overdo it, but it can be done. And I think if implemented correctly it will increase the efficiency of the algorithm. >>: One of the problems of doing this profiling is that you cannot really predict the top architecture to the pipe. You never know where the software is actually going to run. Could you do the signature generation on install time for instance, like the first time you run the application on a topic? Or is it too complicated to do this instrumentation generation? >> Daniel Shelepov: No, I think from a technical standpoint it's possible to do this profiling around your installation. But the point here is that the data that we collect in this particular case is only depends on the instruction set. It doesn't depend on the hardware actual. Because this memory distance reuse profile, it's actually going to be the same regardless of the architecture of the micro architecture. Because this does not use any micro architecture specific data. Now, of course when you enter into less deterministic scenario such as multi-threaded it might change, but the overall idea is you should be able to run some sort of analysis that is completely independent of micro architecture, but I would say that doing it at installation time would increase the accuracy, definitely. >>: So if you start going to a new architecture, new latencies, would that affect how you make your ->> Daniel Shelepov: Yes, that's a good question. So memory latency is actually, I guess, discovered by the OS at runtime. So you could just plug it into the calculation. Because memory latency is used in this stuff here. So if you get access to that value right there you're good to go. Really it's not that simple because you could have Numa architectures and that would create a whole next level of complexity. But for non-Numa you can get it right there. For Numa we didn't do a deep analysis on Numa stuff. One of our machines was in fact Numa, the 16 core AMD one, and we found that it was pretty much a matter of configuring the initial sort of core assignment, and by that I mean like figuring out which cores to set slower, which cores to set faster, because the schedule we found was quite smart in keeping the local memory close to the program. However, the difference was in like shared labs, what not. So by manipulating the Numa configuration combined with the DFS settings, you could create a scheme where memory access latency could be approximated by constant value even for Numa. But for more complex scenarios, we don't know. But, like I said, the memory access, back to your question, memory access can be looked up in this stage. Does that answer your question, more or less? >>: Yes. >> Daniel Shelepov: All right. Great. Anything else? >> Simon Fraser: I'd like to thank the speaker. [Applause]