16544 >> Galen Hunt: Good morning. It's my pleasure... one of -- so he did his undergraduate work at...

advertisement
16544
>> Galen Hunt: Good morning. It's my pleasure to introduce Daniel Shelepov, who is
one of -- so he did his undergraduate work at Simon Fraser with Alexander Federova,
who was the advisor, and who is a co-author on this work.
And they've been looking at heterogeneous multi-core, a topic close at hand, something
that many of us care about. Something we've been looking at in at the HELIOS work
and some other people as well.
And I guess the other thing I would mention is that so they do have a publication on this
will appear in the April issue of Operating Systems Review.
>> Daniel Shelepov: Okay. Thank you, Galen. Is this on? Okay. Great. So, yeah, my
topic today would be HASS, a schedule of heterogenous multi-core systems. And this is
work that I have done with several other people, including Alexander Federova, the
supervisor. He's also underlined there because he's the author of these slides.
Most of the people on the teamwork from our Simon Fraser University, although we've
had the one person from the University of Madrid and a person from the University of
Waterloo. Okay.
So the executive summary. This talk is about an operating system scheduler that is
called HASS. HASS stands for Heterogeneity Aware Signature Supported. The first
part means we're targeting a particular type of emerging architecture, heterogenous
multi-core architectures. And the second part means that our scheduler relies on a
certain piece of data collected offline, which is called a signature, and we use it to
simplify the scheduling overhead.
Of course, I will be going over this in more detail later on. The objective of our scheduler
is to increase the overall system through-put as compared to a default operating system
scheduler.
So a brief introduction to heterogenous multi-core, although I'm sure some of you might
know it already. Basically it's multi-core architecture where the cores are not identical.
For example, in this case we have one big red core, three small blue ones. And besides
the color and size difference, the fast core would have a faster clock, maybe a more
complex execution core.
It's also physically larger and contains perhaps larger cache and it consumes a lot more
power, compared to a smaller, simpler core, which is slower. Might be single issue in
order but very power efficient.
And we're also in this talk implying a situation where all these cores are, share the same
instruction set, which means they're all like general purpose.
So why is heterogeneous important? Well, we want to support a diversity of
applications. Ordinarily simple cores might offer you better power savings but some
applications, especially those that are not paralyzable, can only be spread out by
running on those fast cores, because, first of all, they might be sequential and because
they are able to use these high performance resources really efficiently.
So we call this application-sensitive. And in this talk I'm going to focus on sensitivity to
clock speed differences.
On the other hand, we might have insensitive programs, programs that do not, are not
really affected by changes in clock speed, for example, because they have a high
memory access rate. So they spend a lot of time being stalled, waiting for memory
requests, and as a result they really don't care if the CPU is fast or not.
For example, these are real figures, if we have a sensitive application and we run it on a
core, two cores, one of which is two times as fast as another, we can expect the
performance will be just as fast, two times.
On the other hand, we want to have an insensitive application, we might increase the
speed by 100 percent, get a performance increase of only 30 percent. So there you go.
So the objective, then, is to schedule these programs appropriately. Have the sensitive
application, get the high speed CPU and the slower application, slower CPU to
insensitive application. Like that. And we want to maximize the overall through-put,
which in this case equates with performance.
And, I mean, everything I've said up to now has been shown in previous research. I'm
not going to dwell on this. But it has been shown that heterogeneous architectures of
this kind are in fact efficient and that optimal scheduling are of this kind.
So the problem statement is then to match this with cores, basically by going through a
decision of which threads are sensitive and which aren't and create a good match based
on that.
When I talk about heterogeneous architectures, most people think that I'm talking about
architectures that are asymmetrical by design, which means that the manufacturer has
decided to actually have some powerful cores on chip and some less powerful cores and
design time.
But, in fact, some kinds of asymmetry can be encountered even now. For example,
process variation, which is a phenomenon related to manufacturing where due to minor
differences in microscale features, some cores that are supposed to be identical are
faster than others in a single package unit.
At the same time we can have also DFS scenarios where the operating system is able to
select the clock speed that the chip runs on. So these kind of things will give you
asymmetry in today's systems.
Now, the rest of the talk I'm going to, it will be as follows. First of all, I'm going to talk
about some of the existing approaches to heterogeneous scheduling and describe our
solution in more detail, go over some performance results that we'll have, and then do
some side-by-side sort of analysis comparison with the other approaches and then
summarize.
So many of the existing approaches work as follows: As the thread comes in, it is run on
both types of cores and that determines the comparative performance. We can get
some kind of a performance ratio of fast performance versus slow. And that number
then determines whether the application is sensitive or not.
Now, the key thing is that we have to run this application on both types of cores to get
this kind of information. After we compute these ratios, we assign them cores as
appropriate, and we are also able to detect phase changes by monitoring performance
of threads continuously. And, for example, phase changes can trigger a repeat of this
loop.
The advantages of such a scheme include being flexible in accurately reflecting the
conditions in the current environment, as well as being phase aware and input aware.
>>: How do you measure performance?
>> Daniel Shelepov: Like, for example, like we see in the instructions per second or
something like that. By measuring the differences on different types of cores, you can
compare it.
>>: Execute it?
>> Daniel Shelepov: Yeah. So there are some disadvantages, though, because this
type of monitoring, when you are required to run each thread on every type of core, it
means that there is more complexity, and this complexity actually grows as you add
more cores, more threads, and especially more types of cores.
So as a result of this high complexity step algorithm scale pool are not really expected to
work in many core scenarios, see, where you have dozens or hundreds of cores.
In contrast, our approach works on the basis of something called a signature, which is a
summary of applications, architectural properties. Basically without going into details too
much at this point it would be a description of how a program operates, how it uses
resources, and using that information the operating system would be able to determine
at scheduling time whether the application is sensitive or insensitive.
Ideally, this type of signature would be micro architecture independent, meaning we
should be able to apply one signature to pretty much any core we can encounter of the
same instruction set.
It has to be universal. And the key part about the signature is that if it's obtained
statically, for example, at development time, it can be shipped along with the binary, and
as a result all this performance monitoring work during scheduling time will be
completely avoided.
Now, going into some more detail about the signatures, we know that they must be able
to predict sensitivity to changes in clock speed. Now, how would they do that?
For our model in this work, we have assumed a domain of CPUs that differ by cache
size and clock frequency. For this scenario the signature would then estimate
performance ratio based on those two parameters only. And it will estimate performance
for different, for say one core has a cache size of X and another cache size of Y and
also perhaps different clock speeds.
So what we would do is we would estimate performance of the program on the one core
and then another, and by comparing those two values would determine this sensitivity
index that we could use to compare different jobs against each other in matching.
Now, I'm going to go through this backwards. From the previous slide, we know we want
to predict performance dynamics sort of using the signature. And we are looking at a
cache size and the clock frequency.
It turns out that we can actually do this estimation if you have a cache miss rate of the
thread and the cache miss rate works in this case by knowing how much time, how
many caches the CPU has to serve. You can estimate how much time will the program
spend stalled when it's on CPU. And as a result you can figure out how much of an
improvement will it get from having a higher clock speed.
Now, estimating cache miss is a separate problem sort of that is displayed here in this
blue region, which, by the way, represents work done off line. And it can be estimated
from something called a reuse distance profile of memory. And I'm deliberately not
going to go into too much detail on this one because it's a separate problem related to
cache modeling. Suffice it to say that it's easy to run the program once, correct this
memory reuse distance profile, and from that estimate the cache miss rate for any cache
configuration you are interested in.
So if you go through the step for, if you do the step for every common cache
configuration, and there is a fairly limited set of cache size you can expect in the real
system, you can actually build a matrix of cache misses for common configuration, and
this would be the content of the signature. And then at scheduling time the operating
system will be able to select the correct, to look up the correct cache configuration and
predict performance based on that.
Lastly, I'm going to say that to do this process now work we have now the pin binary
framework from Intel. I don't know if you've heard about it. Probably. And, finally, I also
would like to mention that this method, although accurate, relies on knowing the cache
size in advance. And this cache has to be exclusive for this to work, because if you do
not know how much-if the cache is shared you cannot effectively say how much cache
will we actually have if you run on the shared cache.
So this method relies on exclusive cache, and I'll talk about this later on when I discuss
the results. So at this point I'm going to pause. Are there any questions so far? All
right. Great.
So how accurate is all this? Like I mentioned before, the objective here is to calculate
performance ratios of performances on different types of cores. So, for example, if one
core is two times as fast, we would also like to be able to predict that it's two times as
fast, that their program or their transmitter is two times as fast or one half times as fast.
So these diagrams show how well our estimations match the real ratios. So on the X
axis here, we see the actual observed ratio, which means that for a given change in
clock speed from one to two times as much, how much did the performance actually
change on a given program. So each dot is a program.
So for MCF, for example, we know that when we increase the clock speed from X to 2X,
the performance increased approximately 1.35. And the two graphs represent two
different systems.
So in this system we increased the clock speed by a factor of 2 here by a factor of one
half. And each dot is a benchmark from spec CPU 2,000. The Y axis here is the same
ratio as here except it's estimated using our signature method. So in the perfect
scenario, all the dots would be on this diagonal line, and that would mean that our
estimations are perfectly accurate.
Our estimations are, of course, not perfectly accurate, but you can see in general where
we keep this diagonal trend especially in the area above.
>>: The performance ratio, is it per cycle or per second?
>> Daniel Shelepov: It's per cycle. No. Instructions per second, you're right. And in
that respect you can actually compare. You can actually compare against an absolute
baseline, because if you have cycles here you would kind of -- if you had something two
times slower you would not notice it. Actually, if you work out the math you can see that
you can transfer from these two, and these, the changes between these two values,
instruction per second would also indicate how much time is being wasted on stuff other
than executing. It's a side note. You can basically assume we're actually talking about
instructions per second here in this comparison.
>>: Saying if you're looking at transition frequency, the change in frequency is
disproportionate per second might not give you it in terms of ->> Daniel Shelepov: Actually, it will, because we're looking at, if you remember, we're
looking at how much time does a thread, is thread and stall state when it's waiting for
memory request. And that portion is going to be approximately the same regardless of
the clock speed. So the IPC would also reflect the dynamic, because you can increase
the clock speed. But if you spend some time stalled waiting for memory request you
won't get a proportion for IPC. You can get IPC for this sort of information but it's less
intuitive, I would say.
All right. So now with the signature stuck behind us, I can actually go on to explain
about the algorithm, actual schedule of the algorithm. We start with the concept of
partition. So here's our CPU, different types, of course, and then a partition is just a
group, of course, which can be arbitrarily defined with the condition that it has to contain,
of course, one layer of one type. So, for example, a red one and a blue core could not
be in the same partition.
And then threads are assigned to partitions, and within partitions then they're assigned
to CPUs. So it's a two-stage kind of process. So what happens is let's say a new thread
arises into the system. It would take its architectural signature, look up the type of this
core and then estimate its performance for that particular type of core. Using that
information as well as the load factor of the partition, it can estimate its performance in
that partition where it's scheduled there.
And by doing the same process on every partition, it can generate expected
performance ratio for every partition. And then by selecting the maximum, it is able to
bind optimally.
So in this respect the thread sort of acts greedily. It tries to find out the best partition to
bind to and then goes there.
And then within partitions threads can be scheduled using normal operating system
scheduling techniques. There's no difference in this scenario, because all the cores
inside the partitions are identical, so we're not going to talk about that. So what's the
best partitions? And, yes, and then to accommodate for possible load changes this
process can be repeated periodically. Let's say every, I don't know, half a second. Each
thread for every half second of CPU time each thread wakes up and goes to the
assignment again.
Finally, because greedy actions alone cannot bring us to a globally optimal solution in
general, there is also mechanism built in where a thread can swap with another thread if
it decides that this will increase the overall system efficiency.
So this point here allows us to achieve globally optimal solutions, at least in theory.
Whereas greedy decision making doesn't.
So, yeah, the algorithm itself is really simple. And what are the advantages? Of course,
that huge chunk of complexity-related dynamic monitoring threads is completely
removed to, removed off line, basically. The implementation is really simple. The
algorithm is not complex by itself. And as a result it scales well. Of course, it doesn't
come for free. Because we assume only one signature per thread, we can say for
certain that we cannot accommodate different phase changes inside that thread. The
operating system views the thread as something uniform. Just one signature per thread
and we cannot accommodate for that.
Now, in fact this is in theory reconcilable, because we can think of a system where a
programmer would set up certain points inside a program and then define a phase
change. And then the signatures could be generated, for example, for each phase. And
then at scheduling time the correct signature could be passed to the scheduler
appropriately.
But this has not been implemented yet. Another point is that one signature per
application does not allow for changes in programming input. It does allow for changes
in program but the signature can only assume one input. And in theory this could seem
a big problem, because sometimes a program it seems can make huge, can have huge
behavioral differences based on input.
By testing it somewhat limited in scope on spec CPU 2000 programs we found that
actually input does not make that much of a difference on the majority of programs.
So a typical input will give you a signature that is more or less applicable in a broad
variety of scenarios. However, we cannot say for certain that the same pattern keeps for
all benchmarks out there or all programs out there. And in our experience it has not
been a very significant barrier.
Now, two big points are the following: Multi-threaded applications, being aware of
parallelism, for one thing, because if the jobs are interdependent, you could have certain
priorities which our algorithm is not aware of. Secondly, in a multi-threaded scenario,
you could have data sharing which would be an additional factor to consider in optimality
of schedule to reduce, for example, cache coherence traffic.
This is the subject of our current work, and there are some things that they're trying out.
I guess I could talk about them in more detail later on, if there is time. Shared caching is
another big problem. Like I said, our methods assumes exclusive cache for
performance estimation, and that is completely compromised and shared cache
scenarios. At the same time we know that with a big degree of certainty future
heterogeneous system will in fact involve shared cache, so it's a problem we must
tackle.
Fortunately, it doesn't appear to be insurmountable. It's a separate problem, optimal
scheduling based in shared cache conditions. But our challenge is taking that solution
and combining it with our algorithm, because optimality, of course, can be conflicting
those scenarios.
This is also the subject of current work. Extensive investigation and all that. So these
are substantial limitations. And then I'm going to move on to performance valuation.
So the experimental platform we used has consisted of two machines. One was a 16
core Barcelona machine where we could scale frequency from .15 to 2.3 gigahertz. The
second machine was a zeon eight core, also DFS, vary the frequency from 2 to
3-gigahertz.
Then the algorithm itself I've implemented in Solaris kernel. So the experiment that I'm
going to point out to you here is just one of a set of experiments, and you can read more
about them in an upcoming paper if you want, or I can describe it in the Q&A session.
But this experiment is as follows. We have four cores. Two of them are fast. Two times
as fast as slow cores, and we have four applications. Two of them are considered to be
sensitive, meaning that they react strongly to changes in frequency. Two of them are
insensitive.
So there's a good matching here, except the algorithm has to determine, has to be able
to exploit it. And then I'm going to show two workloads. One is highly heterogeneous,
meaning that there is like a big difference between these two categories, and in this case
the difference is smaller than that.
So let me show you the results with that experiment. So we ran that set of four
benchmarks simultaneously on four cores. So there were four benchmarks running at
the same time. The system was fully utilized, and the graph should be read as follows.
The 100 percent line represents the expected completion time of that particular program
in a default scheduler. And then the bars represent the difference in expected
completion time using our scheduler or the best static assignment.
So, for example, we know that if, let's say, a six track had run on our scheduler, it would
complete 33 percent faster than the default scheduler on average.
MCF, on the other hand, would complete, slower by 18 percent. And then this bar here
represents the geometrical mean, which stands for the overall system through-put
change.
So as you can see, these highly sensitive benchmarks were forced by the scheduler to
be placed, were forced onto faster cores, and as a result they got large completion time
decreases, up to 33 percent in this case.
These two benchmarks, on the other hand, were forced to slower cores, but because
they are insensitive, the performance hid the experience as comparatively smaller in this
gain here.
And as a result the overall performance has decreased by 13 percent in this case. So
what this means is that the overall system through-put has increased. And if you look at
the blue bar here, it shows that our algorithm in this case has matched the best static
assignment.
This is a mildly heterogeneous scenario, where the setup is very much similar, except in
this case the two benchmarks that are considered to be insensitive are less insensitive
than they were here. As a result, they experienced a larger performance hit when they
were placed on slower partition, on a slower course.
So as a result the total win that we have is smaller. But it still matches the geometric
mean. Any questions about this graph? No?
>>: Can you explain the 100 -- the baseline better?
>> Daniel Shelepov: How did we come up with this number?
>>: Yes, what's the baseline number?
>> Daniel Shelepov: The baseline -- we consider an agnostic heterogeneity scheduler,
basically a scheduler that doesn't not know about any difference, assigns programs
randomly. And the 100 percent would be a statistical composite, which would represent
the expected completion time you would get were you actually, where you actually
scheduling that.
>>: You ran two of the processors fast and two slow.
>> Daniel Shelepov: Yes.
>>: The default scheduler would just randomly assign?
>> Daniel Shelepov: Yes.
>>: So half the time each shredded spin, statistically half the time the thread would spin
on a fast processer, half the time spin on a slow processer, that's the baseline?
>> Daniel Shelepov: Well, it's [inaudible] but in fact there is a little bit of a problem there.
Because as the metric that you just described is one metric, but it can be too pessimistic
in reality, because when you schedule threads on a fast and slow course, on average
threads will retire faster from faster course.
So actually threads will spend on average more time -- I mean, yeah, they spend more
time on a faster core. So what you should actually do is consider an ideal round robin
scheduler where the time shares will be equal. But if you count the instruction retired on
the fast core versus slow core, you would have a tilt towards the faster core.
So we'll have to use two of those metrics, in fact, because on one hand the default is too
pessimistic, but the round robin, the ideal thing we believe is too optimistic, because in
fact ensuring that the bench, the programs get equal time sharing is very difficult. So
you cannot really expect the default scheduler to behave in a round robin fashion.
So we found that the ideal round robin is not a valid metric. So what we did was actually
take two of those metrics and combine them somehow, take an average.
I can tell you right away that the ideal round robin metric is usually faster than this one
by, let's say, under five percent. Like two to five percent.
So you would still get the benefits compared to that. But they would be smaller. But it's
very great observation you make there.
So 13 versus nine speed of plant. Let me point out to you that this is the best you can
get with the static assignment on this platform. Where the heterogeneity more strongly
pronounced with the cores you'll get more benefit.
For example, if you had, I don't know, different cache sizes as well as clock speeds,
perhaps you could gain a larger speed because the contrast would be larger between
the cores.
All right. Talked about that. So this was just one experiment on others. We'll try
different workloads, different setups. On average gain about speed of about the same
as the one that I just showed you.
Most of the time we actually match the static assignment. Although in a few cases the
inaccuracies of our estimation method made us develop an assignment that was not
optimal, because we misestimated the performance of the benchmark. Those are
relatively few.
Now we also tested this algorithm on relatively complex scenarios up to 12 cores at the
same time, up to 60 threads, and in all those cases the overhead algorithm was fairly
small, under one percent.
So it scales well. However, because of that part I was talking about, about shared
caches, the optimality is really compromised in those scenarios, in that case half doesn't
do very well, because what it thinks is optimal might not be due to conflicting cache
accesses.
All right. So go on to compare this with the other algorithm. So we have also decided to
implement a dynamic monitoring algorithm, because previously those were only
simulated, we didn't find the actual implementation of an algorithm that would measure
the performance in real time and use it in scheduling. So we decided to do that.
For baseline we have chosen the same algorithm that I was mentioning in the first half of
the presentation, where we would monitor IPC on different core types, compare them get
a performance ratio, based on that performance ratio find the optimal matching.
And because we have this monitor component in place we're able to actually monitor for
IPC changes, for sudden drastic IPC changes, which would indicate a phase change
inside a program. Once a phase change is detected we could redo the scheduling.
So we don't get stuck with an obsolete schedule that's no longer optimal. Okay. So this
algorithm is not, was not developed in-house. It was actually proposed in previous work
by another researcher. But in their case they were doing it on a simulator. We decided
to implement it in Solaris.
And, of course, we expected it to outperform HASS because it's aware of phase
changes and it sounds more flexible and it should work better.
In fact, we did not get that when -- here's a new bar added for dynamic algorithm. You
can see, first of all, the assignments that it ends up making are a little bit different. But
the bottom line is that it doesn't decrease performance by as much as we expected it to.
So why was it? Why was it occurring? There were several problems with this approach.
The first problem had to do with phase change detections. For example, the original
algorithm suggested that the phase change was supposed to be -- it suggested a
mechanism to detect phase changes, and once the phase change was detected a new
round of performance measurement was started and those IPC ratios were generated,
and as you remember we had to generate IPC ratios for two types of cores.
So, for example, we would detect a phase change, measure the performance at this
point, then immediately switch to another core. And on that other core we would
measure performance on that point.
And as you can see, by the time that the contact switch is made, the program has
continued in its phase change and as a result the two measurements are not equivalent.
So this means that we should not only detect phase change, beginnings, like starts, we
should also detect phase change ends.
This is a tricky problem in general to solve. As a result, the ratios that we got were
mostly bogus. And this problem is expected to get even worse when you get more core
types, too, because in that case you'd have to take four points or whatever. So it would
be even more difficult to synchronize them all together.
>>: [Indiscernible] similarities saying that instead of -- drastic IPC change, longer
periods of time.
>> Daniel Shelepov: Yeah, basically you're talking about like a buffer, history buffer that
would detect a long term change.
>>: So long time [inaudible].
>> Daniel Shelepov: That could work, but that has a downside of forcing you to
measure for a longer time. And not only does it make it more difficult to actually take the
measurement, but it also means that the measurement is going to be more invasive to
the overall, to the overall workload, because when you're measuring this IPC, you can
also be in conflict with other threads that were supposed to be running on that core
legitimately.
And this thrashing, as I will actually show you on the next slide, yes, this thrashing is
very detrimental to performance, because by having threads move around all the time,
you are prone to wasted CPU cycles when some cores are left idle as well as like that.
As well as artifacts such as destroying of cache, working sets and whatnot. So overall
phase changes in themselves are very destructive. And by making them longer that
would even increase that problem.
And, of course, as you increase the number of cores and the threads, this problem
aggravates. So the outcome of this comparison was that we found that monitoring not
only caused a performance degradation, but it was not only cause a performance
degradation relative to the best static scenario, but even some cases even worst than a
default scheduler which I didn't show on any diagram.
And it's really difficult to make the system work, because you really have to keep this
very fragile balance of flexibility versus stability. And in our invariant we weren't actually
able to achieve that balance. And we feel that this occurred because the original
algorithm that was proposed was only based on simulations. So it's really important to
validate hypothesis. Not only in simulators, but in real systems.
In retrospect, we see that by being blind to these phase changes has avoided all that
thrashing around and instead was relatively stable.
Okay. So for a conclusion, I presented half, which is a new scheduling algorithm based
on architectural signals. It trades simplicity robustness and scalability for accuracy.
It's not aware of phases, but we see that algorithms are aware of phases are not
necessarily better when they're developed using dynamic monitoring scheme. Dynamic
monitor was found to be pretty difficult to implement in this domain. We were not able to
implement it so it works correctly.
While HASS gives us a robust implementation that exploits guaranteed performance
gains from optimal static assignments.
In future work, we would like to extend our signatures to types of heterogeneity other
than clock speed. For example, cache size should already be supported by our scheme
because we already estimate performance based on cache size.
But we just weren't able to find test machines where we could find CPUs with different
cache sizes at the same time. It's easy to extend it there.
But we would also like to provide support for pipeline heterogeneity, so signatures that
would indicate sensitivity to parameters such as pipeline width or depth, issue width or
pipeline depths or out of order execution.
We would also like to add shared caches into the problem, which is an orthogonal
problem and that creates an additional challenge of combining the two approaches
together. And support from all the thread applications.
So here's a reference from the upcoming paper which will be published in ACMOSR this
April. And contact information for me and Alexander. Thank you.
[Applause]
>>: What's in your signature right now?
>> Daniel Shelepov: Like I said, it's a matrix of cache miss ratios configured for common
cache configurations, if you go back to that, go back to that slide.
Yeah. So using this single profile, we can estimate cache misses for a variety of cache
configurations. So by doing that for a fairly limited set of common caches, you only have
caches powers of two and maybe between like 512, one meg, two meg, whatever.
So we can do the matrix that would have a niche field of cache miss for that particular
cache size, and then at this time we would pretty much look up the correct cache
configuration, extract the value of the missed rate from there. From that missed rate we
can estimate how much time is being stalled in the memory request and how sensitive is
the thread today change clock speed and so on.
>>: In a sense your signatures are about cache sizes but what you varied on the
machine was clock speed, right?
>> Daniel Shelepov: Okay. Well, maybe I can show you how it works. So we know that
if this is the time that a thread spends on a CPU, we can estimate using the cache
missed rate how much time will be spent, let's say, servicing memory requests by taking
the cache missed rate and multiplying it by memory access time. And you can estimate
this portion. And you know this portion will be constant in respect to the number of
instructions executed in this portion. Because we know, for example, if the cache
missed rate is 10 percent, one cache missed rate per 10 instructions we'll know that here
will be -- here we'll have 10 instructions no matter what.
So by actually, by taking the clock speed as a parameter, we can estimate how much
will this vary in comparison to this portion. And that will take care of the difference in
clock speed in this model.
Other questions?
>>: So talk about sometimes the signature can be not correct. Is there any mechanism
for correcting it? Or modifying it? Because you had a situation where you know this
workload is going to be very different than that workload?
>> Daniel Shelepov: Well, that's a great question. So far there isn't a mechanism. We
just have to stick with what we have. But you could build some sort of a feedback
mechanism that would validate your initial guess perhaps and correct it if necessary.
And you could, I guess, do it to a different degree. You could do a periodic adjustment
but that would move you into the direction of dynamic profiling more or less.
So, yes, you can build certain parts where you can have a feedback, adjusting those
initial guesses but it's important not to overdo it, but it can be done.
And I think if implemented correctly it will increase the efficiency of the algorithm.
>>: One of the problems of doing this profiling is that you cannot really predict the top
architecture to the pipe. You never know where the software is actually going to run.
Could you do the signature generation on install time for instance, like the first time you
run the application on a topic? Or is it too complicated to do this instrumentation
generation?
>> Daniel Shelepov: No, I think from a technical standpoint it's possible to do this
profiling around your installation. But the point here is that the data that we collect in this
particular case is only depends on the instruction set. It doesn't depend on the hardware
actual.
Because this memory distance reuse profile, it's actually going to be the same
regardless of the architecture of the micro architecture. Because this does not use any
micro architecture specific data.
Now, of course when you enter into less deterministic scenario such as multi-threaded it
might change, but the overall idea is you should be able to run some sort of analysis that
is completely independent of micro architecture, but I would say that doing it at
installation time would increase the accuracy, definitely.
>>: So if you start going to a new architecture, new latencies, would that affect how you
make your ->> Daniel Shelepov: Yes, that's a good question. So memory latency is actually, I
guess, discovered by the OS at runtime. So you could just plug it into the calculation.
Because memory latency is used in this stuff here. So if you get access to that value
right there you're good to go. Really it's not that simple because you could have Numa
architectures and that would create a whole next level of complexity.
But for non-Numa you can get it right there. For Numa we didn't do a deep analysis on
Numa stuff. One of our machines was in fact Numa, the 16 core AMD one, and we
found that it was pretty much a matter of configuring the initial sort of core assignment,
and by that I mean like figuring out which cores to set slower, which cores to set faster,
because the schedule we found was quite smart in keeping the local memory close to
the program.
However, the difference was in like shared labs, what not. So by manipulating the Numa
configuration combined with the DFS settings, you could create a scheme where
memory access latency could be approximated by constant value even for Numa. But
for more complex scenarios, we don't know.
But, like I said, the memory access, back to your question, memory access can be
looked up in this stage.
Does that answer your question, more or less?
>>: Yes.
>> Daniel Shelepov: All right. Great. Anything else?
>> Simon Fraser: I'd like to thank the speaker.
[Applause]
Download