23236 >> Kathryn McKinley: I'm Kathryn McKinley. It's my... Yang, who has been working with, for two years, while...

advertisement
23236
>> Kathryn McKinley: I'm Kathryn McKinley. It's my pleasure to introduce Xi
Yang, who has been working with, for two years, while he's a Ph.D. or as
Master's student at ANU, working with Professor Steve Blackburn, my colleague
who I've collaborated with for a long time. And so he's just switched last month
to be a Ph.D. student, but before he was even a Ph.D. student he's produced two
pieces of really interesting work. And we get to hear about one of them today.
> Xi Yang: And [inaudible] nature, and it's interesting. My mom sometimes ask
me a question: You said you are doing some research and what are you doing?
I said garbage collection. And my mom get quite unhappy: Garbage collection?
And I said in the computer sense.
So I showed these slides to my mom. This is nature. And do something, has
same word with me. But also different. Yeah. So nature talk about a lot of
[inaudible], and today we're going to talk about why nothing matters. [inaudible].
So I'm Xi Yang. So this is co-work with Steve, Daniel, Jane and Kathryn.
And during naturalization. So a valid problem of old systems like embedded
opening system and native languages, how common problem which is, for
example, when you say whether a global value is naturalized or not depends on
our system. There's no specification about that.
So I'm [inaudible] small operating system, which is used in satellite and basically
embed system one day ask questions whether your system is zero, the BSS
section or not. And the maintainer answers the question kind of like, answer
depends on the version of the PSP, and there's no rules about that, you should
take care about it yourself, otherwise centralized crash doesn't matter with me.
So centralized crash is not good thing. And new languages didn't have this
problem, because in Java they have specification called a variable program must
have a value before it is used. And it's not only for Java or modern languages,
they see how the specification is. And they try to avoid these problems.
And your application programmers don't need to think about that. You just -- they
naturally know that all of the value is naturalized. For the local variable, you
know, the competitor can check. The local variable is naturalized, send first and
you start use. But for the global -- for the nonlocal variable, this work is typically
performed by the garbage collector which zero naturalized all of the global
variables.
So -- and, for example, you get application. Application. You get application
stat, [inaudible] and your garbage collector is coming, and your garbage collector
and before you deliver this array to the applications you have to zero this array
for application programmer. So we've got clean data. This we call the zero
naturalization. Simple. And so does this matter?
And people -- in the common sense you can like, this should not be hard. This
should not be quite high overhead. And the people -- it's hard to notice that.
Their stories -- I give this talk in OOPSLA last week. And after they talk, a guy
came to me, a guy who is doing the hash algorithm for the commercial systems.
And he said that he has a similar problem with its paper mention. And he
designed a fancy hash algorithm.
And he started to optimize this algorithm. But he found out that it doesn't matter
whether which kind of algorithm he choose, the system performance really bad.
And because he's not a computer architecture guy, he ask another friend who
plucked in the performance counter and thought that half of the CPU circle was
depending on zeroing the packet.
And this planning problem is just not easily -- I mean, be mentioned by a
programmer. So this way we measure the CPU, the percentage of CPU circles
expanding on zeroing across Java benchmarks. You have seen that; one
benchmark is weird. Spend about 50 percent CPU circles.
>>: Percentage or a fraction?
> Xi Yang: It's a 50 percent.
>>: So it's a fraction?
> Xi Yang: Yes, it's a fraction. Half of the CPU circles do nothing.
>>: Is this in Jakes or HotSpot?
> Xi Yang: Jakes. And also we validate, it's in the HotSpot it's also true. Of
course, it's the nature of the program. It's not a JVM.
>>: Is this -- it's when I request memory, it gets zeroed in the same thread, or it is
in some other thread that's actually doing the zero? Can it be hidden by -> Xi Yang: I will talk about it.
>>: When you say cycles, multicore system -> Xi Yang: We measure the total, good question. So we measure -- this does
not represent the percentage of execution time. This is a percentage of workload
you do. We measure how many CPU circles. For example, in two CPU each of
them for zeroing and total layer you get half percent of CPU circles, you do
zeroing. And this is interesting. We dig this problem, and this problem comes
from a bug.
Because a loose search is a benchmark which you loosens to search something.
Loosens is quite popular in search area. And that bug recites loosens for one
year and the bug -- the reason of the bug is kind of like when you try to pass a
string to a token, they get a quite complex system to handle this.
And one guy's coming to the system saying I'm trying to optimize passing the
string to token. And he refactor the whole stuff. But he forget in the long, in the
deep side of the system, there is an object who -- sorry, there's a function -- no,
there's a class who allocated a large array, every time you create these objects.
And they forget that, because it's really deep. And he didn't understand the
whole system so he started refactoring.
So there's back coming. And they have a lot of problems. And they try to find it.
They know the bug, but they didn't find it until one year later.
And we fix that. We use a new version -- sorry. We fixed that bug, and it's come
down to 8 percent. So it was still high. So average got 4 percent of CPU circles,
some benchmark you get 12 percent.
>>: If you spend that much money, why put something as boring as zero? You
could put a number that -- an exception as soon as it sees a register it would do
some useful work or the language would stop you.
> Xi Yang: Sorry?
>>: Yes.
>>: Ages ago we put 5555 in memory and anytime that cropped up in a 16 bit
register we stopped the simulation. Find lots of interesting bugs.
> Xi Yang: Yeah, true.
>>: Not zero.
>>: We did that for no preliminary exceptions. We still bit so that you could trace
back the source of where that null pointer came from. So that's a very useful -did a whole paper just about that idea.
>>: Quiet man.
>>: Right.
>>: Propagate, tell you where they come from.
>>: That's right. And Java because you're four byte alives you can still bit, so
you know that it's not some other value.
>>: Yeah. And how does it work. So currently there's two approaches. In the
first slides we have seen that. The key point is that your garbage collector has to
deliver the clean data to applications and you can do anywhere before you
deliver the data. So the one is that you buffering them. The idea is you're going
to give zero quite early, but you're going to zero a large block. And so if you get
a whole block full of garbage and you can zero the whole block at one time and
you can create the object inside this block.
And this happened when one -- this [inaudible] happened with one thread. So
the allocation of nursery, there are two steps, one allocate the big block, and
another one is allocate the small object inside block. This zeroing happened
when you allocate the block. And another is hot-path. The idea is you delay the
zeroing as late as possible.
And you have a block of garbage and you don't zero the whole block at one time.
You create the object, you just zero the object size. You just zero what you
need. And you zero another one and you zero another one. And this is two
common designs in all production systems. And it looks like the people prefer
hot spot zeroing. Hot spot zeroing cause a lot of production JVMs, they use hot
paths zeroing. And one day I talked with a guy working Azure about my idea.
And he think my idea is really bad. And people assume that hot spot zeroing is
the best approach to handle zeroing. Because you zero the whole block at one
time, you get the sequential reference, the whole block. You definitely get good
special locality.
But the problem is that you zero a whole block. There's a lot of distance between
your zeroing and the first time you reference the object, you create it on it.
And for the path zeroing, because the block is large enough and you're allowed
to use a function such as memory set function to zero it. So the instruction
footprint is pretty low. For the hot spot zeroing because the nursery space is a
sequential and Java program is likely to allocate a large number of small objects
quite fast because language encourages you to do that. So it still has -- it still
has a good special locality. And it has reuse distance because you just zero
what you need and start to consume the object you zero.
And but, you know, the allocation code normally is in line to the normal programs.
And zero instructions inside allocation. The zeroing instructions can -- the
zeroing instructions in line and everywhere to get larger instruction footprint.
Another one is kind of like that because every time you zero -- for example for
the left side you zero a 32 kilo block, and you don't have many control
instructions inside the block. And you can use your hand to write the sample
language to control that. But for what's generated by comparer and error time
suggests zero, quite small block. For each block you get some control
instructions overhead.
Okay. So what happened? So this is one of the guys who got the Nobel Prize in
our union, and he got a [inaudible] microscope [inaudible]. We computer guys
don't know [inaudible]. We have micro benchmarks.
We use micro benchmarks to enlarge the effect of zeroing. Try to -- this is stupid
benchmark. And this benchmark basically allocate as many objects as you can
and consume this object again. But this can -- it's perfect here because it's a -zeroing is really important here. And in some situations reflect the real Java
workload, which is you allocate a large sequential small object and use it and
throw it away.
And we test -- we analyze this micro benchmark on top of Intel Core 2. Some
guy asked why you choose such old machine. The idea is kind of like you have a
front side path which gave you a chance to breakdown all the transactions to
understand the problem deeply. They've got the general contingency, 32
megabyte nursery, 32 KB allocation block.
So for the path zeroing, it allocates a 32 gig block and zero the block once a time
and start to allocate object inside the block. For the hot-path, it gets 32 kilo
block, do not touch it. And allocate first object, zero it. Allocate another one,
zero it, until you get next 32 kilobyte. This is for nursery allocation.
And it's interesting. Actually in the hot spot JVM they do have, in JVM they do
have a, we call garbage first collector. And they have a concurrent zeroing. And
I asked some guys probably related with that, I said could we concurrent zero in
the nursery. He said that's a bad idea. Don't do that.
>>: I have a question. In the hot-path zeroing, do you midline the zeroing into
allocation code?
> Xi Yang: Allocation code. Zeroing instruction is basically in the hot-path
allocation code. Hot-path allocation code is in line to the application who do the
allocation.
>>: So ->>: Based on the JIT paths.
>>: Sure.
>>: But if it's that situation, is there the possibility then in this code that it's your
standard [inaudible] code can eliminate the zeroing, because you know you're
guaranteed to write before you read this?
> Xi Yang: Guarantee to -- write before you read.
>>: You're writing to J, to fresh sub J. Before you actually read from that
particular location. I was just curious, the compiler smart enough, JIT is smart
enough to remove the zeroing code that would happen before this.
> Xi Yang: Yeah, we make sure that. And another optimization here is just do a
descript analyze and delete the whole code. You can also do that. But we tried
to make the variables aesthetic. And we tried to check the generated code by,
both my [inaudible] and hot spot JVM to make sure they do what we want. For
JVM [phonetic] we do replay to control the JIT. What we check the code is what
they generated.
>>: But hot spot had that optimization in there, but it slows down the program.
> Xi Yang: Hot spot has a optimization which the idea is not this one. The idea
is if you know -- for example, when you allocate the area and then you naturalize
but not by the program. It's naturalized [inaudible] you have an area which is eco
and something. And the hot spot, you try to analyze the programs and try to pick
up something you write there and avoid zero them which does not work.
And this is quite interesting. So sometimes you think that it works. Sometimes
you think it's a void, the number of write instructions. But pump that in modern
architecture, the number of instructions to write is not that important.
The most important is data located stuff. And so all of the optimizations, I think,
should be based on the quantity analyzed first, not an optimization first. And,
yeah. So which one is better. So at first we try to look at the number of repaired
instructions. And we find that the hot spot is repaired about 20 percent more
instructions, which is -- which is what we expected, because it's in line such more
code.
But that's interesting that. This bar represents the total bar represent the
execution time. And the right bar represents the percentage of CPU circles on
zeroing for bug zeroing. But for hot spot because it's in line, we cannot easily
measure the CPU depended on zeroing. So we make this light blue. That's the
idea.
So we found out that the hot-path it got 20 more instructions and run 15 percent
faster. So they optimize -- they do is you remove the number of instructions not
that quite important in powerful machines.
And we try to understand this problem and why? And the first time is definitely
something about the memory subsystem. And the way we measure the number
of bus transactions and this number is hot -- all the data is [inaudible] normally to
the bug. So both of them consume -- generate the same number of path
generations; it does not answer our question.
And here's the write transactions are the same, the fetch transactions write
transactions are still the same. Still does not answer our question. And here it
is. And the most important is -- and you get a fetch due to program, the first
path, a lot of fetch transactions due to your program missed cache and you start
to fetch.
But for the hot-path, most of the fetch transactions actually are generated by
prefetcher automatically has prefetcher and detect your memory reference. If it
is sequential, it's really simple. Some pattern, some simple pattern. If you're
sequential, data first requirement. Another one is if there's some memory free
memory bandwidth and the prefetcher stopped working. This tell us that hot-path
really take advantage of prefetcher.
Why? Why take the advantage? This is quite simple. If you get bulk and you
zero the whole block, quite sequentially and fast, and your instructions is, all the
instructions push to the memory subsystem and your prefetcher cannot catch it
up. And there's also no free resource. The prefetcher just does not work in
these situations.
But for the hot-path, the first time you allocate object and you start to consume
this object, and start another one, start consuming it. But the key point is that the
nursery space is sequential, and because your sequentially allocate on a
sequential address block, actually this block ships your applications behavior,
make your allocation sequential. And the prefetcher can catch it up. And you
create object and you start consuming this object a little bit, which gives a
prefetcher time to prefetch the next one. So that's why it's -- for example, this
object is already in cache. Fast.
So what happened when we disable the prefetcher, that's the question. And
what do we have say that the hot-path takes advantage of the prefetcher but bulk
does not. What happened to this then? We're lucky on the core two, there are
two registers, you can disable two kind of prefetcher. In the core two, there are
two policies. One policy is that if you miss and you fetch the next LAN beside the
miss address. Another one is this detect the sequential of your application. So
we [inaudible] both of them. So all the data is normalized block. We can see
when we disable the prefetcher the block performance does not affect it
anymore. Because it does not need the help from prefetcher. It's already -- it's
already push instruction to the memory subsystem. But as we look to hot-path,
it's from the -- it's easy to get a 15% speedup but now it gets 10 percent slow.
So to answer the question.
So the most important, on new -- on current modern architecture the prefetcher
help us reference that are sequential but you cannot that fast. And there's a
tense here. And make it complicated. The hot-path you make it like the hot-path
and the optimization you mentioned make the whole program quite complicated.
And actually it's not faster sometimes. So how can we make it simple and fly
faster, this idea.
And the answer is pretty easy we just replace the one instruction, the start
instruction with nontemporal instruction. What is nontemporal instruction.
Normal temporal instructions for the write back cache you basically -- because
when the guy who designed the -- designed the CPU there are really two
principles. They learn from the application. One is that you got a temporal
[inaudible] and another special locality. So the write back cache if you get really
good temporal locality, you consume it and narrow write it back.
And but sometimes you have to write it back because cache is small. And here's
the normal store instructions.
And he's the nontemporal instruction. Nontemporal instruction is designed
basically for device like JPU. So, for example, you want to generate the image
you have to write the image to memory buffer, and the write to the image, the
order is not important, because the [inaudible] cannot distinguish the orders you
write pixel.
And another one is the device in other side of path you want to dump all the data
to the memory first. That's why you got nontemporal instruction. You write down
to the memory without reading first.
So the idea is kind of you can get double throughput you take advantage of all
the memory bandwidth and you bypass the cache. But the drawback of
nontemporal instruction, there are two draw backs one is nontemporal instruction
really how weekly order with normal stores so you cannot -- it's fine-grained. You
must zero a big block. Another one is because when you write the data, because
the bypass cache is directly to the memory, but if you want to consume it
immediately you have to fetch it again to the cache again and pay a penalty. So
that's what is nontemporal instruction. If you do not need to pay the penalty and
nontemporal instruction is perfect for you, what do we learn from the preresult is
kind of like the hot-path actually is missed in the cache but run faster because
your prefetcher takes advantage of that. Because prefetcher identify your
sequential and prefetch the data to cache first. So it's okay. Prefetcher we can
safely use nontemporal instructions to zero a block and we later when we start to
consume this object, the prefetcher help us.
What's the result? There it is. So the write path do represent the CPU circles on
zeroing and nontemporal is quite simple, we just replace memory side with a -handwrite a sample language, a function. It's called memory NT something.
And we can see that it's to get -- it gets double throughput. So white bars get ->>: Are you doing SIMD writes or are you using single -- there's nontemporal
instruction writes using SIMD.
> Xi Yang: Basically each time you write about 16 bytes together. So with an
MMX registers. So, yeah. And the interesting part here is that it compares with
bug, nontemporal bug, it didn't grow, it grew a little bit. It does not -- people
assume that. I talk with the guy and he worry about this part. He thinks this
group somewhere else. So, of course, he think when you reference, you still
have to fetch, still are to pay the penalty. But he forget the prefetcher actually
can help us. And this is one of the -- this is also one advantage if you compare
with Java to some old say systems which use freely semi log.
>>: So we actually try to use the wider instructions but because they weren't
aligned that -- that you have to use aligned.
>>: If you're doing bulk loads can't you just -- do the regular part online part and
deal with the aligned part. Start with the first align.
>>: That's right. That's right.
> Xi Yang: But hot spots just another limitation for hot-path zeroing.
>>: No random parts of memory.
> Xi Yang: That's right. But sometimes you see the write path, the write bar, it's
pretty easy. It still takes -- for the real work, it takes 4 percent or 3 percent CPU
circles to do the zeroing. So the idea is because we directly write memory, we
get advantage, which is -- sorry. Another advantage of nontemporal bug zeroing
because you bypass cache for the CMP multicore systems you avoid the cache
pollution. You can safely do concurrent zeroing but with normal zeroing
instructions you do concurrent big memory and you flush all of the data in the
cache. So sometimes when one is caught sleeping, we give him a job to do
zeroing block for the consumers. And we call it concurrent zeroing which is very
simple. The idea is the end of GC you're zeroing thread star and zero -- this is
another advantage of Java. Sorry about garbage collection. You can set the
nursery space as a fixed size, sequential block. And your concurrent zeroing
thread just directly zero the block. You don't need to care about any other things.
So concurrent zeroing. So it gets faster. And it's better than all of them. And so
let's see what the zeroing.
You can see that we offload that zeroing to another core. Also it consume -- it
also consumes more CPU circles for zeroing but that's okay because it's under
control because you have contentions in memory system. You get longer time to
zeroing.
And astronomers. Astronomers get telescope. And this is -- this also tried to
answer my mom's question. Sometimes I have to do some sense. I'm not just -I'm not just recycle rubbish. So we use -- we have to look to real workload. We
don't have telescope but we have real workload.
This is across a 19 -- Java benchmarks which mix some from [inaudible] and
some from GVM. And this is current two events, which is sparked with
nontemporal, sorry, backed with hot-path zeroing. And this does it normalized
bug zeroing. And we can see sometimes hot-path zeroing get pretty good, and
sometimes worse. So overall speaking it's just roughly the same in our systems
and the key point is that for hot-path zeroing, you make your system complicated
and you limited the optimization you can do. And your chief competitor get
complicated. So sometimes and it's interesting. If you look at the current hot
spot source code and they do have a bug zeroing but really it's implemented with
not -- it does not use memory set. It just has some writing and pretty slow. And
it has a common deer that this bug zeroing is used for the bug. Which means
they got hot-path zeroing and sometimes the problem about zeroing they switch
to bug and check whether this is a zeroing problem. And so this is our
nontemporal bug and [inaudible] to result. And you can see that for some
benchmark which consume a lot of, consume a lot of CPU circles on zeroing, our
nontemporal bug zeroing get faster because you reduce the cost of zeroing. And
the concurrent zeroing get more faster because you offload the zeroing to
another core, but the problem is that when there's no other course, and all of the
threads are busy, eager to get, consume the zeroed memory, the concurrent
zeroing get worse. I mean for optimizations it's really important. You should not
slow down some guys. And ->>: In the area.
>>: Zero.
>>: We're trying here.
> Xi Yang: So we try to make an adaptive zeroing which is quite simple and
energy we say we switch between them based on some simple environment like
we check the number of applications of thread and compare with the number of
cores we have. And make a decision between the [inaudible] and the bug. And
here's the open question here. So JVM sometimes you can actually do the
scheduling thread. Your GVM actually can do better than the operating system
because you understand the semantic of your applications. Semantic of your
threads. But due to current system design and it's not that easy to control bug
GVM. So we -- and your GVM does not understand the whole system workload.
So that's why we choose this quite simple decision. And it works quite well, but if
you are a busy system and the [inaudible] running with other workloads, maybe
it's not, maybe it does not work as good as this. We always choose the best
among them or we don't slow down the system. That's a very important part.
And so we found out that the zeroing is better and takes significant CPU circles
on zeroing and current design we try to deeply understand the current design.
We run a lot of experiments to find out the reasons and break down past
transactions to help us understand how does the hot wear and software co-work
together. And based on our understanding and we propose two designs and
very simple, very easy to implement it. So probably take about -- I mean, to
identify to analyze for this problem is hard. But after you know the reason and
probably HJVM take about one hour to finish the coding. And we got an
implementation which gets 3.2 area speedup so sometimes 3.2 is a small
number. But you just need one hour. 3.2, and the most important for hot spot is
get, we have thousands of lines of code to do the optimization you mentioned.
And you can delete them. And ->>: Not good for the developer who wrote those thousand lines of code. It's a
huge function, too, because you have to analyze the bad code and a lot of corner
cases you can make sure, because you avoid zeroing something. And you make
sure you avoid zeroing something that's safe that's why we have bug zeroing for
debugging. So I draw the paper to the meta list and they say they are interested
but I don't know whether they have time to do that. That's coffee time. Finish
that.
And we begged them to the next version of six VM, we'll have our version. But
for hot spot it will be good for them to finish. That's it. And the nature, to show
that I'm doing some sense.
>>: Somehow you didn't make clear that you did most of it in hot spot also. Got
similar results.
> Xi Yang: Yes, because I have to check the hot spot source code and see
whether they do some fancy, smart things. But it looks like they don't do the
experiment first and they do write code.
>>: So how would this change if you had an architectural structure they let you
basically zero a cash find without having to ever prefetch?
> Xi Yang: Exactly. Some architecture like PBC, they have special instruction
called DZBZ. And the instruction will directly zero a block without fetching the
data. It directly zeroing cache, but the problem in that, the DZBZ, I talk with IBM
guys and they don't like to use that instruction. So the reason is that as -- as we
mentioned, the prefetcher actually already prefetch the data and cache for you,
but if you use DZBZ, the prefetcher cannot see your operations because you
don't cache any memory actually. So you just zero the cache. And that's why
they don't use DZBZ. Because, for example, if you write the large block of
memory if you use DZBZ slower than fetch data from memory to cache. But it
will avoid ->>: Implemented or is that fundamental to.
> Xi Yang: One reason is that it's implemented [inaudible] another one is for
some cases people ignore prefetcher, hardware prefetcher. Hardware prefetcher
sometimes can help you. If you use DZBZ, it affect the prefetcher. Prefetcher
cannot identify the patterns. And Azure, they do have a special pattern
instruction which makes the DZBZ kind of like prefetcher instructions. But
importantly that you still cannot, you still take -- it's still zero. It's still executing
instructions. You still do not -- if you concurrent zeroing you off load all the things
to other cores. If you zero in on the hot-path and you get a, you still pay the
penalty and when you try to do the concurrent zeroing and you cannot use these
instructions because you cannot bring them to cache, if you bring them to cache
you've got to flush the cache in which you pay penalty.
>>: So to do DZBZ zeroing, you have to do it in the hot-path to make it
worthwhile.
> Xi Yang: Yes, otherwise you flush the cache. You don't want to -- for example,
I run a program and another thread is coming just reference another a block and
flush all this program data out of cache. When I was working and I had to fetch.
>>: Can't tell the block [inaudible].
> Xi Yang: This -- well, the point.
>>: But the bulk zeroing you have to do that. But the trick is on the allocation
sequence, is that if you're doing it in the hot-path, you have to bring it right before
you want it, right? And you have to do a cache line aligned so you have to do
math essentially to figure out where the last object is and if you need to go get
the next cache line. So it's basically introducing a lot of junk that didn't have to
be cache line aware into your hot-path allocation. Hot-path allocation just said
put zeros here if they're aligned unaligned on the cache line or whatever, I don't
care, okay. And when you use DZB 0, it has to be cache line aligned, you have
to do math if you need to figure out where 0s to go the next one.
> Xi Yang: For the concurrent zeroing, for the nursery space, we use a quite
small nursery space, actually, four times the number of last level cache which is
32 megabytes. But if you look at the hot spot they use about 100 megabytes
nursery space. And which means one thread is going to zero, 100 megabyte
memory and another thread is working. So if we bring 100 megabyte data to
cache you can imagine what happened in cache.
>>: But I think that the architecture that does the fetch on the right, you know,
even in norm [inaudible] paper like 20 years ago he shows that you shouldn't
fetch the data, that you should have provide valid bits. So then you get around
both problems. But nobody's got that.
>>: You initialize to zero so any reference to initial data would be trapped to
debugger later. Now that trap is not quite as urgent as stray coat. Can you use
some, essentially do something clever. So detecting that, the initialize read
translator does early enough to stop the deployment of the ICBM and -> Xi Yang: The casing that it's theoretically you can. But the point is that how
can you make it simple and work? Right? So currently -- wow, it would be -- it
will be good. We can avoid all of the -- you can analyze bad code hat and see
what happened there.
>>: Zero back and forth to memory, I agree we shouldn't be moving all these
zeros back and forth to memory. This is smarter architecture.
> Xi Yang: If you are free and you design a -- if you can imagine architecture and
you can get away all of these problems. But the point is that the reality is not like
that. It's a majority. And we should make a treat between simple and if -- I mean
for x86 after this, it's simple and effective. So if you can make a complex super
fast and I think it's worse but the point we don't have -- you understand, I think
you understand.
>>: So the problem that you talked about is application-specific, right? It's at the
level -- it comes because the levels of the semantics that say when I get a piece
of data it's guaranteed to be zero. That is to say that -> Xi Yang: Yes, that's the language specification.
>>: Language specification. It's going to happen whether I'm on x86 or No. It's
not an architectural problem. So the problem exists on the architectures, for
instance, phones, and I'm curious whether or not you think the solutions are
going to be architecture-specific.
> Xi Yang: I think this is architecture problem, too, because when people design
the architecture, they have to provide the service for applications. Architecture
doesn't matter. And if the user is happy and your system is good, the user is not
happy it does not matter whether your fancy, your architecture is fancy or not.
Do you agree with that?
>>: I guess my question was: Do you think this would -- what's the solution for
ARM, is it the same solution, or can we use nontemporal instruction forearm and
it will solve it?
> Xi Yang: Arm has interesting structures to play with it. I don't think they
nontemporal. I will check later. But I remember it does not have -- I checked
with.
>>: Did they have the bench allocate? That will work, too.
> Xi Yang: BBC has fetch without -- sorry, write -- there's no write without fetch.
There's zero without fetch. And ARM has some interesting instructions for the
area boundary check and some faster exceptions. But this looks like they don't
have nontemporal instruction. I will check that. I'm not sure. I'm not 100 percent
sure.
>>: Another way to say this, another way to say it is the results are to be believed
if they hold on the architecture, then a large amount of power is being consumed
by our phones by just zeroing memory.
> Xi Yang: Yes.
>>: Which is ridiculous.
>>: Do you have any results about the energy efficiency of the various -> Xi Yang: We tried that. And as we show that, we have a slide which show a
disabled prefetcher and enable this prefetcher, and we -- and we built the current
sensor x86 to measure the power of the chip, CPU chip. And we disable the
prefetcher, enable the prefetcher on Core 2 Q and we measure the energy,
there's no difference. So there's two answers for this question. One is we don't
understand how that is to -- you can disable prefetcher but that part of logic
probably not temporal. And another one is that most of the -- another one is
interesting maybe a lot of power is not drawn by the CPU, drawn by the dynamic
power because you reference a lot. And the 4 two Q there's a problem memory
controller is not inside the CPU. Memory controller [inaudible] pretty cheap.
Memory controller is not bridge cheap. And our twos are very simple, we cannot
measure the [inaudible] bridge power consumptions. So that's -- we think there
are some effect but we didn't prove it.
>>: So in the future, hopefully this design should be better with energy. But right
now it's hard to tell because there's essentially the idle energy is turning off the
things, doesn't show up when we measure it very well. Because the chip doesn't
have power control, power to the different features isn't selectively turned down
or up, it's all -- you run at one voltage everything runs at that voltage. So the
voltage domains don't let you do fine-grained things like you turn off the
prefetcher and you get a big energy savings from that. You don't see that.
>>: And the Intel is going to provide some tools in the new [inaudible] bridge
machine, you can get three sensors, one measure the whole package power and
one for measuring the core power, the four sensors. Another one measure the
anchor part and the special one measure the memory part of the power
consumption. But that machine is not -- depend to really that machine this year
but that is going to be maybe next year. So we need more tools to modify the
energy effect.
>> Kathryn McKinley: Let's thank our speaker.
[applause]
Download