23236 >> Kathryn McKinley: I'm Kathryn McKinley. It's my pleasure to introduce Xi Yang, who has been working with, for two years, while he's a Ph.D. or as Master's student at ANU, working with Professor Steve Blackburn, my colleague who I've collaborated with for a long time. And so he's just switched last month to be a Ph.D. student, but before he was even a Ph.D. student he's produced two pieces of really interesting work. And we get to hear about one of them today. > Xi Yang: And [inaudible] nature, and it's interesting. My mom sometimes ask me a question: You said you are doing some research and what are you doing? I said garbage collection. And my mom get quite unhappy: Garbage collection? And I said in the computer sense. So I showed these slides to my mom. This is nature. And do something, has same word with me. But also different. Yeah. So nature talk about a lot of [inaudible], and today we're going to talk about why nothing matters. [inaudible]. So I'm Xi Yang. So this is co-work with Steve, Daniel, Jane and Kathryn. And during naturalization. So a valid problem of old systems like embedded opening system and native languages, how common problem which is, for example, when you say whether a global value is naturalized or not depends on our system. There's no specification about that. So I'm [inaudible] small operating system, which is used in satellite and basically embed system one day ask questions whether your system is zero, the BSS section or not. And the maintainer answers the question kind of like, answer depends on the version of the PSP, and there's no rules about that, you should take care about it yourself, otherwise centralized crash doesn't matter with me. So centralized crash is not good thing. And new languages didn't have this problem, because in Java they have specification called a variable program must have a value before it is used. And it's not only for Java or modern languages, they see how the specification is. And they try to avoid these problems. And your application programmers don't need to think about that. You just -- they naturally know that all of the value is naturalized. For the local variable, you know, the competitor can check. The local variable is naturalized, send first and you start use. But for the global -- for the nonlocal variable, this work is typically performed by the garbage collector which zero naturalized all of the global variables. So -- and, for example, you get application. Application. You get application stat, [inaudible] and your garbage collector is coming, and your garbage collector and before you deliver this array to the applications you have to zero this array for application programmer. So we've got clean data. This we call the zero naturalization. Simple. And so does this matter? And people -- in the common sense you can like, this should not be hard. This should not be quite high overhead. And the people -- it's hard to notice that. Their stories -- I give this talk in OOPSLA last week. And after they talk, a guy came to me, a guy who is doing the hash algorithm for the commercial systems. And he said that he has a similar problem with its paper mention. And he designed a fancy hash algorithm. And he started to optimize this algorithm. But he found out that it doesn't matter whether which kind of algorithm he choose, the system performance really bad. And because he's not a computer architecture guy, he ask another friend who plucked in the performance counter and thought that half of the CPU circle was depending on zeroing the packet. And this planning problem is just not easily -- I mean, be mentioned by a programmer. So this way we measure the CPU, the percentage of CPU circles expanding on zeroing across Java benchmarks. You have seen that; one benchmark is weird. Spend about 50 percent CPU circles. >>: Percentage or a fraction? > Xi Yang: It's a 50 percent. >>: So it's a fraction? > Xi Yang: Yes, it's a fraction. Half of the CPU circles do nothing. >>: Is this in Jakes or HotSpot? > Xi Yang: Jakes. And also we validate, it's in the HotSpot it's also true. Of course, it's the nature of the program. It's not a JVM. >>: Is this -- it's when I request memory, it gets zeroed in the same thread, or it is in some other thread that's actually doing the zero? Can it be hidden by -> Xi Yang: I will talk about it. >>: When you say cycles, multicore system -> Xi Yang: We measure the total, good question. So we measure -- this does not represent the percentage of execution time. This is a percentage of workload you do. We measure how many CPU circles. For example, in two CPU each of them for zeroing and total layer you get half percent of CPU circles, you do zeroing. And this is interesting. We dig this problem, and this problem comes from a bug. Because a loose search is a benchmark which you loosens to search something. Loosens is quite popular in search area. And that bug recites loosens for one year and the bug -- the reason of the bug is kind of like when you try to pass a string to a token, they get a quite complex system to handle this. And one guy's coming to the system saying I'm trying to optimize passing the string to token. And he refactor the whole stuff. But he forget in the long, in the deep side of the system, there is an object who -- sorry, there's a function -- no, there's a class who allocated a large array, every time you create these objects. And they forget that, because it's really deep. And he didn't understand the whole system so he started refactoring. So there's back coming. And they have a lot of problems. And they try to find it. They know the bug, but they didn't find it until one year later. And we fix that. We use a new version -- sorry. We fixed that bug, and it's come down to 8 percent. So it was still high. So average got 4 percent of CPU circles, some benchmark you get 12 percent. >>: If you spend that much money, why put something as boring as zero? You could put a number that -- an exception as soon as it sees a register it would do some useful work or the language would stop you. > Xi Yang: Sorry? >>: Yes. >>: Ages ago we put 5555 in memory and anytime that cropped up in a 16 bit register we stopped the simulation. Find lots of interesting bugs. > Xi Yang: Yeah, true. >>: Not zero. >>: We did that for no preliminary exceptions. We still bit so that you could trace back the source of where that null pointer came from. So that's a very useful -did a whole paper just about that idea. >>: Quiet man. >>: Right. >>: Propagate, tell you where they come from. >>: That's right. And Java because you're four byte alives you can still bit, so you know that it's not some other value. >>: Yeah. And how does it work. So currently there's two approaches. In the first slides we have seen that. The key point is that your garbage collector has to deliver the clean data to applications and you can do anywhere before you deliver the data. So the one is that you buffering them. The idea is you're going to give zero quite early, but you're going to zero a large block. And so if you get a whole block full of garbage and you can zero the whole block at one time and you can create the object inside this block. And this happened when one -- this [inaudible] happened with one thread. So the allocation of nursery, there are two steps, one allocate the big block, and another one is allocate the small object inside block. This zeroing happened when you allocate the block. And another is hot-path. The idea is you delay the zeroing as late as possible. And you have a block of garbage and you don't zero the whole block at one time. You create the object, you just zero the object size. You just zero what you need. And you zero another one and you zero another one. And this is two common designs in all production systems. And it looks like the people prefer hot spot zeroing. Hot spot zeroing cause a lot of production JVMs, they use hot paths zeroing. And one day I talked with a guy working Azure about my idea. And he think my idea is really bad. And people assume that hot spot zeroing is the best approach to handle zeroing. Because you zero the whole block at one time, you get the sequential reference, the whole block. You definitely get good special locality. But the problem is that you zero a whole block. There's a lot of distance between your zeroing and the first time you reference the object, you create it on it. And for the path zeroing, because the block is large enough and you're allowed to use a function such as memory set function to zero it. So the instruction footprint is pretty low. For the hot spot zeroing because the nursery space is a sequential and Java program is likely to allocate a large number of small objects quite fast because language encourages you to do that. So it still has -- it still has a good special locality. And it has reuse distance because you just zero what you need and start to consume the object you zero. And but, you know, the allocation code normally is in line to the normal programs. And zero instructions inside allocation. The zeroing instructions can -- the zeroing instructions in line and everywhere to get larger instruction footprint. Another one is kind of like that because every time you zero -- for example for the left side you zero a 32 kilo block, and you don't have many control instructions inside the block. And you can use your hand to write the sample language to control that. But for what's generated by comparer and error time suggests zero, quite small block. For each block you get some control instructions overhead. Okay. So what happened? So this is one of the guys who got the Nobel Prize in our union, and he got a [inaudible] microscope [inaudible]. We computer guys don't know [inaudible]. We have micro benchmarks. We use micro benchmarks to enlarge the effect of zeroing. Try to -- this is stupid benchmark. And this benchmark basically allocate as many objects as you can and consume this object again. But this can -- it's perfect here because it's a -zeroing is really important here. And in some situations reflect the real Java workload, which is you allocate a large sequential small object and use it and throw it away. And we test -- we analyze this micro benchmark on top of Intel Core 2. Some guy asked why you choose such old machine. The idea is kind of like you have a front side path which gave you a chance to breakdown all the transactions to understand the problem deeply. They've got the general contingency, 32 megabyte nursery, 32 KB allocation block. So for the path zeroing, it allocates a 32 gig block and zero the block once a time and start to allocate object inside the block. For the hot-path, it gets 32 kilo block, do not touch it. And allocate first object, zero it. Allocate another one, zero it, until you get next 32 kilobyte. This is for nursery allocation. And it's interesting. Actually in the hot spot JVM they do have, in JVM they do have a, we call garbage first collector. And they have a concurrent zeroing. And I asked some guys probably related with that, I said could we concurrent zero in the nursery. He said that's a bad idea. Don't do that. >>: I have a question. In the hot-path zeroing, do you midline the zeroing into allocation code? > Xi Yang: Allocation code. Zeroing instruction is basically in the hot-path allocation code. Hot-path allocation code is in line to the application who do the allocation. >>: So ->>: Based on the JIT paths. >>: Sure. >>: But if it's that situation, is there the possibility then in this code that it's your standard [inaudible] code can eliminate the zeroing, because you know you're guaranteed to write before you read this? > Xi Yang: Guarantee to -- write before you read. >>: You're writing to J, to fresh sub J. Before you actually read from that particular location. I was just curious, the compiler smart enough, JIT is smart enough to remove the zeroing code that would happen before this. > Xi Yang: Yeah, we make sure that. And another optimization here is just do a descript analyze and delete the whole code. You can also do that. But we tried to make the variables aesthetic. And we tried to check the generated code by, both my [inaudible] and hot spot JVM to make sure they do what we want. For JVM [phonetic] we do replay to control the JIT. What we check the code is what they generated. >>: But hot spot had that optimization in there, but it slows down the program. > Xi Yang: Hot spot has a optimization which the idea is not this one. The idea is if you know -- for example, when you allocate the area and then you naturalize but not by the program. It's naturalized [inaudible] you have an area which is eco and something. And the hot spot, you try to analyze the programs and try to pick up something you write there and avoid zero them which does not work. And this is quite interesting. So sometimes you think that it works. Sometimes you think it's a void, the number of write instructions. But pump that in modern architecture, the number of instructions to write is not that important. The most important is data located stuff. And so all of the optimizations, I think, should be based on the quantity analyzed first, not an optimization first. And, yeah. So which one is better. So at first we try to look at the number of repaired instructions. And we find that the hot spot is repaired about 20 percent more instructions, which is -- which is what we expected, because it's in line such more code. But that's interesting that. This bar represents the total bar represent the execution time. And the right bar represents the percentage of CPU circles on zeroing for bug zeroing. But for hot spot because it's in line, we cannot easily measure the CPU depended on zeroing. So we make this light blue. That's the idea. So we found out that the hot-path it got 20 more instructions and run 15 percent faster. So they optimize -- they do is you remove the number of instructions not that quite important in powerful machines. And we try to understand this problem and why? And the first time is definitely something about the memory subsystem. And the way we measure the number of bus transactions and this number is hot -- all the data is [inaudible] normally to the bug. So both of them consume -- generate the same number of path generations; it does not answer our question. And here's the write transactions are the same, the fetch transactions write transactions are still the same. Still does not answer our question. And here it is. And the most important is -- and you get a fetch due to program, the first path, a lot of fetch transactions due to your program missed cache and you start to fetch. But for the hot-path, most of the fetch transactions actually are generated by prefetcher automatically has prefetcher and detect your memory reference. If it is sequential, it's really simple. Some pattern, some simple pattern. If you're sequential, data first requirement. Another one is if there's some memory free memory bandwidth and the prefetcher stopped working. This tell us that hot-path really take advantage of prefetcher. Why? Why take the advantage? This is quite simple. If you get bulk and you zero the whole block, quite sequentially and fast, and your instructions is, all the instructions push to the memory subsystem and your prefetcher cannot catch it up. And there's also no free resource. The prefetcher just does not work in these situations. But for the hot-path, the first time you allocate object and you start to consume this object, and start another one, start consuming it. But the key point is that the nursery space is sequential, and because your sequentially allocate on a sequential address block, actually this block ships your applications behavior, make your allocation sequential. And the prefetcher can catch it up. And you create object and you start consuming this object a little bit, which gives a prefetcher time to prefetch the next one. So that's why it's -- for example, this object is already in cache. Fast. So what happened when we disable the prefetcher, that's the question. And what do we have say that the hot-path takes advantage of the prefetcher but bulk does not. What happened to this then? We're lucky on the core two, there are two registers, you can disable two kind of prefetcher. In the core two, there are two policies. One policy is that if you miss and you fetch the next LAN beside the miss address. Another one is this detect the sequential of your application. So we [inaudible] both of them. So all the data is normalized block. We can see when we disable the prefetcher the block performance does not affect it anymore. Because it does not need the help from prefetcher. It's already -- it's already push instruction to the memory subsystem. But as we look to hot-path, it's from the -- it's easy to get a 15% speedup but now it gets 10 percent slow. So to answer the question. So the most important, on new -- on current modern architecture the prefetcher help us reference that are sequential but you cannot that fast. And there's a tense here. And make it complicated. The hot-path you make it like the hot-path and the optimization you mentioned make the whole program quite complicated. And actually it's not faster sometimes. So how can we make it simple and fly faster, this idea. And the answer is pretty easy we just replace the one instruction, the start instruction with nontemporal instruction. What is nontemporal instruction. Normal temporal instructions for the write back cache you basically -- because when the guy who designed the -- designed the CPU there are really two principles. They learn from the application. One is that you got a temporal [inaudible] and another special locality. So the write back cache if you get really good temporal locality, you consume it and narrow write it back. And but sometimes you have to write it back because cache is small. And here's the normal store instructions. And he's the nontemporal instruction. Nontemporal instruction is designed basically for device like JPU. So, for example, you want to generate the image you have to write the image to memory buffer, and the write to the image, the order is not important, because the [inaudible] cannot distinguish the orders you write pixel. And another one is the device in other side of path you want to dump all the data to the memory first. That's why you got nontemporal instruction. You write down to the memory without reading first. So the idea is kind of you can get double throughput you take advantage of all the memory bandwidth and you bypass the cache. But the drawback of nontemporal instruction, there are two draw backs one is nontemporal instruction really how weekly order with normal stores so you cannot -- it's fine-grained. You must zero a big block. Another one is because when you write the data, because the bypass cache is directly to the memory, but if you want to consume it immediately you have to fetch it again to the cache again and pay a penalty. So that's what is nontemporal instruction. If you do not need to pay the penalty and nontemporal instruction is perfect for you, what do we learn from the preresult is kind of like the hot-path actually is missed in the cache but run faster because your prefetcher takes advantage of that. Because prefetcher identify your sequential and prefetch the data to cache first. So it's okay. Prefetcher we can safely use nontemporal instructions to zero a block and we later when we start to consume this object, the prefetcher help us. What's the result? There it is. So the write path do represent the CPU circles on zeroing and nontemporal is quite simple, we just replace memory side with a -handwrite a sample language, a function. It's called memory NT something. And we can see that it's to get -- it gets double throughput. So white bars get ->>: Are you doing SIMD writes or are you using single -- there's nontemporal instruction writes using SIMD. > Xi Yang: Basically each time you write about 16 bytes together. So with an MMX registers. So, yeah. And the interesting part here is that it compares with bug, nontemporal bug, it didn't grow, it grew a little bit. It does not -- people assume that. I talk with the guy and he worry about this part. He thinks this group somewhere else. So, of course, he think when you reference, you still have to fetch, still are to pay the penalty. But he forget the prefetcher actually can help us. And this is one of the -- this is also one advantage if you compare with Java to some old say systems which use freely semi log. >>: So we actually try to use the wider instructions but because they weren't aligned that -- that you have to use aligned. >>: If you're doing bulk loads can't you just -- do the regular part online part and deal with the aligned part. Start with the first align. >>: That's right. That's right. > Xi Yang: But hot spots just another limitation for hot-path zeroing. >>: No random parts of memory. > Xi Yang: That's right. But sometimes you see the write path, the write bar, it's pretty easy. It still takes -- for the real work, it takes 4 percent or 3 percent CPU circles to do the zeroing. So the idea is because we directly write memory, we get advantage, which is -- sorry. Another advantage of nontemporal bug zeroing because you bypass cache for the CMP multicore systems you avoid the cache pollution. You can safely do concurrent zeroing but with normal zeroing instructions you do concurrent big memory and you flush all of the data in the cache. So sometimes when one is caught sleeping, we give him a job to do zeroing block for the consumers. And we call it concurrent zeroing which is very simple. The idea is the end of GC you're zeroing thread star and zero -- this is another advantage of Java. Sorry about garbage collection. You can set the nursery space as a fixed size, sequential block. And your concurrent zeroing thread just directly zero the block. You don't need to care about any other things. So concurrent zeroing. So it gets faster. And it's better than all of them. And so let's see what the zeroing. You can see that we offload that zeroing to another core. Also it consume -- it also consumes more CPU circles for zeroing but that's okay because it's under control because you have contentions in memory system. You get longer time to zeroing. And astronomers. Astronomers get telescope. And this is -- this also tried to answer my mom's question. Sometimes I have to do some sense. I'm not just -I'm not just recycle rubbish. So we use -- we have to look to real workload. We don't have telescope but we have real workload. This is across a 19 -- Java benchmarks which mix some from [inaudible] and some from GVM. And this is current two events, which is sparked with nontemporal, sorry, backed with hot-path zeroing. And this does it normalized bug zeroing. And we can see sometimes hot-path zeroing get pretty good, and sometimes worse. So overall speaking it's just roughly the same in our systems and the key point is that for hot-path zeroing, you make your system complicated and you limited the optimization you can do. And your chief competitor get complicated. So sometimes and it's interesting. If you look at the current hot spot source code and they do have a bug zeroing but really it's implemented with not -- it does not use memory set. It just has some writing and pretty slow. And it has a common deer that this bug zeroing is used for the bug. Which means they got hot-path zeroing and sometimes the problem about zeroing they switch to bug and check whether this is a zeroing problem. And so this is our nontemporal bug and [inaudible] to result. And you can see that for some benchmark which consume a lot of, consume a lot of CPU circles on zeroing, our nontemporal bug zeroing get faster because you reduce the cost of zeroing. And the concurrent zeroing get more faster because you offload the zeroing to another core, but the problem is that when there's no other course, and all of the threads are busy, eager to get, consume the zeroed memory, the concurrent zeroing get worse. I mean for optimizations it's really important. You should not slow down some guys. And ->>: In the area. >>: Zero. >>: We're trying here. > Xi Yang: So we try to make an adaptive zeroing which is quite simple and energy we say we switch between them based on some simple environment like we check the number of applications of thread and compare with the number of cores we have. And make a decision between the [inaudible] and the bug. And here's the open question here. So JVM sometimes you can actually do the scheduling thread. Your GVM actually can do better than the operating system because you understand the semantic of your applications. Semantic of your threads. But due to current system design and it's not that easy to control bug GVM. So we -- and your GVM does not understand the whole system workload. So that's why we choose this quite simple decision. And it works quite well, but if you are a busy system and the [inaudible] running with other workloads, maybe it's not, maybe it does not work as good as this. We always choose the best among them or we don't slow down the system. That's a very important part. And so we found out that the zeroing is better and takes significant CPU circles on zeroing and current design we try to deeply understand the current design. We run a lot of experiments to find out the reasons and break down past transactions to help us understand how does the hot wear and software co-work together. And based on our understanding and we propose two designs and very simple, very easy to implement it. So probably take about -- I mean, to identify to analyze for this problem is hard. But after you know the reason and probably HJVM take about one hour to finish the coding. And we got an implementation which gets 3.2 area speedup so sometimes 3.2 is a small number. But you just need one hour. 3.2, and the most important for hot spot is get, we have thousands of lines of code to do the optimization you mentioned. And you can delete them. And ->>: Not good for the developer who wrote those thousand lines of code. It's a huge function, too, because you have to analyze the bad code and a lot of corner cases you can make sure, because you avoid zeroing something. And you make sure you avoid zeroing something that's safe that's why we have bug zeroing for debugging. So I draw the paper to the meta list and they say they are interested but I don't know whether they have time to do that. That's coffee time. Finish that. And we begged them to the next version of six VM, we'll have our version. But for hot spot it will be good for them to finish. That's it. And the nature, to show that I'm doing some sense. >>: Somehow you didn't make clear that you did most of it in hot spot also. Got similar results. > Xi Yang: Yes, because I have to check the hot spot source code and see whether they do some fancy, smart things. But it looks like they don't do the experiment first and they do write code. >>: So how would this change if you had an architectural structure they let you basically zero a cash find without having to ever prefetch? > Xi Yang: Exactly. Some architecture like PBC, they have special instruction called DZBZ. And the instruction will directly zero a block without fetching the data. It directly zeroing cache, but the problem in that, the DZBZ, I talk with IBM guys and they don't like to use that instruction. So the reason is that as -- as we mentioned, the prefetcher actually already prefetch the data and cache for you, but if you use DZBZ, the prefetcher cannot see your operations because you don't cache any memory actually. So you just zero the cache. And that's why they don't use DZBZ. Because, for example, if you write the large block of memory if you use DZBZ slower than fetch data from memory to cache. But it will avoid ->>: Implemented or is that fundamental to. > Xi Yang: One reason is that it's implemented [inaudible] another one is for some cases people ignore prefetcher, hardware prefetcher. Hardware prefetcher sometimes can help you. If you use DZBZ, it affect the prefetcher. Prefetcher cannot identify the patterns. And Azure, they do have a special pattern instruction which makes the DZBZ kind of like prefetcher instructions. But importantly that you still cannot, you still take -- it's still zero. It's still executing instructions. You still do not -- if you concurrent zeroing you off load all the things to other cores. If you zero in on the hot-path and you get a, you still pay the penalty and when you try to do the concurrent zeroing and you cannot use these instructions because you cannot bring them to cache, if you bring them to cache you've got to flush the cache in which you pay penalty. >>: So to do DZBZ zeroing, you have to do it in the hot-path to make it worthwhile. > Xi Yang: Yes, otherwise you flush the cache. You don't want to -- for example, I run a program and another thread is coming just reference another a block and flush all this program data out of cache. When I was working and I had to fetch. >>: Can't tell the block [inaudible]. > Xi Yang: This -- well, the point. >>: But the bulk zeroing you have to do that. But the trick is on the allocation sequence, is that if you're doing it in the hot-path, you have to bring it right before you want it, right? And you have to do a cache line aligned so you have to do math essentially to figure out where the last object is and if you need to go get the next cache line. So it's basically introducing a lot of junk that didn't have to be cache line aware into your hot-path allocation. Hot-path allocation just said put zeros here if they're aligned unaligned on the cache line or whatever, I don't care, okay. And when you use DZB 0, it has to be cache line aligned, you have to do math if you need to figure out where 0s to go the next one. > Xi Yang: For the concurrent zeroing, for the nursery space, we use a quite small nursery space, actually, four times the number of last level cache which is 32 megabytes. But if you look at the hot spot they use about 100 megabytes nursery space. And which means one thread is going to zero, 100 megabyte memory and another thread is working. So if we bring 100 megabyte data to cache you can imagine what happened in cache. >>: But I think that the architecture that does the fetch on the right, you know, even in norm [inaudible] paper like 20 years ago he shows that you shouldn't fetch the data, that you should have provide valid bits. So then you get around both problems. But nobody's got that. >>: You initialize to zero so any reference to initial data would be trapped to debugger later. Now that trap is not quite as urgent as stray coat. Can you use some, essentially do something clever. So detecting that, the initialize read translator does early enough to stop the deployment of the ICBM and -> Xi Yang: The casing that it's theoretically you can. But the point is that how can you make it simple and work? Right? So currently -- wow, it would be -- it will be good. We can avoid all of the -- you can analyze bad code hat and see what happened there. >>: Zero back and forth to memory, I agree we shouldn't be moving all these zeros back and forth to memory. This is smarter architecture. > Xi Yang: If you are free and you design a -- if you can imagine architecture and you can get away all of these problems. But the point is that the reality is not like that. It's a majority. And we should make a treat between simple and if -- I mean for x86 after this, it's simple and effective. So if you can make a complex super fast and I think it's worse but the point we don't have -- you understand, I think you understand. >>: So the problem that you talked about is application-specific, right? It's at the level -- it comes because the levels of the semantics that say when I get a piece of data it's guaranteed to be zero. That is to say that -> Xi Yang: Yes, that's the language specification. >>: Language specification. It's going to happen whether I'm on x86 or No. It's not an architectural problem. So the problem exists on the architectures, for instance, phones, and I'm curious whether or not you think the solutions are going to be architecture-specific. > Xi Yang: I think this is architecture problem, too, because when people design the architecture, they have to provide the service for applications. Architecture doesn't matter. And if the user is happy and your system is good, the user is not happy it does not matter whether your fancy, your architecture is fancy or not. Do you agree with that? >>: I guess my question was: Do you think this would -- what's the solution for ARM, is it the same solution, or can we use nontemporal instruction forearm and it will solve it? > Xi Yang: Arm has interesting structures to play with it. I don't think they nontemporal. I will check later. But I remember it does not have -- I checked with. >>: Did they have the bench allocate? That will work, too. > Xi Yang: BBC has fetch without -- sorry, write -- there's no write without fetch. There's zero without fetch. And ARM has some interesting instructions for the area boundary check and some faster exceptions. But this looks like they don't have nontemporal instruction. I will check that. I'm not sure. I'm not 100 percent sure. >>: Another way to say this, another way to say it is the results are to be believed if they hold on the architecture, then a large amount of power is being consumed by our phones by just zeroing memory. > Xi Yang: Yes. >>: Which is ridiculous. >>: Do you have any results about the energy efficiency of the various -> Xi Yang: We tried that. And as we show that, we have a slide which show a disabled prefetcher and enable this prefetcher, and we -- and we built the current sensor x86 to measure the power of the chip, CPU chip. And we disable the prefetcher, enable the prefetcher on Core 2 Q and we measure the energy, there's no difference. So there's two answers for this question. One is we don't understand how that is to -- you can disable prefetcher but that part of logic probably not temporal. And another one is that most of the -- another one is interesting maybe a lot of power is not drawn by the CPU, drawn by the dynamic power because you reference a lot. And the 4 two Q there's a problem memory controller is not inside the CPU. Memory controller [inaudible] pretty cheap. Memory controller is not bridge cheap. And our twos are very simple, we cannot measure the [inaudible] bridge power consumptions. So that's -- we think there are some effect but we didn't prove it. >>: So in the future, hopefully this design should be better with energy. But right now it's hard to tell because there's essentially the idle energy is turning off the things, doesn't show up when we measure it very well. Because the chip doesn't have power control, power to the different features isn't selectively turned down or up, it's all -- you run at one voltage everything runs at that voltage. So the voltage domains don't let you do fine-grained things like you turn off the prefetcher and you get a big energy savings from that. You don't see that. >>: And the Intel is going to provide some tools in the new [inaudible] bridge machine, you can get three sensors, one measure the whole package power and one for measuring the core power, the four sensors. Another one measure the anchor part and the special one measure the memory part of the power consumption. But that machine is not -- depend to really that machine this year but that is going to be maybe next year. So we need more tools to modify the energy effect. >> Kathryn McKinley: Let's thank our speaker. [applause]