>> Ben Zorn: Good morning. It is a great pleasure this morning to introduce Chennaggang Wu from the Institute of Computing Technology in the Chinese Academy of Sciences. Chennaggang received his PhD in computer science from ICT in 2001 and he has been working on dynamic binary optimization, computer architecture and virtual machines. And today he is going to be talking about binary translation system for the Loongson Processor. Please welcome him. Thank you. >> Chennaggang Wu: Okay, thanks for that introduction. Can you hear me? Okay. I cannot hear it [laughter]. >> Ben Zorn: The microphone is mostly for the keyboard. >> Chennaggang Wu: Oh, okay. Because my English is not good, if you cannot understand anything I say you can interrupt me anytime. I am from ICT. I focus on the binary translation about eight years. Our task is to develop a binary translation system for the China made processor named Loongson Processor. Now I introduce my work to you. Before I introduce my work I want to introduce my group. These are all members of my group. We have four, one faculty, three staffers, five PhD's and the others are Masters. My research area is the dynamic compilation including binary translation, dynamic optimization and recently we started research on program reliability. We tried to use dynamic [inaudible] to find the concurrence band of the parallel program. We just started away in order to deep research on this area. This is the outline of my presentation. The first part is I want to introduce the requirement of the China made processors. Another is, the second I will introduce a system called the DigitalBridge and introduce some experiences and progress. Loongson Processor is developed by ICT, also but ICT our Institute by not our group but another group, another lab. They adopt a MIPS like instruction set and inherited all of the MIPS instruction set and extended some instruction sets for binary translation, so they want to support the binary translation. Well, but I will introduce how we do not use many of the extensions we do not use them. Loongson has developed three generation processors. The first generation is one core single issue just now a simple processor. Then they developed the second generation to 2A, 2B to 2F and they are one core processor, one core for each processor. Now they are developing a 3A and 3B. 3A is four cores and the 3B is eight cores and they may offer 16 cores for 3C. The frequency of the processor is not very high. It is about 800 MHz to 1 GHz. This is a picture of the Loongson Processors. Now let me introduce our system. Because the Loongson Processor yields MIPS instruction there are later applications support MIPS machine so they need us to create an X86 applications to Loongson Processor. This is the mechanism of the binary translation. We want to migrate X86 applications into Loongson Processor and use our virtual machine to support it. Let me introduce our progress. The performance of our binary translation system is 67% of the binary generated by GCC4.4 performance. How we gather this data is [inaudible] way generated spec 2006 benchmark funds the GCC 03 flag to generated the code and now we use the binary translator system to run on a Loongson and then gather the performance data. Binary generated by GCC with 03 all 03 to generate the binary code and we compare them and gather the data. Our system now has Adobe flash player, Adobe reader Apache and MySQL. These are real applications. Adobe flash player is now available in the Loongson Processor, so this is the main contribution we made to Loongson. Let me show, this is the performance data. Let me show you the user experience of our system. We are coming to the requirement now for Adobe. They require a Pentium II with 800 MHz and the Loongson 2F also has this kind of frequency, so this is the minimum hardware requirement. Let me show you this. [video begins]. [inaudible]. >> Chennaggang Wu: Because of the speed the pattern translation is mostly important, so I show you the user experience. This I think some of the [inaudible] Loongson [inaudible] opening ceremony. This is a result. We can, how this can show the user experience. This is for the radios and for the animation, either can be better. [video begins]. >>: Dancing through the snow, in a one horse open sleigh, over the fields we go, laughing all the way. >> Chennaggang Wu: It is better than the radio. >>: Making spirits bright, what fun it is to ride and sing a sleighing song… >> Chennaggang Wu: Let me continue. To gather the good performance, let me introduce my experience. The mostly important thing for us we found that improved performance is not the new technology. It is just from selecting the target instructions to simulate guest instruction elaborately. This is the most important thing when we do this work it was engineering work engineering work but it can gather good performance. Afterward to eight we can get performance can reach 40% of the native encoder, 40%. So it is very useful. Then my second experience is to leverage the local resources much as possible if we can use the local libraries. If the system has a local library we actually leverage it. We do not translate it because the local library has much more good, has better performance than the translator performance and the local software frame. For the flash player and it is plug-in off the processor. The processor we tried to translate the processor and the plug-in together but the performance was not good, so we only try, only translate the plug-in and we use the local processor so it had [inaudible] performance. These two [inaudible] our experience actually we need another introduce in detail it is very simple although it is included a lot of engineering tasks. For the progress I think we have four important technologies. The first one is handling misalignment data access efficiently. It is published in CGO 2009. And we improve the data locality by pool allocation, published in CGO 2010. And we improved the data locality by on-the-fly structure splitting, published in HiPEAC this year. And promote local variables into registers, published in CGO 2011. So the following I will introduce these four techniques. For the first one because X86 the hardware support misalignment data access it will not introduce big overhead. So the compiler does not turn on the alignment operation so there are a lot of misalignments in the program, in the binary. The misalignment will introduce greater overhead. What is misalignment? I think I need another introduce data it because you are express so I asking now to introduce them. But for the modern RISC machines especially on Alpha and the MIPS, they don't have this kind of spot. When the machine, when the processor run on misalignment access it will trigger an exception because of the exception will be 1000 cycles once on the instructions. So the cost is very high. This is statistical data on span 2000 and 2006. We found that on average there is one 1.44 misalignments [inaudible] in the program. But for some programs like [inaudible] and some others and Art and some others the misalignment will be very high, so it will introduce much more overhead. Before we… >>: Can I ask about that? So is that, are your programs originally compiled with GCC? Is that right? So does the compiler itself try to avoid misaligned accesses when it… >> Chennaggang Wu: The compiler can't deal with the misalignment [inaudible] but is option to not turn on. >>: Okay. >> Chennaggang Wu: For the ‘03 option is not [inaudible]. >>: I see. See you are just assuming that people won't do that optimization? >> Chennaggang Wu: We are assuming that people do not do it. But indeed the real obligations, the real obligations are [inaudible]. There exists a lot of misalignment data [inaudible] access so there are also already exist some methods, one method called the direct method. In this method is used in what you much in [inaudible] in this virtual machine. It handle it when it meet a misalignment data access, it will translate it with a sequence of code. This code does not trigger misalignment, but it will introduce much more overhead as our single instruction. For QEMU they translate all of the data access instruction into the this kind of sequence, code sequence so for the data access instruction [inaudible], with limited success it will introduce greater overhead. Another method is named static profiling. This method is used in a fat 32 system. They execute the application and they instrumented the application and execute eight and the [inaudible] misalignment information for the instructions and then when they translate the system application, they only translated the instructions with misalignment they found. But for this method we found four different important the misalignment will be different. The instruction will be different, so it is not a good solution. This method cannot be made. So, okay? >>: Obviously your results from the profile are that you can't prove that something will not, that a location will not have a misalignment so do you have an instruction in that case? >> Chennaggang Wu: Yeah, you're right. >>: So your work it just has a really expensive solution at runtime? >> Chennaggang Wu: Yeah, yeah. For some important, some instruction will trigger a misalignment of access, but when using another important, another instruction will trigger a misalignment, but this instruction found the that the profiling do not create misalignment. >>: The profiler thinks it won't. >> Chennaggang Wu: Yeah. So this is a method that we call dynamic profiling and this method use in RIAA 32 execution layer developed by Intel Shanghai. This method they use a interpreter to interpret the program when the user interpreter or user instrumentation method for I 32 execution here they instrumented the binary code when they found some in the loop iteration in the first 20 executions, iterations they found this instruction trigger misalignment and access, they will translate code into our sequence coder. This method works very well for some applications but we found for some there exists some applications that iteration of the loop will exceed how much, I forgot it. Let me try. Yeah. For this application it needs 266 iterations can trigger a forced misalignment data access after eight it will access the misalignment data frequently. So this method cannot work well. So we propose another method named exception handling. It is very simple. We use a translator to translate the translator access code into MIPS image, and then before translation we will just exception handler. When the code triggers an exception, triggers a misalignment, it will trigger an unaligned exception, this handler will modify the code coming to the information with we get. We also do some optimization for us if we use the exception handler to patch the code it will destroy the locality so we code to rearrange to rearrange the code to model this code model this code upward to improve the locality. And we combine the dynamic profiling method with our method [inaudible] where we use the interpreter to interpret the iteration several times to find some misalignment data access and the translator [inaudible] into the code we want and then we use our exception handler to find the misalignment and we do not find during the profiling time. And the way we do another optimization we found in our basic block there are a lot of misalignment where we will translate them together with all the translates instead of translating one instruction each other to improve the code locality. And for some code we use model version to check, we check we generated a code to check if this instruction will trigger misalignment and according to the check we use orange instruction or use misalignment [inaudible] sequence. Very simple method but it is very useful. It increase our system 20% performance. For our system compared to other three methods we can achieve 21% better than dynamic profiling, 14% better than static profiling and 76% better than the direct method. Okay? >>: Do you have a measure of how much it increases the size of the code because you're adding [inaudible]? >> Chennaggang Wu: We do not, because we, we do not, I have forgotten the exact data. Because this paper probably should have several years ago I do not remember the… It was not very high. >>: Okay. >> Chennaggang Wu: We can calculate from the instructions for one instruction, for this, this is instruction with misalignment data access, our instruction will generate a seven to eleven instruction for one instruction so when it multiply this with--very small. Another progress is improved data locality by pool allocation. Dynamic heap memory allocation is widely used in modern programs. But the general-purpose heap allocators focus more on runtime overhead and memory utilization. They do not focus on the data locality and let me give an example. Suppose there are three data structures. One is a list, and another list, and a tree. The programmer, rather the code are calling to the general purpose calling to their purpose so it may access, the program may access one node for list one, one node for list two and one node for tree for list for two, for tree and so on. So this layout has for locality. So we use the pool allocation to aggravate heap objects into separate memory pools at the time of their allocation. We status reports for this one when the program runs [inaudible] allocation for list one in this part, and then allocates for list two in this part and then for trees in this part. So it can improve the data locality because the program, when the program accesses the data structure they always, access with this sequence. How to do it. Before we do this work there exists another work. It uses allocation site to locate the data and when they found in this allocation site they design part a set and the data allocated in this set to this part. So they use different sites to set up different parts. But sometimes there exists some problems for, you know, for the modern programmers, they always use the [inaudible] like this code in twolf. They use safe malloc. In this function they call the system function, library function malloc in this set. If it can malloc data from the heap it will return a pointer. If not it will return a null pointer. For twolf benchmark all data located by this [inaudible] part, so it will introduce our problem. If we use this allocation site the way all our locators all of the data into one part. So this is a problem. This is another, another application also how this can be problem. However, for and this is example how this problem. So our solution is to use the call chain. But we found that we have three kinds of selections. The first one is use the full call chain from the looking inside to the main function. But the call chain is very long and it will introduce much more overhead. If our recursive method overhead will much more worse. Another choice is use the fixed length call chain. We can use two call chain. For this application some wrappers have several layers, not one layer so it cannot work very well. With the adaptive partial call chain we underlies the code of the binary, of the binary and then try to find if we use a wrapper. If we use a wrapper to larger the call chain, so it can work very well. For some data [inaudible] different location site locator the data for one data structure. This location, this allocation site on this allocation site a locator different than all the well connected together will form other structure. So this problem we must to solve, to solve this problem we can design two affinity, named type I and type II. The type I affinity means the objects, if the objects are of the same type and they are linked together so we assume they have the type I affinity. Just like this they have the same type and are connected together by the point, so this is the first, the type I affinity. The type II affinity means for this for this node they do not connect to each other but they are connected by the same type I affinity nodes with the same field. So we assume they have the affinity. So we use pool one to put them together and use pool two to put this kind of node together. When we found suppose this node and this node are not located in the same site we will combine them together. We will module them together. To publish this paper we… >>: Can you go back? So is this done statically? Or are you… >> Chennaggang Wu: Dynamically. >>: Dynamically. >> Chennaggang Wu: In the runtime. >>: How do you know that it, that it is a type one or a type two… >> Chennaggang Wu: We do not know the type. We only, what we know is the size located. If the size are the same, are always the same we believe, we assume they have the same type. >>: No but how do you put them in pool one and pool two? How do you decide which pool to put them in? >> Chennaggang Wu: We set in the first, one way we look at this node, we try to find if any others have the same affinity. If data is affinity to some others already existing… >>: [inaudible] based on the linking. >> Chennaggang Wu: Based on the linking. >>: So how do you monitor that? >> Chennaggang Wu: How do I… >>: You do that dynamically. >> Chennaggang Wu: Yeah. In the beginning they are using our pool. We set in one pool. We only set one pool. So when the program is looking at the first order we put it in this pool. When it look at another mode, we will compare data. We will compare if it has an affinity with the existing order. If it has an affinity with the existing order we locate it in this pool. If not, we will locate in another pool and we will look at another pool and locate the node in this pool. >>: If you don't have type information, how do you know--so what is the affinity, what is the dynamic test that you do? >> Chennaggang Wu: What is what? >>: What is the test, the affinity test? What is it? What do you check? >> Chennaggang Wu: We check one, two, two things, two attributes. One is the site. >>: Right. >> Chennaggang Wu: We do not know the type but we know the site. We can try to either if they are the same site. That's the first one. The second one we track if it has a point from this one to point of this one. We check it. >>: How can you check that if you are allocating the second [inaudible] and there are no corners into it or out of it, right? There is no… >> Chennaggang Wu: When it is located eight the function master return a [inaudible] to our register. The [inaudible] of the register master assign to exist variable or exist… >>: What he is asking is how do you know whether there is a pointer between the two? How do you know there is a pointer between the two? >> Chennaggang Wu: Okay, I see, I see. For this wrapper it return an address, but in this code our address will be returned, returned to its caller. For its caller we will check the code to find… >>: You anticipate where it's going to be stored. >> Chennaggang Wu: What? >>: You anticipate where it is going to be stored by doing a program analysis? You find out which pool it's going to be stored in. >> Chennaggang Wu: Yeah, yeah, yeah. >>: So that is a static analysis before… >> Chennaggang Wu: We do not do static analysis. >>: Then how do I mean [inaudible] knowledge when you do the allocation? >> Chennaggang Wu: When we look at the data we started our analysis. Analysis is the [inaudible]. We try to find which instruction does this task assign return [inaudible] address returns to our pointer. >>: So basically, you look at instructions [inaudible] that follows the caller… >> Chennaggang Wu: Yeah, yeah, yeah. We look at it the instruction that follows the call. In the runtime, on-the-fly. So now… >>: [inaudible] in a sense because your just doing it at runtime but you're doing it before your want the code, right? >> Chennaggang Wu: Not before, after we do it. >>: But you've already allocated it so you can't move it from one pool to another once you've… >> Chennaggang Wu: To another node. >>: Right. So you need to know before you allocate. >> Chennaggang Wu: No. We catch a locator; we do not allocate the data. We do not locate the data just the one way found the code assigned to the pointer assigned addressed to the pointer. >>: So you delay the allocation? >> Chennaggang Wu: Because we use all locator, we interrupt this allocator. We do not laterally allocate the data. We do not laterally allocate the data. We interrupt, intercept the data and we delay it. >>: You delay it? And that's the truth. If you wait until you know where it's stored and then you allocate. >> Chennaggang Wu: Yes. >>: Ahh. >> Chennaggang Wu: We only do it one time. Only do it one time. We do not, because this allocation inside the locator is solved 1 million times, we only do it one time. >>: Oh, and that is based on the call stack… >> Chennaggang Wu: Yes. If the call stacker change that way we do it another time. We don't have to change it. We only do it one time. >>: [inaudible]. Thanks. >> Chennaggang Wu: Because we want to publish the papers so we do our experiment on X86 machines, not on a machine translation system. But we use it in our system. We test the data on two machines so one was Intel Pentium 4 and the Intel Xeon. My pronunciation is right? The difference between level two catch, the level two catch for this one was very small; for this one was bigger. A difference, main difference point which was 12 benchmarks spanning from 2000 to 2006. The policy is we found that when we found that if we used the memory used in pools divided by the memory used in heap, it works to 1% away which shows that this kind of benchmarks. For some benchmarks we can gather great improvement. For this one in the second platform we can gather 80% speed up. For this one in the first machine we can gather 50% speed up. The main reason for the speed up are different. For this one we can the performance comes from what we reduce the TRB miss. We reduce the TRB miss greatly. For this one we reduce the cache misses greatly. >>: Just one question, so can use this for, this is a [inaudible] translation, but does it have to be? I mean it seems like it could be used for any program right? [inaudible] apply your technique if you are not translating because of your tricks of delaying the allocation, is that the reason that you're doing it? >> Chennaggang Wu: I am not sure what your… >>: [inaudible] general technique, right? >> Chennaggang Wu: Do you mean can I use this technique for other… >>: Yes. Optimizing an existing allocation, some other allocator? >> Chennaggang Wu: I think you can, but for some programs it would not gather performance, performance again. If the workload is very large or very small it cannot gather the performance again. It is very large we cannot reduce, because for some applications workload is very large. No matter how we reduce, we improve the locality the cache misses the TRB misses will be great, so for the small application, for the workload, for the application with small workload it cannot work well. But for some applications, before we optimize it trigger [inaudible] miss and TRB misses after we optimize it, wait a minute. After we optimize it the workload can fit in the cache and it can gather great performance again. >>: [inaudible]. Can you let your slide that has the [inaudible]? So what is your baseline here? Because this is on the Intel hardware so are you doing the binary [inaudible] 02? [inaudible]. >> Chennaggang Wu: I have forgotten it because, I have forgotten. [inaudible] I have forgotten it. With the [inaudible] optimization option but I have forgotten it. I could probably check the paper. >>: Are you writing out X86 in during, when you do the DPA? Do you write out X86 or do you write out--so how to actually implement this at runtime? >> Chennaggang Wu: [inaudible] X86 and not a [inaudible] otherwise we implement it in our system. >>: [inaudible]. But as you said, you could probably do this statically once per call site, per call to malloc. >> Chennaggang Wu: [inaudible] call statically. Yeah, [inaudible] works but it doesn't work well due to the same allocation [inaudible] different structure. >>: So have you looked at inlining? I mean these wrapper functions are pretty small generally. You could imagine just inlining all of the wrappers and that would eliminate… >> Chennaggang Wu: Inlining wrapper. I remember it is not a problem. We discuss the in-line before we publish in the paper. Inlining. No, that is another problem. Wouldn't inlining the wrapper, the cost [inaudible] work well. >>: So that would save overhead to measure the… >> Chennaggang Wu: We do not from the [inaudible], we do not know, we cannot identify which coder is inline. >>: Right. I understand. Okay. So I was suggesting if you actually had the source and you inlined the wrappers, then you wouldn't have to go up multiple calls, multiple frames in the call site. And it would save the effort to do that and then you could do a static analysis at every call site, which is what I was suggesting. And you wouldn't need to do the dynamic analysis. >> Chennaggang Wu: I am sorry. I do not catch what you… >>: Okay so, we can talk afterwards. >> Chennaggang Wu: We can talk, okay yeah, my English is [laughter]. >>: No. It's okay. >> Chennaggang Wu: We discussed the inlining. It works very well by the call site. This is the second, this is the third, this is the second method. The second method is improve the data locality by a line on-the-fly structure splitting 79. We always do the [inaudible] on-the-fly [laughter]. So the memory wall has long been our main barrier that limits the system performance. And the data layout and many technologies have been proposed to improve the data layout such as structure splitting and field reordering. But with this kind of technology has some problems. This kind of technology can improve the performance for many applications but for some applications it can slow down them. I will show the data in the following slide. For this kind of optimization is not very safe so it needs a check runtime carefully, otherwise we introduce the bug in the system. So this method has not been applied in many compilers for performance and safety concerns. The way we design on-the-fly dynamic approach is to apply the data structure splitting optimization. Need I introduce structure splitting? Do I need to introduce what is structure splitting? No. I think not [laughter]. Okay. This is the framework of our system. This is the, the orange is the binary code, program code, and this site is our system. When the program runs it may locate data. We intercept the location request and then we catch the location. We use our allocator. Our allocator include large array allocator and an object pool allocator. When we find a large array and the large object pool this operation I just introduce just now. When we find this kind of data, we will use as optimization candidate. The allocator, the allocator will invoke the analyzer to do pointer analysis to find the possible access instruction to the located data, and to try to recognize the size of the structure and the field of the data and to do benefit estimation. If the result shows we can get benefit we will locate two spaces, to block our spaces. One is original space is space eight. After we locate eight away protector eight we do space protector with non-readable and then non-readable flag. In the same time we locate another, space namely splitting space to use and to store space data. Because this kind of data is space protected when the PAI access the data, it will trigger an exception. So our exception catches the eight and we generate, the exception handler will introduce the analyzer another time. And the difference is it has runtime information of the variable of the register, so it can find this kind of information and to generate the code. We replace the original possible access instruction. We replace them and allow them to access the splitting data. So we patch the code; we put the code into the code cache and patch it into the program code. Afterward the program can run without triggering exception. >>: So it doesn't seem safe as it [inaudible]. >> Chennaggang Wu: You are right. I will introduce the safe guarantee mechanism. The safe is very important for this optimization. When we found, we also designed a rollback module. We found some violation of optimization where we were able to go back and solve the safety program, problem. And we sent our monitor to monitor the path counter. One way generated code we will instrument the code to recall the located region the times of the located region. The monitor will check it. If we find that it is not, the optimization is not efficient then we will rollback. Okay. Let me introduce the analyzer first. The analyzer works four jobs. The first one is to find the PAI, possibly inserting instructions. If EAX is calculator from the location sat from the location [inaudible] we [inaudible] instruction PAI. Another important issue is how do we find the EAX, the relationship between EAX and the call set. We can do, we can use two technologies. One is pointer to analysis. Pointer to analysis is very difficult for static compiler, although static compiler has the original, the source code. So it is very easy, much more difficult for us because we do not have the source code. And there in the independent code there is a lot of empty [inaudible] mode in the binary. So we choose another which is lightweight flow sensitive propagation method where it just propagates the address according to the [inaudible] flow. If we find it it has a different relationship we recognize this option is PAI. This method can find some PAI but for some others they cannot find them. Undiscovered PAI will be found by our safety mechanism. Another task, recognize fields. For this task it has two tasks. One is getting the structure size and another is getting field’s offset and size. For structure size we assume all the PAI always access the same field. This assumption is not always true. It is not always true. We must check. We have a mechanism to check it. We assume it. So under this premise we getting the strides of PAI's in all of our loops and calculate the greatest common divisors of the strides. We assume this is a result of the structure size. It may be the multiple of the structure size. It may be a multiple, but it is not influence a factor of the optimization. Another task is to get the field offset and size. It is very simple. Suppose EAX is the start of the locator space is the start. So the field has that offset a four is four bytes from this instruction. We can get other information. We to estimate, we estimate the benefit and we use a heuristic method. Is very simple. We only calculate the field coverage, actually the structure size T is the structure’s number of bytes touched by the loop or trace. We believe if field coverage, the smaller the field coverage, the better for us. Another job is allocate memory for some data if we find no beneficial we will allocate as the default locator of the system. For other we locate original space on the splitting space. For the exception handler that is a generated code we use this equation to calculate it offsite and usually this is efficient to calculate the new address of the splitting space so we generate instruction, we generate instruction like this for one PAI, original PAI. We will try to generate instruction like this slide. For the safety we check if P, P is the pointer, original pointer. If P is in the optimized space, if not we will let it to reveal the original PAI. Do not conceal all the PAI. Another one is to check the offset and I just assume every PAI accessing the same field but it is not always true. So we use the code to check it. If it is not true we will rollback. >>: So do you use a [inaudible] space for every type of allocation to do the [inaudible]? >> Chennaggang Wu: What? >>: Is there a different optimize space for every different [inaudible]? >> Chennaggang Wu: If we estimate and we can gather the benefit we will use it. If not we do not use it. >>: I am thinking about the case we have two different optimizations in the cache. >> Chennaggang Wu: Two different optimizations… >>: [inaudible] you can be an optimize space but you can be a different kind of object [inaudible]. >> Chennaggang Wu: I am sorry. What do you mean? >>: The check doesn't seem… >> Chennaggang Wu: Which one, this one? >>: The first one. >> Chennaggang Wu: Okay. Does it what? >>: Is not clear that it catches all possible things that can go wrong. It doesn't seem safe enough. >> Chennaggang Wu: Why? >>: I, I… >> Chennaggang Wu: Why [inaudible]? >>: [inaudible] field access site if it is [inaudible]… >> Chennaggang Wu: I know what you mean. You mean even if this some point PAI to access the optimized space then it will be wrong. Do you mean? >>: If there are two optimizations, different optimizations that arrived at the same… >> Chennaggang Wu: Why is all I data what is another? Two different optimizations? >>: Well, so objects are allocated in different places, different call signs, but then they get past to this routine where the access is the object. >> Chennaggang Wu: You mean this role here may access this part of data and access this part of data. Am I right? >>: No. No. >> Ben Zorn: We should do this off-line. >> Chennaggang Wu: What does… >>: We can talk afterwards. It seems a little complicated. I think he would need an example to [inaudible]. >> Chennaggang Wu: I am sorry. I think I must improve my English [laughter]. I study English in here, two hours of one, per day, but I think I actually should improve my English. Okay. After the talk we can discuss it or in the paper you could write, or draw the problem for me, okay? Okay. For the optimization we use 21 instructions to replace one PAI. It's overhead. So we use several means to reduce the overhead. First of all we use liveness analysis of the register. If we find some registers are not used this time we use eight so it reduce the instruction. We amortize by the subsequent PAI's silicone and you will find several PAI's if we amortize them use these same check. And we do strength reduction and code promotion. Code promotion means we promote some code outside the loop, some code outside the loop. Okay. Then we do persistent profiling where we sat a monitor here. We do persistent profiling. We use metrics, namely ratio to calculate our way, calculate the penalty cycles and the reduction of cache lines touched. For eight we just estimate use the number and complexity of the instructions. We call it the instructions and therefore some [inaudible] divided instruction and we think the [inaudible]. And we check whether this instruction can be promoted outside of the loop. And we use the field coverage as this f to do the calculation. When optimizing is not effective all exists the field violation in C programs your union or your type cast for this problem we will do the rollback. And the other case is that our system cannot handle or we will go rollback. The rollback is very simple where we restore read and write privilege of the original space and merge the splitting space together and mold the data to the original space and free the splitting space. For the safety problem I am not sure if this will answer your question. For the original instruction if the access optimize data it is good; it is original behavior of the program so no problem. For the original instruction if the access optimized data because this data is page protected so this instruction will be captured by exception handler and will be transformed into a PAI of the [inaudible] code. So no problem. And for the translated instruction outside the optimized data if there is no field violation and the PAI always accesses the same field, no problem. But if these two conditions do not, if there exists a violation we will go rollback. For the translated instruction if the access one will access all the optimized data, we will check the addresses and let it to execute the original PAI, [inaudible] PAI and that will solve the problem. This is speed up. We realize optimization and passed all of the benchmarks, all of the benchmarks, but we only find five benchmarks to trigger the optimization. For other benchmarks no speed up and no slowdown. For this program either can achieve the speed up, achieve the speed up, but for this one slowdown, for this one slowdown. I should introduce the three parts. The red part is the pool allocation method. The green part is the OSS result without the monitor to monitor without the monitor. And this one is OSS with the monitor. So without the monitor, it will slow down for this benchmark. The reason is this benchmark does not access the data sequentially. It access the data randomly, so it will slowdown. With the monitor we found no performance again away rollback, not slowdown. I think the speed up of 0.9 may be the accuracy, maybe so for this one not slow down. It also prove these optimizations if we use it statically it cannot work very well because now monitor to rollback. >>: So historically I think many of the papers that try to do this have a similar result and basically it only works for certain programs and, you know, there is cost. So given your experience would you actually recommend implementing this optimization now that you have done all the work and you see the results? >> Chennaggang Wu: We have done all the other work published already the work. Do you mean? >>: I mean just based on your results here, would you recommend that other people implement this optimization or you consider it not a good enough result. It is not a good enough benefit? >> Chennaggang Wu: I think that the benefit for some applications like art it is very high and it is beautiful. So we found that in the study and the compiler in GCC 4.4 there exists this kind of optimization. But it is not now always. It is not now. I think the reason is for some benchmarks it will create greater overhead to slowdown, so but for all we do not make users slowdown. >>: Right. >> Chennaggang Wu: Yeah? >>: So some of the reasons you think that it may not work well is the X86 processor is very good at prefetching data for you, right? It does a good job so it is hiding some of those latency issues that you run into but it seems like a simpler processor that might be in order like one of your earlier versions of the Loongson Processor, this might have a huge benefit. So do you have the same graph for [inaudible] earlier processor? >> Chennaggang Wu: No. We only do it in the new processors. Last one is promote-do I have time? >> Ben Zorn: Not much. >> Chennaggang Wu: Not much? >> Ben Zorn: About 5 minutes. >> Chennaggang Wu: 5 minutes, okay. The last one is promote local variables into registers because X86 we translate into [inaudible] promote local variables into registers and [inaudible] on eight. So if we for X86 because there are, due to the reason must to register, [inaudible] register locator register. So [inaudible] we try to promote the local variable into the register to reduce the overhead. So since I do not have enough time, our method is for these yellow instructions, we found-we do all of this work on Intel 64 platform not on MIPS. The reason is it is usually the published paper. So for these three instructions we found the instructions access this data, this stack data, so for Intel 64 we have eight extra registers so we can use them to preload and to restore. So in the loop we can replace stack variables and by all the new registers. This optimization is simple but is mostly important thing is how to guarantee the safety. Okay. No time. The main reason is why we target the stack variable because it is simpler than other variables. It is really explicitly referenced. It is relatively easy to disambiguate explicit references. But there is one safety issue is how to detect the memory alias. We use memory protection to do this job when we promote the stack variable from this page we will protect it and promote the variable outside. But this method will introduce another issue. For some data because we do not have enough registers to promote all of the local variables, so the other variable how do we handle them? Okay. The solution is we show remap to remap a shadow stack, remap a shadow stack. We translate, we target the explicit memory access instruction into the stack, shadow stack [inaudible] physical stack. This two-part memory shares the same physical memory, so we change them to the shadow stack so the problem can be solved with them. But there exists some other problems for the original static stack if it does not permit mrepap operation, so we must locate a major stack a first and transform the stack upon into the stack and the do the mremap. I think finally I have run out of the time. Okay. So let me show the result for the application. For some application we can achieve 45% speed up. For the average we can achieve 7.3%. Okay. I am sorry. My poor English. I cannot explain. [applause]. >> Ben Zorn: Thank you. That was a lot of great material. [laughter]. Any further questions? So Chennaggang is going to be with us through the day and he is going to join me for lunch if you want to come to lunch with us, you are welcome to and we could probably find some room in his schedule later if you want to talk to him. So just let me know.