>> Ben Zorn: Good morning. It is a... Chennaggang Wu from the Institute of Computing Technology in the...

>> Ben Zorn: Good morning. It is a great pleasure this morning to introduce
Chennaggang Wu from the Institute of Computing Technology in the Chinese Academy
of Sciences. Chennaggang received his PhD in computer science from ICT in 2001 and
he has been working on dynamic binary optimization, computer architecture and virtual
machines. And today he is going to be talking about binary translation system for the
Loongson Processor. Please welcome him. Thank you.
>> Chennaggang Wu: Okay, thanks for that introduction. Can you hear me? Okay. I
cannot hear it [laughter].
>> Ben Zorn: The microphone is mostly for the keyboard.
>> Chennaggang Wu: Oh, okay. Because my English is not good, if you cannot
understand anything I say you can interrupt me anytime. I am from ICT. I focus on the
binary translation about eight years. Our task is to develop a binary translation system
for the China made processor named Loongson Processor. Now I introduce my work to
you. Before I introduce my work I want to introduce my group. These are all members
of my group. We have four, one faculty, three staffers, five PhD's and the others are
My research area is the dynamic compilation including binary translation, dynamic
optimization and recently we started research on program reliability. We tried to use
dynamic [inaudible] to find the concurrence band of the parallel program. We just started
away in order to deep research on this area. This is the outline of my presentation. The
first part is I want to introduce the requirement of the China made processors. Another is,
the second I will introduce a system called the DigitalBridge and introduce some
experiences and progress. Loongson Processor is developed by ICT, also but ICT our
Institute by not our group but another group, another lab. They adopt a MIPS like
instruction set and inherited all of the MIPS instruction set and extended some instruction
sets for binary translation, so they want to support the binary translation. Well, but I will
introduce how we do not use many of the extensions we do not use them. Loongson has
developed three generation processors. The first generation is one core single issue just
now a simple processor. Then they developed the second generation to 2A, 2B to 2F and
they are one core processor, one core for each processor. Now they are developing a 3A
and 3B. 3A is four cores and the 3B is eight cores and they may offer 16 cores for 3C.
The frequency of the processor is not very high. It is about 800 MHz to 1 GHz.
This is a picture of the Loongson Processors. Now let me introduce our system. Because
the Loongson Processor yields MIPS instruction there are later applications support
MIPS machine so they need us to create an X86 applications to Loongson Processor.
This is the mechanism of the binary translation. We want to migrate X86 applications
into Loongson Processor and use our virtual machine to support it. Let me introduce our
progress. The performance of our binary translation system is 67% of the binary
generated by GCC4.4 performance. How we gather this data is [inaudible] way
generated spec 2006 benchmark funds the GCC 03 flag to generated the code and now
we use the binary translator system to run on a Loongson and then gather the
performance data. Binary generated by GCC with 03 all 03 to generate the binary code
and we compare them and gather the data. Our system now has Adobe flash player,
Adobe reader Apache and MySQL. These are real applications. Adobe flash player is
now available in the Loongson Processor, so this is the main contribution we made to
Let me show, this is the performance data. Let me show you the user experience of our
system. We are coming to the requirement now for Adobe. They require a Pentium II
with 800 MHz and the Loongson 2F also has this kind of frequency, so this is the
minimum hardware requirement. Let me show you this.
[video begins]. [inaudible].
>> Chennaggang Wu: Because of the speed the pattern translation is mostly important,
so I show you the user experience. This I think some of the [inaudible] Loongson
[inaudible] opening ceremony. This is a result. We can, how this can show the user
experience. This is for the radios and for the animation, either can be better.
[video begins].
>>: Dancing through the snow, in a one horse open sleigh, over the fields we go,
laughing all the way.
>> Chennaggang Wu: It is better than the radio.
>>: Making spirits bright, what fun it is to ride and sing a sleighing song…
>> Chennaggang Wu: Let me continue. To gather the good performance, let me
introduce my experience. The mostly important thing for us we found that improved
performance is not the new technology. It is just from selecting the target instructions to
simulate guest instruction elaborately. This is the most important thing when we do this
work it was engineering work engineering work but it can gather good performance.
Afterward to eight we can get performance can reach 40% of the native encoder, 40%.
So it is very useful. Then my second experience is to leverage the local resources much
as possible if we can use the local libraries. If the system has a local library we actually
leverage it. We do not translate it because the local library has much more good, has
better performance than the translator performance and the local software frame. For the
flash player and it is plug-in off the processor. The processor we tried to translate the
processor and the plug-in together but the performance was not good, so we only try,
only translate the plug-in and we use the local processor so it had [inaudible]
performance. These two [inaudible] our experience actually we need another introduce
in detail it is very simple although it is included a lot of engineering tasks.
For the progress I think we have four important technologies. The first one is handling
misalignment data access efficiently. It is published in CGO 2009. And we improve the
data locality by pool allocation, published in CGO 2010. And we improved the data
locality by on-the-fly structure splitting, published in HiPEAC this year. And promote
local variables into registers, published in CGO 2011.
So the following I will introduce these four techniques. For the first one because X86 the
hardware support misalignment data access it will not introduce big overhead. So the
compiler does not turn on the alignment operation so there are a lot of misalignments in
the program, in the binary. The misalignment will introduce greater overhead. What is
misalignment? I think I need another introduce data it because you are express so I
asking now to introduce them. But for the modern RISC machines especially on Alpha
and the MIPS, they don't have this kind of spot. When the machine, when the processor
run on misalignment access it will trigger an exception because of the exception will be
1000 cycles once on the instructions. So the cost is very high.
This is statistical data on span 2000 and 2006. We found that on average there is one
1.44 misalignments [inaudible] in the program. But for some programs like [inaudible]
and some others and Art and some others the misalignment will be very high, so it will
introduce much more overhead. Before we…
>>: Can I ask about that? So is that, are your programs originally compiled with GCC?
Is that right? So does the compiler itself try to avoid misaligned accesses when it…
>> Chennaggang Wu: The compiler can't deal with the misalignment [inaudible] but is
option to not turn on.
>>: Okay.
>> Chennaggang Wu: For the ‘03 option is not [inaudible].
>>: I see. See you are just assuming that people won't do that optimization?
>> Chennaggang Wu: We are assuming that people do not do it. But indeed the real
obligations, the real obligations are [inaudible]. There exists a lot of misalignment data
[inaudible] access so there are also already exist some methods, one method called the
direct method. In this method is used in what you much in [inaudible] in this virtual
machine. It handle it when it meet a misalignment data access, it will translate it with a
sequence of code. This code does not trigger misalignment, but it will introduce much
more overhead as our single instruction. For QEMU they translate all of the data access
instruction into the this kind of sequence, code sequence so for the data access instruction
[inaudible], with limited success it will introduce greater overhead.
Another method is named static profiling. This method is used in a fat 32 system. They
execute the application and they instrumented the application and execute eight and the
[inaudible] misalignment information for the instructions and then when they translate
the system application, they only translated the instructions with misalignment they
found. But for this method we found four different important the misalignment will be
different. The instruction will be different, so it is not a good solution. This method
cannot be made. So, okay?
>>: Obviously your results from the profile are that you can't prove that something will
not, that a location will not have a misalignment so do you have an instruction in that
>> Chennaggang Wu: Yeah, you're right.
>>: So your work it just has a really expensive solution at runtime?
>> Chennaggang Wu: Yeah, yeah. For some important, some instruction will trigger a
misalignment of access, but when using another important, another instruction will
trigger a misalignment, but this instruction found the that the profiling do not create
>>: The profiler thinks it won't.
>> Chennaggang Wu: Yeah. So this is a method that we call dynamic profiling and this
method use in RIAA 32 execution layer developed by Intel Shanghai. This method they
use a interpreter to interpret the program when the user interpreter or user instrumentation
method for I 32 execution here they instrumented the binary code when they found some
in the loop iteration in the first 20 executions, iterations they found this instruction trigger
misalignment and access, they will translate code into our sequence coder. This method
works very well for some applications but we found for some there exists some
applications that iteration of the loop will exceed how much, I forgot it. Let me try.
Yeah. For this application it needs 266 iterations can trigger a forced misalignment data
access after eight it will access the misalignment data frequently. So this method cannot
work well.
So we propose another method named exception handling. It is very simple. We use a
translator to translate the translator access code into MIPS image, and then before
translation we will just exception handler. When the code triggers an exception, triggers
a misalignment, it will trigger an unaligned exception, this handler will modify the code
coming to the information with we get. We also do some optimization for us if we use
the exception handler to patch the code it will destroy the locality so we code to rearrange
to rearrange the code to model this code model this code upward to improve the locality.
And we combine the dynamic profiling method with our method [inaudible] where we
use the interpreter to interpret the iteration several times to find some misalignment data
access and the translator [inaudible] into the code we want and then we use our exception
handler to find the misalignment and we do not find during the profiling time. And the
way we do another optimization we found in our basic block there are a lot of
misalignment where we will translate them together with all the translates instead of
translating one instruction each other to improve the code locality.
And for some code we use model version to check, we check we generated a code to
check if this instruction will trigger misalignment and according to the check we use
orange instruction or use misalignment [inaudible] sequence. Very simple method but it
is very useful. It increase our system 20% performance. For our system compared to
other three methods we can achieve 21% better than dynamic profiling, 14% better than
static profiling and 76% better than the direct method. Okay?
>>: Do you have a measure of how much it increases the size of the code because you're
adding [inaudible]?
>> Chennaggang Wu: We do not, because we, we do not, I have forgotten the exact data.
Because this paper probably should have several years ago I do not remember the… It
was not very high.
>>: Okay.
>> Chennaggang Wu: We can calculate from the instructions for one instruction, for
this, this is instruction with misalignment data access, our instruction will generate a
seven to eleven instruction for one instruction so when it multiply this with--very small.
Another progress is improved data locality by pool allocation. Dynamic heap memory
allocation is widely used in modern programs. But the general-purpose heap allocators
focus more on runtime overhead and memory utilization. They do not focus on the data
locality and let me give an example. Suppose there are three data structures. One is a
list, and another list, and a tree. The programmer, rather the code are calling to the
general purpose calling to their purpose so it may access, the program may access one
node for list one, one node for list two and one node for tree for list for two, for tree and
so on. So this layout has for locality. So we use the pool allocation to aggravate heap
objects into separate memory pools at the time of their allocation. We status reports for
this one when the program runs [inaudible] allocation for list one in this part, and then
allocates for list two in this part and then for trees in this part. So it can improve the data
locality because the program, when the program accesses the data structure they always,
access with this sequence. How to do it.
Before we do this work there exists another work. It uses allocation site to locate the data
and when they found in this allocation site they design part a set and the data allocated in
this set to this part. So they use different sites to set up different parts. But sometimes
there exists some problems for, you know, for the modern programmers, they always use
the [inaudible] like this code in twolf. They use safe malloc. In this function they call
the system function, library function malloc in this set. If it can malloc data from the
heap it will return a pointer. If not it will return a null pointer. For twolf benchmark all
data located by this [inaudible] part, so it will introduce our problem. If we use this
allocation site the way all our locators all of the data into one part. So this is a problem.
This is another, another application also how this can be problem. However, for and this
is example how this problem.
So our solution is to use the call chain. But we found that we have three kinds of
selections. The first one is use the full call chain from the looking inside to the main
function. But the call chain is very long and it will introduce much more overhead. If
our recursive method overhead will much more worse. Another choice is use the fixed
length call chain. We can use two call chain. For this application some wrappers have
several layers, not one layer so it cannot work very well. With the adaptive partial call
chain we underlies the code of the binary, of the binary and then try to find if we use a
wrapper. If we use a wrapper to larger the call chain, so it can work very well. For some
data [inaudible] different location site locator the data for one data structure. This
location, this allocation site on this allocation site a locator different than all the well
connected together will form other structure.
So this problem we must to solve, to solve this problem we can design two affinity,
named type I and type II. The type I affinity means the objects, if the objects are of the
same type and they are linked together so we assume they have the type I affinity. Just
like this they have the same type and are connected together by the point, so this is the
first, the type I affinity. The type II affinity means for this for this node they do not
connect to each other but they are connected by the same type I affinity nodes with the
same field. So we assume they have the affinity. So we use pool one to put them
together and use pool two to put this kind of node together. When we found suppose this
node and this node are not located in the same site we will combine them together. We
will module them together. To publish this paper we…
>>: Can you go back? So is this done statically? Or are you…
>> Chennaggang Wu: Dynamically.
>>: Dynamically.
>> Chennaggang Wu: In the runtime.
>>: How do you know that it, that it is a type one or a type two…
>> Chennaggang Wu: We do not know the type. We only, what we know is the size
located. If the size are the same, are always the same we believe, we assume they have
the same type.
>>: No but how do you put them in pool one and pool two? How do you decide which
pool to put them in?
>> Chennaggang Wu: We set in the first, one way we look at this node, we try to find if
any others have the same affinity. If data is affinity to some others already existing…
>>: [inaudible] based on the linking.
>> Chennaggang Wu: Based on the linking.
>>: So how do you monitor that?
>> Chennaggang Wu: How do I…
>>: You do that dynamically.
>> Chennaggang Wu: Yeah. In the beginning they are using our pool. We set in one
pool. We only set one pool. So when the program is looking at the first order we put it
in this pool. When it look at another mode, we will compare data. We will compare if it
has an affinity with the existing order. If it has an affinity with the existing order we
locate it in this pool. If not, we will locate in another pool and we will look at another
pool and locate the node in this pool.
>>: If you don't have type information, how do you know--so what is the affinity, what is
the dynamic test that you do?
>> Chennaggang Wu: What is what?
>>: What is the test, the affinity test? What is it? What do you check?
>> Chennaggang Wu: We check one, two, two things, two attributes. One is the site.
>>: Right.
>> Chennaggang Wu: We do not know the type but we know the site. We can try to
either if they are the same site. That's the first one. The second one we track if it has a
point from this one to point of this one. We check it.
>>: How can you check that if you are allocating the second [inaudible] and there are no
corners into it or out of it, right? There is no…
>> Chennaggang Wu: When it is located eight the function master return a [inaudible] to
our register. The [inaudible] of the register master assign to exist variable or exist…
>>: What he is asking is how do you know whether there is a pointer between the two?
How do you know there is a pointer between the two?
>> Chennaggang Wu: Okay, I see, I see. For this wrapper it return an address, but in this
code our address will be returned, returned to its caller. For its caller we will check the
code to find…
>>: You anticipate where it's going to be stored.
>> Chennaggang Wu: What?
>>: You anticipate where it is going to be stored by doing a program analysis? You find
out which pool it's going to be stored in.
>> Chennaggang Wu: Yeah, yeah, yeah.
>>: So that is a static analysis before…
>> Chennaggang Wu: We do not do static analysis.
>>: Then how do I mean [inaudible] knowledge when you do the allocation?
>> Chennaggang Wu: When we look at the data we started our analysis. Analysis is the
[inaudible]. We try to find which instruction does this task assign return [inaudible]
address returns to our pointer.
>>: So basically, you look at instructions [inaudible] that follows the caller…
>> Chennaggang Wu: Yeah, yeah, yeah. We look at it the instruction that follows the
call. In the runtime, on-the-fly. So now…
>>: [inaudible] in a sense because your just doing it at runtime but you're doing it before
your want the code, right?
>> Chennaggang Wu: Not before, after we do it.
>>: But you've already allocated it so you can't move it from one pool to another once
>> Chennaggang Wu: To another node.
>>: Right. So you need to know before you allocate.
>> Chennaggang Wu: No. We catch a locator; we do not allocate the data. We do not
locate the data just the one way found the code assigned to the pointer assigned addressed
to the pointer.
>>: So you delay the allocation?
>> Chennaggang Wu: Because we use all locator, we interrupt this allocator. We do not
laterally allocate the data. We do not laterally allocate the data. We interrupt, intercept
the data and we delay it.
>>: You delay it? And that's the truth. If you wait until you know where it's stored and
then you allocate.
>> Chennaggang Wu: Yes.
>>: Ahh.
>> Chennaggang Wu: We only do it one time. Only do it one time. We do not, because
this allocation inside the locator is solved 1 million times, we only do it one time.
>>: Oh, and that is based on the call stack…
>> Chennaggang Wu: Yes. If the call stacker change that way we do it another time.
We don't have to change it. We only do it one time.
>>: [inaudible]. Thanks.
>> Chennaggang Wu: Because we want to publish the papers so we do our experiment
on X86 machines, not on a machine translation system. But we use it in our system. We
test the data on two machines so one was Intel Pentium 4 and the Intel Xeon. My
pronunciation is right? The difference between level two catch, the level two catch for
this one was very small; for this one was bigger. A difference, main difference point
which was 12 benchmarks spanning from 2000 to 2006. The policy is we found that
when we found that if we used the memory used in pools divided by the memory used in
heap, it works to 1% away which shows that this kind of benchmarks. For some
benchmarks we can gather great improvement. For this one in the second platform we
can gather 80% speed up. For this one in the first machine we can gather 50% speed up.
The main reason for the speed up are different. For this one we can the performance
comes from what we reduce the TRB miss. We reduce the TRB miss greatly. For this
one we reduce the cache misses greatly.
>>: Just one question, so can use this for, this is a [inaudible] translation, but does it have
to be? I mean it seems like it could be used for any program right? [inaudible] apply
your technique if you are not translating because of your tricks of delaying the allocation,
is that the reason that you're doing it?
>> Chennaggang Wu: I am not sure what your…
>>: [inaudible] general technique, right?
>> Chennaggang Wu: Do you mean can I use this technique for other…
>>: Yes. Optimizing an existing allocation, some other allocator?
>> Chennaggang Wu: I think you can, but for some programs it would not gather
performance, performance again. If the workload is very large or very small it cannot
gather the performance again. It is very large we cannot reduce, because for some
applications workload is very large. No matter how we reduce, we improve the locality
the cache misses the TRB misses will be great, so for the small application, for the
workload, for the application with small workload it cannot work well. But for some
applications, before we optimize it trigger [inaudible] miss and TRB misses after we
optimize it, wait a minute. After we optimize it the workload can fit in the cache and it
can gather great performance again.
>>: [inaudible]. Can you let your slide that has the [inaudible]? So what is your
baseline here? Because this is on the Intel hardware so are you doing the binary
[inaudible] 02? [inaudible].
>> Chennaggang Wu: I have forgotten it because, I have forgotten. [inaudible] I have
forgotten it. With the [inaudible] optimization option but I have forgotten it. I could
probably check the paper.
>>: Are you writing out X86 in during, when you do the DPA? Do you write out X86 or
do you write out--so how to actually implement this at runtime?
>> Chennaggang Wu: [inaudible] X86 and not a [inaudible] otherwise we implement it
in our system.
>>: [inaudible]. But as you said, you could probably do this statically once per call site,
per call to malloc.
>> Chennaggang Wu: [inaudible] call statically. Yeah, [inaudible] works but it doesn't
work well due to the same allocation [inaudible] different structure.
>>: So have you looked at inlining? I mean these wrapper functions are pretty small
generally. You could imagine just inlining all of the wrappers and that would
>> Chennaggang Wu: Inlining wrapper. I remember it is not a problem. We discuss the
in-line before we publish in the paper. Inlining. No, that is another problem. Wouldn't
inlining the wrapper, the cost [inaudible] work well.
>>: So that would save overhead to measure the…
>> Chennaggang Wu: We do not from the [inaudible], we do not know, we cannot
identify which coder is inline.
>>: Right. I understand. Okay. So I was suggesting if you actually had the source and
you inlined the wrappers, then you wouldn't have to go up multiple calls, multiple frames
in the call site. And it would save the effort to do that and then you could do a static
analysis at every call site, which is what I was suggesting. And you wouldn't need to do
the dynamic analysis.
>> Chennaggang Wu: I am sorry. I do not catch what you…
>>: Okay so, we can talk afterwards.
>> Chennaggang Wu: We can talk, okay yeah, my English is [laughter].
>>: No. It's okay.
>> Chennaggang Wu: We discussed the inlining. It works very well by the call site.
This is the second, this is the third, this is the second method. The second method is
improve the data locality by a line on-the-fly structure splitting 79. We always do the
[inaudible] on-the-fly [laughter]. So the memory wall has long been our main barrier that
limits the system performance. And the data layout and many technologies have been
proposed to improve the data layout such as structure splitting and field reordering. But
with this kind of technology has some problems. This kind of technology can improve
the performance for many applications but for some applications it can slow down them.
I will show the data in the following slide. For this kind of optimization is not very safe
so it needs a check runtime carefully, otherwise we introduce the bug in the system. So
this method has not been applied in many compilers for performance and safety concerns.
The way we design on-the-fly dynamic approach is to apply the data structure splitting
optimization. Need I introduce structure splitting? Do I need to introduce what is
structure splitting? No. I think not [laughter]. Okay.
This is the framework of our system. This is the, the orange is the binary code, program
code, and this site is our system. When the program runs it may locate data. We
intercept the location request and then we catch the location. We use our allocator. Our
allocator include large array allocator and an object pool allocator. When we find a large
array and the large object pool this operation I just introduce just now. When we find this
kind of data, we will use as optimization candidate. The allocator, the allocator will
invoke the analyzer to do pointer analysis to find the possible access instruction to the
located data, and to try to recognize the size of the structure and the field of the data and
to do benefit estimation. If the result shows we can get benefit we will locate two spaces,
to block our spaces. One is original space is space eight. After we locate eight away
protector eight we do space protector with non-readable and then non-readable flag. In
the same time we locate another, space namely splitting space to use and to store space
Because this kind of data is space protected when the PAI access the data, it will trigger
an exception. So our exception catches the eight and we generate, the exception handler
will introduce the analyzer another time. And the difference is it has runtime information
of the variable of the register, so it can find this kind of information and to generate the
code. We replace the original possible access instruction. We replace them and allow
them to access the splitting data. So we patch the code; we put the code into the code
cache and patch it into the program code. Afterward the program can run without
triggering exception.
>>: So it doesn't seem safe as it [inaudible].
>> Chennaggang Wu: You are right. I will introduce the safe guarantee mechanism.
The safe is very important for this optimization. When we found, we also designed a
rollback module. We found some violation of optimization where we were able to go
back and solve the safety program, problem. And we sent our monitor to monitor the
path counter. One way generated code we will instrument the code to recall the located
region the times of the located region. The monitor will check it. If we find that it is not,
the optimization is not efficient then we will rollback. Okay. Let me introduce the
analyzer first. The analyzer works four jobs. The first one is to find the PAI, possibly
inserting instructions. If EAX is calculator from the location sat from the location
[inaudible] we [inaudible] instruction PAI.
Another important issue is how do we find the EAX, the relationship between EAX and
the call set. We can do, we can use two technologies. One is pointer to analysis. Pointer
to analysis is very difficult for static compiler, although static compiler has the original,
the source code. So it is very easy, much more difficult for us because we do not have
the source code. And there in the independent code there is a lot of empty [inaudible]
mode in the binary. So we choose another which is lightweight flow sensitive
propagation method where it just propagates the address according to the [inaudible]
flow. If we find it it has a different relationship we recognize this option is PAI. This
method can find some PAI but for some others they cannot find them. Undiscovered PAI
will be found by our safety mechanism.
Another task, recognize fields. For this task it has two tasks. One is getting the structure
size and another is getting field’s offset and size. For structure size we assume all the PAI
always access the same field. This assumption is not always true. It is not always true.
We must check. We have a mechanism to check it. We assume it. So under this premise
we getting the strides of PAI's in all of our loops and calculate the greatest common
divisors of the strides. We assume this is a result of the structure size. It may be the
multiple of the structure size. It may be a multiple, but it is not influence a factor of the
Another task is to get the field offset and size. It is very simple. Suppose EAX is the
start of the locator space is the start. So the field has that offset a four is four bytes from
this instruction. We can get other information. We to estimate, we estimate the benefit
and we use a heuristic method. Is very simple. We only calculate the field coverage,
actually the structure size T is the structure’s number of bytes touched by the loop or
trace. We believe if field coverage, the smaller the field coverage, the better for us.
Another job is allocate memory for some data if we find no beneficial we will allocate as
the default locator of the system. For other we locate original space on the splitting
For the exception handler that is a generated code we use this equation to calculate it offsite and usually this is efficient to calculate the new address of the splitting space so we
generate instruction, we generate instruction like this for one PAI, original PAI. We will
try to generate instruction like this slide.
For the safety we check if P, P is the pointer, original pointer. If P is in the optimized
space, if not we will let it to reveal the original PAI. Do not conceal all the PAI. Another
one is to check the offset and I just assume every PAI accessing the same field but it is
not always true. So we use the code to check it. If it is not true we will rollback.
>>: So do you use a [inaudible] space for every type of allocation to do the [inaudible]?
>> Chennaggang Wu: What?
>>: Is there a different optimize space for every different [inaudible]?
>> Chennaggang Wu: If we estimate and we can gather the benefit we will use it. If not
we do not use it.
>>: I am thinking about the case we have two different optimizations in the cache.
>> Chennaggang Wu: Two different optimizations…
>>: [inaudible] you can be an optimize space but you can be a different kind of object
>> Chennaggang Wu: I am sorry. What do you mean?
>>: The check doesn't seem…
>> Chennaggang Wu: Which one, this one?
>>: The first one.
>> Chennaggang Wu: Okay. Does it what?
>>: Is not clear that it catches all possible things that can go wrong. It doesn't seem safe
>> Chennaggang Wu: Why?
>>: I, I…
>> Chennaggang Wu: Why [inaudible]?
>>: [inaudible] field access site if it is [inaudible]…
>> Chennaggang Wu: I know what you mean. You mean even if this some point PAI to
access the optimized space then it will be wrong. Do you mean?
>>: If there are two optimizations, different optimizations that arrived at the same…
>> Chennaggang Wu: Why is all I data what is another? Two different optimizations?
>>: Well, so objects are allocated in different places, different call signs, but then they
get past to this routine where the access is the object.
>> Chennaggang Wu: You mean this role here may access this part of data and access
this part of data. Am I right?
>>: No. No.
>> Ben Zorn: We should do this off-line.
>> Chennaggang Wu: What does…
>>: We can talk afterwards. It seems a little complicated. I think he would need an
example to [inaudible].
>> Chennaggang Wu: I am sorry. I think I must improve my English [laughter]. I study
English in here, two hours of one, per day, but I think I actually should improve my
English. Okay. After the talk we can discuss it or in the paper you could write, or draw
the problem for me, okay?
Okay. For the optimization we use 21 instructions to replace one PAI. It's overhead. So
we use several means to reduce the overhead. First of all we use liveness analysis of the
register. If we find some registers are not used this time we use eight so it reduce the
instruction. We amortize by the subsequent PAI's silicone and you will find several PAI's
if we amortize them use these same check. And we do strength reduction and code
promotion. Code promotion means we promote some code outside the loop, some code
outside the loop. Okay. Then we do persistent profiling where we sat a monitor here.
We do persistent profiling. We use metrics, namely ratio to calculate our way, calculate
the penalty cycles and the reduction of cache lines touched. For eight we just estimate
use the number and complexity of the instructions. We call it the instructions and
therefore some [inaudible] divided instruction and we think the [inaudible].
And we check whether this instruction can be promoted outside of the loop. And we use
the field coverage as this f to do the calculation. When optimizing is not effective all
exists the field violation in C programs your union or your type cast for this problem we
will do the rollback. And the other case is that our system cannot handle or we will go
rollback. The rollback is very simple where we restore read and write privilege of the
original space and merge the splitting space together and mold the data to the original
space and free the splitting space.
For the safety problem I am not sure if this will answer your question. For the original
instruction if the access optimize data it is good; it is original behavior of the program so
no problem. For the original instruction if the access optimized data because this data is
page protected so this instruction will be captured by exception handler and will be
transformed into a PAI of the [inaudible] code. So no problem. And for the translated
instruction outside the optimized data if there is no field violation and the PAI always
accesses the same field, no problem. But if these two conditions do not, if there exists a
violation we will go rollback. For the translated instruction if the access one will access
all the optimized data, we will check the addresses and let it to execute the original PAI,
[inaudible] PAI and that will solve the problem.
This is speed up. We realize optimization and passed all of the benchmarks, all of the
benchmarks, but we only find five benchmarks to trigger the optimization. For other
benchmarks no speed up and no slowdown. For this program either can achieve the
speed up, achieve the speed up, but for this one slowdown, for this one slowdown. I
should introduce the three parts. The red part is the pool allocation method. The green
part is the OSS result without the monitor to monitor without the monitor. And this one
is OSS with the monitor. So without the monitor, it will slow down for this benchmark.
The reason is this benchmark does not access the data sequentially. It access the data
randomly, so it will slowdown. With the monitor we found no performance again away
rollback, not slowdown. I think the speed up of 0.9 may be the accuracy, maybe so for
this one not slow down. It also prove these optimizations if we use it statically it cannot
work very well because now monitor to rollback.
>>: So historically I think many of the papers that try to do this have a similar result and
basically it only works for certain programs and, you know, there is cost. So given your
experience would you actually recommend implementing this optimization now that you
have done all the work and you see the results?
>> Chennaggang Wu: We have done all the other work published already the work. Do
you mean?
>>: I mean just based on your results here, would you recommend that other people
implement this optimization or you consider it not a good enough result. It is not a good
enough benefit?
>> Chennaggang Wu: I think that the benefit for some applications like art it is very high
and it is beautiful. So we found that in the study and the compiler in GCC 4.4 there
exists this kind of optimization. But it is not now always. It is not now. I think the
reason is for some benchmarks it will create greater overhead to slowdown, so but for all
we do not make users slowdown.
>>: Right.
>> Chennaggang Wu: Yeah?
>>: So some of the reasons you think that it may not work well is the X86 processor is
very good at prefetching data for you, right? It does a good job so it is hiding some of
those latency issues that you run into but it seems like a simpler processor that might be
in order like one of your earlier versions of the Loongson Processor, this might have a
huge benefit. So do you have the same graph for [inaudible] earlier processor?
>> Chennaggang Wu: No. We only do it in the new processors. Last one is promote-do I have time?
>> Ben Zorn: Not much.
>> Chennaggang Wu: Not much?
>> Ben Zorn: About 5 minutes.
>> Chennaggang Wu: 5 minutes, okay. The last one is promote local variables into
registers because X86 we translate into [inaudible] promote local variables into registers
and [inaudible] on eight. So if we for X86 because there are, due to the reason must to
register, [inaudible] register locator register. So [inaudible] we try to promote the local
variable into the register to reduce the overhead. So since I do not have enough time, our
method is for these yellow instructions, we found-we do all of this work on Intel 64
platform not on MIPS. The reason is it is usually the published paper. So for these three
instructions we found the instructions access this data, this stack data, so for Intel 64 we
have eight extra registers so we can use them to preload and to restore. So in the loop we
can replace stack variables and by all the new registers. This optimization is simple but
is mostly important thing is how to guarantee the safety.
Okay. No time. The main reason is why we target the stack variable because it is
simpler than other variables. It is really explicitly referenced. It is relatively easy to
disambiguate explicit references. But there is one safety issue is how to detect the
memory alias. We use memory protection to do this job when we promote the stack
variable from this page we will protect it and promote the variable outside. But this
method will introduce another issue. For some data because we do not have enough
registers to promote all of the local variables, so the other variable how do we handle
them? Okay. The solution is we show remap to remap a shadow stack, remap a shadow
stack. We translate, we target the explicit memory access instruction into the stack,
shadow stack [inaudible] physical stack. This two-part memory shares the same physical
memory, so we change them to the shadow stack so the problem can be solved with them.
But there exists some other problems for the original static stack if it does not permit
mrepap operation, so we must locate a major stack a first and transform the stack upon
into the stack and the do the mremap. I think finally I have run out of the time. Okay.
So let me show the result for the application. For some application we can achieve 45%
speed up. For the average we can achieve 7.3%. Okay. I am sorry. My poor English. I
cannot explain.
>> Ben Zorn: Thank you. That was a lot of great material. [laughter]. Any further
questions? So Chennaggang is going to be with us through the day and he is going to
join me for lunch if you want to come to lunch with us, you are welcome to and we could
probably find some room in his schedule later if you want to talk to him. So just let me