>> Onur Mutlu: Hi, welcome all. It's a pleasure to introduce Jae Woong Chung from Stanford University. Jae Woong is finishing his Ph.D. pretty soon on transactional memory and its applications, and he's done a lot of work on hardware transactional memory and software support for it as well as applying transactional memory techniques for different applications, and I think he's going to talk about one of those applications today. >> JaeWoong Chung: Thank you, Onur. Good morning. My name is JaeWoong Chung. Today I'm going to talk about dynamic binary translation using transactional memory. This talk is given at HPCA this year a few months ago. I was wondering which talk is better to give today, my thesis talk or my conference talk, and considering the high technical level of this specific audience I decided to delve into one issue. So let's start. Dynamic binary translation. I'm pretty sure that most of you heard about it and know how it works. So basically dynamic binary translation converts binary A to binary B, and when they convert it, they did it as a combination of DBT framework, kind of software talk, DBT framework that provides basic primitives for translation and also analyze application and DBT tool that performs specific translation. So there are many interesting use cases of dynamic binary translation. First translation crucial system is good example that binary A compiled for platform A can be converted to binary B working for platform B in runtime and compilation for Java and C++ is another good example. And we use dynamic binary translation for virtual motion support. And there has been many proposals how to use dynamic binary translation for profiling, debugging, security, liability and so on. So let me give you an example how we can use dynamic binary translation for security. So binary information, flow tracking is very interesting technique for security. It basically tracks the flow of untrusted data in applications. So let's say that we have server applications. We cannot trust all the data coming from the network, and diff first receive the checks to input from the network and wherever it goes in the application execution flow it follows the flow and if later on this untrusted data is used for a security critical operation such as system co, we trigger security exception. So let me give you an example how we can implement this with dynamic binary translation. With dynamic binary translation we add metadata bits called taint bits. It is one bit per memory byte. It doesn't need to be one bit per memory byte but typically one bit per memory byte, and we also add yellow instruction in this slide to maintain this metadata bit. And let me walk you through this example. This example, the first step we receive let's say that we received untrusted data from the network and assigned it to variable T. Now, in this box we see that untrusted data X, X is there. We execute the taint bit to set the corresponding bits and if data bit is flowing to another variable with slot and next instruction also taint bits and the value, anticipated value is assigned and taint bit is also assigned. Key point is that taint bit is added and it follows exactly where the data goes in application execution, and at the end if U variable U1 and U2 and are used for security critical operations by checking taint bit, security policy can kill this process or trigger security exceptions. So DBT has been working fine and proven its value and it's been working very good with single-threaded applications. The problem is that now we are in the add on of multi cores and there will be multithreaded applications given to DBT as an input. The problem is how can you preserve the original semantics of atomicity in translated code? For example, let's say that we use dynamic binary translation for binary comparability, meaning combining binary from platform A to platform B. Platform A provided compare and swap instruction. But platform B we added instruction for compare and another instruction for swap. And single-threaded core it's fine. Compare and swap will happen one by one, but with multithreaded application, there can be a race and mix of compare instruction and swap instruction. Another example is let's say that we added additional instructions to add features. For example difft, we added instruction to maintain taint bits. And typically what we expect is that this taint bit instruction is kind of side effect of original instruction, so taint bit movement happens in atomic way following the untrusted data movement. But -- and it's okay with single-threaded application because the instruction for taint bit will be executed right after the original instruction. I'm going to give you an example right in the next slide. But with multithreaded input, there will be some race. So how can it deal with this? Up to now DBT solution such as tar DBT just doesn't allow multithreaded input, no multithreaded input. Another example is biograin. It just sit outside multithreaded execution. So there only one thread executed at a time. >>: (Inaudible) question. I don't understand why on single course there are no difficulties with multithreaded programs because my understanding is that one thread can run for a little while and then another thread can run for a little while. >> JaeWoong Chung: Yes. If we just mix multithread input in one single thread and we don't pay any attention when two contact switch off the longest thread and there will be same problem, there will be same problem, yes, you're right. But what we are focusing here is that well in that case, we can deal with it by just paying attention to contact switch timing. But with real multithreaded execution it's not enough to do that. We need some kind of architectural support or software technical support to deal with the problem. That's the key. >>: Okay. What (inaudible) pin do this. >> JaeWoong Chung: Pin provide -- I'm going to go over it. Actually we use pin for the task. So that's good. Pin support multithread, pin support provide multithreaded spread support for DBT frame itself, so it allows multithreaded access to code cache. It allows multithreaded access to trace implementation but it has nothing to do with the application code and instrumentation of code that runs with DBT framework. And we are focusing on the code running with -- we are focusing on running multithreaded application code and additional instructions. >>: How does your model (inaudible) where the tainted data (inaudible). >> JaeWoong Chung: Well, it actually doesn't matter for us because we're going to go back to the case but what we try to make sure is that whatever instruction there is, even though there is a branch, we make sure that a taint bit and taint bit check and branch happens in the same way. For example, if it is not let's say there is a branch, if it's not taken branch and there will be transaction it's way go ahead, but anyway, there will be taint bit check after the branch and then if the branch is taken and we're going to put checking instruction in the beginning of target code and then we're going to check it there. Does that answer your question? >>: Well, I'm worried about the tainted data changing the control flow to your program. >> JaeWoong Chung: Still we're going to -- still we're going to guarantee the atomicity of checking the bits and then securing the branch bit, and if it turns out to be dangerous execution, and if it's the responsibility of implemented feature that takes it, we guarantee -- what the problem we try to address here is atomicity issue ->>: I understand what you're trying to model you're assuming just says okay if you change it that's fine ->> JaeWoong Chung: Oh, so there are points that we have to check security. For example, where do we have to check taint bits. One example is jump out the return address. Another example is security. System call. So basically if there is a transfer of control flow, we have to check the value. Does it ->>: Yes. >> JaeWoong Chung: That's security checkpoint. >>: What's your model for 15? Because some people only check (inaudible) instructions. >> JaeWoong Chung: Yes. >>: If your branch is based on a stated address, then all remaining execution is tainted and you can't make any decisions about whether or not a security violation has occurred. >> JaeWoong Chung: Well ->>: You talk about your model (inaudible). >> JaeWoong Chung: Yeah, yeah, I see. So what we -- the point we check, the point we check is in our experiment the point we check is any branch instructions and any security system call. That's basically two. There are more minor cases but basically it's two. Any control flow change and any system call. That's it. Okay. Any questions? Good. As expected, tough audience. Okay. Let's move on. So let me show you why it's hard problem. Again, the repeating the problem going back to the original trace. The problem is how can you guarantee the atomicity even though after we use difft to add instruction and we still expect the original instruction and instrument instruction are executed like atomic instruction but with multithreaded input there will be a race. In this example, this is way watered down example that we actually got from (inaudible) cache, and they avoided using locks relying on the fact that swap instruction and atomic and (inaudible) story is also atomic for most modern processors. So they executed but I'll skip all the implemented code. The key is still here. So thread one is executing swap, thread 2 is executing (inaudible). Doesn't matter which instruction is executed first. It's still lock predata structure. Now, we did same instrumentation for difft, and we added instruction as we've done for a single threaded application. We added bit, we added instructions. And in the beginning we have variable T has untrusted data and taint bit is set. Now, first step we sort the data and when we instrumented code we expected that taint bit happens right next to the original instruction and it's going to happen in single-threaded application. But with multithread application this is structure, so there can be a race and store to the (inaudible) happens from the original program point of view it's okay. Doesn't matter. But once we added instruction for security, it's going to be a problem because taint bit is copied following the variable assignment but it hasn't -- nothing happens and then swap bit -- taint bit soft happens so at the end if variable two is used for security system call and security policy will check the taint bit it's now set sytem call will be allowed, it's security breach. Again, key point. Problem is due to the race there is a chance that expected atomicity from the future implementer's point of view will break and then there will be some cache that metadata will not follow the original user data and there will be significant consequence such as security breach in this case. >>: I have one more question. Do you address which classes of memory consistency models would have this vulnerability and which would not? >> JaeWoong Chung: Oh, in this case, we just assume sequential consistency. >>: So which once would not (inaudible). >> JaeWoong Chung: We just started about the outer consistency model like weight consistency because we just stick to one instruction is security and all instruction will be as it is as a one global history. That's it. Does it address your question? >>: Yes. >> JaeWoong Chung: Hopefully. I see some doubt in your face, but we can talk offline. Okay. Any other question? Okay. So what is atomicity issue? Well, we've been dealing with atomicity issue with locks. We deal with it with locks? Of course we can do it. Well, basic idea is simple, use lock and the lock to enclose data access and metadata access, and that's actually it in terms of correctness. But it turns out it's going to be really hard to do it, to meet the trade-off, good trade-off between performance and programming easiness. First if we go for fine grain locking where we use one lock to have just one original instruction and one additional instruction we have a huge overhead. Just to lock in a lock -- just to acquire and use the lock, and there will be typical difficulty of using fine grain lock such as (inaudible) version, but what we found is that the bigger impact of the performance is that DBT actually optimized the code after it instrumented new instructions such as copy propagation, in-lining and register renaming that a lock and lock instruction works as if its memory barrier and it limits the scope of DBT optimization. That's really hurting. And if you don't like fine grain locking, alternative is obviously coarse grain locking. But typical difficulty of using locks for DBT is that we instrument all of the code which basically makes all of the coded region as critical section. If we start to use coarse grain lock for this proposed it's easy to expect that there will be huge proponents degradation. But curious thing that I think for this problem is lock nesting. We are dealing with multithreaded applications and user application must be dealing with their own atomicity issue with the locks. Now we are adding DBT locks to deal with with our own problem to address atomicity issue between user data access and metadata access, it's going to be not easy to in general to prove that there will not be deadlock due to the lock nesting and the older acquiring user lock and DBT locks. >>: (Inaudible) acquired another lock and DBT lock? >> JaeWoong Chung: Well, if we hundred percent understand what's going on with application semantics, yes, we can avoid it. But typically in our study it's going to be hard to do it in general. >>: Okay. But if you were at least using a fine grain version ->> JaeWoong Chung: Oh, yeah, no problem, no problem. Basically if we use one lock for one original instruction and one instrument instruction, as you said, no problem. >>: (Inaudible) coarse grains, okay, right. >> JaeWoong Chung: Yes, yes, okay, so you know got your question. Good. So the hardest thing to do actually is that lock programming for the code where all the code is critical section, it's really important to come up with smart multithreaded program, but we are -- if we ask future employment such as difft implementer for securing in previous example also deal with multithreading issue, we are asking too much for that people. Or it's going to be the tool developer, DBT tool developer should be not only the expert for specific feature but also should be expert for multithreading. So the whole solution is basically very easy, use transactional memory. Transactional memory I'm pretty sure that most of you have heard about it. So basically it provides atomic and isolated execution of (inaudible) instruction. Atomic it's typical all or non instruction is executed in the group. And isolation it's typical again. No intermediate result from the instructions will be exposed to the rest of the system until all of the instructions are executed. So from programmer side, programmer point of view, just use transition to enclose (inaudible) instruction and can just think that that transaction will be logically sequentially executed against the other transaction and non-transactional accesses. >>: Is it possible -- so you can't put a statement that affects I/O in a transaction? >> JaeWoong Chung: I/O in a transaction. We can do that. There has been idea how to deal with it. So actually we will deal with it there, but basic idea is that for example one way to deal with it is we defer -- the problem with I/O is that once it done, let's say output is done, out of system, we cannot undo it. That's a key point. And one way to do it is defer the I/O until the moment we're pretty sure that the transaction, the current transaction is going to be commit. That's one way. There can be a case that we cannot guarantee it and we cannot know that whether the transaction is committed or not, then by system we're going to force that transaction to come in. Example, if there's any other transaction that conflicts with this transaction, we'll make sure that the other transaction is rolled back. >>: Well, that's not what I'm worried about. So if you have a read that has side effects (inaudible) read is necessary to the transaction. And you can't roll it back. >> JaeWoong Chung: Oh, side effect that has, right. In that case ->>: Let me finish. Okay. So is it possible that your system that you have one of those reads in the same area where you've got to do this taint bit checking so you can't compose it in a transaction because you'll screw up your I/O but since you can't encapsulate within transaction you're exposed to race conditions that will allow securitization? >> JaeWoong Chung: Two answers. First in this typical example we can -- well, we add -- we're going to add a transaction to deal with our own issue, and transaction can be as small as contained in one original instruction and one taint bit instruction so that we don't need to worry about having a loan transaction that has all complex issues, we can split transaction as small as possible. And going back -- well, if we use the transaction to deal with let's say difft, in that case that I/O operation is actually security checkpoint, and then we're going to make -- for example, we're going to check taint bit before the I/O operation and end the operation we can safely take out that instruction out of the transaction boundary. That's another way to do it. >>: Well, I guess what I'm asking, so there can be no I/O operation, we'll swap in memory map I/O space that will require a taint bit to be checked as the next instruction. Is the generator of the taint bit as well as the thing it needs to be coupled with ->> JaeWoong Chung: As long as the data we are accessing is in memory we can data version it. If it's outside we cannot deal with it. In this case we make it outside of transaction. That's one way to do it. If we cannot defer it -- if we can defer it, we defer it, if we cannot defer it, we have to take it out of the transaction. Or, yeah, we can make sure that transaction does not commit, still it falls into the case. Whatever side it is, big problem with I/O with transaction is that we cannot undo that side effect. The way we do it is make sure that we don't need to undo the side effect. Make sure that the transaction commits. Any questions? All right. Let me speed up a little bit. So basically going back to the point again, programmer point of view, just use transaction to enclose instruction and believe that this kind of deal will be sequential history of transaction and non-transaction access, but on the line transactional memory system runs it in parallel as long as transaction don't conflict, transaction don't conflict. Two transaction conflict if they access the same address and one of them writes. To detect this conflict every time changes in access new address we check whether it's read or written and remember it, and we also make sure that if it's write we save all of the safe value so that we have to roll back, we can restore the rest -- we can restore the memory status. We also checked register checkpoint so that we have to roll back we restore the rest of the values as well. There has been proposals how to implement transactional memory system, this style of transaction memory system in hardware, software and hybrid manner. So how can you use transaction for DBT. Basic idea is embarrassingly simple. So if there is a memory access and metadata access, just use the transaction to enclose them and transaction memory system by programming as a programming model it's going to guarantee that these two transaction will be executed for each transaction, I'm sorry, it's transaction is going to be executed in atomic way, but on the line transactional memory system run transaction in parallel based on the atomicity control as long as two transaction don't conflict. And it knows how to deal with nesting transaction, transaction inside another transaction. So basically it is simple, but you know the real life ain't easy like that. Well, there are a list of problems that we have to deal with. First, what would be the best way to add transaction, was it good enough to add a transaction per original and (inaudible) instruction? No, it's going to be really heavy. Transaction begin and ends are not free. What if application already have user locks to deal with their own atomicity issues. By putting DBT locks do we break the original semantics of user locks? That's another question. What if we deal with user transactions and overlap with DBT transaction? What if there is I/O? We already covered it. I/O inside a DBT transaction we just add it. And the last thing is real tricky that what if programmers are so smart so instead of relying on (inaudible) parallel (inaudible) such as lock and barrier they decided to come up with their own synchronization code and it's hard for us to understand all the semantics, how can you deal with it? I'm going to give you an example. And we also need to deal with the runtime overhead of using transactions. So let's first start. We will go over the problem one by one. First, granularity of transaction instrumentation. >>: Are you using software transactional memory? >> JaeWoong Chung: Oh, yeah, in the beginning we're going to give you a result with software transactional memory system, and we're going to also provide a number if we have harder acceleration in this talk. So starting point is software transactional memory system. So granularity of transaction instruction well, if we added a transaction per original instruction too high over it, it's easy to understand. Better way to do it is transaction per basic block so that we can amortize transaction beginning overhead. And some DBT framework provides trace implementation. You know what trace is, it's multiple base blocks that happen to be executed back to back, and we can further amortize the overhead. But the problem is that because we are making transaction longer, we will see a higher chance of transaction conflict. So the final version is based on profile that we start with transaction per trace and then we transaction conflict issue if it goes too high we optimize code with dynamic binary translation to shrink the transaction size. All right. So what if there is usual locks overlapping the DBT transaction? We're going to first consider two cases. One, locked region is so small so that we so that transaction and locked region is totally falling into transactions so that locked region is inside the transaction. In that case, the transaction lock variable is just normal shared variable and lock acquisition code, that's local -- that's the lock variable will incur a conflict. So if there is two threads try to access this locked region inside the transaction, it's going to turn out to be a transaction conflict, and then we're going to roll back one of the transaction at the point where it try to get -- it tries to acquire the lock. In that way we can guarantee that there was only one thread executing the critical section protected by the lock. This way we can preserve the original semantics of lock. The other way around, if locked region is really big and then transaction is inside and then already lock provided pessimistic control mechanism and guaranteed that there will be only one thread executing inside the locked region, what's the point of concern -what's the point of running some part of the execution with transaction again to provide additional (inaudible) with control? It's fine. Don't worry about it. Problem is what if they overlap partially? And the basic idea is that we cannot do anything about user lock because it's part of original program. So we're going to split DBT transaction into two parts, two small transaction. One, couple enclosed by the locked region gather outside. >>: (Inaudible) of the DBT framework that it provides essentially only per instruction operations, that it doesn't provide any way of kind of adding metadata that spans the -that spans multiple instructions? And the reason I ask that is because that seems to me, and is that an assumption of the talk, right, that in the framework you get one instruction, you add some metadata instructions, but you don't ever get multiple instructions at once that you can operate on? >> JaeWoong Chung: No, no, no. If we go for a transaction per base block, per trace, that's when we're going to have multiple instructions and multiple metadata, and actually that's what we did in the experiment. >>: So this is really about the semantics that are given to the person using the DBT framework? The reason I ask is because I see the appeal of being able to operate, to take multiple instructions in the input and then have a single metadata instruction at the end that executes kind of in and all or nothing manner, but then that doesn't seem to -but then but then saying that we're going to split transactions whenever it's necessary in order to solve other problems, we no longer have -- it seems like there has to be some granularity at which you stop splitting. >> JaeWoong Chung: Oh, that perfectly makes sense. In this case we didn't consider semantic requirement of instrumented instructions. We only provide that transaction can be bigger to cover such cases, but we didn't actually study how far we can split in smaller part. In this study yes, you're right, we assume that we can split as small as possible. And actually the basic size is small size that we used for our experiment is transaction per base block. That was the smallest transaction size. Okay. Move on to the next one. User transaction with DBT transaction overlapping. Well, that's the same story. If they are fully nested, doesn't matter whether it's user transaction and DBT transaction or the other way around. Transactional memory system knows how to deal with nested transaction. If it's partially overlapped, again, we split DBT transaction so that one going into user transaction, the other coming out of transaction, user transaction. I/O. We already covered it. One way to deal with I/O in general in transactional memory system is the priority I/O until the moment that we're sure the transaction is committed. One other way is to make sure that this transaction with I/O commit so that we don't need to abort the I/O operations. This actually calls a serialization of transaction. It's not good in terms of performance, but at least it guarantees the correctness. Synchronization is tricky. Let me walk you through this simple example. The two thread and simple instruction what they try to achieve is barrier. So transaction -- I'm sorry, thread to that's done two bit saying that I'm done. And then thread one raise the bit and comes out of the wire. In this way they have implicit dependency. >>: I'm sorry to interrupt. I want to go back to that point. Maybe I'm beating a dead horse here. >> JaeWoong Chung: That's okay. I/O. Okay. >>: You said that the one way to deal with it is to guarantee it completes. What if you have two transactions on different threads that both get that requirement, they have a conflict? >> JaeWoong Chung: Then we have to serialize the transaction. Again it's a serialization and we make sure that that transaction come in. That means that we cannot provide another guarantee for another transaction. There will be only one thread executing I/O with that guarantee. >>: How do you provide that serialization? >> JaeWoong Chung: Well, by make sure that if there's two transaction threaded I/O operation we're going to just roll that back. So ideally ->>: If you have side effects -- >> JaeWoong Chung: Okay. Even before we execute the side effect, somehow we have to detect it first, that's a good question. And when we did an implementation, actually pin framework provided an API that, to decide whether it goes out or in, and in our application we didn't see much case of that because reads multithreaded application mainly going to memory, but if we can detect it even before we allow side effect, we will make sure that that transaction is going to -- that transaction is the only transaction with that guarantee in the system. At least for the process. >>: So you address the problem by not having it in your experiments? >> JaeWoong Chung: In the experiments, yes, you're right. In this experiment we haven't seen the problem. But dealing with I/O you can easily imagine that living in the I/O is big issue and generally in transactional memory literature, and it seems that they are pretty reasonable to deal with it. And I can refer the paper written by Craig Jullis (phonetic) about the analyzing transactional memory system and critical section. So tendency goes that way and executing instruction the other way around dependency goes the other way around this (inaudible) dependency guarantees the semantics of barrier. Problem is if you don't understand what's going on in application layer and just put transaction we're in trouble because doesn't matter what we do. Transactional memory system serialize transaction in a way that it execute two, three comes first in this serialized (inaudible) the history ahead of -- comes first then one or four -- either one and four before two and three. So ideally there is no way to deal -- to serialize these two transactions. So it's a problem and typically in real implementation it shows up as live lock. Two transactions kills each other without making progress. So the way we deal with it is to first check that such case happens. The way we detect it is is it a time out, and actually what we did is to constantly check whether a specific transaction makes progress or not by checking how many transaction it's rolled back in a roll. And if it goes five or 10 and then we say that there is no hope for this transaction to move progress, and we reoptimize the code to make transaction per base block. And by definition of base block, once we get into the beginning of base block we can finish -- we can finish the base block where the end, there's a transaction end, so that we don't have the internal circle of dependency inside one transaction. >>: (Inaudible). >> JaeWoong Chung: Yes. Yes. We will (inaudible) not only for performance but also to detect these kind of cases. >>: Can you guarantee that you cover possible scenarios? >> JaeWoong Chung: The important thing is if we go down to the transaction per base block we are guaranteed that there will be no dependence, because look at this wire, the waiting conditions. They actually consist of multiple basic blocks. And the key is once the -- okay. This is better. From the transactional memory perspective as long as we start a transaction and finish the transaction it's okay. We are making forward progress. Right? So if we add transaction per base block by the definition of base block, we are guaranteed to start, we want to see jump in beginning of base block, we go down to the base block and change beginning and end will be beginning of the base block, end of base block. Okay. >>: (Inaudible). >> JaeWoong Chung: High level question, open question. Well, you know, that's the reason why I put it as the last one. It's not easy to deal with. But at least we have a case, we have a solution to deal with it. Once it happens, system going to slow down because we have transaction per base block, but like better -- the best expression is for greater good, what can I say, for the common case the other, at the other instruction. >>: (Inaudible) deal with some other ways? Like why do you have to use transaction? >> JaeWoong Chung: Basically we don't want the programmer to worry about finding his parent and try to use another parent. We wanted to make sure that tool developer just add a transaction and we provide a way to dynamically optimize the transaction size so that even not for only for performance but also deal with these cases so that from tool developer's point of view, they don't need to worry about finding this case and deal with it as a separate case. That was key point, making DBT tool development easy in reasonable performance degradation. That's key word I think. Of course there is chance to use another mechanic. >>: Different question. >> JaeWoong Chung: Okay. So we've done with how to deal with, how to instrument the transaction and how it interacts with user code. Now let's talk basic baseline performance. We use pin tools with multithread support and we implemented difft out of pin tool and we implemented software transactional memory system to begin with, and we used x86, x86 software with (inaudible) and we run the experimental and real machine. We use nine application, six from SPLASH, three from spectrum NT. In the graph accesses this application, Y is this overhead, where we compare ->>: These applications of taint (inaudible)? >> JaeWoong Chung: Taint? The security ->>: (Inaudible). >> JaeWoong Chung: Oh, in this case, yes, we give you an -- actually it reads from the file. We start the network but that only reads from the outside source it. Reads from the file, and it reads from the comment line. We started tainting from there. But no data block, that's right. So it's key, while it doesn't matter whether it comes from, it's important to check whether it's data from outside or not. Yeah. Okay. Good, good. Any other questions? Okay. So, wow, time is running fast. So access is application, Y is this overhead where we compare two things, the baseline performance is DBT framework running with application and when Y is overhead means the DBT framework application and software tools, software change the memory system, we added and then we compared it at the overhead of adding software transactional memory system to deal with atomicity issues. Get this part. >>: (Inaudible). >> JaeWoong Chung: Yeah. Okay. So without software transactional memory system the baseline performance is running pin tool with difft implementation and with the application here. That's baseline performance. Meaning that it's normalized, it's normalized to that execution. >>: (Inaudible). >> JaeWoong Chung: Yes. Yes. Or in our case it depends on application, but in our case, yes. And to compare the overhead comes from using framework, difft, application, and software transactional memory system to deal with atomicity issue. And we measure how much overhead we have by doing that. >>: (Inaudible). >> JaeWoong Chung: (Inaudible). >>: (Inaudible). >> JaeWoong Chung: Yes. >>: So then the zero percent overhead that's first a system that's incorrect because it doesn't ->> JaeWoong Chung: Yes. We didn't do anything for atomicity issue. That's baseline. Yes. Well, you know, it's based so we can hope that it would run correctly, but you know, no guarantee. So 41 percent -- in this example we actually run -- oh, wait a minute. Going back to your question, was the overhead difft? Well, I'm sorry. That's the wrong answer. So in both cases we don't difft. But the case we measure the overhead we additionally use software transactional memory system to -- yeah, so the overhead is not the difft, the overhead is software solution. >>: That's what I thought you said. >> JaeWoong Chung: Good you correct my answer naturally. >>: The amount of the tainted data that is actually shared between friends, do you have a sense of that? >> JaeWoong Chung: Yes, yes, yes. The way -- it depends on how you pack the taint bits. It's not only user access shared patent, it also how you pack the taint bit so even though it's different word it can go down to the one word with taint bits. And in our case we packed each bit to byte and pack in and pack it, so even without user level shared access there is some level of sharing to the taint bits. That's first thing. And, yes, these applications for data access part not much sharing, frankly, but we just implemented all of the code, for example, even for barriers and all the other, while studies variable and there, yes. We were able to see conflicts to shared metadata access. Anyway, baselines, 41 percent. 41 percent not that good actually, not too bad, not too good. So it's time -- well, but considering the other option that we serialize multithreaded input totally executing thread one by one, this is much better. But still there is chance for optimization. So let's first see where this transaction overhead comes from. Three (inaudible) correct basically. First (inaudible) to start and end transaction. We have to take racer checkpoint and we have to initialize transaction metadata and per memory access inside the transaction we have to remember which address is read or written by a transaction so that we can detect conflict between transactions and also for the write we have to remember all the values in the memory so that if we have to roll back, we can restore their memory image. And then the third type is transaction abort where actual conflict is there and we have to undo all the work done by aborted -- aborting transaction. Also we have to do specific work for abort itself, for example applying the safe value we deserved and then restoring the checkpoint. In our task this is very fair application that we use, so transactional word issues really go down to 0.03 percent. So we focus on the first two overhead. First, transaction begin and end overhead. It's quite easy. Basic strategy is amortize it. So in the graph, access is number of basic plots per transaction, Y is the same definition of runtime overhead, normalized overhead, and it's obvious kind of obvious that the longer transaction -- well, the more base block we have in the transaction the overhead goes down due to the amortization. Next. Per memory access to overhead. Well up to the point, up to this point in the I just said the two used transaction point DBT all we have to do is is to add a transaction begin and end. And it's true for how to do transactional memory system but it's not true for a software transactional memory systems. Because software transactional memory system knows which address is written or read we have to add software barrier to let software transactional memory system knows there are such memory accesses. So we add barriers and this goes performance overhead. Typically two kinds of work is happening in the barrier. First conflict detection by recording the address read or written by the transaction and second, data version overhead by saving the old value when it writes to that address. But this definition gives us two hints. First, if the barrier itself is not shared, it's private, from the beginning we don't do anything for conflict detection and second hint is if this is stuck variable, maybe we can just by adjusting stuck pointer, stuck pointer we can simply just discard any modification done to stuck variable. So there can be some optimization chance. So we came up with categorization 3 for different memory access times where the yellow dart will show that we need data version for that case, red dot will show that we need conflict detection for that case. First we analyze whether a variable is stuck variable or not. If it is, we check whatever that stuck variable is assigned after allocated after transaction begin or not, beginning of transaction or not. If that's -- it's stuck variable and stuck variable is allocated after transaction begin and we don't need conflict detection because it's private we don't need to data versioning because if we have to roll back this transaction, we are going to restore these two values which include stuck pointer and by destroying stuck pointer we are naturally undoing this whole location of stuck variable. So we don't need to remember anything about the modification to the stuck variable. But if stuck variable is allocated before this transaction had started and even after we roll back the transaction, stuck variable will be still there, so we have to remember if there's any modification to stuck variable. If it's stuck variable and escapes or it's not stuck variable, we check whether it's private. There are many analysis to decide whether a variable is private or not. For example in object oriented programming, we allocate object with new. At the moment the object is allocated, it's private. Until the moment it publicized through the global reference variable, it's still private. So we deal with it in such a way we find that whatever variable is private up to some point, some point. And if it's private, we only do data version but we don't need to do conflict detection by the (inaudible) to private. If it's not private, we check whether it's protected with locks or not in user code. And if it's protected with lock, it is already pessimistically protected by a lock so we don't need to do conflict detection. Lock semantics guarantees that there will be only one thread accessing that variable. So we don't do conflict detection, we only do data versioning. >>: (Inaudible). >> JaeWoong Chung: I'm sorry? >>: If the locks are tracked in the application. >> JaeWoong Chung: Locks are ->>: Because you see a lock doesn't mean the variable is not being accessed somewhere else without a lock? >> JaeWoong Chung: Yes, we boot it -- well, the goal here is not to fix the problem of application, assuming the original application is correct, race free, we do this. >>: (Inaudible) security guarantees? >> JaeWoong Chung: Yes. >>: Okay. >> JaeWoong Chung: No. Well, I think the best way to answer your question is yes, if implementation requires additional strength while using transaction, yes, we have to admit this. But in this case, we are not adding transaction for security features, in general we are trying to add transaction to ->>: (Inaudible) performance but I did something like add a fine grain to Y from each memory access for that address, then I would at least know that my change to taint would be atomic but access the address. You're making an assumption here about the user level blocks. >> JaeWoong Chung: Yes, user level blocks. >>: Performance might be breaking taint analysis. (Inaudible) lock randomly in one of my threads, you know, that won't make my program incorrect if the lock doesn't need to be there, but now I'm exposing myself to races on the taints. >> JaeWoong Chung: Yes. Oh, the important thing is -- important thing is ->>: Is that what you're saying? >> JaeWoong Chung: Yes. The important thing is we have to distinguish two things, race due to -- race to the -- race with the user data, race with the metadata. These (inaudible) version three has nothing to do with metadata. >>: Right. >> JaeWoong Chung: It's only for user data. >>: (Inaudible). >> JaeWoong Chung: Yeah. So still that's the reason why you have to still do data versioning. If let's say that this fine grain left the region, as you said, is inside the bigger transaction and for this specific access to the shared variable we don't need conflict detection, but there might be a conflict due to the metadata right next to the original access and there will be conflict, we have to detect it, we have to roll it back for the roll back we still have to do data versioning. But conflict detection, no, it is guaranteed to deal with only one instruction. Does it make sense? >>: (Inaudible). >> JaeWoong Chung: Okay. So while all hope's gone and we have to do both things, we have to do both data version and conflict, key point of here is that any time we cannot analyze specific case we go to the right side to be on the safe side. We just (inaudible) go to the right side to be on the safe side. So we didn't explore all the chance of doing software optimizations, but we tested it several case so with stack optimization over 36 percent with benign race it goes down to 34 percent, and we did additional optimization and goes down to ->>: Question. (Inaudible). >> JaeWoong Chung: Yes, (inaudible) application and (inaudible). >>: Original SPLASH 2 from (inaudible) like how big is your working set? >> JaeWoong Chung: Yes, yes. >>: Megabyte working set. >> JaeWoong Chung: Data size. >>: (Inaudible), right? >> JaeWoong Chung: I don't quite ->>: 10, 15 years ago? >>: What I used on my thesis the machines were so small. (Brief talking over) >> JaeWoong Chung: Well, frankly the way I did it is I increased data set as long as it runs one day, well -- (inaudible) so it's ->>: How did you choose these, rather than running a traditional DBT application like Web server and FTP server ->> JaeWoong Chung: Oh, the reason is that those applications are more vulnerable for security. But our key point is to detect, to eliminate the race in terms of sharing data. Those applications are embarrassingly (inaudible). So there's ->>: The (inaudible). >> JaeWoong Chung: Yes. That maybe I enter confusion by using security example. I just use it as an example how to use DBT and what kind of problem there can be. But from our perspective, we are more interested in races. So in this application by definition shows variable pushes our system further. So, anyway, the basic idea is after we submit this paper we did additional experiments and with software optimization it can go down to 30 percent. You're happy with 30 percent, move on. But if you're not happy we start think about adding how do you support? From here we did simulation work so we compared it basically three types of harder acceleration. One, SM plus that has hardware acceleration for racer checkpointing, one instruction, and gradually even doing an implementation and hybrid where it provides hardware support for checkpointing and conflict detection. So we remember read set and write set and (inaudible) and check the conflict. And the third type is full hardware implementation where everything happens in hardware. So the quick summary application (inaudible) quick summary with SM plus over the 28 percent, with hybrid TM way over the 12 percent, with pull HTM go down to 6 percent, which is pretty good. But, you know, comparing this 12 percent, six percent, half of is gone. But from total execution time point of view, six percent is not that impressive. So we are definitely seeing some kind of dimension return. And maybe to use transactional memory system or DBT hybrid approach makes sense. >>: (Inaudible) on top of the -- so how much of the actual non difft execution time, in other words if you took the basic (inaudible) without difft at all, what (inaudible) ->> JaeWoong Chung: Oh, what is the overhead of difft itself? We actually didn't measure it. The reason is we don't care about how efficient difft itself is. What we care about one such implementation is there, how much overhead we're going to add to deal with the atomicity issue, that was our key point. It's interesting, yeah, question. Another question? Okay. So done. >>: (Inaudible). To the difft effect I suspect the answer is no, but does difft affect multithreaded applications worse than single-threaded applications? >> JaeWoong Chung: Difft -- I didn't get your question. Meaning worse in terms of performance? >>: Yeah. >> JaeWoong Chung: Of course, of course, because it -- multithreaded application and single-threaded application. Well, without worrying about atomicity issue, they just instrumented the same way for a multithreaded application, a single-threaded application and that's the same. It's the same. Well, forgetting about the (inaudible) issues in the multithreaded application basically difft without worrying about ->>: (Inaudible) so the number of let's say (inaudible) the number of processors at which performance is optimal or highest, does that number change when I apply difft, do you think? >> JaeWoong Chung: I think it depends on how you pack metadata bits shared by it. So typically up to now, the difft papers, they packed that one bit or eight bit all eight bit per byte so that they minimized the memory footprint, so a single thread and mostly the studies done in single thread application, no reason to worry about atomicity issue. But in our case it will hurt because actually it's going to be shared by multiple thread and (inaudible) in the cache. >>: I'm not sure if I missed that, but how do you handle tainted data and (inaudible) issues? >> JaeWoong Chung: Point of (inaudible). >>: Yes. Do you have (inaudible) to obtain the data and then you (inaudible). >> JaeWoong Chung: Oh, well it doesn't matter whether an application lever is (inaudible) or not. What we see is the eventual memory access. So in the program, there can be a point variable A eventually turns out to be variable B and variable access, it's hard for composite to understand it. But in transactional memory system point of view it's just normal roll and store instruction. We don't care how it is explained in original source code. Questions? Okay. Wow. I drew this tough audience. It's great. Well, any question I'll be happy to talk with after this talk. And I guess that's it. >>: (Inaudible). >> JaeWoong Chung: Oh, let me finish, okay. So quickly, multithreaded executables are headed for DBT. Well, we use transaction to deal atomicity issue. Baseline 41 percent. With software optimization down to 30 percent. With hardware optimization it can go down to 6 percent, but pay attention to diminishing return. Hybrid approach makes sense in terms of trade (inaudible) hardware cost. That's it. Thank you. (Applause)