>> Onur Mutlu: Hi, welcome all. It's a... Stanford University. Jae Woong is finishing his Ph.D. pretty...

advertisement
>> Onur Mutlu: Hi, welcome all. It's a pleasure to introduce Jae Woong Chung from
Stanford University. Jae Woong is finishing his Ph.D. pretty soon on transactional
memory and its applications, and he's done a lot of work on hardware transactional
memory and software support for it as well as applying transactional memory techniques
for different applications, and I think he's going to talk about one of those applications
today.
>> JaeWoong Chung: Thank you, Onur. Good morning. My name is JaeWoong Chung.
Today I'm going to talk about dynamic binary translation using transactional memory.
This talk is given at HPCA this year a few months ago.
I was wondering which talk is better to give today, my thesis talk or my conference talk,
and considering the high technical level of this specific audience I decided to delve into
one issue. So let's start.
Dynamic binary translation. I'm pretty sure that most of you heard about it and know how
it works. So basically dynamic binary translation converts binary A to binary B, and when
they convert it, they did it as a combination of DBT framework, kind of software talk, DBT
framework that provides basic primitives for translation and also analyze application and
DBT tool that performs specific translation.
So there are many interesting use cases of dynamic binary translation. First translation
crucial system is good example that binary A compiled for platform A can be converted to
binary B working for platform B in runtime and compilation for Java and C++ is another
good example. And we use dynamic binary translation for virtual motion support. And
there has been many proposals how to use dynamic binary translation for profiling,
debugging, security, liability and so on.
So let me give you an example how we can use dynamic binary translation for security.
So binary information, flow tracking is very interesting technique for security. It basically
tracks the flow of untrusted data in applications. So let's say that we have server
applications. We cannot trust all the data coming from the network, and diff first receive
the checks to input from the network and wherever it goes in the application execution
flow it follows the flow and if later on this untrusted data is used for a security critical
operation such as system co, we trigger security exception.
So let me give you an example how we can implement this with dynamic binary
translation. With dynamic binary translation we add metadata bits called taint bits. It is
one bit per memory byte. It doesn't need to be one bit per memory byte but typically one
bit per memory byte, and we also add yellow instruction in this slide to maintain this
metadata bit.
And let me walk you through this example. This example, the first step we receive let's
say that we received untrusted data from the network and assigned it to variable T. Now,
in this box we see that untrusted data X, X is there. We execute the taint bit to set the
corresponding bits and if data bit is flowing to another variable with slot and next
instruction also taint bits and the value, anticipated value is assigned and taint bit is also
assigned.
Key point is that taint bit is added and it follows exactly where the data goes in application
execution, and at the end if U variable U1 and U2 and are used for security critical
operations by checking taint bit, security policy can kill this process or trigger security
exceptions.
So DBT has been working fine and proven its value and it's been working very good with
single-threaded applications. The problem is that now we are in the add on of multi cores
and there will be multithreaded applications given to DBT as an input. The problem is
how can you preserve the original semantics of atomicity in translated code?
For example, let's say that we use dynamic binary translation for binary comparability,
meaning combining binary from platform A to platform B. Platform A provided compare
and swap instruction. But platform B we added instruction for compare and another
instruction for swap.
And single-threaded core it's fine. Compare and swap will happen one by one, but with
multithreaded application, there can be a race and mix of compare instruction and swap
instruction.
Another example is let's say that we added additional instructions to add features. For
example difft, we added instruction to maintain taint bits. And typically what we expect is
that this taint bit instruction is kind of side effect of original instruction, so taint bit
movement happens in atomic way following the untrusted data movement.
But -- and it's okay with single-threaded application because the instruction for taint bit
will be executed right after the original instruction. I'm going to give you an example right
in the next slide. But with multithreaded input, there will be some race.
So how can it deal with this? Up to now DBT solution such as tar DBT just doesn't allow
multithreaded input, no multithreaded input. Another example is biograin. It just sit
outside multithreaded execution. So there only one thread executed at a time.
>>: (Inaudible) question. I don't understand why on single course there are no difficulties
with multithreaded programs because my understanding is that one thread can run for a
little while and then another thread can run for a little while.
>> JaeWoong Chung: Yes. If we just mix multithread input in one single thread and we
don't pay any attention when two contact switch off the longest thread and there will be
same problem, there will be same problem, yes, you're right.
But what we are focusing here is that well in that case, we can deal with it by just paying
attention to contact switch timing. But with real multithreaded execution it's not enough to
do that. We need some kind of architectural support or software technical support to deal
with the problem. That's the key.
>>: Okay. What (inaudible) pin do this.
>> JaeWoong Chung: Pin provide -- I'm going to go over it. Actually we use pin for the
task. So that's good. Pin support multithread, pin support provide multithreaded spread
support for DBT frame itself, so it allows multithreaded access to code cache. It allows
multithreaded access to trace implementation but it has nothing to do with the application
code and instrumentation of code that runs with DBT framework.
And we are focusing on the code running with -- we are focusing on running
multithreaded application code and additional instructions.
>>: How does your model (inaudible) where the tainted data (inaudible).
>> JaeWoong Chung: Well, it actually doesn't matter for us because we're going to go
back to the case but what we try to make sure is that whatever instruction there is, even
though there is a branch, we make sure that a taint bit and taint bit check and branch
happens in the same way.
For example, if it is not let's say there is a branch, if it's not taken branch and there will be
transaction it's way go ahead, but anyway, there will be taint bit check after the branch
and then if the branch is taken and we're going to put checking instruction in the
beginning of target code and then we're going to check it there. Does that answer your
question?
>>: Well, I'm worried about the tainted data changing the control flow to your program.
>> JaeWoong Chung: Still we're going to -- still we're going to guarantee the atomicity of
checking the bits and then securing the branch bit, and if it turns out to be dangerous
execution, and if it's the responsibility of implemented feature that takes it, we
guarantee -- what the problem we try to address here is atomicity issue ->>: I understand what you're trying to model you're assuming just says okay if you
change it that's fine ->> JaeWoong Chung: Oh, so there are points that we have to check security. For
example, where do we have to check taint bits. One example is jump out the return
address. Another example is security. System call. So basically if there is a transfer of
control flow, we have to check the value. Does it ->>: Yes.
>> JaeWoong Chung: That's security checkpoint.
>>: What's your model for 15? Because some people only check (inaudible)
instructions.
>> JaeWoong Chung: Yes.
>>: If your branch is based on a stated address, then all remaining execution is tainted
and you can't make any decisions about whether or not a security violation has occurred.
>> JaeWoong Chung: Well ->>: You talk about your model (inaudible).
>> JaeWoong Chung: Yeah, yeah, I see. So what we -- the point we check, the point we
check is in our experiment the point we check is any branch instructions and any security
system call. That's basically two. There are more minor cases but basically it's two. Any
control flow change and any system call. That's it.
Okay. Any questions? Good.
As expected, tough audience. Okay. Let's move on. So let me show you why it's hard
problem. Again, the repeating the problem going back to the original trace. The problem
is how can you guarantee the atomicity even though after we use difft to add instruction
and we still expect the original instruction and instrument instruction are executed like
atomic instruction but with multithreaded input there will be a race.
In this example, this is way watered down example that we actually got from (inaudible)
cache, and they avoided using locks relying on the fact that swap instruction and atomic
and (inaudible) story is also atomic for most modern processors.
So they executed but I'll skip all the implemented code. The key is still here. So thread
one is executing swap, thread 2 is executing (inaudible). Doesn't matter which instruction
is executed first. It's still lock predata structure.
Now, we did same instrumentation for difft, and we added instruction as we've done for a
single threaded application. We added bit, we added instructions. And in the beginning
we have variable T has untrusted data and taint bit is set.
Now, first step we sort the data and when we instrumented code we expected that taint
bit happens right next to the original instruction and it's going to happen in
single-threaded application. But with multithread application this is structure, so there
can be a race and store to the (inaudible) happens from the original program point of
view it's okay. Doesn't matter.
But once we added instruction for security, it's going to be a problem because taint bit is
copied following the variable assignment but it hasn't -- nothing happens and then swap
bit -- taint bit soft happens so at the end if variable two is used for security system call
and security policy will check the taint bit it's now set sytem call will be allowed, it's
security breach.
Again, key point. Problem is due to the race there is a chance that expected atomicity
from the future implementer's point of view will break and then there will be some cache
that metadata will not follow the original user data and there will be significant
consequence such as security breach in this case.
>>: I have one more question. Do you address which classes of memory consistency
models would have this vulnerability and which would not?
>> JaeWoong Chung: Oh, in this case, we just assume sequential consistency.
>>: So which once would not (inaudible).
>> JaeWoong Chung: We just started about the outer consistency model like weight
consistency because we just stick to one instruction is security and all instruction will be
as it is as a one global history. That's it. Does it address your question?
>>: Yes.
>> JaeWoong Chung: Hopefully. I see some doubt in your face, but we can talk offline.
Okay. Any other question?
Okay. So what is atomicity issue? Well, we've been dealing with atomicity issue with
locks. We deal with it with locks? Of course we can do it. Well, basic idea is simple, use
lock and the lock to enclose data access and metadata access, and that's actually it in
terms of correctness. But it turns out it's going to be really hard to do it, to meet the
trade-off, good trade-off between performance and programming easiness.
First if we go for fine grain locking where we use one lock to have just one original
instruction and one additional instruction we have a huge overhead. Just to lock in a
lock -- just to acquire and use the lock, and there will be typical difficulty of using fine
grain lock such as (inaudible) version, but what we found is that the bigger impact of the
performance is that DBT actually optimized the code after it instrumented new
instructions such as copy propagation, in-lining and register renaming that a lock and lock
instruction works as if its memory barrier and it limits the scope of DBT optimization.
That's really hurting.
And if you don't like fine grain locking, alternative is obviously coarse grain locking. But
typical difficulty of using locks for DBT is that we instrument all of the code which
basically makes all of the coded region as critical section. If we start to use coarse grain
lock for this proposed it's easy to expect that there will be huge proponents degradation.
But curious thing that I think for this problem is lock nesting. We are dealing with
multithreaded applications and user application must be dealing with their own atomicity
issue with the locks. Now we are adding DBT locks to deal with with our own problem to
address atomicity issue between user data access and metadata access, it's going to be
not easy to in general to prove that there will not be deadlock due to the lock nesting and
the older acquiring user lock and DBT locks.
>>: (Inaudible) acquired another lock and DBT lock?
>> JaeWoong Chung: Well, if we hundred percent understand what's going on with
application semantics, yes, we can avoid it. But typically in our study it's going to be hard
to do it in general.
>>: Okay. But if you were at least using a fine grain version ->> JaeWoong Chung: Oh, yeah, no problem, no problem. Basically if we use one lock
for one original instruction and one instrument instruction, as you said, no problem.
>>: (Inaudible) coarse grains, okay, right.
>> JaeWoong Chung: Yes, yes, okay, so you know got your question. Good.
So the hardest thing to do actually is that lock programming for the code where all the
code is critical section, it's really important to come up with smart multithreaded program,
but we are -- if we ask future employment such as difft implementer for securing in
previous example also deal with multithreading issue, we are asking too much for that
people.
Or it's going to be the tool developer, DBT tool developer should be not only the expert
for specific feature but also should be expert for multithreading.
So the whole solution is basically very easy, use transactional memory. Transactional
memory I'm pretty sure that most of you have heard about it. So basically it provides
atomic and isolated execution of (inaudible) instruction. Atomic it's typical all or non
instruction is executed in the group. And isolation it's typical again. No intermediate
result from the instructions will be exposed to the rest of the system until all of the
instructions are executed.
So from programmer side, programmer point of view, just use transition to enclose
(inaudible) instruction and can just think that that transaction will be logically sequentially
executed against the other transaction and non-transactional accesses.
>>: Is it possible -- so you can't put a statement that affects I/O in a transaction?
>> JaeWoong Chung: I/O in a transaction. We can do that. There has been idea how to
deal with it. So actually we will deal with it there, but basic idea is that for example one
way to deal with it is we defer -- the problem with I/O is that once it done, let's say output
is done, out of system, we cannot undo it. That's a key point.
And one way to do it is defer the I/O until the moment we're pretty sure that the
transaction, the current transaction is going to be commit. That's one way. There can be
a case that we cannot guarantee it and we cannot know that whether the transaction is
committed or not, then by system we're going to force that transaction to come in.
Example, if there's any other transaction that conflicts with this transaction, we'll make
sure that the other transaction is rolled back.
>>: Well, that's not what I'm worried about. So if you have a read that has side effects
(inaudible) read is necessary to the transaction. And you can't roll it back.
>> JaeWoong Chung: Oh, side effect that has, right. In that case ->>: Let me finish. Okay. So is it possible that your system that you have one of those
reads in the same area where you've got to do this taint bit checking so you can't
compose it in a transaction because you'll screw up your I/O but since you can't
encapsulate within transaction you're exposed to race conditions that will allow
securitization?
>> JaeWoong Chung: Two answers. First in this typical example we can -- well, we
add -- we're going to add a transaction to deal with our own issue, and transaction can be
as small as contained in one original instruction and one taint bit instruction so that we
don't need to worry about having a loan transaction that has all complex issues, we can
split transaction as small as possible. And going back -- well, if we use the transaction to
deal with let's say difft, in that case that I/O operation is actually security checkpoint, and
then we're going to make -- for example, we're going to check taint bit before the I/O
operation and end the operation we can safely take out that instruction out of the
transaction boundary. That's another way to do it.
>>: Well, I guess what I'm asking, so there can be no I/O operation, we'll swap in
memory map I/O space that will require a taint bit to be checked as the next instruction.
Is the generator of the taint bit as well as the thing it needs to be coupled with ->> JaeWoong Chung: As long as the data we are accessing is in memory we can data
version it. If it's outside we cannot deal with it.
In this case we make it outside of transaction. That's one way to do it.
If we cannot defer it -- if we can defer it, we defer it, if we cannot defer it, we have to take
it out of the transaction. Or, yeah, we can make sure that transaction does not commit,
still it falls into the case. Whatever side it is, big problem with I/O with transaction is that
we cannot undo that side effect. The way we do it is make sure that we don't need to
undo the side effect. Make sure that the transaction commits.
Any questions? All right. Let me speed up a little bit.
So basically going back to the point again, programmer point of view, just use transaction
to enclose instruction and believe that this kind of deal will be sequential history of
transaction and non-transaction access, but on the line transactional memory system
runs it in parallel as long as transaction don't conflict, transaction don't conflict. Two
transaction conflict if they access the same address and one of them writes.
To detect this conflict every time changes in access new address we check whether it's
read or written and remember it, and we also make sure that if it's write we save all of the
safe value so that we have to roll back, we can restore the rest -- we can restore the
memory status.
We also checked register checkpoint so that we have to roll back we restore the rest of
the values as well.
There has been proposals how to implement transactional memory system, this style of
transaction memory system in hardware, software and hybrid manner.
So how can you use transaction for DBT. Basic idea is embarrassingly simple. So if
there is a memory access and metadata access, just use the transaction to enclose them
and transaction memory system by programming as a programming model it's going to
guarantee that these two transaction will be executed for each transaction, I'm sorry, it's
transaction is going to be executed in atomic way, but on the line transactional memory
system run transaction in parallel based on the atomicity control as long as two
transaction don't conflict. And it knows how to deal with nesting transaction, transaction
inside another transaction.
So basically it is simple, but you know the real life ain't easy like that. Well, there are a
list of problems that we have to deal with. First, what would be the best way to add
transaction, was it good enough to add a transaction per original and (inaudible)
instruction? No, it's going to be really heavy. Transaction begin and ends are not free.
What if application already have user locks to deal with their own atomicity issues. By
putting DBT locks do we break the original semantics of user locks? That's another
question. What if we deal with user transactions and overlap with DBT transaction?
What if there is I/O? We already covered it. I/O inside a DBT transaction we just add it.
And the last thing is real tricky that what if programmers are so smart so instead of
relying on (inaudible) parallel (inaudible) such as lock and barrier they decided to come
up with their own synchronization code and it's hard for us to understand all the
semantics, how can you deal with it? I'm going to give you an example.
And we also need to deal with the runtime overhead of using transactions.
So let's first start. We will go over the problem one by one. First, granularity of
transaction instrumentation.
>>: Are you using software transactional memory?
>> JaeWoong Chung: Oh, yeah, in the beginning we're going to give you a result with
software transactional memory system, and we're going to also provide a number if we
have harder acceleration in this talk.
So starting point is software transactional memory system. So granularity of transaction
instruction well, if we added a transaction per original instruction too high over it, it's easy
to understand. Better way to do it is transaction per basic block so that we can amortize
transaction beginning overhead. And some DBT framework provides trace
implementation. You know what trace is, it's multiple base blocks that happen to be
executed back to back, and we can further amortize the overhead. But the problem is
that because we are making transaction longer, we will see a higher chance of
transaction conflict.
So the final version is based on profile that we start with transaction per trace and then
we transaction conflict issue if it goes too high we optimize code with dynamic binary
translation to shrink the transaction size.
All right. So what if there is usual locks overlapping the DBT transaction? We're going to
first consider two cases. One, locked region is so small so that we so that transaction
and locked region is totally falling into transactions so that locked region is inside the
transaction. In that case, the transaction lock variable is just normal shared variable and
lock acquisition code, that's local -- that's the lock variable will incur a conflict.
So if there is two threads try to access this locked region inside the transaction, it's going
to turn out to be a transaction conflict, and then we're going to roll back one of the
transaction at the point where it try to get -- it tries to acquire the lock. In that way we can
guarantee that there was only one thread executing the critical section protected by the
lock. This way we can preserve the original semantics of lock.
The other way around, if locked region is really big and then transaction is inside and
then already lock provided pessimistic control mechanism and guaranteed that there will
be only one thread executing inside the locked region, what's the point of concern -what's the point of running some part of the execution with transaction again to provide
additional (inaudible) with control? It's fine. Don't worry about it.
Problem is what if they overlap partially? And the basic idea is that we cannot do
anything about user lock because it's part of original program. So we're going to split
DBT transaction into two parts, two small transaction. One, couple enclosed by the
locked region gather outside.
>>: (Inaudible) of the DBT framework that it provides essentially only per instruction
operations, that it doesn't provide any way of kind of adding metadata that spans the -that spans multiple instructions? And the reason I ask that is because that seems to me,
and is that an assumption of the talk, right, that in the framework you get one instruction,
you add some metadata instructions, but you don't ever get multiple instructions at once
that you can operate on?
>> JaeWoong Chung: No, no, no. If we go for a transaction per base block, per trace,
that's when we're going to have multiple instructions and multiple metadata, and actually
that's what we did in the experiment.
>>: So this is really about the semantics that are given to the person using the DBT
framework? The reason I ask is because I see the appeal of being able to operate, to
take multiple instructions in the input and then have a single metadata instruction at the
end that executes kind of in and all or nothing manner, but then that doesn't seem to -but then but then saying that we're going to split transactions whenever it's necessary in
order to solve other problems, we no longer have -- it seems like there has to be some
granularity at which you stop splitting.
>> JaeWoong Chung: Oh, that perfectly makes sense. In this case we didn't consider
semantic requirement of instrumented instructions. We only provide that transaction can
be bigger to cover such cases, but we didn't actually study how far we can split in smaller
part. In this study yes, you're right, we assume that we can split as small as possible.
And actually the basic size is small size that we used for our experiment is transaction
per base block. That was the smallest transaction size.
Okay. Move on to the next one. User transaction with DBT transaction overlapping.
Well, that's the same story. If they are fully nested, doesn't matter whether it's user
transaction and DBT transaction or the other way around. Transactional memory system
knows how to deal with nested transaction. If it's partially overlapped, again, we split
DBT transaction so that one going into user transaction, the other coming out of
transaction, user transaction.
I/O. We already covered it. One way to deal with I/O in general in transactional memory
system is the priority I/O until the moment that we're sure the transaction is committed.
One other way is to make sure that this transaction with I/O commit so that we don't need
to abort the I/O operations. This actually calls a serialization of transaction. It's not good
in terms of performance, but at least it guarantees the correctness.
Synchronization is tricky. Let me walk you through this simple example. The two thread
and simple instruction what they try to achieve is barrier. So transaction -- I'm sorry,
thread to that's done two bit saying that I'm done. And then thread one raise the bit and
comes out of the wire. In this way they have implicit dependency.
>>: I'm sorry to interrupt. I want to go back to that point. Maybe I'm beating a dead
horse here.
>> JaeWoong Chung: That's okay. I/O. Okay.
>>: You said that the one way to deal with it is to guarantee it completes. What if you
have two transactions on different threads that both get that requirement, they have a
conflict?
>> JaeWoong Chung: Then we have to serialize the transaction. Again it's a
serialization and we make sure that that transaction come in. That means that we cannot
provide another guarantee for another transaction. There will be only one thread
executing I/O with that guarantee.
>>: How do you provide that serialization?
>> JaeWoong Chung: Well, by make sure that if there's two transaction threaded I/O
operation we're going to just roll that back. So ideally ->>: If you have side effects --
>> JaeWoong Chung: Okay. Even before we execute the side effect, somehow we have
to detect it first, that's a good question. And when we did an implementation, actually pin
framework provided an API that, to decide whether it goes out or in, and in our
application we didn't see much case of that because reads multithreaded application
mainly going to memory, but if we can detect it even before we allow side effect, we will
make sure that that transaction is going to -- that transaction is the only transaction with
that guarantee in the system. At least for the process.
>>: So you address the problem by not having it in your experiments?
>> JaeWoong Chung: In the experiments, yes, you're right. In this experiment we
haven't seen the problem. But dealing with I/O you can easily imagine that living in the
I/O is big issue and generally in transactional memory literature, and it seems that they
are pretty reasonable to deal with it.
And I can refer the paper written by Craig Jullis (phonetic) about the analyzing
transactional memory system and critical section.
So tendency goes that way and executing instruction the other way around dependency
goes the other way around this (inaudible) dependency guarantees the semantics of
barrier. Problem is if you don't understand what's going on in application layer and just
put transaction we're in trouble because doesn't matter what we do.
Transactional memory system serialize transaction in a way that it execute two, three
comes first in this serialized (inaudible) the history ahead of -- comes first then one or
four -- either one and four before two and three. So ideally there is no way to deal -- to
serialize these two transactions. So it's a problem and typically in real implementation it
shows up as live lock. Two transactions kills each other without making progress.
So the way we deal with it is to first check that such case happens. The way we detect it
is is it a time out, and actually what we did is to constantly check whether a specific
transaction makes progress or not by checking how many transaction it's rolled back in a
roll. And if it goes five or 10 and then we say that there is no hope for this transaction to
move progress, and we reoptimize the code to make transaction per base block. And by
definition of base block, once we get into the beginning of base block we can finish -- we
can finish the base block where the end, there's a transaction end, so that we don't have
the internal circle of dependency inside one transaction.
>>: (Inaudible).
>> JaeWoong Chung: Yes. Yes. We will (inaudible) not only for performance but also to
detect these kind of cases.
>>: Can you guarantee that you cover possible scenarios?
>> JaeWoong Chung: The important thing is if we go down to the transaction per base
block we are guaranteed that there will be no dependence, because look at this wire, the
waiting conditions. They actually consist of multiple basic blocks. And the key is once
the -- okay. This is better. From the transactional memory perspective as long as we
start a transaction and finish the transaction it's okay. We are making forward progress.
Right?
So if we add transaction per base block by the definition of base block, we are
guaranteed to start, we want to see jump in beginning of base block, we go down to the
base block and change beginning and end will be beginning of the base block, end of
base block. Okay.
>>: (Inaudible).
>> JaeWoong Chung: High level question, open question.
Well, you know, that's the reason why I put it as the last one. It's not easy to deal with.
But at least we have a case, we have a solution to deal with it. Once it happens, system
going to slow down because we have transaction per base block, but like better -- the
best expression is for greater good, what can I say, for the common case the other, at the
other instruction.
>>: (Inaudible) deal with some other ways? Like why do you have to use transaction?
>> JaeWoong Chung: Basically we don't want the programmer to worry about finding his
parent and try to use another parent. We wanted to make sure that tool developer just
add a transaction and we provide a way to dynamically optimize the transaction size so
that even not for only for performance but also deal with these cases so that from tool
developer's point of view, they don't need to worry about finding this case and deal with it
as a separate case. That was key point, making DBT tool development easy in
reasonable performance degradation. That's key word I think.
Of course there is chance to use another mechanic.
>>: Different question.
>> JaeWoong Chung: Okay. So we've done with how to deal with, how to instrument
the transaction and how it interacts with user code. Now let's talk basic baseline
performance. We use pin tools with multithread support and we implemented difft out of
pin tool and we implemented software transactional memory system to begin with, and
we used x86, x86 software with (inaudible) and we run the experimental and real
machine. We use nine application, six from SPLASH, three from spectrum NT.
In the graph accesses this application, Y is this overhead, where we compare ->>: These applications of taint (inaudible)?
>> JaeWoong Chung: Taint? The security ->>: (Inaudible).
>> JaeWoong Chung: Oh, in this case, yes, we give you an -- actually it reads from the
file. We start the network but that only reads from the outside source it. Reads from the
file, and it reads from the comment line. We started tainting from there. But no data
block, that's right.
So it's key, while it doesn't matter whether it comes from, it's important to check whether
it's data from outside or not. Yeah. Okay. Good, good.
Any other questions?
Okay. So, wow, time is running fast. So access is application, Y is this overhead where
we compare two things, the baseline performance is DBT framework running with
application and when Y is overhead means the DBT framework application and software
tools, software change the memory system, we added and then we compared it at the
overhead of adding software transactional memory system to deal with atomicity issues.
Get this part.
>>: (Inaudible).
>> JaeWoong Chung: Yeah. Okay. So without software transactional memory system
the baseline performance is running pin tool with difft implementation and with the
application here. That's baseline performance. Meaning that it's normalized, it's
normalized to that execution.
>>: (Inaudible).
>> JaeWoong Chung: Yes. Yes. Or in our case it depends on application, but in our
case, yes. And to compare the overhead comes from using framework, difft, application,
and software transactional memory system to deal with atomicity issue. And we measure
how much overhead we have by doing that.
>>: (Inaudible).
>> JaeWoong Chung: (Inaudible).
>>: (Inaudible).
>> JaeWoong Chung: Yes.
>>: So then the zero percent overhead that's first a system that's incorrect because it
doesn't ->> JaeWoong Chung: Yes. We didn't do anything for atomicity issue. That's baseline.
Yes. Well, you know, it's based so we can hope that it would run correctly, but you know,
no guarantee.
So 41 percent -- in this example we actually run -- oh, wait a minute. Going back to your
question, was the overhead difft? Well, I'm sorry. That's the wrong answer. So in both
cases we don't difft. But the case we measure the overhead we additionally use software
transactional memory system to -- yeah, so the overhead is not the difft, the overhead is
software solution.
>>: That's what I thought you said.
>> JaeWoong Chung: Good you correct my answer naturally.
>>: The amount of the tainted data that is actually shared between friends, do you have
a sense of that?
>> JaeWoong Chung: Yes, yes, yes. The way -- it depends on how you pack the taint
bits. It's not only user access shared patent, it also how you pack the taint bit so even
though it's different word it can go down to the one word with taint bits. And in our case
we packed each bit to byte and pack in and pack it, so even without user level shared
access there is some level of sharing to the taint bits. That's first thing. And, yes, these
applications for data access part not much sharing, frankly, but we just implemented all of
the code, for example, even for barriers and all the other, while studies variable and
there, yes. We were able to see conflicts to shared metadata access.
Anyway, baselines, 41 percent. 41 percent not that good actually, not too bad, not too
good. So it's time -- well, but considering the other option that we serialize multithreaded
input totally executing thread one by one, this is much better. But still there is chance for
optimization.
So let's first see where this transaction overhead comes from. Three (inaudible) correct
basically. First (inaudible) to start and end transaction. We have to take racer
checkpoint and we have to initialize transaction metadata and per memory access inside
the transaction we have to remember which address is read or written by a transaction so
that we can detect conflict between transactions and also for the write we have to
remember all the values in the memory so that if we have to roll back, we can restore
their memory image. And then the third type is transaction abort where actual conflict is
there and we have to undo all the work done by aborted -- aborting transaction. Also we
have to do specific work for abort itself, for example applying the safe value we deserved
and then restoring the checkpoint.
In our task this is very fair application that we use, so transactional word issues really go
down to 0.03 percent. So we focus on the first two overhead.
First, transaction begin and end overhead. It's quite easy. Basic strategy is amortize it.
So in the graph, access is number of basic plots per transaction, Y is the same definition
of runtime overhead, normalized overhead, and it's obvious kind of obvious that the
longer transaction -- well, the more base block we have in the transaction the overhead
goes down due to the amortization.
Next. Per memory access to overhead. Well up to the point, up to this point in the I just
said the two used transaction point DBT all we have to do is is to add a transaction begin
and end. And it's true for how to do transactional memory system but it's not true for a
software transactional memory systems. Because software transactional memory
system knows which address is written or read we have to add software barrier to let
software transactional memory system knows there are such memory accesses. So we
add barriers and this goes performance overhead. Typically two kinds of work is
happening in the barrier. First conflict detection by recording the address read or written
by the transaction and second, data version overhead by saving the old value when it
writes to that address.
But this definition gives us two hints. First, if the barrier itself is not shared, it's private,
from the beginning we don't do anything for conflict detection and second hint is if this is
stuck variable, maybe we can just by adjusting stuck pointer, stuck pointer we can simply
just discard any modification done to stuck variable. So there can be some optimization
chance.
So we came up with categorization 3 for different memory access times where the yellow
dart will show that we need data version for that case, red dot will show that we need
conflict detection for that case.
First we analyze whether a variable is stuck variable or not. If it is, we check whatever
that stuck variable is assigned after allocated after transaction begin or not, beginning of
transaction or not.
If that's -- it's stuck variable and stuck variable is allocated after transaction begin and we
don't need conflict detection because it's private we don't need to data versioning
because if we have to roll back this transaction, we are going to restore these two values
which include stuck pointer and by destroying stuck pointer we are naturally undoing this
whole location of stuck variable. So we don't need to remember anything about the
modification to the stuck variable.
But if stuck variable is allocated before this transaction had started and even after we roll
back the transaction, stuck variable will be still there, so we have to remember if there's
any modification to stuck variable. If it's stuck variable and escapes or it's not stuck
variable, we check whether it's private. There are many analysis to decide whether a
variable is private or not. For example in object oriented programming, we allocate object
with new. At the moment the object is allocated, it's private. Until the moment it
publicized through the global reference variable, it's still private. So we deal with it in
such a way we find that whatever variable is private up to some point, some point.
And if it's private, we only do data version but we don't need to do conflict detection by
the (inaudible) to private. If it's not private, we check whether it's protected with locks or
not in user code. And if it's protected with lock, it is already pessimistically protected by a
lock so we don't need to do conflict detection. Lock semantics guarantees that there will
be only one thread accessing that variable. So we don't do conflict detection, we only do
data versioning.
>>: (Inaudible).
>> JaeWoong Chung: I'm sorry?
>>: If the locks are tracked in the application.
>> JaeWoong Chung: Locks are ->>: Because you see a lock doesn't mean the variable is not being accessed
somewhere else without a lock?
>> JaeWoong Chung: Yes, we boot it -- well, the goal here is not to fix the problem of
application, assuming the original application is correct, race free, we do this.
>>: (Inaudible) security guarantees?
>> JaeWoong Chung: Yes.
>>: Okay.
>> JaeWoong Chung: No. Well, I think the best way to answer your question is yes, if
implementation requires additional strength while using transaction, yes, we have to
admit this. But in this case, we are not adding transaction for security features, in general
we are trying to add transaction to ->>: (Inaudible) performance but I did something like add a fine grain to Y from each
memory access for that address, then I would at least know that my change to taint would
be atomic but access the address. You're making an assumption here about the user
level blocks.
>> JaeWoong Chung: Yes, user level blocks.
>>: Performance might be breaking taint analysis. (Inaudible) lock randomly in one of
my threads, you know, that won't make my program incorrect if the lock doesn't need to
be there, but now I'm exposing myself to races on the taints.
>> JaeWoong Chung: Yes. Oh, the important thing is -- important thing is ->>: Is that what you're saying?
>> JaeWoong Chung: Yes. The important thing is we have to distinguish two things,
race due to -- race to the -- race with the user data, race with the metadata. These
(inaudible) version three has nothing to do with metadata.
>>: Right.
>> JaeWoong Chung: It's only for user data.
>>: (Inaudible).
>> JaeWoong Chung: Yeah. So still that's the reason why you have to still do data
versioning. If let's say that this fine grain left the region, as you said, is inside the bigger
transaction and for this specific access to the shared variable we don't need conflict
detection, but there might be a conflict due to the metadata right next to the original
access and there will be conflict, we have to detect it, we have to roll it back for the roll
back we still have to do data versioning. But conflict detection, no, it is guaranteed to
deal with only one instruction.
Does it make sense?
>>: (Inaudible).
>> JaeWoong Chung: Okay. So while all hope's gone and we have to do both things,
we have to do both data version and conflict, key point of here is that any time we cannot
analyze specific case we go to the right side to be on the safe side. We just (inaudible)
go to the right side to be on the safe side.
So we didn't explore all the chance of doing software optimizations, but we tested it
several case so with stack optimization over 36 percent with benign race it goes down to
34 percent, and we did additional optimization and goes down to ->>: Question. (Inaudible).
>> JaeWoong Chung: Yes, (inaudible) application and (inaudible).
>>: Original SPLASH 2 from (inaudible) like how big is your working set?
>> JaeWoong Chung: Yes, yes.
>>: Megabyte working set.
>> JaeWoong Chung: Data size.
>>: (Inaudible), right?
>> JaeWoong Chung: I don't quite ->>: 10, 15 years ago?
>>: What I used on my thesis the machines were so small.
(Brief talking over)
>> JaeWoong Chung: Well, frankly the way I did it is I increased data set as long as it
runs one day, well -- (inaudible) so it's ->>: How did you choose these, rather than running a traditional DBT application like
Web server and FTP server ->> JaeWoong Chung: Oh, the reason is that those applications are more vulnerable for
security. But our key point is to detect, to eliminate the race in terms of sharing data.
Those applications are embarrassingly (inaudible). So there's ->>: The (inaudible).
>> JaeWoong Chung: Yes. That maybe I enter confusion by using security example. I
just use it as an example how to use DBT and what kind of problem there can be.
But from our perspective, we are more interested in races. So in this application by
definition shows variable pushes our system further.
So, anyway, the basic idea is after we submit this paper we did additional experiments
and with software optimization it can go down to 30 percent. You're happy with 30
percent, move on. But if you're not happy we start think about adding how do you
support? From here we did simulation work so we compared it basically three types of
harder acceleration. One, SM plus that has hardware acceleration for racer
checkpointing, one instruction, and gradually even doing an implementation and hybrid
where it provides hardware support for checkpointing and conflict detection. So we
remember read set and write set and (inaudible) and check the conflict.
And the third type is full hardware implementation where everything happens in
hardware.
So the quick summary application (inaudible) quick summary with SM plus over the 28
percent, with hybrid TM way over the 12 percent, with pull HTM go down to 6 percent,
which is pretty good. But, you know, comparing this 12 percent, six percent, half of is
gone. But from total execution time point of view, six percent is not that impressive. So
we are definitely seeing some kind of dimension return. And maybe to use transactional
memory system or DBT hybrid approach makes sense.
>>: (Inaudible) on top of the -- so how much of the actual non difft execution time, in
other words if you took the basic (inaudible) without difft at all, what (inaudible) ->> JaeWoong Chung: Oh, what is the overhead of difft itself? We actually didn't
measure it. The reason is we don't care about how efficient difft itself is. What we care
about one such implementation is there, how much overhead we're going to add to deal
with the atomicity issue, that was our key point.
It's interesting, yeah, question. Another question? Okay. So done.
>>: (Inaudible). To the difft effect I suspect the answer is no, but does difft affect
multithreaded applications worse than single-threaded applications?
>> JaeWoong Chung: Difft -- I didn't get your question. Meaning worse in terms of
performance?
>>: Yeah.
>> JaeWoong Chung: Of course, of course, because it -- multithreaded application and
single-threaded application. Well, without worrying about atomicity issue, they just
instrumented the same way for a multithreaded application, a single-threaded application
and that's the same. It's the same. Well, forgetting about the (inaudible) issues in the
multithreaded application basically difft without worrying about ->>: (Inaudible) so the number of let's say (inaudible) the number of processors at which
performance is optimal or highest, does that number change when I apply difft, do you
think?
>> JaeWoong Chung: I think it depends on how you pack metadata bits shared by it.
So typically up to now, the difft papers, they packed that one bit or eight bit all eight bit
per byte so that they minimized the memory footprint, so a single thread and mostly the
studies done in single thread application, no reason to worry about atomicity issue. But
in our case it will hurt because actually it's going to be shared by multiple thread and
(inaudible) in the cache.
>>: I'm not sure if I missed that, but how do you handle tainted data and (inaudible)
issues?
>> JaeWoong Chung: Point of (inaudible).
>>: Yes. Do you have (inaudible) to obtain the data and then you (inaudible).
>> JaeWoong Chung: Oh, well it doesn't matter whether an application lever is
(inaudible) or not. What we see is the eventual memory access. So in the program,
there can be a point variable A eventually turns out to be variable B and variable access,
it's hard for composite to understand it. But in transactional memory system point of view
it's just normal roll and store instruction. We don't care how it is explained in original
source code.
Questions?
Okay. Wow. I drew this tough audience. It's great. Well, any question I'll be happy to
talk with after this talk. And I guess that's it.
>>: (Inaudible).
>> JaeWoong Chung: Oh, let me finish, okay. So quickly, multithreaded executables are
headed for DBT. Well, we use transaction to deal atomicity issue. Baseline 41 percent.
With software optimization down to 30 percent. With hardware optimization it can go
down to 6 percent, but pay attention to diminishing return. Hybrid approach makes sense
in terms of trade (inaudible) hardware cost. That's it. Thank you.
(Applause)
Download