the University of Illinois today. Josep is a very... member of our community. He's an IEEE fellow. ... >> Doug Burger:

advertisement
>> Doug Burger: So it's my great pleasure to introduce Professor Josep Torrellas from
the University of Illinois today. Josep is a very well-known, distinguished and senior
member of our community. He's an IEEE fellow. He received his PhD from Stanford in
the early '90s and he's done seminal work in speculative multi-threaded architectures,
processor and memory architectures and debugability and programmable reliable
architectures. And his new work looks at bulk sequential consistency and is really
exciting stuff that's going to help us to tackle the parallels of problems.
So I'd like to thank Josep for coming and sharing his most recent work with us today and
it's a real pleasure to have him here.
I will. Sorry, we don't want that vicious Pacific Northwest sun burning you as you give
your talk..
>> Josep Torrellas: Very good. Thank you very much. Thank you very much, Doug.
I'm very pleased to be here. Thank you for hosting the visit. Thank you to all the
attendees. You know, we're small encoders here. Hopefully you will find some of this
work interesting and useful.
The work I'm going to present is a big project that I have with my students on designing
the new multi-core architecture of the future, hardware and software with a big emphasis
on programmability.
Many people contributed to this work, who are not at Illinois anymore. These are some of
my students who contributed to this work. In particular, you will see a lot of work by Luis
Ceze and James Tuck. So the project is called the Bulk Multicore for Programmability.
Let me start -- and by the way, please stop me any time if I say something wrong and it
doesn't make sense. Okay? I'd like to have a discussion, just a simple one.
Let me start with what I think are the challenges that we are facing as we move toward
the multicore here. The first one is probably in the minds of everybody here is 100 cores
per chip are coming and there's precious little parallel software. Okay. Certainly on the
client side. Which means that for system architects, both hardware and software, our
main priority has to be to design a system that supports a programmable environment.
Of course, this is very cheap to say, but what does it need? User friendly -- the most
user friendly concurrency and consistency models that we know of, we should be
sporting them efficiently. And also I think this is very important, provided always on
production run debugging environment because most of the people who will be using
these machines are not going to be experts, so we need to help them a lot, hold their
hands a lot.
Another challenge that I see is that decreasing transistor sizes are making designing
cores harder and harder. The reason for this is that as you decrease the feature size
there are more and more design rules that we need to support, which means that if you
want to satisfy the monolits(phonetic) it’s getting harder and harder, which means there
will be this tendency of taking the core and to some extent shrinking it, which means that
in my opinion, cores are becoming commodities more and more. There will be big cores,
small scores and specialized cores.
And the innovation and the value added of the chips will be mostly in the cash hierarchy,
network, power management and so forth. I think this is an opportunity for us because
designers are going to be, I think, willing to put this extra support and certainly able
because there's enough silicon. You'll see in the design of the Bulk Multicore why this is
important.
Another issue is that we will be adding accelerators. This is something that we see
coming and really we're going to see that these accelerators want to have fine-grain
sharing with the main cores, which means that the accelerators will need to have ideally
the same simple interface with the cache-coherent fabric as the main processors. Okay.
Which means that if you want to exploit them, we're going to need to have in my opinion
simple memory consistency models. So in this context, let me give you a brief overview
of kind of a possible vision of the multicore of seven to ten years from now. 128 plus
scores per chip.
Since people will be providing shared memory burning models, what we for the hardware
is to support shared memory. Certainly I will support this, support shared memory.
Perhaps groups of processors, islands of processors there. Certainly we can have
hard-working forces interweaving restrictions imposed by the language and the
concurrency model. Both the hardware and the language and concurrency model will
have to be designed together.
And I will make the case for sequential memory consistency. Certainly we are able to
provide hardware to support this at a very reasonable cost. At the same time the
certificate is always on debugging environment that can be on all the time, even for
production runs. And this has many components and some of them dear and near to the
work done here, but the realistic replay is parallel programs. We can do it with practically
no log at all. Data ray detection. We can do it at production run speeds with very little
slow down. And pervasive program monitoring, as I will discuss.
So with this context then this talk is about the Bulk Multicore. We even have a picture of
the Bulk Multicore. So what's the main idea of the Bulk Multicore? It's very simple.
Let's say that we eliminate the ability to commit individual instructions at the time. Okay.
And provide -- and we don't provide architectural state after every single instruction. This
was a fundamental assumption, is a fundamental assumption of current processors.
Current processors, you execute out of order, but instructions commit in order and after
every instruction you've got to be able to able to provide the architectural state to the rest
of the machine.
So as we move to multi processors, this becomes harder and harder and this is the root
of all the problems of memory consistency. Okay. Loads and stores now take a longer
while and you have many outstanding and while this happens something else may be
messed up. And because you want to keep this in between instruction architectural state
then you have to worry about memory consistency models, stalls, ordering, overlaps and
so on.
Let's say we (inaudible) and instead what we have is processors only commit chunks of
instructions at a time. Okay. Say for example 2000 dynamic instructions. The key word
here is "dynamic instructions." Okay. Because this is not something that is visible to the
software. The processor, you give a threat to a processor, the processor executes, starts
executing 2000 instructions say and then says, okay, now it's time for me to make this
take visible and if somebody wants here is my architectural state. Then I continue
executing 2000 instructions and so forth. So that's the idea.
>> Question: What if the synchronization operation's in there?
>> Josep Torrellas: Doesn't matter. In fact, you'll see that the sync becomes a no-op.
Offense becomes a no-op. Synchronization will still work, except that you're not going to
know where inside this chunk the synchronization occurs. So that’s the idea.
Chunks execute atomically in an isolation. Okay. People know what this means, but just
to make sure atomically means that nothing that I do within this 2000 instructions will be
visible to the rest of the world until the end. So there is a lock, if you grab a lock, nobody
is going to see that you grabbed the lock until the end, when you make the state visible,
or if you release a lock, same thing.
Isolation means that while you're executing this chunk the machine state cannot change
on you, which means that if you're executing an a chunk and you read the value and
while you're executing the chunk somebody else writes the read and commits, you're
gonna get squashed. You just cannot have this thing execute.
Okay. Atomic isolation and we'll use at the basic level buffering of this data with chunk
and if you have to squash and do. But most importantly we're going to use something
called the hardware signatures, which is very simple hardware primitive that will enable
efficient execution atomic and in isolation.
Now if this sounds familiar to you, this seems like transactional memory, it's not the
same. It's quite different. You could think of this as being transactions all the time in
hardware, invisible to the software. So this doesn't say anything about the software. You
can have object-oriented programs on top. You can have functional programming,
anything you want. This is invisible to the software. Okay.
It's just a way, as you will see, to execute efficiently. And it's not a way to paralyze
programs, either. Okay. What I do here is I give a thread to a processor. Okay. You
have multiple, say, take a parallel program, I give a thread to each processor and each
processor will chunk it to execute efficiently and this should work with synchronization. It
should work with logs, barriers, it works.
You will see that the advantages of a current process is that you improve the
programmability quite significantly, higher performance at current processors and simpler
hardware. If this seems too good to be true, maybe you can wait until I say some of
these things and then you can argue with me if you don't agree. Okay. So that is what I
will do in the rest of the talk, try to convince you that if we just get rid of that thing that is
kind of fundamental tenant of current processors, we get these three and we don't need
anything else.
>> Question: So to go back to synchronization point, can you address how you handle
something like a swap instruction that has read, modify, write semantics?
>> Josep Torrellas: Oh, that's fine. That's fine. You will do that if two processors do
this. Okay. Only one will succeed. Right. When will this succeed? The first one that
commits the chunk.
>> Question: Right. So -- so you -- you're basically saying that you swap instruction,
anything that has atomic semantics to memory is going in substance be cached.
>> Josep Torrellas: Yes. Yes. Everything is cache until commit, as you will see. This
works with locks, barriers, whatever. The only difference is that a chunk now may have
spins; right.
>> Question: Uh-huh.
>> Josep Torrellas: So useless instructions you're spinning longer. So if the guy who
releases the lock doesn't release it until after -- at the end of the chunk, then the guy may
spin longer or they may have squashes. Rich?
>> Question: I was going to say, this seems very optimistic and I'm just wondering how
well it's going to work if you have a lot of contention?
>> Josep Torrellas: Okay. So the high level idea is the finer the grain the sharing the
more squashes you will have. You know, think about it, the way this works is that if you
want the program that works like this, right, this guy said something, this one rates this,
one sets, you're not going to get done. You're going to get either this, or this. And what
you will have is spinning in these other chunks. Okay. Or squashes. But I want to live
with this. I want to live in an environment where if you have fine grain you're going to
reduce the chunk size. Right? If you have courser grain you move to bigger chunks
because it's much more efficient and it has this thing.
>> Question: (Inaudible) ->> Josep Torrellas: You could. In this case we don't do it, but you could do that so
currently we just use 2000 dynamic instructions. Right? If you find a lot of squashing you
could dynamically reduce it.
>> Question: Are you going to talk about exception IO?
>> Josep Torrellas: I can very simply say what the idea is.
>> Question: Yeah.
>> Josep Torrellas: So, I'll get to it in a second. Okay. Let me just go over this first.
So in the rest of the talk I will explain what the Bulk Multicore is, how we can produce
programmability and then revisit the vision of the multicore.
So variability, how does this work? So something called hardware signatures. What's
the idea? Imagine that every time the load or store retires from the reorder buffer there is
hardware that takes the address that you're loading from or storing to and automatically
in turn sparingly hash encodes it into a signature. Okay. So something like this.
Take the address. You do some permeation and then you harsh it into here. You set
multiple bits for an even address. Okay. Another load and store, you do the same thing.
Okay.
So this is a Bloom filter really. This signature collects kind of a footprint of what
addresses you've read or written. Okay. You have a read signature and the write
signature. All the reads go to one place. All the writes go into one place. And you have
this that summarizes the footprint of a chunk. So which chunk has a read signature and
a write signature? Again, transparent to the software, software doesn't see the same
images.
Suppose at the same time that in the cache hierarchy of the processor close to the L1,
not in the core, but in the cache hierarchy, I have a very simple function of units that
operate very efficiently with the signatures. Okay.
Suppose I have a bunch of ungates that take the intersection of two signatures. All right.
So this thing now will contain the addresses that are common or a kind of an encoding of
the addresses that are common to both signatures.
I can check if this is zero by a bunch of ungaten and nongate. If any of the fields are
zero, I know this signature is empty, which means that there is no intersection between
these two chunks. Uh-huh. I can have membership. Is this address part of the
signature? Well, I take the address. I use the previous picture to encode it into a
signature. Then I intersect the signature using this. Test with zero and that tells me if
this address could be in the signature or not. Okay.
So what I have here is an efficient way of operating in groups of addresses. Okay. Very
efficient, low overhead. There's some false positives, but it's very efficient. If I have this
then atomic and isolation execution becomes trivial. Okay.
Here's how it works. Suppose I have two threads, two parallel threads. Right? As I
execute this guy issues loads and stores because I want this to be atomic, I don't let this
store seem outside. Everything that I do is buffered in the cache. Right? Okay. I don't
send invalidations or anything. But the hardware automatically generates a write
signature and a read signature for this chunk. Right.
At the same time, this other processor execute its own chunk. As it loads it loads the
correct data. Remember, it cannot load this here because it is invisible to it. It reads the
correct data. It generates its signatures.
Now what does it mean to execute atomically? When this guy wants to commit all he
needs to do is he needs to send the right signature to this guy. And then clear its own
read and write signature. Everything is done. He has committed.
Does that make sense? So commit simply means sending the signature to everybody
else. Okay. In this case the other processor and clearing your own signatures.
Suddenly the rate of having caching becomes visible to the rest of the world. Uh-huh.
You don't write back all the data. No. The signatures itself should be enough to set the
whole system consistent. Okay. So that's atomic execution.
Isolation. In the cache hierarchy of this processor, the signature comes in and in this
function of unit in hardware there is the intersection against its own read and its own write
signature. Okay. In this case what you can see is that this thing will be null, but this is
not null because this guy stored B, this guy loaded B, therefore this guy will get
squashed. So if this thing is not null, then this guy has to be squashed if I want isolation.
Hmm.
So with this support then I have (inaudible) and isolation. Uh-huh.
>> Question: So are you required to broadcast the right amounts to all the processors
or do you scale the squared into the programs?
>> Josep Torrellas: Okay. So what I would do is I would not actually broadcast it. In a
bus, I just put the signatures on the bus, everybody would see it. In reality it would have
an arbiter that is connected with a directory. You will send it to the arbiter and the arbiter
would say all this worked with cache coherence by the away. The signatures are
enough. So then it says, oh, I know that only process 10 and 16 are interested in this
particular -- in the data could be in the signature. And then this thing gets only forwarded
to those caches.
So basic -- the basics of Bulk Multicore is the combination of chunk operation and
signatures. So we execute, as I said, each chunk atomically in isolation, so like this,
right? There is a distributed arbiter that ensures total of channel commits. So you don't
send it really to the processors, you send it to the arbiter. The arbiter very quickly detects
where you can commit and if so it says, okay, you are committed. Then you can clear
the signatures. Okay. I mean parallel will set -- change the state of the directories and
so on. Okay.
So with this, really what happens is we have a certain set of interlevings, not all of them,
of course, but a set of interleving, as if we had this guy selecting chunk from each
processor. Okay. Now if you have atomic execution, isolation and you have total
orderings then you have sequential consistency at the chunk level. All right.
And if you have consistently at a chunk level, you have it by definition at the instruction
level. So now you have sequential consistency in this machine. Okay. You don't allow
all possible interlevings, but you have some interlevings that are legal. Right. And
because I choose the size of this so that the overhead of commit and so on is small, I
have high performance. So the key is you support sequential consistency with low
hardware complexity and high performance. Why do I say low hardware complexity?
Now in current processors to support sequential consistency you have to live with
reorderings of loads and stores. So current process issue a load and from the time the
load is issued until it retires this window of vulnerability. You have to watch for incoming
invalidations in current processors to support sequential inconsistency or any model that
allows you to do speculative work.
Now whenever you have an invalidation on the bus it has to be looped up all the way to
the load buffer of the current processors. Okay. So that is an intrusive effect of
consistency into the core itself. Moreover, you cannot displace anything from the cache,
okay. If you displace something from the cache, then you could miss in validation coming
in. Therefore, if you either display something from the cache, if you displace something
from the cache, the processors loop the load and all the successors.
If an incoming invalidation snoops into (inaudible) and finds a hit, you have to move the
load and all the successors. Okay. What I'm proposing here is ignore all this stuff.
Okay. Let the loads and stores execute out of order as long as they make it to the
signature, which is in the cache hierarchy and then as the loop cores, the signature will
come in. Okay.
And if in the meantime, somebody else sends you a signature, then that signature you
intersect, you get squashed. So either everything or nothing. You can see that we have
moved all the complexity away from the core into the cache hierarchy.
>> Question: I don't see how you -- I guess I don't understand how you can cover
processors that you need? I mean, the Bloom filter encodes the read and write status
into an approximate hardware hash and you can't recover which hash lines were read to
or written from.
>> Josep Torrellas: Okay, so ->> Question: Right. So this information so, you know, if you don't -- it's highly probable
that these future, if you get to 128 core, you're not going to have a central bus. You will
have a directory at the L2, so how do you convert that, those signatures into an
invalidation or a ->> Josep Torrellas: Okay.
>> Question: Squashes?
>> Josep Torrellas: Very good. So from the signatures you have to be able to generate
super set, a safe super set of things that you want to either invalidate or squash.
Correct? All right. So I showed you a bunch of functional units that operate on
signatures before.
>> Question: Uh-huh.
>> Josep Torrellas: There is one that we call expansion and expansion generates a
conservative set of cache sets that can contain data that belongs to the signature.
>> Question: Right. Super set.
>> Josep Torrellas: Super set. So all you need to do is to ensure that this function
cannot squash or cannot invalidate any data that you shouldn't do for correctness. Okay.
But if you invalidate something and, you know, at most you're going to have a miss on
that thing.
>> Question: See your -- if you assume that a loader restore has to bring the cache line
across it before it can operate on it, then at the directory you can say, let me generate all
possible cache lines, then could you intersect with a signature and any processor which
is caching that line needs to be squashed.
>> Josep Torrellas: Well, so not squashed. Okay. So first of all, when a processor
receives a signature, you do two operations. Okay. First thing is you intersect the
signatures to see if you need to squash that chunk.
>> Question: That's assuming you're broadcasting the signatures.
>> Josep Torrellas: No, no, it doesn't matter. If you send it to a processor, you send it
to a processor.
>> Question: Uh-huh.
>> Josep Torrellas: Okay. The processor receives the signature.
>> Question: Yes.
>> Josep Torrellas: The first thing you need to know is one, do I need to squash this
guy? You have intersection with your read and write and if it says yes, I need to squash
this one. Then what you do is you squash the threat and then you use tape on that right
signature to figure out what are the dirty lines in each cache that need to be invalidated.
Because when you squash a thread ->> Question: I see. I see.
>> Josep Torrellas: So you have this expansion using your own right signature and you
invalidate this.
>> Question: It's a local operation, so ->> Josep Torrellas: Yeah. In addition -- in addition, irrespective of whether you get
squashed or not, okay, when you receive a signature, okay, you need to invalidate from
your cache any lines that guy could have written simply because they become stale.
Right?
>> Question: Uh-huh.
>> Josep Torrellas: So you get the signature and then you apply second expansion on
the incoming right signature and you invalidate from the cache those lines.
>> Question: Yeah.
>> Josep Torrellas: So there's two operations, one you always do it and the second
one, only if you squash the thing.
>> Question: So I misunderstood you earlier, I thought you were talking about using the
coherence protocol to prevent broadcast of your write set.
>> Josep Torrellas: You do. You do.
>> Question: Okay. All processors.
>> Josep Torrellas: Whenever you send a signature to the arbiter, the arbiter expands
into the directory the same idea, figuring out what are the possible lines it maps them.
Then for each line it knows what the processors contain data.
>> Question: Right.
>> Josep Torrellas: And it sends a signature only to those. All right. So you always
have to design the system to be on the conservative side. You cannot miss anybody, but
if you invalidate something that is clean, a cache ->> Question: Are you going to show us results for a large number of cores about what
fraction of lines get -- you know what fraction of the cores a signature gets expanded into
in fact.
>> Josep Torrellas: This talk doesn't have any data. (Giggle)
>> Question: Thank you.
>> Josep Torrellas: So that's one thing. Second is high performance. This is important
because one could design it to a very low complexity, sequential consistency and have
poor performance, not so. Because instructions within a chunk are a fully reauditing
hardware. In fact, loads and stores are viewed in any way you want inside a chunk. If
you have a fence that fence is a no-op, because the fence is during the chunk to prevent
instructions after the fence from being reordered before the fence. Here, because
nobody is going to see anything in the chunk anyway, you can reorder anything, ignore
the fences. Okay.
In fact, I will even show that within a chunk you can issue accesses from two different
chunks within a threat. Okay. As long as each chunk has its own signatures and the
chunks committing order. But you can have request from different chunks out of order
within a thing.
So to summarize then what we're discussing here, the games of this machine is in
simplicity, performance and programmability. Why that? Hardware simplicity. You can
see -- oops, sorry. Something on this thing went back.
Okay. So hardware simplicity. Remember, consistency support is moved away from the
core. I could design a core without worrying about whether it's going to be -- has to
support X86 consistency or power consistency. It doesn't matter. Give me a core. Do
you order the loads and the stores you want. As long as you have signatures and it
works with chunks, then the memory, the cache character will take care of the
consistency. And it will support sequential consistency. Okay.
So designers don't have to worry about this. So that's good because we move toward
commodity cores. We take a core, do anything you want that is going to work in your
system because now we have it in the catch header. Same for accelerator. Say we want
to put an accelerator that works very well with, you know, fine-grain sharing with graphics
or whatever. Do I need to worry about, oh, this, had is going to be for the next 86
machine so I need to worry about fences? No. Here give me what you want, do any
reordering you want, as long as it has chunks. Okay. As long as it works with chunks,
fences and another fence, it works.
High performance. As I said, fully reordering into the chunk. Okay. Now what about
programmability? All this is invisible to the software. That's the first language. You can
have any language, any (inaudible) model. Anything you want is fine because that's
under the covers.
>> Question: That's -- sorry. The only exception I can think of is if you have a read to
(inaudible).
>> Josep Torrellas: Okay. So now what happens if you have a (inaudible). Now is a
good time to do this. IO and speculation don't like each other very well. Right. So
whenever you have to have an IO, you have to commit a chunk. Then do the IO then
start a chunk. So that's forced cutting.
Exception is not necessary, okay. As long as you can (inaudible) thing here. So similar
to what transactional memory is to some extent.
Now support sequential consistency, you know, I argue with low hardware overhead and
high performance. Why is this useful? Okay. So why is this useful? Jim may disagree
or not, let's see. I think that the main reason why this may be useful is that many
correctness tools will assume sequential consistency. Okay.
So as we move forward to an environment where all the machines will be parallel one
would expect there would be more and more parallel software. Many tools will appear
hopefully that will say, you know, I can prove that this is correct. But you have to have it
in sequential consistency. Now this tool, some of them exist today. If you run this code,
the subroutine that you proved correct in a machine that sports X86 consistency, then,
you know, maybe it's not going to work.
It would be nice to have a machine that whatever you prove is correct is correct. I will get
to this more later. Second important thing is that it enables novel always on dividing
techniques. What is the insight here is that because we only keep per chunk state, okay,
we don't keep load and store information anymore. You can cut down on a lot of the
state that tools need to support. Okay. So this means that the terministic replay of
parallel programs, as I will discuss, you don't need to worry about this load happened
here, this store arrows, all you need to know is chunks. Okay.
So we'll see that we can eliminate the whole log actually and I know in these data raised
detections, right, so here we're going to have to cut at sync points. This is the only
exception as far as programming effect is concerned, software effect.
But if you do that then you don't need to worry about chunks stuffed within the
synchronization block between synchronization points. And then there is something else
if I have time I want to talk about, which is the next step. What if I make the signatures
visible to the software? So far I say the signatures are not visible. What if they are?
Then I can do lots of fancy stuff that have to do with watching addresses. Okay.
Pervasive monitoring for debugging and even novel compile optimizations like, you know,
disambiguating ranges of addresses automatically.
>> Question: (Inaudible) -- wonderful section on the text.
>> Josep Torrellas: I'm sorry.
>> Question: You'll also introduce wonderful new side channel attacks for any of these
chunks that are running cryptographic operations.
>> Josep Torrellas: If you need the signatures available, you mean?
>> Question: Yes.
>> Josep Torrellas: Yes. So that's an issue. You get -- well, you'll have to do it in a
way that you always -- okay. I cannot give you an opinion on the security and attacks,
but certainly you could mess up things in not doing it very efficiently.
>> Question: (Inaudible) -- privileged operation.
>> Josep Torrellas: Probably not (inaudible). Anyway, so many ->> Question: (Inaudible) -- give up liberty for a little security deserve neither.
>> Josep Torrellas: Okay.
>> Question: So ->> Josep Torrellas: Huh? And I think that this, you could think of many new tools that
gives you new functionality thanks to having this signature as a result. About the let me
just go over these programmability issues first and then just conclude the talk.
All right. So the issue of sequential consistency, as I said before, formal verification tools
hopefully will become more and more pervasive and they like to have sequential
consistency, otherwise, the state explodes.
Another thing is that there's all these issues about semantics whenever you have data
races. Okay. If you have sequential consistency all these discussions of what is the
semantics when I have data races? What is the java? So on. All this disappears
because it's very clear what data race does in sequentially consistent issue.
Okay. Especially for safe languages, all this stuff disappears.
Another thing is that it is much easier to debug parallel codes, of course, trying to debug
parallel codes of course, try to debug a parallel code that has kind of a software database
problem in sequentially inconsistent machine and one that is not. So that's kind of
another main reason.
And finally as Jim Lauer always tells me, there's some guys who always want to use
hand-crafted synchronization and of course they don't like to have unexpected results.
All right and having sequential consistency (inaudible). Anyway, so here is a machine
that could support this with low cost.
Now let's look at some of the tools that are enabled by working on chunks. Okay. One is
the terministic replay of parallel programs very efficiently. So what is a terministic replay?
So during execution I want to write a parallel program that states that the hardware in
parallel records into a log the order of dependency that existed in these threats. Okay.
So I run a code and then I have this hardware that logs all the dependencies. The log in
a sense has captured the interleving of the threads. All right. So then when I replay the
code, I simply reexecute the program enforcing the dependence order they have
encoded in the log. Okay. So that's the idea of the terministic replay of parallel
programs.
>> Question: (Inaudible) you're recording the order in which the chunks are dependent?
>> Josep Torrellas: Well, I'll get to it, but this is the general idea. The general idea is
you kind of have to record data interleaving of the threads. Right? So in conventional
schemes what you have is you have ROs, dependencies within threads. So in here it
says in instruction N1 versus if you want to add in instruction M1, processor 2 read A. So
there is this arrow, if I observe this in the initial execution then the log of process 2 has to
say -- has to have an entry that says when I get to M1, I have to stall until P1 has finished
M1. Okay.
So I have an entry for each of these arrows and that encodes what happens in the
original execution. Of course, people have come up with very fancy ways of combining
arrows so that you have fewer entries. Okay. But this is what you need and potentially
have very large loss because you need to worry about all the dependencies.
Now if you have bulk, then the log that is necessary becomes miniscule. Okay. During
execution I don't worry about dependency, I just do chunks. So basically a bunch of
arrows go from this guy to this guy and this is a chunk and this is another chunk, I don't
need to store these arrows. All I need to store is the order of chunks. All right.
So my log then is the log of all the processes together, not just of processor 2, is simply
the history of chunk commits. Okay. Chunk from P1, then P2, PI and so forth. Okay.
So that gets the log an order of (inaudible) production. However, I can be even smarter
now. I could say, you know, what if the arbiter, the one that takes the chunk from
different processors, has a built-in algorithm, okay, and says I'm going to take a chunk
round robin. A chunk from this guy, a chunk from this one. From this one. From this
one. Of course, I'm going to pay with some performance overhead because I may have
load imbalance. The process in turn to commit may not have the chunk done. Okay.
But if I do this, say I fix the chunk commit interleaving, then I have completely eliminated
the log. There is no log necessary because the arbiter knows when to replace the code,
okay, doesn't say everybody execute and I'm going to take this chunk first and this one,
this one, this one and so forth. So what you could do is you could whenever you start
debugging the program, right, use this model, okay, to find many bugs. You may have to
run for a day to find the bug. Once you find the bug you terministically replace and you
fix it. Hmm.
And once you fix most of the bugs then you can move to the other mode where you have
a bigger log. Okay. Does it make sense?
If you think about it what I'm doing here is I'm limiting the interlevings as possible.
Certainly at a fundamental level chunk already limits some of the interleving. Okay.
Those are the only possible -- this is one of the possible interlevings okay. At the cost -so I limit interlevings no performance cost largely at the beginning and I get small log.
This has a performance cost, but it eliminating the log completely.
So that's one example where it's possible. Another one is if you want to detect data
races here what I would do is I would cut the chunks at seem points. So suppose I have
the synchronizations like this, logs and logs and so forth. I would cut the chunk every
time there is synchronization operation. Okay.
Now then I use I assign an ID to each of the chunks, okay, that follows -- that happens
before. So the idea of this guy as successor of this guy which is successor of the other
guy. Within a threat it's clear. Across threats I use the synchronization operations to
order chunks. So this guy has to be a successor of this one here. The guy that releases
the log has to be a predecessor of the one that acquires the logs next. So how is this
done?
You amend the software data structure of the log -- of the lock, sorry. Of the lock to have
an extra field that specifies an I.D. When a chunk is about to release the lock it stores its
own I.D. into the software structure, then the next chunk that grabs the log after grabbing
the log it rates this I.D And it sets itself to be a successor of this guy. Okay. So this guy's
I.D. is successor of both. This one here and its own necessity is the same thing. You
have a partial order of chunks, basically, right?
Because of transitivity, this chunk now is a successor of this one up here. Okay. So this
guy is successor. But this guy here is not a successor of this right here. So how do I
detect data races? So what I do is when I detect communication between chunks, I can
use signatures to detect that, okay, or I could use planeload and stores, but let's say I
use signatures, then if same intersection I see these two guys have communicated, then
if this, I compare the I.D.s. If the I.D.s are ordered it means that there is a
synchronization between these two chunks and therefore it is not a data race.
While if this is unordered then I found a data race within these two chunks. All right.
Question there will be: What exactly is the request that cause the data race? There's
some stuff to it. If you work with signatures you may have some false positives. Okay.
You have to do some extra work. But that's the idea. All the references inside are
between two synchronizations. Synchronizations belong to the same equivalence class
and can treat them together.
I can do the same thing for flags and for barriers. In case of barriers all the successor
chunks in all the threats become successors of all the chunks that led to the barrier.
Make sense?
All right. So now I'm ready to get to the final part and that is what if we make the
signatures visible to the software? Okay. Through the ISA. Okay and I claim here that
you can have pervasive monitoring. Okay. And support numerous watch points and so
on. Or you can have novel compile optimizations such as -- I'll show you, code motion
and so forth. Okay.
So let's look at the two things one at a time, the monitoring abilities and the code
optimizations. So in pervasive monitoring the idea is very simple. If you have the ability
to insert and address the signature in software and then disambiguate this address
against all the accesses that you do, then you're effectively watching this address. Okay.
So here is the idea. You would say I want to watch memory location and during a
monitoring function that the user wrote every time somebody touches this address, okay
this, monitoring function could be checking for an insertion or whatever you want. Okay.
So you would have something like this. Watch this address. You would have the system
called, watch his address and execute this monitor that the user wrote. Okay. Every
time somebody straight point or whatever touches this thing. Uh-huh.
So when you execute the system call then inside the system call you would have an
instruction that says stuff this address into a signature. Okay. That would put it there
and from now on keep watching this address. If anybody, if anybody reads or writes this
address give me an exception, okay, or something. And so you go here and suppose
this straight pointer ends up updating this location, okay. So this point you get an
exception and you could choose for example to execute in a different threat this monitor
which could be say a -- an assertion, check an assertion that nothing seems to be
consistent. Okay.
So that gives you the ability to watch for a certain location. Order range of locations or a
group of locations, with some false positive, of course. So the first thing you want to do is
to check that indeed this is an address that you want to watch.
>> Question: How do you encode a range into the (inaudible) ->> Josep Torrellas: You would then have to have some additional logic. You would
have to have say a min and a max. Okay. The logic currently is very simple; right. It's
just an intersection of signatures. What you could think of slightly more complicated
function units that simply check for max and min. Right?
Is this less than this N within this link and then give the exception.
>> Question: How do you (inaudible) ->> Josep Torrellas: How do you turn off? Oh, very simple, you -- if you have
instructions, you can do anything you want. Right?
>> Question: (Inaudible) -- take it out of the set because the group and the collision.
>> Josep Torrellas: Sorry?
>> Question: You can't take it out of your set because there could have been a collision
on (inaudible) ->> Josep Torrellas: Okay. You mean what if I can take a single address from here, a
single one, and keep the remaining, the rest of one and keep it here. Right?
>> Question: We have watch and now I'd like (inaudible) again.
>> Josep Torrellas: Okay. So getting a false positive, right? You cannot -- from here
you cannot generate the exact set of addresses. So you cannot remove this address
from here. All right. So you'd have to live with that and do a check every time you do
that.
>> Question: You can do it if you have counters for (inaudible) --
>> Josep Torrellas: You can, if you count ->> Question: (Inaudible) ->> Josep Torrellas: If you complicate the thing, yes. So if you have a counter of how
many addresses you have, you can remove the thing because they counted by one.
Uh-huh.
Okay. Compile optimizations. Suppose you have this new instruction that says begin,
end, collecting addresses. All right. So I have a set of code and say, begin collect
addresses into your signatures. Of course, have a small registered file of signatures,
very few. All right. And I want to use the signatures to be visible to the software for
compile optimizations. Okay.
Most extensively for disambiguation of groups of addresses at run time. Okay.
Ranges -- or not ranges, set of addresses. So I can have collect -- begin collect, end
collect. So what this will do is at run time all these addresses will be put hash encoded
into signature. And you could say only the writs perhaps, only the writes or both in here.
Right?
Then another instruction is begin disambiguate. And the idea is that you say from now
on start disambiguating against the signature. Okay. And end disambiguating. And
what this would do is at run time any of these loads and stores will be intersected against
the signature. And if there is any collision you can get an exception or you can set a bit,
say for example. All right.
That's local disambiguation. You can have remote disambiguation and what you do is
start remote disambiguating against the signature. That means any incoming
invalidations or any incoming reads, coherent actions from another processor. Let them
intersect against the signature. Huh...
So then if there is somebody that's some stuff that collides with the signature you can get
again an exception or set a bit. This is very useful for say transaction or memory. Right?
What you would do is you would start collecting, begin collect, end collect. So you would
keep on collecting in your signatures the addresses that you're accessing and at the
same time begin remote disambiguate and in remote you would be watching incoming
things that would collide with your addresses. All right.
So what can you do with this? So I give an example of transaction memory. Another one
would be function memorization. Okay. So you have something like this. Could I
eliminate this call to Fum(phonetic), you know, skip it. So I called it here with a certain
input. And I call it again and suppose it's the same input. Okay. Can I just not call it
again and just, you know, use whatever I use -- I generated last time. You entered in a
bunch of pointer accesses in here. Can I do that?
Well, I can say begin collect, end collect. Okay. That's going to put all the addresses
that this Fum(phonetic) is accessing into a signature. Begin disambiguate and
disambiguate.
If by the time I get to this point there has hasn't been no conflict it means that this guy
didn't mess up any of the state that this one generated. So I could just skip the call to X.
If Fum(phonetic) if it's the same input.
Okay. There's one detail here that is missing. The thing missing is that Fum(phonetic)
over rights of value, right? Then I could not do this. The Fum(phonetic) reads something
and overwrites it. Then I cannot just skip it here. And because it has side effects on
itself. Hmm. Does it make sense?
So how do I do this? Rather than collecting into a signature, I collect all the reads and all
the writes and then I intersect the two and if the two is nil, then I know it doesn't have this
overwriting effect. Uh-huh. So there is an extra condition to be able to do this. Okay.
>> Question: (Inaudible) -- intersection, given the (inaudible).
>> Josep Torrellas: Oh, well, so this was sequential problem. If you have a parallel
program then you have to do remote disambiguating. Watch that nobody else messes
this up. Right? And, you know, keep receiving the invalidations for that.
>> Question: Who reads and writes the stack that this is always going to look like it
made (inaudible)
>> Josep Torrellas: Right. Right. So if -- you mean it writes and then reads the stack,
right? If it has a private variable that it writes and then reads, true. So if you do that, then
you could skip the second column, certainly. But using the signatures it would appear
that you have a dependence to write. Make sense? That is what your point is, right?
Fine.
So you would not memorize it, but you could give this (inaudible) and say look, I'm not
going to put into this signature this range of addresses that are stack addresses. It's
private stuff, I'm not going to put into the signature. Some day hardware designers will
realize that this is very useful for software designers. It's very simple to do this.
Another example is -- I guess I mentioned this already, having many watch points, so
here you can say put this address into the signature, put this address, put all the
addresses that Fum(phonetic) is accessing and then watch for this thing if anybody
accesses this.
Another one, looping in variant code motion. What if -- can I move this expression out of
this loop? Okay. It would be good because then my (inaudible) would speedup 5%, for
example. Right? Well, let me move it out and then collect all the reads, the end collect
and collect, begin disambiguate, end disambiguate and then check. Did I get any
conflict?
And if I did get a conflict, you know, I would have to relax certainly. Otherwise, I'm done.
Okay. So signatures can be used in two ways. One is hit a point, check the signature
and based on that decide I go this way or this way, that's the simplest way. Or the
second one is do something that is unsafe, check the signature and then decide what I've
done is incorrect. I have to reevaluate in a different way.
>> Question: The rollback is kind of a gotcha there.
>> Josep Torrellas: It's a (inaudible). You can ->> Question: That's a lot of mechanism out there you can get on the rollback.
>> Josep Torrellas: Okay. So the question here is yes, you get rollback. So you can
use shallow variables perhaps at the cost of some extra copies and so forth. And -- or
otherwise don't use signatures in this mode. Use it in the other mode. You do the check
first it wouldn't work this time in here, but do the check first and then this and that. It
wouldn't work in this case here. Okay.
All right. So you can think of many optimization. We're just getting started here. Many
compile transformations rely on run time disambiguation of groups of addresses. This
gives you a nice primitive (inaudible).
All right. To summarize then what is what you are after here. So we're after a machine
that has hundreds, a hundred processors, all cache coherent in this case, okay, that you
could just have groups of cash coherent processors, okay, where the novelty is not in the
cores. The cores actually can be designed quickly, take the old processor from a
company that is your competitor, it doesn't even support the memory consistency model.
Shrink it. Stuff it there. As long as it sports the bulk, everything is moved to a cash
hierarchy.
You get high performance. This is complete transparent to the software. You don't need
to change the software stack at all. Okay. Signatures are useful disambiguation. Cache
coherence and I gave you a glimpse of how it works, the whole system works with
signatures and then if you want, even for compile optimization, the machine supports
sequential consistency without any fancy hardware, no snoops, no costly operations in
the core.
There are room for many new tools that focus on chunks, okay. For example, as I said
before, the terministic replay with no log, data race detection with very low overhead and
pervasive monitoring of this, of watching addresses and then compile optimization if you
make the signatures visible to the eye itself.
That's kind of the high level view of the project. As you can see there are many different
parts. We're even building an FPGA prototype of this, believe it or not. Each student is
working on a different part of this project. There's room to corroborate with Microsoft if
there is interest and I think that many programming and compile issues remain open.
You can see that the emphasis of this work has so far been the hardware. Techniques to
this. There is all the software that we haven't touched. Okay.
So notice the next step would be what if you make some of the stuff visible to the
software?
>> Doug Burger: Great. Thank you very much.
(applause) -We should have time for some questions if you have them. They were all asked very
methodically.
>> Question: So is your FPGA basically starting to single core this or multiple cores,
how much are you going to try to do?
>> Josep Torrellas: Okay. So this is a very simple core. It's a multi-threaded single
issue core.
(Laughter)
In order. Right. And so this is a very simple thing you need multiple of those. Right?
You just have to chunk them. Okay. And have a way of ordering the chunks in the
cache. Most of the work is in the cache area.
>> Question: Do you think at the end of the day this is running in parallel programs
more or less energy efficient?
>> Josep Torrellas: Well, so it has some sources of energy inefficiency. Right?
>> Question: And some efficiencies. You know in a sense of the work load.
>> Josep Torrellas: So the inefficiency comes from if you have -- did you ask sequential
code or parallel code?
>> Question: Parallel.
>> Josep Torrellas: Parallel code. If you have fine-grain sharing then you run the risk
of squashing. Right? So that's one of the problems. Right? This is where I think most
of the consumption actually consumption will come from. What are the savings?
Singular, no snooping of the -- no snooping of the core really. So this part gets simplified.
How much simplification? That is an open question. You don't need to snoop things.
You don't need to make -- invalidations don't have to come up all the way up and have to
log the cash. So that's a major gain.
Working with signatures also has an individual. You don't send individual invalidations,
okay. Although the fact that you have signatures means that you invalidate additional
locations that you shouldn't, potentially causing extra misses or some energy lost from
there.
Commit is very efficient, okay. All we do is -- we didn't have time to talk about this, but
multiple chunks can be committing at a time. Okay. Updating directories, sending the
signatures around. As long as those chunks don't have any intersection, so all the arbiter
has to do is to keep an image on arbiter hardware. The set of currently committing
signatures. And when a new one comes in requesting for commit, simply intersect with
all of them. If there is no intersection, you say, okay, you're done. You're committed and
in parallel do the propagation.
So one thing that we're looking for carefully is that this thing should not slow down single
threat performance. So there is arbitration that occurs irrespective whether they have
single thread or multiple thread, this should not slow down the system.
>>: Great, that's it. Thank our speaker again.
(applause)
>> Josep Torrellas: Thank you.
Download