>> Doug Burger: So it's my great pleasure to introduce Professor Josep Torrellas from the University of Illinois today. Josep is a very well-known, distinguished and senior member of our community. He's an IEEE fellow. He received his PhD from Stanford in the early '90s and he's done seminal work in speculative multi-threaded architectures, processor and memory architectures and debugability and programmable reliable architectures. And his new work looks at bulk sequential consistency and is really exciting stuff that's going to help us to tackle the parallels of problems. So I'd like to thank Josep for coming and sharing his most recent work with us today and it's a real pleasure to have him here. I will. Sorry, we don't want that vicious Pacific Northwest sun burning you as you give your talk.. >> Josep Torrellas: Very good. Thank you very much. Thank you very much, Doug. I'm very pleased to be here. Thank you for hosting the visit. Thank you to all the attendees. You know, we're small encoders here. Hopefully you will find some of this work interesting and useful. The work I'm going to present is a big project that I have with my students on designing the new multi-core architecture of the future, hardware and software with a big emphasis on programmability. Many people contributed to this work, who are not at Illinois anymore. These are some of my students who contributed to this work. In particular, you will see a lot of work by Luis Ceze and James Tuck. So the project is called the Bulk Multicore for Programmability. Let me start -- and by the way, please stop me any time if I say something wrong and it doesn't make sense. Okay? I'd like to have a discussion, just a simple one. Let me start with what I think are the challenges that we are facing as we move toward the multicore here. The first one is probably in the minds of everybody here is 100 cores per chip are coming and there's precious little parallel software. Okay. Certainly on the client side. Which means that for system architects, both hardware and software, our main priority has to be to design a system that supports a programmable environment. Of course, this is very cheap to say, but what does it need? User friendly -- the most user friendly concurrency and consistency models that we know of, we should be sporting them efficiently. And also I think this is very important, provided always on production run debugging environment because most of the people who will be using these machines are not going to be experts, so we need to help them a lot, hold their hands a lot. Another challenge that I see is that decreasing transistor sizes are making designing cores harder and harder. The reason for this is that as you decrease the feature size there are more and more design rules that we need to support, which means that if you want to satisfy the monolits(phonetic) it’s getting harder and harder, which means there will be this tendency of taking the core and to some extent shrinking it, which means that in my opinion, cores are becoming commodities more and more. There will be big cores, small scores and specialized cores. And the innovation and the value added of the chips will be mostly in the cash hierarchy, network, power management and so forth. I think this is an opportunity for us because designers are going to be, I think, willing to put this extra support and certainly able because there's enough silicon. You'll see in the design of the Bulk Multicore why this is important. Another issue is that we will be adding accelerators. This is something that we see coming and really we're going to see that these accelerators want to have fine-grain sharing with the main cores, which means that the accelerators will need to have ideally the same simple interface with the cache-coherent fabric as the main processors. Okay. Which means that if you want to exploit them, we're going to need to have in my opinion simple memory consistency models. So in this context, let me give you a brief overview of kind of a possible vision of the multicore of seven to ten years from now. 128 plus scores per chip. Since people will be providing shared memory burning models, what we for the hardware is to support shared memory. Certainly I will support this, support shared memory. Perhaps groups of processors, islands of processors there. Certainly we can have hard-working forces interweaving restrictions imposed by the language and the concurrency model. Both the hardware and the language and concurrency model will have to be designed together. And I will make the case for sequential memory consistency. Certainly we are able to provide hardware to support this at a very reasonable cost. At the same time the certificate is always on debugging environment that can be on all the time, even for production runs. And this has many components and some of them dear and near to the work done here, but the realistic replay is parallel programs. We can do it with practically no log at all. Data ray detection. We can do it at production run speeds with very little slow down. And pervasive program monitoring, as I will discuss. So with this context then this talk is about the Bulk Multicore. We even have a picture of the Bulk Multicore. So what's the main idea of the Bulk Multicore? It's very simple. Let's say that we eliminate the ability to commit individual instructions at the time. Okay. And provide -- and we don't provide architectural state after every single instruction. This was a fundamental assumption, is a fundamental assumption of current processors. Current processors, you execute out of order, but instructions commit in order and after every instruction you've got to be able to able to provide the architectural state to the rest of the machine. So as we move to multi processors, this becomes harder and harder and this is the root of all the problems of memory consistency. Okay. Loads and stores now take a longer while and you have many outstanding and while this happens something else may be messed up. And because you want to keep this in between instruction architectural state then you have to worry about memory consistency models, stalls, ordering, overlaps and so on. Let's say we (inaudible) and instead what we have is processors only commit chunks of instructions at a time. Okay. Say for example 2000 dynamic instructions. The key word here is "dynamic instructions." Okay. Because this is not something that is visible to the software. The processor, you give a threat to a processor, the processor executes, starts executing 2000 instructions say and then says, okay, now it's time for me to make this take visible and if somebody wants here is my architectural state. Then I continue executing 2000 instructions and so forth. So that's the idea. >> Question: What if the synchronization operation's in there? >> Josep Torrellas: Doesn't matter. In fact, you'll see that the sync becomes a no-op. Offense becomes a no-op. Synchronization will still work, except that you're not going to know where inside this chunk the synchronization occurs. So that’s the idea. Chunks execute atomically in an isolation. Okay. People know what this means, but just to make sure atomically means that nothing that I do within this 2000 instructions will be visible to the rest of the world until the end. So there is a lock, if you grab a lock, nobody is going to see that you grabbed the lock until the end, when you make the state visible, or if you release a lock, same thing. Isolation means that while you're executing this chunk the machine state cannot change on you, which means that if you're executing an a chunk and you read the value and while you're executing the chunk somebody else writes the read and commits, you're gonna get squashed. You just cannot have this thing execute. Okay. Atomic isolation and we'll use at the basic level buffering of this data with chunk and if you have to squash and do. But most importantly we're going to use something called the hardware signatures, which is very simple hardware primitive that will enable efficient execution atomic and in isolation. Now if this sounds familiar to you, this seems like transactional memory, it's not the same. It's quite different. You could think of this as being transactions all the time in hardware, invisible to the software. So this doesn't say anything about the software. You can have object-oriented programs on top. You can have functional programming, anything you want. This is invisible to the software. Okay. It's just a way, as you will see, to execute efficiently. And it's not a way to paralyze programs, either. Okay. What I do here is I give a thread to a processor. Okay. You have multiple, say, take a parallel program, I give a thread to each processor and each processor will chunk it to execute efficiently and this should work with synchronization. It should work with logs, barriers, it works. You will see that the advantages of a current process is that you improve the programmability quite significantly, higher performance at current processors and simpler hardware. If this seems too good to be true, maybe you can wait until I say some of these things and then you can argue with me if you don't agree. Okay. So that is what I will do in the rest of the talk, try to convince you that if we just get rid of that thing that is kind of fundamental tenant of current processors, we get these three and we don't need anything else. >> Question: So to go back to synchronization point, can you address how you handle something like a swap instruction that has read, modify, write semantics? >> Josep Torrellas: Oh, that's fine. That's fine. You will do that if two processors do this. Okay. Only one will succeed. Right. When will this succeed? The first one that commits the chunk. >> Question: Right. So -- so you -- you're basically saying that you swap instruction, anything that has atomic semantics to memory is going in substance be cached. >> Josep Torrellas: Yes. Yes. Everything is cache until commit, as you will see. This works with locks, barriers, whatever. The only difference is that a chunk now may have spins; right. >> Question: Uh-huh. >> Josep Torrellas: So useless instructions you're spinning longer. So if the guy who releases the lock doesn't release it until after -- at the end of the chunk, then the guy may spin longer or they may have squashes. Rich? >> Question: I was going to say, this seems very optimistic and I'm just wondering how well it's going to work if you have a lot of contention? >> Josep Torrellas: Okay. So the high level idea is the finer the grain the sharing the more squashes you will have. You know, think about it, the way this works is that if you want the program that works like this, right, this guy said something, this one rates this, one sets, you're not going to get done. You're going to get either this, or this. And what you will have is spinning in these other chunks. Okay. Or squashes. But I want to live with this. I want to live in an environment where if you have fine grain you're going to reduce the chunk size. Right? If you have courser grain you move to bigger chunks because it's much more efficient and it has this thing. >> Question: (Inaudible) ->> Josep Torrellas: You could. In this case we don't do it, but you could do that so currently we just use 2000 dynamic instructions. Right? If you find a lot of squashing you could dynamically reduce it. >> Question: Are you going to talk about exception IO? >> Josep Torrellas: I can very simply say what the idea is. >> Question: Yeah. >> Josep Torrellas: So, I'll get to it in a second. Okay. Let me just go over this first. So in the rest of the talk I will explain what the Bulk Multicore is, how we can produce programmability and then revisit the vision of the multicore. So variability, how does this work? So something called hardware signatures. What's the idea? Imagine that every time the load or store retires from the reorder buffer there is hardware that takes the address that you're loading from or storing to and automatically in turn sparingly hash encodes it into a signature. Okay. So something like this. Take the address. You do some permeation and then you harsh it into here. You set multiple bits for an even address. Okay. Another load and store, you do the same thing. Okay. So this is a Bloom filter really. This signature collects kind of a footprint of what addresses you've read or written. Okay. You have a read signature and the write signature. All the reads go to one place. All the writes go into one place. And you have this that summarizes the footprint of a chunk. So which chunk has a read signature and a write signature? Again, transparent to the software, software doesn't see the same images. Suppose at the same time that in the cache hierarchy of the processor close to the L1, not in the core, but in the cache hierarchy, I have a very simple function of units that operate very efficiently with the signatures. Okay. Suppose I have a bunch of ungates that take the intersection of two signatures. All right. So this thing now will contain the addresses that are common or a kind of an encoding of the addresses that are common to both signatures. I can check if this is zero by a bunch of ungaten and nongate. If any of the fields are zero, I know this signature is empty, which means that there is no intersection between these two chunks. Uh-huh. I can have membership. Is this address part of the signature? Well, I take the address. I use the previous picture to encode it into a signature. Then I intersect the signature using this. Test with zero and that tells me if this address could be in the signature or not. Okay. So what I have here is an efficient way of operating in groups of addresses. Okay. Very efficient, low overhead. There's some false positives, but it's very efficient. If I have this then atomic and isolation execution becomes trivial. Okay. Here's how it works. Suppose I have two threads, two parallel threads. Right? As I execute this guy issues loads and stores because I want this to be atomic, I don't let this store seem outside. Everything that I do is buffered in the cache. Right? Okay. I don't send invalidations or anything. But the hardware automatically generates a write signature and a read signature for this chunk. Right. At the same time, this other processor execute its own chunk. As it loads it loads the correct data. Remember, it cannot load this here because it is invisible to it. It reads the correct data. It generates its signatures. Now what does it mean to execute atomically? When this guy wants to commit all he needs to do is he needs to send the right signature to this guy. And then clear its own read and write signature. Everything is done. He has committed. Does that make sense? So commit simply means sending the signature to everybody else. Okay. In this case the other processor and clearing your own signatures. Suddenly the rate of having caching becomes visible to the rest of the world. Uh-huh. You don't write back all the data. No. The signatures itself should be enough to set the whole system consistent. Okay. So that's atomic execution. Isolation. In the cache hierarchy of this processor, the signature comes in and in this function of unit in hardware there is the intersection against its own read and its own write signature. Okay. In this case what you can see is that this thing will be null, but this is not null because this guy stored B, this guy loaded B, therefore this guy will get squashed. So if this thing is not null, then this guy has to be squashed if I want isolation. Hmm. So with this support then I have (inaudible) and isolation. Uh-huh. >> Question: So are you required to broadcast the right amounts to all the processors or do you scale the squared into the programs? >> Josep Torrellas: Okay. So what I would do is I would not actually broadcast it. In a bus, I just put the signatures on the bus, everybody would see it. In reality it would have an arbiter that is connected with a directory. You will send it to the arbiter and the arbiter would say all this worked with cache coherence by the away. The signatures are enough. So then it says, oh, I know that only process 10 and 16 are interested in this particular -- in the data could be in the signature. And then this thing gets only forwarded to those caches. So basic -- the basics of Bulk Multicore is the combination of chunk operation and signatures. So we execute, as I said, each chunk atomically in isolation, so like this, right? There is a distributed arbiter that ensures total of channel commits. So you don't send it really to the processors, you send it to the arbiter. The arbiter very quickly detects where you can commit and if so it says, okay, you are committed. Then you can clear the signatures. Okay. I mean parallel will set -- change the state of the directories and so on. Okay. So with this, really what happens is we have a certain set of interlevings, not all of them, of course, but a set of interleving, as if we had this guy selecting chunk from each processor. Okay. Now if you have atomic execution, isolation and you have total orderings then you have sequential consistency at the chunk level. All right. And if you have consistently at a chunk level, you have it by definition at the instruction level. So now you have sequential consistency in this machine. Okay. You don't allow all possible interlevings, but you have some interlevings that are legal. Right. And because I choose the size of this so that the overhead of commit and so on is small, I have high performance. So the key is you support sequential consistency with low hardware complexity and high performance. Why do I say low hardware complexity? Now in current processors to support sequential consistency you have to live with reorderings of loads and stores. So current process issue a load and from the time the load is issued until it retires this window of vulnerability. You have to watch for incoming invalidations in current processors to support sequential inconsistency or any model that allows you to do speculative work. Now whenever you have an invalidation on the bus it has to be looped up all the way to the load buffer of the current processors. Okay. So that is an intrusive effect of consistency into the core itself. Moreover, you cannot displace anything from the cache, okay. If you displace something from the cache, then you could miss in validation coming in. Therefore, if you either display something from the cache, if you displace something from the cache, the processors loop the load and all the successors. If an incoming invalidation snoops into (inaudible) and finds a hit, you have to move the load and all the successors. Okay. What I'm proposing here is ignore all this stuff. Okay. Let the loads and stores execute out of order as long as they make it to the signature, which is in the cache hierarchy and then as the loop cores, the signature will come in. Okay. And if in the meantime, somebody else sends you a signature, then that signature you intersect, you get squashed. So either everything or nothing. You can see that we have moved all the complexity away from the core into the cache hierarchy. >> Question: I don't see how you -- I guess I don't understand how you can cover processors that you need? I mean, the Bloom filter encodes the read and write status into an approximate hardware hash and you can't recover which hash lines were read to or written from. >> Josep Torrellas: Okay, so ->> Question: Right. So this information so, you know, if you don't -- it's highly probable that these future, if you get to 128 core, you're not going to have a central bus. You will have a directory at the L2, so how do you convert that, those signatures into an invalidation or a ->> Josep Torrellas: Okay. >> Question: Squashes? >> Josep Torrellas: Very good. So from the signatures you have to be able to generate super set, a safe super set of things that you want to either invalidate or squash. Correct? All right. So I showed you a bunch of functional units that operate on signatures before. >> Question: Uh-huh. >> Josep Torrellas: There is one that we call expansion and expansion generates a conservative set of cache sets that can contain data that belongs to the signature. >> Question: Right. Super set. >> Josep Torrellas: Super set. So all you need to do is to ensure that this function cannot squash or cannot invalidate any data that you shouldn't do for correctness. Okay. But if you invalidate something and, you know, at most you're going to have a miss on that thing. >> Question: See your -- if you assume that a loader restore has to bring the cache line across it before it can operate on it, then at the directory you can say, let me generate all possible cache lines, then could you intersect with a signature and any processor which is caching that line needs to be squashed. >> Josep Torrellas: Well, so not squashed. Okay. So first of all, when a processor receives a signature, you do two operations. Okay. First thing is you intersect the signatures to see if you need to squash that chunk. >> Question: That's assuming you're broadcasting the signatures. >> Josep Torrellas: No, no, it doesn't matter. If you send it to a processor, you send it to a processor. >> Question: Uh-huh. >> Josep Torrellas: Okay. The processor receives the signature. >> Question: Yes. >> Josep Torrellas: The first thing you need to know is one, do I need to squash this guy? You have intersection with your read and write and if it says yes, I need to squash this one. Then what you do is you squash the threat and then you use tape on that right signature to figure out what are the dirty lines in each cache that need to be invalidated. Because when you squash a thread ->> Question: I see. I see. >> Josep Torrellas: So you have this expansion using your own right signature and you invalidate this. >> Question: It's a local operation, so ->> Josep Torrellas: Yeah. In addition -- in addition, irrespective of whether you get squashed or not, okay, when you receive a signature, okay, you need to invalidate from your cache any lines that guy could have written simply because they become stale. Right? >> Question: Uh-huh. >> Josep Torrellas: So you get the signature and then you apply second expansion on the incoming right signature and you invalidate from the cache those lines. >> Question: Yeah. >> Josep Torrellas: So there's two operations, one you always do it and the second one, only if you squash the thing. >> Question: So I misunderstood you earlier, I thought you were talking about using the coherence protocol to prevent broadcast of your write set. >> Josep Torrellas: You do. You do. >> Question: Okay. All processors. >> Josep Torrellas: Whenever you send a signature to the arbiter, the arbiter expands into the directory the same idea, figuring out what are the possible lines it maps them. Then for each line it knows what the processors contain data. >> Question: Right. >> Josep Torrellas: And it sends a signature only to those. All right. So you always have to design the system to be on the conservative side. You cannot miss anybody, but if you invalidate something that is clean, a cache ->> Question: Are you going to show us results for a large number of cores about what fraction of lines get -- you know what fraction of the cores a signature gets expanded into in fact. >> Josep Torrellas: This talk doesn't have any data. (Giggle) >> Question: Thank you. >> Josep Torrellas: So that's one thing. Second is high performance. This is important because one could design it to a very low complexity, sequential consistency and have poor performance, not so. Because instructions within a chunk are a fully reauditing hardware. In fact, loads and stores are viewed in any way you want inside a chunk. If you have a fence that fence is a no-op, because the fence is during the chunk to prevent instructions after the fence from being reordered before the fence. Here, because nobody is going to see anything in the chunk anyway, you can reorder anything, ignore the fences. Okay. In fact, I will even show that within a chunk you can issue accesses from two different chunks within a threat. Okay. As long as each chunk has its own signatures and the chunks committing order. But you can have request from different chunks out of order within a thing. So to summarize then what we're discussing here, the games of this machine is in simplicity, performance and programmability. Why that? Hardware simplicity. You can see -- oops, sorry. Something on this thing went back. Okay. So hardware simplicity. Remember, consistency support is moved away from the core. I could design a core without worrying about whether it's going to be -- has to support X86 consistency or power consistency. It doesn't matter. Give me a core. Do you order the loads and the stores you want. As long as you have signatures and it works with chunks, then the memory, the cache character will take care of the consistency. And it will support sequential consistency. Okay. So designers don't have to worry about this. So that's good because we move toward commodity cores. We take a core, do anything you want that is going to work in your system because now we have it in the catch header. Same for accelerator. Say we want to put an accelerator that works very well with, you know, fine-grain sharing with graphics or whatever. Do I need to worry about, oh, this, had is going to be for the next 86 machine so I need to worry about fences? No. Here give me what you want, do any reordering you want, as long as it has chunks. Okay. As long as it works with chunks, fences and another fence, it works. High performance. As I said, fully reordering into the chunk. Okay. Now what about programmability? All this is invisible to the software. That's the first language. You can have any language, any (inaudible) model. Anything you want is fine because that's under the covers. >> Question: That's -- sorry. The only exception I can think of is if you have a read to (inaudible). >> Josep Torrellas: Okay. So now what happens if you have a (inaudible). Now is a good time to do this. IO and speculation don't like each other very well. Right. So whenever you have to have an IO, you have to commit a chunk. Then do the IO then start a chunk. So that's forced cutting. Exception is not necessary, okay. As long as you can (inaudible) thing here. So similar to what transactional memory is to some extent. Now support sequential consistency, you know, I argue with low hardware overhead and high performance. Why is this useful? Okay. So why is this useful? Jim may disagree or not, let's see. I think that the main reason why this may be useful is that many correctness tools will assume sequential consistency. Okay. So as we move forward to an environment where all the machines will be parallel one would expect there would be more and more parallel software. Many tools will appear hopefully that will say, you know, I can prove that this is correct. But you have to have it in sequential consistency. Now this tool, some of them exist today. If you run this code, the subroutine that you proved correct in a machine that sports X86 consistency, then, you know, maybe it's not going to work. It would be nice to have a machine that whatever you prove is correct is correct. I will get to this more later. Second important thing is that it enables novel always on dividing techniques. What is the insight here is that because we only keep per chunk state, okay, we don't keep load and store information anymore. You can cut down on a lot of the state that tools need to support. Okay. So this means that the terministic replay of parallel programs, as I will discuss, you don't need to worry about this load happened here, this store arrows, all you need to know is chunks. Okay. So we'll see that we can eliminate the whole log actually and I know in these data raised detections, right, so here we're going to have to cut at sync points. This is the only exception as far as programming effect is concerned, software effect. But if you do that then you don't need to worry about chunks stuffed within the synchronization block between synchronization points. And then there is something else if I have time I want to talk about, which is the next step. What if I make the signatures visible to the software? So far I say the signatures are not visible. What if they are? Then I can do lots of fancy stuff that have to do with watching addresses. Okay. Pervasive monitoring for debugging and even novel compile optimizations like, you know, disambiguating ranges of addresses automatically. >> Question: (Inaudible) -- wonderful section on the text. >> Josep Torrellas: I'm sorry. >> Question: You'll also introduce wonderful new side channel attacks for any of these chunks that are running cryptographic operations. >> Josep Torrellas: If you need the signatures available, you mean? >> Question: Yes. >> Josep Torrellas: Yes. So that's an issue. You get -- well, you'll have to do it in a way that you always -- okay. I cannot give you an opinion on the security and attacks, but certainly you could mess up things in not doing it very efficiently. >> Question: (Inaudible) -- privileged operation. >> Josep Torrellas: Probably not (inaudible). Anyway, so many ->> Question: (Inaudible) -- give up liberty for a little security deserve neither. >> Josep Torrellas: Okay. >> Question: So ->> Josep Torrellas: Huh? And I think that this, you could think of many new tools that gives you new functionality thanks to having this signature as a result. About the let me just go over these programmability issues first and then just conclude the talk. All right. So the issue of sequential consistency, as I said before, formal verification tools hopefully will become more and more pervasive and they like to have sequential consistency, otherwise, the state explodes. Another thing is that there's all these issues about semantics whenever you have data races. Okay. If you have sequential consistency all these discussions of what is the semantics when I have data races? What is the java? So on. All this disappears because it's very clear what data race does in sequentially consistent issue. Okay. Especially for safe languages, all this stuff disappears. Another thing is that it is much easier to debug parallel codes, of course, trying to debug parallel codes of course, try to debug a parallel code that has kind of a software database problem in sequentially inconsistent machine and one that is not. So that's kind of another main reason. And finally as Jim Lauer always tells me, there's some guys who always want to use hand-crafted synchronization and of course they don't like to have unexpected results. All right and having sequential consistency (inaudible). Anyway, so here is a machine that could support this with low cost. Now let's look at some of the tools that are enabled by working on chunks. Okay. One is the terministic replay of parallel programs very efficiently. So what is a terministic replay? So during execution I want to write a parallel program that states that the hardware in parallel records into a log the order of dependency that existed in these threats. Okay. So I run a code and then I have this hardware that logs all the dependencies. The log in a sense has captured the interleving of the threads. All right. So then when I replay the code, I simply reexecute the program enforcing the dependence order they have encoded in the log. Okay. So that's the idea of the terministic replay of parallel programs. >> Question: (Inaudible) you're recording the order in which the chunks are dependent? >> Josep Torrellas: Well, I'll get to it, but this is the general idea. The general idea is you kind of have to record data interleaving of the threads. Right? So in conventional schemes what you have is you have ROs, dependencies within threads. So in here it says in instruction N1 versus if you want to add in instruction M1, processor 2 read A. So there is this arrow, if I observe this in the initial execution then the log of process 2 has to say -- has to have an entry that says when I get to M1, I have to stall until P1 has finished M1. Okay. So I have an entry for each of these arrows and that encodes what happens in the original execution. Of course, people have come up with very fancy ways of combining arrows so that you have fewer entries. Okay. But this is what you need and potentially have very large loss because you need to worry about all the dependencies. Now if you have bulk, then the log that is necessary becomes miniscule. Okay. During execution I don't worry about dependency, I just do chunks. So basically a bunch of arrows go from this guy to this guy and this is a chunk and this is another chunk, I don't need to store these arrows. All I need to store is the order of chunks. All right. So my log then is the log of all the processes together, not just of processor 2, is simply the history of chunk commits. Okay. Chunk from P1, then P2, PI and so forth. Okay. So that gets the log an order of (inaudible) production. However, I can be even smarter now. I could say, you know, what if the arbiter, the one that takes the chunk from different processors, has a built-in algorithm, okay, and says I'm going to take a chunk round robin. A chunk from this guy, a chunk from this one. From this one. From this one. Of course, I'm going to pay with some performance overhead because I may have load imbalance. The process in turn to commit may not have the chunk done. Okay. But if I do this, say I fix the chunk commit interleaving, then I have completely eliminated the log. There is no log necessary because the arbiter knows when to replace the code, okay, doesn't say everybody execute and I'm going to take this chunk first and this one, this one, this one and so forth. So what you could do is you could whenever you start debugging the program, right, use this model, okay, to find many bugs. You may have to run for a day to find the bug. Once you find the bug you terministically replace and you fix it. Hmm. And once you fix most of the bugs then you can move to the other mode where you have a bigger log. Okay. Does it make sense? If you think about it what I'm doing here is I'm limiting the interlevings as possible. Certainly at a fundamental level chunk already limits some of the interleving. Okay. Those are the only possible -- this is one of the possible interlevings okay. At the cost -so I limit interlevings no performance cost largely at the beginning and I get small log. This has a performance cost, but it eliminating the log completely. So that's one example where it's possible. Another one is if you want to detect data races here what I would do is I would cut the chunks at seem points. So suppose I have the synchronizations like this, logs and logs and so forth. I would cut the chunk every time there is synchronization operation. Okay. Now then I use I assign an ID to each of the chunks, okay, that follows -- that happens before. So the idea of this guy as successor of this guy which is successor of the other guy. Within a threat it's clear. Across threats I use the synchronization operations to order chunks. So this guy has to be a successor of this one here. The guy that releases the log has to be a predecessor of the one that acquires the logs next. So how is this done? You amend the software data structure of the log -- of the lock, sorry. Of the lock to have an extra field that specifies an I.D. When a chunk is about to release the lock it stores its own I.D. into the software structure, then the next chunk that grabs the log after grabbing the log it rates this I.D And it sets itself to be a successor of this guy. Okay. So this guy's I.D. is successor of both. This one here and its own necessity is the same thing. You have a partial order of chunks, basically, right? Because of transitivity, this chunk now is a successor of this one up here. Okay. So this guy is successor. But this guy here is not a successor of this right here. So how do I detect data races? So what I do is when I detect communication between chunks, I can use signatures to detect that, okay, or I could use planeload and stores, but let's say I use signatures, then if same intersection I see these two guys have communicated, then if this, I compare the I.D.s. If the I.D.s are ordered it means that there is a synchronization between these two chunks and therefore it is not a data race. While if this is unordered then I found a data race within these two chunks. All right. Question there will be: What exactly is the request that cause the data race? There's some stuff to it. If you work with signatures you may have some false positives. Okay. You have to do some extra work. But that's the idea. All the references inside are between two synchronizations. Synchronizations belong to the same equivalence class and can treat them together. I can do the same thing for flags and for barriers. In case of barriers all the successor chunks in all the threats become successors of all the chunks that led to the barrier. Make sense? All right. So now I'm ready to get to the final part and that is what if we make the signatures visible to the software? Okay. Through the ISA. Okay and I claim here that you can have pervasive monitoring. Okay. And support numerous watch points and so on. Or you can have novel compile optimizations such as -- I'll show you, code motion and so forth. Okay. So let's look at the two things one at a time, the monitoring abilities and the code optimizations. So in pervasive monitoring the idea is very simple. If you have the ability to insert and address the signature in software and then disambiguate this address against all the accesses that you do, then you're effectively watching this address. Okay. So here is the idea. You would say I want to watch memory location and during a monitoring function that the user wrote every time somebody touches this address, okay this, monitoring function could be checking for an insertion or whatever you want. Okay. So you would have something like this. Watch this address. You would have the system called, watch his address and execute this monitor that the user wrote. Okay. Every time somebody straight point or whatever touches this thing. Uh-huh. So when you execute the system call then inside the system call you would have an instruction that says stuff this address into a signature. Okay. That would put it there and from now on keep watching this address. If anybody, if anybody reads or writes this address give me an exception, okay, or something. And so you go here and suppose this straight pointer ends up updating this location, okay. So this point you get an exception and you could choose for example to execute in a different threat this monitor which could be say a -- an assertion, check an assertion that nothing seems to be consistent. Okay. So that gives you the ability to watch for a certain location. Order range of locations or a group of locations, with some false positive, of course. So the first thing you want to do is to check that indeed this is an address that you want to watch. >> Question: How do you encode a range into the (inaudible) ->> Josep Torrellas: You would then have to have some additional logic. You would have to have say a min and a max. Okay. The logic currently is very simple; right. It's just an intersection of signatures. What you could think of slightly more complicated function units that simply check for max and min. Right? Is this less than this N within this link and then give the exception. >> Question: How do you (inaudible) ->> Josep Torrellas: How do you turn off? Oh, very simple, you -- if you have instructions, you can do anything you want. Right? >> Question: (Inaudible) -- take it out of the set because the group and the collision. >> Josep Torrellas: Sorry? >> Question: You can't take it out of your set because there could have been a collision on (inaudible) ->> Josep Torrellas: Okay. You mean what if I can take a single address from here, a single one, and keep the remaining, the rest of one and keep it here. Right? >> Question: We have watch and now I'd like (inaudible) again. >> Josep Torrellas: Okay. So getting a false positive, right? You cannot -- from here you cannot generate the exact set of addresses. So you cannot remove this address from here. All right. So you'd have to live with that and do a check every time you do that. >> Question: You can do it if you have counters for (inaudible) -- >> Josep Torrellas: You can, if you count ->> Question: (Inaudible) ->> Josep Torrellas: If you complicate the thing, yes. So if you have a counter of how many addresses you have, you can remove the thing because they counted by one. Uh-huh. Okay. Compile optimizations. Suppose you have this new instruction that says begin, end, collecting addresses. All right. So I have a set of code and say, begin collect addresses into your signatures. Of course, have a small registered file of signatures, very few. All right. And I want to use the signatures to be visible to the software for compile optimizations. Okay. Most extensively for disambiguation of groups of addresses at run time. Okay. Ranges -- or not ranges, set of addresses. So I can have collect -- begin collect, end collect. So what this will do is at run time all these addresses will be put hash encoded into signature. And you could say only the writs perhaps, only the writes or both in here. Right? Then another instruction is begin disambiguate. And the idea is that you say from now on start disambiguating against the signature. Okay. And end disambiguating. And what this would do is at run time any of these loads and stores will be intersected against the signature. And if there is any collision you can get an exception or you can set a bit, say for example. All right. That's local disambiguation. You can have remote disambiguation and what you do is start remote disambiguating against the signature. That means any incoming invalidations or any incoming reads, coherent actions from another processor. Let them intersect against the signature. Huh... So then if there is somebody that's some stuff that collides with the signature you can get again an exception or set a bit. This is very useful for say transaction or memory. Right? What you would do is you would start collecting, begin collect, end collect. So you would keep on collecting in your signatures the addresses that you're accessing and at the same time begin remote disambiguate and in remote you would be watching incoming things that would collide with your addresses. All right. So what can you do with this? So I give an example of transaction memory. Another one would be function memorization. Okay. So you have something like this. Could I eliminate this call to Fum(phonetic), you know, skip it. So I called it here with a certain input. And I call it again and suppose it's the same input. Okay. Can I just not call it again and just, you know, use whatever I use -- I generated last time. You entered in a bunch of pointer accesses in here. Can I do that? Well, I can say begin collect, end collect. Okay. That's going to put all the addresses that this Fum(phonetic) is accessing into a signature. Begin disambiguate and disambiguate. If by the time I get to this point there has hasn't been no conflict it means that this guy didn't mess up any of the state that this one generated. So I could just skip the call to X. If Fum(phonetic) if it's the same input. Okay. There's one detail here that is missing. The thing missing is that Fum(phonetic) over rights of value, right? Then I could not do this. The Fum(phonetic) reads something and overwrites it. Then I cannot just skip it here. And because it has side effects on itself. Hmm. Does it make sense? So how do I do this? Rather than collecting into a signature, I collect all the reads and all the writes and then I intersect the two and if the two is nil, then I know it doesn't have this overwriting effect. Uh-huh. So there is an extra condition to be able to do this. Okay. >> Question: (Inaudible) -- intersection, given the (inaudible). >> Josep Torrellas: Oh, well, so this was sequential problem. If you have a parallel program then you have to do remote disambiguating. Watch that nobody else messes this up. Right? And, you know, keep receiving the invalidations for that. >> Question: Who reads and writes the stack that this is always going to look like it made (inaudible) >> Josep Torrellas: Right. Right. So if -- you mean it writes and then reads the stack, right? If it has a private variable that it writes and then reads, true. So if you do that, then you could skip the second column, certainly. But using the signatures it would appear that you have a dependence to write. Make sense? That is what your point is, right? Fine. So you would not memorize it, but you could give this (inaudible) and say look, I'm not going to put into this signature this range of addresses that are stack addresses. It's private stuff, I'm not going to put into the signature. Some day hardware designers will realize that this is very useful for software designers. It's very simple to do this. Another example is -- I guess I mentioned this already, having many watch points, so here you can say put this address into the signature, put this address, put all the addresses that Fum(phonetic) is accessing and then watch for this thing if anybody accesses this. Another one, looping in variant code motion. What if -- can I move this expression out of this loop? Okay. It would be good because then my (inaudible) would speedup 5%, for example. Right? Well, let me move it out and then collect all the reads, the end collect and collect, begin disambiguate, end disambiguate and then check. Did I get any conflict? And if I did get a conflict, you know, I would have to relax certainly. Otherwise, I'm done. Okay. So signatures can be used in two ways. One is hit a point, check the signature and based on that decide I go this way or this way, that's the simplest way. Or the second one is do something that is unsafe, check the signature and then decide what I've done is incorrect. I have to reevaluate in a different way. >> Question: The rollback is kind of a gotcha there. >> Josep Torrellas: It's a (inaudible). You can ->> Question: That's a lot of mechanism out there you can get on the rollback. >> Josep Torrellas: Okay. So the question here is yes, you get rollback. So you can use shallow variables perhaps at the cost of some extra copies and so forth. And -- or otherwise don't use signatures in this mode. Use it in the other mode. You do the check first it wouldn't work this time in here, but do the check first and then this and that. It wouldn't work in this case here. Okay. All right. So you can think of many optimization. We're just getting started here. Many compile transformations rely on run time disambiguation of groups of addresses. This gives you a nice primitive (inaudible). All right. To summarize then what is what you are after here. So we're after a machine that has hundreds, a hundred processors, all cache coherent in this case, okay, that you could just have groups of cash coherent processors, okay, where the novelty is not in the cores. The cores actually can be designed quickly, take the old processor from a company that is your competitor, it doesn't even support the memory consistency model. Shrink it. Stuff it there. As long as it sports the bulk, everything is moved to a cash hierarchy. You get high performance. This is complete transparent to the software. You don't need to change the software stack at all. Okay. Signatures are useful disambiguation. Cache coherence and I gave you a glimpse of how it works, the whole system works with signatures and then if you want, even for compile optimization, the machine supports sequential consistency without any fancy hardware, no snoops, no costly operations in the core. There are room for many new tools that focus on chunks, okay. For example, as I said before, the terministic replay with no log, data race detection with very low overhead and pervasive monitoring of this, of watching addresses and then compile optimization if you make the signatures visible to the eye itself. That's kind of the high level view of the project. As you can see there are many different parts. We're even building an FPGA prototype of this, believe it or not. Each student is working on a different part of this project. There's room to corroborate with Microsoft if there is interest and I think that many programming and compile issues remain open. You can see that the emphasis of this work has so far been the hardware. Techniques to this. There is all the software that we haven't touched. Okay. So notice the next step would be what if you make some of the stuff visible to the software? >> Doug Burger: Great. Thank you very much. (applause) -We should have time for some questions if you have them. They were all asked very methodically. >> Question: So is your FPGA basically starting to single core this or multiple cores, how much are you going to try to do? >> Josep Torrellas: Okay. So this is a very simple core. It's a multi-threaded single issue core. (Laughter) In order. Right. And so this is a very simple thing you need multiple of those. Right? You just have to chunk them. Okay. And have a way of ordering the chunks in the cache. Most of the work is in the cache area. >> Question: Do you think at the end of the day this is running in parallel programs more or less energy efficient? >> Josep Torrellas: Well, so it has some sources of energy inefficiency. Right? >> Question: And some efficiencies. You know in a sense of the work load. >> Josep Torrellas: So the inefficiency comes from if you have -- did you ask sequential code or parallel code? >> Question: Parallel. >> Josep Torrellas: Parallel code. If you have fine-grain sharing then you run the risk of squashing. Right? So that's one of the problems. Right? This is where I think most of the consumption actually consumption will come from. What are the savings? Singular, no snooping of the -- no snooping of the core really. So this part gets simplified. How much simplification? That is an open question. You don't need to snoop things. You don't need to make -- invalidations don't have to come up all the way up and have to log the cash. So that's a major gain. Working with signatures also has an individual. You don't send individual invalidations, okay. Although the fact that you have signatures means that you invalidate additional locations that you shouldn't, potentially causing extra misses or some energy lost from there. Commit is very efficient, okay. All we do is -- we didn't have time to talk about this, but multiple chunks can be committing at a time. Okay. Updating directories, sending the signatures around. As long as those chunks don't have any intersection, so all the arbiter has to do is to keep an image on arbiter hardware. The set of currently committing signatures. And when a new one comes in requesting for commit, simply intersect with all of them. If there is no intersection, you say, okay, you're done. You're committed and in parallel do the propagation. So one thing that we're looking for carefully is that this thing should not slow down single threat performance. So there is arbitration that occurs irrespective whether they have single thread or multiple thread, this should not slow down the system. >>: Great, that's it. Thank our speaker again. (applause) >> Josep Torrellas: Thank you.