>> Juan Vargas: Last of this marathon workshop is going on be on architectures. I will have Josep Torrellas from the University of Illinois presenting the bulk architecture, followed by Krste Asanovic who is going to talk about 21st century architecture research, and Tim Mattson on a topic that is a mystery. >> : Because you never want to talk about it. >> Juan Vargas: Right. Immediately I feel bad. We are going to move some chairs around to start the panel, and the panel is going to be, "Can Industry and Academia Collaborations be Effective?" And we are going to have David Patterson, Burton Smith, Jim Larus, Tim Mattson... >> : [Inaudible] we're getting started. Can you [inaudible]? >> Juan Vargas: ...Josep Torrellas, and we hope to have a very good discussion to end the workshop in fire, you know, having real fun. So, Josep? >> Josep Torrellas: All right. How do I get this guy out of my way? >> : [Inaudible]. >> Josep Torrellas: Good. Okay. Welcome to the last session, the architecture session. In this talk I'll give an overview of the work that we've done in the last 40 years on the bulk multicore architecture. I apologize if some of you guys have heard some of these topics before. But what I'm going to do is I'm going to focus on the vision and on the most recent work as well. Okay? So when we started this project about four years ago we were thinking about what is it that makes a programmable architecture? And that was the goal at the very beginning, "Let's make a programmable architecture." And what we've been able to think about is a programmable architecture is one that attains high efficiency without a lot of work from the programmer, without the programmer having to do low-level tasks like placing the data or worrying about the coherence. And at the same time one that [inaudible] it can minimize the chance of parallel programming errors. So we came up with this bulk multicore, and this is a general purpose multiprocessor architecture which is cache-coherent. And it uses tunable primitives for cache-coherence. One is signatures which are [inaudible] filters, and the other one is chunks or blocks of atomic instructions. And with this we build the whole cache-coherence. And because of that it relieves the programmer and the runtime from having to manage the shared data. At the same time because we work on big chunks of instructions, it provides high-performance sequential memory consistency, and that provides a software-friendly environment for a bunch of tools -- for debuggers, for WatchPoint and what not. And then we augmented the architecture with a set of hardware primitives for low-overhead program development and debugging. So features such as race detection, deterministic replay, address disambiguation are embedded as part of this architecture through special purpose hardware. And this we claim helps reduce the chance of parallel programming errors, and it has an overhead low enough to be "on" during production runs. So that's the high-level view. I need to just go over a couple more slides to give the basic concepts. The bulk multicore is an example of a blocked-execution architecture, an architecture where processors continuously execute and commit chunks of instructions at a time. So we are kind of obstructing a wasting of instructions. We don't do this anymore. We don't worry about offering the architectural state after our instruction. Instead processors will execute a thousand, tenthousand instructions and then commit. And for that they need to have buffering. They buffer the state during these thousand instructions, and then they commit. And when they commit, it's not that they write back streams of data, all they do is they send the signature and that should be enough to make sure that the state is coherent. So this is the example that we have here. So we have a bunch of blocks, and they commit in this form. You can either have the processor doing the chunking automatically driven by the hardware itself. This is like -- It's not unlike transactional memory but its implicit transactions, the processor doing this in the background because it likes to do that. Right? Or you can have the software that itself cuts the -- provides hints of where you want to cut the code. And with this you have higher performance because instructions inside these chunks are reordered by the hardware and also the compiler can aggressively optimize the code inside doing things that would otherwise be illegal. Okay? And just the last slide here on this high-level discussion is that a big problem of this architecture is the squashes. Whenever you are executing this work and you find that somebody else has changed the state that you rely on then you have somebody writing data that you read, right, you have to squash. And each chunk then it's a problem if you squash. And for that we use a lazy approach. At the end of the chunk we have to check that there's no conflicts with anybody; we use the signatures for this. Okay, so what we do is suppose you have this processor executing this chunk, this one executing this chunk. No state goes out in the meantime. This one wants to commit. When it commits all it needs to do is it needs to send the signature, it doesn't have to send any data. And the signature is checked against the signature of this other one, and if there's a collision like in this case when the processor read something that this one wrote, basically [inaudible] you squash this one. And chunk commit is quite expensive, so we're going to try to avoid this. So throughout these four years what we tried to do is to build the whole ecosystem around these chunks. We started with the architecture, the hardware here. We started in Communications of the ACM 2009 explains the basic architecture. Since then we have been looking at different aspects of the architecture. We also worked on feeding this architecture with blocked code. Okay? We have a dynamic compiler that is able to take in -- it takes code and is generating these chunks that are used by the hardware. And we're working on a static compiler. And then we can also have a profiling pass that runs the code and figures out what are the communications, and based on that it gives hints on how to chunk the code. So the good thing about -- And there is also another part of the work which is this additional hardware that does all the [inaudible] detection and the [inaudible] violation detection and so on that uses signature and hashing card. So the interesting thing here is that you started with unmodified source code, so you don't start with code that the user has instrumented with transactions. Instead you start with locks, barriers, flags. You have this compiler. You pass it through here and then this hardware executes this code efficiently. Okay? So what I would like to focus -- And so these ideas have been out and we've basically told Intel about this several times, so if they are interested these are the ideas and they can take them basically. So what I want to focus on is give you some idea on what we've been looking at recently which is some of these architecture issues in HPCA and some of the compiler passes here. So this is the recent accomplishments on the architecture side. One of the papers we published a couple of months ago is on allowing simultaneous multi-threaded processors, meaning hyper-threaded processors, to work in this mode of chunks, so all the threads inside an SMT processor working in chunks. We also extended the architecture to support determinism, deterministic execution. And we have some work on an architecture to record the program as it's running and deterministically replay it later on. On the compiler side we had the recent paper last year on how do you -- what are the interaction of chunks with synchronization. What is the optimal place to cut the chunks? And we're working on using alias speculation, another optimization [inaudible] using these atomic blocks. Okay? So I'll give just a hint of a couple of things. So on the architecture side what we've done is we have used this concept of executing chunks inside and SMT processor, a simultaneous multithreaded processor. The reason is that many, many processors are simultaneous multithreaded these days and you use the hardware better with those. So as you may know, Intel has recently announced that Haswell has support for transactions and it also works as far as they understand on an SMT processor. But what we do hear because you have multiple threads running on the same core, you can tolerate dependent threads. Meaning even if you have a collision between two the chunks that are running on two different threads, we're going to tolerate this and execute and continue executing. Okay, this is the advantage of having this processor with the thread so close that you can keep state nearby. Okay? So that's the main idea here. So we claim that is the first hyper-threaded design that supports atomic chunks. And we analyze what happens when you have conflict between the different threads that run on the same SMT processor. You can either squash, you can stall, or you can order and continue. Okay? And also we designed a way of having a many-core, a multicore, where each of them is an SMT processor and all processors run in this mode of chunks. And this is obviously more cost effective than running a single thread per processor. You have higher performance for the same core count and reasonable performance for a quarter of the hardware. So here's the -- kind of the crux of the problem is that even in an SMT processor all the threads share the first-level cache. Okay? And as a result if I have this processor executing a transaction or a block and writing to the cache then suppose that another thread reads from this location, now we have a dependence. Okay? Traditionally what you would do is you would squash one of them. Okay? Now we want to support -- let them continue. And we take advantage of the fact that it is so close, the two threads, that they can keep state. So I can squash, stall or order. Okay? So this is a pictorial representation of this. So suppose I have two threads. This guy is executing this dark block. This one reads the cache. We find the dependence through the tags of the cache, right, or through the log [inaudible]. And then what you observe here is you can either squash this one and restart -- It would take this long -- or we can stall the processor after the [inaudible], stall. And then when this one commits since we know when it happens, okay, then we can continue this thread. And this has sped up this program from here to here. Or we can actually record that there is a dependence and keep a small table that says, "This thread now depends on this one. I cannot let this one commit. Instead I'm going to have this one commit first and when this one commits have this one commit." Okay? So that's the idea. And the significance is that you can run dependent threads in parallel. Now I won't bother you with some of the hardware that is needed. These are the different types of mechanisms that we use. You can do it in hardware and software; this is a hardware implementation and this is [inaudible]. Let me move on quickly to the code generation. So the big picture of how you generate code for a block architecture is very simple. So what you need to do is you need to have some software entity, compiler, profiler, whatever, insert chunk boundaries at strategic points in the code, okay, and then perform aggressive optimizations in between these two boundaries knowing that the hardware will guarantee that no matter what it's going to execute atomically. Okay? And since a chunk may repeatedly fail because somebody keeps messing with you with your state, you need to create a safe version of the same chunk somewhere. So that if you fail many times then you can go off and execute the plain code with locks and what not. All right? That's what we call the safe version. And there are two key ideas, only two ideas: the first one is maximize your gains. Second, minimize your losses. How do you maximize your gains? You want to form the chunks around the code that you think the compiler can do the most. Okay? So what are these places? For example, these are pointer intensive areas where you know that the compiler cannot do anything in a safe manner. That's where you want to put the blocks, do a lot of transformations and then do a check before the block finishes and either scrap everything or continue. Or in areas of the code that there are many branches but you know there is typically a path that is highly likely. So you optimize for this path and then the exits you squash. Or whenever you have many critical sections that are not contended frequently, rather than having them in different blocks, you put them together and you remove the locks. And then the second is, to minimize the losses is make sure you cut at points when there is likely communication between both the threads. So we can allow threads to communicate. What we cannot allow is two blocks that are concurrently executing to communicate; that's what we're trying to avoid. So here's an example of the first optimization, trying to group things where the compiler can do little. Okay, so look at this thing. This is a Barnes piece of code where I have while loop and I have lots of pointers, if statements and so on. And in each iteration, I have two critical sections that grab the same lock but the lock is different in each iteration. Okay? But there is lots of -- This is a very complicated thing. You have to [inaudible] a couple of pointers at offsets and so on so lots of instructions here. We wonder can we, if we put this in a lock, can we optimize code up out of the loop. Okay? So the first thing we're going to do is, what we're going to do is, we're going to do lock elision. With lock elision is an optimization that what it does is it replaces the lock and unlock, it removes the lock and unlock, and it puts a read, a plain read, using while. With a plain load of the lock to see if the lock is free. If the lock is not free, it keeps spinning; otherwise, it continues. This is called lock elision. Why do I have to have that and not just remove the lock and unlock? If all the threads were executing a block then I could remove the lock and unlock because they either fully succeed or they fail. Right? But because I'm going to have this safe version of the code and sometimes some of them will fail and go and execute the safe version, in the safe version one of them will grab the lock, so I need to make sure that they don't execute this while somebody has the lock. So I'm going to check. If somebody has the lock, I'm going to spin here. When that guy releases the lock, I'm going get squashed and restart it. Okay? This is a common optimization people do. The bottom line is that by removing the locks, now we give this to the compiler and the compiler says, "Whoa, there's no locks here. I can start doing comments of expression elimination." And the first thing it sees is these are different locks at every iteration so I cannot optimize this thing. However, look at this thing. I need to generate this multiple times inside the same iteration and across iterations. So let me [inaudible] motion here. So all the stuff that was in here is computed just once, at the top of the loop and put in a register. And then based on that, it just uses the register to access this thing. Okay? So notice that I have removed a lot of code. In fact, the dynamic total number of instructions per lock moves from 9,000 to 7,700. Okay, and some of them are loads and stores. And what happens if somebody collides with me, because I have this thing here, somebody grabs the log I'm going to hit squash and restart. So that's an example of how this thing works. Large chunks are beneficial. So that's one thing. Now how do I cut the chunks to minimize squashes? That's the second part of the optimization. We need a profiling pass to figure out what are what we call squash hazards. Okay? So squash hazards is something equivalent to [inaudible] hazard but we call squash hazards because it's something that -- given a synchronization access, you don't know if this is a highly contended synchronization. If it's not, it's not the squash hazard. Squash hazards are operations that frequently cause squashes. Typically it's the first communication in a code region with multiple communications. Okay? So typically highly contended synchronizations, data races, not shared data accesses because those are protected by locks. So once you find these things, the sync hazards -- or the squash hazards rather, you transform the code with tailored squash-removing algorithms. So for certain types of locks you want a certain type of algorithm that minimizes the squashes. Okay? The goal is prevent two concurrently executing chunks from communicating on the fly and typically means cutting the chunk before the hazard. Okay? So [inaudible] explains this thing. So this is all about the base technology for chunks. What we are building for several years now, actually, is a hardware prototype for deterministic record and reply within Intel Labs. And the idea here is to augment the cache hierarchy of simple multicore with the memory record and reply engine. This is small -- It's a small thing but what it does is as the programs execute it basically cuts the execution of the code whenever there is communication between threads. And then it stores information of the execution number of instructions within communications. Okay? And then after a while it dumps it into a lock in memory. Okay? So we collect chunks of instructions but not speculative chunks, chunks of instructions within communications and then we can replay in. So it's primitive to recreate past states in a computer. It records chunks of instructions until the next communication. And we hope that this will be the basis of a lot of tools, debugging tools. Because once you have the execution within communications, you can build the tools, say, to detect [inaudible] violations. Okay, so I want to just -- Let's ignore this thing. I want to just talk about a couple more projects, architecture projects that we have at UIUC as part of UPCRC. One is the DeNovo. This is my colleague, Sarita Adve. What it does is hardware for disciplined parallelism. So the idea is if I knew that my code followed determinism, you know, the simple code that DPJ -- deterministic parallel Java produces then I could tailor my hardware to be much more efficient. Right? So if I know that structured parallel control and explicit effects then I can simplify the hardware. I can simplify the cache-coherence protocol. I don't need to worry about invalidations to the other processes because only process can execute at a time. All that transient states are removed. I can optimize the protocol better and easier. And then many of the things that currently cache-coherence has, invalidations, acknowledgements, indirection through directory, even false sharing is unnecessary because you know at all times that the data that you want nobody else wants. Okay? So you cannot have false sharing. And the second project is the Rigel. Rigel is a project that my colleague, Sanjay Patel, has been working on also for UPCRC. And the idea is to build a thousand-core chip for visual computing. So these are very simple cores and they don't have cache-coherence. And because the applications are so regular, you're able to optimize a transfer of the data and it becomes very efficient. Okay. So that's a summary of a work on architecture at UPCRC. Thank you. >> Juan Vargas: Thank you. [ Audience applause ] >> Juan Vargas: Do we have questions for Josep? Well then thank you very much, Josep. And let's move to the next speaker is Krste Asanovic. He's going to talk about 21st century computer architecture research. >> Krste Asanovic: Thank you. [ Silence ] >> Krste Asanovic: Okay. Is the microphone on? Can you hear me? Okay, good. All right, so -- Because I only have 20 minutes I thought I'd just pick one fun thing to talk about. So this is a like a mini-rant about architecture research and looking ahead at what we'll need to do in architecture research, so what's the next big challenge? Well, technology's slowing down/stopping; everybody knows this. There's no device technology out there that's going to save you. You know, sorry. You know, end of technology scaling. Parallelism is a great idea but it's a one-time gain. Right? Basically the two tricks we've played to use parallelism to save energy you either use simpler cores but that's pretty limited by how small you can make the core. You can only make it so simple. The other one is running at a different operating point, lower Vdd, lower frequency, you're making it up through parallelism. But that's limited by Vdd/Vt scaling and just errors you get as you load the Vdd. So that was the one-time gain from parallelism. That's great. Now what? Everybody seems to agree that we're going to improve efficiency with more specialized hardware. And I agree with that as well but part -when we did Par Lab, we spent a lot of time working on the software stack. You know, actually most the money went to the software stack unlike most parallel computing projects. But now really attention is focused back on the architecture because we're going to have to change the architecture substantially if we're going to get more energy efficient processors. So we've got to ask these architects to go do research in, you know, very efficient specialized architecture. So how is research done today? This is a mini-rant. A little while ago I went and looked up one of the ISCA proceedings; this is actually 2010. This is our top conference in computer architecture. So it analyzed, you know, how do people evaluate their architecture ideas? And I kind of broke it down; this was just me doing it. So the caveats are I could be doing this wrong. But I went through all the papers and about two-thirds of them the way they were evaluating I thought sort of made sense for the ideas. In some cases like the papers had no numbers in them at all which surprised me. In reading ISCA you usually see lots of bar graphs. Few papers had no numbers in. A bunch of them actually used some real machines. Some are working on some new device technologies. It's very hard to build models at all, so I gave those guys a free pass. You know? It's very hard to do work in those areas. Some papers were about outer memory system or traces and whatever. It seemed to work fine, and I didn't have a problem with them. But the ones I was really focusing on was, you know, "We're going to build more specialized architecture. There's going to be new kinds of pipelines, new kinds of caches." So I looked at those papers. And about a third of them were in that kind of area, like inner-workings of the core. And of those I took at look at them and, you know, only two of them actually had RTL down to the level of do you know where all your bits are, do you know where all the wires are. Right? Only two of the papers had that. One of them was a Stanford paper which used Tensilica. They don't actually use the RTL but they used a model that was built from the RTL Tensilica has. And they were happy that it was within like 30% roughly of, you know, having done the RTL and back created a model, it was within 30% or so is what they claimed. There's one industry datacenter paper based on a product, so that was okay. So the other 16 were academic C simulations without real design and a lot of improvements in the kind of 20% range. So do you see a problem here? You know, doing cycle counts. The cycle counts they got may have been somewhat representative. But remember, these are the papers that are focusing on pipeline [inaudible] cache design. These are really focusing on that part of the thing and they're doing C simulations of that. And remember the guys who had the RTL and built the model, they're only within 30%. Right? So, you know, what's going on here. And cycle-time/area/energy? Some of the guys, you know, they way they modeled area was took some die photos which are usually incorrectly labeled from ISICC proceedings or whatever and then use that to build their area model. Right, just, you know, completely bogus. So my take on this was most of these papers the evaluations were completely bogus. Actually they were a waste of paper. Not to say the ideas were bad; the ideas could've been great. So I'm not commenting on the ideas just on the evaluation methodology. Right? So again the ideas could've been fantastic. I don't want to upset my colleagues. It's just -- I know why they did this but the evaluations are probably completely bogus. And this is a bit worrying because now we move into this era where it's all about energy efficiency and building the most specialized cores. So this is a, you know, famous painting. You know, this is not a pipe. It's a picture of a pipe. And, you know, those models are not really processes. They're not pipelines. Right. They're just models of them. So we have to get a bit more real. Now it's hard. Like new computer architecture models are really hard to create. Cycle counts: you need a microarchitecture. You need to actually have a microarchitecture and long runs. Cycle time/area/energy: lots of interactions, how you design the thing, process technology, design. Another thing is design space is really important. Why do you think your one design that you put up as a candidate is actually a good point in the design space of that family of architectures? Now industry has a big advantage. They keep doing the same designs over and over again. So they get very good modeling because they're basically -- they have all the software running on last generation's chips with cameras and everything. They can really tune their new models to match it well. And they have real layout. They have real designs. And usually, you know, sorry Intel but your talks are basically Tweets. You know? It's looked the same since, you know, penny and pro days almost. You know how to build that stuff so you can build pretty good accurate models. But doing far out research in academia where you don't have that experience, how are you going to get anything reasonable as a model? You don't have that luxury. Well basically I can't see any other way -- I'd love to know if there's a better way -you have to do real designs or you have to actually go design it for real. So my claims at why you have to do this: you can't really create or even use an existing model correctly if you haven't built processors. And, you know, undergrad hardware class projects are a good start, you know, so we start educating architects but you have to keep doing it to get good. And you won't get this experience by modifying sort of C-base models. Another claim is that only bad models are actually easier to build than actual designs. Good models are harder to build than real designs, all right, because you have to build lots of actual designs to build a good model. Right? So you kind of have to do real designs. Another thing I'll put up is this slide. A little experience I loved from when I was at MIT. We taught this class with [Inaudible], a simple hardware design class. So we gave the students a little project. One of the labs was do a 2-stage RISC pipeline. You know? Design the RTL, synthesize it out. And we thought it was such a simple task -- We even gave them bits of the code. We thought it was such a simple task all the students would have basically the same answer coming out of this design lab. Right? This was the result. Right, so this axis clock period in nanoseconds. So designs were each from 3 nanoseconds to 12 nanoseconds cycle time. Right? That's a factor of 4. Area, this is going from ninety thousand microns squared up to a hundred and fifty thousand, so you know almost a bit less than the factor of two difference in area. This is for just a 2-stage RISC pipeline where we'd given them a lot of the code already. All right? So first thing this tells you, you know, what's the best design on here? >> : [Inaudible]. >> Krste Asanovic: Right. So what are bad designs on here? >> : [Inaudible]. >> Krste Asanovic: Anything that's [inaudible], you know, as a paretooptimal curve. So there's a lot of contrasting design points. Right? So you have this pareto-optimal curve. So now go back and think about people doing these architecture papers. And your student sits there and does a prototype. So this is the pipeline modeling his core. How good do you think your student is? Which of those points did they put in [inaudible] for the baseline all for their candidate work, right? Right? How would you know? How would you know how well they've done? The other thing is, there is a big design space so any simple -- This is a very simple processor. Look at the big design space we got. The more complicated ones, specialized architectures, you've just amplified design space by orders of magnitude. Right? So getting a good design point, you have to do design space exploration. So, you know, what we've been working on at Berkeley is can we make it easy to develop lots of real designs to do the design space exploration? Construct real RTL: if you have the actual RTL, actually know where all the wires are, know where the bits are, you can get accurate cycle counts. And if you can actually go through the layout -- In our case we're just using synthesized layout -- you get cycle time, area, energy. Right? You need to have real world physics pushing back at you when you make a design decision. Another thing to do to get cycle counts for long-running programs, we generate FPGA models automatically as well so we can actually run things for longer. But the big thing is by doing real designs you educate students to be the next generation of architects, actually understand how to build things. A little unfortunate now, sometimes I talk to people and it's clear they've never built something and their advisor has never built something. And maybe their grant advisor never built anything. All right? So -- And some very simple concepts that just don't -- they don't understand why this is bad or why this is good. So that's unfortunately. So we want to actually train people. So to help us do all this, we've built this language called CHISEL. It stands for constructing hardware in a Scala-embedded language. This is an embedded ESL. It's kind of [inaudible] but embedded in Scala. We built a hardware-description language. And basically hardware is just a data structure in Scala. And the real great thing about this is we can use the full power of Scala which is a nice language; it has all these nice language features to write generators, but also we use it lay higher levels of language description on top of this base level. So we can do some very powerful things that build generators. So Jonathan Bachrach at UCB, he's the main developer of Chisel. So what does Chisel look like? Well if you have a design description in Chisel, the Chisel compiler can then output any one of these from the same description. We can output very fast C++ code and get a cycle simulator out of that. Or we can generate FPGA emulation going through the standard FPGA tools to generate FPGA's. Or we can generate verilog. We can synthesize to get real layout that we can go fabricate chips. And we also use that layout to extract various cycle time energy numbers from designs we did. Okay, so one of the -- You know, isn't this too hard? Like building this stuff for real too expensive? Well, first thing, you have to be able to design. You've just got to be able -- If you can design microprocessors, I don't know why you're trying to tell other people how to theirs better. Right? It's just, I'm sorry but you have to be able to design microprocessors. Right? Advances in tools make this more tractable especially in the synthesis tool. So we've been -- Also we've been working on better libraries and tools and [inaudible] released it open source. I should have said Chisel's actually out there. You can go on the website. You can download it. Go to the get help project and get a hold of it. We've been putting up a lot more stuff over the summer, more cores and things so that people can download and use this. And the other thing is you don't actually have to fabricate every design. The point is you design it to the point you almost could fabricate it. And the tools are pretty good, once you have the layout you can extract and get very good numbers out of it. All right? But what I've learned over the years is building chips is actually fun. And the main reason you build chips these days is morale. It's actually the number one reason. And credibility. People really like building chips. So I started a long time ago building chips. This was my thesis chip. It was a vector chip. I was involved with this vector VIRAM a little here at Berkeley. I was on the vector chip. You can detect a trend here. So at MIT built SCALE; it was a vector chip. Notice the gap between these. There's like quite a few years in between these chips. More recently we [inaudible] out something at 28-nanometer another vector-style core experimenting with, a 28-nanometer. We just [inaudible] this other one, a 45-nanometer. There's another one coming out in August. What you might notice is there's a lot more chips. All right? And the same few students are doing all of these chips. All right, so how are we turning out all these chips? One advantage is all of them are similar but not the same. They're being used in different projects. All right? We've been trying to push a more agile hardware development methodology which basically consists of going through the entire process automatically and really automating as much of that flow as possible. So like, you know, there's no such thing as an RTL freeze in our developer methodology. We just go, "We'll wade through this stack continuously and keep pushing through [inaudible]." Well, and you might notice these are all different process technologies which is actually the biggest hurdle. So building chips, the RTL's the really easy part. That's actually trivial. Physical design is the big challenge and actually getting these things finished then working and fabb-able. But so it's fun to build stuff. But you don't need to do most the research. I think this is just -- Actually these are being fabbed for other purposes like they're doing work on low-voltage resiliency with [inaudible] DC converters and the other ones with [inaudible] integrated. So the [inaudible] are just kind of an afterthought but they need to run the rest of the experiment but we get the fabbing as a result. Someone commented about this, you know, "Is synthesizable ASIC design close enough?" Well, for handholds, handhelds/SoC, that kind of space, that is how people do designs in industry. There is the Intel class, you know, Intel, sort of IBM, stalwarts keep doing this custom design, you know, five or six years of careful engineering, lots of detailed synthesis. Those glory days are kind of passing. Right? Glory days of custom circuit design are over for many reasons on the slide. I won't go through them. But I think as an architect researcher it's close enough; you get the insight you need from doing this level of design. It actually matches actually what you would do in industry for those kinds of cores, the kinds of cores we're going to see coming out. Anyways, that was a mini-rant on, you know, people should build stuff. It's a lot easier to build stuff than it's ever been, a lot closer to industry probably than we've ever been when we do this stuff in academia. So people should be doing that. So I just want to say a little about a bigger project that we've sort of been starting off based on -- starting from [inaudible], the new project ASPIRE. So basically starting with applications. You know, how are we going to -- What's our story on this specialized hardware? Everybody is working this area. We're going to leverage all the work we've been doing [inaudible] on patents and the patent-based software decomposition. So we started with applications. We break them down into a patent. And we've been sort of pushing this idea of, you know, the way to do heterogeneous processing is to have central processor with a sort of satellite array of specialized engines. So it's more a coprocessor model than rather than the idea you have sea of GPU stuff over on the other side of the chip. So augment -- It's kind of like you have the [inaudible] units on a regular processor. We go even further with more kinds of engines stacked around the central core. We think there's a lot of good reasons to build it this way. We call this ESP, Ensembles of Specialized Processors. The idea is this ensemble of specialized engines can execute any kind of app you throw at it with greater efficiency. So you might imagine having a, you know, standard core, an LP engine, and then the idea is these side co-processors are actually targeted for the given patent. So the, you know, [inaudible] keeps saying these patents are the things that recur. They're the common operations. So we'll build the engines to match those patents. So that's our idea of building something that is more efficient than regular cores but still has the coverage and flexibility and programmability to cover the application space. And what's nice is we already know how to program this because we built these specialized for each patent. And building a specialized enough for a patent to target an engine design to execute that patent we think should be pretty straight forward and we won't have to change the application code, right, whether you have that accelerate there or not. This sort of patent-specific accelerator is kind of what we've been working on. So we're going to generate [inaudible] together. And what's nice is we can actually take these designs and push them all the way down through either emulation or down to ASIC. And then the [inaudible] important part is doing this design space exploration, not just around the hardware but around the whole stack as well. So on the software level you're autotuning for a given architecture design point, figuring out the numbers you get there and then iterating the whole loop, so two levels of design space exploration. So this is kind of what we're exploring in this next project. How do we drive up the efficiency of these specialized engines? Okay. So in summary: you know, we've got to focus on efficiency as 21st Century architecture. There's a huge range of specialized architectures. And really I can't see any other way except actually having real layout to finalize them. So how are we going to produce that? I told you our approach. One of the good things is we're trying to open source all this BSD and I think the important thing is have people actually study your RTL and say, "Well, your core is, you know, this really dumb idea in this part of the core. Why don't you do this?" And you say, "Sure, we'll do that." We're not claiming we know how to build the best cores, but we'll put them out and host them and have other people contribute. And hopefully we'll drive towards a commonly agreed good baseline design from one of these design points. Yeah, and this last thing: it'd be great if, you know, people did papers that actually improved fabricate-able designs. That would be really good. Okay, that's it. [ Audience applause ] >> Juan Vargas: Do you have questions for Krste? Yes, John? >> : It's more a comment. One of the things that I really like about Chisel which you didn't actually bring up was there is a common inscription of your hardware in Chisel, in SCALA language, that allows you to push it in any of those directions. And so for the design space exploration, I can [inaudible] on the C++ side and I know that that same design [inaudible] in RTL or [inaudible]. So that's a real powerful thing [inaudible] different set of tools for each level [inaudible]. And this really unifies [inaudible]. >> Krste Asanovic: Yeah, that's actually one of the driving irritations of previous stuff we did. Robert? >> : My understanding of most of your methodology is that it deals toward exploring microarchitectures, but it also lends itself to exploring [inaudible]. >> Krste Asanovic: Oh, no definitely. So these specialized architecture -- Those are completely new [inaudible], so very rich in [inaudible]. Like graph engines, [inaudible] graphs and things like that. >> : [Inaudible] exploring [inaudible] to do is write code in the new proposed instructions [inaudible] compilers can generate [inaudible]... >> Krste Asanovic: But that's where our specializers come in. So we're targeting them through the specializers. >> : I think you have another advantage going on here. You know, there's lots of different specializations you showed there. For example, sparse and graph. Maybe they're not different. How do you find out? Well, you design them both I suppose but then you've got to program them both. And you've got to take the same problem and target both of those architectures to decide. For example, we don't need any dynamic type discrimination support in the architecture. We can do it all in RISC. I remember [inaudible]. Right? So --. But you have an advantage here with the programming technology you've got. You could write SILO's that would go to both targets and compare them from the same source code to both of the [inaudible] without nearly as much trouble as you would have inventing new languages and recoding things and other [inaudible]. I think [inaudible] pretty interesting. >> Krste Asanovic: Okay. All right. >> Juan Vargas: The last speaker today is Mattson from Intel. So Tim is going to end the session with fireworks talking about the many core processors at Intel. He's going to have message passing for future. And after this we will get some chairs and start the panel. And thank you very much for staying with us for so long. And it has been a very long day but I hope you enjoyed it and I hope you had a lot of good insights. So, Tim? >> Tim Mattson: Okay. Thank you. So this is really what I would like to talk to you about is kayaking and kayak surfing. But, no. >> : [Inaudible] for that? >> Tim Mattson: Yeah. But we won't. Have to have this disclosure, though. You know, I work for Intel but these are my views not Intel's views necessarily. You will learn absolutely nothing about any Intel product from what I have to say. This is a team effort but if I say anything stupid, I own it not my teammates. And I want to emphasize, it's my job to challenge dogma and to explore alternate realities. So don't for a second think that I'm telling you about any kind of future Intel product. Period. All right. So this is my favorite slide. And I apologize, there are some of you here from Berkeley who've heard me give this talk many times. You guys can go ahead and lay your head down, take a nap. I won't be offended. But this is a slide I pulled out of an Intel executive's deck from 2006. But I just love it because it's talking about this great vision of many-core and how we're going to have -- You know, we had the single-core. Now we've got duel core and now in the future we're going to have these lots and lots of cores with a shared cache and a local cache and how cool this is. But notice the implicit assumption, just the automatic assumption that of course there's a cache-coherent shared address space. Now I want you to think back about the talks you heard today. We heard about heroic research to try and find races. We heard about bizarre tool chains to try and prove that you could do lock and lesion. We heard about this weird bulk multicore architecture which could do rollback if you had a memory conflict. All this complexity, this insane complexity, so that we could have a shared address space. I think a shared address space perhaps is just a big mistake. The fact of the matter is that if you get expert parallel programmers -- And I've been there; I know what I'm talking about here. You get expert parallel programmers who have been doing it for decades and ask them to explain the relaxed memory models they work on. Ninety-nine percent of the time they will get it wrong. All right? So we're going to build our future on a model the experts can't understand? How smart is that? I think that's pretty stupid. Also the only programming models that we know for a fact will scale to hundreds or thousands of cores for non-trivial, I mean nonembarrassingly parallel workloads, the only proof points we have are based on distributed memory message passing style programming. And furthermore I know people tell me I'm wrong. I know architects who are very, very clever tell me I'm wrong. That's fine. They have that right, but I'm the one up here speaking now. As you add the circuitry and the chip area to the power to manage that shared cache-coherent state, that's not scalable. Amdahl's Law is a real law. It's going to eat up some of your overhead. It's going to become expensive. So at some point I think we're going to have to get rid of it anyway. So I want to ask the question, maybe, just maybe, we should bite the bullet and recognize that cache-coherent shared address space was a stupid idea and the sooner we get rid of it the better. So we had a research program at Intel where we actually said, "Okay, let's build some chips that are arbitrarily scalable, meaning we can scale them as far as our process technology will take us, and do not have cache-coherent shared address spaces." So we built two of these. We have an 80-core FLOP monster also called the Polaris Chip, and a 48core chip, the Single-Chip Cloud, probably the worst-named chip in history but that's what it was called. But this was created as a software research platform. So let me say a little bit about these chips, and I'm being very conscious of the time. And of course I have to credit my collaborators. In particularly Rob van der Wijngaart who I worked very closely with on all of these projects. But also the hardware team, you know, Jason, Sriram and Saurabh have been just delightful to work with. So let me tell you a little bit about this 80-core terascale processor. The goal of this project was could we get a Teraflop for under 100 watts, and we basically did that. It was a 65-nanometer process which, at the time, back when we built this in 2006, 2007, that was the leading technology was the 65nanometer process technology, 8 by 10 tiles, mesochronous clock. Offline I can go into all sorts of low-level details there, but in the ten minutes I have left I can't do that. But the interesting part to me was what the cores looked like on this chip. Now this is not a general purpose processor. Of course not. But from a point of view of an old HPC hacker, I love this chip. It's marvelous. Okay? Two floating-point multiply accumulate units, a five-port router so I could XY routing between tiles and I could go directly into the data memory or the instruction memory. So I could write things into the memory whether it's instruction or data without interrupting the core. If you want to build highly, highly scalable architectures that's a wonderful feature. And if we went through all the little tiny ittybitty numbers here and added them all up what you would find is that this is a perfectly balanced processor. Meaning, I can move stuff from the instruction memory fast enough to drive the chip at peak speed. I can pull things out of the data memory enough to keep those floatingpoint units fully occupied. So it's a very well balanced chip. There's no division. There's no integer unit because, you know, those are for weenies. If you're a real programmer, all you need is a floating-point multiply add. The interesting thing is there are 256 96-bit instructions per core. This is not cache; it's memory. And I could hold a just stellar 512 single precision words per core. So this isn't a general purpose chip. All right? We know that. So we wrote some kernels for it. I won't call them applications but, you know, we got a stencil kernel going. We got a matrix multiply going. And we said it was impossible to do 2D FFT's so, of course, Michael Frumkin working with us had to do a 2D FFT just to show us how wrong we were. But we got this stencil code running at a TeraFLOP for 97 watts. We're cool. We rock. For me this was really cool because I was involved on the first TeraFLOP supercomputer in 1997 and -- you know, where we had one megawatt of electricity and 1600 square feet of floor space. And ten years later we're 97 watts in 275 square millimeters. That's pretty cool in one person's career in ten years. I think it's pretty cool. But the problem with it is of course it was a stunt. We know that. It was a stunt. You know? No one could do any real serious software with a chip like that. So the next one in the family, the 48-core chip, it was built as a software research vehicle. Now there was the talk that -Kirstic Avery talked about Intel with its hand layout and all that stuff. That's indeed what we do with our products. But for this we wanted to go directly synthesizable off the RTL. And there are only a few cores we had that you could directly synthesize from the RTL when we went to tape out this chip. And the one core we had to do that with was the P54C. P54C is the Pentium 3. Ancient core, we know that. But the idea is we wanted something we could grab off the shelf, drop in there, X86 so people could write real software and it's a messagepassing architecture. Now here's the interesting thing about his chip, we have 2 cores per tile, 24 tiles. They have their regular cache architecture for the individual core. Then there's a message-passing buffer which is a scratch -- it's a terrible name, message-passing buffer, because what it is, is a scratch space. It's scratch pad memory. So to the programmer this is what the chip looks like. I have 48 cores with an L1 and L2 and Private DRAM. Then I have this message-passing buffer which is a high speed, on die, on chip scratch memory space. And then I have this off-chip shared DRAM but there's absolutely no cache coherence to that off-chip DRAM. So what we're trying to do with this chip is explore an alternate design where you have some shared memory. But remember a lot of the problems with shared memory comes not from sharing the memory, it comes from the shared address space with accidental sharing where you can accidentally stumble over each other's addresses. Here all of that is managed at software. There is no non-scalable cache coherency protocol. This is us trying to have our cake and eat it too. There's just a little bit of shared memory but without hopefully the bad parts. So we built this chip. And another thing interesting about this chip is we have explicit control of the power. We have voltage control of -- I'm sorry -frequency control at the tile level. You can individual vary the voltage on the inter-connect. And then you have these blocks of 8 cores per voltage domain so I can vary the voltage on these voltage domains. And all of that is exposed to the programmer. So we can do research on people experimenting with explicit control of the voltage and the frequency. Marvelous research platform. And in a long version of this talk I would go through and go through research results and talk about it. Invite me to come back some time and I can talk to you a whole bunch about that. But in the five minutes I have left, I want to talk to you about something else. So I hear again and again, shared memory is so much better than message passing. Shared memory is just the only way to go. And the people who say that I submit haven't written much if any message passing code and shared memory code. Or they base that conclusion on the following: they take a matrix multiply code, matrix multiply kernel, and they run it. Or they take some other trivial little toy program. All right? What I want to point out is I've done professional software development of products that use both message passing and shared address space, and you measure when did you start the project and you end when you deliver optimize validated code. Not little toy demonstration code, I mean real applications. And what you find is with message passing, by plotting effort over time, you indeed have a brutal learning curve. And if you never do the optimization validations, you're just looking at this beginning piece of the curve, then indeed you're going to conclude that shared address space programming is much easier because you can just sort of, you know, add a directive here or there or do a couple fork joints. It's real easy to sort of sneak in that concurrency control. Whereas in the message passing, I have to break up my data structures, decide how they're going to be distributed. I got to do a lot of work up front. But the point is, and here's the key thing, when you look at real code that you're going to deliver [inaudible] in application that must be optimized and validated then you have to go through after you've done that initial parallelization and optimize and validate. And I'll tell you, once you've broken up those data structures into chunks, because so much of optimization is managing data locality, that job's pretty much -- you've done the hard part of it already. What we find in these shared address programming is you don't do that work up front but by the time you get it optimized you've done that work. You have to cache block. You have to break things up. So I submit, look, just do it up front. In the long run you're better off, and you don't have race conditions. If you're disciplined in how you use the message passing -- And I could tell you exactly what I mean about that, it's basically don't use wild cards. If you're disciplined in your use of message passing, you write code that is almost automatically race-free. Whereas of course with shared address space programming, proving you're race free is NP-complete and you change the data set. And your program that your tool told you was race free all of a sudden is full of races. So I submit that someday we as a community will wake up and go, "I can't believe we were insane enough to ever push this shared address space cache coherence," and we will all change to a message passing road. And I want to -- Okay, so I want to emphasize two things and then I'll be done. And I'm watching the clock; I'm going to be done exactly at four-thirty. All right. >> : You're always going to be last. >> Tim Mattson: I know. That's what I get for that. A lot of people criticize message passing because they say, "Well, gosh, no one wants to put all these sends and receives in their programs." Well, if you look at a serious message passing program there aren't very many sends and receives. And I have an example project here that are in the slides I can talk to you about. It's where they build the linear algebra library. But what they do is they break their job and they decompose into panels of matrices. And they're doing dense linear algebra so they have domain-specific collective communications of all gathers, reduce scatters. This really -- They called it send-receive; it's really exchange is a better name. The point is that in the real application you have collective communications that reply to a domain-specific type library. There's very, very few instances of MPI send, MPI receive, few to almost none. And so I'm telling you message passing, if it's done right, is usually structured -- Like in the SCALA pack world, they have the BLACS, basic linear algebra communication subroutine. Same thing there. Experienced message passing programmers do not write a lot of sends-receives. They do these global, these collective communication operations. It's actually not as hard as many of you think. Give it a try. You might find you like it. And then the other thing I close with is Barrelfish. I love Barrelfish. All right, why do I love Barrelfish? Because they take the concept of an operating system that's split between a host and devices, and it's a message-based operating system that they give you a consistent view across both host and device. It's fundamentally based on message passing not shared address space underneath with cache coherence and, therefore, it maps onto these modern heterogeneous platforms. I love it. And I hope to spend a lot more time with it. It's four-thirty. I have to finish. Thank you. [ Audience applause ] >> Juan Vargas: Well, that was very nice to wake up again. [ Inaudible audience talking and laughter ] >> Tim Mattson: We have a panel. >> : That's okay. Let's [inaudible]... >> : It's good that you don't speak for Intel because all of the Intel processors are cache coherence. >> Tim Mattson: You notice how many times I said at the beginning I don't speak for Intel. >> : So I'm sorry, Tim, but I have to [inaudible] the following: so I've gone on the record many times saying, "Shared memory is the world's best compiler target and the world's worst programming model." >> Tim Mattson: Okay. >> : Okay? Don't confuse the two. Don't say, "Well, we have to build our architectures to pass messages because we need the handcuffs to keep us from killing ourselves with data races and things like that." How about this? Don't use coherence for everything. Don't use shared memory for everything. Instead, write in a functional or mostly functional [inaudible] little updating [inaudible] as possible and do everything with messages at the programming level or maybe even at higher-level map. But don't blame it on the fact that the hardware underneath is shared memory. The great thing about shared memory is that it allows load balancing and it allows dynamic localization. And if you use it for that, it'll be your friend and you won't kill yourself with cache coherence. If you use it the way you were using it today, I agree with you. But we have to stop doing that at least. >> Tim Mattson: Excellent point. You may very well be right, though I'm not ready to necessarily concede it. But I think where you and I do agree is I'm a software guy. I'm not a hardware guy. What I'm really talking about is... >> : Yeah, the software. >> Tim Mattson: ...the people writing the software should write MPI. They should write message passing. They should write a code -- Yeah, that's what they should write. And if underneath you hardware guys who are much smarter than me -- I know that -- figure out that the best way to support it's with a shared address space, you go for it. But I think programmers, they're just not going to get straight the shared address space. >> Juan Vargas: Okay, so there are more questions. >> : Sir, I think you maybe [inaudible] examples in the beginning, but aren't patterns supposed to make -- hide a lot of these details from most programmers? >> Tim Mattson: So the questions was, won't patterns hide most of these details from programmers? That's the ultimate dream. But let me be really clear on software development and the patterns world because I'm kind of one of the people pushing that really hard. When I go deeper and I talk about the whole software stack, I also stress the importance to support opportunistic refinement so that the developers on your teams can drop lower. So even though that top-level domain specialist programmer probably never will go below the patterns, there will be plenty of people on the team who need to go all the way down to the lowest level. And so it's not true that when you look at the whole software development stack the people are not going to have to pay attention to the low-level details. And, therefore, it is important what's the programming model we expose to those people, the efficiency programmers. >> Juan Vargas: John also had a question. >> : Oh no, it was [Inaudible]. >> : Yeah, so I drivers manual. message passing It stops you as hurt you in the mean I think the [inaudible] are kind of like the And what we're really looking for is guardrails. And model can [inaudible] a lot of those guardrails, right? a programmer doing the things that are going to really long run with, say, shared memory just... >> Tim Mattson: Here's the -- It does that. But here's the other thing it does is it forces you to make the hard decisions up front. And what I'm saying is by the time you optimize a highly scalable code, you're going to make those decisions anyway. You know? I've written and optimized more open MP code than I could ever count. And I'm telling you by the time you're done, you have figured out how the data is going to block, how the locality is. I mean, you know, same thing with Pthreads code. You know? You're going to have to figure all that stuff out anyway. Kind of what I'm saying is, "Okay, do it up front at the design phase when you're starting out. That's the best place to do it. In the long run you're going to be better off." >> : So I agree with that statement but then you make the conclusion that I should be doing it in MPI. >> Tim Mattson: Well, okay, fair enough. Fair enough. My conclusion is you should do it with a programming model that does not expose a shared address space. [ Multiple audience voices ] >> Tim Mattson: You're right. >> : Suppose it was a single-assignment language, Tim. Okay? Suppose it was shared address space, but you couldn't overwrite anything. All you could do is collect it when it was dead. >> Tim Mattson: So... >> : SISAL. >> : Like SISAL, yeah. >> Tim Mattson: Like SISAL. >> : Like SISAL. >> Tim Mattson: And you know, SISAL was such a successful language. I mean how many people are writing SISAL right now? Huh? Really? Interesting. You know, I was involved with PARLOG. How many people are writing PARLOG right now? You know the fact of the matter is, we're not going to be able to dictate what language people are going to use so we have to come up with a collection of abstractions that work for the families of languages people actually use. >> : Yeah, but if you look at languages that are emerging and are successful and so on -- Like SCALA, for example. SCALA is very good language for a lot of reasons, but one of them is it has a very powerful functional subset. You don't have to write it in this imperative style if you don't want to. >> Tim Mattson: Right. >> : And there are others like that. So I [inaudible]. >> Juan Vargas: So John [inaudible] had another question. >> : So [inaudible] brought this up before, but it seemed to me that there should be a separation between shared memory and cache coherence. And you link the two. And I say that from the standpoint that I work back projects where I'm explicitly using message passing to pass tokens to data, but I'm leveraging the shared memory so that I don't have to copy stuff. I think that you're misleading things by saying shared memory's evil but it's shared memory plus cache coherency. >> Tim Mattson: Right. I try to be careful and I do slip up. So let me be clear. What I'm attacking is a cache coherent shared address space. The most productive programming model I've ever worked in is global arrays. Now granted, that's my application domain so it's -- But that's a shared memory model. >> : Right. Right. So that's why I'm saying it's very interesting... >> Tim Mattson: Yeah, you are absolutely right. Let's be clear. I'm attacking a shared address space programming model. >> : [Inaudible] is good. >> Juan Vargas: So I think we have to stop here. And I learned something after listening to Tim that instead of having breaks we just have him talking during the breaks. >> Tim Mattson: So no one has to listen to me. That's good. [ Multiple audience voices ] >> Tim Mattson: Yeah. >> Juan Vargas: Okay. So now we go to a panel, and we need to put some chairs so that we can start with the [inaudible]. [ Multiple audience voices ] [ Silence ]