>> Darko Kirovski: Welcome. Today I would like... from UCLA Computer Science Department to Microsoft Research. He'll...

>> Darko Kirovski: Welcome. Today I would like to welcome Professor Cong from UCLA Computer Science Department to Microsoft Research. He'll be talking about converting C code to FPGAs. I want to present Professor Cong as one of the I would say brightest stars of electronic design automation over the past probably 15 years in all of academia, as well as industry. He has done pretty much everything that there is in academia. He's the chair of the department. He is IEEE fellow, produced couple of students who have ended up at Wisconsin, UT Austin, UNUC, et cetera. He is head of Magma, he's the chief technology advisor to Magma. There's just tons of think I could just spend pretty much the entire hour introducing Professor Cong. I think the best thing is if we start down to the technical talk. Thank you, Jason. >> Jason Cong: Darko, thanks very much for the introduction. It's my pleasure to be here. Even before walking into the room Darko has lower my expectations and you may not even see people because people can sign up remotely, da, da, da, in fact that was my expectation as well because I always associate the name Microsoft with sort of leading software research. I even didn't expect the topic would generally interest, so I appreciate your being here. But I think it's getting to the point that the boundary between software and hardware is blurred quite a bit. There's also a lot of opportunity to use some of the techniques I'm talking about for accelerated computing as a new computing platform in general. So as we go through that be happy to take questions and have some discussions. Since this is my first time talking at Microsoft Research, give you a quick overview about some of the things we are doing. We are looking at the -- we are actually looking at a number of topics from physical synthesis with scaleability vulnerability and also the novel interconnects for communications. In particular we're looking at the 3D designs and also using the RF interconnect for on chip communications. This is actually very useful for multicore type of designs. 3D will cut down the communication between the different cores, core to cache, right, and having the new type of configuration. The RF will talk in a moment will give you the capability to multiplex multiple channels on the same physical media, so allow you to have a reconfigurable interconnect. And the topic I'm going to talk more today is on the electronic level synthesis from behavior level system level. Let's see. I'll just come down to this slide, talk a little bit about 3D. It's a (inaudible) project at UCLA started about five, six years ago. We have developed the very complete system. By the way, today if you go to any EDA vendors, CAD tool vendors you want to buy a tool suite to do 3D design that does not exist, so this is typically how that work that our part or ASF will put in research dollars to enable technology first and then industry will follow. So in this case, we have actually in the second phase we have a joint project with IBM, Penn State and UCLA. So we pretty much created a physical design flow from 3D floor planning placement, detail routing and that interface with the commercial router, these are cadence. And all of this is on open source data model called the open access. If any people here are interest to do some architecture explorations, that's actually very useful. We have several papers using this as a kind of a physical prototyping tool just to say hey, if we can actually design things in 3D instead of 2D how much will impact on latency power and so on, so forth. This is a quick introduction about the RF part where we have a o where we have done a lot of work looking at the novel interconnect schemes. What we have so far is that you can see that the today because of the power that the limitation we limit it in our communication bandwidth. I mean the processor performs more or less to four or five gigahertz. However the basic transmission line if you look at it on today even 90 nanometer process we can carry somewhere around hundred to 200 gigahertz of signals. And so we are only using five percent or less than five percent of bandwidth available. So the idea here is that working to multiplex multiple communication channels on the physical wire which is a well terminated transmission line. And by doing that, I can actually have a totally reconfigurable system. Of course you also get the benefits if it goes at the speed of light, right, it's a much higher bandwidth but the more important part is kind of a multicore, minicore system is that I can talk to you on demand. We can create a channel and at the particular frequency if we need to talk a lot more we can grab more frequency, when we're done we can release that, someone else can talk. So this work actually just published in HPCA conference back in February received the best paper award seem to be the community really like the concept is looking in this direction. So this is kind of the idea. You can still have your that same mesh network on chip kind of arrangement. But you can superimpose a freeway on top of that. And with that, you can actually get for example different access point they can actually can create a different short cut for you to have dedicated communications. And so sort of at the very high level you can think about, you are a surface street, you can view the freeway on that, you can have a kind of a reconfigurable interconnects between any two points. Or actually can have a multiple pair of them talking simultaneously. The number of pairs determine by the number of the channels we can multiplex on the high speed transmission line. Okay. So but where I want to spend a little bit of time to talk today is on this electronic system level design where we have done is that we take kind of enable the design abstraction from the current VHDL Verilog into a higher level. So why do we were interested in doing that? And the first if you really design a customized chip which (inaudible) has been our research focus at UCLA in my lab, so we worked with almost all the major semiconductor companies very closely, so you are looking at 10 minigate to 50 minigate, but even a 10 minigate it takes you half million lines of RTL. So it's a lot of code to create. I'll show you later on instead of by moving sort of the up the level obstruction you can reduce that by roughly a factor of 10. And then also almost all the sort of interesting SOC designs, small chip designs you have either one a multiple processors there, right. So be much nicer you can actually use a common specification and basically C like language, so part of that goes to the embedded processors, the rest of that can be compiled, synthesized into the hardware. Moreover, there's a lot of push in hardware community to just be able to have an executable specifications. No one is willing to dive into writing half lines of VHDL Verilog called from a Power Point specification. So they want to have a higher level execution model first and then validate that model and then refine it into RTF. So if you walk into almost all the major semiconductor company today they have actually a lot of models written in C or C++ or variation of C which is called SystemC. Basically C++ program with a special cache of extension libraries to describe somehow our behavior in communication, synchronization and clocking so on, so forth. So in a way, the community or the industry is also ready for something of this kind because it remind me the revolution we went through 10, 12 years ago remember when Verilog language first introduced that was not intended as a synthesis language, right, because Verilog allows you to write models at RT level, virtual transfer level can simulate 100X, 200X faster at the gate level, so a lot of people start coding things in Verilog or later on HDL just for the purpose of simulation. But once you have all those models available, the question is why should I refine it manually into gave lab or specification list can we automate that process? That's where synopsis started the revolution so we can all design the RT level now. So I think we are at a similar transition point if we can enable people to design circuits, just use C, C++ and then redesign that into the traditional RTL, that's actually where we want to kind of go. And then finally I'll show you that once we have this technology that it also create a capability of doing so-called XL computing or reconfigurable computing and where you can couple the CPUs with FPGAs that kind of a platform to enable a much higher performance that much more reasonable power budget. Okay. So that's where the reason where we have. So the research started at UCLA in the year 2000 when we start kind of looking into the direction. For me personally to start working this area, also for another reason, because clearly in the chip design today the performance dominated by the interconnect wires. There's a lot of concern saying that we cannot scale the performance any further, which is true, but at that time we also realize not all the long wires are bad. The long wires you have to cross in the single clock cycle is the performance bottle neck. If you have a long wire it can cross a multiple clock cycle, that's fine. I said okay, this is a great idea, maybe I should design a synthesis system so I allocate the multiple clock cycles for every long wires. So we look along that direction a bit further but it turned out that the if you want to do that kind of optimization, a synthesis, you cannot start with the existing RTL specification. Because by definition RTL is the cycle accurate specification. It tells you what to do at every clock cycle. You may actually make a mistake as a designer that you don't give me the freedom to optimize so I really want to have the total freedom to schedule the computationable also communication simultaneously. In order to do that, I have to move up to that behavior level. So of course once we started doing that, we also see a lot of benefits just moving higher level for design productivity, for verification, for a number of other things. So this is the prototype system that we have developed at UCLA and later on I can also mention a little bit about commercialization effort. So we have a system, we can take C, C++ system, CS input and at the other side we can take the platform descriptions, for example whether this map to FPGA, whether map it to ASIC, FPGA, whether (inaudible) it to it's a ASIC, it's a 90 nanometer, 65 nanometer. So then we can pilot into a system level data model where we have the capability now to map it into is there a pointer I can -- it can map it into sort of a three different alternatives. One is that you go to customize the hardware and through behavior synthesis and the other one is to go to customized processors or stand up processors, doesn't have to be one, it could be multiple of them. And then finally you can also generate the interface to put everything together. The part I want to talk a little bit further today is about this automated the hardware synthesis part. So from a C level behavior specification we can come down to first we go through a number of compiler like optimization, transformation, for example loop unloading, pipelining, there's a stress reduction, b-width analysis optimization, memory analysis. I didn't list all of them. For example we also have interprocedure analysis, so on, so forth. So this effort actually on the compiler side so we decide to take a hybrid effort. A lot of previous effort people start writing a dedicated compiler for that. I think it's probably wrong because this is where the community has done a lot of research, it's the best to leverage open source compiler where you can kind of take all the existing effort and then add the hardware specific passes that are on top of that. That's the approach we took. I don't know how many of you are familiar with that AIOBM compiler from the University of Illinois. So that was the one we actually used at the base. It seems to me that it's very big at Apple it's an open source effort, 10, 20s engineers from Apple check off codes all time. It's a very active effort. From what I'm understanding that was the kind of next compiler framework they're going to move to. And then of course now core research here is sort of scheduled in resource binding in how do you generate very efficient competitive hardware. I guess no one will argue with me about the motivation of doing something of this kind. The question is how competitive you can be, right. So that I put in a lot of effort. I will tell you a little bit about the hardwares that we are doing at a very high level. And I also have some one-on-one time I'll be going into more details. And finally we generate the VHDL Verilog. We can also generate SystemC so you can simulate it without any license because those are C++ programs. The target can be FPGAs can be ASICs. So we also generate the constraints. The constraint could be like a multicycle constraints, the force pass constraint. We also have the plan to do very aggressive power optimization. By the way, this is a unique advantage when you do kind of a synthesis from behavior level. Though power is a big constraint, concern. So how do you minimize it? Really the most effective way is just to shut it down when you are not using it, right? So today we are doing cross grain power optimization at OS level or system level so if I am not using the disk after 30 seconds I can shut off the disk. But in fact, even when you are using the disk if you look at the disk controller, it might be only 15 percent of circuits active at any given point the other 80 percent are not active. But the designer does not have the capability to go into shut down the circuit one by one at the verifying grain level. So that's the technology we can provide. Because we do the entire scheduling, remember. We schedule the computation to all the resources. We know exactly what is going to be on, what is going to be off. We had that idea for a long time. Used to be the case there's no way to communicate to the downstream implementation tool and I can say okay, I'm going to shut down this multiplier for 16 clock cycle, I'm going to let it on for another four clock cycle, right. How do I tell the implementation tool? There's no way. Only until very recently, a few months ago, the industry get together the synopsis cadence, magnum, they created a format called the UPS. How many have heard about that? Uniform power specification. So allow us to go from high level, I mean to kind of a passive design intent. So they created the format not for this automated tool, really for human designers, right. So if you're an RTL designer if you're really very careful, very clever you can say okay, these things go to high BDB, this go to low BDB, this one will be off, da, da, da. So we think that is actually ideal for us to pass the idea to the implementation tool. So that's also a very active research area. Just I show you one slides about the result first just convince you that we can do this. For example this is the result from two years ago from one of the industry partners that give us profile decoder. So they kind of chop it up in hard design to a number of blocks. So these are the modules within that design and these are the submodules, and these are the number of lines of the C code you can have, for example, a copy controller, motion compensation, there's a parser, decoder, so on, so forth. All together it's about 5,000 lines of a C code. And we can take that to synthesize into VHDLs in automated fashion. So altogether we get about 56 or 57,000 lines of code. So enter the code density we get a reduction about 10 to 1 or 11 to 1. Not only that, so this one works that we map it to designing FPGA. These are the number of slices you have. All together it's about less than half of the smallest device has at that time for vertex two. And also it works at the frame per second as required. So the design can also be very competitive. So that's one example. Later on I also going to show you some design that actually that compared to the sort of manual designs. So these are the logic we use, these are the embedded memory we use. >>: How (inaudible). >> Jason Cong: I'm sorry? >>: How are you shutting off components? >> Jason Cong: So this one we are not doing that. The FPGA you cannot do that, but with the customized ASIC design, right, so we can do that. Where we can -- so today in power is a big concern so that when goes down to 45 nanometer, 65 man meter, they provide you with standard cell libraries but they also provide you so-called MTC where you have the capability to have your power wheel connect to the real power wheel through a sleep transistor. So when you are not using this part of logic, I can just cut off, I can just turn off this sleep transistor. So this small portion of the circuit will be off. Okay. >>: (Inaudible). >> Jason Cong: Yeah, so that's the way to solve the leakage. Because you can do a lot of things to help with dynamic power. The leakage, the only way to turn the knob when you are not using it. Right. But you can see, I mean, this is I can do it at a very fine grain level, I mean there's three multipliers, a chunk of let's say I'm now doing motion compensation I can turn it off. So this is very scary for the human designer to do that. They just cannot think at that level of detail, right. So this kind of thing is perfect for automation. So the advantage you can do that is quite clear. As I said, first if you write and you model at the higher level is typically 1,000 X, 100 X speedup in simulation time. I show you you can do better code density and also more importantly you can do a very rapid exploration. I have some example to show. And finally there are two kind of exploration you can do. One is that you can explore the boundary in hardware and software so in that MPEG 4 design I show you. Maybe it says not all the modules equally important. I can implement some of them on the embedded processor, some of them on customized logic. So that if you start with a uniform to see like a specification you can experiment that. The second thing is that even for the same behavior model you can try different latency of frequency. So if you can with RTL you just say okay, where can I implement this thing at 250 megahertz so it was 500 megahertz so you set a target, there's a team of people working on it for several months, you get a design. But with this one, we can say okay, what happens if you go with 500 megahertz, what happen if you go 400 megahertz, right? Each one of them will give you a different area, more important different latency on the different throughput and also different power. So actually I believe we can get the better result than human designers for multiple reasons. One is I told you this power, the designers is very difficult for them to deal with that fine grain level to decide who is on, who is off. And also you have to make a lot of decisions when you generate RTL code. Some of them of course are very experienced designers. They say okay, I know average trade-off but a lot of them starting, they have limited experience in IP design. You have two computations, you have to decide whether you share adder or not or you share a multiplier or not. So if you share it what does it mean you have to have a multiplex right in front of that, and then you can design what time this data goes through. But you have to remember that this is the area multiplex bigger or the area of adder bigger? So you ask that question, you probably had down a lot of RTL designs you may not know because it depends on the platform you have, whether you go to designing FPGA or you go to TSMC 65 nanometer. So we actually model all this platform information so we can make that decisions that when we generate the RTL code. Whether you share or not share, you get a different RTL code. So there's all these things that we believe once we automate we actually give you better result. We are getting to there. So let me give you some examples about the type of optimizations we have. So there's one very important step is the one you from a behavior level model to cycle accurate RTL model. So behavior model there's no concept of cycles, it's just computation model need to be done right where in the RTL model you have to say what computation need to be carried out every clock cycle. So kind of an analogy will be you may have Bill Gates or (inaudible) want to reach a hundred billion dollar next year now so that's a very level goal. So the COOs, the president going to refine target quarter by quarter maybe even month by month and week by week, right so that's where we could schedule it. Someone probably also do the allocation saying division by division which one should achieve which target then department by department so on, so forth. So that we call the resource allocation. So scheduling is an important step to give. You make computation graph, how do I figure out what computation carry out in what clock cycles. There's really it's difficult is that first you have a very different application data intensive, computer intensive, memory intensive and also there's a lot of constraint. For example you have resource constraint. I only have (inaudible) and not only that once you go to FPGA they have a different type of resource constraint. I have a number of logic lots gates, right, I also have a beauty in DSPs, I have a beauty in memory. All of those will be constraints. Latency constraints, the frequency constraint, relative timing constraint. And then if you wanted to better optimization you also have to be careful. Some operation can be done in one clock cycle. We call that operation chaining. Some of the operation actually have spread out in multiple clock cycle so that called a multi cycle operations. And some of them has to be handled with pipe lining, right, so that even before the first operation is finished I pump in another data. Of course if you are experienced designers, you can possibly do all of those, right, but still there's a limit when you get down to half million code you imagine how much optimization can you do for every one of them? So these are kind of we see ideal opportunity to optimization. However, we look at the literature very carefully. Really the profits very hard. So there's basically two kinds of approaches. One is that the people are shown the (inaudible). So all they can do is that they -- these are very simple minded heuristics. And most common ones you probably have noticed is called the leaf scheduling. It's not only used for hour synthesis, it's also used for multi processor synthesis. You somehow order the operations or task in the linear order and with some heuristic waiting functions or priority functions you decide who to execute next and you decide what resources it goes to. And so it's efficient but it's greedy. Another way is that it -- they try to capture all the constraints. So it becomes a linear programming constraint. It turns out you can do some reasonable approximations so almost all the constraint will be linear. The challenging part, however, variables, integer variables because you can now see this operation happen at step 2.7, you had to say either clock 3 or clock 4, right, so has to be integers. So that's where you have the difficulty. So we actually make a very significant progress a couple of years ago. We come up with a new type of project called that using the so-called system of difference constraint. So here's the -- let me illustrate at a very high level. Let's say you have an adder and you have a multiplier, adder is faster, two nanoseconds, multiplier five nanoseconds, I have a target clock cycle being 10 nanosecond, this is 100 megahertz, a slow down on FPGA. And let's say for some reason just for the illustration purpose the resource are only one multipliers and one adders. But the only multipliers available. So first you have the so-called data dependency constraints. It's not too difficult to write it down. You have X 3 minus X1 larger than zero basically say X3 cannot execute before X1, right? These three cannot execute before V1. I can also have these three X3 minus X two larger than zero. So for the dependency constraint you can actually have that. And then you can look at the frequency constraint. Okay. If I have a multiplication, multiplication and addition and subtraction is the same thing, can I have all three of them in the same clock cycle? No, because if you add five nanoseconds, 5 nanoseconds, two, you exceed this one. So now can say X5 minus X2 has to be larger than 1 because they have to differ by at least one clock cycle, right? So that's type of frequency constraint I have. And you also have a resource constraint because I only have one multiplier. In that case, these two cannot be in the same clock cycle, so I can't say X3 minus X2 larger than 1 because they have to be separated by at least one clock cycle. So if you write out all these constraints you get matrix and you have a variable vector. So this is we call that the scheduling steps, control steps of those operations. And it's basically linear type of a constraint. Whereas the difficulty come from is that these has to be integer variables. So in general if you write in this way it's are very difficult to solve. But you probably notice that when I generate the constraints I was very careful all my constraints looks had the form XI minus XJ, no larger than certain things. So if you do that kind of modeling you can show this matrix has a nice property called the total unit modularity. Once you have that total unit modularity you can guarantee that the vectors you can just solve that using standard linear programming techniques, you can guarantee you get the integer solutions. So that's actually kind of the one approach we take. And that once we have this basic framework done, the question will be how do we approximate all the constraints by this linear sort of a system, a difference constraints. So we turn out we can do that very efficiently. Let me give you another example kind of optimization we put it in. So after you schedule, after you schedule you have a data operations in circles, right, fit into operations in the later clock cycles. And once you cross the clock boundary, what do you need to do? You got to store the data, right? Before it disappears. So you actually would introduce a few registers. So this is a very sort of (inaudible) called the register finding. You have to figure out I have all these registers, these are their life intervals and how much register I need, who goes to what register. And you we do that in computer architecture design as well. Accepting how we design, not only how you assign a variables affect the number of registers you also affect the amount of interconnect, how the wires are connected. Because I have to introduce multiplexors when I have two registers feeding to same functional unit. So that is actually the best possible binding you can have and I have one execution unit, I have two registers, one has three input -- two multiplexers, one has three input and the other one two. But don't be deceived by this picture. Each wire here is not this one bit, right, it could be 64 bit datas coming in, so those multiplexors are huge. Actually we find out that consume 30, 40 percent of the chip area could just all be a lot of this steering logic. But it turn out if actually have a register file then my structure is very simple, that I can actually have a execution unit, I have two read ports and one write port. Okay. So that's kind of the model I have. And is it easy or is it difficult to get register files let's say in FPGA designs in our case? It's actually not too difficult because for those of you have deal with FPGA, they have this concept called the block ran, right, distributed throughout the chip. Let me show you. And, yeah, I used to have a figure. I probably delete that. So (inaudible) cases they all have tens to hundreds of multi plexor -- I mean the embedded memory block. So those embedded memory blocks are ideal to kind of be the ones to implement the register files. Now here you have a interesting question. So can I divide my entire designs into a number of island? Within each island I have a local register file, right, I have some data routing logic and the rest will be the computational unit. So that's the architecture I want to create. Of course you can feed everything into one register pile, that's good, but most likely will not be very good because you have a limited number of read-write ports, your concurrency is very poor. So what you want to do is you partition your computation into a number of island and where you can actually do this. So that was also another algorithm we developed be able to kind of explore the computation, the communication locality to make efficient use of this embedded memories on chip. So we can actually come up with a very significant improvement if you do this using either the discreet register file approach, DRF, versus we use this -- versus a distributed register file approach versus the conventional discreet register approach, right? We can reduce the area by almost 50 percent because you make use of embedded memories. And not only that, when you reduce that much logic, then the clock is also somewhat improved as well. And there's also question how do you generate this sort of a register file island, right? There's a different approach. We also experiment with more efficient algorithms versus less efficient algorithm. So we can also see 12, 13 percent of difference there. So this is another example type of optimization you can do. So finally I'll give you yet another example about how do we do the communication between different mod duals? So that design is not going to just be one module, your design is going to be -- have multiple modules. They communicate. How do they communicate? The common ways are 5-0s and busses and share memories. Share memory actually is it's a kind of a different animal, but 5-0s and buss, they share a common property. What's the commonality between them? I call them sequential community media because the order you write it into the bus is exactly the same order you'll read out of that, right, with the 5-0s as well. So that's actually one property we explore. But it turn out once you have a different modules, so typically in today by the way when a (inaudible) does a design or Intel does a design they will divide it into a number of team. Each team will be responsible for one blocks. So each block itself is optimized. But actually intracommunication in this block can be poor. It turn out actually how you generate the data from this module because it's going to fit into the next data that through 5-0s, right? And then you generate the customer logic block for one, you generate data for the custom logic block for another, a lot of cases people use divide and con concur approach, they will design each one independently. But it turned out that ordering your generous data is actually crucial for the guys after you because different ordering can enable different type of implementation and also the latency, the communication. The idea is that they will have some dependencies with the data, if you can generate those data first and then go to them under that you can actually follow up with the data later on. So that's type of one type of optimization we can have. So where we did is you can actually prove there's a way you can take all that controlled data flow diagram of all the mod duals together and then you can have a way to module those -- model those communication channels as part of your scheduled resources. Then you actually have a much more global picture. So we can say that by doing this we've got a behavior and a communication co-optimization and you can actually end up to have a much better result. So in this case we have a number of DSP like applications and by doing -- this is a traditional approach you design every mod dual separately and then hook them up together using 5-0s versus in this case that we actually do the behavior synthesis and the communication synthesis simultaneously you can see in some cases we get a very good reduction, 30 percent, 40 percent. And a big part of that is how do you handle arrays? Well, when you have array data do you just kind of produce on the data row by row or column by column or there's some clever schemes you can produce the data in some specialized order? So that's the kind of ordering we explore. When you actually pass that to the downstream implementation, basically you have a different way to encode your matrix, right? You want to -- but you don't want to really kind of have a very sort of unstructured data pass to the next one. So you reorder the matrix, you still want to actually compress it again, so basically we also have a compression technique to show that you have this unstructured data, I can reorder them, I can still kind of make them into a set of a smaller arrays or some special structure array. So if you don't do any optimization you can end up with lots of unstructured data and once do you that optimization it's a lot more reasonable. So that's another type of optimization we put it into the system. Okay. So I think all together we have been doing this research for six, seven years. The result looks very promising. Some of my students graduated two years ago, actually two and a half years ago so we're exploring what you do. So one decision they made is they actually licensed the tool from UCLA for commercialization. So there's a startup company not too far away from UCLA called AutoESL. And I'm also advisor to the company. So doing the commercialization of this technology. So that's one thing that the -- I want to kind of. So they have a -- what they have done a much better job is that they have a very robust language coverage now, complete support of a C, C++, SystemC under them. They also have much better support of all the platforms. For us at university we tend to focus more on algorithmic part of that, right, so the core optimization engines. They also come up with a sort of a simulation verification where you have the let's say that test programs, test data for the C models, you can reuse that to test your RTL code. You can generate that actually very significant. And the whole tool was on kind of a -- you can generate a C code in any other ways so the default for them was to provide you with this eclipse based framework where you can do the editing, you can sort of organize your designs and then you can see the output of that. So that was one design they have. So here is for example another design they were taking that part of the motion compensation to the ASIC implementation. Here shows the power of doing sort of architecture exploration, right? You can study frequency to be one gigahertz down to half a gigahertz, quarter gigahertz. Try different ones. You can see you generate very different number of cycles. So basically you get a totally different RTL implementation. The cell count differs by more than a factor of two, the area also differs quite significant and the of course the critical path, the frequency, the total latency all different because the latency is a combination of a frequency and also the overall number of cycles. So that kind have exploration is also in our opinion very important. That's actually a factual design quality, very significant. Yes? >>: (Inaudible) how can you imagine with 900 megahertz? (Inaudible). >> Jason Cong: How do you -- so if you this is a picosecond, so this is basically. >>: One nanosecond (inaudible). >> Jason Cong: Yes. So basically that's the resulting frequency, right, 900, so I think it's pretty close to gigahertz. Not exactly. >>: But you (inaudible)? Isn't the constraint broken? >> Jason Cong: Yes, the constraint is broken. So here is kind of a classical problem, right. You tell the design team to work very hard to timing closure by giving hem RTL. There's no guarantee that RTL can be used to meet the constraint, right? So this is the case basically shows you this is the -- you probably wanted to go as fast as possible in frequency and downstream too cannot achieve that, that's actually one data point. You can say okay now I can back off, then there's different implementation. So I think that's (inaudible) something we look at it as a result. And then by the way, it's also now the case that the highers, the frequency design will be the most efficient. Because first they can burn a lot of power, right, and also if you look at the complete product between the frequency and also the latency, the cycle count, it actually what determines the performance. What I want to show is that by having a single behavior model you can do this type of exploration very easily in the matter of hours. They also have the flow to take it from SystemC specification in this case is kind of data security type of applications. Very similar kind of a reduction you start from a couple thousand lines of C code, SystemC code in this case, you generate an RTL code in 23,000 lines and this is at the TSMC 90 nanometer process was about 70 K gate com. I mentioned there's a simulation verification flow where you can reuse the behavior tests vectors tests bench to test your RTL code. One of the things I want to mention a little bit is that having a tool like this has actually enabled very interesting applications. Of course the primary use when we develop this technology was for hardware design. It turn out once we have this capability we can also use it for -- we can also use it for the -- we can also use it for computation very efficiently. Let me see -- I want to just switch to another presentation real quickly. I want to say a few words about that. Because I want to show you one slides's about because I promised to show something about the quality of the result of the tool. So this is actually one data point fairly recent, there's a large defense contract work with them. They want to implement the three tests on FPGA and they actually was -- had the menu reference design. It's basically coming from IP vendor as a reference point. So they have a C level model they want to see how that actually impact the design. So it turned out that this is their C manual is sort of a implementation. It's on pipeline implementation. Well they give the behavior model to the tool. The tool can also generate a very competitive design. You can see that the interim frequency and the throughput will be the same in terms of lookup table areas this is a logic is actually less and the interim number three plus is slightly higher. So this is basically very comparable. But tell you the tool can match the actually refined manual RTL design. But it's more interesting is that by the same tool you can start experimenting. This is on pipeline. You can say okay, if I want to pipeline it, I'm going to feed in data every 18 clock cycles. I believe the whole if you don't do anything is actually 56 clock cycles. You have the latency there, right, you feed in every 18 clock cycles, now you design larger in area, you can see that the logic is about the same but the number of free flops increases because these are the pipeline stage, right? Now the throughput from 200 goes to 600. You can say I want to put in data every 6 clock cycles. Now you can get something which is a 10 axis throughput and of course with more area in the more logic. So that cast kind of the first you can match the manual designs in a very competitive way but a more over it also gave you a lot of alternative solutions. I think that is what enables. Yes. >>: What is the input? You said that it was basically a GCC program? How modified was it before feeding them into this process? I'm assuming it wasn't just another modified software implementation. >> Jason Cong: There's two things. One is that the interim language coverage there's two, is actually very good. I'll show you one slide. So one of the application of this tool is at the investment bank which actually surprised us initially. So there's a group of people at the investment bank they want to implement customized circuit to do stock option calculation, financial engineering, all of those things. Okay. Imagine that's probably the (inaudible) that that financial system meltdown. (Laughter) They are trying to be too smart. So they actually took very kind of totally different code, right, I mean written for a very different purpose able to go through a system. We actually get a very strong recommendation endorsement from them. So interim language coverage is very good. But it doesn't mean that every piece of program once you can pass it through will give you the good result. So you have to also think about the kind of the concurrency and so some of the structures. For example, I will show you also some UCLA experience we have using the (inaudible) computing, how you structure a code is important. Now so the interim language code is very good. It also give you the option, for example, like this pipelining you can actually add (inaudible) code, this loop I want to pipeline and here's the initial interval. If you don't do anything, they will pick one, right? But then you can also give your own input to that. So there's also -- so it will be a C code. The good thing is that the C code you fit it into AutoPilot can also be pass for GPC, right? But it give you some (inaudible) that you can also kind of as a directive to the power synthesis. >>: Was this financial institution using your software or the startups modifications? >> Jason Cong: It is (inaudible) well it come to the real applications particularly goes to the startup. This is one is also the same thing because just the interim, the language coverage, all of this, that they provides much better support. Yes? >>: (Inaudible) pretty much (inaudible) for C, nothing is of the right size, would -(inaudible) if you have something more sensible like (inaudible) addressing rather than funny boxes with six wires in and four wires out? >> Jason Cong: So I did show a slide with AES. We also tried it with another one. So in that particular application they chose to SystemC. So this is the difference, the difference between C and SystemC. SystemC is more intended to access C++ extension for hardware designers, right, it give you the control to have kind of arbitrary precision, that bit back manipulation, some of the synchronization so that's for the reason is that when you want to have more control you probably want to use not a vanilla C language you want to give you more power to do that. Yes? >>: (Inaudible) because someone is looking at (inaudible) validations themselves. How (inaudible) application (inaudible)? >> Jason Cong: Sort of a traditional profiler to do the profiling that's very easy. And not only that, you can typically it's iterative process, right you come out with some model, you go to the tool, you get to APG, you do some simulation or even just running at speed you can say okay this module is too slow, then you can come back to do that. So it's more of an iterative process. They will actually give you the information about sort of a simulation, the latency of every modules Yeah, these are some of the marketing slides I will skip basically talk about what's the unique extensions of that. I think just two parts I will highlight. One is the language coverage because we take a very different approach. A lot of these EDA companies they start using their customizing compilers as they do that. I think that's actually wrong approach. You want to kind of leverage all the research in a compiler community, you take off, build things on top of that. The second thing is obviously the order of optimizations we are doing, with scheduling, with resource binding, with how the optimize communication, computation. This fine grain power optimization I still think is a very important part of what differentiates it from a lot of other tools. So I want to say a little bit one thing it enables on (inaudible) computing. So you have seen these slides many times. Clearly power is the barrier. So the industry current approach is that by going through paralyzation. So I stop the frequency skating so I will kind of have more course at the lower frequency. And the question is how do we go from there. Because you can view this very generic systems with multiple cores and or you can actually have a large system with many of these compute units either in a cabinet, in a cluster, whatever. So what we are seeing in UCLA is yeah, that's one to go that but we think there's a lot more opportunity to gain power efficiency by customization. So we think that could be the next trend after paralyzation. The reason I'm thinking about that way is that I think we actually of sufficient computing power for most of the applications if you bought on to us. Mean this presentation, whatever the MSM messenger, email, you really don't need to run at the very high frequency, very high performance. However, if you borrow down to every individual customer or every single enterprise they do have one or two a handful of high -- I mean sort of a computing intensive application, need a lot of performance. So the idea I think is natural thing to think is whether we have a way to speed up those application for that particular user or particular enterprise instead of just raising the level of a general purpose computing. So the computing hungry class can be accelerated for example using dead indicated hardware. You look at this sum that the latest announcement they already include some encryption engines, right, and customize hardware for that. You can have a customized processors. We actually did research at UCLA called ASIP, application specific instructions to their processors. And you can use GPU. GPU is not very power friendly, but it can give you a lot of performance. So we in particular are looking at the possibility FPGAs because it's programmable, right, it give you are total customization for implementation. So it has actually a lot of promise if you look at the literature there's a conference dedicated to this area call the FCCM. Field -- what it stand for. I forgot the acronym, but it's basically customized computing, field customized computing machine or something. But there's talks again of barriers. I actually directed an effort even 12 years ago at UCLA but I give it up very quickly. First at that time the FPGA sits on the peripheral bus which (inaudible) work station. When it goes through S bus get to the processor and get the data back is like taking years, so any gain you got from FPGA is useless, right? The second thing is if you look at all these papers they have very impressive speedup now on server numerical computation, (inaudible) . But all of those are written by lower paid graduate students using VHD. Real programmers will not program Verilog, you have to actually get to C and C++. So I think both barrier are kind of disappearing in a way. One is on the communication side there's a lot of exciting R chip two chip communications. So example this is the hyper transfer bus that AMD sort of enabled, allow you to go from a processor, co-processors very efficiently and Intel also opened up there French side bus and you can actually allow very high speed communication. Even the PCA express is actually quite decent bandwidth, speaking of processors with a co-processor. So these things are actually that barrier is become very small. So the other one -- for example this is the unit we have at UCLA and I will show you some application we've developed where you have the PGA, I mean CPU, this is Optron can have its own DDR memory and this is originally intended for us to do Optron systems, right, so you can replace one of them with FPGA and they will be communicating by the hyper transfer bus at the very high speed. The second thing is that with this kind of a compilation technology I talked about really you can go to hardware implementation very efficiently. So this is one of the famous algorithm called the black truth algorithm for stock option pricing. And when we were working with this financial institute at the beginning of it they were skeptical, they said I will not give you any account -- proprietary data. Why don't you just go to this Website to download this piece of code. Tell me the result. So it was obviously now written for sort of a synthesis. But it turn out the only thing we need to rewrite is this power function so that we -- because -- and right now even though I need to rewrite it. Now you can just take that with support X to the P's power. You can take that into FPGA. Really there's no modification of that. You can get implementation on a language site FPGA. This thing runs at only hundred megahertz roughly because the clock here is 10 nanosecond. It's basically a customized machine going through 400 some clock cycles on every 12 clock cycle you can pump in the data. So we get about 30 X speed up over the implementation and D machine. So not only that, if you look at the power implication. So on the FPGA running at 100 megahertz you consume about six watts of power, you could run it on the AMD processor, it does 68 horsepower so there's kind of a 10X of a difference there. If you really want to bridge the gap by having 10 Optron work stations, whatever, CPUs, so you are talking about a difference about almost hundreds of X of CPU power. So that's where we see the potential, we'll call the customized computing. If this client or this enterprise that's their primary interest, maybe we don't have to view it very powerful general purpose computing for that, we can have actually accelerate a lot of applications as a co-processors. Yes? >>: Is that FPGA do (inaudible) IEEE (inaudible)? >> Jason Cong: Yes. I can show you. This is what we do no you. What we did. In fact, again, this is the effort done at the AutoESL startup company. So what they did was they actually added the full support of (inaudible) 7.5 floating point. So this is even better than GPU because it supports both the single precision an double precision. And also what we added is a bit accurate fixed point arithmetics. So this is not only a specification at the input, that the program will automatically propagate all the b-width information for the intermediate variables. You can do (inaudible) efficient hardware. And then the good thing is that for them the investment bank this music because they really don't want to learn anything about the VHDL Verilog, they have no interest about anything about that FPGA so the actually support C++. So he gives a very glowing call and he says okay it does whatever. By the way, there's a number of company claim they can do C to RTL but you should ask a little bit of detailed questions. For example, half of the company, more than half will say we do not support structures. So that's actually a show stopper right away. A lot of them -- I mean then, (inaudible) cannot support array structures, structures of arrays. By the way, there's so fundamental limitation you cannot do. For example you cannot support dynamic memory allocation because you cannot generate transistors on the fly, right, and you cannot support unbounded recursions because everything is a fine night resource. So those are considered to be fundamental limitation, you have to rewrite a code somewhat. But all others really kind of artificial constraints a lot of the tools impose we don't have to kind of face that problems. And at UCLA, I do not know too much about financial application. When I saw this, I said this is a very interesting, so that's this particular application I'm familiar with and I mean hardware design area. It turn out that one very time consuming task today is called the disography simulation. We use one 93 laser to introduce 49 -- 45 nanometer transistors, right? So it's a miracle because you actually use a very wide brush to paint very fine, right? The small font basically can do that. But it's through a lot of numerical techniques. So in order to do that, you have to solve so-called Hopkins equation, that this can take hours or days to do that. So I start ask a student to do an implementation. Originally this tool, before the tool available the research started the early part of last year. So he spent about four months to do RTL coding. He sort of got the system working. But then a good part is that towards the last summer, the tools available he was able to try the tool just from C to RTL. He did it at in about -- he did it about in about a week or two that so this is the core part of the loop to have to deal with. So just for this project we also gain a lot of experience because there was a question asked answer what kind of C you can take. So his initial C code can pass the cool without any problem and get the RTL. But they do not get a very good performance. It turned out you have several loops, right, your loop with a different rectangles that they lay out rectangles. You also loop through a different level kernels, this is by kernel decomposition from optical theory. And finally loop for different pixel, right? This is a very natural way to write it mathematically, but it's not a very good way to implement our hardware. That is also not a good way to implement on the multicores CPUs because you do not consider the data locality, right? So what you want is that the kernels you wanted to be reused as much as possible because that just affects your precision. You just one kernel to do all the computation and come back to do another kernel. So that actually is easy to do. You can change the loop ordering that you move the kernel to the outer most iteration and within that you have to do some analysis do I do process for every rectangle, every rectangle you can do all the pixels, or do I do for each pixel I consider all the rectangles, right? So that kind of a code rewriting is still the human creative part. So it's not saying you have the tool you can just write any C code and try implementation, we don't need designers. We still need good designers that can think kind of analytically or kind of strategically where you can make use of the tool to get the best result. So he actually did that, so eventually we get about 15 X speed up, again on Optron work station. So this is about 15 watts we consume compared to 86 watts. So you get about 6 X times 15, so that's about 100 X power performance efficiency we gain. It's also a fairly successful example we have there. So some of this data point really kind of points you think there's a lot to be gained by this concept of domain specify computing, right, so you can actually create customized computers, customized machines with much higher performance but much lower power profile that for specialized application or specialized application domains. So I think that's kind of the materials I want to cover. I'll be happy to use the rest of the time for some discussions. Very quickly, I just wanted to show that you there's ongoing effort with UCLA to processor based design optimization. First is that we can create so-called application specify instructions set processors. Same thing that with the FPGA with some kind of a programmable logic we can generate customized instructions. So these customized instruction can implement a sequence of applications, you can execute a very commonly, right. So those we can identify those and we can create the pattern libraries. Finally we can generate sort of the unique, the new programs you need to have. So this is what we got the patent generation and the patent selection with like among these what are the set of patents should be mapped to specialized instruction sets. And finally you can generate the new program to do that. So not only that, so we are looking at the possibility of synthesizing the entire design using this kind of a methodology into a number -- basically application specific processor network. So the difference here is that you definitely had a customize the hardware logic using xPilot, AutoPilot. But a lot of the others can just be specialized processors, the processor with different gateways, with different instructions, with different file, cache sizes. So my view is if you look at the complexity of the design you can easily put in 1,000 to 2,000 reasonable size transistors. Remember that the core due is a billion transistor design. I remember when I entered graduate school 386 came out. That was a big achievement. That's one million transistor design. So you can put 1,386 there without any problem. So my view is that let the processors are just the future standard cells, the building block, the questions of can you synthesize your design into that very efficiently? So we actually had the very nice progress just recently. We can synthesize application into a homogenous collection, homogenous set of processors. This is just that one illustration. These are the task graphs you have and you can have the naive way so you have one kind of processor for each one of them, you have a 10 nanosecond stage delay, five stages. So this is one implementation. So these two can be actually one processor because I have 10 nanosecond so it's okay to have them to one processor, actually two of them. So in this case, I got four stages pipeline, right? So finally we can actually come out with even more clever implementation that I can actually get three stages still satisfy the 10 microsecond stage pipe delay. So the idea is that if you think about the processors your building blocks you had to think about how do you map computation to a network of processors very efficiently. So that's kind of the outgoing research we are doing at UCLA. So this paper was published in FPGA '07. So we're looking at kind of extensions in this direction. So this is another example where we start looking at heterogenous system, for example, part of that can all be in micro place, that's the designing embedded processors but part of that can going into customized hardware. You can see sort of a significant performance gain by doing this trade-off. But I think today in today's design what is more important is the power gain. Typically when you go from a processor based implementation to a customized hardware based implementation you can gain power efficiency by someone between 100X to 1,000X. So that's actually what will make a difference. So I think the conclusion I want to kind of share with everyone is I think the behavior technology today is mature enough to handle a lot of applications. The sort of the improvement coming from several areas by doing platform based modeling and by looking at the communication, the interconnect optimization simultaneously and a lot of these advanced algorithm we developed they also have to improve the quality result I think at the end we can actually get RTL code very comparable to the human designers. Hopefully we'll be better. As I said in the interim power routability, some of these very late metrics, the RTL designer has a hard time to deal with. We also kind of looking beyond this is to look at the synthesize at a system where hopefully we can make use of both of processors and customize hardware to get a lot more efficient implementation. So that concludes my talk. (Applause) >> Darko Kirovski: Questions? >>: So the last thing you mentioned about that processors and the nice graph showing sort of tasks going from one processor to another, so can you kind of applications that you are basically trying to get into this paradigm? >> Jason Cong: Yeah. So here, there what we're thinking is that I guess you are referring to this example. So this could be a task graph like if you have a stream type of application like a video encoding, MPEG 4, I think motion JPEG was the application we used. You can typically decompose your design into a number for modules. And it's a stream processing because data will keep coming in, and it will be implemented in a pipeline in a fashion so typically you have a stage delay that determine the throughput, right. So that's given as a constraint. Another question is that given that throughput how many processors do I need to have, and what's the minimum number of processors? What's minimum sort of a stage latency, how many stages? I'm just showing you that was an implementation, naive implementation. You need what seven processors, five stages. And if you be a bit more clever, you need so this case is five processor, four stages eventually can do it in four processors, right, three stages. So that's kind of the trade-off we're looking at. >>: Designing tools are not exactly great in that respect like routing clocks (inaudible) so your numbers are taking the tools as they are, correct? So the actually (inaudible) you'll be getting a lot better than three 00, right. >> Jason Cong: Sure. >>: Do you have an estimate on that? >> Jason Cong: That if we -- so that FPGA to be honest is not very kind of efficient for power in general, right, that was not their effort. But actually we did a lot of work at UCLA on just on the FPGA architecture design. We published sequence workshops. In fact, you can introduce the concept of multi DBD, FPGAs, right, you can also have kind of a original clock gating, power gating, all of those. So I believe they are following that very carefully. We actually have very good relationship with both companies. They have funded UCLA research for about 16 years now from early 1990s when they were very small. So they are taking a lot of it. But in the current commercial implementation, none of them that capability. So really the best way you can do is by optimizing some transition activity, right, and also by improve some data and memory locality in doing that. So absolutely agree with you, there's room to be improved. Code gating you can implement on a FPGA fairly easily because they do have three flops with enable signals. So kind of things the same concept when I say fine grain optimization. If you cannot shut out the power at least you can shut off the clock, right, so that will help the dynamic power. And that's actually fairly easy to do. I believe the tool does that already, they can actually just generate the clock getting signals. >>: Have you tried and see like one of those results ->> Jason Cong: Yeah. I don't have the slides, but they have slides. I think it was quite dramatic that was 20, 30 percent of a reduction in power. >>: I have a couple questions here. So when (inaudible). >> Jason Cong: When you go to FPGA to debug we do not provide the specific tools. But I think there's a very nice tool called the chip scope. I don't know how many of you have used that. They can actually capture basically all the runtime formation and you can read the art, you can actually look at the signal value. >>: (Inaudible) I assume that you have some tools that (inaudible) right? >> Jason Cong: Yes. So the ->>: (Inaudible). >> Jason Cong: (Inaudible) yes. So in general it's not the easy problem because remember I was keep saying that we want to leverage all these advance compilers which is very good for the quality result but it also messed up the correspondence that the -- because they do a lot of. So it's what we are thinking is that we can give you option, you can compile using let's say dash G kind of option so I do less optimization allows you to see a better correspondence. And then once you get it correct you can go back to do more aggressive optimization. >>: One more question. This is more of a vision question. So there's a lot of speculation (inaudible) thinking there will be hundred thousand CPUs on a single chip, et cetera. There was a paper from a couple years ago where they looked at the (inaudible) that you're going to have on the surface chip. And if the conclusion from what I recall was if all CPUs are under distress the thermal power and just in general assumption is such that you stop getting any gains from the multicore vector beyond I think the result was around (inaudible) so where do you see -- mainly because of the leakage, right. So what do you see here, what is your vision there? Do you think that it's going to be so easy to get into the leakage, you know? I mean, that's -- I know that's one of the major problems right now. >> Jason Cong: Yeah. It's not easy to do that. In fact -- where was I? I was just thinking that that's where you need the customization will be very useful, right? And also you can see the type of customization we think about is that each processor can be customized in term of B width versus a (inaudible) clock frequency supply voltage. Maybe you can have all these cores. They don't have to be add to the same clock rate, they don't have to add the same clock frequency, right? So under that instruction and a degree of reliability. So there's a lot of parameters. And what you also want to do is maybe very fine grain of power kind of control. But even though you are doing the computation maybe not every cycles that there will be active. So you can do a lot more. So that probably can extend your -- always there will be some limitations in the power thermal constraint. >>: So do you see like a thousand core chips (inaudible). >> Jason Cong: As I said, in term of complexity there's no difficulty to put in a thousand core chips, right. Where the bottle neck, in fact one is the communication I see and the other one is the power thermal behavior. So in fact some of the work I mentioned early on about RF was also to address the communication bottle neck. >>: (Inaudible). >> Jason Cong: (Inaudible) very power efficient because we transmitting waves instead of charging, discharging the RS feed. >> Darko Kirovski: Any other questions? Yes. >>: (Inaudible) how do they figure out that the (inaudible) is how do they monitor that better? From doing what it's supposed to do, (inaudible). >> Jason Cong: So how do you know that FPGA -- so you can do, they have actually test patterns you can use, maybe at the power up time you can just the same thing as you test memories you can test all the logic interconnect quickly and then you can start uploading your application into that. >>: (Inaudible). >> Jason Cong: I think you can get -- as a research topic we even look at the possibility of you can have some small logic just keep testing your entire at the one time but that I do not know whether they provide that or not but that's some of the ideas we have. Yes? >>: I would just point the first (inaudible) exactly that, they did not have an (inaudible) they had a test (inaudible) pretty unreliable. >> Jason Cong: It's sort of tricky. Basically we're thinking that you have to be able to migrate some logic, right, so you want to test this part of a logic here and then I can test that. So we wrote some papers on this type. I mean I'm not sure how it's not really in kind of any really system here. In fact some of the discussion was a (inaudible). >>: (Inaudible). >> Jason Cong: I'm sorry? >>: The 3D AC design, actually tubes coming out (inaudible). >> Jason Cong: We did a couple of tape parts. One the first we have limited access to the found describe, so the found describe we can use is (inaudible) lab. This also funded by the (inaudible). UCLA we did one chip was using that interesting concept called capacity of the coupling so actually you don't need to have a haul that ->>: (Inaudible). >> Jason Cong: (Inaudible). Actually that was IFCC paper last year was very high frequency at gigahertz range. So it was actually a successful tape pod. We did a second tape pod where we have a microprocessors where we have two microprocessors, a cache, so that one we have, we got the chip back but it would have in I/O problems in the design. So that we haven't been able to kind of a (inaudible) chip yet. But what I'm saying is that you can certainly do that today. By the way the key part of a (inaudible) design is the using thermal (inaudible) because it can be very hot, right. We did a calculation if we do don't anything that is very difficult to take the heat off. So first you can do optimization which active layer should be close you should I which one should be further away. But not only that, you also need to turn out this (inaudible) process is SOI process. Thermal friendly. So we have to insert this V. The V is the connections between different layers. That's very good for thermal conduction. When you don't have enough of those we'll put in actually thermal ourselves. That's part of the optimization. >>: (Inaudible). Enormous amount of capacity (inaudible). >> Jason Cong: So we just putting, actually we are doing nothing. >>: Oh, I see. >> Jason Cong: So that's why power down optimization will do. There's some interesting studies we can show. You can do that either during placement or after placement. >>: (Inaudible). >> Jason Cong: (Inaudible) it's not that bad. We actually have papers, we have data to show you that it's still reasonable and the amount is manageable. >> Darko Kirovski: Any more questions? (Applause) >> Darko Kirovski:

>> Darko Kirovski: Welcome. Today I would like... from UCLA Computer Science Department to Microsoft Research. He'll...

Related documents

Products

Support

&gt;&gt; Darko Kirovski: Welcome. Today I would like... from UCLA Computer Science Department to Microsoft Research. He'll...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Darko Kirovski: Welcome. Today I would like... from UCLA Computer Science Department to Microsoft Research. He'll...