>> Doug Burger: Good morning, I'm Doug Burger. I'll pass on the opportunity to say anything inappropriate given there's only a few people in the room, but there may be people watching online. So it's my pleasure today to introduce Hadi Esmaeilzadeh. How was my pronunciation? >> Hadi Esmaeilzadeh: That's perfect. >> Doug Burger: Okay. Thank you. I try hard. Hadi is a PhD student at the University of Washington who started with me at the University of Texas, moved to Washington, and is now working with myself and Luis Ceze and several of the other graduate students in Luis' group. He has had an amazing track record of output and top quality research over the past couple of years with several award papers mentioned in the front page of the New York Times on your work. And he now has an incredibly interesting result in a new area that we believe is going to be very important. And it's new. It could be profoundly important. It could be completely irrelevant. We just don't know, but it's very exciting. So thank you for spending your summer here with us doing your internship.... >> Hadi Esmaeilzadeh: Thank you [inaudible]. >> Doug Burger: You've gotten a lot done. And I'm really looking forward to seeing the results of all your hard work. >> Hadi Esmaeilzadeh: Okay. Good morning. Today I'm going to talk about a new class of accelerators that use machine learning to accelerate approximate general-purpose programs. This is a project which is done in collaboration with the University of Washington and Microsoft Research and the University of Texas at Austin. I have collaborators from these three institutions. Parts of this talk are going to appear in the next International Symposium on Microarchitectures. So people have been using accelerators, and there are different kinds of accelerators like GPUs and FPGAs that each of them take advantage of certain characteristics in the program to accelerate the program. Like, FPGAs can accelerate applications that have abundance of parallelism, but they are [inaudible]. And GPUs are good for applications with a lot of parallelism but they can't, you know, work well with the applications that have very divergent control flow. And there has been recent, you know, proposals in the literature for different accelerators that augment the processor and try to, let's say, synthesize parts of the program on hardware or user configurable fabric to actually accelerate the programs. What we are looking at here is the -- You know, we want to take advantage of these characteristics in many applications in diverse domains that full precision is not always required. Either there is no certain one answer for the application or [inaudible] at the output. The best example is graphics and we see it like in JPEG compression. It's tolerable. And the perception of, you know, the human plays a role in this process. So... >> : I have a question. >> Hadi Esmaeilzadeh: Sure. >> : Is it your belief that the opportunities for approximate computing and applications have increased? >> Hadi Esmaeilzadeh: I think because like machine learning is taking over and we are facing the era of big data that we need to do a lot of, you know, computations with a vast amount of data. The... >> : Most of your applications are drawn from big data really. >> Hadi Esmaeilzadeh: Search is one of them where if you look at data mining. >> : No, I'm talking about in your paper. >> Hadi Esmaeilzadeh: Right. >> : But so, I guess the root of my question is has there been some trend that is making approximate computing more possible or is it the case that we could always have done it? For example, HPC workloads in numerical simulations with, you know, convergent algorithms, radiant set type approaches, and we've only just realized this now where our backs are against the wall. I mean, has it always been there or has it grown because of the emergence of some of these areas? >> Hadi Esmaeilzadeh: So there is another trend that is emerging. Maybe it's because of the challenges that we face in the silicon. >> : That's not driving the applications. >> Hadi Esmaeilzadeh: Right. That's like -- You know? That's the energy efficiency.... >> : I mean [inaudible] in what applications we can do and drive them that way, but that's not what's driving a lot of these. Big data is a buzz word. Yes, it's important but --. >> Hadi Esmaeilzadeh: Right. >> : So do you think something has changed? >> Hadi Esmaeilzadeh: In the application domain? >> : Are there a few obviously identifiable trends? I mean machine learning is clearly one. >> Hadi Esmaeilzadeh: Machine learning... >> : [Inaudible] popular but... >> Hadi Esmaeilzadeh: ...is one of the things. And then, you know -- So another trend that we can maybe look at is the -- you know, doing computation with limited amount of energy and then trading off compute power -- Sorry -- trading off battery or harvested energy for the quality of results that [inaudible]... >> : That's an opportunity but that's not what's driving a lot of these emerging applications today. Vision is emerging despite energy limits not because of it. >> Hadi Esmaeilzadeh: Right. >> : Okay. Let's put that on the cue. >> : So isn't it that algorithmically things like approximate algorithms are gaining in their acceptance in use of these things? >> Hadi Esmaeilzadeh: That might be the case. That might be the case. Because a lot of, you know, let's say recognition or, you know, even gaming they don't require an algorithmic level. They don't require like perfect, you know --. >> : Right. And those are used to help reduce overall computation time or... >> Hadi Esmaeilzadeh: That's true. >> : ...[inaudible] various properties about... >> Hadi Esmaeilzadeh: That's true. >> : ...convergence or things like that. >> Hadi Esmaeilzadeh: Yeah. But I'm not sure if that answers Doug's question. >> : I just wonder if it's that, you know, if you're summing up your checkbook you'd want it to be precise, but whenever you're interacting with the real world, the real world doesn't have a precise digital representation. And so there's necessarily going to be fuzziness. So as we do more things that interact with the world, Vision and Augmented Reality, trying to extract trends from the web, they're all in some sense approximable because they don't have a precise answer. >> Hadi Esmaeilzadeh: Yes. >> : You could also go and say that there's an acceptance of doing that. So ten years ago we were working on compression of mass images and the answer was, yeah, you could do it. It doesn't hurt the [inaudible] but they wouldn't accept it because, ooh, it's not exact and, therefore, it's wrong. And I don't know that's necessarily the case anymore. >> : Well, they're older now and their vision's gotten worse that they can't [inaudible]. [ Laughter ] >> : Yeah. >> Hadi Esmaeilzadeh: Okay. So what we are doing is that if we look at this design space of processors, we always trade off power and performance and move over different points in this design space. Each point in this design space is a processor and the line shows the Pareto Frontier. What we doing is we are exploring a new dimension which is approximation. And what we want to do is that we want to compromise a little bit of accuracy and hopefully get better performance and lower power by comprising the, you know, amount of accuracy. And for this we are proposing a new class of accelerators that need to be trained rather than be programmed. So what I mean by trainable accelerator is that if you look at an imperative code, you pick a target region that is hot and the application spends a lot on it. Then put a learning algorithm besides it, run the application. The learning algorithm will observe the inputs and the outputs from that target code. And after it reaches a certain level of accuracy then you replace that target region with a recall to that machine learning algorithm. So we call that algorithmic transformation a Parrot transformation which mimics the target region of code. That allows us to run the program on the core on the processor while accelerating the machine learning algorithm on the dedicated hardware. Since this is replacing the target region, we are indirectly accelerating the whole application by doing this transformation. Sure. >> : Have you learned the regions that you shouldn't apply this and drop back to the old code for those situations? >> Hadi Esmaeilzadeh: So we have safety criteria. Like let's say you are doing JPEG, right? So when you writing the header of the JPEG, it should be precise because if you even like mess up one bit the image is not readable. But when you are processing the pixels and the bits then you can do approximation in that code. So what we require from the programmer is to annotate regions of code that are approximable without being conscious about which technique of approximation you are going to use. Just tell me which regions are safe to approximate and have a little bit of inaccuracy and error at the output. >> : So using your JPEG example, do you have a notion of a quality factor? Like JPEG compression I can set a quality factor on the compression. Or are you so small in your approximation that it's just a little bit of noise around this coarser grain? >> Hadi Esmaeilzadeh: That's an interesting point, but right now I don't have it for this technique. We have other techniques that use, you know, different approximation methods that we can actually tune up the approximation. But not for this. >> : [Inaudible]. There's a whole body of work on synthesis where, you know, proving by example is becoming kind of in vogue, right? And I'm really struck by your title here, "Acceleration Should Be Trained not Programmed." You can almost, using some of the ideas from [inaudible] on synthesis here. You can think about programming in general for a large class of programs as being trained not programmed, right? You specify very simple poor quality algorithms and then train whole bodies of work, and it's not really necessarily about accelerators. Your accelerating out here is just a general... >> Hadi Esmaeilzadeh: Sure. >> : ...[inaudible]. Right? >> Hadi Esmaeilzadeh: Sure. Sure. >> : So I think the generality here is great. >> Hadi Esmaeilzadeh: That's actually very interesting. >> : So how do you --. Sorry [inaudible]. This is really interesting. So I think when we get a little bit deeper we do have a non-CPU accelerator. But how do you tie program synthesis to this training that you can use to make things more efficient? How are you doing this algorithmic transformation? Is there a tie there or are they just intellectually interesting? >> : No, I think there is a tie. I mean... >> : Yeah, I'm trying to understand that. >> : ...you can think about synthesis as training, right? So the programming bit example, that example's much like you have a learning problem here. >> : Yep. >> : And in the same vein, those examples are used by the synthesis engine to come up with a huge -- You know, you can think about synthesizing a large number of possible solutions and picking the one that balances approximation versus power. >> : I see. I see. So it's more like we can think about the ones that are most amenable to this transformation. >> : That's right. >> : Okay. >> : But for general purpose code, right? >> : Right. Right, right. >> : Are you going to remember that? >> Hadi Esmaeilzadeh: Yes. >> : Okay. >> Hadi Esmaeilzadeh: Can you... >> : I'll take it down. >> Hadi Esmaeilzadeh: ...actually send me like the links and... >> : Yes. >> Hadi Esmaeilzadeh: ...you know, papers [inaudible]. >> : I'll talk to you afterwards because I know we're actually working on a similar problem for synthesis in this example. >> Hadi Esmaeilzadeh: Oh, okay. [ Multiple audience comments simultaneously ] >> : But your talk's already a success. [ Laughter ] >> Hadi Esmaeilzadeh: So to do the Parrot transformation -- So this has been an internal debate between me and Doug about the naming of the Parrot transformation. So if you have a better suggestion, I'm willing to like hear. >> : So who likes Parrot and who doesn't like Parrot? >> Hadi Esmaeilzadeh: I like the Parrot. He doesn't like the Parrot. [ Multiple audience comments simultaneously ] >> : Yeah, I like Parrot. [ Laughter ] >> Doug Burger: Screw you guys. [ Laughter ] >> Doug Burger: [Inaudible]. >> Hadi Esmaeilzadeh: So to realize the Parrot transformation we actually, you know, need to come up with a learning algorithm that can learn regions of imperative code. And then we have to come up with a programming model that lets the programmer, you know, think about this transformation and annotate the code for us. And then we need to come up with a compilation work flow that actually carries out this transformation. And then we need to think about the implementation of the learning algorithm. As I'm going to show you, and I think I gave it, you know, up with the title, is if we empirically found out that neural networks can actually mimic parts of the imperative code. And the good thing about neural networks is that we know how to implement them very efficiently in hardware and they inherently parallel. So they're a good target for the acceleration at the end. And I'm going to talk about different opportunities that using neural networks will, you know, provide us. But in general I think potentially any regression mechanism can be used for Parrot transformation. I don't know if it's going to be beneficial because I haven't tried them, but I think potentially you can regress any region of code with, you know, let's say support vector machines or logistic regression or any kind of these transformations to do this. First let's talk about the programming model. So the developer should be involved because of the safety. As I told you guys before it's important because a programmer has the only view of which part of the algorithm is safe to approximate. So the programmer needs to annotate the source code. But which kind of source is good for Parrot transformation? So aside from being, you know, approximable that needs to be hot code because we want to, you know -- we are bound with Amdahl's Law. And since we are limited by the topology of the neural network when we are doing the Parrot transformation that region needs to have fixed size inputs and outputs that I can, as a compiler, identify statically at compilation time. So let's look at an example. So here I'm showing this whole, you know, edge detection algorithm. So what you do is that you take an image. You convert it to a gray scale and then you slide a window over it and you do the convolution operation. The convolution actually, you know estimates the gradient for one pixel with respect to its surrounding other pixels. So it's inherently approximate. So the programmer is going to do [inaudible] that this is approximable. The programmer is not aware that I am going to use neural network, just this is approximable. And this is a good code because it has a fixed number of inputs and just one output. So what I'm going to do, I'm going to do the Parrot transformation in three steps. The first part is that I'm going to take the annotated source code, put probes on the inputs and outputs of the candidate function, run it with a certain number of input data, collect the training data which are the inputs and outputs to that candidate. Then I will feed that training data to Parrot generator which is going to explore the topology space of neural networks and redo a search space. We have a search space of neural networks and we try to find a neural network. And that Parrot generator is going to give us the topology and then a number of weights. And then we take that annotated source code again and take the neural network and replace the calls to the candidate function with the calls to the neural network. And if we have special hardware, right, we are just invoking special hardware behind the scene instead of running the original function. So in our example, our Parrot generator has found 9 to 8 to 1. These are the neurons. These are the inputs and that's the output neuron. That can mimic, you know, that region. And -- Sorry. So what happens here is instead of calling the original function, I'm sending the inputs to the neural network and receiving the output instead of that. So if you run this on an image, the neural network is trained on a different image. And this is running it on original code. This is running it on a Parrot transformed code. There are a little bit of differences. I don't know if you see here. But I think perceptually this passes. >> : Can I ask a question [inaudible]? >> Hadi Esmaeilzadeh: Sure. >> : I mean, I realize that you're looking at applications where there is this kind of perceptual aspect to it. But because you're just doing transformation, right, on these apps is there any way to reason about the error bounds? >> Hadi Esmaeilzadeh: Yeah, so for the neural networks I'm going to show you like different applications and how we reasoned [inaudible]. So this is a good example for showing to the audience. But you can reason about the error of the neural network. Its standard is MSE, minimum mean-squared error, and then for the application level depending on the application you can define an error metric. Like an average error or root mean-squared error at the application output level. That gives you a mathematical framework to reason about the error in the application. >> : But you couldn't do a max error? >> Hadi Esmaeilzadeh: You can define -- So it's your -- As a developer you can define any error metric that suits you. Right? You can do maximum error and if it doesn't... >> : But you only do maximum error over the training set not over any possible input set, right? I would assume. >> Hadi Esmaeilzadeh: So when we are doing this we have input for training, we have an unseen input, for test and evaluation. So you define the error metric for that test and evaluation metric. You know, for that test and evaluation date. >> : But you could -- If your data set never passes a negative number's square root, you will never know that the code needs to see that and flag an error. >> Hadi Esmaeilzadeh: Right. >> : In either of the sets.... >> Hadi Esmaeilzadeh: Right, right. >> : But later on it's a different... >> Hadi Esmaeilzadeh: Right, right. So I can't -- So like similar to any applications that use learning, I can't, you know, bound the error and say that I will mathematically guarantee that the error is going to be less than this. What I can do is that I can say that that's going to happen infrequently. And hopefully since your application is approximate, the final output is still going to be okay. But like we can do one more step, right? We can put the MPU and a predictor, right, which predicts if this error is going to be too large or the input is unseen. Right? And then decide if I should retaliate the original coding instead of running the MPU. So that's one of the approaches that we are thinking about. I haven't implemented it yet. >> : This also has to do with after the fact too, right? And throw an exception, right? And then... >> Hadi Esmaeilzadeh: Exactly. >> : And then use the original [inaudible]. >> Hadi Esmaeilzadeh: Exactly. Exactly. That's actually done -- Like there is a work called Relax which does approximation for different regions of code and they throw an exception and retaliate back if there is like more than a certain level. So another application is inverse kinematics. So you have a two-joint arm. And you have the X's and Y's and you want to figure out the angles of the joints. If I can play this, that would be awesome. Okay. The circle is the original application, the original code. But the arm is moving with the Parrot transformed code. So we applied this to six different applications. For FFT from signal processing, inverse kinematics robotics. I didn't do T-joint arm because T-joint arm doesn't have like a closed-form solution and this [inaudible] are huge because you can just learn where the arm goes. We did it for part of the jmeint game engine which you figure out if two triangles in 3D are intersecting or not. We did it for parts of JPEG. We used it for K-Means and the Sobel filter. For K-Means to make it easier to understand the errors, I used K-Means for image segmentation. And these are the neural network MSE errors, and these are the application level errors that we see. For FFT... >> : What does it mean? What is the application level error... >> Hadi Esmaeilzadeh: Right. >> : Is that... >> Hadi Esmaeilzadeh: So for FFT and inverse kinematics, I used the relative error as the metric. For jmeint, either you hitting or missing if two triangles are intersecting or not, so this is the miss rate. And for JPEG, K-Means and Sobel, I'm using image difference between the original application, original code and the Parrot transformed code at the output the pixel differences, averaged within pixel differences on the output. >> : For JPEG are the errors usually like clustered to a few really bad ones or is it just kind of a smooth error? >> Hadi Esmaeilzadeh: It's a smooth error. It's a smooth error. So we can do Parrot transformation but what opportunity is there for when we are doing the -- Sure. >> : I'm sorry. Can you go back one slide? >> Hadi Esmaeilzadeh: Sure. >> : If I were to do a normal JPEG compression to an image, do you have any idea how that relates to your error metric? >> Hadi Esmaeilzadeh: I haven't measured that. But this image is worse than normal JPEG. >> : Okay. >> Hadi Esmaeilzadeh: So the quality... >> : Is it [inaudible] or -- Is it something you can visualize, you can see the difference? Or is it... >> Hadi Esmaeilzadeh: I can visualize the image difference and see... >> : Okay. >> Hadi Esmaeilzadeh: ...how that looks. >> : Can you post process to computer quality factor with JPEG? Like the final... >> Hadi Esmaeilzadeh: Yes. >> : ...compressed images to... >> Hadi Esmaeilzadeh: Yeah. >> : ...use their metric? >> Hadi Esmaeilzadeh: Yeah. Actually the average or MSE difference is the image difference that, you know, imaging processing people. That's the reason I used that. And this is between, like, you know, after you do the JPEG compression and then you do the Parrot transfer here you see like an average of 10% error. So now that we have this Parrot transformation we can have different implementation of the neural network. We can have as a library we can just call a library of neural network or I can use digital NPU. This should be analog. Sorry. I fixed the graph but I didn't update the... >> : That's all right. We can use analog [inaudible]. >> Hadi Esmaeilzadeh: Yeah, so this should be analog. That's -[Laughter] Yeah, so. So that's why I came up with this idea because I wanted to do analog computation while having digital normal programming interfaces but use analog circuitry. And there is a whole body of research that shows we can do analog implementation of neural network. So we can use analog. And I'm going to talk about parts of it. So let's do the software first and see what happens when we do the software. So if I use a library. This is application slowdown and this is the, you know, different applications. And I'm using Fast Artificial Neural Network. It's a widely-used open source library C++ implementation. You see that this is kind of -- Yu know, at least CPU implementation is not working. But -- Sure. >> : Is this if you do your technique how much the code slows down? >> Hadi Esmaeilzadeh: Yes. If I just use a library of neural networks without any hardware support. >> : Fine. Go ahead. >> : This is how not to do approximate computing. >> Hadi Esmaeilzadeh: Yes. [ Laughter and multiple comments ] >> : But you leak a lot more and so you consume a lot more energy because you're running so slowly and so it's a huge opportunity for speed ups and improvements. Straw man. >> : It's like power consumption branch predictors. >> Hadi Esmaeilzadeh: I'm trying to make a case for two things: One is that how can we change the processors so that we can, you know, actually gain benefit from this without adding any extra hardware to the processor, without adding the actual big accelerator to the processor. >> : And no algorithmic changes. >> Hadi Esmaeilzadeh: With no algorithmic changes. >> : Can we flip back one, please? >> Hadi Esmaeilzadeh: Sure. >> : Does the difference in the heights relate to the size of the input or the computation size? Or what's the difference between 5 and 75 typically? >> Hadi Esmaeilzadeh: So it depends on how much of the computation goes into that region of code, right? So in the FFT we are spending like, you know, 30% of that computation in that region. So I'm slowing down that region and then Amdahl's Law translates to this 4.5. In like jmeint, that region is most of the application and the implementation actually is based on a paper. It is a very efficient implementation. And I'm using large neural network because it's a very complicated control flow intensive region to approximate that region. So I'm trading off a little bit of computation to a large [inaudible] computation. So I see a huge slowdown. Does that make sense? >> : Yeah. I guess I was expecting a statement along the lines of as the number of instructions I'm replacing goes up, my penalty goes down because the ANN is sort of fixed time so you want to increase the amount of code you cover. >> Hadi Esmaeilzadeh: Exactly. That's one of the things. But there is another side to it: how much computation you are replacing in that region. Right? As you enlarge the region then the neural networks potentially can grow and you may not end up with a gain there. >> : So the example you gave where you had the approximate annotation on the code didn't have a loop, I don't think. It was just straight floating point computation. If I have a loop then I can potentially have a program that doesn't terminate. So I'm trying to get to the point of when does this break down and when can you not learn a particular function? Have you guys tried to classify -- I mean effectively you could reduce this down to the halting problem I suspect, right? Because now you have a neural network that's going to try me whether or not my program halts. >> Hadi Esmaeilzadeh: Right. >> : So when does this break down and when can you -- But for some programs I can actual deal with loops? >> Hadi Esmaeilzadeh: Right. Right. For JPEG there is a loop inside... >> : Yes. >> Hadi Esmaeilzadeh: ...that goes over the pixels of like 64. You don't like implement in linear. But it's not loop that, you know, goes around for everything. So the region of code should not, you know, change any state besides its outputs. So the neural network that I am replacing the region of code, right, takes a bunch of inputs and generates a bunch of outputs. Right? And if that code is changing something besides the outputs, either I have to hoist it and do a bunch of computation to come up with it so that this region of code is pure or I can't do it. >> : I think it's a much simpler answer. So you can -- Now slap me if this is just way off base, but I think you can -- Let me move to the side here. I mean you can grow the bounds that you were considering in the neural network for until you found a program region that was well structured with this input and output behavior. So there might be lots of internal state communication with it, you know, if you [inaudible]. And then your example, you know, you could have a -- you know, you have some inputs which is the program and then an output which is, you know, what was program halt. Right? And you could try to train the neural network to do that. And that has well structured inputs and outputs but you're not going to be able to train the network to do it. So I think to your point about the halting program, I think it's really just that some functions are amenable to neural networking and some are not. And the less they're amenable, the more error you're going to get. And with the halting problem you're just going to get pure entropy. You're going to get a random result and you basically have zero signal there because you can't train the network to do it. And so it's really just, you know, there's a spectrum of how much error you're going to have. And for some things, you know, the error will be infinite one. >> Hadi Esmaeilzadeh: Usually it's random 0.5. You can have like random like a neural network. If that's like a classification problem, you can have a random... >> : Right. So is that right, do you think? >> : Yeah, I think that is right. I mean I think the idea is that if you have N inputs that you're training on, you're going to get some answer by the network. If you use N plus 1, the entropy state is the same. The amount of error stays the same. It doesn't go down. Right? >> : Right. >> : As you add that training in. If you keep adding and adding and adding, you're never going to get to the position where you actually converge. >> : That's right. >> : It's really just a question of how learnable is the function? >> : Yeah. >> Hadi Esmaeilzadeh: And the dimensionality of the space of the data like the input. That if you grow the dimensionality then training is going to be, you know, harder and harder. >> : I have another question and maybe you're going to get to it. But it kind of comes back to the algorithmic aspect that if these applications know that they can tolerate a certain approximation in their result, where would changing the software implementation not using your library to learn and train a neural network but changing the implementation of the code to tolerate that much less quality in their result set? Right? I mean I'm trying to get at the change that you get from invoking the neural network with some special hardware or whatever, right, versus algorithmically doing less work and getting a less accurate --. >> Hadi Esmaeilzadeh: So I can give you one example answer. So when I was doing JPEG I downloaded a code of JPEG and I was using it. And I when I did the Parrot transformation I was getting like 100X speedup. The reason was that the DCT part of the JPEG was like an exact cosine transformation. And that's very, very slow. Then I changed the implementation. I found another implementation. The DCT was approximation of the exact, and my speedup was around like 70%. So even at the algorithmic level you can do approximation but there is a limit. And then this goes beyond that. The other thing is we wrote another paper -- we proposed another architecture which is changing the processor itself and supporting approximation instructions. That's one of the ways that you can deal with approximation. The gains that you were saying is that no speedup, like 20% to 40% energy reduction. So we were going after like 20% energy reduction. And here we are saying around like 2X speedup, 3X energy reduction when we are doing this. But one of my colleagues, Adrian -- You know Adrian. So he is working on compiler optimization, unsafe compiler optimizations that you can apply to regions that are approximate and probably execute less code and see how much you can get away from that. I don't have a head-tohead comparison to that technique. So during my internship -- So this is what I did before -- so I started with generated code instead of using your library I can get the neural network and generate a code which is efficient. Then let's assume that one of the things that I'm going to show you and the results you're going to see, one of the things that causes this slow down is the sigmoid function which takes a large part. And then -- Like I finished it like two days ago -- I did AVX code generation with Intel Ivy Bridge AVX extension to see if we can use vectorization and gain better results. So before I show you the results I want to talk about the AVX code generation. So we can do vectorization in two different models. One is that each neuron is summing a multiplying [inaudible]. So I can use the parallelism inside the neuron and do the vectorization like this, different inputs getting multiplied. Or I can do across neurons and just one input for different neurons in each vector. I took this because I think this gives a better vectorization approach. At the end I have to do a ladder of additions to get the final results, so I can do the other additions for different neurons at the same time I implemented this one. Let's look at the results. So before I show you the slowdown, a byproduct of doing this work is actually let's say we want to neural network execution regardless of Parrot transformation. How much gain we can gain with these techniques? So this is speedup over the FANN Library when you are doing neural networking implications. So this is the generated code. You get around 1.5X speedup, 50%. And when you add the sigmoid instruction you see a huge bump in the geometric mean. Speedup goes up around 7.3X. And then when you do the vectorization with vector support for sigmoid, you get an order of magnitude speedup for neural execution. >> : [Inaudible] floats? >> Hadi Esmaeilzadeh: These are floats, single precision. And one thing that I found surprising is that AVX sometimes is also in slowdowns for small networks. But this is the biggest network that I had, 18, 32, 8, 2 and I see a huge bump here along, you know, 13X speedup over, you know, hardware sigmoid code generated. So this is the application slowdown when we apply these techniques. After I did -- So after I did the code generation and gained something then after I added the sigmoid instruction to the processor, actually two of the applications speedup even without any hardware support for neural execution just the sigmoid instruction and the vectorization takes the slowdown from 15.7 to 2.0. So this kind of makes the case for using dedicated hardware to do the neural network execution. So I'm going to talk about the digital hardware implementation. So for this we needed a configurable hardware that can realize different neural networks. Different regions of code require different neural networks. So the hardware implementation needs to be reconfigurable. Then we needed a microarchitectural interface between the accelerator and the processor. And we needed ISA extensions that can communicate with the NPU, let's the processor communicate with the NPU. And at the end -- I thought that I was talking [inaudible]. So at the end since we are doing very fine grained acceleration, this integration of the hardware neural network should not hinder speculative execution or out of order execution in the processor. So we designed a reconfigurable NPU, digital NPU. Each neuron essentially is a multiply-add unit with weight cache and just crunches through the multiply and adds. And then with a sigmoid hardware sigmoid unit. And then these are the three FIFOs that are exposed to the processor. So the processor sends inputs to this FIFO. It reads it from that FIFO or configures and sends the, you know, weights to the NPU. So I used Marssix86 Cycle-Accurate Simulation, and it's configured very closely to Intel core architecture, and 8-PE NPU. And I compiled the applications with -O3, so that I don't bias the results. That's fine. So these are the application speedups. The dark part shows the actual speedup with the NPU that I showed you. The light part shows the ideal speedup that I would have gotten if I had a zero-delay NPU. So [inaudible] 2.3 here in one of the applications I actually see even with hardware slowdown because that region of the code for K-Means is that [inaudible] calculation. It's a very fine region; it spends like 30% of the computation in there, so even though the network is small the actual code is very efficient. So I [inaudible] on 3X energy savings here for the applications, but I would have got it 3.9X if I had zero-energy NPU. So the question is that can we move this further? Can we push it further? This is actually analog NPU. So part of my internship I studied the feasibility of moving toward analog implementation of neural networks. So to do that, as I said, the ANPU needs to be reconfigurable. So what we are going to do here is that we are going to do the computation in analog and the storage communication between the units of analog computation in digital. So we call each of the computational units that carry out the computations of a neuron in analog PE, processing engine. So we can have an area of APEs, analog PEs, and then we have to figure out how to map the neural network. So one is to time multiplex neurons over the APEs. And that happens that we're going to do the computations of 1, 2 and 3 first and then use the same APEs to do the 4, 5. Right? The other approach is to have a two-dimensional area of APEs and geometrically map the neural network to this. The good opportunity with the geometric design is that the communication between the APEs can be analog instead of converting it to digital and then communicating it to the... >> : It's not [inaudible]. You can time multiplex it if you do a D to A and A to D conversion for the analog units. >> Hadi Esmaeilzadeh: You can do it but here you don't have to do it. >> : Yeah, if you want to stay in the analog domain then you need the multiplex design. >> Hadi Esmaeilzadeh: If you want... >> : And you get the geometric design. >> Hadi Esmaeilzadeh: Yes. >> : Yeah. >> Hadi Esmaeilzadeh: Yes, exactly. And there are other factors like resource utilization and fault tolerance. >> : So I'm a software guy. Can you give me a little two-second of your standing as to why I want to use analog rather than the digital [inaudible]? >> Hadi Esmaeilzadeh: So the reason is that with the analog you can do addition by just having the point here, like you have multiple wires coming to one point, and use Kirchhoff's Law to do the addition. So you don't have to convert it to bits and, you know, things like that. And you can actually do the multiplication. Here we are not doing actually multiplication, we are scaling input. You can actually use a, you know, resistor ladder, pass a current through it and use that resistor to scale that current, and then do the addition. So you're just using a Kirchhoff Law to do the multiplication and addition. That's much, you know, more efficient than doing the digital computation. >> : I think a broader answer is that in analog circuits you can implement much more efficient computational primitives. I mean you can do integration, addition just by building a circuit that physically mimics the function. You just don't get digital precision. Right? I have two wires with currents on them and I tie them together and that's an add. It's pretty efficient. >> : There's a lot of work on that. >> : Yeah [inaudible]. What's that? >> : Get analog precision because it's arguably better in some scenarios. >> : Let's take that one offline. [ Laughter ] >> : It's not wrong; it's complex. >> : Yeah. >> Hadi Esmaeilzadeh: So what we are going to do is we're going to have the communication between the neurons in digital but the computations inside the APEs in analog. So we have to decide how many inputs we're going to feed to that analog unit like perceptually like [inaudible] you can have multiple -- as any wires coming there to do the addition. But analog circuits tend to work in a certain small signal region. So you have this region of current that everything is linear and you're getting that, you know, addition effect. But if you blow up that region then the nonlinearities in the analog circuit will kick in and, you know, screw your precision for that. So one of the things is the computation bits, how many inputs we are feeding to that analog PE. The other thing is the number of bits that you're going to choose to represent any number. Like with single precision after your two bits you have a very large dynamic range. But with the analog circuit you are moving toward fixed point operations, and as you increase the number of bits the speed, the energy is going to change drastically. I'm going to show you some results. Before I actually show you the results, I'm going to show you a little bit of the circuitry. So what we do, we convert the inputs, the bits, to currents. Right? You can have a surge and another surge which is two times the other surge and four times, eight times and when you have ones in each position then the current that goes the surgers gets multiplied by that factor and then you have a current value which is representative of the bits that you had in the input. And then you can do the scaling or the multiplication with a resistor ladder. And then if you want to subtract or add, do the addition, you have to choose if you want to get the negative current or the positive current. And then you -- So this unit does the multiplication for eight inputs. And then you have the addition which is just tying together the wires and getting it. And then you have the A to D conversion which also applies the sigmoid naturally to it. And you get the output. So we went with the time multiplex ANPU design. This is a conceptual design; it's not, like, realized yet. But you have these APEs with eight inputs. These are the input-output FIFOs and then you, you know, do the communication digitally between these units. Our methodology for the design space exploration of this is that we are going to do cadence transistor-level simulations and then feed it to a software simulator that realizes the entire ANPU. And the first thing that we did is to see how far we can push the bit-width. Right? We want to identify how much we can use bit-widths. This is the single precision and that's the error. Right? And this is the number of bits for the inputs and this is the number of bits for the weights which goes through the resistor. You can see -- It's like behind these lines -- that 8-bit is enough. Right? Okay, let's look at the energy projections. This is energy. This is the number of bits that you use in the APE. Because of the design of this digital-to-analog conversion the size of the surges are increasing exponentially. The energy is going to go exponentially as you increase the number of bits. So if we look here, this is a 16-bit digital FP two different frequencies. Around 8 bits of input we see a 10X energy reduction. This is given the fact that we are doing A to D and D to A between the neurons and doing the communication in digital domain. If we do the geometric design, this is well beyond 100X, you know, energy efficiency with analog. So for this I have worked with Doug, Luis and Professor Hassibi from UT Austin on the analog parts, and I have also worked with Adrian from the University of Washington, Renee from the University of Texas. We have a new guy in the University of Washington; Theirry is working on a future implementation of this so that we have a conceptual design that actually accelerates an ARM core which is on the FPGA. We got the board and we are, you know, pushing that forward. We have a webpage for this project and we're going to, you know, provide the compilation for flows, the tools that I developed, the forward code generation of neural networks. I think that's important because, you know, Google had this project that they did a very large scale neural network on their clusters. So the work that I did which was toward using neural networks for accelerating general purpose code, but the byproduct of the code generation and using AVX and things like that can be beneficial for such projects when they are doing, you know, neural networks or if Microsoft is interested in this. The thing that I didn't talk about is for the [inaudible], I was the compiler. But part of the, you know, internship was developing the compilation for work flow, and right now we are doing a pragma-based, so the user uses pragma. And I did it more flexible to elect the, you know, developer specify errors, ranges of inputs or different things that, you know, can be used. And during my internships I worked on the [inaudible] for these two papers as well. So that's all I've got. This is "The Fifth Day of Creation." That's kind of where we are. >> : All right. So can I give you some advice about your talk? >> Hadi Esmaeilzadeh: Sure. >> : Because you're going to be giving a variant of this when you go out on the interview circuit. You should render this with the approximation and end on that. And not tell them and then flip to the original and say, "Here's the version that was done digitally." >> Hadi Esmaeilzadeh: Okay. >> : Right? Because every talk you give at a major university, people are going to jump on you and say, "I don't believe you can give up error. How can you give up digital precision?" Blah, blah, blah, blah. And they're probably right. So that little trick will sort of kind of anticipate that objection and head it off at the pass. And you can just say, "Well, as I've just shown you, you know, there are cases where you can't help the difference. I mean you do it with the monkey and you might want to leave that in. You know, it would just show that you've anticipated it and then you can do it with a little smile. And, you know, that would be a really nice way to... >> Hadi Esmaeilzadeh: Okay. >> : ...end it. >> : I have a question. >> Hadi Esmaeilzadeh: Sure. >> : If you go back to the slides This one. Right, where you've got indicate how much the algorithmic how much you'd have to reduce the match your energy savings. where you have the -- No, next one. the energy. So you could use this to changes would have to -- Right? -algorithmic complexity in order to >> Hadi Esmaeilzadeh: Right, right. >> : Right? >> Hadi Esmaeilzadeh: Right, right. Because we haven't, like, done this implementation. We are planning to do it for ISCA this year, you know, the analog implementation. So I was a little bit cautious about talking too much about it. Right. No, but you were right. >> : I mean, for example, JPEG. JPEG is a good one because it's got the ability to tune it, right? It's built into the algorithm. Right? To get the same 10% error that you had, right, if I set the quality factor at that, use that as the thing that your normalize to... >> Hadi Esmaeilzadeh: I see. >> : ...show what your energy savings are. >> Hadi Esmaeilzadeh: Sure, sure. Yeah [inaudible].... >> : Right? So the algorithmic changes actually occur and you want to beat that. >> Hadi Esmaeilzadeh: Okay. >> : Right? >> : So I'm curious if you thought about following this up. The examples that you gave are very -- they're... >> Hadi Esmaeilzadeh: Small. >> : ...numerical and the approximation is pretty obvious. Right? And maybe this goes a little bit to Doug's point at the very beginning of the talk. There's a lot of situations on the phone, for instance, where everything I sense -- My phone is effectively sensing the world around me all the time, and all of my programming models that currently exist for that sensing data are discrete and work on facts. When the reality of the situation is that those sensors are not giving me factual input. Right? They're telling me my approximate location. >> : That's exactly right. >> : And so I wonder if you thought about how do you change -- I mean, as a programmer how do we start to talk about this? Error is great but I don't think it's the right solution because it's very problem dependent. So how, as a programmer, do we start talking about dealing and like following this kind of approximation up to a programmer at the level of the type of people that are writing JavaScript? Right? How do we allow people who are writing these very high-level APIs to reason about approximate computations? This is a hard question and I don't mean -- I mean, I'm just curious if you've given any thought to it. >> : Can I add to the question? So I want to make the problem bigger. Okay? So if you're building an app for a mobile phone and we're trying to do a lot of stuff in our group around inference and, you know, extraction of these high-level semantic signals from these noisy sensors and things you do like browse the web and all that. Okay? So we have a really good understanding -- or appreciation for the problem. In this energy-limited world in some sense what you want to do, which we don't know how to do either, is give the programmer an energy budget and say, "What's the best answer I can get with this energy budget?" And it's not that your thing is going to add error. Your thing might allow them to do algorithms in that fixed energy budget to give them a much better result with a lower error. But if it's the digital representation versus the trained representation with error, it's exactly right it's worse. But the answer is you'll be able to do much better stuff with this because you're energy limited. And so how do you say to the programmer, "Here's a hundred joules, and you want to figure out whether the user's at work or at home." Right? Now this will actually probably let them do a better job but it's not about error because the algorithm you would use if you had to do it digitally would have more error. And so I think that's what you're saying, right? >> : Yes, that's exactly right. Yep. >> : And so in some sense you want to give the programmer a bag of energy and say, "Here you go." You know? "Now you have all these different choices you can do." And you kind of want to run it through a tool flow and say, you know, "This meets your energy budget. This doesn't." And maybe it's the desynchronization of your GPS sampling. Actually, I haven't seen any work on this and this would be really cool. >> : So energy versus accuracy? >> : Giving a programmer -- In Visual Studio you have an energy model and a model of the system and processor. And you have an energy budget. >> : That's static. >> : Yeah, I have a static energy budget. You know, you've got some -You know, I've provisioned a million joules and you have a bunch of templates, sketches, right, and you'll compile the code. You'll do the analysis. You'll run it against the model. And then the system can automatically adjust, you know, desynchronization of the data, number of loops and scale the sketch down to meet your budget. And then you can say, "Approach X, Y, Z, A, B, C. Which one gives me the best results?" You know, that's what we're going to have to do in the future. >> Hadi Esmaeilzadeh: Yes. Exactly. >> : And it's a compilation problem, too. >> : It's a compilation problem. Actually this would be a really interesting project. >> : It sounds like you're turning normal software into FPGA computation and timing closure. >> : No. No, no, no, no. I mean there's an element of that... >> : It's approximate energy. >> : Yeah, yeah. But you've... >> : [Inaudible]. >> : ...got a model. But you've just got to expose the model. >> : Right. >> : Right? >> : Right. >> : I mean you can slice it two ways. You can say, "Here's --." I can have some trace and I can do some profiling. And I can take these different sketches and say, "Here's the quality of the answer you get and here's the amount of energy that each of them consumes." And you want the... >> Hadi Esmaeilzadeh: But something like Gprof which instead of giving you the timing that you, you know, spend in each function, it gives you the amount of energy that your unit is spending. >> : It's Eprof. Energy and error. It's like [inaudible] problem. >> : So we've started doing a piece of this, Kathryn McKinley and myself. We've been instrumenting Windows phone to provide effectively exactly what you talked about as an energy limit. We haven't got to the point where -- And then that budget then is used to inform. And it's predictable. You know, you have to predict what the budget is for today. I think that normally you turn on your power -- or, I'm sorry, you plug in your phone at six o'clock at night. >> : Yep. >> : And so that gives me now an -- I look at the battery. Now I say, "Okay, I've got an estimated amount of power that I have to get through the day." And then you -- I'm sorry energy. >> : Yes. >> : And then... >> : I'm sorry, that's a pet peeve of mine. >> : And then you have to make decisions based on that, right? But we're nowhere near, I think, what you just described. >> : Yeah, yeah. >> Hadi Esmaeilzadeh: I think like... >> : [Inaudible] ecosystem that we did a long time ago. Do you know what that is? >> : I think I've heard of it, yeah. >> : Yeah, yeah. >> : So this [inaudible]... >> : [Inaudible].... >> : We have the right set of people in the room. >> : And then force it. >> : You did a great job today, Hadi. Good talk. >> Hadi Esmaeilzadeh: Thank you. [ Audience applause and commenting ]