>> Doug Burger: Good morning, I'm Doug Burger. I'll pass... opportunity to say anything inappropriate given there's only a few

advertisement
>> Doug Burger: Good morning, I'm Doug Burger. I'll pass on the
opportunity to say anything inappropriate given there's only a few
people in the room, but there may be people watching online. So it's my
pleasure today to introduce Hadi Esmaeilzadeh. How was my
pronunciation?
>> Hadi Esmaeilzadeh: That's perfect.
>> Doug Burger: Okay. Thank you. I try hard. Hadi is a PhD student at
the University of Washington who started with me at the University of
Texas, moved to Washington, and is now working with myself and Luis
Ceze and several of the other graduate students in Luis' group. He has
had an amazing track record of output and top quality research over the
past couple of years with several award papers mentioned in the front
page of the New York Times on your work. And he now has an incredibly
interesting result in a new area that we believe is going to be very
important. And it's new. It could be profoundly important. It could be
completely irrelevant. We just don't know, but it's very exciting. So
thank you for spending your summer here with us doing your
internship....
>> Hadi Esmaeilzadeh: Thank you [inaudible].
>> Doug Burger: You've gotten a lot done. And I'm really looking
forward to seeing the results of all your hard work.
>> Hadi Esmaeilzadeh: Okay. Good morning. Today I'm going to talk about
a new class of accelerators that use machine learning to accelerate
approximate general-purpose programs. This is a project which is done
in collaboration with the University of Washington and Microsoft
Research and the University of Texas at Austin. I have collaborators
from these three institutions. Parts of this talk are going to appear
in the next International Symposium on Microarchitectures.
So people have been using accelerators, and there are different kinds
of accelerators like GPUs and FPGAs that each of them take advantage of
certain characteristics in the program to accelerate the program. Like,
FPGAs can accelerate applications that have abundance of parallelism,
but they are [inaudible]. And GPUs are good for applications with a lot
of parallelism but they can't, you know, work well with the
applications that have very divergent control flow. And there has been
recent, you know, proposals in the literature for different
accelerators that augment the processor and try to, let's say,
synthesize parts of the program on hardware or user configurable fabric
to actually accelerate the programs. What we are looking at here is the
-- You know, we want to take advantage of these characteristics in many
applications in diverse domains that full precision is not always
required.
Either there is no certain one answer for the application or
[inaudible] at the output. The best example is graphics and we see it
like in JPEG compression. It's tolerable. And the perception of, you
know, the human plays a role in this process. So...
>> : I have a question.
>> Hadi Esmaeilzadeh: Sure.
>> : Is it your belief that the opportunities for approximate computing
and applications have increased?
>> Hadi Esmaeilzadeh: I think because like machine learning is taking
over and we are facing the era of big data that we need to do a lot of,
you know, computations with a vast amount of data. The...
>> : Most of your applications are drawn from big data really.
>> Hadi Esmaeilzadeh: Search is one of them where if you look at data
mining.
>> : No, I'm talking about in your paper.
>> Hadi Esmaeilzadeh: Right.
>> : But so, I guess the root of my question is has there been some
trend that is making approximate computing more possible or is it the
case that we could always have done it? For example, HPC workloads in
numerical simulations with, you know, convergent algorithms, radiant
set type approaches, and we've only just realized this now where our
backs are against the wall. I mean, has it always been there or has it
grown because of the emergence of some of these areas?
>> Hadi Esmaeilzadeh: So there is another trend that is emerging. Maybe
it's because of the challenges that we face in the silicon.
>> : That's not driving the applications.
>> Hadi Esmaeilzadeh: Right. That's like -- You know? That's the energy
efficiency....
>> : I mean [inaudible] in what applications we can do and drive them
that way, but that's not what's driving a lot of these. Big data is a
buzz word. Yes, it's important but --.
>> Hadi Esmaeilzadeh: Right.
>> : So do you think something has changed?
>> Hadi Esmaeilzadeh: In the application domain?
>> : Are there a few obviously identifiable trends? I mean machine
learning is clearly one.
>> Hadi Esmaeilzadeh: Machine learning...
>> : [Inaudible] popular but...
>> Hadi Esmaeilzadeh: ...is one of the things. And then, you know -- So
another trend that we can maybe look at is the -- you know, doing
computation with limited amount of energy and then trading off compute
power -- Sorry -- trading off battery or harvested energy for the
quality of results that [inaudible]...
>> : That's an opportunity but that's not what's driving a lot of these
emerging applications today. Vision is emerging despite energy limits
not because of it.
>> Hadi Esmaeilzadeh: Right.
>> : Okay. Let's put that on the cue.
>> : So isn't it that algorithmically things like approximate
algorithms are gaining in their acceptance in use of these things?
>> Hadi Esmaeilzadeh: That might be the case. That might be the case.
Because a lot of, you know, let's say recognition or, you know, even
gaming they don't require an algorithmic level. They don't require like
perfect, you know --.
>> : Right. And those are used to help reduce overall computation time
or...
>> Hadi Esmaeilzadeh: That's true.
>> : ...[inaudible] various properties about...
>> Hadi Esmaeilzadeh: That's true.
>> : ...convergence or things like that.
>> Hadi Esmaeilzadeh: Yeah. But I'm not sure if that answers Doug's
question.
>> : I just wonder if it's that, you know, if you're summing up your
checkbook you'd want it to be precise, but whenever you're interacting
with the real world, the real world doesn't have a precise digital
representation. And so there's necessarily going to be fuzziness. So as
we do more things that interact with the world, Vision and Augmented
Reality, trying to extract trends from the web, they're all in some
sense approximable because they don't have a precise answer.
>> Hadi Esmaeilzadeh: Yes.
>> : You could also go and say that there's an acceptance of doing
that. So ten years ago we were working on compression of mass images
and the answer was, yeah, you could do it. It doesn't hurt the
[inaudible] but they wouldn't accept it because, ooh, it's not exact
and, therefore, it's wrong. And I don't know that's necessarily the
case anymore.
>> : Well, they're older now and their vision's gotten worse that they
can't [inaudible].
[ Laughter ]
>> : Yeah.
>> Hadi Esmaeilzadeh: Okay. So what we are doing is that if we look at
this design space of processors, we always trade off power and
performance and move over different points in this design space. Each
point in this design space is a processor and the line shows the Pareto
Frontier. What we doing is we are exploring a new dimension which is
approximation. And what we want to do is that we want to compromise a
little bit of accuracy and hopefully get better performance and lower
power by comprising the, you know, amount of accuracy.
And for this we are proposing a new class of accelerators that need to
be trained rather than be programmed. So what I mean by trainable
accelerator is that if you look at an imperative code, you pick a
target region that is hot and the application spends a lot on it. Then
put a learning algorithm besides it, run the application. The learning
algorithm will observe the inputs and the outputs from that target
code. And after it reaches a certain level of accuracy then you replace
that target region with a recall to that machine learning algorithm. So
we call that algorithmic transformation a Parrot transformation which
mimics the target region of code. That allows us to run the program on
the core on the processor while accelerating the machine learning
algorithm on the dedicated hardware. Since this is replacing the target
region, we are indirectly accelerating the whole application by doing
this transformation. Sure.
>> : Have you learned the regions that you shouldn't apply this and
drop back to the old code for those situations?
>> Hadi Esmaeilzadeh: So we have safety criteria. Like let's say you
are doing JPEG, right? So when you writing the header of the JPEG, it
should be precise because if you even like mess up one bit the image is
not readable. But when you are processing the pixels and the bits then
you can do approximation in that code. So what we require from the
programmer is to annotate regions of code that are approximable without
being conscious about which technique of approximation you are going to
use. Just tell me which regions are safe to approximate and have a
little bit of inaccuracy and error at the output.
>> : So using your JPEG example, do you have a notion of a quality
factor? Like JPEG compression I can set a quality factor on the
compression. Or are you so small in your approximation that it's just a
little bit of noise around this coarser grain?
>> Hadi Esmaeilzadeh: That's an interesting point, but right now I
don't have it for this technique. We have other techniques that use,
you know, different approximation methods that we can actually tune up
the approximation. But not for this.
>> : [Inaudible]. There's a whole body of work on synthesis where, you
know, proving by example is becoming kind of in vogue, right? And I'm
really struck by your title here, "Acceleration Should Be Trained not
Programmed." You can almost, using some of the ideas from [inaudible]
on synthesis here. You can think about programming in general for a
large class of programs as being trained not programmed, right? You
specify very simple poor quality algorithms and then train whole bodies
of work, and it's not really necessarily about accelerators. Your
accelerating out here is just a general...
>> Hadi Esmaeilzadeh: Sure.
>> : ...[inaudible]. Right?
>> Hadi Esmaeilzadeh: Sure. Sure.
>> : So I think the generality here is great.
>> Hadi Esmaeilzadeh: That's actually very interesting.
>> : So how do you --. Sorry [inaudible]. This is really interesting.
So I think when we get a little bit deeper we do have a non-CPU
accelerator. But how do you tie program synthesis to this training that
you can use to make things more efficient? How are you doing this
algorithmic transformation? Is there a tie there or are they just
intellectually interesting?
>> : No, I think there is a tie. I mean...
>> : Yeah, I'm trying to understand that.
>> : ...you can think about synthesis as training, right? So the
programming bit example, that example's much like you have a learning
problem here.
>> : Yep.
>> : And in the same vein, those examples are used by the synthesis
engine to come up with a huge -- You know, you can think about
synthesizing a large number of possible solutions and picking the one
that balances approximation versus power.
>> : I see. I see. So it's more like we can think about the ones that
are most amenable to this transformation.
>> : That's right.
>> : Okay.
>> : But for general purpose code, right?
>> : Right. Right, right.
>> : Are you going to remember that?
>> Hadi Esmaeilzadeh: Yes.
>> : Okay.
>> Hadi Esmaeilzadeh: Can you...
>> : I'll take it down.
>> Hadi Esmaeilzadeh: ...actually send me like the links and...
>> : Yes.
>> Hadi Esmaeilzadeh: ...you know, papers [inaudible].
>> : I'll talk to you afterwards because I know we're actually working
on a similar problem for synthesis in this example.
>> Hadi Esmaeilzadeh: Oh, okay.
[ Multiple audience comments simultaneously ]
>> : But your talk's already a success.
[ Laughter ]
>> Hadi Esmaeilzadeh: So to do the Parrot transformation -- So this has
been an internal debate between me and Doug about the naming of the
Parrot transformation. So if you have a better suggestion, I'm willing
to like hear.
>> : So who likes Parrot and who doesn't like Parrot?
>> Hadi Esmaeilzadeh: I like the Parrot. He doesn't like the Parrot.
[ Multiple audience comments simultaneously ]
>> : Yeah, I like Parrot.
[ Laughter ]
>> Doug Burger: Screw you guys.
[ Laughter ]
>> Doug Burger: [Inaudible].
>> Hadi Esmaeilzadeh: So to realize the Parrot transformation we
actually, you know, need to come up with a learning algorithm that can
learn regions of imperative code. And then we have to come up with a
programming model that lets the programmer, you know, think about this
transformation and annotate the code for us. And then we need to come
up with a compilation work flow that actually carries out this
transformation. And then we need to think about the implementation of
the learning algorithm. As I'm going to show you, and I think I gave
it, you know, up with the title, is if we empirically found out that
neural networks can actually mimic parts of the imperative code. And
the good thing about neural networks is that we know how to implement
them very efficiently in hardware and they inherently parallel. So
they're a good target for the acceleration at the end. And I'm going to
talk about different opportunities that using neural networks will, you
know, provide us.
But in general I think potentially any regression mechanism can be used
for Parrot transformation. I don't know if it's going to be beneficial
because I haven't tried them, but I think potentially you can regress
any region of code with, you know, let's say support vector machines or
logistic regression or any kind of these transformations to do this.
First let's talk about the programming model. So the developer should
be involved because of the safety. As I told you guys before it's
important because a programmer has the only view of which part of the
algorithm is safe to approximate. So the programmer needs to annotate
the source code. But which kind of source is good for Parrot
transformation? So aside from being, you know, approximable that needs
to be hot code because we want to, you know -- we are bound with
Amdahl's Law. And since we are limited by the topology of the neural
network when we are doing the Parrot transformation that region needs
to have fixed size inputs and outputs that I can, as a compiler,
identify statically at compilation time.
So let's look at an example. So here I'm showing this whole, you know,
edge detection algorithm. So what you do is that you take an image. You
convert it to a gray scale and then you slide a window over it and you
do the convolution operation. The convolution actually, you know
estimates the gradient for one pixel with respect to its surrounding
other pixels. So it's inherently approximate. So the programmer is
going to do [inaudible] that this is approximable. The programmer is
not aware that I am going to use neural network, just this is
approximable. And this is a good code because it has a fixed number of
inputs and just one output. So what I'm going to do, I'm going to do
the Parrot transformation in three steps. The first part is that I'm
going to take the annotated source code, put probes on the inputs and
outputs of the candidate function, run it with a certain number of
input data, collect the training data which are the inputs and outputs
to that candidate. Then I will feed that training data to Parrot
generator which is going to explore the topology space of neural
networks and redo a search space. We have a search space of neural
networks and we try to find a neural network. And that Parrot generator
is going to give us the topology and then a number of weights. And then
we take that annotated source code again and take the neural network
and replace the calls to the candidate function with the calls to the
neural network. And if we have special hardware, right, we are just
invoking special hardware behind the scene instead of running the
original function.
So in our example, our Parrot generator has found 9 to 8 to 1. These
are the neurons. These are the inputs and that's the output neuron.
That can mimic, you know, that region. And -- Sorry. So what happens
here is instead of calling the original function, I'm sending the
inputs to the neural network and receiving the output instead of that.
So if you run this on an image, the neural network is trained on a
different image. And this is running it on original code. This is
running it on a Parrot transformed code. There are a little bit of
differences. I don't know if you see here. But I think perceptually
this passes.
>> : Can I ask a question [inaudible]?
>> Hadi Esmaeilzadeh: Sure.
>> : I mean, I realize that you're looking at applications where there
is this kind of perceptual aspect to it. But because you're just doing
transformation, right, on these apps is there any way to reason about
the error bounds?
>> Hadi Esmaeilzadeh: Yeah, so for the neural networks I'm going to
show you like different applications and how we reasoned [inaudible].
So this is a good example for showing to the audience. But you can
reason about the error of the neural network. Its standard is MSE,
minimum mean-squared error, and then for the application level
depending on the application you can define an error metric. Like an
average error or root mean-squared error at the application output
level. That gives you a mathematical framework to reason about the
error in the application.
>> : But you couldn't do a max error?
>> Hadi Esmaeilzadeh: You can define -- So it's your -- As a developer
you can define any error metric that suits you. Right? You can do
maximum error and if it doesn't...
>> : But you only do maximum error over the training set not over any
possible input set, right? I would assume.
>> Hadi Esmaeilzadeh: So when we are doing this we have input for
training, we have an unseen input, for test and evaluation. So you
define the error metric for that test and evaluation metric. You know,
for that test and evaluation date.
>> : But you could -- If your data set never passes a negative number's
square root, you will never know that the code needs to see that and
flag an error.
>> Hadi Esmaeilzadeh: Right.
>> : In either of the sets....
>> Hadi Esmaeilzadeh: Right, right.
>> : But later on it's a different...
>> Hadi Esmaeilzadeh: Right, right. So I can't -- So like similar to
any applications that use learning, I can't, you know, bound the error
and say that I will mathematically guarantee that the error is going to
be less than this. What I can do is that I can say that that's going to
happen infrequently. And hopefully since your application is
approximate, the final output is still going to be okay. But like we
can do one more step, right? We can put the MPU and a predictor, right,
which predicts if this error is going to be too large or the input is
unseen. Right? And then decide if I should retaliate the original
coding instead of running the MPU. So that's one of the approaches that
we are thinking about. I haven't implemented it yet.
>> : This also has to do with after the fact too, right? And throw an
exception, right? And then...
>> Hadi Esmaeilzadeh: Exactly.
>> : And then use the original [inaudible].
>> Hadi Esmaeilzadeh: Exactly. Exactly. That's actually done -- Like
there is a work called Relax which does approximation for different
regions of code and they throw an exception and retaliate back if there
is like more than a certain level.
So another application is inverse kinematics. So you have a two-joint
arm. And you have the X's and Y's and you want to figure out the angles
of the joints. If I can play this, that would be awesome. Okay. The
circle is the original application, the original code. But the arm is
moving with the Parrot transformed code. So we applied this to six
different applications. For FFT from signal processing, inverse
kinematics robotics. I didn't do T-joint arm because T-joint arm
doesn't have like a closed-form solution and this [inaudible] are huge
because you can just learn where the arm goes.
We did it for part of the jmeint game engine which you figure out if
two triangles in 3D are intersecting or not. We did it for parts of
JPEG. We used it for K-Means and the Sobel filter. For K-Means to make
it easier to understand the errors, I used K-Means for image
segmentation. And these are the neural network MSE errors, and these
are the application level errors that we see. For FFT...
>> : What does it mean? What is the application level error...
>> Hadi Esmaeilzadeh: Right.
>> : Is that...
>> Hadi Esmaeilzadeh: So for FFT and inverse kinematics, I used the
relative error as the metric. For jmeint, either you hitting or missing
if two triangles are intersecting or not, so this is the miss rate. And
for JPEG, K-Means and Sobel, I'm using image difference between the
original application, original code and the Parrot transformed code at
the output the pixel differences, averaged within pixel differences on
the output.
>> : For JPEG are the errors usually like clustered to a few really bad
ones or is it just kind of a smooth error?
>> Hadi Esmaeilzadeh: It's a smooth error. It's a smooth error. So we
can do Parrot transformation but what opportunity is there for when we
are doing the -- Sure.
>> : I'm sorry. Can you go back one slide?
>> Hadi Esmaeilzadeh: Sure.
>> : If I were to do a normal JPEG compression to an image, do you have
any idea how that relates to your error metric?
>> Hadi Esmaeilzadeh: I haven't measured that. But this image is worse
than normal JPEG.
>> : Okay.
>> Hadi Esmaeilzadeh: So the quality...
>> : Is it [inaudible] or -- Is it something you can visualize, you can
see the difference? Or is it...
>> Hadi Esmaeilzadeh: I can visualize the image difference and see...
>> : Okay.
>> Hadi Esmaeilzadeh: ...how that looks.
>> : Can you post process to computer quality factor with JPEG? Like
the final...
>> Hadi Esmaeilzadeh: Yes.
>> : ...compressed images to...
>> Hadi Esmaeilzadeh: Yeah.
>> : ...use their metric?
>> Hadi Esmaeilzadeh: Yeah. Actually the average or MSE difference is
the image difference that, you know, imaging processing people. That's
the reason I used that. And this is between, like, you know, after you
do the JPEG compression and then you do the Parrot transfer here you
see like an average of 10% error. So now that we have this Parrot
transformation we can have different implementation of the neural
network. We can have as a library we can just call a library of neural
network or I can use digital NPU.
This should be analog. Sorry. I fixed the graph but I didn't update
the...
>> : That's all right. We can use analog [inaudible].
>> Hadi Esmaeilzadeh: Yeah, so this should be analog. That's -[Laughter] Yeah, so. So that's why I came up with this idea because I
wanted to do analog computation while having digital normal programming
interfaces but use analog circuitry. And there is a whole body of
research that shows we can do analog implementation of neural network.
So we can use analog. And I'm going to talk about parts of it.
So let's do the software first and see what happens when we do the
software. So if I use a library. This is application slowdown and this
is the, you know, different applications. And I'm using Fast Artificial
Neural Network. It's a widely-used open source library C++
implementation. You see that this is kind of -- Yu know, at least CPU
implementation is not working. But -- Sure.
>> : Is this if you do your technique how much the code slows down?
>> Hadi Esmaeilzadeh: Yes. If I just use a library of neural networks
without any hardware support.
>> : Fine. Go ahead.
>> : This is how not to do approximate computing.
>> Hadi Esmaeilzadeh: Yes.
[ Laughter and multiple comments ]
>> : But you leak a lot more and so you consume a lot more energy
because you're running so slowly and so it's a huge opportunity for
speed ups and improvements. Straw man.
>> : It's like power consumption branch predictors.
>> Hadi Esmaeilzadeh: I'm trying to make a case for two things: One is
that how can we change the processors so that we can, you know,
actually gain benefit from this without adding any extra hardware to
the processor, without adding the actual big accelerator to the
processor.
>> : And no algorithmic changes.
>> Hadi Esmaeilzadeh: With no algorithmic changes.
>> : Can we flip back one, please?
>> Hadi Esmaeilzadeh: Sure.
>> : Does the difference in the heights relate to the size of the input
or the computation size? Or what's the difference between 5 and 75
typically?
>> Hadi Esmaeilzadeh: So it depends on how much of the computation goes
into that region of code, right? So in the FFT we are spending like,
you know, 30% of that computation in that region. So I'm slowing down
that region and then Amdahl's Law translates to this 4.5. In like
jmeint, that region is most of the application and the implementation
actually is based on a paper. It is a very efficient implementation.
And I'm using large neural network because it's a very complicated
control flow intensive region to approximate that region. So I'm
trading off a little bit of computation to a large [inaudible]
computation. So I see a huge slowdown.
Does that make sense?
>> : Yeah. I guess I was expecting a statement along the lines of as
the number of instructions I'm replacing goes up, my penalty goes down
because the ANN is sort of fixed time so you want to increase the
amount of code you cover.
>> Hadi Esmaeilzadeh: Exactly. That's one of the things. But there is
another side to it: how much computation you are replacing in that
region. Right? As you enlarge the region then the neural networks
potentially can grow and you may not end up with a gain there.
>> : So the example you gave where you had the approximate annotation
on the code didn't have a loop, I don't think. It was just straight
floating point computation. If I have a loop then I can potentially
have a program that doesn't terminate. So I'm trying to get to the
point of when does this break down and when can you not learn a
particular function? Have you guys tried to classify -- I mean
effectively you could reduce this down to the halting problem I
suspect, right? Because now you have a neural network that's going to
try me whether or not my program halts.
>> Hadi Esmaeilzadeh: Right.
>> : So when does this break down and when can you -- But for some
programs I can actual deal with loops?
>> Hadi Esmaeilzadeh: Right. Right. For JPEG there is a loop inside...
>> : Yes.
>> Hadi Esmaeilzadeh: ...that goes over the pixels of like 64. You
don't like implement in linear. But it's not loop that, you know, goes
around for everything. So the region of code should not, you know,
change any state besides its outputs. So the neural network that I am
replacing the region of code, right, takes a bunch of inputs and
generates a bunch of outputs. Right? And if that code is changing
something besides the outputs, either I have to hoist it and do a bunch
of computation to come up with it so that this region of code is pure
or I can't do it.
>> : I think it's a much simpler answer. So you can -- Now slap me if
this is just way off base, but I think you can -- Let me move to the
side here. I mean you can grow the bounds that you were considering in
the neural network for until you found a program region that was well
structured with this input and output behavior. So there might be lots
of internal state communication with it, you know, if you [inaudible].
And then your example, you know, you could have a -- you know, you have
some inputs which is the program and then an output which is, you know,
what was program halt. Right? And you could try to train the neural
network to do that. And that has well structured inputs and outputs but
you're not going to be able to train the network to do it. So I think
to your point about the halting program, I think it's really just that
some functions are amenable to neural networking and some are not. And
the less they're amenable, the more error you're going to get. And with
the halting problem you're just going to get pure entropy. You're going
to get a random result and you basically have zero signal there because
you can't train the network to do it. And so it's really just, you
know, there's a spectrum of how much error you're going to have. And
for some things, you know, the error will be infinite one.
>> Hadi Esmaeilzadeh: Usually it's random 0.5. You can have like random
like a neural network. If that's like a classification problem, you can
have a random...
>> : Right. So is that right, do you think?
>> : Yeah, I think that is right. I mean I think the idea is that if
you have N inputs that you're training on, you're going to get some
answer by the network. If you use N plus 1, the entropy state is the
same. The amount of error stays the same. It doesn't go down. Right?
>> : Right.
>> : As you add that training in. If you keep adding and adding and
adding, you're never going to get to the position where you actually
converge.
>> : That's right.
>> : It's really just a question of how learnable is the function?
>> : Yeah.
>> Hadi Esmaeilzadeh: And the dimensionality of the space of the data
like the input. That if you grow the dimensionality then training is
going to be, you know, harder and harder.
>> : I have another question and maybe you're going to get to it. But
it kind of comes back to the algorithmic aspect that if these
applications know that they can tolerate a certain approximation in
their result, where would changing the software implementation not
using your library to learn and train a neural network but changing the
implementation of the code to tolerate that much less quality in their
result set? Right? I mean I'm trying to get at the change that you get
from invoking the neural network with some special hardware or
whatever, right, versus algorithmically doing less work and getting a
less accurate --.
>> Hadi Esmaeilzadeh: So I can give you one example answer. So when I
was doing JPEG I downloaded a code of JPEG and I was using it. And I
when I did the Parrot transformation I was getting like 100X speedup.
The reason was that the DCT part of the JPEG was like an exact cosine
transformation. And that's very, very slow. Then I changed the
implementation. I found another implementation. The DCT was
approximation of the exact, and my speedup was around like 70%. So even
at the algorithmic level you can do approximation but there is a limit.
And then this goes beyond that. The other thing is we wrote another
paper -- we proposed another architecture which is changing the
processor itself and supporting approximation instructions. That's one
of the ways that you can deal with approximation. The gains that you
were saying is that no speedup, like 20% to 40% energy reduction. So we
were going after like 20% energy reduction. And here we are saying
around like 2X speedup, 3X energy reduction when we are doing this.
But one of my colleagues, Adrian -- You know Adrian. So he is working
on compiler optimization, unsafe compiler optimizations that you can
apply to regions that are approximate and probably execute less code
and see how much you can get away from that. I don't have a head-tohead comparison to that technique.
So during my internship -- So this is what I did before -- so I started
with generated code instead of using your library I can get the neural
network and generate a code which is efficient. Then let's assume that
one of the things that I'm going to show you and the results you're
going to see, one of the things that causes this slow down is the
sigmoid function which takes a large part. And then -- Like I finished
it like two days ago -- I did AVX code generation with Intel Ivy Bridge
AVX extension to see if we can use vectorization and gain better
results.
So before I show you the results I want to talk about the AVX code
generation. So we can do vectorization in two different models. One is
that each neuron is summing a multiplying [inaudible]. So I can use the
parallelism inside the neuron and do the vectorization like this,
different inputs getting multiplied. Or I can do across neurons and
just one input for different neurons in each vector. I took this
because I think this gives a better vectorization approach. At the end
I have to do a ladder of additions to get the final results, so I can
do the other additions for different neurons at the same time I
implemented this one. Let's look at the results.
So before I show you the slowdown, a byproduct of doing this work is
actually let's say we want to neural network execution regardless of
Parrot transformation. How much gain we can gain with these techniques?
So this is speedup over the FANN Library when you are doing neural
networking implications. So this is the generated code. You get around
1.5X speedup, 50%. And when you add the sigmoid instruction you see a
huge bump in the geometric mean. Speedup goes up around 7.3X. And then
when you do the vectorization with vector support for sigmoid, you get
an order of magnitude speedup for neural execution.
>> : [Inaudible] floats?
>> Hadi Esmaeilzadeh: These are floats, single precision. And one thing
that I found surprising is that AVX sometimes is also in slowdowns for
small networks. But this is the biggest network that I had, 18, 32, 8,
2 and I see a huge bump here along, you know, 13X speedup over, you
know, hardware sigmoid code generated. So this is the application
slowdown when we apply these techniques. After I did -- So after I did
the code generation and gained something then after I added the sigmoid
instruction to the processor, actually two of the applications speedup
even without any hardware support for neural execution just the sigmoid
instruction and the vectorization takes the slowdown from 15.7 to 2.0.
So this kind of makes the case for using dedicated hardware to do the
neural network execution. So I'm going to talk about the digital
hardware implementation. So for this we needed a configurable hardware
that can realize different neural networks. Different regions of code
require different neural networks. So the hardware implementation needs
to be reconfigurable. Then we needed a microarchitectural interface
between the accelerator and the processor. And we needed ISA extensions
that can communicate with the NPU, let's the processor communicate with
the NPU. And at the end -- I thought that I was talking [inaudible]. So
at the end since we are doing very fine grained acceleration, this
integration of the hardware neural network should not hinder
speculative execution or out of order execution in the processor. So we
designed a reconfigurable NPU, digital NPU. Each neuron essentially is
a multiply-add unit with weight cache and just crunches through the
multiply and adds. And then with a sigmoid hardware sigmoid unit. And
then these are the three FIFOs that are exposed to the processor. So
the processor sends inputs to this FIFO. It reads it from that FIFO or
configures and sends the, you know, weights to the NPU.
So I used Marssix86 Cycle-Accurate Simulation, and it's configured very
closely to Intel core architecture, and 8-PE NPU. And I compiled the
applications with -O3, so that I don't bias the results. That's fine.
So these are the application speedups. The dark part shows the actual
speedup with the NPU that I showed you. The light part shows the ideal
speedup that I would have gotten if I had a zero-delay NPU. So
[inaudible] 2.3 here in one of the applications I actually see even
with hardware slowdown because that region of the code for K-Means is
that [inaudible] calculation. It's a very fine region; it spends like
30% of the computation in there, so even though the network is small
the actual code is very efficient. So I [inaudible] on 3X energy
savings here for the applications, but I would have got it 3.9X if I
had zero-energy NPU.
So the question is that can we move this further? Can we push it
further? This is actually analog NPU. So part of my internship I
studied the feasibility of moving toward analog implementation of
neural networks. So to do that, as I said, the ANPU needs to be
reconfigurable. So what we are going to do here is that we are going to
do the computation in analog and the storage communication between the
units of analog computation in digital. So we call each of the
computational units that carry out the computations of a neuron in
analog PE, processing engine.
So we can have an area of APEs, analog PEs, and then we have to figure
out how to map the neural network. So one is to time multiplex neurons
over the APEs. And that happens that we're going to do the computations
of 1, 2 and 3 first and then use the same APEs to do the 4, 5. Right?
The other approach is to have a two-dimensional area of APEs and
geometrically map the neural network to this. The good opportunity with
the geometric design is that the communication between the APEs can be
analog instead of converting it to digital and then communicating it to
the...
>> : It's not [inaudible]. You can time multiplex it if you do a D to A
and A to D conversion for the analog units.
>> Hadi Esmaeilzadeh: You can do it but here you don't have to do it.
>> : Yeah, if you want to stay in the analog domain then you need the
multiplex design.
>> Hadi Esmaeilzadeh: If you want...
>> : And you get the geometric design.
>> Hadi Esmaeilzadeh: Yes.
>> : Yeah.
>> Hadi Esmaeilzadeh: Yes, exactly. And there are other factors like
resource utilization and fault tolerance.
>> : So I'm a software guy. Can you give me a little two-second of your
standing as to why I want to use analog rather than the digital
[inaudible]?
>> Hadi Esmaeilzadeh: So the reason is that with the analog you can do
addition by just having the point here, like you have multiple wires
coming to one point, and use Kirchhoff's Law to do the addition. So you
don't have to convert it to bits and, you know, things like that. And
you can actually do the multiplication. Here we are not doing actually
multiplication, we are scaling input. You can actually use a, you know,
resistor ladder, pass a current through it and use that resistor to
scale that current, and then do the addition. So you're just using a
Kirchhoff Law to do the multiplication and addition. That's much, you
know, more efficient than doing the digital computation.
>> : I think a broader answer is that in analog circuits you can
implement much more efficient computational primitives. I mean you can
do integration, addition just by building a circuit that physically
mimics the function. You just don't get digital precision. Right? I
have two wires with currents on them and I tie them together and that's
an add. It's pretty efficient.
>> : There's a lot of work on that.
>> : Yeah [inaudible]. What's that?
>> : Get analog precision because it's arguably better in some
scenarios.
>> : Let's take that one offline.
[ Laughter ]
>> : It's not wrong; it's complex.
>> : Yeah.
>> Hadi Esmaeilzadeh: So what we are going to do is we're going to have
the communication between the neurons in digital but the computations
inside the APEs in analog. So we have to decide how many inputs we're
going to feed to that analog unit like perceptually like [inaudible]
you can have multiple -- as any wires coming there to do the addition.
But analog circuits tend to work in a certain small signal region. So
you have this region of current that everything is linear and you're
getting that, you know, addition effect. But if you blow up that region
then the nonlinearities in the analog circuit will kick in and, you
know, screw your precision for that. So one of the things is the
computation bits, how many inputs we are feeding to that analog PE. The
other thing is the number of bits that you're going to choose to
represent any number. Like with single precision after your two bits
you have a very large dynamic range. But with the analog circuit you
are moving toward fixed point operations, and as you increase the
number of bits the speed, the energy is going to change drastically.
I'm going to show you some results.
Before I actually show you the results, I'm going to show you a little
bit of the circuitry. So what we do, we convert the inputs, the bits,
to currents. Right? You can have a surge and another surge which is two
times the other surge and four times, eight times and when you have
ones in each position then the current that goes the surgers gets
multiplied by that factor and then you have a current value which is
representative of the bits that you had in the input. And then you can
do the scaling or the multiplication with a resistor ladder. And then
if you want to subtract or add, do the addition, you have to choose if
you want to get the negative current or the positive current. And then
you -- So this unit does the multiplication for eight inputs. And then
you have the addition which is just tying together the wires and
getting it. And then you have the A to D conversion which also applies
the sigmoid naturally to it. And you get the output.
So we went with the time multiplex ANPU design. This is a conceptual
design; it's not, like, realized yet. But you have these APEs with
eight inputs. These are the input-output FIFOs and then you, you know,
do the communication digitally between these units. Our methodology for
the design space exploration of this is that we are going to do cadence
transistor-level simulations and then feed it to a software simulator
that realizes the entire ANPU. And the first thing that we did is to
see how far we can push the bit-width. Right? We want to identify how
much we can use bit-widths. This is the single precision and that's the
error. Right? And this is the number of bits for the inputs and this is
the number of bits for the weights which goes through the resistor. You
can see -- It's like behind these lines -- that 8-bit is enough. Right?
Okay, let's look at the energy projections. This is energy. This is the
number of bits that you use in the APE. Because of the design of this
digital-to-analog conversion the size of the surges are increasing
exponentially. The energy is going to go exponentially as you increase
the number of bits.
So if we look here, this is a 16-bit digital FP two different
frequencies. Around 8 bits of input we see a 10X energy reduction. This
is given the fact that we are doing A to D and D to A between the
neurons and doing the communication in digital domain. If we do the
geometric design, this is well beyond 100X, you know, energy efficiency
with analog.
So for this I have worked with Doug, Luis and Professor Hassibi from UT
Austin on the analog parts, and I have also worked with Adrian from the
University of Washington, Renee from the University of Texas. We have a
new guy in the University of Washington; Theirry is working on a future
implementation of this so that we have a conceptual design that
actually accelerates an ARM core which is on the FPGA. We got the board
and we are, you know, pushing that forward. We have a webpage for this
project and we're going to, you know, provide the compilation for
flows, the tools that I developed, the forward code generation of
neural networks. I think that's important because, you know, Google had
this project that they did a very large scale neural network on their
clusters. So the work that I did which was toward using neural networks
for accelerating general purpose code, but the byproduct of the code
generation and using AVX and things like that can be beneficial for
such projects when they are doing, you know, neural networks or if
Microsoft is interested in this.
The thing that I didn't talk about is for the [inaudible], I was the
compiler. But part of the, you know, internship was developing the
compilation for work flow, and right now we are doing a pragma-based,
so the user uses pragma. And I did it more flexible to elect the, you
know, developer specify errors, ranges of inputs or different things
that, you know, can be used. And during my internships I worked on the
[inaudible] for these two papers as well. So that's all I've got. This
is "The Fifth Day of Creation." That's kind of where we are.
>> : All right. So can I give you some advice about your talk?
>> Hadi Esmaeilzadeh: Sure.
>> : Because you're going to be giving a variant of this when you go
out on the interview circuit. You should render this with the
approximation and end on that. And not tell them and then flip to the
original and say, "Here's the version that was done digitally."
>> Hadi Esmaeilzadeh: Okay.
>> : Right? Because every talk you give at a major university, people
are going to jump on you and say, "I don't believe you can give up
error. How can you give up digital precision?" Blah, blah, blah, blah.
And they're probably right. So that little trick will sort of kind of
anticipate that objection and head it off at the pass. And you can just
say, "Well, as I've just shown you, you know, there are cases where you
can't help the difference. I mean you do it with the monkey and you
might want to leave that in. You know, it would just show that you've
anticipated it and then you can do it with a little smile. And, you
know, that would be a really nice way to...
>> Hadi Esmaeilzadeh: Okay.
>> : ...end it.
>> : I have a question.
>> Hadi Esmaeilzadeh: Sure.
>> : If you go back to the slides
This one. Right, where you've got
indicate how much the algorithmic
how much you'd have to reduce the
match your energy savings.
where you have the -- No, next one.
the energy. So you could use this to
changes would have to -- Right? -algorithmic complexity in order to
>> Hadi Esmaeilzadeh: Right, right.
>> : Right?
>> Hadi Esmaeilzadeh: Right, right. Because we haven't, like, done this
implementation. We are planning to do it for ISCA this year, you know,
the analog implementation. So I was a little bit cautious about talking
too much about it. Right. No, but you were right.
>> : I mean, for example, JPEG. JPEG is a good one because it's got the
ability to tune it, right? It's built into the algorithm. Right? To get
the same 10% error that you had, right, if I set the quality factor at
that, use that as the thing that your normalize to...
>> Hadi Esmaeilzadeh: I see.
>> : ...show what your energy savings are.
>> Hadi Esmaeilzadeh: Sure, sure. Yeah [inaudible]....
>> : Right? So the algorithmic changes actually occur and you want to
beat that.
>> Hadi Esmaeilzadeh: Okay.
>> : Right?
>> : So I'm curious if you thought about following this up. The
examples that you gave are very -- they're...
>> Hadi Esmaeilzadeh: Small.
>> : ...numerical and the approximation is pretty obvious. Right? And
maybe this goes a little bit to Doug's point at the very beginning of
the talk. There's a lot of situations on the phone, for instance, where
everything I sense -- My phone is effectively sensing the world around
me all the time, and all of my programming models that currently exist
for that sensing data are discrete and work on facts. When the reality
of the situation is that those sensors are not giving me factual input.
Right? They're telling me my approximate location.
>> : That's exactly right.
>> : And so I wonder if you thought about how do you change -- I mean,
as a programmer how do we start to talk about this? Error is great but
I don't think it's the right solution because it's very problem
dependent. So how, as a programmer, do we start talking about dealing
and like following this kind of approximation up to a programmer at the
level of the type of people that are writing JavaScript? Right? How do
we allow people who are writing these very high-level APIs to reason
about approximate computations? This is a hard question and I don't
mean -- I mean, I'm just curious if you've given any thought to it.
>> : Can I add to the question? So I want to make the problem bigger.
Okay? So if you're building an app for a mobile phone and we're trying
to do a lot of stuff in our group around inference and, you know,
extraction of these high-level semantic signals from these noisy
sensors and things you do like browse the web and all that. Okay? So we
have a really good understanding -- or appreciation for the problem. In
this energy-limited world in some sense what you want to do, which we
don't know how to do either, is give the programmer an energy budget
and say, "What's the best answer I can get with this energy budget?"
And it's not that your thing is going to add error. Your thing might
allow them to do algorithms in that fixed energy budget to give them a
much better result with a lower error. But if it's the digital
representation versus the trained representation with error, it's
exactly right it's worse. But the answer is you'll be able to do much
better stuff with this because you're energy limited. And so how do you
say to the programmer, "Here's a hundred joules, and you want to figure
out whether the user's at work or at home." Right? Now this will
actually probably let them do a better job but it's not about error
because the algorithm you would use if you had to do it digitally would
have more error. And so I think that's what you're saying, right?
>> : Yes, that's exactly right. Yep.
>> : And so in some sense you want to give the programmer a bag of
energy and say, "Here you go." You know? "Now you have all these
different choices you can do." And you kind of want to run it through a
tool flow and say, you know, "This meets your energy budget. This
doesn't." And maybe it's the desynchronization of your GPS sampling.
Actually, I haven't seen any work on this and this would be really
cool.
>> :
So energy versus accuracy?
>> : Giving a programmer -- In Visual Studio you have an energy model
and a model of the system and processor. And you have an energy budget.
>> : That's static.
>> : Yeah, I have a static energy budget. You know, you've got some -You know, I've provisioned a million joules and you have a bunch of
templates, sketches, right, and you'll compile the code. You'll do the
analysis. You'll run it against the model. And then the system can
automatically adjust, you know, desynchronization of the data, number
of loops and scale the sketch down to meet your budget. And then you
can say, "Approach X, Y, Z, A, B, C. Which one gives me the best
results?" You know, that's what we're going to have to do in the
future.
>> Hadi Esmaeilzadeh: Yes. Exactly.
>> : And it's a compilation problem, too.
>> : It's a compilation problem. Actually this would be a really
interesting project.
>> : It sounds like you're turning normal software into FPGA
computation and timing closure.
>> : No. No, no, no, no. I mean there's an element of that...
>> : It's approximate energy.
>> : Yeah, yeah. But you've...
>> : [Inaudible].
>> : ...got a model. But you've just got to expose the model.
>> : Right.
>> : Right?
>> : Right.
>> : I mean you can slice it two ways. You can say, "Here's --." I can
have some trace and I can do some profiling. And I can take these
different sketches and say, "Here's the quality of the answer you get
and here's the amount of energy that each of them consumes." And you
want the...
>> Hadi Esmaeilzadeh: But something like Gprof which instead of giving
you the timing that you, you know, spend in each function, it gives you
the amount of energy that your unit is spending.
>> : It's Eprof. Energy and error. It's like [inaudible] problem.
>> : So we've started doing a piece of this, Kathryn McKinley and
myself. We've been instrumenting Windows phone to provide effectively
exactly what you talked about as an energy limit. We haven't got to the
point where -- And then that budget then is used to inform. And it's
predictable. You know, you have to predict what the budget is for
today. I think that normally you turn on your power -- or, I'm sorry,
you plug in your phone at six o'clock at night.
>> : Yep.
>> : And so that gives me now an -- I look at the battery. Now I say,
"Okay, I've got an estimated amount of power that I have to get through
the day." And then you -- I'm sorry energy.
>> : Yes.
>> : And then...
>> : I'm sorry, that's a pet peeve of mine.
>> : And then you have to make decisions based on that, right? But
we're nowhere near, I think, what you just described.
>> : Yeah, yeah.
>> Hadi Esmaeilzadeh: I think like...
>> : [Inaudible] ecosystem that we did a long time ago. Do you know
what that is?
>> : I think I've heard of it, yeah.
>> : Yeah, yeah.
>> : So this [inaudible]...
>> : [Inaudible]....
>> : We have the right set of people in the room.
>> : And then force it.
>> : You did a great job today, Hadi. Good talk.
>> Hadi Esmaeilzadeh: Thank you.
[ Audience applause and commenting ]
Download