>> Aaron Smith: All right. So today we... Mehrzad Samadi from the University of Michigan.

advertisement
>> Aaron Smith: All right. So today we have
Mehrzad Samadi from the University of Michigan.
He's advised by Scott Mahlke. He's done two
great internships here with us at MSR, and so
recently won Microsoft best paper.
>> Mehrzad Samadi:
>> Aaron Smith:
Yeah.
Congratulations on that.
>> Mehrzad Samadi:
Thank you.
>> Aaron Smith: Has a lot of other really
interesting research and major conferences, and
hopefully we'll get him here as a postdoc working
on the Edge architecture top. So today's title
talk is Dynamic Orchestration of Massively
Parallel Execution. I have a feeling it's going
to be about the open GPUs, heterogeneous
computing.
>> Mehrzad Samadi:
>> Aaron Smith:
Yep.
So --
>> Mehrzad Samadi: Thanks a lot, Aaron. It's
great, really great to be back here and it's a
good time to be here because of yesterday's game.
Congratulation on that.
Okay. Let me get started. This is the project
that I've done during my studies in University of
Michigan. It's called dynamic orchestration of
massively data parallel execution. The main goal
of this project is to get the best performance
for massively data parallel applications,
specifically on GPUs which are designed to
accelerate these type of workloads, but let's see
why GPUs.
GPUs are everywhere these days, from super
computers, servers, desktops, and even in your
cell phones you can find GPUs. And their job is
to give you good performance for data parallel
[indiscernible]. How they do that? They have
many cores working all together and different
data sets. And as a programmer, what you need to
do is to launch thousands to millions of threads
to get the best performance out of that, but ->>: Can you answer a question about GPUs that
I've never understood, which is we're in an
energy limited era. Is this model of thousands
to millions of threads the best, the most energy
efficient way to extract this sort of DLP?
>> Mehrzad Samadi: Okay. That's really good
question. I don't know about the whole thing,
but so far, if you have enough data that you can
feed these many cores, it is pretty energy
efficient.
>>: But don't GPUs, when you use that model,
require you to do enormous amounts of data
movement in and out, on and off the chip from
DRAM, which is why they provide all that
bandwidth and that data movement burns a lot of
energy.
>> Mehrzad Samadi: Yes, that's completely true.
And that's why they are trying to mix these
together, put the CPU next to the GPU. That's
what they're trying to do.
>>: Nothing about the graphics model, the GPU
model with, you know, highly, highly threaded
codes that latency tolerance, right? So, you
know, I just wonder whether, if you're energy
constrained, whether this model is actually the
right way to get the right -- the highest
performance.
>> Mehrzad Samadi: It's -- okay. It's two
questions. Is it the right model for to get the
highest performance or energy ->>:
[inaudible].
>> Mehrzad Samadi:
>>:
Oh, if you're energy limited.
Which we are.
>> Mehrzad Samadi: Yes, that's true. So far
it's pretty energy efficient if you are not
energy limited. I mean, if you have a cable next
to this and compute energy, you are good. But on
the cell phone and something like that I think ->>:
You're energy limited in servers, too.
>> Mehrzad Samadi:
true.
>>:
Everywhere.
>> Mehrzad Samadi:
>>:
On servers, yeah, that's
Yeah, that's true.
Probably not desktops.
>>: But still, one of the highest compute per
joule or whatever.
>> Mehrzad Samadi: Yeah, if you compute that,
it's pretty high, but ->>: It's the highest compute. I don't think
it's the highest compute per joule.
>>: There are alternatives, right? Anyways, so
there's a presupposition here that this is the
right model, and this model to me seems
fundamentally energy inefficient, and we're in a
energy-dominated era, and that's something I've
never really understood, whether there's a better
alternative or this is the only way to do it, pay
the energy tax. So why don't we move on.
>> Mehrzad Samadi: Okay. Let me say something.
At least in desktops I know that if you don't
consider moving data from CPU to GPU ->>:
No, [inaudible] I'm talking about moving it
in and out of the DRAM.
>> Mehrzad Samadi:
>>:
Oh.
I see.
I'm not talking about migrating [inaudible].
>> Mehrzad Samadi:
[indiscernible].
Okay.
>>: I'm saying that, you know, if you have the
data residing on chip and you can beat on it,
it's really efficient. The model here is you're
going to move -- you're presuming that everything
is going to be a cache [inaudible], which is why
you need so many threads, and so you're doing
massive amounts of data movement into on and off
chip.
>> Mehrzad Samadi:
Yes.
Yes.
>>: And that seems inefficient to me.
all I'm saying.
So that's
>> Mehrzad Samadi: Okay. Sure. Okay. One
problem is it might not be energy efficient, like
I said. You have these millions of threads. How
do you want to manage them together, manage them
to get the best performance.
So in this work, I want to show you how to manage
all these threads to get the best performance out
of them. So basically what I want to do is doing
this guy's job.
So let's see why GPU programming is hard. Here I
show peak performance of NVIDIA GPUs over past
seven years. And as you can see, it grows
rapidly and we've been to 500 gigaflops, now we
are around five teraflops for peak performance.
Okay.
Let's see what we can get with software. Here I
took matrix multiplication from CUBLAS Library.
CUBLAS Library is written by Evan
[indiscernible]. It's highly optimized. It's
best code available for metrics modification
right now.
As you can see, they get really good performance
growth, but still there is a gap between what GPU
can provide and what you can get writing your own
software. But this is not the only problem of
GPU programming. Another problem which I am
trying to attack is performance portability.
>>:
Can I get [inaudible] --
>> Mehrzad Samadi:
Sure.
>>: Why doesn't it follow more the GTX 680
point? The CUBLAS cycle seems to be kind of
linear with what the 480/580.
>> Mehrzad Samadi:
>>:
Oh, what's happening in 680?
Yeah.
>> Mehrzad Samadi: Okay. I don't know if I can
say it in the camera or not, but I talked with
one NVIDIA engineer and I talked with another
NVIDIA engineer, I heard two stories. So I will
tell you both the stories afterwards, okay? So,
go ahead, sir.
>>: So is that the [indiscernible] limitation or
is this just a software level limitation that you
don't see as much as the peak performance?
>> Mehrzad Samadi: Okay. It's -- there is no
hardware limitation. It's probably all software,
like ->>: I mean, the code is like, [inaudible]
limitations or why is that [inaudible] ->> Mehrzad Samadi: Oh, if you -- yeah. There
are both. Actually, this one is pretty
optimistic. All you have to do here is multiply,
add batch to batch.
>>:
Yes.
>> Mehrzad Samadi: But in matrix multiplication
you have other stuff and ->>:
Yeah.
>> Mehrzad Samadi: But still, if it's not -matrix multiplication is the best example. If
you go other workloads, it will be, yeah.
>>:
[inaudible].
>> Mehrzad Samadi: Yeah. Okay. Another problem
is performance portability. When you write a
program, you have several assumptions in your
mind. This is a GPU I want to optimize for and
this is my input size and so on. You write one
implementation, which I call it fixed
implementation code, and optimize for those
assumptions. What will happen if those
assumptions change when you are running? So ->>: [inaudible] is it really a problem with the
GPU? Because the same thing is true about CPU
code, as well.
>> Mehrzad Samadi:
That's exactly ->>:
That's exactly correct.
You can look at them [inaudible] --
>> Mehrzad Samadi: You have the exact same
problems in the CPU world, but here I will show
you my results. Problem is more important
because with changing one of these, you lose all
your performance. Okay?
>>:
CPU -- the GPUs is more sensitive.
>>:
[inaudible] they're clipped.
>> Mehrzad Samadi:
Yeah, exactly.
Deeper clip.
Okay.
Let's
see what are the problem of fixing implementation
code. First one is device portability. You
optimize for one GPU, but when you run it in
another GPU, it doesn't give you the performance
that you want.
For an example, I took metrics multiplication
from NVIDIA SDK. It's not as optimized as CUBLAS
Library, but it's pretty decent code. Basically
after taking a three months course of CUDA
programming, you can write this matrix
multiplication.
And as you can see, this is only one
implementation I ran it on three different GPUs.
As you can see, your GPUs is getting better and
better, but your performance doesn't show any -code doesn't show any performance improvement.
Portability is not specific to device. There is
another problem which I call it input
portability. When you write your code for some
input size or even input shape, for example, you
write your code for S core like matrix and you
run it for some other shapes or sizes, you will
not get good performance.
Here I have matrix vector multiplication from
CUBLAS Library, and matrix size is the same. I
just changed the shape. So here is a rectangle.
Here is close to a square and again a rectangle.
As you can see, it gets -- it gives you the best
performance for close to the square, but you
won't get -- you will get almost zero performance
when you have longer matrices.
Third problem, third problem is when you have
irregular dependencies. Something like indirect
memory accesses. What will happen then, you
don't know your code is parallel or not, and if
you want to write one fixed implementation, you
need to be conservative and write a sequential
CPU code.
So then what will happen if your code is actually
parallel. You are losing all that GPU
computation power.
Here I show you
indirect memory
if I want to be
CPU. Then this
several applications that have
accesses, and if I want to be -conservative, I will run them on
is the performance of CPU.
And I normalize it for GPU. 100 percent is GPU
performance. So by being conservative, you are
losing this much performance because you don't
know this -- I figured they are parallel, but the
compiler doesn't know they are parallel because
they have indirect memory access in that.
And the last one is value portability. Even
value -- values that you are processing can have
impact on your performance. Let me show you an
example.
Here I have performance of atomic add operation
in CUDA programming. Atomic operations are those
that you can update the same elements atomically.
So what GPU does if several threads accessing the
same element is serialize those accesses. So on
X axis I have conflicts per warp or 32 threads.
As number of conflicts goes up, you can see that
performance drops down rapidly.
Okay. What is the solution to all these
problems? One solution is ask the programmer to
write different implementation for different
GPUs, different input shapes, and parallel or not
parallel, and write all of this, which is not
practical.
What I want to propose in this work is using
static dynamic compiler framework. As a
programmer, you write your parallel code once and
give it to my compiler. We use several
optimizations which are designed to target
performance portability and we generate several
implementation with tuning parameters for you.
And during runtime, based on these runtime
properties, device input, if it's parallel or
not, and value, we choose which kernel to run and
tune that for you.
That's the main idea of this work.
And I try to attack all these problems during my
Ph.D., and I will talk about these three briefly
and I will go to the most recent one, value
portability which might be interesting for you.
>>:
How do you generate all the kernels?
>> Mehrzad Samadi: Statically. I generate them
next to each other. During runtime I decide
which one to launch. So it will expand the code
size.
>>: So, but who produces them?
something that the compiler will
transformations and generate end
the programmer have to write all
Is this
do
kernels? Does
of kernels?
>> Mehrzad Samadi: No, no, no. Programmer just
write one and compiler generates multiple.
>>:
Okay.
>> Mehrzad Samadi: So I have different
optimization. I apply this optimization or not.
I increase this optimization or ->>: There's no man behind the curtain? There's
no wizard of kernel writing all the kernels, you
know, that presume that somebody takes the pain?
You can generate them automatically?
>> Mehrzad Samadi:
>>:
Yes.
Okay.
>> Mehrzad Samadi:
I took the pain to write the
code, yeah.
>>: You took the pain to write the framework
that generates the kernels or ->> Mehrzad Samadi:
>>:
No.
-- the kernels?
>> Mehrzad Samadi: Write the framework that
generate the kernels.
>>:
Okay.
That's fine.
>> Mehrzad Samadi:
That's --
>>: That's [inaudible].
should be paying that.
You're allowed -- you
>>: So I guess this doesn't sound like a new
idea because I've heard for a couple years now
about work on auto-tuners and ->> Mehrzad Samadi: No, it's not a new idea.
combination of problem and solution is new.
The
>>: But I mean, there's this notion of searching
different. So are you doing something different
than just trying different kernels?
>> Mehrzad Samadi: We -- yes, yes. But that's
the main solution of this project. But inside
each of these there are many new contributions.
The way that we search, the way that we generate,
the way that we tune, these are all new stuff.
>>:
Okay.
I'll wait for the --
>> Mehrzad Samadi:
Okay.
Sure.
Okay. First one is the sponge which targets the
device portability. What we do here, we ask the
programmer to write its code and stream it.
Assume it is a language designed by MIT. It's a
function-based language. You have different
functions and you know explicit communication
between those. So you know how many elements
this function produce and how many elements this
function consume.
And get the GPU type and we use several
optimizations and generate CUDA code for you.
Based on the GPU type, for example, if it has
more registers, we use more register optimization
to utilize more registers to accelerate program
for you.
Second one is adaptic. We saw this input
portability. It's based on the sponge, but this
time we had input portable -- input portability
optimizations too.
So this is the graph that I show you in the
motivation. I show the results for three
different input matrices, and blue line is
adaptic and other one is CUBLAS Library. And
this is different shapes for those matrices.
As you can see, we are doing really great. We
are better than CUBLAS, and the reason is that
it's not actually one kernel like this. It's
composed of five, six different kernels. So this
one is optimized for just this range. After that
it will drop. And this one is optimized for just
this range, and after that it will drop.
That's -- by using all of those, we can do better
than their code.
Any questions so far?
>>: The input size changes over the course of
execution? Do you also change the kernel or did
you run >> Mehrzad Samadi: No.
>>:
-- 20 different kernels here, one for each
point, or is it dynamically adapting if the input
size is changing, or is the input size always
fixed?
>> Mehrzad Samadi: Input size is fixed when you
launch the kernel. So when you are launching
kernel inside your code, we check the input size
and change the kernel that you want to launch.
That will launch for you.
>>: So do you know which kernel runs the fastest
for the specific shape statically or is that
something that you run dynamically and find out?
>> Mehrzad Samadi: We can do both. We had some
GPU model, which wasn't that great, but it was
working for us because we just want to know if
this code is faster or not. It doesn't give you
actual cycles that it [inaudible]. And profiling
always help.
>>: So if you put the ideal numbers on top of
this here, the fastest kernels you're able to
find, I mean, how close, on average, does adaptic
get?
>> Mehrzad Samadi: For this I said it's pretty
close, because it's -- matrix vector
multiplication, it's a small kernel. So the way
that we define it in a [indiscernible] is pretty
efficient. And the best that you can get is
close to this, at least to my knowledge as I can
write. But if you get bigger kernels, then we
have problem because it's not that efficient in
the [indiscernible].
Second -- third one is irregular dependencies,
which we come up with Paragon. It's -- you don't
know your loop is parallel or not. What do you
want to do? What we did ->>:
If you can go back to this.
>> Mehrzad Samadi:
This one?
>>: So for the places where there are big gaps,
what did you learn? What was -- what was so
optimal about the CUBLAS in those cases that
you're adaptic approach? Was it like a memory
layout issue? Registry issue? The two
disparities in performance or ->> Mehrzad Samadi: Yes, exactly. For example,
if you have few number of rows or few number of
columns, your rows are small. I can put several
of them into shared memory, but when they are
writing one code, you can't assume that. So the
way that we utilized resources on the GPU is
different from CUBLAS Library.
>>: So I imagine that you're doing things like
piling and blocking?
>> Mehrzad Samadi:
>>:
Yes.
Big difference.
>> Mehrzad Samadi: Yeah, we fuse the work of one
thread -- several threads together to make them
bigger or we make them smaller. Both we can do.
The good thing about our input language is domain
a specific language, so we can do all of them.
We can fuse. We can fuse this way, horizontal,
and we can do all of those, yes.
>>: I guess I'm just trying to picture this. It
seems like there's a large parameter space, like
every flag I can set a different value, and even
if I say I had a way of blocking, like there's
tons of different parameters.
>> Mehrzad Samadi:
It's a --
>>:
-- [inaudible] experiment over, and so are
you ->> Mehrzad Samadi:
>>:
Yes.
Is there pre-search step or is this, you
know, best effort when you fire up this thing?
>> Mehrzad Samadi: It's a best effort,
basically. And the thing is, it works really
well for smaller kernels, and when you go to big
kernels, it's the domain will become bigger and
bigger. But good for us when you write your GPU
code, usually you launch small kernels one after
each other, not all the time.
Okay. Paragon. You have your dual loop, which
you know you want to run it on the GPU. You have
possibly parallel loop, which you don't know if
it's parallel or not. And all you have the old
loop.
It's -- the idea is pretty simple. When you see
possibly parallel, run it on both CPU and GPU at
the same time.
After Fermi NVIDIA GPU I think we are able to
launch concrete execution, they're called
concrete execution. So at the same time we run
CPU and GPU together, and after GPU finished,
we'll start checking the conflicts. If we find
any conflicts -- if you don't find any conflicts,
we stop the CPU execution and continue the
execution with GPU data. If we didn't find, just
throw it in the trash.
>>: What's the mechanism in the CPU by which you
take -- you interrupt the running kernel and tell
it to ignore its results and jump on to the next
one?
>> Mehrzad Samadi: It's pretty similar,
actually. There is one thread which is
responsible to talk to GPU, and other threads are
working threads. So that thread will get the
results from the GPU and see if it's a conflict
or not. Then sends a flag, basically, to others.
>>: So you have one thread computing the L2
kernel on the CPU and you can just kill it and
read the results from a different buffer on the
communication thread?
>> Mehrzad Samadi: I have many threads.
I have one thread ->>:
L2 and
On the CPU.
>> Mehrzad Samadi:
threads being L2.
On the CPU, I have many
>>: All right. So the sequential L2 and
sequential are all in the same ->> Mehrzad Samadi: Oh, no. Yeah, yeah, yeah,
you are right. I have two threads. One is
sequential L2 and one is managing GPU-CPU.
>>: Right. But what I'm saying is do you -to -- if you have the case where the GPU
correctly executes the code and you want to
cancel the CPU L2 execution, what you're labeling
as "stop" up there ->> Mehrzad Samadi:
>>:
-- are you just killing the thread?
>> Mehrzad Samadi:
>>:
Yes.
Yes.
On the CPU?
>> Mehrzad Samadi:
Yes.
>>: And then the next sequential thing will pull
the result, the L2 results out of a different
address space [indiscernible].
>> Mehrzad Samadi: Yes, because they are using
different letters in space, CPU and GPU. Yeah.
>>:
Perfect.
Thank you.
>>:
So how are you detecting conflicts?
>> Mehrzad Samadi: Okay. That's a longer story.
But basically what we are doing, we check all the
writes and all the reads. So if we see two
writes, it's a conflict. If we see write and
read are the same element, it's a conflict. So
we instrument this loop on the GPU to mark some
memory elements when it's write or reads.
>>:
Who does that marking?
>> Mehrzad Samadi: I generate -- I instrument
the code to do the writing, to do the marking for
me.
>>: So it's like an automated tool or like you
use the programmer to mark this?
>> Mehrzad Samadi: No, no, no. I'm -- as a
programmer, you don't know anything. You just
write your loop. I will put that mark instrument
instructions after your write or reads.
>>: So that probably generates overhead in terms
of the instructions you're adding, so have you
measured how much that slows down the parallel
code you're running? Is it worth parallelizing
with all the instrumentation?
>> Mehrzad Samadi: Actually, yes. I have all
the results. I can show you. Can I show you
after my ->>:
Do that [indiscernible].
>> Mehrzad Samadi: Yeah, yeah, yeah, sure.
Sure. I have the results in the paper, yeah.
That's absolutely true. It has high overhead,
but even with that high overhead, it's better
than sequential CPU.
>>: I have another question. So do you check
for all threads and all the parallel computation
or do you check in the beginning and then you
just let the rest of the computation happen on
the GPU?
>> Mehrzad Samadi: Marking happens during the L2
execution and then checking happens after with
another kernel that checks everything.
>>: But you do the whole execution
[indiscernible] ->> Mehrzad Samadi: Yeah, I do the whole
execution. Then check everything.
>>:
Got it.
Okay.
>> Mehrzad Samadi:
Thank you.
Yeah.
Sure.
>>: So even given the migration cost, it still
doesn't make sense to just run everything on the
GPU? Because you've got to go over
[indiscernible], you've got those edges between.
>> Mehrzad Samadi: This is nothing. This is
just one byte done flat. Stop or nothing. Stop
or ->>: You're running -- I'm assuming the output of
kernel one.
>> Mehrzad Samadi:
>>:
Yes.
Yeah, between L1 and L2.
>> Mehrzad Samadi:
Yes.
>>: Yeah, between L1 and L2, so there's a cost
there, right?
>> Mehrzad Samadi: There's a cost there, yes.
Sometimes it's good for us. Sometimes it's bad.
Because if -- okay. One thing is if you don't
change the input, you can launch while you're
transferring at the same time. That's one thing.
And sometimes -- and even if that's only one of
these might be problem, because until then you
find the conflicts and you don't do the extra
one. Yeah. Even considering that, it will be
useful compared to sequential code.
>>: But alternatively, you could have parallel
code on the CPU doing similar things, right? Say
what transaction memory support.
>> Mehrzad Samadi:
>>:
So is it better than that, as well?
>> Mehrzad Samadi:
paper.
>>:
Yeah, we compare it in our
Oh, you do?
>> Mehrzad Samadi:
>>:
Yes.
Okay.
Yeah.
Great.
>> Mehrzad Samadi: Okay. Value portability.
What we did here, we used approximate computing.
Basically we said that if this value is too
expensive for you to compute, just ignore it.
Drop it. But we need to make sure that it
doesn't impact the output quality that much.
So approximate computing is used in many domains,
machine learning, image processing, video
processing, and if I change the quality from 100
percent to 90 percent, still is a good picture.
And the caution is can I use -- can I do less
work and produce this 90 percent image while user
is happy, because less work is always good. It
gives you higher performance, lower power
consumption. It's always good.
>>: How is it possible computing use in machine
learning?
>> Mehrzad Samadi:
For example, not sampling
data, you're working on sample of data, not the
whole data or something like that. Sampling,
basically. For example, you want to do K means.
>>: Okay. You call that sampling, not
approximate computing. Sorry.
>> Mehrzad Samadi:
>>:
Yeah.
That's statistics.
>> Mehrzad Samadi: Yeah, one of the techniques
of approximate computing that we use is sampling.
>>:
Okay.
>> Mehrzad Samadi: It might not be the exact
answer by sampling. So it might have error.
Okay. Here I don't want to lose 10 percent
quality and give you 10 percent performance.
What I want to do, give you two or three speedup
by losing 10 percent quality.
How can I do that? By doing two things. One is
simplify or skipping processing those inputs sets
that are expensive for GPU to compute. And the
other thing is ignoring those that have lowest
impact on the output quality.
So we propose SAGE, which is trying to use
approximation on graphics engine. This picture
shows what we do not want to happen with SAGE.
It's approximation -- we want to use
approximation while the user is satisfied.
So, again, as a user you write the program once.
We automatically generate approximate kernels and
we monitor it during runtime.
>>: So how do I as a programmer, when I write my
program, how do I specify what's approximate and
what's not? So for instance, your example, I
don't want to screw up the image header, but it's
reasonable to screw up a couple pixel values, for
instance.
>> Mehrzad Samadi: Yes. What we have is we
get -- that's in my slides, too.
>>:
[inaudible].
>> Mehrzad Samadi: Let me give you brief answer.
We will get output code evaluation metric from
you, so we know what is valuable for you. So
while we are approximating, we check that
evaluation metric, see if this approximation is
good or not. But always it's great to get hints
from the programmer, too. I'm not working on
that part, actually. I'm just working on the
approximation part. but there are great papers
that target that.
Okay. Let's look at overview of SAGE. You write
your program. Statistic compiler generates
multiple kernels for you, multiple approximate
kernels, and with different tuning parameters,
and during runtime we monitor the quality and
you'll use those tuning parameters to control
quality.
If you look at the static part in detail, you
get -- I get the input CUDA code. I get
something what we call target output quality or
TOQ. Target output quality, I will use this TOQ
many times during my talk. So it means that if
it's 90 percent, it means that you're willing to
lose 10 percent quality for speedup. So 90
percent is good for me.
And as I said, we get the evaluation metric from
the user, like how do we compute the quality.
And we have three approximation techniques that I
will talk about, and we generate approximate
kernels with tuning parameters for you.
What will happen in runtime. We have three main
units. First one is preprocessing which is done
on CPU [indiscernible] GPU, makes the data ready
for approximate kernel. Then we need to find
configuration that gives us a quality better than
TOQ with good performance. For that we use
tuning.
We start from 100 percent quality, exact version.
We start approximating more and more. You can
see the quality drops. At the same time,
hopefully speedup goes up and we will stop when
we reach close to TOQ. Sure.
>>: So can you give me -- I know you're probably
going to get to it, but I could use a little
high-level overview. What are the knobs that
you're tuning for approximation?
>> Mehrzad Samadi: Okay. Let me say something
else. These are completely independent
invocation of the same kernel. So suppose you
are doing face detection on a database of images.
So each point is one complete face detection, and
the knobs that I have is, for example, sampling
rate.
>>:
[inaudible].
>> Mehrzad Samadi: Sampling rate, I will say in
my -- how many samples of data -- of my input
sets I will look at. Like I will look at only 80
percent of my image for doing face detection.
That's the easiest approximation method. And ->>: So they're domain specific? I mean that
seems like a domain specific approximation.
>> Mehrzad Samadi: There are based on some
assumptions on domain, but the good thing is we
can apply those approximation. If they don't
show any good quality, we can throw them out,
which we have always -- we can always go back to
exact version. That's the good thing about this.
So we can make mistake.
>>:
So Martin -- I think Martin Rinard from MIT
had worked on code perforation.
>> Mehrzad Samadi:
>>:
Is it -- can you compare and contrast?
>> Mehrzad Samadi:
Yes. That's ->>:
Yes.
I have the slide for that.
Is that a yes or a no?
>> Mehrzad Samadi: Yes. Oh, yeah, yeah.
[laughter]. I have those results in one slide,
yeah.
>>:
Okay.
I'll wait for it.
>> Mehrzad Samadi: Yeah. Code perforation or
look perforation is a well known technique. What
they do is they skip iterations of a loop.
That's one way. So you can skip more iteration
or fewer iteration. That's the knob that you can
change.
>>: I think part of the problem you're running
into here is that approximate computing is a hot
topic now and everyone thinks of it a little bit
differently because it's not really yet well
defined because there's not a good taxonomy.
And so you can think of it as doing
transformations that cause loss of data. You can
think of it as using lower precision operations.
You can think of it as, you know, sampling fewer
data points. You can think about iterating less
on some gradient descent algorithm.
>> Mehrzad Samadi:
the approximation.
thing, and everyone
So we are trying to
That's a great way to sum up
We are really new in this
is saying different things.
make sense of that.
>>: Yeah. So maybe if you -- I think -- I mean,
are you limiting yourself to subsampling in your
approximation optimizations?
>> Mehrzad Samadi:
I have precision --
>>: Or what are the classes of approximation
that you're leveraging? Let's put it that way,
just so we're all on the same page.
>> Mehrzad Samadi: Okay. If you give me two
minutes, I will go to those slides, like what
are -- that's okay?
>>:
Sure.
>> Mehrzad Samadi: Okay. The main goal here, I
want to do better the loop perforation. I don't
want to just drop iteration without knowing
hardware. I want to come up with approximation
methods that ->>: I get that. I was just trying to help you
get a clearer -- so that everyone is not hearing
you talk and then hearing -- listening to you
talk and hearing something different.
>> Mehrzad Samadi: Okay. Sure. Let me give you
quick overview. I will do loop perforation on
atomic operation, for example. But I will drop
those atomic operation that have more conflicts.
So I'm not dropping without just randomly. I'm
dropping those that have more conflicts.
Precision is the second one. If I don't need
precision, I can always compress. And
compressing more means reducing quality.
>>: So what's the mechanism by which you lose
precision?
>> Mehrzad Samadi:
quantization.
>>:
Good.
We use fixed point like we do
Operations?
>> Mehrzad Samadi:
Yes.
What's that?
>>: The type of operations, you're doing lower
precision operations.
>> Mehrzad Samadi: Yes, yes. Yeah, exactly.
Exactly. And the last one is sometimes you're
working on the same element, different threads
working on similar elements. Just drop those,
and just one thread doing work produces those for
all of them.
>>: So you can think of that as sampling
reduction, because you're doing fewer points but
you're trying to be smart about how you're using
them?
>> Mehrzad Samadi:
Yes.
Yes.
>>: So do you have a well thought-through way to
choose when you apply which approximation
technique? Is it all manually identified? How
do you know when to use -- you just articulated
three classes of approximation techniques, and
there are certainly others that people in this
room are thinking about.
>> Mehrzad Samadi:
>>:
Yes.
How do you know when to use which?
>> Mehrzad Samadi: I'm -- okay. Right now I
don't think about, like, is it safe to apply the
optimization or not, which is really great topic
for research. What my goal is, is just doing
this for performance. So when I apply those
optimization for getting performance, I have an
algorithm that I will explain.
>>: But for a given kernel, are all of those
algorithms in play, all those different types of
approximation, or do you use -- do you select
different ones based on the algorithm?
>> Mehrzad Samadi: Based on opportunities. If I
see the opportunity for first one or second one
or third one, I ->>: Who decides whether you see the opportunity?
Is that your tool? Is it the programmer?
>> Mehrzad Samadi:
tool.
It's my tool.
It's all my
>>: So I find that surprising because that seems
like a lot -- a lot of insight for the tool I
have that it's safe to drop this down into a
fixed point or you can drop atomic operations. I
mean ->> Mehrzad Samadi: Okay. Here is the thing. I
will use fixed point when I'm reading a large
matrix. I don't know if it impacts the quality
that much or not.
>>: All right. So you have some heuristics that
let the tool that say this is when to turn this
class of approximation optimizations on.
>> Mehrzad Samadi: I just look, can I apply
this -- I don't care about quality. Can I apply
this approximation on this or not? Then I have a
runtime to tell me that you shouldn't use this
approximation.
>>: You have to have some heuristics to guide
the tool to know when to do it.
>>: I believe that you ask the user to write a
function that you basically call to evaluate
whether something still meets a quality bar or
not?
>> Mehrzad Samadi:
Yes.
>>: I'm not even talking about the quality bar.
I'm just saying, you know, if I give it a kernel
and I've got floating point, do you just -- do
you say, hey, I'm going to try converting all the
floating points to fixed point for any kernel, or
do you have some smarts that let's you say, hey,
this might be a good place to try this
optimization?
>> Mehrzad Samadi: But this [indiscernible]
usually have two or three input matrix, for
example. And I know which one I'm reading in a
loop. So that might be useful. I have some
heuristics to find out which approximations might
give you good performance.
>>:
Okay.
>> Mehrzad Samadi: But I don't know -- don't
have any idea which approximation gives you good
quality.
>>:
Yep.
>> Mehrzad Samadi:
the --
Quality will be done in
>>: You're always [inaudible] turn them on
pretty aggressively when it seems like it's
possible.
>> Mehrzad Samadi: Yes. It's not -- yeah.
have heuristics for that, but ->>:
Okay.
>>:
Can you walk us through the --
I
>> Mehrzad Samadi: Sure. Okay. Okay. It will
stop when we reach close the TOQ, and we continue
the execution with that configuration. But this
quality might change for different invocations,
so we have calibration part. We checks every
[indiscernible] N invocation, the quality.
So if it's better than TOQ, we increase the
interval between two checks because we want to
reduce the overhead of calibration.
But what will happen if it goes below TOQ, like
here? We do two things. First of all, we
decrease the aggressiveness of approximation so
hopefully quality will go up and at the same time
the speedup will drop. And we reset the
calibration interval to the minimal, because we
want to check more often to make sure that our
new decision is good. Okay.
>>:
Can you repeat the section that fell below?
>> Mehrzad Samadi: I will talk about this, but
for measuring quality I will execute exact
version and approximate version and compare the
results. For user actually here is 100 percent
quality because I actually ran the exact version.
>>: Yeah. On that sample, but it's possible
that you ->> Mehrzad Samadi: No, I missed those.
those. I will talk about that. Yeah.
I missed
>>: So for your quality function that you
measured how I create your results, so you have
the actual output, right? You have the correct
100 percent quality of the output?
>> Mehrzad Samadi: When I'm checking -- when I'm
checking the quality, yes.
>>: Why are you running -- I mean, if you have
to compare the quality, then why are you running
the code at all? Right. Because if you have the
[indiscernible], then there is no need to run the
code because you have the [indiscernible].
>> Mehrzad Samadi: That's the reason we are
checking every -- invocation. If I had the
quality, I didn't need to check anything. For
these points I have the quality. For this I
don't have any quality. So you are now Oracle.
Do you know everything right now, but what are -my framework will see just these points. Sure.
>>: So you can probably put a statistical bound
on the likelihood that you're going to
actually -- between these T invocations, you're
going to miss a quality check.
>> Mehrzad Samadi: I will try to do that, see if
that satisfies you or not.
Okay.
>>:
Let's go to approximation methods.
[inaudible].
>> Mehrzad Samadi: Okay. Great. First one is
atomic operation. As I said, atomic operation
are those that update the same element
atomically. So here is an example of computing
histogram of one image. You have a four loop
that goes over different pixels. You read the
color of that pixel and add the bucket
corresponding to that color.
What will happen is several -- if several pixels
next to each other are trying to access the same
bucket? You have conflicts. Several threads are
accessing the same element.
What GPU does is serialize this, so it will do
that one after each other. These subsets are all
conflict free. So as I said, more conflicts,
more -- lower performance. So again, we have the
histogram here. How can I approximate this? If
you look at this loop, it goes over all pixels,
which I called iterations.
And if I launch several threads, each thread will
be responsible for doing some of these
iterations. These threads will do two pixels,
for example. We execute those computation for
two pixels.
So my approximation method will drop one
iteration per thread that has the maximum number
of conflicts. It computes the number of
conflicts for this and drops the one that has the
maximum conflict. I will show you how.
So if I drop one iteration per thread, I'm
dropping actually 50 percent of iteration. How
can I change? I need tuning up, right? How can
I change this drop in rate? I can reduce the
number of threads, for example. Now each thread
is doing more pixels and dropping one means 25
percent.
Okay. How can I drop the iteration with the
maximum number of conflicts per thread? Here in
this example I have four iteration, and you can
see that number of conflicts is written next to
each iteration. So I need to skip iteration two
because it has the maximum number of conflicts.
You don't know this when you run your program.
So we come up with the software conflict
detection, which has lower overhead than actual
atomic operation. We use some of the PTX
instruction of CUDA programming, and we keep the
maximum conflict so far as we're executing. So
we check the conflicts for iterations zero.
Right now it has the maximum number of conflicts.
Then we go to check the conflicts for iteration
one, so this has more conflicts, so maximum
conflicts will be iteration one. So now we can
run iterations zero, because it's not maximum
anymore.
Again, check the conflicts for iteration two. It
is the maximum, so we can run iteration one
because it's not the maximum anymore. Check
for -- check the conflicts for iteration three.
It's not greater than iteration two, so we can
run that iteration.
So we basically escape iteration two which has
the maximum number of conflicts. And it will
give you good performance if these checking
conflicts is really light overhead, has really
light overhead.
>>:
Who's checking?
>> Mehrzad Samadi:
>>:
What's that?
Who is checking this?
>> Mehrzad Samadi: This is -- it's our
instructions inside the code that we put.
>>:
On CPU?
>> Mehrzad Samadi:
>>:
On GPU.
On GPU.
>> Mehrzad Samadi: On GPU, yeah. For -- before
doing each atomic operation we have instructions
that check the conflicts.
Okay. Second one is data packing. The one that
I said I reduce the precision. Sometimes you
don't need full precision, so what you can do is
use half of the beats, for example, and put them
together.
Now you have fewer memory requests to access the
whole input sets. And this technique is really
good for memory-bound kernels because it just
reduced the memory access. It doesn't change the
computation.
And the third one is thread fusion. In some
domains neighbor elements have similar values.
For example, in the [indiscernible] processing,
you have your gray area, white area in the image.
So those have similar values.
So what you will end up having two threads
working on similar values doing the same
computation and generating similar output.
So
what I can do is fuse these two threads together
and now I read only one element, do computation
once and write two outputs.
So I can change the approximation, aggressiveness
approximation for how many threads I want to
fuse. And it's really good for computation-bound
kernels because it doesn't change that much
memory because these are in the same cache land.
So when you access this one, this one will be in
your cache, too.
Okay. Let me show you how runtime actually
works. First, how do we compute the output
quality? We run the approximate version, we run
the exact version, then we run the evaluation
metric.
It has huge overhead. It's not possible to do it
for every time because we can run it, run the
exact version only.
And it's really important for tuning, because
during tuning you want to check the quality every
time because we want to converge to the good
solution, so we use the Greedy algorithm.
Let me explain our Greedy algorithm. In this
example you have TOQ called 90 percent. It means
that you are willing to lose 10 percent of
quality. 10 percent is lucky.
We start from the exact version with quality of
100 percent and a speedup of 1x. There is no
speedup. K(x,y) means that you have one kernel,
you apply two approximations methods on it, and x
and y -- x is tuning parameter of the first
approximation method and y is tuning parameter of
the second approximation method.
So more means more aggressive. So we start from
the exact version and check its children. Each
child is the parent returning one more, means
it's more aggressive than the parent for one of
the approximation methods. So we check both
children, and both of them have better quality
than TOQ, so we choose the one with better
speedup because it's a Greedy algorism. So we
choose the one on the right.
Continue the process with this one. We again
check the children. Both have the better quality
than TOQ, so we choose the one that has better
speedup. So 2.5 x the speedup.
And finally, when we check its children, we can
see that both of them have lower quality than
TOQ, so both of them are not good, so although
they give us better performance. So this is our
final choice. Go ahead.
>>:
So is this over a single input [inaudible]?
>> Mehrzad Samadi:
It's ->>:
It's not over a single.
It's over many inputs [inaudible].
>> Mehrzad Samadi:
time.
It's over -- one input at a
>>: Ah. So is there -- you're assuming here
then that both quality and speedup are not
necessarily a function of the input, right? They
generalize ->> Mehrzad Samadi: They are actually a function
of input, and that cause errors in our system,
yeah.
>>: Okay.
error is?
Have you qualified how much that
>> Mehrzad Samadi: I will show you -- I will
show you the run for hundred ->>:
Great.
>> Mehrzad Samadi: But the thing is, when you
launch this for two different inputs, you might
see the differences. But this and this, this one
is always better than that one for different -even different inputs.
>>: So it's just to say that qualitatively -quantitatively, it really doesn't matter if
qualitatively the results are [inaudible]?
>> Mehrzad Samadi:
Yes, yes.
That's -- yeah.
So this is our final choice. We continue the
execution with this one. But we store this
tuning pass because during calibration, when we
see that, okay, quality goes below TOQ, we can go
back one step, one node in this pass to make sure
that quality becomes better.
So let me show you some results. For evaluation
we change the back end of Cetus compiler. It's a
source-to-source compiler. It's compiled C like
code to C like code, and we ran it on GTX 560.
It's the Fermi Intel Core i7, and we choose
several benchmarks from image processing and
machine learning.
This is the results for K-Means for 100 input
sets, and as you can see, here you have
accumulative speedup, and then you have output
quality on top, and TOQ is 90 percent. So I
start with tuning. As you can see, you have
approximate version and exact version. That's
why it goes to hundred percent quality at both
nodes.
And at speedup you can see during tuning is like
you have two x slowdown because you do two times.
Basically each input set is done two times. So
we start from exact version and we come down
until we reach close to TOQ.
At that time we continue the execution, and we
check every ten invocation here. So quality is
better than TOQ, so we increase the calibration
interval. We increase the calibration interval,
we increase the calibration interval, but we are
losing this, basically, because we don't check
all the invocations.
>>: [inaudible] there is no guarantee that you
meet the TOQ, what was the ->> Mehrzad Samadi:
>>:
The guarantee for the quality.
>> Mehrzad Samadi:
>>:
Yes.
You don't have that?
>> Mehrzad Samadi:
>>:
Yeah.
We don't have that.
Okay.
>> Mehrzad Samadi: Yeah, we don't have that. I
have two solutions for that. I will say one of
them right now and I will keep one of them for
the future works.
Okay. One way that we can make ourselves
confident about our work is that check these -do more calibration at the beginning. So what we
do is check more often to increase the confidence
about the quality of your signal. Then increase
the calibration interval.
For example here, I assume that your quality is
uniform over some range, has the uniform
probability over some range, and likelihood is
binomial. Is it better than TOQ or is it lower
than TOQ? So this is confidence interval.
If I check 50 times, I will be 93 percent sure
that 95 of my invocations meet the TOQ. So if I
check more often, I can increase the confidence
level of user. Then I can increase the
calibration interval. But still, there is no
guarantee. I'm just -- it's statically -statistically.
And let me show you the calibration interval,
calibration overhead, too. I show it for two of
our benchmarks, Gaussian filtering, which is a
blurring on the image, and K-Means.
This is the number of invocations between two
checks. Around 40 and 50 we have about 5 percent
overhead by checking. If we check every 40 or 50
invocations we have 5 percent overhead.
And overall result, I show you for TOQs, one 95
percent and one 90 percent. And I compare it
with loop perforation. Loop perforation is great
technique, but the thing is it doesn't know about
hardware. That's why SAGE gives you better
performance than loop perforation, itself,
because we kind of know which iteration to drop
compared to daily work.
And you can see that we get 2X the speedup by
losing 5 percent quality and 2.5 X speedup by
losing 10 percent quality.
>>: So you have a pretty weak assumption on
the -- which is good -- on the quality is
uniformly distributed, right?
>> Mehrzad Samadi:
Yes.
>>: So you could make these numbers -- and if
that's what the numbers are using to set these
priorities ->> Mehrzad Samadi:
Yes.
>>:
-- if you actually looked at the quality of
your output, you could probably do a lot better,
right?
>> Mehrzad Samadi: Yeah, but computing that
quality is another problem.
>>: Sure, but do it offline, right? If you
already have an assumption of -- that your inputs
are representative across all other inputs.
>> Mehrzad Samadi:
Yes.
>>: Right? So if you could estimate that
quality by some subset of the inputs ->> Mehrzad Samadi:
>>:
Yes.
-- for a particular domain --
>> Mehrzad Samadi:
Yes.
>>: -- that then gets you away from your uniform
[indiscernible], which provides no information at
all. Right?
>> Mehrzad Samadi: Yes. Basically what we are
doing is similar thing with profiling, but we are
doing over time. It's enough profiling at the
beginning with offline we are doing while we were
running that. So, yeah, that's true. We can do
it offline and come up with static solution.
Okay. As a conclusion, we generate approximate
kernels and we monitor those for you during
runtime, and we got 2.5 x by losing only 10
percent quality with SAGE. But there are two
limitations with SAGE. One is if you have atomic
operation inside your kernel, it's perfect. You
use SAGE and that approximation method will give
you best performance.
But what if you don't have that? It's not
applicability of SAGE is limited, so I try to
solve this one, and the other one is this that we
are missing, that I will talk about them in
future work.
First, let's see how I try to approximate this.
Increase the applicability of SAGE. We propose a
new framework which we call it Paraprox. In
Paraprox we look for common patterns in data
parallel application. If it's map, if it's
partitioning or tiling reduction scatter/gather,
stencil, or scan, and it's based on the book by
Professor Bacul [phonetic] actually.
And we detect these patterns inside your kernel
and we have approximation method for each
pattern. So we run that. We use that
approximation method for each pattern.
Let me show you like three of these approximation
method that we uses. For map, which you have you
read several elements, do computation, and
generate several other [indiscernible], you don't
do reduction or something like that. We replace
these functions with lookup table, basically.
It's called approximate minimization.
For that, this function should be pure, so it
shouldn't have any side effect that we can
replace that with that. So how do we detect pure
functions? Finding pure sections inside the code
is really hard. It can be done, but it's real
hard.
What we do is looking at functions that is
written by programmers. If they are pure, we
replace them with those lookup tables. And
during -- if you look at the lookup table refers
use quantization. So each input that comes, it
has some quantization levels. We find the
nearest quantization level and outputs that
quantization level corresponding number.
So it has several bits assigned to each input,
and when we put all those bits next to each
other, we have one big address which go to lookup
table and gives us the result. And this lookup
table is filled before the execution, so it will
be filled during offline.
>>:
Is your quantization just I chop off bits or
is it you look at particular bits in the input
pattern?
>> Mehrzad Samadi: We know min and max, okay?
We quantize this ->>:
I see.
>> Mehrzad Samadi: -- range and if input comes,
it's first level, it's second level or something
like that.
>>: Do you know that min and max from a
profiling or ->> Mehrzad Samadi:
>>:
Yes, from profiling.
Okay.
>> Mehrzad Samadi:
it doesn't matter.
out of the range.
And if we are wrong, it's -It can be max or min if it's
>>: [inaudible] multiple inputs, you're
quantizing different inputs into different
ranges, right? This is not really shown there,
or is it ->> Mehrzad Samadi:
>>:
Yeah, yeah, these are --
[inaudible] wrong.
>> Mehrzad Samadi: Yeah. It should be like one
quantization here, one here, one here, one here,
yes.
>>:
[inaudible].
>> Mehrzad Samadi: Those bits -- those bits -those number of bits are different.
>>: Okay. But what [inaudible] operations
you're doing, I see a --
>> Mehrzad Samadi: It's a shift. Basically I
have quantization level for this input, I shift
it over the next one. I put them together,
basically.
>>:
[inaudible].
>> Mehrzad Samadi:
make it.
Put everything together to
>>: So that will, for different inputs and
different number of inputs, it's going to
generate the different number of bits in the
address, right? How do you manage that?
>> Mehrzad Samadi: For different number of
inputs, I know the number of inputs from the
kernel.
>>:
Okay.
>> Mehrzad Samadi: This is -- but for different
inputs I don't change anything, right? I know
these numbers from profiling. And the good thing
about this is you can assign more bits to more
important inputs and fewer bits to less important
inputs.
For example, if one of those inputs -- excuse me.
Sorry. One of those inputs is always constant
during your profiling. You can assign zero bits
to that and put that constant so it will be
there.
>>: So you decide to run different importance of
the input string of profiling place?
>> Mehrzad Samadi:
>>:
Okay.
>> Mehrzad Samadi:
>>:
Yes.
Yeah, exactly.
But by shifting and ordering, it's a pretty
restricted domain that you can actually -functions you can approximate. So if I have an X
or ->> Mehrzad Samadi: No, no, no.
the approximation part.
>>:
No, this is not
It's the index into a function.
>> Mehrzad Samadi: It will go to this -- it's
just an address to this table. My pre-computed
results is already in that table.
>>:
Okay.
>> Mehrzad Samadi: So I'm just making this
address to look in that table and, so this, the
whole thing will be replaced by this.
>>:
Got it.
>> Mehrzad Samadi:
>>:
This is fixed in our other --
Yeah.
>> Mehrzad Samadi:
Okay. Awesome.
For other kernels too.
The second one is similar to what we have in
SAGE. Something like image processing, neighbor
limits have similar values. So here I have -I'm showing the difference of one pixel with its
neighbors, so for ten different images.
So what this graph shows, about 80 percent of
pixels have less than 10 percent difference with
their neighbors, so most of them are similar,
basically. So what I can do for tiling, instead
of accessing this whole tile, I can access the
center of it or one row or one column, and that
might be a good representative of the whole tile.
And for reduction we use loop perforation. We -instead of adding N elements, we add N over two
elements and then multiply results by two.
For this one, we use Clang as a compiler. We get
the CUDA code. We visit the AST. We find those
patterns and our action generator generated those
approximate kernels and rewrite those.
Use the same GPU Intel Core i7, and this time we
have wider range of applications because we can
do -- we increase the applicability of the
approximation methods.
And this time we generate open CL code two, so we
can run on the CPU two, and here I'm not
comparing GPU to CPU. Approximate version of GPU
compared to exact GPU, approximate CPU compared
to exact CPU.
So that's TOQ is 90 percent, and you can see
that. On the GPU and CPU we get similar speedup,
2.6, 2.7 x speedup by losing 10 percent of output
quality.
>>: I just want to make sure that we see the
numbers are wall clock time?
>> Mehrzad Samadi:
>>:
Yes.
You can compare CPU-GPU.
>> Mehrzad Samadi: Yeah, exactly. Okay. These
are my works, and let me tell you a little bit
about the future work that I want to do.
>>:
[inaudible] ten minutes.
>> Mehrzad Samadi: Okay. Ten minutes. Okay.
How can I solve this? The way that I decided to
solve in SAGE was actually we were really
conservative. Every time we saw that output
quality is lower than TOQ, we dropped the speedup
in SAGE in the one that I propose. So we never
increase the aggressiveness because that might
cause so many problems.
So one way to do that is when quality is too
good, we check the quality, it's really good, you
drop the -- you increase the aggressiveness and
speedup will go up. This is what I call it
nonconservative. This might miss more
invocations because of bad quality and so on.
So just keep this in mind. What I'm proposing
here is collaborative CPU-GPU output quality
monitoring. I'm doing monitoring on CPU while
running on GPU.
The first problem is CPU cannot keep up with GPU.
I can't expect CPU to run the exact version,
approximate version, compare those two, and give
the results to the GPU.
So I will have partial output quality monitoring,
which is another approximation level on top
another approximation. Means that we are doing
approximate checking for approximation.
So instead of checking, for example, the whole
image, I just choose a tile of that image in the
CPU, apply the exact kernel, apply the
approximate kernel, and compare the results.
The first question that I need to answer is how
can I generate its code. I don't know yet, but I
can use something like that Paraprox. And how to
choose which tile to use for partial output
quality monitoring, and that's another question,
but right now we are using uniformly distributed.
So we don't choose tile. We prove one pixel
here, one pixel here, one pixel here, one pixel
here.
And how do you make decisions? Suppose your
quality comes 80 percent. Do you increase, do
you decrease, something like that. Right now I'm
checking for three different configurations at
the same time on the CPU, and based on which one
to use, based on their quality, I will use which
one to use for that invocation. So each time I
check for data set one while running GPU is
running for data set zero. So I'm checking
ahead, basically.
So I have some preliminary results which I can
show you. I use two benchmarks, mean filtering,
which is used for blending images and mosaic,
which you make one big image with small images.
And this bottom figure is split up and top figure
is how many images you missed. Basically the
quality is lower than TOQ, but you didn't see
those. And I ran these benchmarks on 1600 image
of flowers. Basically. I downloaded it.
So what are these implementation? First I have
conservative fixed interval. I just drop the
approximation and interval between two checkings
are fixed.
Then I have conservative adaptic interval, which
is what we proposed in SAGE. Nonconservative
fixed interval you can increase the
aggressiveness, but still checking a fix. And
nonconservative adaptic interval.
And the last one is CPU and GPU. As you can see,
these conservative ones show really low speedup
compared to nonconservative ones, but they're
missing fewer images. But nonconservative ones
is like missing like more than 30 percent, like
at some point 50 percent of images don't have
good quality.
CCG, like which we use CPU for quality
monitoring, gives you good, better speedup than
all of them, and about -- they are losing
5percent of images, basically, that they don't
show any good quality.
So I -- another future work that we are currently
doing is right now I just talked about single
kernel, single device. So what -- at each time
you have only one GPU and you run one device, one
kernel on it, but what about single-channel
multiple devices?
We asked the programmer to write the code like he
has only one GPU in the system. But what we can
do for him is to generate code for different GPUs
that are in the system and also on the CPU.
These two work in parallel and at the end we
measure results for the programmer. And we might
do multiple kernel, multiple devices, too.
And conclusion. Programming GPUs is hard. It's
really hard to write an efficient code, and we
can't ask the programmer to write those. So
we -- what we want to do is to ask the programmer
and we want to help the programmer to generate
multiple optimal versions. And if we are allowed
to use our approximation, we can show some good
performance.
Thanks a lot.
Sorry.
Too much.
[applause]
>> Mehrzad Samadi:
appreciate.
Thank you.
>>: It's an interesting talk.
on this TQ.
>> Mehrzad Samadi:
>>:
Yeah.
Thanks.
I
I'm still hung up
TOQ?
I mean, who decides that and --
>> Mehrzad Samadi: Oh, yeah. Yeah.
That's a great question, and --
Yeah.
>>: Because my -- you know, if you go back to
image processing, if you're talking about 90
percent of the pixels are the same or -- I mean,
maybe my eye, I can't tell the difference
between. It's very -- seems very -- that metric
seems very specific to the user.
>> Mehrzad Samadi: I completely agree with you.
It's not set for world at all. It's something
really important, and right now we are just
assuming that programmer provides that. But that
should be ->>: It makes a lot of sense like that the user
writes like a function that based on the
approximated output, computes the TO -- computes
the quality of the output, like -- I don't know,
like something. I don't know. Like for DF -based on the valleys you computed, you measured
the quality of your output.
>>: Yeah. I mean, I think one of the tricky
parts, though, is that you're measuring the
quality of a small piece of the output, not -well ->>: Well, what if that small piece depends on
the whole computer? Like metrics [inaudible] if
you have a bunch of metrics [inaudible].
>>:
Right.
>>: That small piece, you could have done that
whole computation ->> Mehrzad Samadi: Yeah, that's another problem,
yeah. You can't ->>: So here is a question, back to the image.
What if I give you an image that's 90 percent the
color blue and 10 percent very detailed.
>> Mehrzad Samadi:
part.
Yeah.
That's the important
>>: Would that really mess up your analysis,
like -- because from -- if you go down 10 percent
in quality, will the part that actually -- the
picture that you actually care about might be
horrible, but the rest of it is just blue,
regardless of what you do.
>> Mehrzad Samadi: Yes. The thing that I'm
using right now is the uniform average around
errors of all pixels, so ->>: So was one of those results you showed at
the end in future work where you have different
intervals and would one of those solve that
issue?
>> Mehrzad Samadi: It will not generate
evaluation metric for you. I don't know how to
generate that now. But it's a great research
idea, like how can we use that, yeah.
>>: So did you look at -- so suppose that -- I
like this example of the image header being
nonapproxible, and versus the image pixels which
I display on the screen, so that reasonably seems
like something I could approximate. So if -suppose I write a really braindead program that
updates the image header using atomic variables?
>> Mehrzad Samadi:
Okay.
>>: Which I don't know why you would do that,
but image that I do, right?
>> Mehrzad Samadi:
Yeah.
>>: So pretty -- statically it should be pretty
clear. If my approximation is effectively saying
something about, you know, the pixel values, it
should be pretty clear that your technique
won't -- you know, even though you're going to
try turning these knobs to try and reduce the
amount of atomic updates I do on the image
header, every time you do that, you're going to
screw up the image.
>> Mehrzad Samadi:
>>:
Yes.
And so I'm going to have a bad output,
right?
>> Mehrzad Samadi:
Yes.
>>: So have you thought of any kind of static
techniques to let you sit above your ideas to
weed out those things that are guaranteed not to
work?
>> Mehrzad Samadi: Okay. The example that you
said it's not applicable to GPUs, because you
usually write kernels to work on real data on the
GPU. So that header reading is usually happen on
the CPU. The good thing that my work is on the
GPU, so I don't need to worry about those header
files.
>>:
Sure.
>> Mehrzad Samadi: Another thing that I need to
wor -- similar thing that I need to worry,
sometimes you use atomic operation for
synchronization, for lock, for example.
>>:
Yes.
>> Mehrzad Samadi: At those places it will be
disaster if I do approximation. So what we did
is we used really simple compiler analysis that
if you use the result of that atomic operation in
that kernel, inside the branch, don't do that.
So if you have -- if, based on the result, don't
do that.
So it's pretty easy to provide safety or come up
with good heuristic when to apply approximation
methods on one kernel, because it's pretty
straightforward. It's a small code and something
like that. But providing safety that you don't
do anything crazy on the whole benchmarks,
considering all kernels is hard, and we don't -we haven't done that.
We have several heuristics for each kernel, so we
can provide safety for you inside the kernel.
But the interaction between GPU and CPU and so
on, I don't know yet.
>>:
Cool.
I think we're out of time.
>> Mehrzad Samadi:
appreciate.
[End]
Thank you.
Thanks a lot.
I
Download