>> Aaron Smith: All right. So today we have Mehrzad Samadi from the University of Michigan. He's advised by Scott Mahlke. He's done two great internships here with us at MSR, and so recently won Microsoft best paper. >> Mehrzad Samadi: >> Aaron Smith: Yeah. Congratulations on that. >> Mehrzad Samadi: Thank you. >> Aaron Smith: Has a lot of other really interesting research and major conferences, and hopefully we'll get him here as a postdoc working on the Edge architecture top. So today's title talk is Dynamic Orchestration of Massively Parallel Execution. I have a feeling it's going to be about the open GPUs, heterogeneous computing. >> Mehrzad Samadi: >> Aaron Smith: Yep. So -- >> Mehrzad Samadi: Thanks a lot, Aaron. It's great, really great to be back here and it's a good time to be here because of yesterday's game. Congratulation on that. Okay. Let me get started. This is the project that I've done during my studies in University of Michigan. It's called dynamic orchestration of massively data parallel execution. The main goal of this project is to get the best performance for massively data parallel applications, specifically on GPUs which are designed to accelerate these type of workloads, but let's see why GPUs. GPUs are everywhere these days, from super computers, servers, desktops, and even in your cell phones you can find GPUs. And their job is to give you good performance for data parallel [indiscernible]. How they do that? They have many cores working all together and different data sets. And as a programmer, what you need to do is to launch thousands to millions of threads to get the best performance out of that, but ->>: Can you answer a question about GPUs that I've never understood, which is we're in an energy limited era. Is this model of thousands to millions of threads the best, the most energy efficient way to extract this sort of DLP? >> Mehrzad Samadi: Okay. That's really good question. I don't know about the whole thing, but so far, if you have enough data that you can feed these many cores, it is pretty energy efficient. >>: But don't GPUs, when you use that model, require you to do enormous amounts of data movement in and out, on and off the chip from DRAM, which is why they provide all that bandwidth and that data movement burns a lot of energy. >> Mehrzad Samadi: Yes, that's completely true. And that's why they are trying to mix these together, put the CPU next to the GPU. That's what they're trying to do. >>: Nothing about the graphics model, the GPU model with, you know, highly, highly threaded codes that latency tolerance, right? So, you know, I just wonder whether, if you're energy constrained, whether this model is actually the right way to get the right -- the highest performance. >> Mehrzad Samadi: It's -- okay. It's two questions. Is it the right model for to get the highest performance or energy ->>: [inaudible]. >> Mehrzad Samadi: >>: Oh, if you're energy limited. Which we are. >> Mehrzad Samadi: Yes, that's true. So far it's pretty energy efficient if you are not energy limited. I mean, if you have a cable next to this and compute energy, you are good. But on the cell phone and something like that I think ->>: You're energy limited in servers, too. >> Mehrzad Samadi: true. >>: Everywhere. >> Mehrzad Samadi: >>: On servers, yeah, that's Yeah, that's true. Probably not desktops. >>: But still, one of the highest compute per joule or whatever. >> Mehrzad Samadi: Yeah, if you compute that, it's pretty high, but ->>: It's the highest compute. I don't think it's the highest compute per joule. >>: There are alternatives, right? Anyways, so there's a presupposition here that this is the right model, and this model to me seems fundamentally energy inefficient, and we're in a energy-dominated era, and that's something I've never really understood, whether there's a better alternative or this is the only way to do it, pay the energy tax. So why don't we move on. >> Mehrzad Samadi: Okay. Let me say something. At least in desktops I know that if you don't consider moving data from CPU to GPU ->>: No, [inaudible] I'm talking about moving it in and out of the DRAM. >> Mehrzad Samadi: >>: Oh. I see. I'm not talking about migrating [inaudible]. >> Mehrzad Samadi: [indiscernible]. Okay. >>: I'm saying that, you know, if you have the data residing on chip and you can beat on it, it's really efficient. The model here is you're going to move -- you're presuming that everything is going to be a cache [inaudible], which is why you need so many threads, and so you're doing massive amounts of data movement into on and off chip. >> Mehrzad Samadi: Yes. Yes. >>: And that seems inefficient to me. all I'm saying. So that's >> Mehrzad Samadi: Okay. Sure. Okay. One problem is it might not be energy efficient, like I said. You have these millions of threads. How do you want to manage them together, manage them to get the best performance. So in this work, I want to show you how to manage all these threads to get the best performance out of them. So basically what I want to do is doing this guy's job. So let's see why GPU programming is hard. Here I show peak performance of NVIDIA GPUs over past seven years. And as you can see, it grows rapidly and we've been to 500 gigaflops, now we are around five teraflops for peak performance. Okay. Let's see what we can get with software. Here I took matrix multiplication from CUBLAS Library. CUBLAS Library is written by Evan [indiscernible]. It's highly optimized. It's best code available for metrics modification right now. As you can see, they get really good performance growth, but still there is a gap between what GPU can provide and what you can get writing your own software. But this is not the only problem of GPU programming. Another problem which I am trying to attack is performance portability. >>: Can I get [inaudible] -- >> Mehrzad Samadi: Sure. >>: Why doesn't it follow more the GTX 680 point? The CUBLAS cycle seems to be kind of linear with what the 480/580. >> Mehrzad Samadi: >>: Oh, what's happening in 680? Yeah. >> Mehrzad Samadi: Okay. I don't know if I can say it in the camera or not, but I talked with one NVIDIA engineer and I talked with another NVIDIA engineer, I heard two stories. So I will tell you both the stories afterwards, okay? So, go ahead, sir. >>: So is that the [indiscernible] limitation or is this just a software level limitation that you don't see as much as the peak performance? >> Mehrzad Samadi: Okay. It's -- there is no hardware limitation. It's probably all software, like ->>: I mean, the code is like, [inaudible] limitations or why is that [inaudible] ->> Mehrzad Samadi: Oh, if you -- yeah. There are both. Actually, this one is pretty optimistic. All you have to do here is multiply, add batch to batch. >>: Yes. >> Mehrzad Samadi: But in matrix multiplication you have other stuff and ->>: Yeah. >> Mehrzad Samadi: But still, if it's not -matrix multiplication is the best example. If you go other workloads, it will be, yeah. >>: [inaudible]. >> Mehrzad Samadi: Yeah. Okay. Another problem is performance portability. When you write a program, you have several assumptions in your mind. This is a GPU I want to optimize for and this is my input size and so on. You write one implementation, which I call it fixed implementation code, and optimize for those assumptions. What will happen if those assumptions change when you are running? So ->>: [inaudible] is it really a problem with the GPU? Because the same thing is true about CPU code, as well. >> Mehrzad Samadi: That's exactly ->>: That's exactly correct. You can look at them [inaudible] -- >> Mehrzad Samadi: You have the exact same problems in the CPU world, but here I will show you my results. Problem is more important because with changing one of these, you lose all your performance. Okay? >>: CPU -- the GPUs is more sensitive. >>: [inaudible] they're clipped. >> Mehrzad Samadi: Yeah, exactly. Deeper clip. Okay. Let's see what are the problem of fixing implementation code. First one is device portability. You optimize for one GPU, but when you run it in another GPU, it doesn't give you the performance that you want. For an example, I took metrics multiplication from NVIDIA SDK. It's not as optimized as CUBLAS Library, but it's pretty decent code. Basically after taking a three months course of CUDA programming, you can write this matrix multiplication. And as you can see, this is only one implementation I ran it on three different GPUs. As you can see, your GPUs is getting better and better, but your performance doesn't show any -code doesn't show any performance improvement. Portability is not specific to device. There is another problem which I call it input portability. When you write your code for some input size or even input shape, for example, you write your code for S core like matrix and you run it for some other shapes or sizes, you will not get good performance. Here I have matrix vector multiplication from CUBLAS Library, and matrix size is the same. I just changed the shape. So here is a rectangle. Here is close to a square and again a rectangle. As you can see, it gets -- it gives you the best performance for close to the square, but you won't get -- you will get almost zero performance when you have longer matrices. Third problem, third problem is when you have irregular dependencies. Something like indirect memory accesses. What will happen then, you don't know your code is parallel or not, and if you want to write one fixed implementation, you need to be conservative and write a sequential CPU code. So then what will happen if your code is actually parallel. You are losing all that GPU computation power. Here I show you indirect memory if I want to be CPU. Then this several applications that have accesses, and if I want to be -conservative, I will run them on is the performance of CPU. And I normalize it for GPU. 100 percent is GPU performance. So by being conservative, you are losing this much performance because you don't know this -- I figured they are parallel, but the compiler doesn't know they are parallel because they have indirect memory access in that. And the last one is value portability. Even value -- values that you are processing can have impact on your performance. Let me show you an example. Here I have performance of atomic add operation in CUDA programming. Atomic operations are those that you can update the same elements atomically. So what GPU does if several threads accessing the same element is serialize those accesses. So on X axis I have conflicts per warp or 32 threads. As number of conflicts goes up, you can see that performance drops down rapidly. Okay. What is the solution to all these problems? One solution is ask the programmer to write different implementation for different GPUs, different input shapes, and parallel or not parallel, and write all of this, which is not practical. What I want to propose in this work is using static dynamic compiler framework. As a programmer, you write your parallel code once and give it to my compiler. We use several optimizations which are designed to target performance portability and we generate several implementation with tuning parameters for you. And during runtime, based on these runtime properties, device input, if it's parallel or not, and value, we choose which kernel to run and tune that for you. That's the main idea of this work. And I try to attack all these problems during my Ph.D., and I will talk about these three briefly and I will go to the most recent one, value portability which might be interesting for you. >>: How do you generate all the kernels? >> Mehrzad Samadi: Statically. I generate them next to each other. During runtime I decide which one to launch. So it will expand the code size. >>: So, but who produces them? something that the compiler will transformations and generate end the programmer have to write all Is this do kernels? Does of kernels? >> Mehrzad Samadi: No, no, no. Programmer just write one and compiler generates multiple. >>: Okay. >> Mehrzad Samadi: So I have different optimization. I apply this optimization or not. I increase this optimization or ->>: There's no man behind the curtain? There's no wizard of kernel writing all the kernels, you know, that presume that somebody takes the pain? You can generate them automatically? >> Mehrzad Samadi: >>: Yes. Okay. >> Mehrzad Samadi: I took the pain to write the code, yeah. >>: You took the pain to write the framework that generates the kernels or ->> Mehrzad Samadi: >>: No. -- the kernels? >> Mehrzad Samadi: Write the framework that generate the kernels. >>: Okay. That's fine. >> Mehrzad Samadi: That's -- >>: That's [inaudible]. should be paying that. You're allowed -- you >>: So I guess this doesn't sound like a new idea because I've heard for a couple years now about work on auto-tuners and ->> Mehrzad Samadi: No, it's not a new idea. combination of problem and solution is new. The >>: But I mean, there's this notion of searching different. So are you doing something different than just trying different kernels? >> Mehrzad Samadi: We -- yes, yes. But that's the main solution of this project. But inside each of these there are many new contributions. The way that we search, the way that we generate, the way that we tune, these are all new stuff. >>: Okay. I'll wait for the -- >> Mehrzad Samadi: Okay. Sure. Okay. First one is the sponge which targets the device portability. What we do here, we ask the programmer to write its code and stream it. Assume it is a language designed by MIT. It's a function-based language. You have different functions and you know explicit communication between those. So you know how many elements this function produce and how many elements this function consume. And get the GPU type and we use several optimizations and generate CUDA code for you. Based on the GPU type, for example, if it has more registers, we use more register optimization to utilize more registers to accelerate program for you. Second one is adaptic. We saw this input portability. It's based on the sponge, but this time we had input portable -- input portability optimizations too. So this is the graph that I show you in the motivation. I show the results for three different input matrices, and blue line is adaptic and other one is CUBLAS Library. And this is different shapes for those matrices. As you can see, we are doing really great. We are better than CUBLAS, and the reason is that it's not actually one kernel like this. It's composed of five, six different kernels. So this one is optimized for just this range. After that it will drop. And this one is optimized for just this range, and after that it will drop. That's -- by using all of those, we can do better than their code. Any questions so far? >>: The input size changes over the course of execution? Do you also change the kernel or did you run >> Mehrzad Samadi: No. >>: -- 20 different kernels here, one for each point, or is it dynamically adapting if the input size is changing, or is the input size always fixed? >> Mehrzad Samadi: Input size is fixed when you launch the kernel. So when you are launching kernel inside your code, we check the input size and change the kernel that you want to launch. That will launch for you. >>: So do you know which kernel runs the fastest for the specific shape statically or is that something that you run dynamically and find out? >> Mehrzad Samadi: We can do both. We had some GPU model, which wasn't that great, but it was working for us because we just want to know if this code is faster or not. It doesn't give you actual cycles that it [inaudible]. And profiling always help. >>: So if you put the ideal numbers on top of this here, the fastest kernels you're able to find, I mean, how close, on average, does adaptic get? >> Mehrzad Samadi: For this I said it's pretty close, because it's -- matrix vector multiplication, it's a small kernel. So the way that we define it in a [indiscernible] is pretty efficient. And the best that you can get is close to this, at least to my knowledge as I can write. But if you get bigger kernels, then we have problem because it's not that efficient in the [indiscernible]. Second -- third one is irregular dependencies, which we come up with Paragon. It's -- you don't know your loop is parallel or not. What do you want to do? What we did ->>: If you can go back to this. >> Mehrzad Samadi: This one? >>: So for the places where there are big gaps, what did you learn? What was -- what was so optimal about the CUBLAS in those cases that you're adaptic approach? Was it like a memory layout issue? Registry issue? The two disparities in performance or ->> Mehrzad Samadi: Yes, exactly. For example, if you have few number of rows or few number of columns, your rows are small. I can put several of them into shared memory, but when they are writing one code, you can't assume that. So the way that we utilized resources on the GPU is different from CUBLAS Library. >>: So I imagine that you're doing things like piling and blocking? >> Mehrzad Samadi: >>: Yes. Big difference. >> Mehrzad Samadi: Yeah, we fuse the work of one thread -- several threads together to make them bigger or we make them smaller. Both we can do. The good thing about our input language is domain a specific language, so we can do all of them. We can fuse. We can fuse this way, horizontal, and we can do all of those, yes. >>: I guess I'm just trying to picture this. It seems like there's a large parameter space, like every flag I can set a different value, and even if I say I had a way of blocking, like there's tons of different parameters. >> Mehrzad Samadi: It's a -- >>: -- [inaudible] experiment over, and so are you ->> Mehrzad Samadi: >>: Yes. Is there pre-search step or is this, you know, best effort when you fire up this thing? >> Mehrzad Samadi: It's a best effort, basically. And the thing is, it works really well for smaller kernels, and when you go to big kernels, it's the domain will become bigger and bigger. But good for us when you write your GPU code, usually you launch small kernels one after each other, not all the time. Okay. Paragon. You have your dual loop, which you know you want to run it on the GPU. You have possibly parallel loop, which you don't know if it's parallel or not. And all you have the old loop. It's -- the idea is pretty simple. When you see possibly parallel, run it on both CPU and GPU at the same time. After Fermi NVIDIA GPU I think we are able to launch concrete execution, they're called concrete execution. So at the same time we run CPU and GPU together, and after GPU finished, we'll start checking the conflicts. If we find any conflicts -- if you don't find any conflicts, we stop the CPU execution and continue the execution with GPU data. If we didn't find, just throw it in the trash. >>: What's the mechanism in the CPU by which you take -- you interrupt the running kernel and tell it to ignore its results and jump on to the next one? >> Mehrzad Samadi: It's pretty similar, actually. There is one thread which is responsible to talk to GPU, and other threads are working threads. So that thread will get the results from the GPU and see if it's a conflict or not. Then sends a flag, basically, to others. >>: So you have one thread computing the L2 kernel on the CPU and you can just kill it and read the results from a different buffer on the communication thread? >> Mehrzad Samadi: I have many threads. I have one thread ->>: L2 and On the CPU. >> Mehrzad Samadi: threads being L2. On the CPU, I have many >>: All right. So the sequential L2 and sequential are all in the same ->> Mehrzad Samadi: Oh, no. Yeah, yeah, yeah, you are right. I have two threads. One is sequential L2 and one is managing GPU-CPU. >>: Right. But what I'm saying is do you -to -- if you have the case where the GPU correctly executes the code and you want to cancel the CPU L2 execution, what you're labeling as "stop" up there ->> Mehrzad Samadi: >>: -- are you just killing the thread? >> Mehrzad Samadi: >>: Yes. Yes. On the CPU? >> Mehrzad Samadi: Yes. >>: And then the next sequential thing will pull the result, the L2 results out of a different address space [indiscernible]. >> Mehrzad Samadi: Yes, because they are using different letters in space, CPU and GPU. Yeah. >>: Perfect. Thank you. >>: So how are you detecting conflicts? >> Mehrzad Samadi: Okay. That's a longer story. But basically what we are doing, we check all the writes and all the reads. So if we see two writes, it's a conflict. If we see write and read are the same element, it's a conflict. So we instrument this loop on the GPU to mark some memory elements when it's write or reads. >>: Who does that marking? >> Mehrzad Samadi: I generate -- I instrument the code to do the writing, to do the marking for me. >>: So it's like an automated tool or like you use the programmer to mark this? >> Mehrzad Samadi: No, no, no. I'm -- as a programmer, you don't know anything. You just write your loop. I will put that mark instrument instructions after your write or reads. >>: So that probably generates overhead in terms of the instructions you're adding, so have you measured how much that slows down the parallel code you're running? Is it worth parallelizing with all the instrumentation? >> Mehrzad Samadi: Actually, yes. I have all the results. I can show you. Can I show you after my ->>: Do that [indiscernible]. >> Mehrzad Samadi: Yeah, yeah, yeah, sure. Sure. I have the results in the paper, yeah. That's absolutely true. It has high overhead, but even with that high overhead, it's better than sequential CPU. >>: I have another question. So do you check for all threads and all the parallel computation or do you check in the beginning and then you just let the rest of the computation happen on the GPU? >> Mehrzad Samadi: Marking happens during the L2 execution and then checking happens after with another kernel that checks everything. >>: But you do the whole execution [indiscernible] ->> Mehrzad Samadi: Yeah, I do the whole execution. Then check everything. >>: Got it. Okay. >> Mehrzad Samadi: Thank you. Yeah. Sure. >>: So even given the migration cost, it still doesn't make sense to just run everything on the GPU? Because you've got to go over [indiscernible], you've got those edges between. >> Mehrzad Samadi: This is nothing. This is just one byte done flat. Stop or nothing. Stop or ->>: You're running -- I'm assuming the output of kernel one. >> Mehrzad Samadi: >>: Yes. Yeah, between L1 and L2. >> Mehrzad Samadi: Yes. >>: Yeah, between L1 and L2, so there's a cost there, right? >> Mehrzad Samadi: There's a cost there, yes. Sometimes it's good for us. Sometimes it's bad. Because if -- okay. One thing is if you don't change the input, you can launch while you're transferring at the same time. That's one thing. And sometimes -- and even if that's only one of these might be problem, because until then you find the conflicts and you don't do the extra one. Yeah. Even considering that, it will be useful compared to sequential code. >>: But alternatively, you could have parallel code on the CPU doing similar things, right? Say what transaction memory support. >> Mehrzad Samadi: >>: So is it better than that, as well? >> Mehrzad Samadi: paper. >>: Yeah, we compare it in our Oh, you do? >> Mehrzad Samadi: >>: Yes. Okay. Yeah. Great. >> Mehrzad Samadi: Okay. Value portability. What we did here, we used approximate computing. Basically we said that if this value is too expensive for you to compute, just ignore it. Drop it. But we need to make sure that it doesn't impact the output quality that much. So approximate computing is used in many domains, machine learning, image processing, video processing, and if I change the quality from 100 percent to 90 percent, still is a good picture. And the caution is can I use -- can I do less work and produce this 90 percent image while user is happy, because less work is always good. It gives you higher performance, lower power consumption. It's always good. >>: How is it possible computing use in machine learning? >> Mehrzad Samadi: For example, not sampling data, you're working on sample of data, not the whole data or something like that. Sampling, basically. For example, you want to do K means. >>: Okay. You call that sampling, not approximate computing. Sorry. >> Mehrzad Samadi: >>: Yeah. That's statistics. >> Mehrzad Samadi: Yeah, one of the techniques of approximate computing that we use is sampling. >>: Okay. >> Mehrzad Samadi: It might not be the exact answer by sampling. So it might have error. Okay. Here I don't want to lose 10 percent quality and give you 10 percent performance. What I want to do, give you two or three speedup by losing 10 percent quality. How can I do that? By doing two things. One is simplify or skipping processing those inputs sets that are expensive for GPU to compute. And the other thing is ignoring those that have lowest impact on the output quality. So we propose SAGE, which is trying to use approximation on graphics engine. This picture shows what we do not want to happen with SAGE. It's approximation -- we want to use approximation while the user is satisfied. So, again, as a user you write the program once. We automatically generate approximate kernels and we monitor it during runtime. >>: So how do I as a programmer, when I write my program, how do I specify what's approximate and what's not? So for instance, your example, I don't want to screw up the image header, but it's reasonable to screw up a couple pixel values, for instance. >> Mehrzad Samadi: Yes. What we have is we get -- that's in my slides, too. >>: [inaudible]. >> Mehrzad Samadi: Let me give you brief answer. We will get output code evaluation metric from you, so we know what is valuable for you. So while we are approximating, we check that evaluation metric, see if this approximation is good or not. But always it's great to get hints from the programmer, too. I'm not working on that part, actually. I'm just working on the approximation part. but there are great papers that target that. Okay. Let's look at overview of SAGE. You write your program. Statistic compiler generates multiple kernels for you, multiple approximate kernels, and with different tuning parameters, and during runtime we monitor the quality and you'll use those tuning parameters to control quality. If you look at the static part in detail, you get -- I get the input CUDA code. I get something what we call target output quality or TOQ. Target output quality, I will use this TOQ many times during my talk. So it means that if it's 90 percent, it means that you're willing to lose 10 percent quality for speedup. So 90 percent is good for me. And as I said, we get the evaluation metric from the user, like how do we compute the quality. And we have three approximation techniques that I will talk about, and we generate approximate kernels with tuning parameters for you. What will happen in runtime. We have three main units. First one is preprocessing which is done on CPU [indiscernible] GPU, makes the data ready for approximate kernel. Then we need to find configuration that gives us a quality better than TOQ with good performance. For that we use tuning. We start from 100 percent quality, exact version. We start approximating more and more. You can see the quality drops. At the same time, hopefully speedup goes up and we will stop when we reach close to TOQ. Sure. >>: So can you give me -- I know you're probably going to get to it, but I could use a little high-level overview. What are the knobs that you're tuning for approximation? >> Mehrzad Samadi: Okay. Let me say something else. These are completely independent invocation of the same kernel. So suppose you are doing face detection on a database of images. So each point is one complete face detection, and the knobs that I have is, for example, sampling rate. >>: [inaudible]. >> Mehrzad Samadi: Sampling rate, I will say in my -- how many samples of data -- of my input sets I will look at. Like I will look at only 80 percent of my image for doing face detection. That's the easiest approximation method. And ->>: So they're domain specific? I mean that seems like a domain specific approximation. >> Mehrzad Samadi: There are based on some assumptions on domain, but the good thing is we can apply those approximation. If they don't show any good quality, we can throw them out, which we have always -- we can always go back to exact version. That's the good thing about this. So we can make mistake. >>: So Martin -- I think Martin Rinard from MIT had worked on code perforation. >> Mehrzad Samadi: >>: Is it -- can you compare and contrast? >> Mehrzad Samadi: Yes. That's ->>: Yes. I have the slide for that. Is that a yes or a no? >> Mehrzad Samadi: Yes. Oh, yeah, yeah. [laughter]. I have those results in one slide, yeah. >>: Okay. I'll wait for it. >> Mehrzad Samadi: Yeah. Code perforation or look perforation is a well known technique. What they do is they skip iterations of a loop. That's one way. So you can skip more iteration or fewer iteration. That's the knob that you can change. >>: I think part of the problem you're running into here is that approximate computing is a hot topic now and everyone thinks of it a little bit differently because it's not really yet well defined because there's not a good taxonomy. And so you can think of it as doing transformations that cause loss of data. You can think of it as using lower precision operations. You can think of it as, you know, sampling fewer data points. You can think about iterating less on some gradient descent algorithm. >> Mehrzad Samadi: the approximation. thing, and everyone So we are trying to That's a great way to sum up We are really new in this is saying different things. make sense of that. >>: Yeah. So maybe if you -- I think -- I mean, are you limiting yourself to subsampling in your approximation optimizations? >> Mehrzad Samadi: I have precision -- >>: Or what are the classes of approximation that you're leveraging? Let's put it that way, just so we're all on the same page. >> Mehrzad Samadi: Okay. If you give me two minutes, I will go to those slides, like what are -- that's okay? >>: Sure. >> Mehrzad Samadi: Okay. The main goal here, I want to do better the loop perforation. I don't want to just drop iteration without knowing hardware. I want to come up with approximation methods that ->>: I get that. I was just trying to help you get a clearer -- so that everyone is not hearing you talk and then hearing -- listening to you talk and hearing something different. >> Mehrzad Samadi: Okay. Sure. Let me give you quick overview. I will do loop perforation on atomic operation, for example. But I will drop those atomic operation that have more conflicts. So I'm not dropping without just randomly. I'm dropping those that have more conflicts. Precision is the second one. If I don't need precision, I can always compress. And compressing more means reducing quality. >>: So what's the mechanism by which you lose precision? >> Mehrzad Samadi: quantization. >>: Good. We use fixed point like we do Operations? >> Mehrzad Samadi: Yes. What's that? >>: The type of operations, you're doing lower precision operations. >> Mehrzad Samadi: Yes, yes. Yeah, exactly. Exactly. And the last one is sometimes you're working on the same element, different threads working on similar elements. Just drop those, and just one thread doing work produces those for all of them. >>: So you can think of that as sampling reduction, because you're doing fewer points but you're trying to be smart about how you're using them? >> Mehrzad Samadi: Yes. Yes. >>: So do you have a well thought-through way to choose when you apply which approximation technique? Is it all manually identified? How do you know when to use -- you just articulated three classes of approximation techniques, and there are certainly others that people in this room are thinking about. >> Mehrzad Samadi: >>: Yes. How do you know when to use which? >> Mehrzad Samadi: I'm -- okay. Right now I don't think about, like, is it safe to apply the optimization or not, which is really great topic for research. What my goal is, is just doing this for performance. So when I apply those optimization for getting performance, I have an algorithm that I will explain. >>: But for a given kernel, are all of those algorithms in play, all those different types of approximation, or do you use -- do you select different ones based on the algorithm? >> Mehrzad Samadi: Based on opportunities. If I see the opportunity for first one or second one or third one, I ->>: Who decides whether you see the opportunity? Is that your tool? Is it the programmer? >> Mehrzad Samadi: tool. It's my tool. It's all my >>: So I find that surprising because that seems like a lot -- a lot of insight for the tool I have that it's safe to drop this down into a fixed point or you can drop atomic operations. I mean ->> Mehrzad Samadi: Okay. Here is the thing. I will use fixed point when I'm reading a large matrix. I don't know if it impacts the quality that much or not. >>: All right. So you have some heuristics that let the tool that say this is when to turn this class of approximation optimizations on. >> Mehrzad Samadi: I just look, can I apply this -- I don't care about quality. Can I apply this approximation on this or not? Then I have a runtime to tell me that you shouldn't use this approximation. >>: You have to have some heuristics to guide the tool to know when to do it. >>: I believe that you ask the user to write a function that you basically call to evaluate whether something still meets a quality bar or not? >> Mehrzad Samadi: Yes. >>: I'm not even talking about the quality bar. I'm just saying, you know, if I give it a kernel and I've got floating point, do you just -- do you say, hey, I'm going to try converting all the floating points to fixed point for any kernel, or do you have some smarts that let's you say, hey, this might be a good place to try this optimization? >> Mehrzad Samadi: But this [indiscernible] usually have two or three input matrix, for example. And I know which one I'm reading in a loop. So that might be useful. I have some heuristics to find out which approximations might give you good performance. >>: Okay. >> Mehrzad Samadi: But I don't know -- don't have any idea which approximation gives you good quality. >>: Yep. >> Mehrzad Samadi: the -- Quality will be done in >>: You're always [inaudible] turn them on pretty aggressively when it seems like it's possible. >> Mehrzad Samadi: Yes. It's not -- yeah. have heuristics for that, but ->>: Okay. >>: Can you walk us through the -- I >> Mehrzad Samadi: Sure. Okay. Okay. It will stop when we reach close the TOQ, and we continue the execution with that configuration. But this quality might change for different invocations, so we have calibration part. We checks every [indiscernible] N invocation, the quality. So if it's better than TOQ, we increase the interval between two checks because we want to reduce the overhead of calibration. But what will happen if it goes below TOQ, like here? We do two things. First of all, we decrease the aggressiveness of approximation so hopefully quality will go up and at the same time the speedup will drop. And we reset the calibration interval to the minimal, because we want to check more often to make sure that our new decision is good. Okay. >>: Can you repeat the section that fell below? >> Mehrzad Samadi: I will talk about this, but for measuring quality I will execute exact version and approximate version and compare the results. For user actually here is 100 percent quality because I actually ran the exact version. >>: Yeah. On that sample, but it's possible that you ->> Mehrzad Samadi: No, I missed those. those. I will talk about that. Yeah. I missed >>: So for your quality function that you measured how I create your results, so you have the actual output, right? You have the correct 100 percent quality of the output? >> Mehrzad Samadi: When I'm checking -- when I'm checking the quality, yes. >>: Why are you running -- I mean, if you have to compare the quality, then why are you running the code at all? Right. Because if you have the [indiscernible], then there is no need to run the code because you have the [indiscernible]. >> Mehrzad Samadi: That's the reason we are checking every -- invocation. If I had the quality, I didn't need to check anything. For these points I have the quality. For this I don't have any quality. So you are now Oracle. Do you know everything right now, but what are -my framework will see just these points. Sure. >>: So you can probably put a statistical bound on the likelihood that you're going to actually -- between these T invocations, you're going to miss a quality check. >> Mehrzad Samadi: I will try to do that, see if that satisfies you or not. Okay. >>: Let's go to approximation methods. [inaudible]. >> Mehrzad Samadi: Okay. Great. First one is atomic operation. As I said, atomic operation are those that update the same element atomically. So here is an example of computing histogram of one image. You have a four loop that goes over different pixels. You read the color of that pixel and add the bucket corresponding to that color. What will happen is several -- if several pixels next to each other are trying to access the same bucket? You have conflicts. Several threads are accessing the same element. What GPU does is serialize this, so it will do that one after each other. These subsets are all conflict free. So as I said, more conflicts, more -- lower performance. So again, we have the histogram here. How can I approximate this? If you look at this loop, it goes over all pixels, which I called iterations. And if I launch several threads, each thread will be responsible for doing some of these iterations. These threads will do two pixels, for example. We execute those computation for two pixels. So my approximation method will drop one iteration per thread that has the maximum number of conflicts. It computes the number of conflicts for this and drops the one that has the maximum conflict. I will show you how. So if I drop one iteration per thread, I'm dropping actually 50 percent of iteration. How can I change? I need tuning up, right? How can I change this drop in rate? I can reduce the number of threads, for example. Now each thread is doing more pixels and dropping one means 25 percent. Okay. How can I drop the iteration with the maximum number of conflicts per thread? Here in this example I have four iteration, and you can see that number of conflicts is written next to each iteration. So I need to skip iteration two because it has the maximum number of conflicts. You don't know this when you run your program. So we come up with the software conflict detection, which has lower overhead than actual atomic operation. We use some of the PTX instruction of CUDA programming, and we keep the maximum conflict so far as we're executing. So we check the conflicts for iterations zero. Right now it has the maximum number of conflicts. Then we go to check the conflicts for iteration one, so this has more conflicts, so maximum conflicts will be iteration one. So now we can run iterations zero, because it's not maximum anymore. Again, check the conflicts for iteration two. It is the maximum, so we can run iteration one because it's not the maximum anymore. Check for -- check the conflicts for iteration three. It's not greater than iteration two, so we can run that iteration. So we basically escape iteration two which has the maximum number of conflicts. And it will give you good performance if these checking conflicts is really light overhead, has really light overhead. >>: Who's checking? >> Mehrzad Samadi: >>: What's that? Who is checking this? >> Mehrzad Samadi: This is -- it's our instructions inside the code that we put. >>: On CPU? >> Mehrzad Samadi: >>: On GPU. On GPU. >> Mehrzad Samadi: On GPU, yeah. For -- before doing each atomic operation we have instructions that check the conflicts. Okay. Second one is data packing. The one that I said I reduce the precision. Sometimes you don't need full precision, so what you can do is use half of the beats, for example, and put them together. Now you have fewer memory requests to access the whole input sets. And this technique is really good for memory-bound kernels because it just reduced the memory access. It doesn't change the computation. And the third one is thread fusion. In some domains neighbor elements have similar values. For example, in the [indiscernible] processing, you have your gray area, white area in the image. So those have similar values. So what you will end up having two threads working on similar values doing the same computation and generating similar output. So what I can do is fuse these two threads together and now I read only one element, do computation once and write two outputs. So I can change the approximation, aggressiveness approximation for how many threads I want to fuse. And it's really good for computation-bound kernels because it doesn't change that much memory because these are in the same cache land. So when you access this one, this one will be in your cache, too. Okay. Let me show you how runtime actually works. First, how do we compute the output quality? We run the approximate version, we run the exact version, then we run the evaluation metric. It has huge overhead. It's not possible to do it for every time because we can run it, run the exact version only. And it's really important for tuning, because during tuning you want to check the quality every time because we want to converge to the good solution, so we use the Greedy algorithm. Let me explain our Greedy algorithm. In this example you have TOQ called 90 percent. It means that you are willing to lose 10 percent of quality. 10 percent is lucky. We start from the exact version with quality of 100 percent and a speedup of 1x. There is no speedup. K(x,y) means that you have one kernel, you apply two approximations methods on it, and x and y -- x is tuning parameter of the first approximation method and y is tuning parameter of the second approximation method. So more means more aggressive. So we start from the exact version and check its children. Each child is the parent returning one more, means it's more aggressive than the parent for one of the approximation methods. So we check both children, and both of them have better quality than TOQ, so we choose the one with better speedup because it's a Greedy algorism. So we choose the one on the right. Continue the process with this one. We again check the children. Both have the better quality than TOQ, so we choose the one that has better speedup. So 2.5 x the speedup. And finally, when we check its children, we can see that both of them have lower quality than TOQ, so both of them are not good, so although they give us better performance. So this is our final choice. Go ahead. >>: So is this over a single input [inaudible]? >> Mehrzad Samadi: It's ->>: It's not over a single. It's over many inputs [inaudible]. >> Mehrzad Samadi: time. It's over -- one input at a >>: Ah. So is there -- you're assuming here then that both quality and speedup are not necessarily a function of the input, right? They generalize ->> Mehrzad Samadi: They are actually a function of input, and that cause errors in our system, yeah. >>: Okay. error is? Have you qualified how much that >> Mehrzad Samadi: I will show you -- I will show you the run for hundred ->>: Great. >> Mehrzad Samadi: But the thing is, when you launch this for two different inputs, you might see the differences. But this and this, this one is always better than that one for different -even different inputs. >>: So it's just to say that qualitatively -quantitatively, it really doesn't matter if qualitatively the results are [inaudible]? >> Mehrzad Samadi: Yes, yes. That's -- yeah. So this is our final choice. We continue the execution with this one. But we store this tuning pass because during calibration, when we see that, okay, quality goes below TOQ, we can go back one step, one node in this pass to make sure that quality becomes better. So let me show you some results. For evaluation we change the back end of Cetus compiler. It's a source-to-source compiler. It's compiled C like code to C like code, and we ran it on GTX 560. It's the Fermi Intel Core i7, and we choose several benchmarks from image processing and machine learning. This is the results for K-Means for 100 input sets, and as you can see, here you have accumulative speedup, and then you have output quality on top, and TOQ is 90 percent. So I start with tuning. As you can see, you have approximate version and exact version. That's why it goes to hundred percent quality at both nodes. And at speedup you can see during tuning is like you have two x slowdown because you do two times. Basically each input set is done two times. So we start from exact version and we come down until we reach close to TOQ. At that time we continue the execution, and we check every ten invocation here. So quality is better than TOQ, so we increase the calibration interval. We increase the calibration interval, we increase the calibration interval, but we are losing this, basically, because we don't check all the invocations. >>: [inaudible] there is no guarantee that you meet the TOQ, what was the ->> Mehrzad Samadi: >>: The guarantee for the quality. >> Mehrzad Samadi: >>: Yes. You don't have that? >> Mehrzad Samadi: >>: Yeah. We don't have that. Okay. >> Mehrzad Samadi: Yeah, we don't have that. I have two solutions for that. I will say one of them right now and I will keep one of them for the future works. Okay. One way that we can make ourselves confident about our work is that check these -do more calibration at the beginning. So what we do is check more often to increase the confidence about the quality of your signal. Then increase the calibration interval. For example here, I assume that your quality is uniform over some range, has the uniform probability over some range, and likelihood is binomial. Is it better than TOQ or is it lower than TOQ? So this is confidence interval. If I check 50 times, I will be 93 percent sure that 95 of my invocations meet the TOQ. So if I check more often, I can increase the confidence level of user. Then I can increase the calibration interval. But still, there is no guarantee. I'm just -- it's statically -statistically. And let me show you the calibration interval, calibration overhead, too. I show it for two of our benchmarks, Gaussian filtering, which is a blurring on the image, and K-Means. This is the number of invocations between two checks. Around 40 and 50 we have about 5 percent overhead by checking. If we check every 40 or 50 invocations we have 5 percent overhead. And overall result, I show you for TOQs, one 95 percent and one 90 percent. And I compare it with loop perforation. Loop perforation is great technique, but the thing is it doesn't know about hardware. That's why SAGE gives you better performance than loop perforation, itself, because we kind of know which iteration to drop compared to daily work. And you can see that we get 2X the speedup by losing 5 percent quality and 2.5 X speedup by losing 10 percent quality. >>: So you have a pretty weak assumption on the -- which is good -- on the quality is uniformly distributed, right? >> Mehrzad Samadi: Yes. >>: So you could make these numbers -- and if that's what the numbers are using to set these priorities ->> Mehrzad Samadi: Yes. >>: -- if you actually looked at the quality of your output, you could probably do a lot better, right? >> Mehrzad Samadi: Yeah, but computing that quality is another problem. >>: Sure, but do it offline, right? If you already have an assumption of -- that your inputs are representative across all other inputs. >> Mehrzad Samadi: Yes. >>: Right? So if you could estimate that quality by some subset of the inputs ->> Mehrzad Samadi: >>: Yes. -- for a particular domain -- >> Mehrzad Samadi: Yes. >>: -- that then gets you away from your uniform [indiscernible], which provides no information at all. Right? >> Mehrzad Samadi: Yes. Basically what we are doing is similar thing with profiling, but we are doing over time. It's enough profiling at the beginning with offline we are doing while we were running that. So, yeah, that's true. We can do it offline and come up with static solution. Okay. As a conclusion, we generate approximate kernels and we monitor those for you during runtime, and we got 2.5 x by losing only 10 percent quality with SAGE. But there are two limitations with SAGE. One is if you have atomic operation inside your kernel, it's perfect. You use SAGE and that approximation method will give you best performance. But what if you don't have that? It's not applicability of SAGE is limited, so I try to solve this one, and the other one is this that we are missing, that I will talk about them in future work. First, let's see how I try to approximate this. Increase the applicability of SAGE. We propose a new framework which we call it Paraprox. In Paraprox we look for common patterns in data parallel application. If it's map, if it's partitioning or tiling reduction scatter/gather, stencil, or scan, and it's based on the book by Professor Bacul [phonetic] actually. And we detect these patterns inside your kernel and we have approximation method for each pattern. So we run that. We use that approximation method for each pattern. Let me show you like three of these approximation method that we uses. For map, which you have you read several elements, do computation, and generate several other [indiscernible], you don't do reduction or something like that. We replace these functions with lookup table, basically. It's called approximate minimization. For that, this function should be pure, so it shouldn't have any side effect that we can replace that with that. So how do we detect pure functions? Finding pure sections inside the code is really hard. It can be done, but it's real hard. What we do is looking at functions that is written by programmers. If they are pure, we replace them with those lookup tables. And during -- if you look at the lookup table refers use quantization. So each input that comes, it has some quantization levels. We find the nearest quantization level and outputs that quantization level corresponding number. So it has several bits assigned to each input, and when we put all those bits next to each other, we have one big address which go to lookup table and gives us the result. And this lookup table is filled before the execution, so it will be filled during offline. >>: Is your quantization just I chop off bits or is it you look at particular bits in the input pattern? >> Mehrzad Samadi: We know min and max, okay? We quantize this ->>: I see. >> Mehrzad Samadi: -- range and if input comes, it's first level, it's second level or something like that. >>: Do you know that min and max from a profiling or ->> Mehrzad Samadi: >>: Yes, from profiling. Okay. >> Mehrzad Samadi: it doesn't matter. out of the range. And if we are wrong, it's -It can be max or min if it's >>: [inaudible] multiple inputs, you're quantizing different inputs into different ranges, right? This is not really shown there, or is it ->> Mehrzad Samadi: >>: Yeah, yeah, these are -- [inaudible] wrong. >> Mehrzad Samadi: Yeah. It should be like one quantization here, one here, one here, one here, yes. >>: [inaudible]. >> Mehrzad Samadi: Those bits -- those bits -those number of bits are different. >>: Okay. But what [inaudible] operations you're doing, I see a -- >> Mehrzad Samadi: It's a shift. Basically I have quantization level for this input, I shift it over the next one. I put them together, basically. >>: [inaudible]. >> Mehrzad Samadi: make it. Put everything together to >>: So that will, for different inputs and different number of inputs, it's going to generate the different number of bits in the address, right? How do you manage that? >> Mehrzad Samadi: For different number of inputs, I know the number of inputs from the kernel. >>: Okay. >> Mehrzad Samadi: This is -- but for different inputs I don't change anything, right? I know these numbers from profiling. And the good thing about this is you can assign more bits to more important inputs and fewer bits to less important inputs. For example, if one of those inputs -- excuse me. Sorry. One of those inputs is always constant during your profiling. You can assign zero bits to that and put that constant so it will be there. >>: So you decide to run different importance of the input string of profiling place? >> Mehrzad Samadi: >>: Okay. >> Mehrzad Samadi: >>: Yes. Yeah, exactly. But by shifting and ordering, it's a pretty restricted domain that you can actually -functions you can approximate. So if I have an X or ->> Mehrzad Samadi: No, no, no. the approximation part. >>: No, this is not It's the index into a function. >> Mehrzad Samadi: It will go to this -- it's just an address to this table. My pre-computed results is already in that table. >>: Okay. >> Mehrzad Samadi: So I'm just making this address to look in that table and, so this, the whole thing will be replaced by this. >>: Got it. >> Mehrzad Samadi: >>: This is fixed in our other -- Yeah. >> Mehrzad Samadi: Okay. Awesome. For other kernels too. The second one is similar to what we have in SAGE. Something like image processing, neighbor limits have similar values. So here I have -I'm showing the difference of one pixel with its neighbors, so for ten different images. So what this graph shows, about 80 percent of pixels have less than 10 percent difference with their neighbors, so most of them are similar, basically. So what I can do for tiling, instead of accessing this whole tile, I can access the center of it or one row or one column, and that might be a good representative of the whole tile. And for reduction we use loop perforation. We -instead of adding N elements, we add N over two elements and then multiply results by two. For this one, we use Clang as a compiler. We get the CUDA code. We visit the AST. We find those patterns and our action generator generated those approximate kernels and rewrite those. Use the same GPU Intel Core i7, and this time we have wider range of applications because we can do -- we increase the applicability of the approximation methods. And this time we generate open CL code two, so we can run on the CPU two, and here I'm not comparing GPU to CPU. Approximate version of GPU compared to exact GPU, approximate CPU compared to exact CPU. So that's TOQ is 90 percent, and you can see that. On the GPU and CPU we get similar speedup, 2.6, 2.7 x speedup by losing 10 percent of output quality. >>: I just want to make sure that we see the numbers are wall clock time? >> Mehrzad Samadi: >>: Yes. You can compare CPU-GPU. >> Mehrzad Samadi: Yeah, exactly. Okay. These are my works, and let me tell you a little bit about the future work that I want to do. >>: [inaudible] ten minutes. >> Mehrzad Samadi: Okay. Ten minutes. Okay. How can I solve this? The way that I decided to solve in SAGE was actually we were really conservative. Every time we saw that output quality is lower than TOQ, we dropped the speedup in SAGE in the one that I propose. So we never increase the aggressiveness because that might cause so many problems. So one way to do that is when quality is too good, we check the quality, it's really good, you drop the -- you increase the aggressiveness and speedup will go up. This is what I call it nonconservative. This might miss more invocations because of bad quality and so on. So just keep this in mind. What I'm proposing here is collaborative CPU-GPU output quality monitoring. I'm doing monitoring on CPU while running on GPU. The first problem is CPU cannot keep up with GPU. I can't expect CPU to run the exact version, approximate version, compare those two, and give the results to the GPU. So I will have partial output quality monitoring, which is another approximation level on top another approximation. Means that we are doing approximate checking for approximation. So instead of checking, for example, the whole image, I just choose a tile of that image in the CPU, apply the exact kernel, apply the approximate kernel, and compare the results. The first question that I need to answer is how can I generate its code. I don't know yet, but I can use something like that Paraprox. And how to choose which tile to use for partial output quality monitoring, and that's another question, but right now we are using uniformly distributed. So we don't choose tile. We prove one pixel here, one pixel here, one pixel here, one pixel here. And how do you make decisions? Suppose your quality comes 80 percent. Do you increase, do you decrease, something like that. Right now I'm checking for three different configurations at the same time on the CPU, and based on which one to use, based on their quality, I will use which one to use for that invocation. So each time I check for data set one while running GPU is running for data set zero. So I'm checking ahead, basically. So I have some preliminary results which I can show you. I use two benchmarks, mean filtering, which is used for blending images and mosaic, which you make one big image with small images. And this bottom figure is split up and top figure is how many images you missed. Basically the quality is lower than TOQ, but you didn't see those. And I ran these benchmarks on 1600 image of flowers. Basically. I downloaded it. So what are these implementation? First I have conservative fixed interval. I just drop the approximation and interval between two checkings are fixed. Then I have conservative adaptic interval, which is what we proposed in SAGE. Nonconservative fixed interval you can increase the aggressiveness, but still checking a fix. And nonconservative adaptic interval. And the last one is CPU and GPU. As you can see, these conservative ones show really low speedup compared to nonconservative ones, but they're missing fewer images. But nonconservative ones is like missing like more than 30 percent, like at some point 50 percent of images don't have good quality. CCG, like which we use CPU for quality monitoring, gives you good, better speedup than all of them, and about -- they are losing 5percent of images, basically, that they don't show any good quality. So I -- another future work that we are currently doing is right now I just talked about single kernel, single device. So what -- at each time you have only one GPU and you run one device, one kernel on it, but what about single-channel multiple devices? We asked the programmer to write the code like he has only one GPU in the system. But what we can do for him is to generate code for different GPUs that are in the system and also on the CPU. These two work in parallel and at the end we measure results for the programmer. And we might do multiple kernel, multiple devices, too. And conclusion. Programming GPUs is hard. It's really hard to write an efficient code, and we can't ask the programmer to write those. So we -- what we want to do is to ask the programmer and we want to help the programmer to generate multiple optimal versions. And if we are allowed to use our approximation, we can show some good performance. Thanks a lot. Sorry. Too much. [applause] >> Mehrzad Samadi: appreciate. Thank you. >>: It's an interesting talk. on this TQ. >> Mehrzad Samadi: >>: Yeah. Thanks. I I'm still hung up TOQ? I mean, who decides that and -- >> Mehrzad Samadi: Oh, yeah. Yeah. That's a great question, and -- Yeah. >>: Because my -- you know, if you go back to image processing, if you're talking about 90 percent of the pixels are the same or -- I mean, maybe my eye, I can't tell the difference between. It's very -- seems very -- that metric seems very specific to the user. >> Mehrzad Samadi: I completely agree with you. It's not set for world at all. It's something really important, and right now we are just assuming that programmer provides that. But that should be ->>: It makes a lot of sense like that the user writes like a function that based on the approximated output, computes the TO -- computes the quality of the output, like -- I don't know, like something. I don't know. Like for DF -based on the valleys you computed, you measured the quality of your output. >>: Yeah. I mean, I think one of the tricky parts, though, is that you're measuring the quality of a small piece of the output, not -well ->>: Well, what if that small piece depends on the whole computer? Like metrics [inaudible] if you have a bunch of metrics [inaudible]. >>: Right. >>: That small piece, you could have done that whole computation ->> Mehrzad Samadi: Yeah, that's another problem, yeah. You can't ->>: So here is a question, back to the image. What if I give you an image that's 90 percent the color blue and 10 percent very detailed. >> Mehrzad Samadi: part. Yeah. That's the important >>: Would that really mess up your analysis, like -- because from -- if you go down 10 percent in quality, will the part that actually -- the picture that you actually care about might be horrible, but the rest of it is just blue, regardless of what you do. >> Mehrzad Samadi: Yes. The thing that I'm using right now is the uniform average around errors of all pixels, so ->>: So was one of those results you showed at the end in future work where you have different intervals and would one of those solve that issue? >> Mehrzad Samadi: It will not generate evaluation metric for you. I don't know how to generate that now. But it's a great research idea, like how can we use that, yeah. >>: So did you look at -- so suppose that -- I like this example of the image header being nonapproxible, and versus the image pixels which I display on the screen, so that reasonably seems like something I could approximate. So if -suppose I write a really braindead program that updates the image header using atomic variables? >> Mehrzad Samadi: Okay. >>: Which I don't know why you would do that, but image that I do, right? >> Mehrzad Samadi: Yeah. >>: So pretty -- statically it should be pretty clear. If my approximation is effectively saying something about, you know, the pixel values, it should be pretty clear that your technique won't -- you know, even though you're going to try turning these knobs to try and reduce the amount of atomic updates I do on the image header, every time you do that, you're going to screw up the image. >> Mehrzad Samadi: >>: Yes. And so I'm going to have a bad output, right? >> Mehrzad Samadi: Yes. >>: So have you thought of any kind of static techniques to let you sit above your ideas to weed out those things that are guaranteed not to work? >> Mehrzad Samadi: Okay. The example that you said it's not applicable to GPUs, because you usually write kernels to work on real data on the GPU. So that header reading is usually happen on the CPU. The good thing that my work is on the GPU, so I don't need to worry about those header files. >>: Sure. >> Mehrzad Samadi: Another thing that I need to wor -- similar thing that I need to worry, sometimes you use atomic operation for synchronization, for lock, for example. >>: Yes. >> Mehrzad Samadi: At those places it will be disaster if I do approximation. So what we did is we used really simple compiler analysis that if you use the result of that atomic operation in that kernel, inside the branch, don't do that. So if you have -- if, based on the result, don't do that. So it's pretty easy to provide safety or come up with good heuristic when to apply approximation methods on one kernel, because it's pretty straightforward. It's a small code and something like that. But providing safety that you don't do anything crazy on the whole benchmarks, considering all kernels is hard, and we don't -we haven't done that. We have several heuristics for each kernel, so we can provide safety for you inside the kernel. But the interaction between GPU and CPU and so on, I don't know yet. >>: Cool. I think we're out of time. >> Mehrzad Samadi: appreciate. [End] Thank you. Thanks a lot. I