# tactiq.io free youtube transcript # Bill Dally - Accelerating AI # https://www.youtube.com/watch/EkHnyuW_U7o so I'm really pleased to be here this is a great venue and a great topic unless you've been hiding under a rock you've realized that AI and a particular deep learning is changing almost every aspect of human life you know in the internet every time you upload a photo to Facebook they run it through a half a dozen deep networks looking for inappropriate content copyrighted content tagging faces etc when you you know talk to Google you're talking to a neural network and that's why the speech recognition is so much better when you travel you can translate language by just you know pointing your phone at the text and it comes out in whatever language you're most comfortable with it's making doctors more effective the goal here is not to replace doctors with neural networks but to empower them so that a clinician with a neural network can interpret you know medical images better whether it's for skin cancer whether it's for breast cancer you know whether it's you know looking at retinas for various forms of eye disease and also they're mining data of various symptoms and and you know signs patient histories and helping doctors make better diagnoses media entertainment it's you know completely changing how people produce content security and defense it's easy to train a network for what's normal anything abnormal is an anomaly gets flagged and one thing we're very excited about at Nvidia is the prospect of saving some of the 1.3 million lives lost on highways worldwide each year by applying neural networks to autonomous vehicles there was a computer architect and a hardware designer what's particularly exciting to me is that this revolution was enabled by Hardware most of the algorithms we use deep neural networks convolutional neural networks training them with backpropagation stochastic gradient descent were all around to the 1980s actually took a course when I was a PhD student at Caltech from john hopfield who was visiting Caltech at that time on neural networks and built a lot of these networks and concluded that we didn't have enough computing power to make them useful which was correct in the 1980s the other ingredient you need to make these things work is lots of data we had that around the early 2000s and the imagenet competition was running for a long time before the Burnham took off with me and him labeled sets but the missing ingredient in the spark that really ignited the fire that its current revolution in deep learning was having enough computer horsepower to train these models on these data sets in a reasonable amount of time where people defined reasonable as two weeks and once you know once that happened it took off now since it's taken off we have this this issue and two of the previous speakers I've had this little cycle of better models more data you know better results that deep warning is now gated by hardware we need to continue to produce faster solutions for both training and inference to enable people to run you know bigger models on more data you know this shows sort of the increased compute complexity for you know images going from 2012 to 2016 so a four year period you need three hundred and fifty times as much performance for speech over a shorter period it's 30x for machine translation 10x and so we have to continue to deliver better hardware solutions but Dave Patterson actually showed you this exact slide this morning despite the fact that a particular company is in denial about this fact the amours law is dead you know if two Turing Award winners you know sort of call it I think I think it's it's a done deal they can put the defibrillators away now to give you an idea of a little bit of what's going on and cool not talked about a bunch of this in his talk let me just show you one brief example of how we use machine learning and what the compute complexity is so we use a network called drive net which is your very similar to res net for our deep learning perception we have 12 cameras on the car for each camera we do many things we do detection we detect free space we have two independent calls of where it's safe to go one by detecting what's safe from another one second when it's not we find lanes and predict a path that goes into our path planner as a suggested route and we do this in all directions from 12 cameras this is putting all three of those together and we do it in all kinds of weather so we do it you know at night during the day in the rain in the snow and have tested it on lots of real and synthetic scenarios that that happen all the time now if you look at the computational complexity of doing this let's start with resonate 50 because I don't want to release the details of Dr net and resident 50 is very comparable to run an image net sized image that's roughly eight billion operations to feed one image into an imprint coming out to do that thirty frames per second it's a quarter of a tera op to run one camera through one network but our cameras aren't image net cameras in fact I don't think anybody has image net sized cameras that are 225 by 225 our cameras are HD cameras and in the very near future they're going to be moving up to 4k but at HD it's about ten tera ops per camera per network we have 12 cameras for networks actually have more than four networks depending on how you look at it but but the the 12 times four is a good number so this is the computational complexity here is really enormous how do we solve that well today we do it with GPUs and I'll talk a little bit about the evolution of GPUs for deep learning it's actually an interesting kind of coevolution that that's going on and the current GPS are very much designed to be deep learning accelerators if you're training the absolute best machine on the planet for training and I'll get to the ml per result shortly but as certified by mo perf is a Tesla you know V 100 Volta tenth record GPU and what makes it really good for training are two numbers on this I'm not gonna go through all the specs when is it has a hundred and twenty five teraflops of tensor core performance that's FP sixteen performance and we provide a software package called amp automatic multi precision that does scaling so it will take almost any fact we have not yet found one that it doesn't do reasonably well on almost any model and automatically scale things so you can run the training in FP sixteen the other number here that's really important actually two of them that are important is the nearly a terabyte per second of HB m bandwidth because a lot of these you know models especially as your batch sizes get up there the activations don't fit on chip and you're actually running the activations through that that memory interface and a conventional memory interface can't keep up and then for scalability we have 300 gigabytes per second of NB link bandwidth so can connect these together into it into a training supercomputer and I'll show the the ml per scalability results shortly where we won every category we submitted to now the reason this is so good at training is it's very much a stuff or purpose device for training but unlike a TPU where you know Google basically built a big you know 256 by 256 matrix multiply unit what we decided to do is to take a very flexible programmable engine which had a very capable memory system which is half the battle and add an instruction to it in the instruction we added was hmm 1/2 precision matrix multiply accumulate and it does what's shown here and that it takes two F P 16 matrices multiplies them together and adds them to an FP 32 matrix and because it does 128 floating-point operations almost all of the energy goes to doing the math the overhead of programmability of fetching the instruction decoding the instruction fetching the operands all the things that give you the programmability which is so valuable for doing new models new layers doing different normalization schemes doing sparsity masks doing drop out whatever you want to do you can do with the programmability you essentially get for free the it's not advancing ok that's it this way the the Turing part is so the savolta was announced in May of 2017 turing we announced last summer in in august of 2018 does the same thing for inference volta is an inference GPU turing is actually a universal GPU it it matches Voltas performance not this version this is the t4 which is very low power and and 65 teraflops FP 16 the Quadro 8000 is 130 tariffs of FP 16 it's actually slightly faster than Volta but what's key here is that we've also provided into 8 + inch for tensor cores so for inference we can provide an enormous number of ops per watt for inference in fact it's quite a bit better than any of the special-purpose parts that I've seen to date and a little graphic of sort of how this works and Pascal we had a dot product instruction so you can sort of do four operations at a time with Volta we went to the FP 16 tensor cores that do 128 ops at a time and then an INT 8 its 256 and an INT for its 512 so it just multiplies up up the performance and when you look and these are the numbers for the FP 16 version the the fully loaded cost of instruction fetch decode and operand fetch and by the way even in a tentacle you have to get the operands from somewhere and I have to say I was not allowed to show the actual 12 nanometer numbers so these numbers are scaled from publicly available 45 nanometer numbers running the RTL code through and getting the extracted energies out but basically the overhead of programmability is 27% what that means is that if you threw everything else off the GPU except for the actual math units themselves and of course there's going to be some overhead even for a dedicated accelerator like a TPU you couldn't do any better than 27% better and that's assuming you can make floating-point units as good as ours and we've been at it for a while now one thing I like to show is the evolution of our performance over the years and this is the chart for inference as a comfortable chart for training and everybody always likes to compare against Kepler it's kind of like the way they used to always compare against the VAX 11 780 because it's a good target but Kepler is a part that we started shipping in fact the K 20 wasn't even the first one but we started shipping the K 20 in 2012 we you know completed a lot of the initial stages of this design in 2009 and that was sort of before we'd really identified deep learning as a target area to make things better even so it was for a long period of time the platform people ran deep warning training on and just to show you that Moore's law doesn't matter here m40 which isn't exactly the same technology they're both the tsfc 28 nanometer we announced a couple years later with almost double the performance per watt and from there on we went to Pascal and by the way these were both doing FP 32 operations we didn't have the FP 16 until we move to Pascal the jump here is FP 16 on Pascal the the co were here by the way in codes it's a next process node these three are all on 16 nanometer I know we call the last to 12 but it's 16 nanometer metal rules so it's the same energy as as the original 16 nanometer is a slightly faster transistors but by going to FP 16 we get a big jump 3x in in performance by adding the tensor cores still FP 16 we get another big jump about 5x and this is an efficient to energy per op and then adding the into 8 this is 4 into 8 inference we get another huge jump now you know Dave Patterson made the point this morning that the TPU one compared against Kepler was 10x but better something I don't know exactly they're measuring may have been per watt I will observe and I'll show you the data on this and a couple slides that they were comparing an INT 8 engine against an FP 32 engine and the difference between the energy to do an intake multiply and an FP 32 multiply is 16x almost the entire advantage of the TPU 1 over Kepler is a difference between FP 32 and Entei the rest of it is all just waving hands this this chart shows a similar thing for the single chip inference performance but what really matters actually is the performance per watt which which is when is showing and so over the the past six years since we identified deep learning as an important use case we've improved our performance per watt by a hundred and ten X we are not a stationary target for all these numerous startups out there who would like to eat our lunch it's a very rapidly moving lunch but what's what's what's interesting is that the very same chip that is the leading chip for deep learning is also the chip the power is the most powerful supercomputer in the world the summit supercomputer at Oak Ridge National Laboratories the most powerful supercomputer in Europe has deigned at the Swiss Federal Institute of Technology and and a large fraction of the top 20 on the top 500 list and you know they've made the point of no training at 200 petaflop s-- summit is an exaflop machine if you're gonna count FP 16 flops it's 3x of flops and one of the winners of the Gordon Bell prize at supercomputing last year which included authors go out there's from both the DOA and from Nvidia actually sustained over an exaflop on a deep learning tasks associated with interpreting weather patterns for climate change so we've been able to sustain exaflop on real deep learning tasks using this machine what's great about combining the FP 64 performance you need for a lot of the historic HPC applications with the FP 16 you need for for deep learning is deep learning is revolutionized scientific computing in two ways one is that Gordon Bell prize shows by using deep learning to interpret the results you're on some huge climate simulation and and you want the answer you know is the earth doomed and and the way to extract that is to look for things in that to look for patterns of currents and and typhoons and the like and by finding those you can actually interpret the results it's hard for humans to find that in a huge data set the deep learning you can build pattern recognizers that find those things the other application is actually doing the simulation itself rather than simulating the constituent equations you can basically take previous simulations or real ground truth data and use them to train a neural network to predict the next state of the simulation and people have gotten speed ups as high as 10 to the fifth on quantum chemistry codes by taking the original density functional Theory codes taking results of them running training networks on them and then running the networks and getting results that are equal accuracy with 10 to the fifth fewer operations so there's marriage of high-performance computing with deep learning is a huge thing now in the embedded space we make an SOC called xaviar it's designed primarily for our self-driving cars but we also apply to video analytics and robotics and it has it's basically a tenth of a volt it's got 512 volts of cores instead of 50-120 but it has a bunch of accelerators in addition to that and so we have accelerators for video and things we can process those 12 cameras coming in but we also have a deep warning X order and the reason we did this isn't in the in the embedded space that 27 percent matters and you actually want to get every little bit of overhead out of there and so our deep learning accelerator actually looks a lot like the Google TPU that's a Big Mac array in the middle it's not quite as big as as theirs it's a 2k in tape multiplies per cycle rather than 60 4k but it has two things that they're deep learning accelerators don't have when is it as support for sparsity because as I'll talk about in a minute most networks are sparse and the activations are sparse and by moving around only the nonzero values you can save an enormous amount of bandwidth and by not burning any energy trying to multiply things by zero you can save an enormous amount of energy we also support Winograd transforms natively in this space that makes convolutions much more efficient if I have to do a 3x3 convolution and I do it in the spatial domain it takes nine multiplies if I go to a transform domain I can do that convolution with a single multiply and so there's a big big advantage for moving to the Winograd domain so let's talk about ml per so you know when I was a teenager we used to race cars and you only knew if somebody was serious if they would race you for pink slips and so you know I think we should have raced Google for pink slips on this one and you know basically there were seven categories we submitted to six of them the the the reinforced warning category wasn't really appropriate for what GPUs are very good at but the six we submitted to we won both both and both for single node and for scalability and it was really the scalability one that mattered the Intel person mentioned the recommender system as being their submission was 1.6 times better than Pascal so I went to the ml per side by the way the ml perfect org you guys can all just go there and look the numbers up for yourself are submission was 116 times a p100 just to put it put that in perspective and that's training in point four minutes by the way that trained fast enough that we didn't do multi-node for that one it didn't make any sense if you're taking you know what is that 20 seconds or something like that to do the training you don't need to make it go faster by going to multiple nodes so these are all in a single DG x2h you guys can just get out your checkbooks and buy one right now if you wanted who take orders after the talk if you if this isn't fast enough for you these are the numbers at scale and we got very very good performance speed ups and these aren't a bunch of different clusters most of which are clusters have actually djx ones because we just had a larger cluster of those available at the time we ran these to run on but we got very good scalability on all of these because of our NV link network and then the with it within the DG x1 and then the the InfiniBand network connecting the nodes together so this is where we are today where we are the fastest in the world at training networks at least as reported by ml per we submitted two more categories and anybody else we submitted to six one six let's talk about where we go from here we feel this responsibility since deep learning is gated by performance and hardware to keep you know to keep those curves I showed you going right we actually have pretty much the next two points on those curves pretty much loaded already and we're still going you know up at a at a comparable rate but but where is that performance going to come from and so some of the previous speakers hinted at this one is from number representation and when you choose a number of representation what you're really doing is you're selecting two things one is how much dynamic range do you have and how much accuracy do you have and it's important to use as few bits as you can for two reasons one is that the energy goes quadratically with the number of bits right when you do a multiplier you're going from doing an 8 bit multiplier to a 32 bit multiplier is not four times more energy it's 16 times more energy again that was the entire difference in energy between the TPU one and a k80 is comparing int 8 with FP 32 but the other reason that's really important is that you want your your data to be small both the weights and the activations so you can fit a lot of it whatever on-chip memory you have and keep it right next to the earth medic units because moving data around is very expensive energetically and so one thing you want to do is is to look at ways of encoding your data that makes the most of each bit a bit as a horrible thing to waste and this shows after pruning which I'll talk about in a minute the distribution of weight values in a network of movies was in vgg 16 and what you see here is that if I have 4 bits to use I could choose to use them with an integer representation which would give me the evenly spaced symbols as film by the green X's here you can see that's a horrible way to waste your symbols right because I've got a bunch of symbols out here where nothing is happening and I'm sampling the place where everything is happening very sparsely a better way to use your symphony of symbols if you have 16 symbols is to do what the red dot show here is to put them very densely where the interesting things are happening and not waste any out here were for the outliers and the the data here is from a paper that I published with my former graduate student Seong Han who's now a professor at MIT in ICL are actually in 2016 Oh in an archive in 2015 and this is one way of doing it it turns out this is actually an energetically inefficient way of doing it because it requires a codebook look-up on a full 16-bit multiply there are actually very clever numerical encodings they get you a similar efficiency we can wind up getting 8-bit accuracy with sort of four bits per symbol and do it in a very energetically efficient way this is another figure from from that 2015 paper which shows you know for convolutional neural networks we're able to get down to 6 bits per symbol with no loss of accuracy that's no loss of accuracy compared to F P 32 and for fully connected layers from multiple era perceptrons were able to get down to four with no loss of accuracy you don't really fall off a cliff of accuracy until two bits and actually since this time we've done done a lot of work that is not published yet which is actually even better so you can you can use the minimum number of bits required and that number is is trending down toward for a lot of people will say let's go binary and it turns out that winds up not being such a great idea you end up losing more accuracy and having to earn it back in ways that are more expensive and four seems to be kind of the sweet spot in energy of for doing things to get a given level of accuracy for inference so let's talk about pruning one of my favorite sayings in life is never put off till tomorrow what you can put off forever and this goes especially true for multiplies so it turns out that you just just as you know biological brains are sparse right they they do not have every neuron connected every neuron and then they're even sparse or dynamically in terms of where the firing occurs artificial brains can be the same way and in fact you can lop out most of the neurons in a network and not lose any accuracy for fully connected layers we've been able to repeatedly knock out 90% of the neurons leaving only 10% remaining with no loss of accuracy and for convolutional networks between 60 and 70 percent can be pruned leaving thirty to forty percent now to do this you then have to retrain the network right so what you do is you train the network you then lobotomize it lopping out you know some large number of its neurons and you get performance that's indicated by this purple line here right so you know Dave I think said at one percent loss of accuracy is considered catastrophic we actually consider a point one percent loss of accuracy catastrophic and so we would stop pruning at around fifty percent with without retraining but with retraining we get the green line here where we can get out to you know easily over eighty percent without substantial loss of accuracy and then anything that's fun doing once you should do multiple times and that's true for pruning as well so if you iteratively prune you prune to one level retrain prune and retrain after three iterations you get the red line here where we're out to 90 percent they're pruning out ten percent density left without loss of accuracy again the reason why this is really important this combination of reducing precision with pruning is that lets us fit things into a really small local memory and the cost of accessing memory goes up an order of magnitude every time you move up the memory hierarchy so if I can fetch my data from a really small local memory that's five pica joules per word if I have to go across the chip to get a premise-free I'm that's on chip but not really local that's 50 by the way this memory is built out of a bunch of these little memories and the other 45 is crossing wires to get there back it's a communication that really burns the energy if I have to go off ship even with lpddr3 energy efficient memory it's yet another order of magnitude so we switch gear here and talk a little bit about D pointing accelerators we've been building these you know video research for a number of years let me talk about a few the first is one that actually did with a number of my colleagues at Stanford the eie and the reason we did this is we were playing with these sparse networks and and the conventional wisdom from every person we talked to especially the numerical analysts was you're in the uncomfortable range of sparsity so it turns out that you know most people have sparse matrix packages we have one called coos parse and coos parse starts reading coup Blas at a sparsity level of about a half a percent in other words if you're a half a percent dense you're better off using the sparse matrix package denser than that you're actually better off just doing the dense calculation and we're in this range of ten to thirty percent depending on whether you're an MLP era or a convolutional Network so we said is ok that's great if you're running on you know conventional hardware but we can make sparsity work by building a pipeline where we basically walk the you know compress as far as column structure and hardware it takes almost no area almost no energy to to do this and this is the beauty of domain-specific hardware we can wire in things that do that so we basically did this accelerators show that we could make sparsity and that train quantization there's a codebook look up here where it says wait decoder essentially for free and show people yes when you're building the hardware you can do things that are not possible if you just writing code one thing that's really interesting to me looking at this plot which is the the EIU was a array of processing elements and we'll get each processing element is it's all RAM right the non Ram stuff is a little thing in the middle label of arithmetic and and this is actually true of many accelerators they tend to be completely memory dominated in this case the memory is needed to hold the sparse matrix and to hold that compressed first column structure that has the pointers into the sparse matrix so with a ie we showed we can do sparsity really well for fully connected layers we then an Nvidia research did a project called SC n n where we looked at the problem for convolutional layers where you have a activation that looks kind of like this it's got maybe you know after after applying relu all the negatives turn to zero and so you wind up with maybe thirty to forty percent non zeros we're convolving that with after pruning a kernel of weights that looks like this and of course it fills in almost completely into you apply the relu again and then that red one will look like this and so we looked at the most efficient way to do this and the approach we took in the s CNN was basically just to read a bunch of weight so as we basically pack we packed the green so only the green ones are there we get rid of all the white but with each green square we have its coordinates x and y similarly for the this and every one of these green squares has to be multiplied by every one of those blue squares so we just do that we read w weights at a time and inputs at a time and I think it was four and four for the four the baseline configuration we multiply everyone by everyone so we produce sixteen products and then we sort them all out on the output by taking their coordinates adding the coordinates together and that basically says we're to accumulate the results this wound up giving about a 3x boost in energy efficiency over over running the computation dense so there's you know we're looking at where we're gonna get the next 10 X you can get a lot from better number representation there's probably a three X in there from sparsity there's some other things this is actually a dye that we just got back recently we're experimenting with a bunch of things one of them is scalability so it's actually a 36 die MCM and on each of the locations in the MCM we have an array of 16 processing elements and so we can scale this from a single processing element which you could use for some small IOT device to something which does something in excess of 2,000 images per second on ResNet 50 batch size one and it's energy efficiency is a hundred and five femtojoules prop so about 50 ter ops it's 50 tera max but pursue me confusing this with the later one this is excuse me about 10 tariffs per watt for doing 8-bit deep learning inference and so this is demonstrates a bunch of neat technologies for communicating between the MCM s but where it's giving a lot of its performance for the deep learning comes from is partitioning the weights and activation so that things stay local so the weights are actually local on each block we take the input activations and partition them by row both over the chips of the MCM and the Pease of the array per MC m and then partition the output activations by column so everything is very local communication flowing through the system more recently we've done a study of doing this type of tiling as numerical animals call it at multiple levels where in addition to sort of reading out of a very small weight buffer we actually of a weight collector it's like a four eighth element thing that holds just a few weights and by doing this I'm gonna jump forward just getting a little bit behind my schedule here we're able to demonstrate it again in the same technology these are all 16 nanometer numbers thirty tariffs per watt on on inference and and so you know we're constantly trying to raise that bar ourselves to see where we're gonna get that next jump in inference performance one of the things that's not even in this ship yet but we're looking very carefully at is making the on ship communication more efficient remember moving this data around is really expensive and one way of doing this is to use every electron twice so normally you start at the power supply that's like you know point eight or one volt and you get to use that electron once it as it drops to ground a signal a one or a zero you know through an inverter what we do here is we introduce a mid plane which is half the supply voltage so this is 0.8 that's 0.4 we get to use the electron once on the top floor sending this bit across we use it again on the bottom for sending this bit across and we actually gain 4x by doing this because the energy goes as the square of the voltage it seems like you're getting something for free here it ought to be 2x but it's actually 4 there's a paper on this at is SCC in 2016 and again it's one of the cards we have in the table to keep those plots going upward so I kept on getting pestered by our board members about analog because many of them had invested in these various startups doing analog deep learning so I did what I usually do is I go over and I talk to my colleagues at Stanford and in fact Boris Merman is a great resource than this he's done about a lot of analog real networked chips and and after after talking to him for a couple hours I realized that it's all about the data conversion so even if I spot all these people that they can do really ficient vector matrix multiply an analogue by taking a vector of activations as voltages a matrix of weights as conductances and that applying Ohm's law to get current equal conductance times voltage I'm going to give that to them for free and charge them only for the cost of doing the data conversion back at the end of the layer but I'm going to be very scrupulous about precision if you want to match 8-bit inference performance you have to do the data conversion with enough bits that you're not you know after you summed up a bunch of things you're not throwing away significant bits so it turns out that Boris actually maintains this database of all the analog to digital converters anybody has ever done that's all the points on this chart and it turns out that they're they're bounded by two fundamental limits it costs something to do a conversion no matter how many bits it is and then it costs exponentially more to do conversions as a number of get bits goes up above about ten and so if you take that data you assume that the multipliers are free but you still have to do a conversion per layer and you Wow it turns out that you you can sum as many multiplies together before you do the conversion you can't win that way and this is one of the ways that physics just works amazingly well however many you have to sum together here you just wind up needing more bits of precision by exactly the same slope so you wind up on the same line and it turns out that to match 8-bit inference performance takes over a Pico Joule per mac remember we have demonstrated you know in chips that are in the lab right now 200 femtojoules per max they're off by a factor of five they're not counting the energy they're using to do the multiplying everything else in the system just doing the A to D conversion the thing that's interesting though is if you start walking up some of these if you went from you know you know 8 to 7 to 6 to 5 around 4 or 3 bits of precision analog might make sense of course at that level the digital is a lot less expensive as well remember we're getting 70 femtojoules per mac and that your one chip that I showed you a little while back digitally now one reason we're able to keep pushing the edge on on D pointing hardware at Nvidia is that we eat our own dog food and so so we have a big research organization that takes our deep learning hardware and frameworks and software at each level and tries to push the state-of-the-art with it so we have among the best of semantic segmentation systems we have currently the top of the leaderboard and optical flow this is actually a very interesting Network I encourage you to go to the CVP our paper last year because you know it did not take the end-to-end approach which many people took for deep learning optical flow instead we took you know twenty years of computer vision research and optical flow took the best ideas of how to do optical flow without deep learning and applied deep warning to them and what I'm getting really kind of the best of both worlds and then because we need to process lidar data from our self-driving cars we have among the best 3d segmentation for point clouds now our core business is graphics so we also have made that the practice of applying AI to graphics this comes in really two main categories and I only talked a little bit about one which is content creation a lot of people don't appreciate it but the video game industry is a bigger industry than the motion picture industry by revenue and typically producing a triple a video game cost significantly more than producing a major motion picture most of the Triple A titles break a billion dollars and most of that money goes to artists its artists time and so the key thing to enable the industry to do more is to automate content creation these are some figures out of a paper we had with people at one of the game developers where instead of having you know if you wanted to animate a face they would typically a what's called a rig with about 300 control points to move different muscles of the face and producing something like a 10-minute second actually made a ten-second segment we take about three weeks of artist time laborious ly controlling the rig until it looks correct what we did is we trained a neural network with an actor and then we're able to take an audio trace to animate the face speaking and in a double-blind experiment you couldn't tell it from the one that somebody spent three weeks producing and it basically ran in a few seconds on a GPU the other three here I'll talk about in a little more detail this never works and so I have the backup queued up if I can only oh I say I've got a select that and then I need to mirror my displays and you should see this so this is how we take the tensor cores on Turing and make them useful for graphics everybody in deep warning is always worried about the tiny little bit of area we have devoted to graphics and that it doesn't make deep learning faster we're always worried about the other thing which is all the area we devote to deep warning that doesn't make graphics faster this is the technology that I'm very happy about cuz we developed an envy research which is the organization I run it's called deep learning super scaling and so what we do is we take this image and we actually render it at one resolution and then feed it into a deep network that up-up resolutions it and you know the the typical case is 1440p to 4k because this is an HD monitor I believe this is 720p to HD and you can't really appreciate until I run this slider across but the DL SS on is on the left and so you see how things to the left of this like those little features on the back of this booth just pop out when the slider goes across and where are these little glass ornaments you can't really see what's inside them until you turn the DL SS on going over it this is one great way of taking deep learning performance and making graphics better now I'm showing you the easy case here with a still image you know anybody can do a still image what makes this hard is that if you apply this to a video sequence and you're hallucinating those pixels that were never shaded you have to do that consistently frame-to-frame or it looks really really objectionable you get little wigglies and you get things flashing on and off and and and that's a no-no you will get you know drummed out very very quickly if you do that so let's go back to the main talk extend desktop why is that not coming back that is bad let's try this again go back to here it keeps coming back to that there we go well I've got this here but I don't have it over there huh he just never works rebooted the other thing I can do is just quit the thing that's taken over let me go back to here I will quit that hold that to quit okay now that will give me control back I hope this is another example of how to apply deep learning to graphics so if you watch a major motion picture that's either animated or one that's live-action but it has CGI segments in it the the artificial portions of it are done with the technique called ray tracing where you actually cast rays from the eyeball into the scene see what they bounce off of and you know then ultimately you try to after many bounces connect them to a light source and then run that backwards to compute what what color you get a major motion picture will probably cast about ten thousand rays per pixel and that will take many hours on a farm of hundreds to thousands of CPUs we don't have many hours or hundreds of thousands of CPUs we have one GPU and 16 milliseconds so we do is we cache five rays per pixel to get an image that looks kind of like this so this looks like what your camera would take if you set your ISO to like 200 thousand right it's a very grainy noisy image and then we feed it through a deep neural network and we get this beautiful image again double-blind experiments people can't tell the difference and this is what's enabled us with our r-tx technology to bring ray tracing and physically based rendering to real-time graphics we couldn't do it if we had to do ten thousand rays per pixel but if we can do a few rays per pixel and clean it up with a deep network it makes a whole nother level of realism in in graphics possible we've also been doing a lot a lot of work with ganz about a year ago we developed this technology called progressive Gann and the the thing the thing we realized is that you know and again we have these two networks you have the generator Network here and the discriminator network the generator networks never seen a picture of somebody all it sees are these random variables called latent variables we feed in it produces a picture and he gets one bit of feedback good or bad right good you fooled me I thought that was a real person or bad that's obviously a fake by the discriminator Network which is trying to learn at the same time and nobody before we done this had been able to do good high-resolution images they started getting blurry and the reason is you just have too many free variables you've got this huge generator Network and huge discriminator Network initialize two random variables all trying to learn the stuff with neither having any idea of what it's doing so we thought about this we realized the right way to do this is the way you would teach you know you know a kid you know mathematics you don't start with differential geometry and and you know give them the big black hole and tell them about general relativity you start with you know arithmetic or maybe set theory or something like that and work your way up so we started with a very simple network we learned four by four and the discriminator will discriminate four by four and when it gets four by four right we'll move up to 8 by 8 and then 16 by 16 and so on and by doing this in a progressive way what's often called curricular training we're able to get the network to be stable and converge something that produces really good high resolution images here's the movie of the training it shows the resolution its training currently at the bottom left and the number of days I think on a single v100 in the middle and you see that you know by the time it gets 256 by 256 the images are looking pretty good we can go to a thousand by a thousand and the images are still crisp even though remember this generator has actually never seen a picture of a real person although it's getting his feedback about whether the image is produced or is real or fake now just recently we improved on this significantly by developing a new network called style Gann and what we show here on the left side is for the way our progressive ganmo scans work we feed the latent variable in the top and your image comes out the bottom and it's kind of a flow down between the layers well it turns out that we can get much better results if we don't do it that way and so we just feed a constant into the top and instead we take our latent variable we feed it to a multi-layer perceptron and then we factor it into pieces we feed into different and by doing this we can take different parts of the latent variable and use them to control different aspects of the image we're producing we can control different resolutions different aspects of of the face so we can independently control things like hair color eyes teeth and the like these are our images coming out of the style gann as we interpret that latent variable at different scales let me see how we're doing on time here okay so let me wrap up actually as I'm ahead of time it's amazing so you know deep learning is fueled you know of a revolution in almost every aspects of our life transportation healthcare education and graphics and as a hardware designer yeah I'm really happy about this because it's been enabled by hardware it's been this co-designed between hardware and algorithms but the core algorithms have been around for 30 years waiting for the hardware to catch up with them and so it's kind of great that the hardware finally did and now that we're here progress is gated by hardware if we don't continue to build faster and more efficient hardware we will plateau we will stagnate because we can't run bigger models and bigger networks without better machines to run them on so we've been able to increase the performance of GPUs probably by an average of about 3 X per generation you know since the Kepler generation without any help from Moore's Law right we had one process jump in there where we jumped from 28 to 16 nanometer they gave us maybe 20% it didn't help much on that 3 X and it's been by being clever about architecture and circuits going from doing you know FP 32 on Volta to doing into 8 on Turing but also adding you know the the dot product instructions and the tensor cores and a lot of other features that have been geared toward toward deep learning and that's where we get today with you know Volta at 125 teraflops for training and turing at 261 tara ops that that's a Quadra 8000 of into 8 and 512 teraflops of infor and it int 4 worth we're about 3.6 teraflops per watt for inference for embedded systems we have where's that your SOC which has 20 ter ops of deep learning performance part of which is in the volta cores part of which is in 2d pointing accelerators by the way the deep learning accelerator is open source you can go to mV DLA org download the RTL many people are using it in their own SOC s and and the question now and what I sort of wake up very excited about working at every morning is how do we continue scaling this performance from actually this pretty impressive point we're at today without any help from Moore's Law and I've shared with you some of the things we're going to be doing precision we need to use the fewest number of bits that we can and make each of those bits count which means we don't do when you're sampling but we need to do it in a way that the arithmetic operations are efficient we can't afford codebook lookups and and high precision multiplies like like our original trained quantization work we need to move to supporting sparsity better we already support sparsity quite well in our DLA by the way which which has support for sparsity in both data gating of the multiplier array and in all the data transfers being in compressed form with the zero squeezed out and then accelerators are really sort of how we're prototyping a bunch of these concepts but with the idea that in the long run they'll they'll go both into future versions of the DLA and future tensor cores that would then support new data types support sparsity and other things going forward now what keeps us honest about this is we use our own stuff we eat our own dog food and I give me some examples of how we apply deep learning to perception to graphics and into image synthesis and and with that I'd be happy to take some questions we have lots of time for questions and I'm sure there will be there'll be many this one Oh so to what extent do you feel like the restriction or gating on the performances due to the GPU versus not owning the host processor Oh so to what extent is the negating I'm performance to the GPU now and the host processor so very little when we first started doing some image work we wound up being a little bit limited by the host processor doing things like decoding the JPEG and and shuffling the images because when you do the the epics you need to have each batch be independently sampled and so we moved all that to the GPU with with our package called Dali and now we're not you know we're not constrained so like Ana dgx - where we have I think you know put pretty good Zeon's those are not the gating factor we need it's it's GPU limited actually an announcement to make there's a lost wallet named gem Lal Salim at registration just so you know yeah and I believe so but yeah so bill is here if you have any other questions please come up and talk to him we're now going to coffee break we're back at 3:30 p.m. so we have 45 minutes and see you then Thanks oh one question over there sorry okay yeah so question about the architecture so you mentioned about Moore's you and Dave both mentioned that the Moore's law for its going away and is there any it's basically now - up to the architecture and what do you foresee the Moore's law of architecture like what do you what do you is there is there something you foresee in future happen to the architecture as as that happened - similar to the process yeah yeah that's a great question so unfortunately the Moore's was great about Moore's law and Dennard scaling even more so it was sort of the gift that kept giving right every generation you know you'd shrink line widths and everything would get more efficient and and you'd be able to turn some of that efficiency into better performance and Moore's law which is really about the economics of semiconductor manufacturing said every generation the transistors get cheaper in the next generation they got cheaper again the problem with architecture is every clever idea you think up as an architect is a one one-time play you play that card and it's played right you add tensor cards okay we've done we've played the tensor core card what are we going to do next time now I think that the one trend and I think Dave said it as well is that we're moving to more domain-specific architectures because we can get more efficiency that way but even saying that each thing you do with the domain-specific architecture is still a one-time play right so to continue to scale performance to get you know the three to five X we've been giving you on every generation of GPU for deep learning the really easy obvious ideas have been done right we have to think of the harder or less obvious ideas and just have a lot of smart people trying a lot of things and seeing what works and eventually we will reach the point of diminishing returns where it gets harder to find those things right now we're not at that point we have lots of good ideas we can see our way through the next several data points but it will get harder as time goes on thank you is we're warming up on questions now go ahead please you mentioned some of the compression methods that require a table lookup for the weights or not energy efficient can you elaborate a bit more on that yeah so the the paper that song hana i published i clear in in 2016 we basically wanted to find out what sort of theoretically were the fewest bits we could use to quantize a weight so we we observed as many people have that anything you can differentiate you can train using back prop and so we basically did back prop into our codebook and trained a codebook to basically find the optimal set of weights given a number of weights to represent the the weights and to do this you had to do some k-means clustering to group the weights together and then back what value would do that now the reason that's not energetically efficient is if you look at what you have to do now to do the arithmetic operation first you have to do a lookup from a small table right we use a latch array for that so it's not as bad as you know the two and a half pica jewels would use reading sixteen bits from a ram but it still winds up being hundreds of femtojoules and then we have to do a 16-bit multiply because decoding the way we decoded it to sixteen bits so we wouldn't lose any accuracy there and that 16-bit multiply is four times as expensive as doing an 8-bit multiply so that pipeline of table lookup 16-bit multiply is a very expensive thing so it turns out that if you really want to use a small number of bits efficiently you need to come up with a numeric representation which samples non-uniformly putting the bits most of the symbols where the interesting things happen but also admit very inexpensive arithmetic and there are such things but I'm not prepared to talk about them yet in general if you have questions please go up to the microphones I think part of the confusion with questions has been there are three microphones one two or three if you have questions please please go there yeah how do you see the advent of non-volatile memory changing the GPU landscape things like mist or technology stuff like that I don't see any really substantial change there we're always looking for better memory technologies we have many active collaborations going on with the major memory companies and so we'll take what we can get in either more capacity fewer dollars per bit well it's really matters to us though is is bandwidth and energy per bit moving moving the data on and off and I don't see the non-volatile memory is making a big dent there many people are applying non-volatile technology both flash and memristor to trying to do these analog multiplies and so we kicked off a project to try to replicate some of those results and the result of that was this paper that will be in DAC this year which I showed two figures from where we found that we were completely dominated by the A to D converters and that you know using a given noise model we wound up with energy per bit for certain accuracy levels substantially higher the we could do with digital GP odds are virtualized now and so what are the techniques being used to make make the context switching times smaller and smaller so they're virtualized now but the context switching time is not very small nor do we feel a lot of pressure for it to be small because we don't do a lot of very rapid context switching okay so what's the order of I would have to guess I don't have that on the top of my head I would have to get that number thank you yeah let's thank Bill one more time thanks Bill [Applause]