# tactiq.io free youtube transcript # Brice Lecture 2019 - "The Future of Computing: Domain-Specific Accelerators" William Dally # https://www.youtube.com/watch/fnd05AeeFN4 vo is a chief scientist and senior vice president research media and he is also a professor to research in a former chair of the computer science department at Stanford University he has speaking with Stanford full number of years and before that it was MIP he has a stella research record were caught working on cold blue sky compare system and men of the paper cutting these days you read them you impressed by the most heartily inside and also good mathematical reader so this is he works on developing kawan talking system to actually bombing applications some of the research today and he's a member of the National Academy of Engineering and the fellow of activity ACM and American Academy of Arts and Science he pretty much long all the awards major was a company architect for the ever dreamed about it he's also distinguished educator and manager and has granted remaining his students who become professors everywhere including our yes I just heard from again also the secret speaker we had Monday participant a channel in Chicago actually was abused first a student so you can't tell how many so today to talk about the future of computing at our main suspect attacks waiters please [Applause] thank you I'm honored to be here my pug isn't a little bit jet-lagged I was flew in from Tel Aviv via New York this morning yeah I'm sure motion does it all the time but I want to share with you one of these I think makes it really exciting being a computer scientist and computer architect these days which is this revitalization of competing hardware because we've hit a point where what we've been doing for really the last 40 years doesn't work anymore and domain-specific accelerators I think are is what's going to revolutionize not just hardware architecture but but all of computing and I'll well you know why this photo by the way I took on my commute to work I live you know if this Mouse works here I live about there in the corner and commute down to the Bay Area of every week or so to meet with students and meetings at Nvidia anyway back to the subject at hand the thing which has really driven a lot of not just the computing industry but productivity and society in general has been faster computers people sort of know this is a popular version of Moore's law that isn't actually what Moore wrote about in his 1965 paper in his 1965 paper you talked about the scaling of economical transistors that also has come to an end but also scaling of computing performance from conventional serial computers has come to an end and in historically to create more value and driving innovations across many industries has been driven by better computing performance more value has taken faster computing better algorithms and more data we still have better algorithms and more data but we're not getting faster computing anymore there's a great talk at the DARPA ERI summit last summer by an economist from CMU named Fuchs who basically pointed out with with pretty rigorous data that most of the productivity gains across many US industries are actually driven by computing and that without this continued scaling of computing a lot of productivity is basically you threatened to stagnate and not continue to increase so I'll posit that we need to continue as computer scientists delivering better performance and better performance per watt but we historically got as this almost free ride from Moore's law we now have to be covered we have to think about how to do this it used to be we got this from processed technology but now Moore's law is dead and as evidence up to this I'll present a figure from and by the way there is still one award that I aspire to or at least dream about our latest tutoring Award winner is John Hennessy and David Patterson published this in the latest edition of their textbook which is the scaling of computing performance over you know the periods from 1980 to 2020 and what you see is back during the heyday of Moore's law the big green area of this curved computing performance doubled every 1.5 years fifty two percent per year increase in performance it's now essentially zero it's not under three percent per year so the old way of increasing computing performance was just to wait and take your old dusty decks run them on the new computer then run faster that doesn't work anymore so how now are we going to continue to scale computing performance and in their Turing Award lecture that they gave it ESCA last year they basically point out the domain-specific accelerators are the most promising way to continue to scale performance and performance per watt and I was very happy to hear that because I've been making fast accelerators since 1985 this is a subset of of papers I've written on accelerators and they run the gamut from simulation accelerators to things that do signal and image processing deep learning and more recently genomics and I'm gonna draw examples from these various projects over the years to talk a little bit about the nature of domain-specific accelerators and how they make you rethink computation in in general and so to ask the question why why do accelerators do better than general-purpose computers certainly specialization is one of the characteristics but I don't even put that first because when you look at the performance increase from these accelerators probably the biggest common denominator is that they are massively parallel if you get a conventional CPU chip these days it might have the sixteen processors on it at the high end or your four at the low end and that's parallel but that's not massively parallel many of these accelerators get parallelism in the thousands and that really is what accounts for a lot of their increase in performance but if you got that increase in the thousands and you didn't change the power equation you wouldn't be able to run it on a reasonable amount of power or do the computation in the form factor that you'd like so what gets you the efficiency is typically the special data types and operations and I'll give you an example in a few minutes of a dynamic programming problem from gene sequencing where the specialization runs 37 times faster than running it on a general-purpose processor so modest increase in performance but it's 26,000 times more energy efficient and that's really where where specialization wins is making things more energy efficient and part of that I'll jump over the memory one for now is because you get rid of the overhead a modern you know high-performance CPU has very high management overhead it's actually spends the bulk of its energy well over 99% of the energy is spent on administrative overhead on fetching instructions decoding them deciding what order they're gonna happen in renaming registers running things out of order unrolling speculation when you got a branch prediction wrong and that's very expensive specialized accelerators get rid of that huge overhead and spend all of their energy actually doing the core computation let me come back then to memory it turns out that almost every computation we do is memory centric in the sense that the bulk of the area on the bulk of the power is taken up by representing state and accessing that state and so core to all these accelerators is how we deal with memory and in fact when I talked a little bit about how we want to change how we think about algorithms if you think about when you took your basic algorithms course they taught you things about counting operations big o-notation it's more efficient to do a sort this way because it's order n log n rather than order N squared it turns out that the operations are almost free today what's really expensive is accessing memory so we typically will wind up redesigning our algorithms for acceleration to optimize how memory works to basically keep the memory footprint to the highest bandwidth memory access is small so we can access it that those memories from small local memory arrays and not from a big DRAM array if you're basically eliminated by DRAM bandwidth you're not going to get anything out of an accelerator it's going to be memory limited and that's going to be the end of your performance and many algorithms when we start accelerating them have that characteristic their memory limited and we have to restructure the algorithm and I'll give you a couple examples of how we've done that drawing from that from the genomics accelerator to make them optimizable on the memory side and that it really gets to the bottom line here which is it's almost impossible to take an existing algorithm unchanged unless you're extremely lucky and accelerate it and get huge performance improvements usually it requires restructuring that algorithm for the constraints and particularly the memory constraints of accelerators and this is the algorithm Hardware co.design and and what I'll suggest is sort of a hope for the future that as people develop algorithms in the future they will target them for accelerators from day one and as a result they'll do these optimizations to begin with now accelerators aren't new specialized Hardware is everywhere but it's largely invisible most of us have something that looks a little bit like this in our pockets right and if you if you look at the processor this is this is a Apple iPhone I think it has like they're you know a ten or a 11 processor chip in it if you look at that processor chip yeah it has a few ARM cores on it those arm cores are used to run the complex but not demanding part of the computation mostly user interface and things like that and if you look at where most of the operations performed on that iPhone go there in specialized accelerators it has accelerators for the radio modems for still and moving image codecs for doing the front end of the camera image image processing that D mosaicing and white balance and and in color balance it has a deep neural network accelerator and as graphics accelerators my company Nvidia for 25 years has been built graphics chips the core of a graph and graphics chips these days are extremely programmable but again the programmability provides flexibility the really heavy lifting is done by accelerators so the rasterization that turns triangles into pixels texture filtering compositing a lot of these core operations are done by accelerators a lot of the accelerators are even hidden it turns out we have compression and decompression accelerators on our memory channels this is actually not to make things take less space in memory but to make things take less memory bandwidth so we can access them more quickly so it's a compression that basically doesn't reduce the memory allocation but when the compression is successful we can basically fetch textures and in fact fetch things from surfaces at many times the bandwidth that we would get if we didn't do that compression most recently with our Turing generation of GPUs we've launched acceleration for ray tracing trying to make photorealistic images we can do an order of magnitude better than even the very efficient computation we had on our regular GPUs by accelerating a tree traversal the bounding volume hierarchy which basically when you cast array in space we want to find what polygons it intersects first we basically walk' a tree that divides space up until we get down to a piece of that space that has one triangle in it and see whether the ray hits or not in fact we also have a special purpose accelerator to do that intersection test so we can do the right triangle intersection very quickly and that's what makes the r-tx feature of the Turing GPUs possible wouldn't be possible without that acceleration so let me start with I'm gonna walk through that list of things that accelerators have and talk about specialized operations so when you think about specialized operations you need to know what to compare to so what most computers provide are integer and floating-point operations if you think about it floating-point operations are really specialized operations for scientific computing there are there a data representation that's been specialized over the years for scientific computing but when you want to do a computation that doesn't fit well into integer or floating-point operations emulating that on unconventional processors is is expensive and so let me start with an example from bioinformatics it's actually a really exciting area these days because the performance of the sequencing machines the machines are basically will take a bit of your saliva and from that produce a gene sequence has increased faster than Moore's law and is still on that increase it's an area where exponential scaling is still holding and it's to the point now we're actually the cost of doing the assembly these machines produce what are called reads contiguous sequences of bases ranging from a few hundred bases for the second-generation technology made by companies like Illumina up to 10 or 20,000 bases for the third-generation technology from companies like PacBio which is actually just acquired by Illumina or Oxford nanopore and then the prompt you Taoiseach process from taking a bunch of these reads and figuring out it's like a big jigsaw puzzle right the the finished puzzle is your genome and you have all of these pieces sitting around except they all kind of look the same they don't have funny edges on the wave real puzzle pieces do you have to assemble them all together and there's two ways of doing this assembly one is to say you know most people are like person X so let's start with person X's genome and we'll line these pieces up that's called reference based assembly it turns out that it's very biased it actually won't catch certain variants that person X doesn't have but we'll kind of reject them because there's no place to assemble those puzzle pieces so if you really want to get the best diagnostic you will basically do what's called a de novo assembly where you will not use any reference information you'll simply take the puzzle pieces see where they overlap and come up with a maximum likelihood assembly now what makes this more difficult than it may sound is that these reads are very noisy for the long read technology in particular some of the technologies have a 60% accuracy that means for any one base in the read you only have a 60% chance of it being right a 40% chance of it being wrong which makes it that that assembly even more difficult so if you look at the core operation for doing this it's a you know sort of very fundamental computer science algorithm it's a dynamic programming right where I have the reference sequence along one edge and the query sequence on the other and for de novo basically the reference sequence becomes all of the query sequences concatenate together you're just trying to see where they overlap and so you're trying to find where two sequences line up where they match and so I start up in the corner and if they match like you see the two G's there I get a score for matching if there's a mismatch I get a penalty for mismatching I also have the possibility that there's an insertion or deletion in which case say they're go horizontally or vertically so for every square in the dynamic programming matrix there are three possible ways of arriving there depending on whether the highest score was a match and insertion or deletion and and this computation is expressed by these three recurrence equations on the right for four on the affine gap penalties you actually keep separate scores for insertion and deletion so you can charge more for starting the insertion and then for continuing it and then the hij basically says it's the maximum score of nothing if you're at one of the edges the insertion score which is basically if I end an insertion at this point the deletion score if I end the deletion or the previous score to the upper left plus the matching score where I get a positive score if I match an a negative score if I mismatch doing this on an Intel CPU and I'm even going to handicap this by letting Intel use 14 nanometer technology and our accelerator uses 40 nanometer technology this takes 37 cycles and 81 nano joules the little accelerator we built does this computation of one cycle so it's 37 times faster and it takes 3.1 Pico joules which is 26,000 times more efficient now if you actually peeled apart the computation it turns out the logic computing these three recurrence equations takes 300 femtojoules it's a tiny fraction of the energy and the bulk of the energy 90% of the energy is actually storing the trace back pointer the pointer of which of the three conditions gave you the maximum so you can then reconstruct the matching of the two sequences this is an example of specialization giving modest improvements in performance huge improvements in efficiency now that if efficiency is really the whole thing that accelerator design is about and to design an accelerator well you have to have a good model of cost and a very simple model of cost for one that's actually amazingly accurate is on this slide arithmetic is free particularly its low precision it's it's so inexpensive it almost doesn't count and in fact a good first-order way of estimating the area and energy consumption of an accelerator is just to look at the memory and you'll actually come up with a number that's usually in the ballpark memory is expensive accessing even a small memory where it costs way more than doing an arithmetic operation and communication is prohibitively expensive and actually a lot of what we think of as memory costs today is really communication cost basic memory arrays are small they're 8k bytes or so and if I build say on chip a large SRAM array I build it out of a little 8k byte arrays and the cost of accessing the big memory array is almost entirely communication cost of getting the address to the selected sub back and getting the data back the actual cost of accessing the 8k byte array is roughly the same I'll give some actual numbers net later so here's sort of a cheat sheet I tend to use when I want to compare how expensive arithmetic operations are and it and it tells us a bunch of things about precision so for example in something like a neural network accelerator we're dominated by doing multiply ads and the multiplies are more expensive than the ads and if you look at the cost here what you realize is that the cost of doing a multiply increases quadratically with the number of bits in that multiplying this should make sense right because when you learn in grade school to multiply numbers you basically compute a bunch of partial products you know you know a column for each digit and a row for each digit is N squared partial products then you sum them all up the ads increase linearly but tend to be small enough that when you're doing multiply adds and multiplies dominate so there's a big push to reduce precision because will win quadratically in the earth medic energy but if you look down a little bit further you'll realize that you know even if I do you know a large precision operation say a 32-bit multiply that's less expensive than reading those 32 bits even a very small SRAM and the lower precision operations are way less expensive this you know feeds that first concept arithmetic is free memory is expensive and if I want to do a dram read which is actually mostly communication going off chip its orders of magnitude more expensive so so when we get to doing the co.design we'll see that we really need to restructure our algorithms so we do as few memory operations as possible and those memory operations are done out of small arrays in fact a good rule of thumb is every time I go up a level of the memory hierarchy the cost increases by an order of magnitude accessing this small local less ram array costs about five pica joules per 32-bit word if I build a sort of global on chip SRAM array something that may be on the order of a few megabytes that's 50 Pico joules per word and remember five pica joules of that is really memory accessing the base SRAM array the other 45 pica joules is communication and if I go off chip even to lpddr3 Ram which is one of the most energetically efficient DRAM families it's another order of magnitude more expensive than that moreover as we look forward and scale technology say from 40 nanometers to 10 nanometers the arithmetic this is the energy of a double precision floating multiplied accumulate operation it basically goes roughly linearly so it's four times as efficient to do the double precision floating multiplied accumulate it's only about 30% more efficient to send a 32-bit word over 10 millimetres of wire in that same technology so the wire the communication energy is scaling at a much slower rate than the arithmetic energy so whereas arithmetic is free today it's really free tomorrow its energy is scaling down faster than the competing energy of communication so we talked a little bit about Co design and I'll start with an example from our genomics accelerator when we first started looking at this we got that we were doing long read assembly because the long read technology really has the ability to do things like detect structural variants that have you know diagnostic properties you don't get with your single nucleotide polymorphisms and they're things you'll just miss with court reads because the short reads are too small to see the whole variant and at the time the best piece of software for doing long read assembly was called graph map and it turns out that because alignment which uses dynamic programming is really expensive unconventional processors graph map spends almost all of its computation time doing what's called filtration this is where you basically take seeds of the genome maybe 11 base pair sequences from the query sequence and you find and you index the reference sequence and find where those seeds appear and by doing this you can winnow down the number of cases you have to test so that almost everyone you test work so in fact they only get two false positives for every true positive in graph map but what you see here is that they spend almost all their time doing filtration which is blue and very little time doing alignment there actually is a little bit if you look carefully on the end now when we looked at this what we realized is that we can do alignment blindingly fast because you know alignment is dynamic programming we could build this dynamic programming engine that is 26,000 times more efficient and because we're energy limited that means ultimately we could make it run 26,000 times faster at the same energy so we're willing to trade a few more false positives to spend less time doing filtering and more time doing alignment now why don't we make the filtering faster was because filtering is fundamentally memory limited you you take the reference sequence which for reference based alignment is a three billion base pair sequence you know person X's genome and you index it you basically compute a big many tens of gigabyte table in memory which has the locations of every 11 mer you know there's you know 2 to the 2211 Mars for 2 xi possibilities so it's a huge table and you're essentially making random accesses into it so you are going to be limited by DRAM bandwidth on filtration there's no fundamental way of accelerating that algorithm fundamental you just have to do random accesses into a large table with with our Darwin accelerator we decided to do a faster but less precise filtration an algorithm we called Esau I'll talk a little bit more about it later they get to an enormous number of false positives there's almost 1,700 false positives for every true positive that's okay we filter them out really fast with alignment and then we do the alignment with a variation of straight dynamic programming we call gackt I'll go into a minute in a minute what that exactly is and so if we just did this in software this would make everything run two times slower but what it did is it traded doing filtration which is fundamentally limited by memory bandwidth for doing alignment which we can accelerate very easily it has a very small memory footprint the next thing we did is we built a hardware engine for doing alignment and we built it with a degree of parallelism of 4,000 so the combination of that degree of parallelism 4,000 and the speed-up we get from specialization 37 actually exceeds the efficiency gain and this essentially makes that red bar there go to zero we basically traded something expensive for something cheap and then we exploited the inexpensive nature of that and made it go blindingly fast in hardware we could have actually stopped there and a 380 times speed-up would have been acceptable but there are a couple other optimizations one is when you are doing memory accesses you want to basically keep all the memory channels busy all the time it turns out that especially for random accesses into big tables conventional processors don't do this they have memory systems that are optimized for latency not throughput and they typically lock up after a relatively small number of misses like 8 instead we basically optimize the memory system to keep for DRAM channels busy simultaneously and compared to the CPU implementation think it's about a 4x improvement in performance we then did two other optimizations one is when we looked at our memory channels what we realized is that once you find the right place in these seed tables it's then actually linear accesses as you read every location that that seed could possibly be but these were getting interrupted by incrementing the bins which is when you find out where the seed is you count how many seeds hit that location because each time you get more evidence of that location it increases your chance of a hit and so we basically factored out those bin tables is sort of shown on the right here into dedicated bin count s ramps because the small enough data structure we could put it on chip and it actually gave a speed-up greater than the amount of memory traffic that we were taking out because in addition to room two basically making that memory traffic take essentially a zero time it removed interference from the sequential accesses and allowed them to go at full speed and now at this point the red bar is actually big enough it still shows so we pipeline to the the filtering with the alignment and got a final 1.4 X speed up the total is fifteen thousand on a reference based assembly so the co.design really mattered you had to make some fundamental changes to the algorithm and it requires you're working with people who are experts in the domain because for the biologists who use our tools to trust this they had to know that these changes we made didn't cause us to give them the wrong answer especially for something where it's a very statistical process and so we tested this very rigorously by making sure that we had equal or better sensitivity at each step of this of this process and in many cases actually doing even more work than we needed to to get better sensitivity than the baseline algorithm that we were competing with so let's talk a little bit about memory and memory dominates in several ways the first is that it dominates power and area so here are the area on power for each part of the of the Darwin accelerator and what you see is for the dynamic programming part the GAC part memory is almost 80% of the area and over three-quarters of the power for the filtering it's actually even more than that it's 98% of the area 96 percent of the power then even a little bit higher if you were to throw the DRAM in there as well so if you're designing an accelerator this actually makes it really easy to do some first-order estimates of things very often you need to do this you're doing some design exploration you're trying to decide do i do approach a or do i do approach piece rather than having the graduate student go and actually implement all of the RTL and synthesize it and place and route it and get very accurate measurements you get numbers that are probably 80 to 90% accurate by just figuring out how big the memory arrays are and how many accesses you have to each of them and counting those up and then choosing among your alternative embodiments that way and then and then fine-tuning at the end by doing all the detailed design another way that memory dominates is it actually drives what algorithms you can use there are a lot of great algorithms that you can't make work in accelerator because they become memory limited so let's talk about dynamic programming if we're doing these long reads we typically for a typical assembly you have 30 times coverage you have 10,000 base pair reads that one's up being 15 million reads and if we do these all with straight dynamic programming that array that we have to fill in has 10 million entries and that's one's up being too much to put in in on chip storage it turns out people tried to come up with smaller memory footprint approaches to this in the past by doing what's called bandits with mod watermen where they basically compute only a band around the diagonal down this array and the problem is that doesn't work and the reason is you notice for the one sequence I showed here which is actually a real sequence it doesn't hit in the bottom right corner it turns out that the probability of an insert is not the same as the probability of a delete so over time there's a bias and you're the actual match of the two sequences will wander and often two will wander very far off the diagonal so you have to reuse an enormous band if you want to reduce the probability that that sequence will wander outside of the band so what we did instead is came up with a tiling approach and and gackt is is I put with the G a and C are four but the t means tiled and and we first did this we tried to tile it rigidly so that we forced the sequence to try to go through the bottom-right corner of every block and that of course didn't work we did not get optimal i mints that way and and the thing that made gackt work is realizing that if we overlapped these blocks so we would basically do an alignment on one tile say the upper left I'll find its maximum scoring exit point and then overlap the next with that back so the typical sizes are maybe a 500 by 500 tile overlapped by 100 to 200 and then we would restart the alignment not from where we left off but from 200 back that would always find the correct alignment we have not found a single assembly where our alignment does not match the alignment done by doing the full smith-waterman but instead of having a memory footprint of 10 million matrix locations to fill in it now is you know the the footprint is the size of those tiles this is 500 by 500 order of 25,000 footprint which is easy enough to put in a very small Ram and in fact you can have many very small Rams so once we've reduced the memory footprint by that amount and the logic here is very inexpensive we can have a lot of these so we actually have have 4000 of these processing elements that compute the dynamic programming that's 64 arrays of 64 elements per array there's you know we're doing these assemblies with 15 million reads there's 15 million alignments we need to do there's plenty of parallelism at that outer loop so we start 64 of them at a time and then we have 64 processing elements that basically walk a diagonal down that dynamic programming core and each of those processing elements is its own little private SRAM where what gets stored every cycle is the trace back now this is an example of a systolic array algorithm the great thing about the systolic array algorithms is that they simplify two things about parallelism communication and synchronization what we're communicating each cycle here is the I D and H values from those recurrence equations and those are nearest neighbor communications out of registers there's no memory accesses required and then the synchronization is lockstep we're basically because we're walking down that diagonal if we want you know the values from the previous cells above us you know up to the you know left and up into the left those are in the registers from the last cycle and and there's no special synchronization needed as might happen if two things are running on their own and you need to signal not only that the data is there but when it's there and so the the operation is extremely efficient it why is it being about a hundred and fifty thousand times faster than a CPU for for this part of the computation it comes the wrong way there then the final way memory is important is it really drives cost it drives cost in in an unusual way so for example when I took those bin count memories in in the Darwin implementation and put them in SRAM rather than putting them in DRAM I was basically replacing a storage technology with one that's probably about a hundred times more expensive a bit of on chip SRAM it's kind of like real estate in a city like Houston if you're located in a really good place like maybe near the university that Real Estate's probably a lot more expensive per acre than something you know 50 miles out of town and the same thing if I'm in an SRAM right near the computational you know that's very expensive real estate at least 100 times more expensive than DRAM but even being a hundred times more expensive per bit it can be cheaper so explain why let me show you the D soft algorithm and then do a cost computation so so D soft basically we've got these seeds coming in it's kind of what location this base pair could hit and we're trying to that's not good and we're trying to compute which bins accumulate enough hits enough evidence that these two sequences might be similar that we can then do the alignment between them and so we start out and I mean this is really sort of a toy example typically you'll use eleven Mercer I'm going to use tumors and I'll start out with the the query sequence at the left the reference sequence is running along the bottom and I'll ask how many places does GT occur right and so I'll go to the pointer table lookup GT that will then cause me to make a look both of these tables are in DRAM let's go to the physician table and it says okay it's the location it's twelve and thirty one so I increment the bins for twelve and thirty one that's shown on the right here with that's a green bin and the in the tan bin with two because two base pairs match then I get my next sequence which is GC and it's not overlapping so I'm gonna get full counts here it matches four places is shown by the position and those four places get incremented by two so now I have three twos and a four the next one is overlapping it's a CT so where it hits if I also hit with GC I increment by one otherwise I increment by Shu and I do this you know until I get to the end of this sequence I'll continue on doing this and then I'll check the bins for a threshold and in this case with the threshold being six I guess no special is five two of these match the rest don't those two go off to the alignment stage so the way we do this in hardware is we build a structure where we have separate bin count Rams and we actually partition it up so there's 16 of them even though because we have four DRAM channels we can at most get 4 queries into this at a time for places that we need to increment we have a little on-chip memory on trip network that routes these increments to the appropriate bin counter raise those bin counts get incremented the reason why we have a 16 x over provisioning here is to make the probability of two things coming in hitting the same array small enough that a small FIFO will take out that variation so we never fall behind we can we want to keep those d Rams busy all the time and then there's sort of two other s ramps here these nonzero bins SRAM it turns out every time we do an alignment we need to start this process over again and so we need to figure out all of the bin Rams that we may have incremented and there are a lot of them we don't want to have to go and scan through the whole array setting them all back to zero so every time we increment one the first time we tell the dumpster bin SRAM to push that bin count on a stack and then when we reinitialize we just pop that stack off and zero only those SRAM's that need to be zeroed and those are done by done by bank and then the ones that actually exceed the threshold as soon as they exceed the threshold they go out to that arbiter and get fed off to the alignment so the alignment gets pipe lined with this so let's do a cost computation let's suppose the multiplier for on chip SRAM is 100 and for the bin contest Rams I have 64 megabytes at a cost of 100 sets you know 6400 M or 6.4 gig you know units of cost the DRAM has a cost of 1 and so I 128 gig of that so the cost of my dear aunt of my total memory here is 134 128 gig units of cost for the DRAM and 6.4 gig for the SRAM now you would think that that would be more than the DRAM only system but there's a time component to cost suppose I have an unending stream of sequences to to filter I can filter it 15.6 times faster on this array so to match the performance of that I would need 15.6 copies of the DRAM only system so that the the Dieruff only system is in some sense in computing memory units times time fifteen point six times as expensive so even though it has a less of a memory cost its total cost is actually about fifteen times as much sort of you know Tutera units of cost as opposed to under the 34 Giga units now when you look at trying to maximize memory it drives you to do things like using sparse structures in compression and this is where the main specific architectures play a big role because it turns out that if you you know go and you grab a standard sparse linear algebra package what you'll find is you have to be really sparse for that package to run faster than the dense package like less than one percent dense sometimes less than a tenth of a percent dense depending on the implementation my my former students song Han and I we were looking at neural networks we discovered that they were between 10% and 30% dense if you actually you know did a processor you eliminated the unneeded weights of the matrices so the conventional wisdom was that was too dense to use a sparse package and that would be true if you actually had to implement that sparse package on conventional hardware but domain-specific hardware can make the overhead of doing that sparsity essentially go away and so we basically built a hardware accelerator that would walk the tables in these tables of course were but in separate small memories of the compressed sparse column format and that basically allowed us to get improved performance with densities up to 50 and 60 percent you know there is still some overhead of walking those pointers but it takes it out of the critical path we also realize that you could if you really wanted to compress things down you know you're wasting a bit if you sample things uniformly so suppose I I have a probability distribution shown here and this is actually a real probability distribution of weights in a neural network and I have four bits to represent these values if I simply use a binary encoding my by symbol so the X is shown here they're equally spaced and what you see is I'm wasting a lot of X's out here where nothing interesting is happening and I'm sampling relatively sparsely under this lobe where lots of interesting stuff is happening instead if you train a codebook and it turns out that with neural networks anything that you can take a derivative of you can train with stochastic gradient descent so we can fact train the codebook to find the optimal weights the optimal set of symbols on to be put in the codebook and if we train the codebook with four bits we get the red dots here you see they aren't wasting any dots out where nothing interesting is happening and and the dots are instead being spent where interesting stuff happens under these curves now this would be prohibitively expensive to do again if you did it on conventional hardware because now you would have to do the decoding process you know on every you multiply ad you basically get the weight have to look up in the codebook what the actual value is and that that would slow things down quite a bit but in domain-specific architecture both of these things the sparsity and the and the codebook are almost free so the way the efficient inference engine that we built works the code because this separate pipelines change your weight decoder it's a small SRAM and because it isn't taking cycles away from the main compute engine and it doesn't have an overhead of nearly a nano Joule per instruction fetch swamp in everything out it essentially adds nothing to the cost of doing this this computation the way we handled as far as matrices is we get a column index and we basically then read from a pair of s Rams one which tells us we're in the compressed farce row structure our column starts and we're in the compressed farce roast structure the next column starts we then read those columns start end addresses from from a ram this Rams dual ported so we can do this each cycle and that indexes our sparse matrix SRAM to read the actual weight values out and since we know where it ends we know how far to read before we're stopping now to drive home the point that memory dominates this is one of the processing elements is the overall architecture of the eie it's a 2d array of processing element this is one of those processing elements the green areas here are the sparse matrix Ram the things storing the actual weights of the sparse matrix and the pointer even employer odd Rams are these two memories that store the starting point er of the even columns and the odd columns and they account for like 90% of the area that the the arithmetic all the logic in the middle is is again less than 10% another thing that we've observed over time in building these accelerators is sometimes very clever algorithms wind up being slower and the example I'll use here is sat the boolean satisfiability problem it's actually a really important problem it's a it's an np-complete problem that people solve all the time because they have to it's the core of many hardware and software verification algorithms it's also something that for example if you want to do logical inference you can take a number of logical clauses and pose them as a Sat problem and it winds up being a very efficient way of doing logical inference it turns out that there's a sat competition every year the the programs that have won it in recent years are derivatives of a program called mini Sat that came out in the early 2000s and many Saten on all these derivatives tend to actually take advantage of two things one is is was actually developed in 1996 called conflict driven clause learning which means when you actually have a conflict in your you the guest values for boolean variables working down the tree when you hit a point where a boolean variable has to be both 1 and 0 at the same time that's a conflict and you have to backtrack up the tree when you hit those conflicts if you create a new Clause to augment your original set of Sat clauses to remember that conflict it makes your search of this space much more efficient so everybody everybody does conflict driven Clause learning the other thing that all of the recent algorithms do is an innovation that came about by some folks at Princeton in 2000 where they basically keep a very compressed data structure where they only keep two variables per clause and as you determine a variable if it's not one of those two you know you couldn't have possibly driven that Clause unit so it optimizes the search well it turns out we tried to implement that since we decided we wanted to use the most efficient algorithms and it was blindingly slow because it serializes things these data structures by creating these indices and these to watch structures that chaff was the name of the program at at Princeton that they used made it much more slow and so ultimately what we decided to do was just to implement a an array as shown here fine board my mouse is where say I decide that I'm gonna set variable a to be 1 I then send a message in this array that propagates it each of these blue squares is what's called a clause unit which holds a very large many thousands of clauses and it's checking all those clauses to see if they have variable a and if it does setting it to one in those clauses and this propagates down until this green thing here indicates that I've driven a clause to be unit which means having set a to one there's only one remaining unbound variable in that clause which means for that clause to be satisfied that variable has to be you know whatever its clarity is in there 1 or 0 so I now determine this is a derived variable I determined that say B needs to be 1 and I start propagating B equal 1 from this point these are the purple ones here until for example I detect a conflict setting B equal to 1 caused a conflict and so by getting rid of the serial nature of walking the to watch structure in the index is that are required to maintain that I get tremendous parallelism so sometimes you actually want to do more operations than the minimum order and of that algorithm to unlock that acceleration the other thing we found it in looking at sat is many people accelerated the part of site I just showed you which is called the propagation basically setting a variable and propagating all the consequences of that in fact we were very fast to doing that were 300 to 500 times faster than a CPU at accelerating the boolean constraint propagation but if that was all we did we would have accelerated SAP by about 4x and the reason is sort of shown in this bar graph these are a bunch of the benchmarks from the most recent sat competition and what you see is that the propagate is the green part of the bar and for some problems propagates less than half of the total and on average it's about 70 or 80 percent and so if you don't accelerate that remaining percent this is sort of the Amdahl's law of domain-specific hardware that you run that on the standard CPU it completely limits you so in fact we had to extend our our array to an addition to doing the forward propagation it also does the cause learning in the clause simplification we did keep much of the the algorithm the sort of more subtle but less throughput demanding parts of the algorithm in software so for example maintaining the variables that determine which variable to choose next we basically kept that in software so we could duplicate the exact heuristics of the existing algorithms deciding when to do a restart we kept in software also we could duplicate the remaining algorithms deciding which causes to discard it turns out over time you wind up learning too many clauses and you have to throw a bunch of them away we kept that so we basically can keep all of the existing heuristics unchanged but just make them go blindingly fast with hardware now a key thing that comes up in building accelerators is over specialization so very often you're implementing exactly one algorithm you're going to make this algorithm to go really fast but then you're done and somebody comes up with a slightly different algorithm in run on your hardware or won't it and actually some of the first accelerators I built these are things I did back in the 1980s had this problem so I built a an accelerator called the maasen simulation engine when I was a graduate student at Caltech I had this unfortunate circumstance that my first PhD thesis advisor was Randy Bryant and after about a year at Caltech when I was well into what was going to be my PhD thesis Randy comes to me one day and says I'm moving to Pittsburgh to go to CMU you're coming with me right I go what move move from Los Angeles to Pittsburgh I don't think so and and so I immediately was sort of on the market to find a new PhD advisor but that so I sort of did a reset and you sort of threw that PhD thesis away and started started another one but at the same time I was being supported in graduate school by Bell Labs where I'd worked before before going back to my PhD and I was spending a week a month in Murray Hill working with some people there and they saw that projects that actually finished it and wanted a copy of it so the original one I built Hardware blocks for each part of a basically a switch level simulator and it was blindingly fast it did exactly that algorithm but when I went around Bell Labs talking to people what they wanted out of the simulator I found some people wanted a switch level simulator but other people just wanted a logic simulator and some people wanted a logic simulator that had unit delays and other people wanted multiple delay and some people wanted to do fault simulation then my head was starting to hurt as I was drawing all the boxes I was going to have to implement so then I realized that the MIT the main piece of performance I was going to get out of this was parallelism and if I wanted specialization it turned out there were some common operations that all of these boxes had so rather than building something that looked like this which is what the original massive simulation engine looks like I actually have a huge wire wrapped circuit board which is the way you did things back back in those days I sort of architects it at the high level for it to look like this and each of those boxes from the previous one these are one or a small number of boxes to load-balanced was mapped to each of these processing elements I don't know if I have a photo of this here I should put one in the slide slide and then each of those processing elements was a custom microprocessor this is back in the day when this this microprocessor was designed by me and one layout tech I basically was hand drawing schematics pencil on paper handing them to the layout tech and he was you know feeding them into into the into the layout system but the the common operations were doing operations on small fields was a field operation unit where you would pull a arbitrary bit field you know you know 1 2 3 4 up to like 16 bits out of a word do an operation on insert it back into a word and doing table lookups so those a very efficient address arithmetic unit to do table lookups and to reduce the overhead of communication and synchronization which is you know what we got our systolic arrays this is not completely systolic the different operations take different amounts of time but I wanted to have very efficient communication and synchronization and remember these things are all a big pipeline so everything starts by reading a record from your input and then writing a record to your output so there was a queue unit that managed those records and so you would read a record from your input as if it was a register that the queue unit was register mapped you would read from that register do a couple operations and then write to the output queue and when you wrote to the output queue you would specify to which processing element that message should be sent fill out the record and end it and it wound up that you know we was whereas in the previous thing I designed all those Hardware boxes to run in exactly one cycle the unit delay logic simulator on Mars ran four cycles per step it was limited by the slowest pipeline stage but this this machine was still something like a thousand times faster than running that simulator on the IBM mainframe of the time and they also implemented a multiple delay logic simulator and a fault simulator and a switch level simulator on the hardware and they later found out that they actually I did the original design at one point two five microns CMOS in I think it was like 1987 and I found out much later that they actually revised this through five generations of logic technology and the final version was done around 2000 and the way I discovered this was kind of interesting I was teaching the introductory logic design course at Stanford and this student of my class comes up to me after class he says of my dad knows you he says he works with you at Bell Labs and so what's his name and I didn't I don't remember working with him he says he says he inherited your simulator and so it turns out this guy did I pity him because he drawn these schematics so I could read them and the tech could read them there was not great documentation for this machine and he says yeah he says he reimplemented it in in you know sort of half micron CMOS and then you know 0.35 and then 0.25 and and so on so I wound up calling the guy up and had some great great chats traps about it but it shows how if something is built general it winds up with a relatively good piece of longevity whereas if you over specialized that it usually is used for a short period of time and as soon as the algorithms improved it gets tossed out so that leads me then to this concept of platforms for acceleration so if you look at the common denominator between the different accelerators I've talked about I must be talking too long people are starting to leave nearly done they need a few things so first of all they need a very high bandwidth hierarchical memory system we wind up needing these small memory arrays whether the traceback memory arrays in the dynamic programming engine whether the bin count arrays in the in the filtering engine whether they're the things that sorting the sparse matrix in our neural network accelerator and then we need very programmable control an operation delivery and we need simple places to bolt on domain-specific hardware so we so we have an engine for doing neural networks really fast we have an engine for doing dynamic programming really fast and you can bolt this on in a couple ways the easiest way to bolted on is to define a new instruction I could have an instruction which is dynamic programming step another way to bolted on would be to have it be a memory client so I loaded up its problem in memory and then kick it off it reads reads a problem for memory writes a problem back to memory so I will pause it that a GPU is the perfect platform to be some of you build accelerators on top of it has a wonderful underlying memory system it has a great place to bolt on domain-specific accelerators and it's very programmable control and an operand delivery and so in fact we've been using it in this in this way and one example I like to draw us to look at the all 2v 102 compare it to something like the Google TPU when you do that comparison the part of Volta that's important is what we call the tensor course now tensor core is the name of the marketing people made up in engineering we refer to this as hmm 1/2 precision matrix multiply accumulate and the operation is sort of illustrated here in that it takes two four by four fourteen points 16 matrices multiplies them together and then adds the result into typically a floating point 32 matrix so this does 128 arithmetic operations in in in the course of doing a single instruction operation now the important of the importance of that is that it amortized as out overhead so if I go through previous generations it turns out that when the Google TPU people came out they did a comparison against our Keppler GPU which is the GPU we started designing in I think like 2012 before and sube 2009 it came out in 2012 before we sort of considered deep learning as a special test case so it didn't even have integer operation Lee did enough floating point 16 operations they were basically comparing their innate operations against rfp 32 operations the first GPU that actually had any support for doing this well at all was maxwell where we had 1/2 precision floating multiply accumulate instruction that would do two ops which take 1.5 Pikachu love energy but compared against the 30 Pico joules it takes to fetch and decode the instruction and fetch the operands our overhead was 200 was 2,000 percent unless you think that's a lot remember that a general purpose CPU has an overhead of about a hundred thousand percent so this actually isn't that bad in comparison by the time we move to the Pascal generation we had a dot product instruction so our overhead is down to 500 percent and with the hmm a and volta we're doing 128 ops it's a hundred and ten pika joules of energy and advertising out the 30 pika tools is only 30% what this means is if you built a dedicated GPU you with the same technology using the same degree of of craftsmanship for the different arithmetic units you would do at most 27 percent better and we actually evaluated doing that and have decided that the 27 percent isn't worth it for the flexibility we get of being able to program arbitrary layers using the programmable parts of this so again it's getting that degree of program ability right and that by the way is sort of in a fair comparison of a dedicated TPU to a GPU with tensor cores what the advantage should be which is about 27 percent in energy so our vision for the future is that people will code their programs not thinking about them as just getting mapped down into software but getting mapped into a combination of software and hardware so we will write a program say it's a embody program where I'm mapping a force over pairs of particles and I may even have some guidance some mapping directors about how I think this ought to be delivered onto Hardware I'll have some software which is a mapper and a runtime that will do data and task placement here and for the tasks it will have the option of running it out to a hardware synthesizer to build specialized blocks to get that advantage of specialization that can be fed back in so you can think of this as synthesizing new tensor cores or ray-tracing cores or whatever the algorithms we want to speed up in the future this will get mapped to a platform which has a very high bandwidth memory and communication system it has some general-purpose cores like our streaming multi processors in a GPU and then selectively will drop in these specialized units to accelerate that the problem that we're trying to face and the cost of doing this is much lower than building a specialized energy and engine from scratch because we're able to leverage the platform for 90% of what is on this chip that memory system the communication system the general-purpose control and all we have to do is customize those blocks and I think there's a really interesting research to be done in how to automate this mapping and runtime and how to decide what parts need to be accelerated and realize that it's not a one-way flow right you're going to write the program and some mapping direct to see what happens and then do some design space exploration where you try out different approaches to algorithms to mappings two ways of decomposing the problem but I think this is the way computation computation will be done in the future so let me wrap up since I think I'm actually probably over time we need to continue scaling performance per watt our economic growth depends on it machine learning depends on a lot of people don't appreciate this but you know all the algorithms everybody uses for machine learning have been around since the 1980s convolutional neural networks stochastic gradient gradient descent back prop all in the 1980s the I actually took a course at Caltech with john hopfield when he was on sabbatical there and built neural networks and concluded that the machines weren't fast enough to to make them work in fact that was what the case was it wasn't until we had GPUs that were fast enough that machine learning took off and today machine learning progress is gated by our ability to train larger models people want to build bigger models train them on more data and that takes quadratically more compute to actually train a model and so they're limited not by how much data they have or by how creative they are building models but by the compute to train them so we need to continue scaling performance to make that happen with Moore's law being over the most promising way of scaling performance is to build domain-specific accelerators and there's a bunch of principles i hope i've shared with you sort of through case study about how to do that the first one is co design you can very rarely take the same algorithm and just accelerate it you have to rethink the algorithm in terms of parallelism in terms of efficient memory footprint it's often the right way to rethink it even to run it on a conventional processor but the algorithm often has to change memory dominates it takes up the bulk of the area in the power you often have to restructure your algorithms to make them fit in a reasonable amount of memory you get performance when you can make the high bandwidth access to memory spit and small fast on chip memories we can have multiple arrays and get parallelism out of that memory system you're limited by how much global memory bandwidth you have and you have to use that very carefully and then finally simple parallelism often wins often you're better off not doing complex data structures that are very serial but paying you know higher you know computational complexity because operations are free for something with more parallelism because that may wind up giving you a faster solution time to make this economically feasible in the future we need to do is actor out the common parts of the system memory doesn't care what its bits are storing they could be storing you know jeans you know a little two-bit your base pair sequences they can be storing weights and activations for neural networks they can be storing pressures and velocities for a fluid dynamics computation if you have a very fast very parallel unchipped memory system it can be used for any of those things and so you can leverage that leverage some general-purpose processing cores and add just the amount of logic you need for the specialization of your problem and GPUs when to being in an optimal way of doing this whether you're adding instructions or memory clients and I think that to make this possible we need better tools to explore this design space of accelerators so thank you I'd be happy to take any questions [Applause] actually have two questions one is is there any chance for moving memory latency I mean this has been telling us now for decades and partly it's been driving much of really classical architecture in some sort of major issue with you're discussing is there any chance we can do something called memory latency but as with some new electronics in the second question is there was one who not surprisingly mention FPGA okay let me take a look those Thank You Moshe those are both really good questions so let me talk about memory latency actually I'm actually much more worried about memory bandwidth than I am about latency and let me tell you tell you why memory latency I can cover with more parallelism right so if I took the gene gene sequence accelerator I got 15 million reads right then I can all handle in parallel so I've got an enormous amount of parallelism that I can cover whatever memory latency you have but today I'm limited by today's memory chips having a certain amount of memory bandwidth so you know it's somewhere around you know 20 gigabytes per second per channel it may be actually a little optimistic maybe 15 and that's all I get and that winds up being a bottleneck so it's actually the bandwidth I worry about more and you can do things about memory bandwidth but it boils down to energy finally because accessing you know and an LP DDR DRAM is probably around eight to ten Pico joules per bit and and that winds up getting getting expensive you just multiply that you know bits per second by Pico joules per bitten they gives you your power dissipation in watt and it's sort of a fundamental thing that when I go off chip it takes more energy to fetch those bits and so well that way you could engineer chips and we use high bandwidth memories and in our you know high-end GPUs in volt to be 100 Pascal P 100 you know and they have this characteristic of having very low energy per bit probably in the order of three or four Pico tools per bit you're still wind up being energy limited to get that bandwidth latency boils down to almost you know sort of some speed of light calculation so it's all about communication Rimmer actually reading the actual memory location takes almost no time it's getting the request to the memory chip to the proper Bank of the memory chip getting the result back and and these on chip wires are actually much slower than the speed of light so the one thing they're like an order of magnitude slower than the speed of light because their RC transmission lines better than LC transmission lines so one thing that we could do to make it better first of all is to make all the distances smaller that's largely a cooling problem to first approximation because now we have all this energy being dissipated concentrated in a small space we need a very effective liquid coin to get the cooling density up but then the other thing that we could do is basically have good transmission lines so rather than running these things around on RC transmission lines we can fabricate LC transmission lines either with just very fat high conductivity metal layers on the chips themselves or by running them down to an organic substrate and running them around if we can exploit locality we can also imagine doing stacking where we can have the DRAM we're accessing right next to the processing element that's accessing it and and all of these things could potentially reduce reduce the latency by making things electrically closer to one another now the other question you asked was about FPGAs and so there's a really simple way of somewhat summing up an fpga is that an FPGA is just a bad ASIC no so the way the way to think about an FPGA is they have some special purpose units on them some of them have floating-point units some of them have DSPs and for those special purpose units that are very highly engineered they're as good as the ones you build on an ASIC but the rest of the FPGA and what makes it feel programmable or they have look-up tables lots and most of the modern ones are like six lot so you feed six bits in it looks up one of 64 locations and tells you what the output is you get sort of arbitrary six input logic function and we have benchmarked on many different applications that comparing a a ASIC in the same technology to the FPGA slots is almost exactly a hundred x in both area and power it's a hundred times more area for the lot a hundred times more energy per operation for the lot and so for any app for any application that people really about they will take whatever they might prototype in FPGA and ultimately build the ASIC that said our darwin gene sequence accelerator is right now available both on Azure for Microsoft and Amazon on f1 instances because it's 15,000 X improvement over CPU if you actually do the full custom accelerator it's still 150 X running out an FPGA down by a factor of 100 so if you get enough out of your accelerator you can tolerate that hundred x overhead of the FPGA yeah thank you for the great talk so far domain-specific accelerators cause I think in memory has shown been shown to have great potential to reduce the high cost memory access could you share your comment in this emerging architecture yeah that that's a great question thank you so process and we go back to my slides on memory dominates so processing in memory is really the point I was trying to get at when I said memory dominates when most people say processing in memory what they really mean is processing next to memory and so for example in our gackt array we have an array of processing elements are actually 64 arrays of 64 elements each unique processing element has this little SRAM array right next to it and all of the memory bandwidth this you know one per cycle trace back pointer is getting stored into that SRAM array the the small amount of memory bandwidth which is basically loading the next reference sequence and loading the next query sequence takes place over standard memory channels and there's almost to the noise so I think that what processing memory is really about is co-locating processing elements with small memory arrays and in fact historically all of the PIM tips that people built the the one that Peter Cody built with IBM whose name is escaping me right now execute I think it was you know our J machine at MIT and all of these were really about putting memory next to processing elements so that you would have very high bandwidth access to local memories rather than I mean to make global memory references all right thank you [Applause]