Uploaded by gacarel813

tactiq-free-transcript-fnd05AeeFN4

advertisement
# tactiq.io free youtube transcript
# Brice Lecture 2019 - "The Future of Computing: Domain-Specific Accelerators" William Dally
# https://www.youtube.com/watch/fnd05AeeFN4
vo is a chief scientist and senior vice
president research media and he is also
a professor to research in a former
chair of the computer science department
at Stanford University
he has speaking with Stanford full
number of years and before that it was
MIP he has a stella research record were
caught working on cold blue sky compare
system and men of the paper cutting
these days you read them you impressed
by the most heartily inside and also
good mathematical reader so this is he
works on developing kawan talking system
to actually bombing applications some of
the research today and he's a member of
the National Academy of Engineering and
the fellow of activity ACM and American
Academy of Arts and Science he pretty
much long all the awards major was a
company architect for the ever dreamed
about it he's also distinguished
educator and manager and has granted
remaining his students who become
professors everywhere including our yes
I just heard from again also the secret
speaker we had Monday participant a
channel in Chicago actually was abused
first a student so you can't tell how
many so today to talk about the future
of computing at our main suspect attacks
waiters please
[Applause]
thank you I'm honored to be here my pug
isn't a little bit jet-lagged I was flew
in from Tel Aviv via New York this
morning
yeah I'm sure motion does it all the
time but I want to share with you one of
these I think makes it really exciting
being a computer scientist and computer
architect these days which is this
revitalization of competing hardware
because we've hit a point where what
we've been doing for really the last 40
years doesn't work anymore
and domain-specific accelerators I think
are is what's going to revolutionize not
just hardware architecture but but all
of computing and I'll well you know why
this photo by the way I took on my
commute to work I live you know if this
Mouse works here I live about there in
the corner and commute down to the Bay
Area of every week or so to meet with
students and meetings at Nvidia anyway
back to the subject at hand the thing
which has really driven a lot of not
just the computing industry but
productivity and society in general has
been faster computers people sort of
know this is a popular version of
Moore's law that isn't actually what
Moore wrote about in his 1965 paper in
his 1965 paper you talked about the
scaling of economical transistors that
also has come to an end but also scaling
of computing performance from
conventional serial computers has come
to an end and in historically to create
more value and driving innovations
across many industries has been driven
by better computing performance more
value has taken faster computing better
algorithms and more data we still have
better algorithms and more data but
we're not getting faster computing
anymore there's a great talk at the
DARPA ERI summit last summer by an
economist from CMU named Fuchs who
basically pointed out with with pretty
rigorous data that most of the
productivity gains across many US
industries are actually driven by
computing and that without this
continued scaling of computing a lot of
productivity is basically you threatened
to stagnate and not continue to increase
so I'll posit that we need to continue
as computer scientists delivering better
performance and better performance per
watt but we historically got as this
almost free ride from Moore's law we now
have to be covered we have to think
about how to do this it used to be we
got this from processed technology but
now Moore's law is dead and as evidence
up to this I'll present a figure from
and by the way there is still one award
that I aspire to or at least dream about
our latest tutoring Award winner is John
Hennessy and David Patterson published
this in the latest edition of their
textbook which is the scaling of
computing performance over you know the
periods from 1980 to 2020 and what you
see is back during the heyday of Moore's
law the big green area of this curved
computing performance doubled every 1.5
years fifty two percent per year
increase in performance it's now
essentially zero it's not under three
percent per year so the old way of
increasing computing performance was
just to wait and take your old dusty
decks run them on the new computer then
run faster that doesn't work anymore so
how now are we going to continue to
scale computing performance and in their
Turing Award lecture that they gave it
ESCA last year they basically point out
the domain-specific accelerators are the
most promising way to continue to scale
performance and performance per watt and
I was very happy to hear that because
I've been making fast accelerators since
1985 this is a subset of of papers I've
written on accelerators and they run the
gamut from simulation accelerators to
things that do signal and image
processing deep learning and more
recently genomics and I'm gonna draw
examples from these various projects
over the years to talk a little bit
about the nature of domain-specific
accelerators and how they make you
rethink computation in in general and so
to ask the question why why do
accelerators do better than
general-purpose computers certainly
specialization is one of the
characteristics but I don't even put
that first because when you look at the
performance increase from these
accelerators probably the biggest common
denominator is that they are massively
parallel if you get a conventional CPU
chip these days it might have the
sixteen processors on it at the high end
or your four at the low end and that's
parallel but that's not massively
parallel many of these accelerators get
parallelism in the thousands and that
really is what accounts for a lot of
their increase in performance but if you
got that increase in the thousands and
you didn't change the power equation you
wouldn't be able to run it on a
reasonable amount of power or do the
computation in the form factor that
you'd like so what gets you the
efficiency is typically the special data
types and operations and I'll give you
an example in a few minutes of a dynamic
programming problem from gene sequencing
where the specialization runs 37 times
faster than running it on a
general-purpose processor so modest
increase in performance but it's 26,000
times more energy efficient and that's
really where where specialization wins
is making things more energy efficient
and part of that I'll jump over the
memory one for now is because you get
rid of the overhead a modern you know
high-performance CPU has very high
management overhead it's actually spends
the bulk of its energy well over 99% of
the energy is spent on administrative
overhead on fetching instructions
decoding them deciding what order
they're gonna happen in renaming
registers running things out of order
unrolling speculation when you got a
branch prediction wrong and that's very
expensive specialized accelerators get
rid of that huge overhead and spend all
of their energy actually doing the core
computation let me come back then to
memory it turns out that almost every
computation we do is memory centric in
the sense that the bulk of the area on
the bulk of the power is taken up by
representing state and accessing that
state and so core to all these
accelerators is how we deal with memory
and in fact when I talked a little bit
about how we want to change how we think
about algorithms if you think about when
you took your basic algorithms course
they taught you things about counting
operations big o-notation it's more
efficient to do a sort this way because
it's order n log n rather than order N
squared it turns out that the operations
are almost free today what's really
expensive is accessing memory
so we typically will wind up redesigning
our algorithms for acceleration to
optimize how memory works to basically
keep the memory footprint to the highest
bandwidth memory access is small so we
can access it that those memories from
small local memory arrays and not from a
big DRAM array if you're basically
eliminated by DRAM bandwidth you're not
going to get anything out of an
accelerator it's going to be memory
limited and that's going to be the end
of your performance and many algorithms
when we start accelerating them have
that characteristic their memory limited
and we have to restructure the algorithm
and I'll give you a couple examples of
how we've done that drawing from that
from the genomics accelerator to make
them optimizable on the memory side and
that it really gets to the bottom line
here which is it's almost impossible to
take an existing algorithm unchanged
unless you're extremely lucky and
accelerate it and get huge performance
improvements usually it requires
restructuring that algorithm for the
constraints and particularly the memory
constraints of accelerators and this is
the algorithm Hardware co.design and and
what I'll suggest is sort of a hope for
the future that as people develop
algorithms in the future they will
target them for accelerators from day
one and as a result they'll do these
optimizations to begin with now
accelerators aren't new specialized
Hardware is everywhere but it's largely
invisible most of us have something that
looks a little bit like this in our
pockets right and if you if you look at
the processor this is this is a Apple
iPhone I think it has like they're you
know a ten or a 11 processor chip in it
if you look at that processor chip yeah
it has a few ARM cores on it those arm
cores are used to run the complex but
not demanding part of the computation
mostly user interface and things like
that and if you look at where most of
the operations performed on that iPhone
go there in specialized accelerators it
has accelerators for the radio modems
for still and moving image codecs for
doing the front end of the camera image
image processing that D mosaicing and
white balance and and in color balance
it has a deep neural network accelerator
and as graphics accelerators my company
Nvidia for 25 years has been built
graphics chips the core of a graph and
graphics chips these days are extremely
programmable but again the
programmability provides flexibility the
really heavy lifting is done by
accelerators so the rasterization that
turns triangles into pixels texture
filtering compositing a lot of these
core operations are done by accelerators
a lot of the accelerators are even
hidden it turns out we have compression
and decompression accelerators on our
memory channels this is actually not to
make things take less space in memory
but to make things take less memory
bandwidth so we can access them more
quickly so it's a compression that
basically doesn't reduce the memory
allocation but when the compression is
successful we can basically fetch
textures and in fact fetch things from
surfaces at many times the bandwidth
that we would get if we didn't do that
compression most recently with our
Turing generation of GPUs we've launched
acceleration for ray tracing trying to
make photorealistic images we can do an
order of magnitude better than even the
very efficient computation we had on our
regular GPUs by accelerating a tree
traversal the bounding volume hierarchy
which basically when you cast array in
space we want to find what polygons it
intersects first we basically walk' a
tree that divides space up until we get
down to a piece of that space that has
one triangle in it and see whether the
ray hits or not in fact we also have a
special purpose accelerator to do that
intersection test so we can do the right
triangle intersection very quickly and
that's what makes the r-tx feature of
the Turing GPUs possible wouldn't be
possible without that acceleration so
let me start with I'm gonna walk through
that list of things that accelerators
have and talk about specialized
operations so when you think about
specialized operations you need to know
what to compare to so what most
computers provide are integer and
floating-point operations if you think
about it
floating-point operations are really
specialized operations for scientific
computing there are there a data
representation that's been specialized
over the years for scientific computing
but when you want to do a computation
that doesn't fit well into integer or
floating-point operations emulating that
on unconventional processors is is
expensive and so let me start with an
example from bioinformatics it's
actually a really exciting area
these days because the performance of
the sequencing machines the machines are
basically will take a bit of your saliva
and from that produce a gene sequence
has increased faster than Moore's law
and is still on that increase it's an
area where exponential scaling is still
holding and it's to the point now we're
actually the cost of doing the assembly
these machines produce what are called
reads contiguous sequences of bases
ranging from a few hundred bases for the
second-generation technology made by
companies like Illumina up to 10 or
20,000 bases for the third-generation
technology from companies like PacBio
which is actually just acquired by
Illumina or Oxford nanopore and then the
prompt you Taoiseach process from taking
a bunch of these reads and figuring out
it's like a big jigsaw puzzle right the
the finished puzzle is your genome and
you have all of these pieces sitting
around except they all kind of look the
same they don't have funny edges on the
wave real puzzle pieces do you have to
assemble them all together and there's
two ways of doing this assembly one is
to say you know most people are like
person X so let's start with person X's
genome and we'll line these pieces up
that's called reference based assembly
it turns out that it's very biased it
actually won't catch certain variants
that person X doesn't have but we'll
kind of reject them because there's no
place to assemble those puzzle pieces so
if you really want to get the best
diagnostic you will basically do what's
called a de novo assembly where you will
not use any reference information you'll
simply take the puzzle pieces see where
they overlap and come up with a maximum
likelihood assembly now what makes this
more difficult than it may sound is that
these reads are very noisy for the long
read technology in particular some of
the technologies have a 60% accuracy
that means for any one base in the read
you only have a 60% chance of it being
right a 40% chance of it being wrong
which makes it that that assembly even
more difficult so if you look at the
core operation for doing this it's a you
know sort of very fundamental computer
science algorithm it's a dynamic
programming right where I have the
reference sequence along one edge and
the query sequence on the other and for
de novo basically the reference sequence
becomes all of the query sequences
concatenate
together you're just trying to see where
they overlap and so you're trying to
find where two sequences line up where
they match and so I start up in the
corner and if they match like you see
the two G's there I get a score for
matching if there's a mismatch I get a
penalty for mismatching I also have the
possibility that there's an insertion or
deletion in which case say they're go
horizontally or vertically so for every
square in the dynamic programming matrix
there are three possible ways of
arriving there depending on whether the
highest score was a match and insertion
or deletion and and this computation is
expressed by these three recurrence
equations on the right for four on the
affine gap penalties you actually keep
separate scores for insertion and
deletion so you can charge more for
starting the insertion and then for
continuing it and then the hij basically
says it's the maximum score of nothing
if you're at one of the edges the
insertion score which is basically if I
end an insertion at this point the
deletion score if I end the deletion or
the previous score to the upper left
plus the matching score where I get a
positive score if I match an a negative
score if I mismatch doing this on an
Intel CPU and I'm even going to handicap
this by letting Intel use 14 nanometer
technology and our accelerator uses 40
nanometer technology this takes 37
cycles and 81 nano joules the little
accelerator we built does this
computation of one cycle so it's 37
times faster and it takes 3.1 Pico
joules which is 26,000 times more
efficient now if you actually peeled
apart the computation it turns out the
logic computing these three recurrence
equations takes 300 femtojoules it's a
tiny fraction of the energy and the bulk
of the energy 90% of the energy is
actually storing the trace back pointer
the pointer of which of the three
conditions gave you the maximum so you
can then reconstruct the matching of the
two sequences this is an example of
specialization giving modest
improvements in performance huge
improvements in efficiency now that if
efficiency is really the whole thing
that accelerator design is about and to
design an accelerator well you have to
have a good model of cost and a very
simple model of cost for one that's
actually amazingly accurate is on this
slide arithmetic is free particularly
its low precision it's it's so
inexpensive it almost doesn't count and
in fact a good first-order way of
estimating the area and energy
consumption of an accelerator is just to
look at the memory and you'll actually
come up with a number that's usually in
the ballpark memory is expensive
accessing even a small memory where it
costs way more than doing an arithmetic
operation and communication is
prohibitively expensive and actually a
lot of what we think of as memory costs
today is really communication cost basic
memory arrays are small they're 8k bytes
or so and if I build say on chip a large
SRAM array I build it out of a little 8k
byte arrays and the cost of accessing
the big memory array is almost entirely
communication cost of getting the
address to the selected sub back and
getting the data back the actual cost of
accessing the 8k byte array is roughly
the same I'll give some actual numbers
net later so here's sort of a cheat
sheet I tend to use when I want to
compare how expensive arithmetic
operations are and it and it tells us a
bunch of things about precision so for
example in something like a neural
network accelerator
we're dominated by doing multiply ads
and the multiplies are more expensive
than the ads and if you look at the cost
here what you realize is that the cost
of doing a multiply increases
quadratically with the number of bits in
that multiplying this should make sense
right because when you learn in grade
school to multiply numbers you basically
compute a bunch of partial products you
know you know a column for each digit
and a row for each digit is N squared
partial products then you sum them all
up the ads increase linearly but tend to
be small enough that when you're doing
multiply adds and multiplies dominate so
there's a big push to reduce precision
because will win quadratically in the
earth medic energy but if you look down
a little bit further you'll realize that
you know even if I do you know a large
precision operation say a 32-bit
multiply that's less expensive than
reading those 32 bits even
a very small SRAM and the lower
precision operations are way less
expensive this you know feeds that first
concept arithmetic is free memory is
expensive and if I want to do a dram
read which is actually mostly
communication going off chip its orders
of magnitude more expensive so so when
we get to doing the co.design we'll see
that we really need to restructure our
algorithms so we do as few memory
operations as possible and those memory
operations are done out of small arrays
in fact a good rule of thumb is every
time I go up a level of the memory
hierarchy the cost increases by an order
of magnitude accessing this small local
less ram array costs about five pica
joules per 32-bit word if I build a sort
of global on chip SRAM array something
that may be on the order of a few
megabytes that's 50 Pico joules per word
and remember five pica joules of that is
really memory accessing the base SRAM
array the other 45 pica joules is
communication and if I go off chip even
to lpddr3 Ram which is one of the most
energetically efficient DRAM families
it's another order of magnitude more
expensive than that moreover as we look
forward and scale technology say from 40
nanometers to 10 nanometers the
arithmetic this is the energy of a
double precision floating multiplied
accumulate operation it basically goes
roughly linearly so it's four times as
efficient to do the double precision
floating multiplied accumulate it's only
about 30% more efficient to send a
32-bit word over 10 millimetres of wire
in that same technology so the wire the
communication energy is scaling at a
much slower rate than the arithmetic
energy
so whereas arithmetic is free today it's
really free tomorrow its energy is
scaling down faster than the competing
energy of communication so we talked a
little bit about Co design and I'll
start with an example from our genomics
accelerator when we first started
looking at this we got that we were
doing long read assembly because the
long read technology really has the
ability to do things like detect
structural variants that have you know
diagnostic properties you don't get with
your single nucleotide polymorphisms and
they're things you'll just miss with
court reads because the short reads are
too small to see the whole variant and
at the time the best piece of software
for doing long read assembly was called
graph map and it turns out that because
alignment which uses dynamic programming
is really expensive unconventional
processors graph map spends almost all
of its computation time doing what's
called filtration this is where you
basically take seeds of the genome maybe
11 base pair sequences from the query
sequence and you find and you index the
reference sequence and find where those
seeds appear and by doing this you can
winnow down the number of cases you have
to test so that almost everyone you test
work so in fact they only get two false
positives for every true positive in
graph map but what you see here is that
they spend almost all their time doing
filtration which is blue and very little
time doing alignment there actually is a
little bit if you look carefully on the
end now when we looked at this what we
realized is that we can do alignment
blindingly fast because you know
alignment is dynamic programming we
could build this dynamic programming
engine that is 26,000 times more
efficient and because we're energy
limited that means ultimately we could
make it run 26,000 times faster at the
same energy so we're willing to trade a
few more false positives to spend less
time doing filtering and more time doing
alignment now why don't we make the
filtering faster was because filtering
is fundamentally memory limited you you
take the reference sequence which for
reference based alignment is a three
billion base pair sequence you know
person X's genome and you index it you
basically compute a big many tens of
gigabyte table in memory which has the
locations of every 11 mer you know
there's you know 2 to the 2211 Mars for
2 xi possibilities so it's a huge table
and you're essentially making random
accesses into it so you are going to be
limited by DRAM bandwidth on filtration
there's no fundamental way of
accelerating that algorithm fundamental
you just have to do random accesses into
a large table with with our Darwin
accelerator we decided to do a faster
but less precise filtration
an algorithm we called Esau I'll talk a
little bit more about it later they get
to an enormous number of false positives
there's almost 1,700 false positives for
every true positive that's okay we
filter them out really fast with
alignment and then we do the alignment
with a variation of straight dynamic
programming we call gackt I'll go into a
minute in a minute what that exactly is
and so if we just did this in software
this would make everything run two times
slower but what it did is it traded
doing filtration which is fundamentally
limited by memory bandwidth for doing
alignment which we can accelerate very
easily it has a very small memory
footprint the next thing we did is we
built a hardware engine for doing
alignment and we built it with a degree
of parallelism of 4,000 so the
combination of that degree of
parallelism 4,000 and the speed-up we
get from specialization 37 actually
exceeds the efficiency gain and this
essentially makes that red bar there go
to zero we basically traded something
expensive for something cheap and then
we exploited the inexpensive nature of
that and made it go blindingly fast in
hardware we could have actually stopped
there and a 380 times speed-up would
have been acceptable but there are a
couple other optimizations one is when
you are doing memory accesses you want
to basically keep all the memory
channels busy all the time it turns out
that especially for random accesses into
big tables conventional processors don't
do this they have memory systems that
are optimized for latency not throughput
and they typically lock up after a
relatively small number of misses like 8
instead we basically optimize the memory
system to keep for DRAM channels busy
simultaneously and compared to the CPU
implementation think it's about a 4x
improvement in performance we then did
two other optimizations one is when we
looked at our memory channels what we
realized is that once you find the right
place in these seed tables it's then
actually linear accesses as you read
every location that that seed could
possibly be but these were getting
interrupted by incrementing the bins
which is when you find out where the
seed is you count how many seeds hit
that location because each time you get
more evidence of that location it
increases your chance
of a hit and so we basically factored
out those bin tables is sort of shown on
the right here into dedicated bin count
s ramps because the small enough data
structure we could put it on chip and it
actually gave a speed-up greater than
the amount of memory traffic that we
were taking out because in addition to
room two basically making that memory
traffic take essentially a zero time it
removed interference from the sequential
accesses and allowed them to go at full
speed and now at this point the red bar
is actually big enough it still shows so
we pipeline to the the filtering with
the alignment and got a final 1.4 X
speed up the total is fifteen thousand
on a reference based assembly so the
co.design really mattered you had to
make some fundamental changes to the
algorithm and it requires you're working
with people who are experts in the
domain because for the biologists who
use our tools to trust this they had to
know that these changes we made didn't
cause us to give them the wrong answer
especially for something where it's a
very statistical process and so we
tested this very rigorously by making
sure that we had equal or better
sensitivity at each step of this of this
process and in many cases actually doing
even more work than we needed to to get
better sensitivity than the baseline
algorithm that we were competing with so
let's talk a little bit about memory and
memory dominates in several ways the
first is that it dominates power and
area so here are the area on power for
each part of the of the Darwin
accelerator and what you see is for the
dynamic programming part the GAC part
memory is almost 80% of the area and
over three-quarters of the power for the
filtering it's actually even more than
that it's 98% of the area 96 percent of
the power then even a little bit higher
if you were to throw the DRAM in there
as well so if you're designing an
accelerator this actually makes it
really easy to do some first-order
estimates of things very often you need
to do this you're doing some design
exploration you're trying to decide do i
do approach a or do i do approach piece
rather than having the graduate student
go and actually implement all of the RTL
and synthesize it and place and route it
and get very accurate measurements you
get numbers that are probably 80 to 90%
accurate by just figuring out how big
the memory arrays are and how many
accesses you have to each of them and
counting those up and then choosing
among your alternative embodiments that
way and then and then fine-tuning at the
end by doing all the detailed design
another way that memory dominates is it
actually drives what algorithms you can
use there are a lot of great algorithms
that you can't make work in accelerator
because they become memory limited so
let's talk about dynamic programming if
we're doing these long reads we
typically for a typical assembly you
have 30 times coverage you have 10,000
base pair reads that one's up being 15
million reads and if we do these all
with straight dynamic programming that
array that we have to fill in has 10
million entries and that's one's up
being too much to put in in on chip
storage it turns out people tried to
come up with smaller memory footprint
approaches to this in the past by doing
what's called bandits with mod watermen
where they basically compute only a band
around the diagonal down this array and
the problem is that doesn't work and the
reason is you notice for the one
sequence I showed here which is actually
a real sequence it doesn't hit in the
bottom right corner it turns out that
the probability of an insert is not the
same as the probability of a delete so
over time there's a bias and you're the
actual match of the two sequences will
wander and often two will wander very
far off the diagonal so you have to
reuse an enormous band if you want to
reduce the probability that that
sequence will wander outside of the band
so what we did instead is came up with a
tiling approach and and gackt is is I
put with the G a and C are four but the
t means tiled and and we first did this
we tried to tile it rigidly so that we
forced the sequence to try to go through
the bottom-right corner of every block
and that of course didn't work we did
not get optimal i mints that way and and
the thing that made gackt work is
realizing that if we overlapped these
blocks so we would basically do an
alignment on one tile say the upper left
I'll find its maximum scoring exit point
and then overlap the next
with that back so the typical sizes are
maybe a 500 by 500 tile overlapped by
100 to 200 and then we would restart the
alignment not from where we left off but
from 200 back that would always find the
correct alignment we have not found a
single assembly where our alignment does
not match the alignment done by doing
the full smith-waterman but instead of
having a memory footprint of 10 million
matrix locations to fill in it now is
you know the the footprint is the size
of those tiles this is 500 by 500 order
of 25,000 footprint which is easy enough
to put in a very small Ram and in fact
you can have many very small Rams
so once we've reduced the memory
footprint by that amount and the logic
here is very inexpensive we can have a
lot of these so we actually have have
4000 of these processing elements that
compute the dynamic programming that's
64 arrays of 64 elements per array
there's you know we're doing these
assemblies with 15 million reads there's
15 million alignments we need to do
there's plenty of parallelism at that
outer loop so we start 64 of them at a
time and then we have 64 processing
elements that basically walk a diagonal
down that dynamic programming core and
each of those processing elements is its
own little private SRAM where what gets
stored every cycle is the trace back now
this is an example of a systolic array
algorithm the great thing about the
systolic array algorithms is that they
simplify two things about parallelism
communication and synchronization what
we're communicating each cycle here is
the I D and H values from those
recurrence equations and those are
nearest neighbor communications out of
registers there's no memory accesses
required and then the synchronization is
lockstep we're basically because we're
walking down that diagonal if we want
you know the values from the previous
cells above us you know up to the you
know left and up into the left those are
in the registers from the last cycle and
and there's no special synchronization
needed as might happen if two things are
running on their own and you need to
signal not only
that the data is there but when it's
there and so the the operation is
extremely efficient it why is it being
about a hundred and fifty thousand times
faster than a CPU for for this part of
the computation it comes the wrong way
there then the final way memory is
important is it really drives cost it
drives cost in in an unusual way so for
example when I took those bin count
memories in in the Darwin implementation
and put them in SRAM rather than putting
them in DRAM I was basically replacing a
storage technology with one that's
probably about a hundred times more
expensive a bit of on chip SRAM it's
kind of like real estate in a city like
Houston if you're located in a really
good place like maybe near the
university that Real Estate's probably a
lot more expensive per acre than
something you know 50 miles out of town
and the same thing if I'm in an SRAM
right near the computational you know
that's very expensive real estate at
least 100 times more expensive than DRAM
but even being a hundred times more
expensive per bit it can be cheaper so
explain why let me show you the D soft
algorithm and then do a cost computation
so so D soft basically we've got these
seeds coming in it's kind of what
location this base pair could hit and
we're trying to that's not good and
we're trying to compute which bins
accumulate enough hits enough evidence
that these two sequences might be
similar that we can then do the
alignment between them and so we start
out and I mean this is really sort of a
toy example typically you'll use eleven
Mercer I'm going to use tumors and I'll
start out with the the query sequence at
the left the reference sequence is
running along the bottom and I'll ask
how many places does GT occur right and
so I'll go to the pointer table lookup
GT that will then cause me to make a
look both of these tables are in DRAM
let's go to the physician table and it
says okay it's the location it's twelve
and thirty one so I increment the bins
for twelve and thirty one that's shown
on the right here with that's a green
bin and the in the tan bin with two
because two base pairs match then I get
my next sequence which is GC and it's
not overlapping so I'm gonna get full
counts here it matches four places is
shown by the position
and those four places get incremented by
two so now I have three twos and a four
the next one is overlapping it's a CT so
where it hits if I also hit with GC I
increment by one otherwise I increment
by Shu and I do this you know until I
get to the end of this sequence I'll
continue on doing this and then I'll
check the bins for a threshold and in
this case with the threshold being six I
guess no special is five two of these
match the rest don't those two go off to
the alignment stage so the way we do
this in hardware is we build a structure
where we have separate bin count Rams
and we actually partition it up so
there's 16 of them even though because
we have four DRAM channels we can at
most get 4 queries into this at a time
for places that we need to increment we
have a little on-chip memory on trip
network that routes these increments to
the appropriate bin counter raise those
bin counts get incremented the reason
why we have a 16 x over provisioning
here is to make the probability of two
things coming in hitting the same array
small enough that a small FIFO will take
out that variation so we never fall
behind we can we want to keep those d
Rams busy all the time and then there's
sort of two other s ramps here these
nonzero bins SRAM it turns out every
time we do an alignment we need to start
this process over again and so we need
to figure out all of the bin Rams that
we may have incremented and there are a
lot of them we don't want to have to go
and scan through the whole array setting
them all back to zero so every time we
increment one the first time we tell the
dumpster bin SRAM to push that bin count
on a stack and then when we reinitialize
we just pop that stack off and zero only
those SRAM's that need to be zeroed and
those are done by done by bank and then
the ones that actually exceed the
threshold as soon as they exceed the
threshold they go out to that arbiter
and get fed off to the alignment so the
alignment gets pipe lined with this so
let's do a cost computation let's
suppose the multiplier for on chip SRAM
is 100 and for the bin contest Rams I
have 64 megabytes at a cost of 100 sets
you know
6400 M or 6.4 gig you know units of cost
the DRAM has a cost of 1 and so I 128
gig of that so the cost of my dear aunt
of my total memory here is 134 128 gig
units of cost for the DRAM and 6.4 gig
for the SRAM now you would think that
that would be more than the DRAM only
system but there's a time component to
cost suppose I have an unending stream
of sequences to to filter I can filter
it 15.6 times faster on this array so to
match the performance of that I would
need 15.6 copies of the DRAM only system
so that the the Dieruff only system is
in some sense in computing memory units
times time fifteen point six times as
expensive so even though it has a less
of a memory cost its total cost is
actually about fifteen times as much
sort of you know Tutera units of cost as
opposed to under the 34 Giga units now
when you look at trying to maximize
memory it drives you to do things like
using sparse structures in compression
and this is where the main specific
architectures play a big role because it
turns out that if you you know go and
you grab a standard sparse linear
algebra package what you'll find is you
have to be really sparse for that
package to run faster than the dense
package like less than one percent dense
sometimes less than a tenth of a percent
dense depending on the implementation my
my former students song Han and I we
were looking at neural networks we
discovered that they were between 10%
and 30% dense if you actually you know
did a processor you eliminated the
unneeded weights of the matrices so the
conventional wisdom was that was too
dense to use a sparse package and that
would be true if you actually had to
implement that sparse package on
conventional hardware but
domain-specific hardware can make the
overhead of doing that sparsity
essentially go away and so we basically
built a hardware accelerator that would
walk the tables in these tables of
course were
but in separate small memories of the
compressed sparse column format and that
basically allowed us to get improved
performance with densities up to 50 and
60 percent you know there is still some
overhead of walking those pointers but
it takes it out of the critical path we
also realize that you could if you
really wanted to compress things down
you know you're wasting a bit if you
sample things uniformly so suppose I I
have a probability distribution shown
here and this is actually a real
probability distribution of weights in a
neural network and I have four bits to
represent these values if I simply use a
binary encoding my by symbol so the X is
shown here they're equally spaced and
what you see is I'm wasting a lot of X's
out here where nothing interesting is
happening and I'm sampling relatively
sparsely under this lobe where lots of
interesting stuff is happening instead
if you train a codebook and it turns out
that with neural networks anything that
you can take a derivative of you can
train with stochastic gradient descent
so we can fact train the codebook to
find the optimal weights the optimal set
of symbols on to be put in the codebook
and if we train the codebook with four
bits we get the red dots here you see
they aren't wasting any dots out where
nothing interesting is happening and and
the dots are instead being spent where
interesting stuff happens under these
curves now this would be prohibitively
expensive to do again if you did it on
conventional hardware because now you
would have to do the decoding process
you know on every you multiply ad you
basically get the weight have to look up
in the codebook what the actual value is
and that that would slow things down
quite a bit but in domain-specific
architecture both of these things the
sparsity and the and the codebook are
almost free so the way the efficient
inference engine that we built works the
code because this separate pipelines
change your weight decoder it's a small
SRAM and because it isn't taking cycles
away from the main compute engine and it
doesn't have an overhead of nearly a
nano Joule per instruction fetch swamp
in everything out it essentially adds
nothing to the cost of doing this this
computation the way we handled as far as
matrices is we get a column index
and we basically then read from a pair
of s Rams one which tells us we're in
the compressed farce row structure our
column starts and we're in the
compressed farce roast structure the
next column starts we then read those
columns start end addresses from from a
ram this Rams dual ported so we can do
this each cycle and that indexes our
sparse matrix SRAM to read the actual
weight values out and since we know
where it ends we know how far to read
before we're stopping now to drive home
the point that memory dominates this is
one of the processing elements is the
overall architecture of the eie it's a
2d array of processing element this is
one of those processing elements the
green areas here are the sparse matrix
Ram the things storing the actual
weights of the sparse matrix and the
pointer even employer odd Rams are these
two memories that store the starting
point er of the even columns and the odd
columns and they account for like 90% of
the area that the the arithmetic all the
logic in the middle is is again less
than 10% another thing that we've
observed over time in building these
accelerators is sometimes very clever
algorithms wind up being slower and the
example I'll use here is sat the boolean
satisfiability problem it's actually a
really important problem it's a it's an
np-complete problem that people solve
all the time because they have to it's
the core of many hardware and software
verification algorithms it's also
something that for example if you want
to do logical inference you can take a
number of logical clauses and pose them
as a Sat problem and it winds up being a
very efficient way of doing logical
inference it turns out that there's a
sat competition every year the the
programs that have won it in recent
years are derivatives of a program
called mini Sat that came out in the
early 2000s and many Saten on all these
derivatives tend to actually take
advantage of two things one is is was
actually developed in 1996 called
conflict driven clause learning which
means when you actually have a conflict
in your you
the guest values for boolean variables
working down the tree when you hit a
point where a boolean variable has to be
both 1 and 0 at the same time that's a
conflict and you have to backtrack up
the tree when you hit those conflicts if
you create a new Clause to augment your
original set of Sat clauses to remember
that conflict
it makes your search of this space much
more efficient so everybody everybody
does conflict driven Clause learning the
other thing that all of the recent
algorithms do is an innovation that came
about by some folks at Princeton in 2000
where they basically keep a very
compressed data structure where they
only keep two variables per clause and
as you determine a variable if it's not
one of those two you know you couldn't
have possibly driven that Clause unit so
it optimizes the search well it turns
out we tried to implement that since we
decided we wanted to use the most
efficient algorithms and it was
blindingly slow because it serializes
things these data structures by creating
these indices and these to watch
structures that chaff was the name of
the program at at Princeton that they
used made it much more slow and so
ultimately what we decided to do was
just to implement a an array as shown
here fine board my mouse is where say I
decide that I'm gonna set variable a to
be 1 I then send a message in this array
that propagates it each of these blue
squares is what's called a clause unit
which holds a very large many thousands
of clauses and it's checking all those
clauses to see if they have variable a
and if it does setting it to one in
those clauses and this propagates down
until this green thing here indicates
that I've driven a clause to be unit
which means having set a to one there's
only one remaining unbound variable in
that clause which means for that clause
to be satisfied that variable has to be
you know whatever its clarity is in
there 1 or 0 so I now determine this is
a derived variable I determined that say
B needs to be 1 and I start propagating
B equal 1 from this point these are the
purple ones here until for example I
detect a conflict setting B equal to 1
caused a conflict and so by getting rid
of the serial nature of walking the to
watch structure in the index
is that are required to maintain that I
get tremendous parallelism so sometimes
you actually want to do more operations
than the minimum order and of that
algorithm to unlock that acceleration
the other thing we found it in looking
at sat is many people accelerated the
part of site I just showed you which is
called the propagation basically setting
a variable and propagating all the
consequences of that in fact we were
very fast to doing that were 300 to 500
times faster than a CPU at accelerating
the boolean constraint propagation but
if that was all we did we would have
accelerated SAP by about 4x and the
reason is sort of shown in this bar
graph these are a bunch of the
benchmarks from the most recent sat
competition and what you see is that the
propagate is the green part of the bar
and for some problems propagates less
than half of the total and on average
it's about 70 or 80 percent and so if
you don't accelerate that remaining
percent this is sort of the Amdahl's law
of domain-specific hardware that you run
that on the standard CPU it completely
limits you so in fact we had to extend
our our array to an addition to doing
the forward propagation it also does the
cause learning in the clause
simplification we did keep much of the
the algorithm the sort of more subtle
but less throughput demanding parts of
the algorithm in software so for example
maintaining the variables that determine
which variable to choose next we
basically kept that in software so we
could duplicate the exact heuristics of
the existing algorithms deciding when to
do a restart we kept in software also we
could duplicate the remaining algorithms
deciding which causes to discard it
turns out over time you wind up learning
too many clauses and you have to throw a
bunch of them away we kept that so we
basically can keep all of the existing
heuristics unchanged but just make them
go blindingly fast with hardware now a
key thing that comes up in building
accelerators is over specialization so
very often you're implementing exactly
one algorithm you're going to make this
algorithm to go really fast but then
you're done and somebody comes up with a
slightly different algorithm in
run on your hardware or won't it and
actually some of the first accelerators
I built these are things I did back in
the 1980s had this problem so I built a
an accelerator called the maasen
simulation engine when I was a graduate
student at Caltech
I had this unfortunate circumstance that
my first PhD thesis advisor was Randy
Bryant and after about a year at Caltech
when I was well into what was going to
be my PhD thesis Randy comes to me one
day and says I'm moving to Pittsburgh to
go to CMU you're coming with me right I
go what move move from Los Angeles to
Pittsburgh I don't think so
and and so I immediately was sort of on
the market to find a new PhD advisor but
that so I sort of did a reset and you
sort of threw that PhD thesis away and
started started another one but at the
same time I was being supported in
graduate school by Bell Labs where I'd
worked before before going back to my
PhD and I was spending a week a month in
Murray Hill working with some people
there and they saw that projects that
actually finished it and wanted a copy
of it so the original one I built
Hardware blocks for each part of a
basically a switch level simulator and
it was blindingly fast it did exactly
that algorithm but when I went around
Bell Labs talking to people what they
wanted out of the simulator I found some
people wanted a switch level simulator
but other people just wanted a logic
simulator and some people wanted a logic
simulator that had unit delays and other
people wanted multiple delay and some
people wanted to do fault simulation
then my head was starting to hurt as I
was drawing all the boxes I was going to
have to implement so then I realized
that the MIT the main piece of
performance I was going to get out of
this was parallelism and if I wanted
specialization it turned out there were
some common operations that all of these
boxes had so rather than building
something that looked like this which is
what the original massive simulation
engine looks like I actually have a huge
wire wrapped circuit board which is the
way you did things back back in those
days I sort of architects it at the high
level for it to look like this
and each of those boxes from the
previous one these are one or a small
number of boxes to load-balanced was
mapped to each of these processing
elements I don't know if I have a photo
of this here I should put one in the
slide slide
and then each of those processing
elements was a custom microprocessor
this is back in the day when this this
microprocessor was designed by me and
one layout tech I basically was hand
drawing schematics pencil on paper
handing them to the layout tech and he
was you know feeding them into into the
into the layout system but the the
common operations were doing operations
on small fields was a field operation
unit where you would pull a arbitrary
bit field you know you know 1 2 3 4 up
to like 16 bits out of a word do an
operation on insert it back into a word
and doing table lookups so those a very
efficient address arithmetic unit to do
table lookups and to reduce the overhead
of communication and synchronization
which is you know what we got our
systolic arrays this is not completely
systolic the different operations take
different amounts of time but I wanted
to have very efficient communication and
synchronization and remember these
things are all a big pipeline so
everything starts by reading a record
from your input and then writing a
record to your output so there was a
queue unit that managed those records
and so you would read a record from your
input as if it was a register that the
queue unit was register mapped you would
read from that register do a couple
operations and then write to the output
queue and when you wrote to the output
queue you would specify to which
processing element that message should
be sent
fill out the record and end it and it
wound up that you know we was whereas in
the previous thing I designed all those
Hardware boxes to run in exactly one
cycle the unit delay logic simulator on
Mars ran four cycles per step it was
limited by the slowest pipeline stage
but this this machine was still
something like a thousand times faster
than running that simulator on the IBM
mainframe of the time and they also
implemented a multiple delay logic
simulator and a fault simulator and a
switch level simulator on the hardware
and they later found out that they
actually I did the original design at
one point two five microns CMOS in I
think it was like 1987 and I found out
much later that they actually revised
this through five generations of logic
technology and the final version was
done around 2000 and
the way I discovered this was kind of
interesting I was teaching the
introductory logic design course at
Stanford and this student of my class
comes up to me after class he says of my
dad knows you he says he works with you
at Bell Labs and so what's his name and
I didn't I don't remember working with
him he says he says he inherited your
simulator and so it turns out this guy
did I pity him because he drawn these
schematics so I could read them and the
tech could read them there was not great
documentation for this machine and he
says yeah he says he reimplemented it in
in you know sort of half micron CMOS and
then you know 0.35 and then 0.25 and and
so on so I wound up calling the guy up
and had some great great chats traps
about it but it shows how if something
is built general it winds up with a
relatively good piece of longevity
whereas if you over specialized that it
usually is used for a short period of
time and as soon as the algorithms
improved it gets tossed out so that
leads me then to this concept of
platforms for acceleration so if you
look at the common denominator between
the different accelerators I've talked
about I must be talking too long people
are starting to leave nearly done they
need a few things so first of all they
need a very high bandwidth hierarchical
memory system we wind up needing these
small memory arrays whether the
traceback memory arrays in the dynamic
programming engine whether the bin count
arrays in the in the filtering engine
whether they're the things that sorting
the sparse matrix in our neural network
accelerator and then we need very
programmable control an operation
delivery and we need simple places to
bolt on domain-specific hardware so we
so we have an engine for doing neural
networks really fast we have an engine
for doing dynamic programming really
fast and you can bolt this on in a
couple ways the easiest way to bolted on
is to define a new instruction I could
have an instruction which is dynamic
programming step
another way to bolted on would be to
have it be a memory client so I loaded
up its problem in memory and then kick
it off it reads reads a problem for
memory writes a problem back to memory
so I will pause it that a GPU is the
perfect platform to be some of you build
accelerators on top of it has a
wonderful underlying memory system it
has a great place to bolt on
domain-specific accelerators and it's
very programmable control and an operand
delivery and so in fact we've been using
it in this in this way and one example I
like to draw us to look at the
all 2v 102 compare it to something like
the Google TPU when you do that
comparison the part of Volta that's
important is what we call the tensor
course now tensor core is the name of
the marketing people made up in
engineering we refer to this as hmm 1/2
precision matrix multiply accumulate and
the operation is sort of illustrated
here in that it takes two four by four
fourteen points 16 matrices multiplies
them together and then adds the result
into typically a floating point 32
matrix so this does 128 arithmetic
operations in in in the course of doing
a single instruction operation now the
important of the importance of that is
that it amortized as out overhead so if
I go through previous generations it
turns out that when the Google TPU
people came out they did a comparison
against our Keppler GPU which is the GPU
we started designing in I think like
2012 before and sube 2009 it came out in
2012 before we sort of considered deep
learning as a special test case so it
didn't even have integer operation Lee
did enough floating point 16 operations
they were basically comparing their
innate operations against rfp 32
operations the first GPU that actually
had any support for doing this well at
all was maxwell where we had 1/2
precision floating multiply accumulate
instruction that would do two ops which
take 1.5 Pikachu love energy but
compared against the 30 Pico joules it
takes to fetch and decode the
instruction and fetch the operands our
overhead was 200 was 2,000 percent
unless you think that's a lot remember
that a general purpose CPU has an
overhead of about a hundred thousand
percent so this actually isn't that bad
in comparison by the time we move to the
Pascal generation we had a dot product
instruction so our overhead is down to
500 percent and with the hmm a and volta
we're doing 128 ops it's a hundred and
ten pika joules of energy and
advertising out the 30 pika tools is
only 30% what this means is if you built
a dedicated GPU you with the same
technology using the same degree of
of craftsmanship for the different
arithmetic units you would do at most 27
percent better and we actually evaluated
doing that and have decided that the 27
percent isn't worth it for the
flexibility we get of being able to
program arbitrary layers using the
programmable parts of this so again it's
getting that degree of program ability
right and that by the way is sort of in
a fair comparison of a dedicated TPU to
a GPU with tensor cores what the
advantage should be which is about 27
percent in energy so our vision for the
future is that people will code their
programs not thinking about them as just
getting mapped down into software but
getting mapped into a combination of
software and hardware so we will write a
program say it's a embody program where
I'm mapping a force over pairs of
particles and I may even have some
guidance some mapping directors about
how I think this ought to be delivered
onto Hardware I'll have some software
which is a mapper and a runtime that
will do data and task placement here and
for the tasks it will have the option of
running it out to a hardware synthesizer
to build specialized blocks to get that
advantage of specialization that can be
fed back in so you can think of this as
synthesizing new tensor cores or
ray-tracing cores or whatever the
algorithms we want to speed up in the
future this will get mapped to a
platform which has a very high bandwidth
memory and communication system it has
some general-purpose cores like our
streaming multi processors in a GPU and
then selectively will drop in these
specialized units to accelerate that the
problem that we're trying to face and
the cost of doing this is much lower
than building a specialized energy and
engine from scratch
because we're able to leverage the
platform for 90% of what is on this chip
that memory system the communication
system the general-purpose control and
all we have to do is customize those
blocks and I think there's a really
interesting research to be done in how
to automate this mapping and runtime and
how to decide what parts need to be
accelerated and realize that it's not a
one-way flow right you're going to write
the program and some mapping direct to
see what happens and then do some design
space exploration where you try out
different approaches to algorithms to
mappings two ways of decomposing the
problem but I think this is the way
computation computation will be done in
the future so let me wrap up since I
think I'm actually probably over time we
need to continue scaling performance per
watt our economic growth depends on it
machine learning depends on a lot of
people don't appreciate this but you
know all the algorithms everybody uses
for machine learning have been around
since the 1980s convolutional neural
networks stochastic gradient gradient
descent back prop all in the 1980s the I
actually took a course at Caltech with
john hopfield when he was on sabbatical
there and built neural networks and
concluded that the machines weren't fast
enough to to make them work in fact that
was what the case was it wasn't until we
had GPUs that were fast enough that
machine learning took off and today
machine learning progress is gated by
our ability to train larger models
people want to build bigger models train
them on more data and that takes
quadratically more compute to actually
train a model and so they're limited not
by how much data they have or by how
creative they are building models but by
the compute to train them so we need to
continue scaling performance to make
that happen with Moore's law being over
the most promising way of scaling
performance is to build domain-specific
accelerators and there's a bunch of
principles i hope i've shared with you
sort of through case study about how to
do that the first one is co design you
can very rarely take the same algorithm
and just accelerate it you have to
rethink the algorithm in terms of
parallelism in terms of efficient memory
footprint it's often the right way to
rethink it even to run it on a
conventional processor but the algorithm
often has to change memory dominates it
takes up the bulk of the area in the
power you often have to restructure your
algorithms to make them fit in a
reasonable amount of memory you get
performance when you can make the high
bandwidth access to memory spit and
small fast on chip memories we can have
multiple arrays and get parallelism out
of that memory system you're limited by
how much global memory bandwidth you
have and you have to use that very
carefully and then finally simple
parallelism often wins often you're
better off not doing complex data
structures that are very serial but
paying you know higher you know
computational complexity because
operations are free for something with
more parallelism because that may wind
up giving you a faster solution time to
make this economically feasible in the
future we need to do is
actor out the common parts of the system
memory doesn't care what its bits are
storing they could be storing you know
jeans you know a little two-bit your
base pair sequences they can be storing
weights and activations for neural
networks they can be storing pressures
and velocities for a fluid dynamics
computation if you have a very fast very
parallel unchipped memory system it can
be used for any of those things and so
you can leverage that leverage some
general-purpose processing cores and add
just the amount of logic you need for
the specialization of your problem and
GPUs when to being in an optimal way of
doing this whether you're adding
instructions or memory clients and I
think that to make this possible we need
better tools to explore this design
space of accelerators so thank you I'd
be happy to take any questions
[Applause]
actually have two questions one is is
there any chance for moving memory
latency I mean this has been telling us
now for decades and partly it's been
driving much of really classical
architecture in some sort of major issue
with you're discussing is there any
chance we can do something called memory
latency but as with some new electronics
in the second question is there was one
who not surprisingly mention FPGA okay
let me take a look those Thank You Moshe
those are both really good questions so
let me talk about memory latency
actually I'm actually much more worried
about memory bandwidth than I am about
latency and let me tell you tell you why
memory latency I can cover with more
parallelism right so if I took the gene
gene sequence accelerator I got 15
million reads right then I can all
handle in parallel so I've got an
enormous amount of parallelism that I
can cover whatever memory latency you
have but today I'm limited by today's
memory chips having a certain amount of
memory bandwidth so you know it's
somewhere around you know 20 gigabytes
per second per channel it may be
actually a little optimistic maybe 15
and that's all I get and that winds up
being a bottleneck so it's actually the
bandwidth I worry about more and you can
do things about memory bandwidth but it
boils down to energy finally because
accessing you know and an LP DDR DRAM is
probably around eight to ten Pico joules
per bit and and that winds up getting
getting expensive you just multiply that
you know bits per second by Pico joules
per bitten they gives you your power
dissipation in watt and it's sort of a
fundamental thing that when I go off
chip it takes more energy to fetch those
bits and so well that way you could
engineer chips and we use high bandwidth
memories and in our you know
high-end GPUs in volt to be 100 Pascal P
100 you know and they have this
characteristic of having very low energy
per bit probably in the order of three
or four Pico tools per bit you're still
wind up being energy limited to get that
bandwidth latency boils down to almost
you know sort of some speed of light
calculation
so it's all about communication Rimmer
actually reading the actual memory
location takes almost no time it's
getting the request to the memory chip
to the proper Bank of the memory chip
getting the result back and and these on
chip wires are actually much slower than
the speed of light so the one thing
they're like an order of magnitude
slower than the speed of light because
their RC transmission lines better than
LC transmission lines so one thing that
we could do to make it better first of
all is to make all the distances smaller
that's largely a cooling problem to
first approximation because now we have
all this energy being dissipated
concentrated in a small space we need a
very effective liquid coin to get the
cooling density up but then the other
thing that we could do is basically have
good transmission lines so rather than
running these things around on RC
transmission lines we can fabricate LC
transmission lines either with just very
fat high conductivity metal layers on
the chips themselves or by running them
down to an organic substrate and running
them around if we can exploit locality
we can also imagine doing stacking where
we can have the DRAM we're accessing
right next to the processing element
that's accessing it and and all of these
things could potentially reduce reduce
the latency by making things
electrically closer to one another now
the other question you asked was about
FPGAs and so there's a really simple way
of somewhat summing up an fpga is that
an FPGA is just a bad ASIC no so the way
the way to think about an FPGA is they
have some special purpose units on them
some of them have floating-point units
some of them have DSPs and for those
special purpose units that are very
highly engineered they're as good as the
ones you build on an ASIC but the rest
of the FPGA and what makes it feel
programmable or they have look-up tables
lots and most of the modern ones are
like six lot so you feed six bits in it
looks up one of 64 locations and tells
you what the output is you get sort of
arbitrary six input logic function and
we have benchmarked on many different
applications that comparing a a ASIC in
the same technology to the FPGA slots is
almost exactly a hundred x in both area
and power it's a hundred times more area
for the lot a hundred times more energy
per operation for the lot
and so for any app for any application
that people really
about they will take whatever they might
prototype in FPGA and ultimately build
the ASIC
that said our darwin gene sequence
accelerator is right now available both
on Azure for Microsoft and Amazon on f1
instances because it's 15,000 X
improvement over CPU if you actually do
the full custom accelerator it's still
150 X running out an FPGA down by a
factor of 100 so if you get enough out
of your accelerator you can tolerate
that hundred x overhead of the FPGA yeah
thank you for the great talk so far
domain-specific accelerators cause I
think in memory has shown been shown to
have great potential to reduce the high
cost memory access could you share your
comment in this emerging architecture
yeah that that's a great question thank
you so process and we go back to my
slides on memory dominates so processing
in memory is really the point I was
trying to get at when I said memory
dominates when most people say
processing in memory what they really
mean is processing next to memory and so
for example in our gackt array we have
an array of processing elements are
actually 64 arrays of 64 elements each
unique processing element has this
little SRAM array right next to it and
all of the memory bandwidth this you
know one per cycle trace back pointer is
getting stored into that SRAM array the
the small amount of memory bandwidth
which is basically loading the next
reference sequence and loading the next
query sequence takes place over standard
memory channels and there's almost to
the noise so I think that what
processing memory is really about is
co-locating processing elements with
small memory arrays and in fact
historically all of the PIM tips that
people built the the one that Peter Cody
built with IBM whose name is escaping me
right now execute I think it was you
know our J machine at MIT and all of
these were really about putting memory
next to processing elements so that you
would have very high bandwidth access to
local memories rather than I mean to
make global memory references all right
thank you
[Applause]
Download