Uploaded by gacarel813

Accelerating AI: Hardware for Deep Learning

# tactiq.io free youtube transcript
# Bill Dally - Accelerating AI
# https://www.youtube.com/watch/EkHnyuW_U7o
so I'm really pleased to be here this is
a great venue and a great topic
unless you've been hiding under a rock
you've realized that AI and a particular
deep learning is changing almost every
aspect of human life
you know in the internet every time you
upload a photo to Facebook they run it
through a half a dozen deep networks
looking for inappropriate content
copyrighted content tagging faces etc
when you you know talk to Google you're
talking to a neural network and that's
why the speech recognition is so much
better when you travel you can translate
language by just you know pointing your
phone at the text and it comes out in
whatever language you're most
comfortable with
it's making doctors more effective the
goal here is not to replace doctors with
neural networks but to empower them so
that a clinician with a neural network
can interpret you know medical images
better whether it's for skin cancer
whether it's for breast cancer you know
whether it's you know looking at retinas
for various forms of eye disease and
also they're mining data of various
symptoms and and you know signs patient
histories and helping doctors make
better diagnoses media entertainment
it's you know completely changing how
people produce content security and
defense it's easy to train a network for
what's normal anything abnormal is an
anomaly gets flagged and one thing we're
very excited about at Nvidia is the
prospect of saving some of the 1.3
million lives lost on highways worldwide
each year by applying neural networks to
autonomous vehicles there was a computer
architect and a hardware designer what's
particularly exciting to me is that this
revolution was enabled by Hardware most
of the algorithms we use deep neural
networks convolutional neural networks
training them with backpropagation
stochastic gradient descent were all
around to the 1980s actually took a
course when I was a PhD student at
Caltech from john hopfield who was
visiting Caltech at that time on neural
networks and built a lot of these
networks and concluded that we didn't
have enough computing power to make them
useful which was correct in the 1980s
the other ingredient you need to make
these things work is lots of data we had
that around the early 2000s and the
imagenet competition was running for a
long time before the Burnham took off
with me and him
labeled sets but the missing ingredient
in the spark that really ignited the
fire that its current revolution in deep
learning was having enough computer
horsepower to train these models on
these data sets in a reasonable amount
of time where people defined reasonable
as two weeks and once you know once that
happened it took off now since it's
taken off we have this this issue and
two of the previous speakers I've had
this little cycle of better models more
data you know better results that deep
warning is now gated by hardware we need
to continue to produce faster solutions
for both training and inference to
enable people to run you know bigger
models on more data you know this shows
sort of the increased compute complexity
for you know images going from 2012 to
2016 so a four year period you need
three hundred and fifty times as much
performance for speech over a shorter
period it's 30x for machine translation
10x and so we have to continue to
deliver better hardware solutions but
Dave Patterson actually showed you this
exact slide this morning despite the
fact that a particular company is in
denial about this fact the amours law is
dead you know if two Turing Award
winners you know sort of call it I think
I think it's it's a done deal
they can put the defibrillators away now
to give you an idea of a little bit of
what's going on and cool not talked
about a bunch of this in his talk let me
just show you one brief example of how
we use machine learning and what the
compute complexity is so we use a
network called drive net which is your
very similar to res net for our deep
learning perception we have 12 cameras
on the car for each camera we do many
things we do detection we detect free
space we have two independent calls of
where it's safe to go one by detecting
what's safe from another one second when
it's not we find lanes and predict a
path that goes into our path planner as
a suggested route and we do this in all
directions from 12 cameras this is
putting all three of those together and
we do it in all kinds of weather so we
do it you know at night during the day
in the rain in the snow and have tested
it on
lots of real and synthetic scenarios
that that happen all the time now if you
look at the computational complexity of
doing this let's start with resonate 50
because I don't want to release the
details of Dr net and resident 50 is
very comparable to run an image net
sized image that's roughly eight billion
operations to feed one image into an
imprint coming out to do that thirty
frames per second it's a quarter of a
tera op to run one camera through one
network but our cameras aren't image net
cameras in fact I don't think anybody
has image net sized cameras that are 225
by 225 our cameras are HD cameras and in
the very near future they're going to be
moving up to 4k but at HD it's about ten
tera ops per camera per network we have
12 cameras for networks actually have
more than four networks depending on how
you look at it but but the the 12 times
four is a good number so this is the
computational complexity here is really
enormous how do we solve that
well today we do it with GPUs and I'll
talk a little bit about the evolution of
GPUs for deep learning it's actually an
interesting kind of coevolution that
that's going on and the current GPS are
very much designed to be deep learning
accelerators if you're training the
absolute best machine on the planet for
training and I'll get to the ml per
result shortly but as certified by mo
perf is a Tesla you know V 100 Volta
tenth record GPU and what makes it
really good for training are two numbers
on this I'm not gonna go through all the
specs when is it has a hundred and
twenty five teraflops of tensor core
performance that's FP sixteen
performance and we provide a software
package called amp automatic multi
precision that does scaling so it will
take almost any fact we have not yet
found one that it doesn't do reasonably
well on almost any model and
automatically scale things so you can
run the training in FP sixteen the other
number here that's really important
actually two of them that are important
is the nearly a terabyte per second of
HB m bandwidth because a lot of these
you know models especially as your batch
sizes get up there the activations don't
fit on chip and you're actually running
the activations
through that that memory interface and a
conventional memory interface can't keep
up and then for scalability we have 300
gigabytes per second of NB link
bandwidth so
can connect these together into it into
a training supercomputer and I'll show
the the ml per scalability results
shortly where we won every category we
submitted to now the reason this is so
good at training is it's very much a
stuff or purpose device for training but
unlike a TPU where you know Google
basically built a big you know 256 by
256 matrix multiply unit what we decided
to do is to take a very flexible
programmable engine which had a very
capable memory system which is half the
battle and add an instruction to it in
the instruction we added was hmm 1/2
precision matrix multiply accumulate and
it does what's shown here and that it
takes two F P 16 matrices multiplies
them together and adds them to an FP 32
matrix and because it does 128
floating-point operations
almost all of the energy goes to doing
the math the overhead of programmability
of fetching the instruction decoding the
instruction fetching the operands all
the things that give you the
programmability which is so valuable for
doing new models new layers doing
different normalization schemes doing
sparsity masks doing drop out whatever
you want to do you can do with the
programmability you essentially get for
free the it's not advancing ok that's it
this way the the Turing part is so the
savolta was announced in May of 2017
turing we announced last summer in in
august of 2018 does the same thing for
inference volta is an inference GPU
turing is actually a universal GPU it
it matches Voltas performance not this
version this is the t4 which is very low
power and and 65 teraflops FP 16 the
Quadro 8000 is 130 tariffs of FP 16 it's
actually slightly faster than Volta but
what's key here is that we've also
provided into 8 + inch for tensor cores
so for inference we can provide an
enormous number of ops per watt for
inference in fact it's quite a bit
better than any of the special-purpose
parts that I've seen to date and a
little graphic of sort of how this works
and Pascal we had a dot product
instruction so you can sort of do four
operations at a time with Volta we went
to the FP 16 tensor cores that do 128
ops at a time and then an INT 8 its 256
and an INT for its 512 so it just
multiplies up up the performance and
when you look and these are the numbers
for the FP 16 version the the fully
loaded cost of instruction fetch decode
and operand fetch and by the way even in
a tentacle you have to get the operands
from somewhere and I have to say I was
not allowed to show the actual 12
nanometer numbers so these numbers are
scaled from publicly available 45
nanometer numbers running the RTL code
through and getting the extracted
energies out but basically the overhead
of programmability is 27% what that
means is that if you threw everything
else off the GPU except for the actual
math units themselves and of course
there's going to be some overhead even
for a dedicated accelerator like a TPU
you couldn't do any better than 27%
better and that's assuming you can make
floating-point units as good as ours and
we've been at it for a while now one
thing I like to show is the evolution of
our performance over the years and this
is the chart for inference as a
comfortable chart for training and
everybody always likes to compare
against Kepler it's kind of like the way
they used to always compare against the
VAX 11 780 because it's a good target
but Kepler is a part that we started
shipping in fact the K 20 wasn't even
the first one but we started shipping
the K 20 in 2012 we you know completed a
lot of the initial stages of this design
in 2009 and that was sort of before we'd
really identified deep learning as a
target area to make things better even
so it was for a long period of time the
platform people ran deep warning
training on and just to show you that
Moore's law doesn't matter here m40
which isn't exactly the same technology
they're both the tsfc 28 nanometer we
announced a couple years later with
almost double the performance per watt
and from there on we went to Pascal and
by the way these were both doing FP 32
operations we didn't have the FP 16
until we move to Pascal the jump here is
FP 16 on Pascal the the co were here by
the way in codes it's a next process
node these three are all on 16 nanometer
I know we call the last to 12 but it's
16 nanometer metal rules so it's the
same energy as as the original 16
nanometer is a slightly faster
transistors but by going to FP 16 we get
a big jump 3x in in performance by
adding the tensor cores still FP 16 we
get another big jump about 5x and this
is an efficient to energy per op and
then adding the into 8 this is 4 into 8
inference we get another huge jump now
you know Dave Patterson made the point
this morning that the TPU one compared
against Kepler was 10x but better
something I don't know exactly they're
measuring may have been per watt I will
observe and I'll show you the data on
this and a couple slides that they were
comparing an INT 8 engine against an FP
32 engine and the difference between the
energy to do an intake multiply and an
FP 32 multiply is 16x almost the entire
advantage of the TPU 1 over Kepler is a
difference between FP 32 and Entei the
rest of it is all just waving hands this
this chart shows a similar thing for the
single chip inference performance but
what really matters actually is the
performance per watt which which is when
is showing and so over the the past six
years since we identified deep learning
as an important use case we've improved
our performance per watt by a hundred
and ten X we are not a stationary target
for all these numerous startups out
there who would like to eat our lunch
it's a very rapidly moving lunch
but what's what's what's interesting is
that the very same chip that is the
leading chip for deep learning is also
the chip the power is the most powerful
supercomputer in the world the summit
supercomputer at Oak Ridge National
Laboratories the most powerful
supercomputer in Europe has deigned at
the Swiss Federal Institute of
Technology and and a large fraction of
the top 20 on the top 500 list and you
know they've made the point of no
training at 200 petaflop s-- summit is
an exaflop machine if you're gonna count
FP 16 flops it's 3x of flops and one of
the winners of the Gordon Bell prize at
supercomputing last year which included
authors go out there's from both the DOA
and from Nvidia actually sustained over
an exaflop on a deep learning tasks
associated with interpreting weather
patterns for climate change so we've
been able to sustain exaflop on real
deep learning tasks using this machine
what's great about combining the FP 64
performance you need for a lot of the
historic HPC applications with the FP 16
you need for for deep learning is deep
learning is revolutionized scientific
computing in two ways one is that Gordon
Bell prize shows by using deep learning
to interpret the results you're on some
huge climate simulation and and you want
the answer you know is the earth doomed
and and the way to extract that is to
look for things in that to look for
patterns of currents and and typhoons
and the like and by finding those you
can actually interpret the results it's
hard for humans to find that in a huge
data set the deep learning you can build
pattern recognizers that find those
things the other application is actually
doing the simulation itself rather than
simulating the constituent equations you
can basically take previous simulations
or real ground truth data and use them
to train a neural network to predict the
next state of the simulation and people
have gotten speed ups as high as 10 to
the fifth on quantum chemistry codes by
taking the original density functional
Theory codes taking results of them
running training networks on them and
then running the networks and getting
results that are equal accuracy with 10
to the fifth fewer operations
so there's marriage of high-performance
computing with deep learning is a huge
thing now in the embedded space
we make an SOC called xaviar it's
designed primarily for our self-driving
cars but we also apply to video
analytics and robotics and it has it's
basically a tenth of a volt it's got 512
volts of cores instead of 50-120 but it
has a bunch of accelerators in addition
to that and so we have accelerators for
video and things we can process those 12
cameras coming in but we also have a
deep warning X order and the reason we
did this isn't in the in the embedded
space that 27 percent matters and you
actually want to get every little bit of
overhead out of there
and so our deep learning accelerator
actually looks a lot like the Google TPU
that's a Big Mac array in the middle
it's not quite as big as as theirs it's
a 2k in tape multiplies per cycle rather
than 60 4k but it has two things that
they're deep learning accelerators don't
have when is it as support for sparsity
because as I'll talk about in a minute
most networks are sparse and the
activations are sparse and by moving
around only the nonzero values you can
save an enormous amount of bandwidth and
by not burning any energy trying to
multiply things by zero you can save an
enormous amount of energy we also
support Winograd transforms natively in
this space that makes convolutions much
more efficient if I have to do a 3x3
convolution and I do it in the spatial
it takes nine multiplies if I go to a
transform domain I can do that
convolution with a single multiply and
so there's a big big advantage for
moving to the Winograd domain so let's
talk about ml per so you know when I was
a teenager we used to race cars and you
only knew if somebody was serious if
they would race you for pink slips and
so you know I think we should have raced
Google for pink slips on this one and
you know basically there were seven
categories we submitted to six of them
the the the reinforced warning category
wasn't really appropriate for what GPUs
are very good at but the six we
submitted to we won both both and both
for single node and for scalability and
it was really the scalability one that
mattered the Intel person mentioned the
recommender system as being their
submission was 1.6 times better than
Pascal so I went to the ml per side by
the way the ml perfect org you guys can
all just go there and look the numbers
up for yourself are
submission was 116 times a p100 just to
put it put that in perspective and
that's training in point four minutes by
the way that trained fast enough that we
didn't do multi-node for that one it
didn't make any sense if you're taking
you know what is that 20 seconds or
something like that to do the training
you don't need to make it go faster by
going to multiple nodes so these are all
in a single DG x2h you guys can just get
out your checkbooks and buy one right
now if you wanted who take orders after
the talk if you if this isn't fast
enough for you these are the numbers at
scale and we got very very good
performance speed ups and these aren't a
bunch of different clusters most of
which are clusters have actually djx
ones because we just had a larger
cluster of those available at the time
we ran these to run on but we got very
good scalability on all of these because
of our NV link network and then the with
it within the DG x1 and then the the
InfiniBand network connecting the nodes
together so this is where we are today
where we are the fastest in the world at
training networks at least as reported
by ml per we submitted two more
categories and anybody else we submitted
to six one six let's talk about where we
go from here we feel this responsibility
since deep learning is gated by
performance and hardware to keep you
know to keep those curves I showed you
going right we actually have pretty much
the next two points on those curves
pretty much loaded already and we're
still going you know up at a at a
comparable rate but but where is that
performance going to come from and so
some of the previous speakers hinted at
this one is from number representation
and when you choose a number of
representation what you're really doing
is you're selecting two things one is
how much dynamic range do you have and
how much accuracy do you have and it's
important to use as few bits as you can
for two reasons one is that the energy
goes quadratically with the number of
bits right when you do a multiplier
you're going from doing an 8 bit
multiplier to a 32 bit multiplier is not
four times more energy it's 16 times
more energy again that was the entire
difference in energy between the TPU one
and a k80 is comparing int 8 with FP 32
but the other reason that's really
important is that you want your your
data to be small both the weights and
the activations so you can fit a lot of
whatever on-chip memory you have and
keep it right next to the earth medic
units because moving data around is very
expensive energetically and so one thing
you want to do is is to look at ways of
encoding your data that makes the most
of each bit a bit as a horrible thing to
and this shows after pruning which I'll
talk about in a minute the distribution
of weight values in a network of movies
was in vgg 16 and what you see here is
that if I have 4 bits to use I could
choose to use them with an integer
representation which would give me the
evenly spaced symbols as film by the
green X's here you can see that's a
horrible way to waste your symbols right
because I've got a bunch of symbols out
here where nothing is happening and I'm
sampling the place where everything is
happening very sparsely a better way to
use your symphony of symbols if you have
16 symbols is to do what the red dot
show here is to put them very densely
where the interesting things are
happening and not waste any out here
were for the outliers and the the data
here is from a paper that I published
with my former graduate student Seong
Han who's now a professor at MIT in ICL
are actually in 2016 Oh in an archive in
2015 and this is one way of doing it it
turns out this is actually an
energetically inefficient way of doing
it because it requires a codebook
look-up on a full 16-bit multiply there
are actually very clever numerical
encodings they get you a similar
efficiency we can wind up getting 8-bit
accuracy with sort of four bits per
symbol and do it in a very energetically
efficient way this is another figure
from from that 2015 paper which shows
you know for convolutional neural
networks we're able to get down to 6
bits per symbol with no loss of accuracy
that's no loss of accuracy compared to F
P 32 and for fully connected layers from
multiple era perceptrons were able to
get down to four with no loss of
accuracy you don't really fall off a
cliff of accuracy until two bits and
actually since this time we've done done
a lot of work that is not published yet
which is actually even better so you can
you can use the minimum number of bits
required and that number is is trending
down toward for a lot of people will say
let's go binary and it turns out that
winds up not being such a great idea you
end up losing more accuracy and having
to earn it back in ways that are more
expensive and four seems to be kind of
the sweet spot in energy of
for doing things to get a given level of
accuracy for inference so let's talk
about pruning one of my favorite sayings
in life is never put off till tomorrow
what you can put off forever
and this goes especially true for
multiplies so it turns out that you just
just as you know biological brains are
sparse right they they do not have every
neuron connected every neuron and then
they're even sparse or dynamically in
terms of where the firing occurs
artificial brains can be the same way
and in fact you can lop out most of the
neurons in a network and not lose any
accuracy for fully connected layers
we've been able to repeatedly knock out
90% of the neurons leaving only 10%
remaining with no loss of accuracy and
for convolutional networks between 60
and 70 percent can be pruned leaving
thirty to forty percent now to do this
you then have to retrain the network
right so what you do is you train the
network you then lobotomize it lopping
out you know some large number of its
neurons and you get performance that's
indicated by this purple line here right
so you know Dave I think said at one
percent loss of accuracy is considered
catastrophic we actually consider a
point one percent loss of accuracy
catastrophic and so we would stop
pruning at around fifty percent with
without retraining but with retraining
we get the green line here where we can
get out to you know easily over eighty
percent without substantial loss of
accuracy and then anything that's fun
doing once you should do multiple times
and that's true for pruning as well so
if you iteratively prune you prune to
one level retrain prune and retrain
after three iterations you get the red
line here where we're out to 90 percent
they're pruning out ten percent density
left without loss of accuracy
again the reason why this is really
important this combination of reducing
precision with pruning is that lets us
fit things into a really small local
memory and the cost of accessing memory
goes up an order of magnitude every time
you move up the memory hierarchy so if I
can fetch my data from a really small
local memory that's five pica joules per
word if I have to go across the chip to
get a premise-free I'm that's on chip
but not really local that's 50 by the
way this memory is built out of a bunch
of these little memories and the other
45 is crossing wires to get there
back it's a communication that really
burns the energy if I have to go off
ship even with lpddr3 energy efficient
memory it's yet another order of
magnitude so we switch gear here and
talk a little bit about D pointing
accelerators we've been building these
you know video research for a number of
years let me talk about a few the first
is one that actually did with a number
of my colleagues at Stanford the eie and
the reason we did this is we were
playing with these sparse networks and
and the conventional wisdom from every
person we talked to especially the
numerical analysts was you're in the
uncomfortable range of sparsity so it
turns out that you know most people have
sparse matrix packages we have one
called coos parse and coos parse starts
reading coup Blas at a sparsity level of
about a half a percent in other words if
you're a half a percent dense you're
better off using the sparse matrix
package denser than that you're actually
better off just doing the dense
calculation and we're in this range of
ten to thirty percent depending on
whether you're an MLP era or a
convolutional Network so we said is ok
that's great if you're running on you
know conventional hardware but we can
make sparsity work by building a
pipeline where we basically walk the you
know compress as far as column structure
and hardware it takes almost no area
almost no energy to to do this and this
is the beauty of domain-specific
hardware we can wire in things that do
that so we basically did this
accelerators show that we could make
sparsity and that train quantization
there's a codebook look up here where it
says wait decoder essentially for free
and show people yes when you're building
the hardware you can do things that are
not possible if you just writing code
one thing that's really interesting to
me looking at this plot which is the the
EIU was a array of processing elements
and we'll get each processing element is
it's all RAM right the non Ram stuff is
a little thing in the middle label of
arithmetic and and this is actually true
of many accelerators they tend to be
completely memory dominated in this case
the memory is needed to hold the sparse
matrix and to hold that compressed first
column structure that has the pointers
into the sparse matrix so with a ie we
showed we can do sparsity really well
for fully connected layers we then an
Nvidia research did a project called SC
n n where we looked at the problem for
convolutional layers where you have a
that looks kind of like this it's got
maybe you know after after applying relu
all the negatives turn to zero and so
you wind up with maybe thirty to forty
percent non zeros
we're convolving that with after pruning
a kernel of weights that looks like this
and of course it fills in almost
completely into you apply the relu again
and then that red one will look like
this and so we looked at the most
efficient way to do this and the
approach we took in the s CNN was
basically just to read a bunch of weight
so as we basically pack we packed the
green so only the green ones are there
we get rid of all the white but with
each green square we have its
coordinates x and y similarly for the
this and every one of these green
squares has to be multiplied by every
one of those blue squares so we just do
that we read w weights at a time and
inputs at a time and I think it was four
and four for the four the baseline
configuration we multiply everyone by
everyone so we produce sixteen products
and then we sort them all out on the
output by taking their coordinates
adding the coordinates together and that
basically says we're to accumulate the
results this wound up giving about a 3x
boost in energy efficiency over over
running the computation dense so there's
you know we're looking at where we're
gonna get the next 10 X you can get a
lot from better number representation
there's probably a three X in there from
sparsity there's some other things
this is actually a dye that we just got
back recently we're experimenting with a
bunch of things one of them is
scalability so it's actually a 36 die
MCM and on each of the locations in the
MCM we have an array of 16 processing
elements and so we can scale this from a
single processing element which you
could use for some small IOT device to
something which does something in excess
of 2,000 images per second on ResNet 50
batch size one and it's energy
efficiency is a hundred and five
femtojoules prop so about 50 ter ops
it's 50 tera max but pursue me confusing
this with the later one this is excuse
me about 10 tariffs per watt for doing
8-bit deep learning inference and so
this is demonstrates a bunch of neat
technologies for communicating between
the MCM s but where it's giving a lot of
its performance for the deep learning
comes from is partitioning the weights
and activation
so that things stay local so the weights
are actually local on each block we take
the input activations and partition them
by row both over the chips of the MCM
and the Pease of the array per MC m and
then partition the output activations by
column so everything is very local
communication flowing through the system
more recently we've done a study of
doing this type of tiling as numerical
animals call it at multiple levels where
in addition to sort of reading out of a
very small weight buffer we actually of
a weight collector it's like a four
eighth element thing that holds just a
few weights and by doing this I'm gonna
jump forward just getting a little bit
behind my schedule here we're able to
demonstrate it again in the same
technology these are all 16 nanometer
numbers thirty tariffs per watt on on
inference and and so you know we're
constantly trying to raise that bar
ourselves to see where we're gonna get
that next jump in inference performance
one of the things that's not even in
this ship yet but we're looking very
carefully at is making the on ship
communication more efficient remember
moving this data around is really
expensive and one way of doing this is
to use every electron twice so normally
you start at the power supply that's
like you know point eight or one volt
and you get to use that electron once it
as it drops to ground a signal a one or
a zero you know through an inverter what
we do here is we introduce a mid plane
which is half the supply voltage so this
is 0.8 that's 0.4 we get to use the
electron once on the top floor sending
this bit across we use it again on the
bottom for sending this bit across and
we actually gain 4x by doing this
because the energy goes as the square of
the voltage it seems like you're getting
something for free here it ought to be
2x but it's actually 4 there's a paper
on this at is SCC in 2016 and again it's
one of the cards we have in the table to
keep those plots going upward so I kept
on getting pestered by our board members
about analog because many of them had
invested in these various startups doing
analog deep learning so I did what I
usually do is I go over and I talk to my
colleagues at Stanford and in fact Boris
Merman is a great resource than this
he's done about a lot of analog real
networked chips and and after after
talking to him for a couple hours I
realized that it's all about the data
conversion so even if I spot all these
people that they can do really
ficient vector matrix multiply an
analogue by taking a vector of
activations as voltages a matrix of
weights as conductances and that
applying Ohm's law to get current equal
conductance times voltage I'm going to
give that to them for free and charge
them only for the cost of doing the data
conversion back at the end of the layer
but I'm going to be very scrupulous
about precision if you want to match
8-bit inference performance you have to
do the data conversion with enough bits
that you're not you know after you
summed up a bunch of things you're not
throwing away significant bits so it
turns out that Boris actually maintains
this database of all the analog to
digital converters anybody has ever done
that's all the points on this chart and
it turns out that they're they're
bounded by two fundamental limits it
costs something to do a conversion no
matter how many bits it is and then it
costs exponentially more to do
conversions as a number of get bits goes
up above about ten and so if you take
that data you assume that the
multipliers are free but you still have
to do a conversion per layer and you Wow
it turns out that you you can sum as
many multiplies together before you do
the conversion you can't win that way
and this is one of the ways that physics
just works amazingly well however many
you have to sum together here you just
wind up needing more bits of precision
by exactly the same slope so you wind up
on the same line and it turns out that
to match 8-bit inference performance
takes over a Pico Joule per mac remember
we have demonstrated you know in chips
that are in the lab right now 200
femtojoules per max they're off by a
factor of five they're not counting the
energy they're using to do the
multiplying everything else in the
system just doing the A to D conversion
the thing that's interesting though is
if you start walking up some of these if
you went from you know you know 8 to 7
to 6 to 5 around 4 or 3 bits of
precision analog might make sense of
course at that level the digital is a
lot less expensive as well remember
we're getting 70 femtojoules per mac and
that your one chip that I showed you a
little while back digitally now one
reason we're able to keep pushing the
edge on on D pointing hardware at Nvidia
is that we eat our own dog food and so
so we have a big research organization
that takes our deep learning hardware
and frameworks and
software at each level and tries to push
the state-of-the-art with it so we have
among the best of semantic segmentation
systems we have currently the top of the
leaderboard and optical flow this is
actually a very interesting Network I
encourage you to go to the CVP our paper
last year because you know it did not
take the end-to-end approach which many
people took for deep learning optical
flow instead we took you know twenty
years of computer vision research and
optical flow took the best ideas of how
to do optical flow without deep learning
and applied deep warning to them and
what I'm getting really kind of the best
of both worlds and then because we need
to process lidar data from our
self-driving cars we have among the best
3d segmentation for point clouds now our
core business is graphics so we also
have made that the practice of applying
AI to graphics this comes in really two
main categories and I only talked a
little bit about one which is content
creation a lot of people don't
appreciate it but the video game
industry is a bigger industry than the
motion picture industry by revenue and
typically producing a triple a video
game cost significantly more than
producing a major motion picture
most of the Triple A titles break a
billion dollars and most of that money
goes to artists its artists time and so
the key thing to enable the industry to
do more is to automate content creation
these are some figures out of a paper we
had with people at one of the game
developers where instead of having you
know if you wanted to animate a face
they would typically a what's called a
rig with about 300 control points to
move different muscles of the face and
producing something like a 10-minute
second actually made a ten-second
segment we take about three weeks of
artist time laborious ly controlling the
rig until it looks correct what we did
is we trained a neural network with an
actor and then we're able to take an
audio trace to animate the face speaking
and in a double-blind experiment you
couldn't tell it from the one that
somebody spent three weeks producing and
it basically ran in a few seconds on a
GPU the other three here I'll talk about
in a little more detail this never works
and so I have the backup queued up if I
can only oh I say I've got a select that
and then I need to mirror my displays
and you should see this so this is how
we take the tensor cores on Turing
and make them useful for graphics
everybody in deep warning is always
worried about the tiny little bit of
area we have devoted to graphics and
that it doesn't make deep learning
faster we're always worried about the
other thing which is all the area we
devote to deep warning that doesn't make
graphics faster this is the technology
that I'm very happy about cuz we
developed an envy research which is the
organization I run it's called deep
learning super scaling and so what we do
is we take this image and we actually
render it at one resolution and then
feed it into a deep network that up-up
resolutions it and you know the the
typical case is 1440p to 4k because this
is an HD monitor I believe this is 720p
to HD and you can't really appreciate
until I run this slider across but the
DL SS on is on the left and so you see
how things to the left of this like
those little features on the back of
this booth just pop out when the slider
goes across and where are these little
glass ornaments you can't really see
what's inside them until you turn the DL
SS on going over it this is one great
way of taking deep learning performance
and making graphics better now I'm
showing you the easy case here with a
still image you know anybody can do a
still image what makes this hard is that
if you apply this to a video sequence
and you're hallucinating those pixels
that were never shaded you have to do
that consistently frame-to-frame
or it looks really really objectionable
you get little wigglies and you get
things flashing on and off and and and
that's a no-no you will get you know
drummed out very very quickly if you do
that so let's go back to the main talk
extend desktop why is that not coming
back that is bad let's try this again go
back to here
it keeps coming back to that there we go
well I've got this here but I don't have
it over there huh he just never works
rebooted the other thing I can do is
just quit the thing that's taken over
let me go back to here I will quit that
hold that to quit okay
now that will give me control back I
hope this is another example of how to
apply deep learning to graphics so if
you watch a major motion picture that's
either animated or one that's
live-action but it has CGI segments in
it the the artificial portions of it are
done with the technique called ray
tracing where you actually cast rays
from the eyeball into the scene see what
they bounce off of and you know then
ultimately you try to after many bounces
connect them to a light source and then
run that backwards to compute what what
color you get a major motion picture
will probably cast about ten thousand
rays per pixel and that will take many
hours on a farm of hundreds to thousands
of CPUs we don't have many hours or
hundreds of thousands of CPUs we have
one GPU and 16 milliseconds so we do is
we cache five rays per pixel to get an
image that looks kind of like this so
this looks like what your camera would
take if you set your ISO to like 200
thousand right it's a very grainy noisy
image and then we feed it through a deep
neural network and we get this beautiful
image again double-blind experiments
people can't tell the difference and
this is what's enabled us with our r-tx
technology to bring ray tracing and
physically based rendering to real-time
graphics we couldn't do it if we had to
do ten thousand rays per pixel but if we
can do a few rays per pixel and clean it
up with a deep network it makes a whole
nother level of realism in in graphics
possible we've also been doing a lot a
lot of work with ganz about a year ago
we developed this technology called
progressive Gann and the the thing the
thing we realized is that you know and
again we have these two networks you
have the generator Network here and the
discriminator network
the generator networks never seen a
picture of somebody all it sees are
these random variables called latent
variables we feed in it produces a
picture and he gets one bit of feedback
good or bad right good you fooled me I
thought that was a real person or bad
that's obviously a fake by the
discriminator Network which is trying to
learn at the same time and nobody before
we done this had been able to do good
high-resolution images they started
getting blurry and the reason is you
just have too many free variables you've
got this huge generator Network and huge
discriminator Network initialize two
random variables all trying to learn the
stuff with neither having any idea of
what it's doing so we thought about this
we realized the right way to do this is
the way you would teach you know you
know a kid you know mathematics you
don't start with differential geometry
and and you know give them the big black
hole and tell them about general
relativity you start with you know
arithmetic or maybe set theory or
something like that and work your way up
so we started with a very simple network
we learned four by four and the
discriminator will discriminate four by
four and when it gets four by four right
we'll move up to 8 by 8 and then 16 by
16 and so on and by doing this in a
progressive way what's often called
curricular training we're able to get
the network to be stable and converge
something that produces really good high
resolution images here's the movie of
the training it shows the resolution its
training currently at the bottom left
and the number of days I think on a
single v100 in the middle and you see
that you know by the time it gets 256 by
256 the images are looking pretty good
we can go to a thousand by a thousand
and the images are still crisp even
though remember this generator has
actually never seen a picture of a real
person although it's getting his
feedback about whether the image is
produced or is real or fake now just
recently we improved on this
significantly by developing a new
network called style Gann and what we
show here on the left side is for the
way our progressive ganmo scans work we
feed the latent variable in the top and
your image comes out the bottom and it's
kind of a flow down between the layers
well it turns out that we can get much
better results if we don't do it that
way and so we just feed a constant into
the top and instead we take our latent
variable we feed it to a multi-layer
perceptron and then we factor it into
pieces we feed into different
and by doing this we can take different
parts of the latent variable and use
them to control different aspects of the
image we're producing we can control
different resolutions different aspects
of of the face so we can independently
control things like hair color eyes
teeth and the like these are our images
coming out of the style gann as we
interpret that latent variable at
different scales let me see how we're
doing on time here okay so let me wrap
up actually as I'm ahead of time it's
amazing so you know deep learning is
fueled you know of a revolution in
almost every aspects of our life
transportation healthcare education and
graphics and as a hardware designer yeah
I'm really happy about this because it's
been enabled by hardware it's been this
co-designed between hardware and
algorithms but the core algorithms have
been around for 30 years waiting for the
hardware to catch up with them and so
it's kind of great that the hardware
finally did and now that we're here
progress is gated by hardware if we
don't continue to build faster and more
efficient hardware we will plateau we
will stagnate because we can't run
bigger models and bigger networks
without better machines to run them on
so we've been able to increase the
performance of GPUs probably by an
average of about 3 X per generation you
know since the Kepler generation without
any help from Moore's Law right we had
one process jump in there where we
jumped from 28 to 16 nanometer they gave
us maybe 20% it didn't help much on that
3 X and it's been by being clever about
architecture and circuits going from
doing you know FP 32 on Volta to doing
into 8 on Turing but also adding you
know the the dot product instructions
and the tensor cores and a lot of other
features that have been geared toward
toward deep learning and that's where we
get today with you know Volta at 125
teraflops for training and turing at 261
tara ops that that's a Quadra 8000 of
into 8 and 512 teraflops of infor and it
int 4 worth we're about 3.6 teraflops
per watt for inference for embedded
systems we have where's that your SOC
which has 20 ter ops of deep learning
performance part of which is in the
volta cores part of which is in 2d
pointing accelerators by the way the
deep learning accelerator is open source
you can go to mV DLA org
download the RTL many people are using
it in their own SOC s and and the
question now and what I sort of wake up
very excited about working at every
morning is how do we continue scaling
this performance from actually this
pretty impressive point we're at today
without any help from Moore's Law and
I've shared with you some of the things
we're going to be doing precision we
need to use the fewest number of bits
that we can and make each of those bits
count which means we don't do when
you're sampling but we need to do it in
a way that the arithmetic operations are
efficient we can't afford codebook
lookups and and high precision
multiplies like like our original
trained quantization work we need to
move to supporting sparsity better we
already support sparsity quite well in
our DLA by the way which which has
support for sparsity in both data gating
of the multiplier array and in all the
data transfers being in compressed form
with the zero squeezed out and then
accelerators are really sort of how
we're prototyping a bunch of these
concepts but with the idea that in the
long run they'll they'll go both into
future versions of the DLA and future
tensor cores that would then support new
data types support sparsity and other
things going forward now what keeps us
honest about this is we use our own
stuff we eat our own dog food and I give
me some examples of how we apply deep
learning to perception to graphics and
into image synthesis and and with that
I'd be happy to take some questions
we have lots of time for questions and
I'm sure there will be there'll be many
this one Oh
so to what extent do you feel like the
restriction or gating on the
performances due to the GPU versus not
owning the host processor Oh so to what
extent is the negating I'm performance to the GPU now and the host processor so
very little when we first started doing
some image work we wound up being a
little bit limited by the host processor
doing things like decoding the JPEG and
and shuffling the images because when
you do the the epics you need to have
each batch be independently sampled and
so we moved all that to the GPU with
with our package called Dali and now
we're not you know we're not constrained
so like Ana dgx - where we have I think
you know put pretty good Zeon's those
are not the gating factor we need it's
it's GPU limited actually an
announcement to make
there's a lost wallet named gem Lal
Salim at registration just so you know
yeah and I believe so but yeah so bill
is here if you have any other questions
please come up and talk to him we're now
going to coffee break we're back at 3:30
p.m. so we have 45 minutes and see you
Thanks oh one question over there sorry
okay yeah so question about the
architecture so you mentioned about
Moore's you and Dave both mentioned that
the Moore's law for its going away and
is there any it's basically now - up to
the architecture and what do you foresee
the Moore's law of architecture like
what do you what do you is there is
there something you foresee in future
happen to the architecture as as that
happened - similar to the process yeah
yeah that's a great question so
unfortunately the Moore's was great
about Moore's law
and Dennard scaling even more so it was
sort of the gift that kept giving right
every generation you know you'd shrink
line widths and everything would get
more efficient and and you'd be able to
turn some of that efficiency into better
performance and Moore's law which is
really about the economics of
semiconductor manufacturing said every
generation the transistors get cheaper
in the next generation they got cheaper
again the problem with architecture is
every clever idea you think up as an
architect is a one one-time play you
play that card and it's played right you
add tensor cards okay we've done we've
played the tensor core card what are we
going to do next time
now I think that the one trend and I
think Dave said it as well is that we're
moving to more domain-specific
architectures because we can get more
efficiency that way but even saying that
each thing you do with the
domain-specific architecture is still a
one-time play right so to continue to
scale performance to get you know the
three to five X we've been giving you on
every generation of GPU for deep
learning the really easy obvious ideas
have been done right we have to think of
the harder or less obvious ideas and
just have a lot of smart people trying a
lot of things and seeing what works and
eventually we will reach the point of
diminishing returns where it gets harder
to find those things right now we're not
at that point we have lots of good ideas
we can see our way through the next
several data points but it will get
harder as time goes on thank you is
we're warming up on questions now go
ahead please you mentioned some of the
compression methods that require a table
lookup for the weights or not energy
efficient can you elaborate a bit more
on that yeah so the the paper that song
hana i published i clear in in 2016 we
basically wanted to find out what sort
of theoretically were the fewest bits we
could use to quantize a weight so we we
observed as many people have that
anything you can differentiate you can
train using back prop and so we
basically did back prop into our
codebook and trained a codebook to
basically find the optimal set of
weights given a number of weights to
represent the the weights and to do this
you had to do some k-means clustering to
group the weights together and then back
what value would do that now the
reason that's not energetically
efficient is if you look at what you
have to do now to do the arithmetic
operation first you have to do a lookup
from a small table right we use a latch
array for that so it's not as bad as you
know the two and a half pica jewels
would use reading sixteen bits from a
ram but it still winds up being hundreds
of femtojoules and then we have to do a
16-bit multiply because decoding the way
we decoded it to sixteen bits so we
wouldn't lose any accuracy there and
that 16-bit multiply is four times as
expensive as doing an 8-bit multiply so
that pipeline of table lookup 16-bit
multiply is a very expensive thing so it
turns out that if you really want to use
a small number of bits efficiently you
need to come up with a numeric
representation which samples
non-uniformly putting the bits most of
the symbols where the interesting things
happen but also admit very inexpensive
arithmetic and there are such things but
I'm not prepared to talk about them yet
in general if you have questions please
go up to the microphones I think part of
the confusion with questions has been
there are three microphones one two or
three if you have questions please
please go there yeah how do you see the
advent of non-volatile memory changing
the GPU landscape things like mist or
technology stuff like that I don't see
any really substantial change there
we're always looking for better memory
technologies we have many active
collaborations going on with the major
memory companies and so we'll take what
we can get in either more capacity
fewer dollars per bit well it's really
matters to us though is is bandwidth and
energy per bit moving moving the data on
and off and I don't see the non-volatile
memory is making a big dent there many
people are applying non-volatile
technology both flash and memristor to
trying to do these analog multiplies and
so we kicked off a project to try to
replicate some of those results and the
result of that was this paper that will
be in DAC this year which I showed two
figures from where we found that we were
completely dominated by the A to D
converters and that you know using a
given noise model we wound up with
energy per bit for certain accuracy
levels substantially higher the
we could do with digital GP odds are
virtualized now and so what are the
techniques being used to make make the
context switching times smaller and
smaller so they're virtualized now but
the context switching time is not very
small nor do we feel a lot of pressure
for it to be small because we don't do a
lot of very rapid context switching okay
so what's the order of I would have to
guess I don't have that on the top of my
head I would have to get that number
thank you yeah let's thank Bill one more
time thanks Bill [Applause]