>> Kenneth Tran: Okay.

advertisement
>> Kenneth Tran: Okay.
So let's get started. Hello, everyone. It is my pleasure to introduce
Tianqi and Eric from the University of Washington. Even though they are
still students, but they are working on very cool project: A deep
learning toolkit, a very powerful toolkit in the open source world or
MXNet. For your information, Tianqi is also the main author of another
very powerful opensource machine learning package called xgboots for
creating boosted trees learning. So welcome, Tianqi and Eric, to
Microsoft to give a talk on MXNet.
[applause.]
>> Tianqi Chen: Okay. Thank you, Ken, for the introduction. It is a
great pleasure to give a talk here. Today we are going to mainly talk
about MXNet, the insights that we learned from the system and mainly what
the system -- the motivation, why we designed it. And Eric is going to
talk about some [indiscernible] usage in the [indiscernible] use cases.
First I'm going to acknowledge our collaborators, including contributors.
We get, especially we got two Microsoft contributors here. Chuntao is
from Microsoft Research and Tianjun is sitting in the audience. He is
now in Microsoft Bing Image Team. They joined us in the very beginning.
And if you have questions related to internal [indiscernible], you are
more than welcome to ask them if I am not appropriate to answer.
But so yeah, and we also want to thank Chuntao for suggesting the back
end API, which is used for the language understanding and comes very
handy recently.
So this is the all time for talk. Today we are going to talk about deep
learning systems. As you know, there are a lot of recent different
[indiscernible]s. They each come with their own unique feature.
Ken had a very nice way about the advantage and disadvantage of all these
systems. MXNet is not there yet, but hopefully it will be. Today I am
going to talk about, I want to discuss a very different perspective. So
all the existing surveys about this system asks the question: What are
the nice features of all these systems in the current stage? Let's ask a
different question. So let's ask what is the -- how will this system
evolve and eventually become, given the four engineering power and all
the things that you can throw into it? Because different systems provide
different user interfaces and different user interfaces will eventually
lead to limiting the capabilities. These are like a [indiscernible] in
the systems and the operator are all the, engineering efforts are like
the means in the system. And these [boons] will eventually result in
what is the limit you can retrieve in terms of flexibility and the power
of those systems?
So today I am going to talk about from this perspective. So I'm going to
talk about programming models. So normally if you think about deep
learning or any programming, any task in mission learning you need to
program them. That's the interface you have. There are two program
models that you can use. One type of program model is [indiscernible]
programming. So here this is the Imperative program of numpy's group
which gives you the result of E here, which is a very simple result.
This is not deep learning yet, but as you all know that Microsoft has a
toolkit called CMTK, computational network toolkit. And Google has
another famous firmware called TensorFlow. All these operates on those
computational graphs. And bear with me, this is the basic path that deep
learning toolkit needs to be able to do efficiently.
What Imperative programming does is, what you will do, you will appear
implement all those operations like matrix multiplication in neural
networks there will be more complicated things like convolution and
element wise addition and imperative program will do those things stepby-step. So you write your program as you execute this clause it will
basically translate into a kernel call on GPU or some CPU functions.
On the other hand, another more typical way actually in most deep
learning platforms is called declarative programming. So in declarative
programming, what you do is you first declare those computations you want
and then you send those computational graphs to some compile function or
other equivalent of compile function. Then you can utilize those
compiled functions for your result.
So the difference between, from the imperative programming is in the
declarative programming normally you first declare all you need. Then
the computation only happens at the, after you do the compilations.
Typical example about declarative programming outside of the deep
learning context is SQL for database, which I think many of you might be
familiar with. So many of the existing deep learning firm works usually
use those declarative programming. For example, if you think about
configuration file based firm ware like Caffe or CMTK, you declare a
network. That's basically the declarative part.
In other firm works like TensorFlow yellow, or Python API, not very
different from configuration file but not flexible. It gives you a more
flexible way of declaring what those computation you'll need.
So what are the differences between the imperative programming p and
declarative programming in the context of deep learning? So think about
deep neural networks. Here this is actually a shared neural network.
What declarative programming gives you is actually a black box. So
usually you declare what are the computational graphs are. Here this is
the one layer neural network. Then you throw into the toolkit, into the
toolkit gives you the ingredient, gives you the output. With the assist
the toolkit is able to give you whatever you require.
On the other hand if you do imperative programming, you need to write the
script layer by layer. Basically here you have some functions called
layer one to four, layer two to four and you call back propagation. Of
course, in some popular systems they will do the wrapping for you so
that, for example, in Torch you can build all those configurations and
the imperative part is actually hidden in the code. You can always dig
into the code and hack. If you are a good hacker or researcher, because
for any of the advanced need, you need.
So these are two programming styles. What are the differences? So one
of the things, the advantage I'm trying to argue for declarative
programming first is because most of the frameworks follow the
declarative programming paradigm, including all of the frameworks you see
except for Torch, I guess, in the deep learning world.
So the declarative program is more efficient mainly because they allow to
do more [indiscernible] because you don't need to specify what procedure
you need. Instead you only say what you need. So take this, for
example. So assume that I want to do this calculation steps. The left
side is imperative program. And right side is the declarative program.
How many bytes you need to calculate, to finish the calculation basically
in a Python [indiscernible]?
So in the left side you can see that there are four variables here. And
each variable takes ten, 1040, ten-64-bit flow. Normally it would take
four times the -- four number of arrays to finish this calculation. On
the other hand, if you do have a declarative program what you can do is
we can find that the declarative program will find that C and D can
actually share the memory because C is an intermediate memory that isn't
a sing by the user. You can kind of optimize and get three copies of
memory instead of four copies of memory. This might be a very naive
example but it demonstrates the difference between declarative and
imperative.
Because the imperative word, you specify the calculation procedure and
that might not be optimum. And another key property about this is that
the imperative program needs to be prepared for all the possible futures.
Say that you are writing this program in a Python console. When you
write here at this point, actually the imperative program doesn't know
that [indiscernible] will be used in the future or not. So it can not
actually directly reuse memory. Of course, the Python garbage collector
and kind of reuse, got recycled. But this is somewhat limited in the
sense of deep learning which we will give a more [indiscernible] example
in the next slide.
So on the other hand, in a declarative part, why you do the compilations,
the declarative program sees the entire computational graph as well as
the boundaries. By boundary I mean that the program can see what are the
input and output and what are the intermediate stages. And all these
intermediate stages cannot be seen by the user, which means that it can
be freely optimized out, be rewritten or doing the memory optimization
that we mentioned earlier.
So more realistic example in the deep learning world is that it is about
this case study about what is the memory cost you need to run N layer
neural network. If you only want to do a prediction, production
environment instead of training.
If you are doing imperative programming style, if you do imperative
programming, normally what you will do is that you will do this N forward
functions here. Because most of the deep learning frameworks are
optimized for both forward prediction and back propagation, normally the
statistics are keeping track in those data structures so they cannot be
freed.
And this is for the same reason that because the imperative program needs
to be prepared for the possible future of back propagation. On the other
hand in a declarative program, you declare in [indiscernible] forward
computational graph and call the compilation function when the
compilation function basically sees the graph and it knows that all those
memories can be shared in place. So as a result, you get a much less
memory cost when running a declarative program normally for predictions.
Normally you only need two copies of memory that alternates through each
other. Yes?
>> Audience: Can you explain why the memory can be shared from different
layers? Because when you do back propagation ABC belongs to ->> Tianqi Chen: Exactly. In our use case we
instead of back propagation. So imagine that
only want to do prediction. And this can be,
not about, is not about back propagation. So
are only doing prediction
you, in this use case you
this computational graph is
this memory can be saved.
That's exactly why imperative program column costs memory because it
wants to be prepared for the case of that possibility of back propagation
as well as the possibility of only doing forward prediction. Of course,
you can optimize this program by adding special cases like I only want to
do forward prediction. That's adds another layer of complication to an
imperative program.
So there are a bunch of optimizations that you can do for declarative
program because of the advantage of it. It sees the entire global scope
of computational graph instead of a local step. One example is that you
can do, dependency pruning. Here this is a forward propagation step and
on the right side is, this is the graph corresponding to the back
propagation step. So because there is addition here, you can find and
you can find that back propagation actually doesn't depend on the barrier
F. So this memory can be freed as early as the forward propagation step.
We don't need to postpone that until the back propagation step.
So if you only need some intermediate values of variable C, you don't
need even to carry out all the computations as well. So this is kind of
more an advanced version of dependency pruning. Yes, you can hard code
into your framework, but the current programming cannot give you free
[indiscernible] if you have a good optimizer.
Another thing that you can do is operate a fusion. So in here, because C
is intermediate variable that is not referenced by the user and you can
easily optimize that out. Say you provide a huge kernel that does the
matrix multiplication plus some bias which is commonly, which is commonly
available in many of our frameworks, you can do this kind of operation
rewriting to write this program. However, you cannot do that for the
same reason because in the imperative program there is a possible future
that C might be referenced.
And there is another [indiscernible] that I mentioned and that is memory
sharing here. So I have advocated so much for declarative programming so
far and it is good that many of the existing deep learning frameworks
actually do follow the use case of declarative programming in the sense
that you declare the neural network. Then you optimize the computational
structure. So why do we still need imperative programming?
Or otherwise we basically, everything so far is good and we can end this
talk here. So why do we still need imperative programming. Imagine the
same case for SQL in the database. We all need to imagine that you want
to develop a Web app that needs to do online banking system and you have
a good secure server in Microsoft CQ server.
And it does all the good jobs for you. It cures optimizations, gives you
a result quickly. But still you need some front end language like [C
shop] and Java Strip to interact with it. So why does Web developers
still need this front end language? The reason is that the declarative
language is kind of limited in what they can do. There are some certain
things that declarative language are not good at. Take, so take for
example you imagine that how you can do parameter update in the
imperative language. It's possible, but you do need to extend the
language into some steps. TensorFlow needs support mutations and
[indiscernible] in the computational graph which is not very natural and
requires some engineering effort into that.
Other more complicated example includes things like optimizations Luke
limited memory BFGS step which requires a line search and think about how
you can write a line search program in CQ which I don't know how to
write. Maybe there are some extended language in the CQ Norm that can do
that. And there are other use cases like variable input lengths which
are not very easy to be described by a predetermined computational graph.
And it will be, need to be used by, need to be using advanced features in
the imperative language to do the dynamic enrolling. So one of the
advantages of imperative program is that it is deeply embedded into the
host language. So also TensorFlow and [Ciano] are claimed to be
supporting Python. Actually, why they are right that TensorFlow and
[Ciano] programs, you are actually writing another program that is called
TensorFlow. It is like you are writing CQ in Python. It is not like you
are writing Python languages. You are writing CQ language.
So while you are writing those declarative programs in Python, you are
actually writing in context of those declarative programs. However, if
you are right imperative programs in Python, you are actually writing in
Python because you can take advantage of all those advanced features
like, you can condition all the [indiscernible] set lines. In here, this
is a very naive example. Say I have two neural networks that I like to
do for input data that is bigger than ten, I use first steps, first tab
on network architecture. For the other input tab I use second type of
architecture.
If you have an imperative program, you can basically very easily do the
wrong time dispatching and select the path that you want. Now, more
complicating examples. For example, you might want to take a look at
output of a neural network and decide which path you want to go. This
can be very common in reinforcements learning. And on the case of
recurrent neural networks, say you want to do prediction, you want to
dynamically enroll your predictions and the length of the predicter
seconds is not determined in the beginning.
Of course, all these things can be hard coded in your system and
optimized for specific use cases. But to support these things in a
generic version, we do need some flavor of imperative programming in
order to do that easily.
So that's why we want to advocate for a mixed flavor. So MXNet kind of
stand for mixed flavor. We want to combine the power of imperative
programming together with the optimized declarative programming. So as
we all know, as hopefully you are now convinced that the imperative
programming are kind of more flexible and allows you to write whatever
things you need, say you want to insert your [indiscernible] step and
your training loop. On the other hand, the declarative program can be
quite powerful in the sense that the system can optimize the memory,
computation, and everything in behind for you.
MXNet we provide two set of APIs that are coupled together in the sense
that they work together as a whole piece. So we have our noon parallel
un[indiscernible] database that basically SQL as you go. That's
imperative programming part and we have a symbolic API that basically
does the declarative programming part.
In the hole -- so this is the example of the basic training loop in
MXNet. So what you basically do is like you write imperative loop here
and then we have a dictionary of the, from the next string of key input
to the corresponding array. So every iteration what you do is that you
copy your input to the corresponding arrays. That is the imperative
step. The architecture itself is compiled from declarative language.
You declare all your neural networks, and the compiling into the
executor. You call the executor to go forward, backwards. This is the
optimized step that is optimized for memory.
Then you use the imperative step. This is imperative step of SGD, but we
can definitely put more advanced step like the variance reduction method
that developed here. Or other type of optimizers here. So the reason
that, the fact that we are able to use the imperative step actually helps
us a lot in the sense that after we released MXNet, in the beginning we
are going to have -- actually it was momentum optimizers and after a few
months people contribute other optimizers like [indiscernible] MS probe,
LL grid, and all these people are not original developers of the
framework. Which makes, which shows that MXNet is kind of easier to use
in terms of this imperative programming.
On the other hand, it is still very efficient in the sense that all those
portion [indiscernible] are optimized by those declarative programming
part.
Another example that is actually quite exciting for me is that, is about
the support for variable length or variable size input in images. So
recently this is actually the proposal of Jacob who is in Microsoft. So
as you know, if you do second training, normally you have seconds of
different lengths and it usually corresponds to different network
structure of different length or different shape. So thanks to the
imperative solution model that we have, actually this is coined by
Chiyuan Xie. He added his support of the bucketing API which basically
switches by the input length and you pick different [indiscernible]
solutions. The computer corresponds to a different network that for that
specific input and all this computers can share the memory.
So according to him we only cost around ten lengths of Python code which
may be a little bit exaggerated but I truly believe in him. This is
basically a map or switch in Python. And that's all you need to support
the variable length input.
So in summary, in terms of programming models, hopefully you are now
convinced that there are two style programming models in deep learning
languages. Either you can do declarative programming that is very
optimized and engineers always like that. It is good in production
systems. On the other hand, imperative programming gives you the
flexibility of inserting whatever you need into the control flows and
around a runtime dynamic dispatching of the operations.
Combination of these two can maximize your productivity for one reason.
This is the best thing that I learned from my undergrad is
[indiscernible] from computer architecture. Basically in the sense that
you only want to optimize the bottlenecks in your system. So this is the
same for deep learning. Usually when we are doing extensions to deep
learning program or deep learning architecture, we are extending, the
part that we extend may not be the bottleneck of your computation. And
usually we can put many of the existing pieces in those declarative
languages and use imperative language to smoothly add the things we want
to add and still get the maximum performance you want. And that is, that
motivates us to use this kind of mixed programming design.
Okay. So now we talk about programming models. Another thing that is
very important is to try to develop a system that effectively supports
these two kinds of programming models and allows them to mix together.
Especially in multiple device environment, as we all know that when you
have a machine, now you can put four GPUs into it and in a distributed
environment there are even more resources to utilize.
It is very important to be automatically utilize all resources to
maximize your performance. So however, it's quite painful to write
[indiscernible] programs or concurrency programs in, although I like them
very much, but it is very painful to write, especially to write for
specific cases.
So one thing that we did is we have a dependency checking engine that
automatically parallelized all our operations for you.
Here is the example. So these are four imperative crosses of operations
and you can easily find that these two operations can run in parallel.
And what we actually have is that we have a generic dependency engine
that will call all operations, including the imperative and the
declarative executions to train them together and automatically parallel
ice all the [indiscernible] operations you have.
Here is a more realistic example. This is a bunch of code that you don't
have to read but the basic idea is that this is a two lay area neural
network that runs on two GPUs and you need to copy the data to CPU for
synchronization, do the update and copy them back. These are the generic
code of data parallel paradigm that you have. This is Python code that
you can write, actually in MXNet.
This corresponds to this computational graph. And there are many
interesting aspects from this computational graph. You can find that
there are computational paths on the left side on GPU one. On GPU zero
actually. And on the right side is the computational path on the GPU
one. And you can do the forward/backward propagation and after the first
back propagation of the top layer, as soon as you get the actually you
can copy the data to the CPU, do the update and copy them back.
So one thing part of optimization that you want to do for this kind of
computation in parallel, you want to over lamb your computation with the
communications. So as soon as you get this result, this memory copy can
happen. And concurrently you can do this another step of back
propagation.
And you can form a similar pattern in model parallel sense in model layer
ASTN, as soon as you get the first time stamp ready you can directly copy
the result, the second CPU can kick after and all these things can start
concurrently.
However, it is quite a hard to hard code all these patterns and do those
asynchronous communications. We do need a dependent check to check all
this imperative and declarative operations. One other key property I
want to emphasize here in our design of the scheduling engine is that it
is designed to be transparent in the sense that we don't want to make the
assumption of what kind of operation you want to schedule. And basically
it means that we can schedule anything, including those codes that aren't
written by us. Okay.
Which Eric will give a more detailed example, actually.
So however, there are a lot of troubles when you want to have such a
dependency engine. One trouble is that about memory cycling,
surprisingly. So here is two dependency paths that you have two
operations that depends on C and B. And then we have a memory dilution
that is automatically triggered by the garbage collection. However, this
memory deletion, this memory deletion cannot happen before those
operations are carried out. This is like implicit dependency that you
really need to be able to check.
Another interesting example is random number generations. Here you have
two random number generations in your code. The same study can run in
parallel. However, if you are familiar with random number generator,
usually you have to run them in a serial manner either for pre-
producibility because every time you do one, the second code gives it
same result and another reason is that random number generator is the
result that cannot be shared and be serialized.
In order to support these two kind of operations you really need to
expand those dependency engines from immutable scheduler which is a
technical term, but you don't have to understand. To something that is a
pause mutation. And in your framework, what we do, we try to do things
in an abstract way. For all the possible resources, memories, random
number generators and all those things, we allocate a variable or a tag
to attach -- it's like a tag of your bag in the flight. So you attach
them to each of the resources. While you push the operation to the
engine, you use these tags to specify what resources you want to read are
muted. And the engine treats the operations as a black box in the sense
that any simply use those tags to track the dependency and execute them
when ready but wouldn't actually know what's in the content of this
[indiscernible] function.
So all these code experiments will be translated to a synchronized call
of these pushes into the dependency engine. You are going to plugin
auxiliary functions as well in here, including accompanied that you
developed here. So this is how, a huge picture showing how I do it. I
am not going to walk through it, but you can always read it from our blog
post. We have a detailed design doc of what we did in those systems.
Basically we have multiple dispatching queue. There is aware of the read
and write dependencies. So to recap this section, we really need a good
dependency checker that checks all the dependencies. If you want to view
a powerful system for multiple GPUs or multiple resources. And having
such dependency engine makes this easier for you in the sense that you
don't have to hard code all the data parallelization patterns. If you
want to have them more in parallel, you simply write the Python code
until serial manner and you will execute the dependent operations as soon
as it is ready and it is ready to use and gives you the maximum
performance.
And it can also be used in combination with other optimizations such as
data compression which optimized for the -- which optimized the cost of
each individual operations. In the sense that such dependency checker
helps you to run more things in current, in concurrent manner.
Okay. I mentioned two major motivations for us to build a system. We
want to mix the programming model for maximum optimization and the
flexibility. And we have built the dependent scheduling engine to
schedule a big tree operations so that you can make a big tree
parallelization pattern easier.
I will briefly walk through other features that might be of interest.
Yeah, so MXNet comes with low memory consumption mainly because we have
those declarative optimized memory. So as an example, one of our friends
actually changed the entire ImageNet, data set on a single machine with
4GTX900 -- 980 GPU, which is not a very good GPU. Because we can use
less memory than the existing systems. So we feed bigger neural networks
into those systems.
It is also lightweight and portable in the sense that it is available in
all possible languages and the video platforms and also available in
Windows, of course. And you can follow all the mobile devices. One
interesting thing that we encourage you to check out, we have a
Javascript version which means that -- which actually runs in your
browser. If you can open this link now, if you have a laptop. And there
is an image cast demo that runs on your browser. Microsoft does have an
advantage here in the sense that Microsoft H actually runs, I think eight
times faster than Chrome because they didn't optimize for the ASM-GS
part.
So this is really due to the fact that we are really, it makes things
really plausible in that all dependencies are minimized and we can
compile it entire project into one file and compiler into Javascript.
And this comes ->> Audience:
Does Javascript use the same back end code?
>> Tianqi Chen:
>> Audience:
It uses the same back end code, yes.
[indiscernible] -- runs in C++ code?
>> Tianqi Chen: Yes, exactly. It's a technical inscription that
compiles, that compiles C++ code into Javascript. But you do need to
isolate all the dependencies in order to be able to do that.
And this comes very handy when you are running demos, for example. If
you have a new model that you want to show to your colleagues, basically
you can run a demo on their browser and you don't need to have a
dedicated server for whatever purposes.
And it also runs on major Cloud platforms. We have a building support
for AWSS3 and I think that's a ongoing podcast on supporting Azure
integration as well. It runs on major scheduling engines such as MPI,
SAN grid engine and Hadoop [young.]
This is a comparison of single GPU vision task. I think the major take
away of this slide is that also we support mixed style of imperative
programming and the declarative programming. We are [indiscernible] of
all the major systems because we share all the same kernels and all those
things. And MSS really will optimize for those inception styles or
residual style architecture that have multiple paths, which gives us a
much, much more memory saving than existing frameworks and we have a new
optimization that is able to further optimize another factor of two or
three using a grout rewriting techniques.
Here is a comparative example about forward N-lay computation and forward
path backward, which answers your question in a previous, which answers a
previous question about like why forward computation gives you like cost
less memory? Simply because there are less memory dependencies and you
can save more in a forward computation as compared with forward pass back
backward computation and declarative programming gives that for you.
On the other hand, imperative programming also gives you all the power
that you need to write your research code easily. And ports into the
productive environment.
This is the entire overview of the system. Basically it is kind of very
different from existing system because we think, we built it in a way
such that we wand to have 1N array model to support imperative
programming. As well as the symbolic execution model to support
declarative programming.
Okay. So I think I finished all the overview about the system and the
general idea behind the system. Here are some take-home messages that
hopefully you can get from this talk. So from my part of the talk. Eric
will give some tutorials about the detail usages.
So MSS stands for mixed programming and maximum productivity to MX. One
important thing we do is automatic parallelism, the way to power fully
use all the parallelism patterns you can use without hard coding. And
one of the things that I want to emphasize is that we are open system.
Like Lego bricks that you can plug into various parts and plug other
bricks into the MXNet. One important example is that currently we
actually work as a fully functioning Torch model, which means that you
can take any functions in the Torch and use the MXNet. And it also
balances flexibility of the symbolic and of the declarative programming
and imperative programming to give you the maximum performance and the
flexibility.
Okay. So I will hand over to Eric for the next part of the talk.
>> Junyuan Xie: Thanks. So Tianqi has talked a lot about MX design and
its cofeatures. Hopefully by now you are convinced it is a good system.
Now I am going to talk about how you can use it.
So since it is fairly technical audience, I will assume knowledge of
machine learning and Python and numpy and such. So we actually do have
other front end language wrappers like Scala, Julia, R, and et cetera,
but Python is the most supported and it is the language I use. So I am
going to give examples in Python.
So by the way, feel free to interrupt me. I know that some of you may
have played with MXNet already. So if you have more, you have deeper
questions, you can interrupt me anywhere to ask.
So first the symbolic graph -- is the mic working? Okay.
So first MXNet is the basic usage is symbolic graph. And symbols are
basic units of computation. And you can define symbols like this with a
function call and a symbol has input and it has name and number of
parameters like how many hidden units this fully connected layer has for
output.
We also provide -- this is like the Caffe layer. And it is a coarse
granularity symbol. We also provide finer granularity symbols like plus,
multiply and you can define it with FC1 plus one assigned to FC2.
So that is actually [indiscernible] in the symbolic engine is that of
just an imperative clause.
And then graphs are defined by confusing symbols. You first have the X
which is a variable which basically serves as a place holder for your
input data. And then you feed it through the network. You have lick net
[indiscernible] activation and soft next. And then this will give you a
network which can be instantiated with memory to give an executor and
here we use simple bang which allocates all of the memory buffers for
you. You can also call bind with the symbol and provide your own arrays
as buffers and gradients.
And everything that you don't bind, if you use the bind method,
everything you don't bind won't be computed if it can be updated. They
won't be computed. And if you don't bind a gradient for an array, it
won't be computed. You won't be able to see it.
Anything that you want as output, you have to bind it. Otherwise it will
be automate the out. After that, these are basically works pretty much
like numpy arrays. With some limited feature. And you can assign to the
input which is the data. And then you do a forward and then do backward.
>> Audience: So after you declare symbol and then you do some
imperative, so-called imperative programming on those symbols, it is
still declarative programming, right?
symbol.
Because you still program on
>>: So it cannot be called mixed program.
the framework.
It is still programming on
>> Junyuan Xie: This is symbolic graph. Here it is binded. And you can
go forward and back wards to it. These are two black box symbolic
computational graphs. In the middle you can do anything you want with
it. So the output array is just an NDArray. You can say take that, feed
it to a CRF and do some inference in the CRF. Then take the gradient
back from the CRF and then feed it to the backward. You can add
arguments here.
>> Audience:
execute?
>>:
Is that after you run forward, meaning you called to
Yes, the ... [speaker away from microphone.] institutions.
>> Audience: In other frameworks you can also call F [indiscernible]
something to get the values and then you can feed those values into
whatever ->> Tianqi Chen: The difference is that in most of the frameworks, after
you call Evo, you get a numpy variable. You need a [indiscernible] GPU
arrays to explore all the different operations. The second difference is
all those operations you would be parallelized in the sense that in the
existing framework all you do is that you take that numb pi array, but
all those [indiscernible] are not parallelized. So those imperative
programs are actually [indiscernible] [speaker away from microphone.]
>> Junyuan Xie:
Okay, so --
>> Audience: I think this is really just a library function call to any
language. Worldwide difference?
>> Junyuan Xie:
>> Audience:
[speaker away from microphone.]
>> Junyuan Xie:
>> Audience:
All right, so let's speed forward --
Yeah, it pushes.
It is varied output?
>> Junyuan Xie: Not really. So the forward is after you compile this
graph, the forward actually pushes a whole bunch of operations to the
engine and then it returns without waiting for them to finish. So you
push a whole lot of stuff in to the engine and then say the network has - the network has ten branches and you have ten outputs. The forward
immediately returns and you can take each output and do anything you want
with it without waiting for the other nine.
You see what I mean?
>> Audience:
[speaker away from microphone.]
>>: -- all the things that are parallelized.
[indiscernible].
And including, I think
>> Audience: That is still, the parallelization still happen inside the
library, right?
>> Tianqi Chen: No, it is not happening, it happens inside the function
call as well as outside function call. So after you call forward and
backward, you do a permanent update. That permanent update can be
parallelizing together with the forward and backward calls.
As soon as you get agreement in the backward call, the permanent happen
in workshop [indiscernible] that's the major difference between the
[indiscernible] institution and the parallel institution.
>> Audience:
right?
But parallelization still happens inside your library,
>> Junyuan Xie:
>> Audience:
No, it happens in Python.
[indiscernible] it happens in Python?
>> Junyuan Xie: It happens in the engine, but it is issued, you can
issue everything in Python.
>> Tianqi Chen: Basically you issue everything Python in [indiscernible]
manner. That's the difference. In most libraries, allows you declare a
graph and you do whatever parallelization or optimization on that graph.
But most libraries doesn't allow you to do parallelization over different
type of graph institutions. You [indiscernible] the graph. You get the
gradient, you do the permanent update. That's a different operation.
Most libraries doesn't allow, doesn't allow parallelization between those
two operations.
While because, if you want to really support imperative operation well,
you done want to be able to check all those dependencies and parallelize
all those operations as a whole piece because all these operations are
also run on GPU and they need to -- for example, the permanent update
needs to be done as soon as you get the gradient in the backwards step.
Instead of waiting all the backward steps to finish.
>> Junyuan Xie: So basically here you have these networks, layers and
each one has a few parameters. When you, after you do a forward you can
do a backward pass. Then it will, since this is the linear structure you
can -- it will execute from this one to this one and then backwards. And
the backwards actually returns immediately after it pushes all the
operations into them. Then when you do the update, it -- you call update
on each parameter and if this is just a normal function call you would
need for the backward to compute the gradient for each of the parameters
before you can do update, before the returns.
>> Audience:
[speaker away from microphone.]
>> Junyuan Xie:
Uh-huh.
>> Audience:
You can do a function core asynchronous immediately, right?
>> Junyuan Xie: Yes. But then you need the dependency engine. So
basically that's why the dependency engine is good.
Because it registers all of the read and write operations on each of the
parameters. So you can issue everything without waiting for anything to
finish. It will get automatically dispatched when it is ready to be run.
>> Audience:
But since the computing device is limited.
>> Junyuan Xie:
Uh-huh.
>> Audience: Okay? And while you run some [indiscernible] on that
device, that device is occupied. You can run another ->> Junyuan Xie:
So --
>> Audience: At the same time. So based on that you have the limitation
for the device. The only thing you can actually parallelize is the
memory core [indiscernible] and some communication, right?
>> Junyuan Xie: There are two guesses. So first, your layers, your
network has module branches and some layers are small, running that layer
doesn't occupy the entire GPU. Then you can, if you can run two branches
in parallel it gives you a speed-up.
The other is this instance naturally to model GPUs. You can easily do a
model parallelism across module GPUs.
>> Audience: Yes, but in the past this typically will be slowed down in
your overall computing ... [speaker away from microphone.]
Every time when you call for information between the GPU devices or
GPU/CPU ->> Junyuan Xie:
Well, you need to --
>> Tianqi Chen:
That's why you need dependency engines.
>> Audience: No, the dependency itself is not enough.
[overlapping speech.]
>> Junyuan Xie:
>> Audience:
You need a special model structure for that.
Not just the dependency engine.
>> Junyuan Xie: No. So basically if you just take a vgg network and put
the first convolution on one GPU and the second on the other GPU, it
won't work very well. But if you have a more decoupled structure, it
will work better. Say these days they have the two-stream video
recognition network that has one string for RGB. The other for optical
flow. Then the two streams merge on the top. Then you can easily
parallel as the two part on to two GPUs.
>> Audience: But you still, if you want to get optimal speed, you still
need to have some human hint to decide how you want to parallelize, where
you want to launch from.
>> Junyuan Xie: Well, you just need to say which symbol should be
allocated to which device.
>> Audience: You basically tell the system what is.
[overlapping speech.]
>> Junyuan Xie:
Yes, yes, you say which symbol goes to which device.
>> Audience: Then I don't see any benefit other than parallel between
[indiscernible]
Because you still need the human to decide exactly which one occurs. It
is still [indiscernible] you can also put things in ...
[overlapping speech.]
>> Junyuan Xie:
Well, this is --
>> Audience: Two separate drawings. You can compute each one
independently on different CPUs and just have a groove at the top so it
is very similar. You can basically implement and say.
[overlapping speech.]
>> Junyuan Xie: Yeah, but that requires manual coding for each specific
case and since you get two four GPUs with crazy network structures, you
will find yourself writing like hundreds of lines of code for each
specific network.
>> Audience: No, no, no, that's not true. Actually, once you have more
than four GPUs, these kind of -- typically you will slow down your
[indiscernible] significantly. So that's why in all our current
implementation of cross shield or cluster shield [indiscernible] we have
some, we need to use some other techniques. Cannot attempt to use -- in
your case, this is [indiscernible] across Maximo machine, right? This is
typically where we device log. Especially when you cannot include the
mini site larger than [indiscernible].
So this, if I understand what they are telling me right now.
>> Junyuan Xie: Okay. So one thing is the Pascal CPU, GPUs are coming
out in probably half a year. If they keep up their promises. And they
advertise for like ten times faster GPU to GPU communication. With
unveiling. So I think maybe it will probably get to a point where interGPU communication is similar in speed to within GPU communication.
Because they have these 3D memories.
>> Audience: For example, in your case we can reduce manual consumption
[indiscernible] by 32 times by [indiscernible] the gradients.
>> Junyuan Xie: Well, you can.
>> Audience: Even that is [indiscernible] now.
[indiscernible] by 32 times, it is not enough.
You understand?
Even
>> Junyuan Xie: It depends on your network structure. It depends on
your application. For some applications it doesn't work. For someone it
will.
>> Audience:
Yes, but basically --
>> Tianqi Chen:
For the --
>>: I don't understand. Are you saying that one data parallelism
doesn't work and mono-parallelism and one bit gradient doesn't work.
What works?
>> Tianqi Chen:
very large.
One bit works if you can increase the mean byte size to
>>: Isn't there a toolkit?
any toolkit?
Shouldn't [indiscernible] turns you on for
>>: No, if you increase the mini byte size too large your final model is
bad.
>>: I don't see what that has to do with the toolkit. The toolkit
supports mono-parallelism, data parallelism and -- you know, it could be
one bit gradient supports. I don't see.
>> Tianqi Chen: No, what I mean, even one bit is large. But what he
said is as long as we can improve the speed, the [indiscernible] between
CPU and CPU, it is enough. I mean it is not enough. That is why we have
some [indiscernible] algorithms bigger than one bit
[indiscernible]ization to make this work. All those experiments we
basically, this kind of [indiscernible] is not enough. This is what I
want to say.
>> Audience: Yes, I want to say you need to increase the [indiscernible]
dynamically to increase the communication channel to you can get the
benefit. So this is the whole point. Like you need to increase the
universe site.
I have a related question, but actually, I am actually like the dynamic
scheduling for operations and for memory. But what concerns me actually
is the table that you had in the earlier part of the talk that you are
comparing your performance to something like Torch. Torch is kind of the
other side of the spectrum where the researcher, developer is like
building all these optimizations by hand.
>> Tianqi Chen:
Uh-huh.
>> Audience: And the kind of argument is Torch should offer the
flexibility, but MXNet and other symbolic or declarative programs would
offer the optimization.
>> Tianqi Chen:
Yes.
>> Audience: But in these tables I didn't see these huge difference in
terms of speed-ups or memory. This is one part maybe because all these
networks are standard and everyone is using [indiscernible] for example
conclusion and so on, but the other part of the question like if you have
some like kind of experiments like for something like machine translation
models or like LSTM language models where you can do model parallelism
and show difference if you have some benchmarks.
>> Tianqi Chen:
Yeah, yeah.
>> Audience: The other part of the question is, if I would do the
standard stuff, all the toolkits are the same but for the nonstandard
stuff maybe, and in parallel the program would be easier to work with
rather than a declarative program and if the difference in speed then and
memory is not that big anyways, why should I go for the declarative
programs?
>> Junyuan Xie: So basically for the standard program, for the standard
networks people like, people working on Torch put a lot of engineering
efforts to optimize them and so basically in the end get to the,
basically similar optimization as we do here automatically.
>> Audience: Their layer, their operation transition, they have.
[overlapping speech.]
>> Junyuan Xie: So, for example, for the ResNet, the network is so big
and so many layers that it doesn't fit into memory. And so what they did
is in the backward path they take a gradient blob and walk it down
through the network so that you don't need a gradient blob for every
layer.
We do this automatically and we do many more optimizations for that, but
for Torch either you need to, as a user you need to code the optimization
for each network. You need to figure out what to do and then you need
like sometimes significant change to the infrastructure to do that.
You certainly can, but it's kind of too much trouble. Otherwise you can
also like add a gazillion options to the library so you can turn on each
one and turn on each optimization, but it quickly becomes immeasurable.
>> Audience: So actually, kind of a follow-up question. So what you are
saying that for Torch they did the co-share, what you called co-share by
hand, right?
>> Junyuan Xie:
Yeah.
>> Audience: So did you try to run the ResNet on the same kind of
memory, hmm, with MXNet on the same kind of GPU call with the same memory
limitations? Like automatic co-share would be runnable? Or would it run
out of memory?
>> Junyuan Xie:
Hmm, I personally haven't.
We have people who have --
>> Audience: [speaker away from microphone.] that is the six, the six
gig card, I see?
>> Junyuan Xie: Yeah, so this we did most of the memory sharing that you
can do automatically. And so this requires less effort when you are in
exploring new network structures.
>> Audience:
[speaker away from microphone.]
>> Audience:
Your last layer is not separate from the --
>> Tianqi Chen:
That means [indiscernible] loss and activation.
>> Junyuan Xie:
Yeah.
>> Audience: How [indiscernible] it is to chain the loss to a
[indiscernible] loss that you want to define in MXNet?
>> Junyuan Xie: You can do it with
loss is. We have self activation.
the activation within loss attached
top of that have by either creating
ones.
-- it depends on how complicated your
That is only activation instead of
to it. And you can do anything on
new symbols or composing existing
>> Audience: A question along that line. So you couple soft activation
and [indiscernible] together. After you train the model, how can you,
how can you use the model at press time?
>> Junyuan Xie:
work.
At press time, you just don't call back drop and it will
>> Audience: But if indeed you don't call back drop, you still call
forward, right?
>> Junyuan Xie:
Forward doesn't require a label.
>> Audience: So that is a [indiscernible] so going forward it is not
output to loss function. It is output to manual. So it is -[overlapping speech.]
>> Audience:
This is special case or --
>> Junyuan Xie: No. So in all of our output layers we don't output the
loss. The loss is computed separately. And all these just output
whatever it is supposed to output instead of the loss.
The loss is computed separately if you want it.
>> Audience: But how do you compute the gradient coming from the loss?
To the model?
>> Junyuan Xie: The gradient actually doesn't come from the loss. So
imagine you are doing SoftMax. You have each node you have the
activation, right? The gradient is just that, minus 1. Right? So the
gradient doesn't require you to explicitly compute the loss. So ->> Audience:
[speaker away from microphone.]
>> Junyuan Xie:
So you don't need to.
>> Audience: But sometimes the loss function might impact the way that
you compute the gradient.
>> Junyuan Xie: Well, for that layer you can compute the loss.
this case you don't need to.
>> Audience:
So in
Okay.
>>: Is there a way to, if you wanted to chain different graphs together
because you wanted to do a [indiscernible] network?
>> Junyuan Xie: Exactly. We are going to talk about this in the
following slide.
So I want to talk a little bit about naming. So in MXNet every symbol
and every array in the symbolic graph has a name and the names are not
arbitrary. So we have certain name conventions that can impact behavior.
So you can create a variable place holder by declaring a variable and all
the other output arrays and named arrays and named input arrays will be
named by symbol name and default array name.
And for example, fc1 weight, fc1 output which didn't appear in the
previous slide. They will be allocated, they will be named by default.
In the default model and update, you don't have to use this, but if you
are -- just using standard feed for neural network, probably this will
make it much easier. If you are using this, all the arrays that ends
with weights will be treated as weights. They will be weight decay then
the initialized weights and the biases will be initialized to zero and
won't be decayed.
And data blobs from the data generator will be copied to the blob,
NDArrays with the corresponding name. So these are like variable names
in your program that it is not just for display. It is not arbitrary.
And so I just want to quickly mention that I/O in MXNet is different from
TensorFlow and Caffe in that TensorFlow and Caffe is just a layer in
their graph. It takes nothing and spit out data. But here we separate
the computer and graph from the data iterator because we want to make the
forward and backward cleaner so that it acts as the function that takes
input and spits out output. So these are separated and during runtime
you will be copying the data into the, into the model, the executor. And
we provide a number of default data iterators like NDArray which allows
you to use numpy arrays as input.
And which will have a image generator for ImageNet and other image based,
and we also have MNister. You can easily create other data iterators in
the front end language like in Python by subclassing the phase data
iterator class. Uh-huh?
>> Audience: So the data iterator supports [indiscernible] or do users
have to support the data [indiscernible]?
>> Junyuan Xie: Well, it does some local sharpening but not global ones
because it can be very efficient on SSD. So ->> Audience:
[speaker away from microphone.]
>> Junyuan Xie: So we take 1,000 examples, shuffle that instead of take
the whole data set and do a complete shuffle.
>> Audience:
You do it blob by blob?
>> Junyuan Xie:
>>:
Basically for this, for this [indiscernible].
It is also really run [indiscernible] global run shuffling.
>> Audience:
Okay.
>> Junyuan Xie: So it is not the entire shuffling as you would do with a
run array. It can be very inefficient for HHD disk.
>> Audience:
[speaker away from microphone] few iterations?
>> Junyuan Xie: It is specific to.
[overlapping speech.]
>> Audience:
Is it across every --
>> Junyuan Xie: Well, so this is just in memory. So you can do any
shuffling you want. And this is on desk. So for large images, that is
so we only do local shuffling and blog shuffling. And this I think is
not shuffled. Because it is just an example. Like it makes, it makes it
convenient to show MNist results.
When you are subclassing this and creating your own data iterator, you
can do whatever you want.
>> Audience: But at the how do you know which iterator supports software
[indiscernible].
>> Junyuan Xie:
It's in the document, I think.
>> Audience:
Okay.
>> Audience:
By the way, does it handle sequencing nicely or not?
>> Junyuan Xie: Yeah, recently we added, we do bucketing for sequences.
Which I will explain shortly.
So after you have the data and the graph, you can train it pretty easily
by first creating a model object. You specify the contacts you want it
to run now. This is triple 0 you can feed an -- GPU zero to GPU 4 and it
will parallel as that for you.
(Not triple O: GPU.]
So here is an example. So if you use vgg or googlenet, which is actually
googlenet which has very few parameters but has a lot of computation, we
single machine with four GPUs we get basically linear speed-up with machO GPUs.
And all those are just automatic. You don't need any encoding for that.
I will also cover recurrent networks and more complicated examples later.
And so since MXNet will, the executor will optimize everything out if you
don't need it, this can make debugging harder because sometimes when your
model are blowing up and you don't know why, you want to print the
intermediate gradients or weight arrays during the training to see some
of them are doing weird things.
To do that, we provide a monitor class that hooks into the executor and
immediately after each array is computed we push another operation to it
and saying that take this array and compute some statistics on it. And
return the statistics.
So after you create a monitor saying print every ten batch with this
statistic function, this is just center division and I want to compute on
everything, every array that matches this. This is regular expression
matching. And sort resolve based on name. Then combine the executor and
the install the monitor to it or you can provide the monitor to the fit
function and you will see these statistics is printed out every ten
batch.
>> Audience: I have a question. So I think the debug is very important
for this, a lot of things. So do you also offer some kind of set to have
some [indiscernible] or whatever, some other ->> Junyuan Xie:
A what?
>> Audience: Discern very interesting how it is very -- so you want to
look at the print, print the value and then you have steps ->> Junyuan Xie: Well, here you can return arbitrary things and we will
print it for you. So whatever here -- remember that this is pushed into
the engine. So you cannot do async, you cannot do blocking across here.
Basically anything async a rhythmic, just don't call ask numpy here.
So this don't take the -- don't take the NDArray data and try to copy it
to the Python firm hand. It will be copied and printed later, but don't
copy it here.
>> Audience: I also see if I want to bridge [indiscernible] so we have
the code already.
>> Junyuan Xie:
We do have gradient checking.
>> Audience:
It is in Python code.
>> Audience:
Oh, okay.
It is not --
>> Junyuan Xie: Also this does add more performance because you are
computing all this stuff and it is blocking the computation each step to
compute this. Only use it when you need it and probably add this to a
bigger value. Okay, so we built MXNet with the idea that it should be a
open system that is easily extendible. So we already have some examples
of that. We can call Torch tensor functions. So this will be once or
SVD or all those Torch functions. With MXNet and NDArrays and it's
executed in the engine asynchronously. It is pretty transparent. It is
just as if you are using MXNet's own functions.
We can also integrate Torch and layer into MXNet and body graph so you
can take one Torch script that defines a layer. It has forward, backward
and gradient. And you can embed it into the MXNet body graph as one
symbol.
And this allows you to migrate any existing work in Torch you have to
MXNet pretty easily. Basically we developed new algorithms. You pretty
much define a set of new layers that you want to use, pretty two or
three. You can put it into MXNet and you won't lose any existing work.
And we are also working on Caffe integration. It is harder because Caffe
doesn't have a full function scripting interface. So we need to fake
Caffe's headers and like come file with it. It is a lot of hacking. By
the way Torch and potentially TensorFlow it is pretty easy because they
have this scripting interface.
So here we show how you can call Torch tensor functions. They are
provided and they are MXNet.TH model. You can create random arrays. And
you can call arithmetic operations on it. You can create multiple lines
and then do a rift particular computations. These are all async. So
when you call this, it returns immediately. And all the execution
happens in the engine. You force synchronization whenever you take this
output and try to get it from Python, as numpy arrays.
That just take MXNet's NDArray structure which is similar to numpy. A
lot of times it can be used with numpy arrays transparently as if they
are numpy arrays. And all of these are pushing to the engine.
Here we show how you can use Torch N layers as MXNet symbols. So these
are just like any other symbols in MXNet. You have Torch module for the
normal layers and you have Torch criterion for the Torch loss layers.
So this would be, the Lua string is an initializer that is a Lua command
or function that returns a Lua object representing a neural network
layer. So after you construct this graph, it works pretty much the same
as any other MXNet graphs. But this allows you to migrate in existing
work, but since Torch is imperative, we can not do any memory
optimization with it. These are not memory optimized. Anything will be
there and won't be freed.
Unless you are only doing forward, in which case we know for certain that
you are not going to call backward on it again and it will be freed.
So if you can use MXNet's layer, if you don't have MXNet layer for that,
call Torch. So here we show how you can do custom training loop. This
is more advanced like when you are doing something other than feed
forward your networks. Say if you have convolutional, fully
convolutional network that operates on different size images and you want
to do the spatial pooling thing and generate different number of optimals
from each batch. You can do it with your custom training loop. And
basic loop basically looks like this. You take a symbol. You bind it to
create an executor. You take a data iterator for the number of epochs
you reset it and then you numerator the batch from the data iterator.
And for each batch you load the data to the executor. This is basically
a batch of assignments. And then you call forward and then backward.
In the middle of the forward and backward and before and after you can do
anything you want with the output like user a CRF to it or take this
executors forward output and feed it to the next executors input and then
call that executors backward and feed the gradient back to this executors
backward.
As argument here.
This here it doesn't have any argument because the MNist neural network
has a loss layer at the end. So it doesn't need gradient input and then
->> Audience: So how does it know which ones are missing from the -basically the terminal noticed in the graph? Like if it tries to
execute, it's missing and will throw an exception in runtime or
something? You know what I'm saying? There are some nodes in the deck,
there's some nodes that don't have -- in this scenario there's some nodes
that don't have any, don't have anything coming into them, right? But
you still -- that's what you have to pass in the partial with respect to
that output for those guys.
>> Junyuan Xie: Well, so here whenever you bind you will need to provide
any, every input and the gradient to that. If you don't provide, they
will be optimized out if it can.
>> Audience:
I see.
>> Junyuan Xie: If they can. So if you don't provide a weight, it is
pretty bad. It is going to be a random ->> Audience: But let's say you have a SoftNet out work or run a
[indiscernible] output over a sequence and you take the final hidden
state and pass it as input to another network, radio it? Then you need
to pass back the gradient [indiscernible] back to that final hidden
state. Where do you tell the network, hey, I'm go to do this? Do you
have to bind a variable to that bind ->> Junyuan Xie: Hmm, the backward of the other executor will give you
gradient for every input.
>> Audience:
Okay.
>> Junyuan Xie: For every input you require it to fill. Others will be
optimized out. For the things you have, you can take that aside to,
assign it to the hat grad here.
>> Audience: Then the [indiscernible] first one knows that is missing,
right? If you didn't provide it ... [speaker away from microphone.]
>> Junyuan Xie: So if the final area is not a loss, it will need hat
grad to backward if you don't provide one, it will throw in an error.
>> Audience: So it assumes that whatever the, like in terms of the
actual deck, whatever the final thing is, is going to be the one that you
provide it to?
>>: [speaker away from microphone.]
>> Junyuan Xie: Yeah, so you need to specify which one you want and then
it will be the hat grad and you need to assign to it.
>> Audience:
Okay.
>>: So are you [indiscernible] with the forward backward update somehow?
Or are you like stop each time it goes and next? Is that like some
internal queuing of the iterator?
>> Tianqi Chen: It is depending on the engine [speaker away from
microphone.]
>> Junyuan Xie: So all of these are scheduled. The forward will return
immediately and this will returning immediately and the [indiscernible]
there also push the compilation of the momentum and modify the weights
and tease are also into the engine.
The will only thing that is synchronized is the metric update because you
need to compute the metric and print it out. So this is synchronized.
And this is synchronized on the output. So it doesn't depend on the
backward. So as soon as the operations in the forward finishes, this
will return and then we will start the next batch. So the backward could
be still running. We are already doing the forward for the next batch.
>> Audience: Right, right. So when you are blocked on an update, you
are not doing any I/O, right? You are.
[overlapping speech.]
>> Audience: -- to get the next ->> Tianqi Chen: [speaker away from microphone.]
>> Audience:
Well, the next deck.
>> Junyuan Xie:
>> Audience:
That's what data iterator do.
It does fetching.
Just for fetching?
>> Junyuan Xie:
Yeah.
>> Audience: Do you have an example of the, we can see things that you
put in the middle of forward [indiscernible]?
>> Junyuan Xie:
>> Audience:
Nothing to tells us you have [indiscernible].
>> Junyuan Xie:
>> Audience:
Not in --
Not in the slides.
Okay.
>> Junyuan Xie: It is basically pretty easy. You take these outputs.
These are just arrays. And you do whatever you want with them.
>> Audience:
Then that is output from the four?
>> Junyuan Xie: Yeah. And backward is these grad arrays.
So as I said, you can check multiple executors and you can insert some
imperative computation and we also have data par presently executor
manager that allows you to easily do data parallelism. Across multiple
views or multiple machines.
>> Audience: Is there a way to ... [indiscernible] thing to look at the
party coder frame? Do you have that?
>> Junyuan Xie:
For discern.
>> Audience: [indiscernible] frames, so now just pick all the frames,
the second frame also.
>> Tianqi Chen: Yes, so you.
[overlapping speech.]
>> Junyuan Xie:
>> Tianqi Chen:
The frame? Layer?
[speaker away from microphone.]
>> Audience: The time, you know.
>>: The video streams?
>> Audience:
So you already --
Yes, for the [indiscernible] stream.
>> Junyuan Xie: Basically you don't bind whatever you don't need and
they will be optimized out if you need to compute them and if you don't
need to compute them, so if they are not in the middle of other things
you want, you just don't compute them.
>> Audience:
Even for samples [indiscernible] tables?
>> Junyuan Xie: Well, it depends on how you define your graph structure.
Basically it is a dag, whatever you don't need, say you need the here.
So whatever that leads to here which is an output that you don't need,
these -- this branch won't be computed if you don't bind this.
>> Audience: [speaker away from microphone.] if you have a sequence as
[indiscernible], so you have sequences. Are you treating the whole
sequence as a unit or you treat it as a sample unit in unit A?
>> Junyuan Xie: Well, you can do either. So basically if you treat a
sample as a dag, then you need to in Python feed one to the next. If you
bind them as entire dag, then you just take the input and then get the
output.
Whatever that suits your application.
>> Audience:
So maybe ...
>>: So this will be different.
whole sequence.
>> Junyuan Xie:
For example, if you treat this as a
Uh-huh.
>> Audience: In your whole graph, you probably only have, for example,
you have -- if you have used a kernel connect here, certainly you can do
batch computation for the lower layers and upper layers, but not that
layer. Does this handle he will it, true or not?
>> Junyuan Xie:
I don't quite ... as far as you mean, so --
>>: This is a kind of application that you can write.
[overlapping speech.]
>>: You need to write this in Python to do this batching by yourself?
>> Tianqi Chen: If you need the patch ->>: No, this can be optimize police department inside engine but you are
not doing that right now, right?
>> Junyuan Xie: What exactly can be optimized?
So you have, you are saying.
[overlapping speech.]
>> Audience:
You have the current work.
>> Junyuan Xie:
Uh-huh.
>> Audience: Let's consider simplistic [indiscernible] why you only have
certain group has hidden layers. You have five hidden layers, you only
have self recurrency at a certain hidden layer.
>> Junyuan Xie:
Uh-huh.
>> Audience: And if you treat each separate as an independent start
within, you need to unload the whole graph into a very big dag and you
compute each sample independently. So you can parallel ice by each one
is very small computation into [indiscernible].
>> Junyuan Xie: That's where the dependency engine comes in.
[overlapping speech.]
So.
>> Audience: What I mean, this is very slow. A dependent solution is to
compute everything below the certain layer as a huge batch computation.
So it is very efficient.
Everything after, above that is also a whole batch [indiscernible]
dependency. Only the third layer which has the recurrent connection unit
compute sample by sample. This cannot be -[overlapping speech.]
>> Junyuan Xie: Well, it symbolic graph doesn't import batch size across
symbols. So you can change the batch size wherever you need.
And so the dependency engine if you bind every executor for each ten
step, it won't -- so the forward will return immediately after it push
all the execution into the engine and then you can directly use the
output, although it haven't been computed. You can directly use the
output and feed it to the next executor and the dependency engine will
schedule everything for you.
And you don't need to do any synchronization manually.
>> Audience:
I am not talking about [indiscernible].
We need to talk --
>>: This is too much of a subtlety. I know what you are talking about,
but I think it's a very subtle issue that it's hard to explain. Once you
drew it out, I think we can talk about it after, right.
>>:
Yeah, we probably need to [indiscernible] that.
>> Junyuan Xie: So bucketing. Tianqi already talked a little bit about
how we are doing bucketing for variable lengths or size inputs. So this,
if you have a fully convolutional neural network variable image sizes for
example, for image for detection, you can allocate multiple executors and
have them share memory so that you are not going to call them in
parallel. They are, they share memory and your cost, your memory cost
will be go to the maximum cost of the maximum executor.
You can have one executor for each image size and you can cache them and
take, whenever you get image you put it into the corresponding executor.
Question?
>> Audience:
Yes, what is the overhead of spinning the executor?
>> Junyuan Xie: If you are sharing memory, it is pretty vast. We didn't
do exact computation, but it is basically some graph search. Basically
some DFS search on say 100, 200 node graph.
We need to compute the dependencies to our structure.
>> Audience:
>>:
[speaker away from microphone.]
No, it's done ahead of time.
>> Junyuan Xie: No, for each executor. If every mini batch have
absolutely a new size, so you know two mini batches share the same input
size, you will need to create one for each.
>> Audience: [speaker away from microphone] one bucket, your mini batch.
Well, you have one bucket for each mini batch?
>>: Well, you have one bucket type for each mini batch, but it's, the
cost is done ahead of time. So you define how many buckets you have
ahead of time and then you assign it to a batch, but still it's a one
time cost so it is not anything.
>> Audience: When you do the computation, you do one by one?
everything in one batch? In one kernel operation?
Or
>> Junyuan Xie: Well, it depends on what you want to do. So one kernel
feed forward cannot operate on different sized images, right? If you
have within one batch different image sizes, you will have to do the
maximum. You will have to make the batch the maximum size and then pad
the smaller images.
Because like GPU doesn't work that way. You can do the forward ->> Audience: [speaker away from microphone.] buckets of five or ten,
right? So you.
[overlapping speech.]
>> Junyuan Xie: Yes, you can assign images with very similar size to
each batch and save some overhead.
>> Audience: Yeah, actually in [indiscernible] we group everything
together with [indiscernible].
So we use just not just one, to do 200s, one computation, one for
[indiscernible] processing.
But that is only for the speech cases.
>> Audience: Not just speech, everything.
>>: It is --
[overlapping speech.]
>> Junyuan Xie: Suppose you have two images. One is 400 by 800.
200 by 400. Then your kernel will need to operate on 400 by 800.
>> Audience:
padding.
One is
No, no, you can, you can -- at a stop, basically you can do
>> Junyuan Xie: Yeah, that's what I'm talking about. So if you have
images with variable size in within a batch, you have to pad the smaller
images to the larger one.
>> Audience: Yes, this is what I ask you. So are you doing this
together to batch as much this into one kernel fusion or you actually
only have single batches multiple times? June synchronization it depends
on whatever you want in Python. So it can be done within like ten lines
of Python code. Depends on what you want. If we find that most people
want some specific thing, we can add it to the library.
So ->> Audience: I have a question. So maybe we have not the ... [speaker
away from microphone.]
So [indiscernible] asked me that. We have this tensor model. It is only
the input that is variable but also the output is also variable length?
It can be [speaker away from microphone.]
>> Junyuan Xie: It doesn't matter. Each packet only has a fixed network
structure. It has a fixed number of input and output. Say your network
can be like ten inputs, 20 outputs, five input, ten output. You just
create a bucket for each possible case.
>> Audience:
[speaker away from microphone.]
>> Junyuan Xie:
Uh-huh.
>> Audience: I wanted to push that, but [indiscernible].
You can do any sort of weird graph you want. It's just a joint graph,
right? So in this case the tension graph says every target connects to
every source. So from the bucket size 40 to 50 you have a connector -you have just one target connecting 50 things to the bucket 30 to 40, you
have a connector of 40 things. It's very difficult to do in the style of
declarative, like in the CDK case style where you don't explicitly unroll
the loop because then all of these cases like one to many -- like it's
fine for like feed forward and bidirectional. But once you have things
that -- or something, like say it's a parse tree where for a sigma like
N, the actual graph is a complete binary parse tree of length 2N, right?
Where the parse is [indiscernible].
Doing something like that in a simple like lag of, metal language saying
how you want to enroll it is not really possible. But this method lets
you define this graph and this graph and this graph and this graph and
this is the one I really want to use at runtime.
>> Junyuan Xie: Yes, basically you can have any number of graphs and
they can be totally destroyed and they just share GPU memory.
And there won't be any synchronization because they share the same
NDArray and that is scheduled by the engine.
So you don't need to wait for one executor, one bucket to finish to start
the next one. You can just go ahead and do it.
Also if you find yourself in need of new symbols that is not in Matlab
you can create it in multiple ways. First we have basic arithmetic
computations. You can just compose those and get your new symbol.
Or if that doesn't work or you want more efficiency, you can do a symbol
in C++ call by writing cuda kernels or we have this M shadow with discern
which is similar to AGIN. It is the template metrics library. And you
can write those and it will be able to run on CPU and GPU.
You can also define a symbol in front end language in Python. It is a
little less efficient, but it is much easier to write than C++ and cuda
code.
You can just take the NDArrays and get numpy arrays to Python and do
whatever you want on it and then feed it back.
We also support writing cuda kernels in front end with runtime
compilation. So nvidia has the RKC library that allows you to compile
kernels runtime and we can compile kernels for NDArrays runtime.
So we also support writing layers in Lua with Torch and then call it from
Torch [indiscernible].
So there are a number of ways you can do it. And thank you. That's my
talk.
[applause.]
(End of transcription .]
Download