>> Kenneth Tran: Okay. So let's get started. Hello, everyone. It is my pleasure to introduce Tianqi and Eric from the University of Washington. Even though they are still students, but they are working on very cool project: A deep learning toolkit, a very powerful toolkit in the open source world or MXNet. For your information, Tianqi is also the main author of another very powerful opensource machine learning package called xgboots for creating boosted trees learning. So welcome, Tianqi and Eric, to Microsoft to give a talk on MXNet. [applause.] >> Tianqi Chen: Okay. Thank you, Ken, for the introduction. It is a great pleasure to give a talk here. Today we are going to mainly talk about MXNet, the insights that we learned from the system and mainly what the system -- the motivation, why we designed it. And Eric is going to talk about some [indiscernible] usage in the [indiscernible] use cases. First I'm going to acknowledge our collaborators, including contributors. We get, especially we got two Microsoft contributors here. Chuntao is from Microsoft Research and Tianjun is sitting in the audience. He is now in Microsoft Bing Image Team. They joined us in the very beginning. And if you have questions related to internal [indiscernible], you are more than welcome to ask them if I am not appropriate to answer. But so yeah, and we also want to thank Chuntao for suggesting the back end API, which is used for the language understanding and comes very handy recently. So this is the all time for talk. Today we are going to talk about deep learning systems. As you know, there are a lot of recent different [indiscernible]s. They each come with their own unique feature. Ken had a very nice way about the advantage and disadvantage of all these systems. MXNet is not there yet, but hopefully it will be. Today I am going to talk about, I want to discuss a very different perspective. So all the existing surveys about this system asks the question: What are the nice features of all these systems in the current stage? Let's ask a different question. So let's ask what is the -- how will this system evolve and eventually become, given the four engineering power and all the things that you can throw into it? Because different systems provide different user interfaces and different user interfaces will eventually lead to limiting the capabilities. These are like a [indiscernible] in the systems and the operator are all the, engineering efforts are like the means in the system. And these [boons] will eventually result in what is the limit you can retrieve in terms of flexibility and the power of those systems? So today I am going to talk about from this perspective. So I'm going to talk about programming models. So normally if you think about deep learning or any programming, any task in mission learning you need to program them. That's the interface you have. There are two program models that you can use. One type of program model is [indiscernible] programming. So here this is the Imperative program of numpy's group which gives you the result of E here, which is a very simple result. This is not deep learning yet, but as you all know that Microsoft has a toolkit called CMTK, computational network toolkit. And Google has another famous firmware called TensorFlow. All these operates on those computational graphs. And bear with me, this is the basic path that deep learning toolkit needs to be able to do efficiently. What Imperative programming does is, what you will do, you will appear implement all those operations like matrix multiplication in neural networks there will be more complicated things like convolution and element wise addition and imperative program will do those things stepby-step. So you write your program as you execute this clause it will basically translate into a kernel call on GPU or some CPU functions. On the other hand, another more typical way actually in most deep learning platforms is called declarative programming. So in declarative programming, what you do is you first declare those computations you want and then you send those computational graphs to some compile function or other equivalent of compile function. Then you can utilize those compiled functions for your result. So the difference between, from the imperative programming is in the declarative programming normally you first declare all you need. Then the computation only happens at the, after you do the compilations. Typical example about declarative programming outside of the deep learning context is SQL for database, which I think many of you might be familiar with. So many of the existing deep learning firm works usually use those declarative programming. For example, if you think about configuration file based firm ware like Caffe or CMTK, you declare a network. That's basically the declarative part. In other firm works like TensorFlow yellow, or Python API, not very different from configuration file but not flexible. It gives you a more flexible way of declaring what those computation you'll need. So what are the differences between the imperative programming p and declarative programming in the context of deep learning? So think about deep neural networks. Here this is actually a shared neural network. What declarative programming gives you is actually a black box. So usually you declare what are the computational graphs are. Here this is the one layer neural network. Then you throw into the toolkit, into the toolkit gives you the ingredient, gives you the output. With the assist the toolkit is able to give you whatever you require. On the other hand if you do imperative programming, you need to write the script layer by layer. Basically here you have some functions called layer one to four, layer two to four and you call back propagation. Of course, in some popular systems they will do the wrapping for you so that, for example, in Torch you can build all those configurations and the imperative part is actually hidden in the code. You can always dig into the code and hack. If you are a good hacker or researcher, because for any of the advanced need, you need. So these are two programming styles. What are the differences? So one of the things, the advantage I'm trying to argue for declarative programming first is because most of the frameworks follow the declarative programming paradigm, including all of the frameworks you see except for Torch, I guess, in the deep learning world. So the declarative program is more efficient mainly because they allow to do more [indiscernible] because you don't need to specify what procedure you need. Instead you only say what you need. So take this, for example. So assume that I want to do this calculation steps. The left side is imperative program. And right side is the declarative program. How many bytes you need to calculate, to finish the calculation basically in a Python [indiscernible]? So in the left side you can see that there are four variables here. And each variable takes ten, 1040, ten-64-bit flow. Normally it would take four times the -- four number of arrays to finish this calculation. On the other hand, if you do have a declarative program what you can do is we can find that the declarative program will find that C and D can actually share the memory because C is an intermediate memory that isn't a sing by the user. You can kind of optimize and get three copies of memory instead of four copies of memory. This might be a very naive example but it demonstrates the difference between declarative and imperative. Because the imperative word, you specify the calculation procedure and that might not be optimum. And another key property about this is that the imperative program needs to be prepared for all the possible futures. Say that you are writing this program in a Python console. When you write here at this point, actually the imperative program doesn't know that [indiscernible] will be used in the future or not. So it can not actually directly reuse memory. Of course, the Python garbage collector and kind of reuse, got recycled. But this is somewhat limited in the sense of deep learning which we will give a more [indiscernible] example in the next slide. So on the other hand, in a declarative part, why you do the compilations, the declarative program sees the entire computational graph as well as the boundaries. By boundary I mean that the program can see what are the input and output and what are the intermediate stages. And all these intermediate stages cannot be seen by the user, which means that it can be freely optimized out, be rewritten or doing the memory optimization that we mentioned earlier. So more realistic example in the deep learning world is that it is about this case study about what is the memory cost you need to run N layer neural network. If you only want to do a prediction, production environment instead of training. If you are doing imperative programming style, if you do imperative programming, normally what you will do is that you will do this N forward functions here. Because most of the deep learning frameworks are optimized for both forward prediction and back propagation, normally the statistics are keeping track in those data structures so they cannot be freed. And this is for the same reason that because the imperative program needs to be prepared for the possible future of back propagation. On the other hand in a declarative program, you declare in [indiscernible] forward computational graph and call the compilation function when the compilation function basically sees the graph and it knows that all those memories can be shared in place. So as a result, you get a much less memory cost when running a declarative program normally for predictions. Normally you only need two copies of memory that alternates through each other. Yes? >> Audience: Can you explain why the memory can be shared from different layers? Because when you do back propagation ABC belongs to ->> Tianqi Chen: Exactly. In our use case we instead of back propagation. So imagine that only want to do prediction. And this can be, not about, is not about back propagation. So are only doing prediction you, in this use case you this computational graph is this memory can be saved. That's exactly why imperative program column costs memory because it wants to be prepared for the case of that possibility of back propagation as well as the possibility of only doing forward prediction. Of course, you can optimize this program by adding special cases like I only want to do forward prediction. That's adds another layer of complication to an imperative program. So there are a bunch of optimizations that you can do for declarative program because of the advantage of it. It sees the entire global scope of computational graph instead of a local step. One example is that you can do, dependency pruning. Here this is a forward propagation step and on the right side is, this is the graph corresponding to the back propagation step. So because there is addition here, you can find and you can find that back propagation actually doesn't depend on the barrier F. So this memory can be freed as early as the forward propagation step. We don't need to postpone that until the back propagation step. So if you only need some intermediate values of variable C, you don't need even to carry out all the computations as well. So this is kind of more an advanced version of dependency pruning. Yes, you can hard code into your framework, but the current programming cannot give you free [indiscernible] if you have a good optimizer. Another thing that you can do is operate a fusion. So in here, because C is intermediate variable that is not referenced by the user and you can easily optimize that out. Say you provide a huge kernel that does the matrix multiplication plus some bias which is commonly, which is commonly available in many of our frameworks, you can do this kind of operation rewriting to write this program. However, you cannot do that for the same reason because in the imperative program there is a possible future that C might be referenced. And there is another [indiscernible] that I mentioned and that is memory sharing here. So I have advocated so much for declarative programming so far and it is good that many of the existing deep learning frameworks actually do follow the use case of declarative programming in the sense that you declare the neural network. Then you optimize the computational structure. So why do we still need imperative programming? Or otherwise we basically, everything so far is good and we can end this talk here. So why do we still need imperative programming. Imagine the same case for SQL in the database. We all need to imagine that you want to develop a Web app that needs to do online banking system and you have a good secure server in Microsoft CQ server. And it does all the good jobs for you. It cures optimizations, gives you a result quickly. But still you need some front end language like [C shop] and Java Strip to interact with it. So why does Web developers still need this front end language? The reason is that the declarative language is kind of limited in what they can do. There are some certain things that declarative language are not good at. Take, so take for example you imagine that how you can do parameter update in the imperative language. It's possible, but you do need to extend the language into some steps. TensorFlow needs support mutations and [indiscernible] in the computational graph which is not very natural and requires some engineering effort into that. Other more complicated example includes things like optimizations Luke limited memory BFGS step which requires a line search and think about how you can write a line search program in CQ which I don't know how to write. Maybe there are some extended language in the CQ Norm that can do that. And there are other use cases like variable input lengths which are not very easy to be described by a predetermined computational graph. And it will be, need to be used by, need to be using advanced features in the imperative language to do the dynamic enrolling. So one of the advantages of imperative program is that it is deeply embedded into the host language. So also TensorFlow and [Ciano] are claimed to be supporting Python. Actually, why they are right that TensorFlow and [Ciano] programs, you are actually writing another program that is called TensorFlow. It is like you are writing CQ in Python. It is not like you are writing Python languages. You are writing CQ language. So while you are writing those declarative programs in Python, you are actually writing in context of those declarative programs. However, if you are right imperative programs in Python, you are actually writing in Python because you can take advantage of all those advanced features like, you can condition all the [indiscernible] set lines. In here, this is a very naive example. Say I have two neural networks that I like to do for input data that is bigger than ten, I use first steps, first tab on network architecture. For the other input tab I use second type of architecture. If you have an imperative program, you can basically very easily do the wrong time dispatching and select the path that you want. Now, more complicating examples. For example, you might want to take a look at output of a neural network and decide which path you want to go. This can be very common in reinforcements learning. And on the case of recurrent neural networks, say you want to do prediction, you want to dynamically enroll your predictions and the length of the predicter seconds is not determined in the beginning. Of course, all these things can be hard coded in your system and optimized for specific use cases. But to support these things in a generic version, we do need some flavor of imperative programming in order to do that easily. So that's why we want to advocate for a mixed flavor. So MXNet kind of stand for mixed flavor. We want to combine the power of imperative programming together with the optimized declarative programming. So as we all know, as hopefully you are now convinced that the imperative programming are kind of more flexible and allows you to write whatever things you need, say you want to insert your [indiscernible] step and your training loop. On the other hand, the declarative program can be quite powerful in the sense that the system can optimize the memory, computation, and everything in behind for you. MXNet we provide two set of APIs that are coupled together in the sense that they work together as a whole piece. So we have our noon parallel un[indiscernible] database that basically SQL as you go. That's imperative programming part and we have a symbolic API that basically does the declarative programming part. In the hole -- so this is the example of the basic training loop in MXNet. So what you basically do is like you write imperative loop here and then we have a dictionary of the, from the next string of key input to the corresponding array. So every iteration what you do is that you copy your input to the corresponding arrays. That is the imperative step. The architecture itself is compiled from declarative language. You declare all your neural networks, and the compiling into the executor. You call the executor to go forward, backwards. This is the optimized step that is optimized for memory. Then you use the imperative step. This is imperative step of SGD, but we can definitely put more advanced step like the variance reduction method that developed here. Or other type of optimizers here. So the reason that, the fact that we are able to use the imperative step actually helps us a lot in the sense that after we released MXNet, in the beginning we are going to have -- actually it was momentum optimizers and after a few months people contribute other optimizers like [indiscernible] MS probe, LL grid, and all these people are not original developers of the framework. Which makes, which shows that MXNet is kind of easier to use in terms of this imperative programming. On the other hand, it is still very efficient in the sense that all those portion [indiscernible] are optimized by those declarative programming part. Another example that is actually quite exciting for me is that, is about the support for variable length or variable size input in images. So recently this is actually the proposal of Jacob who is in Microsoft. So as you know, if you do second training, normally you have seconds of different lengths and it usually corresponds to different network structure of different length or different shape. So thanks to the imperative solution model that we have, actually this is coined by Chiyuan Xie. He added his support of the bucketing API which basically switches by the input length and you pick different [indiscernible] solutions. The computer corresponds to a different network that for that specific input and all this computers can share the memory. So according to him we only cost around ten lengths of Python code which may be a little bit exaggerated but I truly believe in him. This is basically a map or switch in Python. And that's all you need to support the variable length input. So in summary, in terms of programming models, hopefully you are now convinced that there are two style programming models in deep learning languages. Either you can do declarative programming that is very optimized and engineers always like that. It is good in production systems. On the other hand, imperative programming gives you the flexibility of inserting whatever you need into the control flows and around a runtime dynamic dispatching of the operations. Combination of these two can maximize your productivity for one reason. This is the best thing that I learned from my undergrad is [indiscernible] from computer architecture. Basically in the sense that you only want to optimize the bottlenecks in your system. So this is the same for deep learning. Usually when we are doing extensions to deep learning program or deep learning architecture, we are extending, the part that we extend may not be the bottleneck of your computation. And usually we can put many of the existing pieces in those declarative languages and use imperative language to smoothly add the things we want to add and still get the maximum performance you want. And that is, that motivates us to use this kind of mixed programming design. Okay. So now we talk about programming models. Another thing that is very important is to try to develop a system that effectively supports these two kinds of programming models and allows them to mix together. Especially in multiple device environment, as we all know that when you have a machine, now you can put four GPUs into it and in a distributed environment there are even more resources to utilize. It is very important to be automatically utilize all resources to maximize your performance. So however, it's quite painful to write [indiscernible] programs or concurrency programs in, although I like them very much, but it is very painful to write, especially to write for specific cases. So one thing that we did is we have a dependency checking engine that automatically parallelized all our operations for you. Here is the example. So these are four imperative crosses of operations and you can easily find that these two operations can run in parallel. And what we actually have is that we have a generic dependency engine that will call all operations, including the imperative and the declarative executions to train them together and automatically parallel ice all the [indiscernible] operations you have. Here is a more realistic example. This is a bunch of code that you don't have to read but the basic idea is that this is a two lay area neural network that runs on two GPUs and you need to copy the data to CPU for synchronization, do the update and copy them back. These are the generic code of data parallel paradigm that you have. This is Python code that you can write, actually in MXNet. This corresponds to this computational graph. And there are many interesting aspects from this computational graph. You can find that there are computational paths on the left side on GPU one. On GPU zero actually. And on the right side is the computational path on the GPU one. And you can do the forward/backward propagation and after the first back propagation of the top layer, as soon as you get the actually you can copy the data to the CPU, do the update and copy them back. So one thing part of optimization that you want to do for this kind of computation in parallel, you want to over lamb your computation with the communications. So as soon as you get this result, this memory copy can happen. And concurrently you can do this another step of back propagation. And you can form a similar pattern in model parallel sense in model layer ASTN, as soon as you get the first time stamp ready you can directly copy the result, the second CPU can kick after and all these things can start concurrently. However, it is quite a hard to hard code all these patterns and do those asynchronous communications. We do need a dependent check to check all this imperative and declarative operations. One other key property I want to emphasize here in our design of the scheduling engine is that it is designed to be transparent in the sense that we don't want to make the assumption of what kind of operation you want to schedule. And basically it means that we can schedule anything, including those codes that aren't written by us. Okay. Which Eric will give a more detailed example, actually. So however, there are a lot of troubles when you want to have such a dependency engine. One trouble is that about memory cycling, surprisingly. So here is two dependency paths that you have two operations that depends on C and B. And then we have a memory dilution that is automatically triggered by the garbage collection. However, this memory deletion, this memory deletion cannot happen before those operations are carried out. This is like implicit dependency that you really need to be able to check. Another interesting example is random number generations. Here you have two random number generations in your code. The same study can run in parallel. However, if you are familiar with random number generator, usually you have to run them in a serial manner either for pre- producibility because every time you do one, the second code gives it same result and another reason is that random number generator is the result that cannot be shared and be serialized. In order to support these two kind of operations you really need to expand those dependency engines from immutable scheduler which is a technical term, but you don't have to understand. To something that is a pause mutation. And in your framework, what we do, we try to do things in an abstract way. For all the possible resources, memories, random number generators and all those things, we allocate a variable or a tag to attach -- it's like a tag of your bag in the flight. So you attach them to each of the resources. While you push the operation to the engine, you use these tags to specify what resources you want to read are muted. And the engine treats the operations as a black box in the sense that any simply use those tags to track the dependency and execute them when ready but wouldn't actually know what's in the content of this [indiscernible] function. So all these code experiments will be translated to a synchronized call of these pushes into the dependency engine. You are going to plugin auxiliary functions as well in here, including accompanied that you developed here. So this is how, a huge picture showing how I do it. I am not going to walk through it, but you can always read it from our blog post. We have a detailed design doc of what we did in those systems. Basically we have multiple dispatching queue. There is aware of the read and write dependencies. So to recap this section, we really need a good dependency checker that checks all the dependencies. If you want to view a powerful system for multiple GPUs or multiple resources. And having such dependency engine makes this easier for you in the sense that you don't have to hard code all the data parallelization patterns. If you want to have them more in parallel, you simply write the Python code until serial manner and you will execute the dependent operations as soon as it is ready and it is ready to use and gives you the maximum performance. And it can also be used in combination with other optimizations such as data compression which optimized for the -- which optimized the cost of each individual operations. In the sense that such dependency checker helps you to run more things in current, in concurrent manner. Okay. I mentioned two major motivations for us to build a system. We want to mix the programming model for maximum optimization and the flexibility. And we have built the dependent scheduling engine to schedule a big tree operations so that you can make a big tree parallelization pattern easier. I will briefly walk through other features that might be of interest. Yeah, so MXNet comes with low memory consumption mainly because we have those declarative optimized memory. So as an example, one of our friends actually changed the entire ImageNet, data set on a single machine with 4GTX900 -- 980 GPU, which is not a very good GPU. Because we can use less memory than the existing systems. So we feed bigger neural networks into those systems. It is also lightweight and portable in the sense that it is available in all possible languages and the video platforms and also available in Windows, of course. And you can follow all the mobile devices. One interesting thing that we encourage you to check out, we have a Javascript version which means that -- which actually runs in your browser. If you can open this link now, if you have a laptop. And there is an image cast demo that runs on your browser. Microsoft does have an advantage here in the sense that Microsoft H actually runs, I think eight times faster than Chrome because they didn't optimize for the ASM-GS part. So this is really due to the fact that we are really, it makes things really plausible in that all dependencies are minimized and we can compile it entire project into one file and compiler into Javascript. And this comes ->> Audience: Does Javascript use the same back end code? >> Tianqi Chen: >> Audience: It uses the same back end code, yes. [indiscernible] -- runs in C++ code? >> Tianqi Chen: Yes, exactly. It's a technical inscription that compiles, that compiles C++ code into Javascript. But you do need to isolate all the dependencies in order to be able to do that. And this comes very handy when you are running demos, for example. If you have a new model that you want to show to your colleagues, basically you can run a demo on their browser and you don't need to have a dedicated server for whatever purposes. And it also runs on major Cloud platforms. We have a building support for AWSS3 and I think that's a ongoing podcast on supporting Azure integration as well. It runs on major scheduling engines such as MPI, SAN grid engine and Hadoop [young.] This is a comparison of single GPU vision task. I think the major take away of this slide is that also we support mixed style of imperative programming and the declarative programming. We are [indiscernible] of all the major systems because we share all the same kernels and all those things. And MSS really will optimize for those inception styles or residual style architecture that have multiple paths, which gives us a much, much more memory saving than existing frameworks and we have a new optimization that is able to further optimize another factor of two or three using a grout rewriting techniques. Here is a comparative example about forward N-lay computation and forward path backward, which answers your question in a previous, which answers a previous question about like why forward computation gives you like cost less memory? Simply because there are less memory dependencies and you can save more in a forward computation as compared with forward pass back backward computation and declarative programming gives that for you. On the other hand, imperative programming also gives you all the power that you need to write your research code easily. And ports into the productive environment. This is the entire overview of the system. Basically it is kind of very different from existing system because we think, we built it in a way such that we wand to have 1N array model to support imperative programming. As well as the symbolic execution model to support declarative programming. Okay. So I think I finished all the overview about the system and the general idea behind the system. Here are some take-home messages that hopefully you can get from this talk. So from my part of the talk. Eric will give some tutorials about the detail usages. So MSS stands for mixed programming and maximum productivity to MX. One important thing we do is automatic parallelism, the way to power fully use all the parallelism patterns you can use without hard coding. And one of the things that I want to emphasize is that we are open system. Like Lego bricks that you can plug into various parts and plug other bricks into the MXNet. One important example is that currently we actually work as a fully functioning Torch model, which means that you can take any functions in the Torch and use the MXNet. And it also balances flexibility of the symbolic and of the declarative programming and imperative programming to give you the maximum performance and the flexibility. Okay. So I will hand over to Eric for the next part of the talk. >> Junyuan Xie: Thanks. So Tianqi has talked a lot about MX design and its cofeatures. Hopefully by now you are convinced it is a good system. Now I am going to talk about how you can use it. So since it is fairly technical audience, I will assume knowledge of machine learning and Python and numpy and such. So we actually do have other front end language wrappers like Scala, Julia, R, and et cetera, but Python is the most supported and it is the language I use. So I am going to give examples in Python. So by the way, feel free to interrupt me. I know that some of you may have played with MXNet already. So if you have more, you have deeper questions, you can interrupt me anywhere to ask. So first the symbolic graph -- is the mic working? Okay. So first MXNet is the basic usage is symbolic graph. And symbols are basic units of computation. And you can define symbols like this with a function call and a symbol has input and it has name and number of parameters like how many hidden units this fully connected layer has for output. We also provide -- this is like the Caffe layer. And it is a coarse granularity symbol. We also provide finer granularity symbols like plus, multiply and you can define it with FC1 plus one assigned to FC2. So that is actually [indiscernible] in the symbolic engine is that of just an imperative clause. And then graphs are defined by confusing symbols. You first have the X which is a variable which basically serves as a place holder for your input data. And then you feed it through the network. You have lick net [indiscernible] activation and soft next. And then this will give you a network which can be instantiated with memory to give an executor and here we use simple bang which allocates all of the memory buffers for you. You can also call bind with the symbol and provide your own arrays as buffers and gradients. And everything that you don't bind, if you use the bind method, everything you don't bind won't be computed if it can be updated. They won't be computed. And if you don't bind a gradient for an array, it won't be computed. You won't be able to see it. Anything that you want as output, you have to bind it. Otherwise it will be automate the out. After that, these are basically works pretty much like numpy arrays. With some limited feature. And you can assign to the input which is the data. And then you do a forward and then do backward. >> Audience: So after you declare symbol and then you do some imperative, so-called imperative programming on those symbols, it is still declarative programming, right? symbol. Because you still program on >>: So it cannot be called mixed program. the framework. It is still programming on >> Junyuan Xie: This is symbolic graph. Here it is binded. And you can go forward and back wards to it. These are two black box symbolic computational graphs. In the middle you can do anything you want with it. So the output array is just an NDArray. You can say take that, feed it to a CRF and do some inference in the CRF. Then take the gradient back from the CRF and then feed it to the backward. You can add arguments here. >> Audience: execute? >>: Is that after you run forward, meaning you called to Yes, the ... [speaker away from microphone.] institutions. >> Audience: In other frameworks you can also call F [indiscernible] something to get the values and then you can feed those values into whatever ->> Tianqi Chen: The difference is that in most of the frameworks, after you call Evo, you get a numpy variable. You need a [indiscernible] GPU arrays to explore all the different operations. The second difference is all those operations you would be parallelized in the sense that in the existing framework all you do is that you take that numb pi array, but all those [indiscernible] are not parallelized. So those imperative programs are actually [indiscernible] [speaker away from microphone.] >> Junyuan Xie: Okay, so -- >> Audience: I think this is really just a library function call to any language. Worldwide difference? >> Junyuan Xie: >> Audience: [speaker away from microphone.] >> Junyuan Xie: >> Audience: All right, so let's speed forward -- Yeah, it pushes. It is varied output? >> Junyuan Xie: Not really. So the forward is after you compile this graph, the forward actually pushes a whole bunch of operations to the engine and then it returns without waiting for them to finish. So you push a whole lot of stuff in to the engine and then say the network has - the network has ten branches and you have ten outputs. The forward immediately returns and you can take each output and do anything you want with it without waiting for the other nine. You see what I mean? >> Audience: [speaker away from microphone.] >>: -- all the things that are parallelized. [indiscernible]. And including, I think >> Audience: That is still, the parallelization still happen inside the library, right? >> Tianqi Chen: No, it is not happening, it happens inside the function call as well as outside function call. So after you call forward and backward, you do a permanent update. That permanent update can be parallelizing together with the forward and backward calls. As soon as you get agreement in the backward call, the permanent happen in workshop [indiscernible] that's the major difference between the [indiscernible] institution and the parallel institution. >> Audience: right? But parallelization still happens inside your library, >> Junyuan Xie: >> Audience: No, it happens in Python. [indiscernible] it happens in Python? >> Junyuan Xie: It happens in the engine, but it is issued, you can issue everything in Python. >> Tianqi Chen: Basically you issue everything Python in [indiscernible] manner. That's the difference. In most libraries, allows you declare a graph and you do whatever parallelization or optimization on that graph. But most libraries doesn't allow you to do parallelization over different type of graph institutions. You [indiscernible] the graph. You get the gradient, you do the permanent update. That's a different operation. Most libraries doesn't allow, doesn't allow parallelization between those two operations. While because, if you want to really support imperative operation well, you done want to be able to check all those dependencies and parallelize all those operations as a whole piece because all these operations are also run on GPU and they need to -- for example, the permanent update needs to be done as soon as you get the gradient in the backwards step. Instead of waiting all the backward steps to finish. >> Junyuan Xie: So basically here you have these networks, layers and each one has a few parameters. When you, after you do a forward you can do a backward pass. Then it will, since this is the linear structure you can -- it will execute from this one to this one and then backwards. And the backwards actually returns immediately after it pushes all the operations into them. Then when you do the update, it -- you call update on each parameter and if this is just a normal function call you would need for the backward to compute the gradient for each of the parameters before you can do update, before the returns. >> Audience: [speaker away from microphone.] >> Junyuan Xie: Uh-huh. >> Audience: You can do a function core asynchronous immediately, right? >> Junyuan Xie: Yes. But then you need the dependency engine. So basically that's why the dependency engine is good. Because it registers all of the read and write operations on each of the parameters. So you can issue everything without waiting for anything to finish. It will get automatically dispatched when it is ready to be run. >> Audience: But since the computing device is limited. >> Junyuan Xie: Uh-huh. >> Audience: Okay? And while you run some [indiscernible] on that device, that device is occupied. You can run another ->> Junyuan Xie: So -- >> Audience: At the same time. So based on that you have the limitation for the device. The only thing you can actually parallelize is the memory core [indiscernible] and some communication, right? >> Junyuan Xie: There are two guesses. So first, your layers, your network has module branches and some layers are small, running that layer doesn't occupy the entire GPU. Then you can, if you can run two branches in parallel it gives you a speed-up. The other is this instance naturally to model GPUs. You can easily do a model parallelism across module GPUs. >> Audience: Yes, but in the past this typically will be slowed down in your overall computing ... [speaker away from microphone.] Every time when you call for information between the GPU devices or GPU/CPU ->> Junyuan Xie: Well, you need to -- >> Tianqi Chen: That's why you need dependency engines. >> Audience: No, the dependency itself is not enough. [overlapping speech.] >> Junyuan Xie: >> Audience: You need a special model structure for that. Not just the dependency engine. >> Junyuan Xie: No. So basically if you just take a vgg network and put the first convolution on one GPU and the second on the other GPU, it won't work very well. But if you have a more decoupled structure, it will work better. Say these days they have the two-stream video recognition network that has one string for RGB. The other for optical flow. Then the two streams merge on the top. Then you can easily parallel as the two part on to two GPUs. >> Audience: But you still, if you want to get optimal speed, you still need to have some human hint to decide how you want to parallelize, where you want to launch from. >> Junyuan Xie: Well, you just need to say which symbol should be allocated to which device. >> Audience: You basically tell the system what is. [overlapping speech.] >> Junyuan Xie: Yes, yes, you say which symbol goes to which device. >> Audience: Then I don't see any benefit other than parallel between [indiscernible] Because you still need the human to decide exactly which one occurs. It is still [indiscernible] you can also put things in ... [overlapping speech.] >> Junyuan Xie: Well, this is -- >> Audience: Two separate drawings. You can compute each one independently on different CPUs and just have a groove at the top so it is very similar. You can basically implement and say. [overlapping speech.] >> Junyuan Xie: Yeah, but that requires manual coding for each specific case and since you get two four GPUs with crazy network structures, you will find yourself writing like hundreds of lines of code for each specific network. >> Audience: No, no, no, that's not true. Actually, once you have more than four GPUs, these kind of -- typically you will slow down your [indiscernible] significantly. So that's why in all our current implementation of cross shield or cluster shield [indiscernible] we have some, we need to use some other techniques. Cannot attempt to use -- in your case, this is [indiscernible] across Maximo machine, right? This is typically where we device log. Especially when you cannot include the mini site larger than [indiscernible]. So this, if I understand what they are telling me right now. >> Junyuan Xie: Okay. So one thing is the Pascal CPU, GPUs are coming out in probably half a year. If they keep up their promises. And they advertise for like ten times faster GPU to GPU communication. With unveiling. So I think maybe it will probably get to a point where interGPU communication is similar in speed to within GPU communication. Because they have these 3D memories. >> Audience: For example, in your case we can reduce manual consumption [indiscernible] by 32 times by [indiscernible] the gradients. >> Junyuan Xie: Well, you can. >> Audience: Even that is [indiscernible] now. [indiscernible] by 32 times, it is not enough. You understand? Even >> Junyuan Xie: It depends on your network structure. It depends on your application. For some applications it doesn't work. For someone it will. >> Audience: Yes, but basically -- >> Tianqi Chen: For the -- >>: I don't understand. Are you saying that one data parallelism doesn't work and mono-parallelism and one bit gradient doesn't work. What works? >> Tianqi Chen: very large. One bit works if you can increase the mean byte size to >>: Isn't there a toolkit? any toolkit? Shouldn't [indiscernible] turns you on for >>: No, if you increase the mini byte size too large your final model is bad. >>: I don't see what that has to do with the toolkit. The toolkit supports mono-parallelism, data parallelism and -- you know, it could be one bit gradient supports. I don't see. >> Tianqi Chen: No, what I mean, even one bit is large. But what he said is as long as we can improve the speed, the [indiscernible] between CPU and CPU, it is enough. I mean it is not enough. That is why we have some [indiscernible] algorithms bigger than one bit [indiscernible]ization to make this work. All those experiments we basically, this kind of [indiscernible] is not enough. This is what I want to say. >> Audience: Yes, I want to say you need to increase the [indiscernible] dynamically to increase the communication channel to you can get the benefit. So this is the whole point. Like you need to increase the universe site. I have a related question, but actually, I am actually like the dynamic scheduling for operations and for memory. But what concerns me actually is the table that you had in the earlier part of the talk that you are comparing your performance to something like Torch. Torch is kind of the other side of the spectrum where the researcher, developer is like building all these optimizations by hand. >> Tianqi Chen: Uh-huh. >> Audience: And the kind of argument is Torch should offer the flexibility, but MXNet and other symbolic or declarative programs would offer the optimization. >> Tianqi Chen: Yes. >> Audience: But in these tables I didn't see these huge difference in terms of speed-ups or memory. This is one part maybe because all these networks are standard and everyone is using [indiscernible] for example conclusion and so on, but the other part of the question like if you have some like kind of experiments like for something like machine translation models or like LSTM language models where you can do model parallelism and show difference if you have some benchmarks. >> Tianqi Chen: Yeah, yeah. >> Audience: The other part of the question is, if I would do the standard stuff, all the toolkits are the same but for the nonstandard stuff maybe, and in parallel the program would be easier to work with rather than a declarative program and if the difference in speed then and memory is not that big anyways, why should I go for the declarative programs? >> Junyuan Xie: So basically for the standard program, for the standard networks people like, people working on Torch put a lot of engineering efforts to optimize them and so basically in the end get to the, basically similar optimization as we do here automatically. >> Audience: Their layer, their operation transition, they have. [overlapping speech.] >> Junyuan Xie: So, for example, for the ResNet, the network is so big and so many layers that it doesn't fit into memory. And so what they did is in the backward path they take a gradient blob and walk it down through the network so that you don't need a gradient blob for every layer. We do this automatically and we do many more optimizations for that, but for Torch either you need to, as a user you need to code the optimization for each network. You need to figure out what to do and then you need like sometimes significant change to the infrastructure to do that. You certainly can, but it's kind of too much trouble. Otherwise you can also like add a gazillion options to the library so you can turn on each one and turn on each optimization, but it quickly becomes immeasurable. >> Audience: So actually, kind of a follow-up question. So what you are saying that for Torch they did the co-share, what you called co-share by hand, right? >> Junyuan Xie: Yeah. >> Audience: So did you try to run the ResNet on the same kind of memory, hmm, with MXNet on the same kind of GPU call with the same memory limitations? Like automatic co-share would be runnable? Or would it run out of memory? >> Junyuan Xie: Hmm, I personally haven't. We have people who have -- >> Audience: [speaker away from microphone.] that is the six, the six gig card, I see? >> Junyuan Xie: Yeah, so this we did most of the memory sharing that you can do automatically. And so this requires less effort when you are in exploring new network structures. >> Audience: [speaker away from microphone.] >> Audience: Your last layer is not separate from the -- >> Tianqi Chen: That means [indiscernible] loss and activation. >> Junyuan Xie: Yeah. >> Audience: How [indiscernible] it is to chain the loss to a [indiscernible] loss that you want to define in MXNet? >> Junyuan Xie: You can do it with loss is. We have self activation. the activation within loss attached top of that have by either creating ones. -- it depends on how complicated your That is only activation instead of to it. And you can do anything on new symbols or composing existing >> Audience: A question along that line. So you couple soft activation and [indiscernible] together. After you train the model, how can you, how can you use the model at press time? >> Junyuan Xie: work. At press time, you just don't call back drop and it will >> Audience: But if indeed you don't call back drop, you still call forward, right? >> Junyuan Xie: Forward doesn't require a label. >> Audience: So that is a [indiscernible] so going forward it is not output to loss function. It is output to manual. So it is -[overlapping speech.] >> Audience: This is special case or -- >> Junyuan Xie: No. So in all of our output layers we don't output the loss. The loss is computed separately. And all these just output whatever it is supposed to output instead of the loss. The loss is computed separately if you want it. >> Audience: But how do you compute the gradient coming from the loss? To the model? >> Junyuan Xie: The gradient actually doesn't come from the loss. So imagine you are doing SoftMax. You have each node you have the activation, right? The gradient is just that, minus 1. Right? So the gradient doesn't require you to explicitly compute the loss. So ->> Audience: [speaker away from microphone.] >> Junyuan Xie: So you don't need to. >> Audience: But sometimes the loss function might impact the way that you compute the gradient. >> Junyuan Xie: Well, for that layer you can compute the loss. this case you don't need to. >> Audience: So in Okay. >>: Is there a way to, if you wanted to chain different graphs together because you wanted to do a [indiscernible] network? >> Junyuan Xie: Exactly. We are going to talk about this in the following slide. So I want to talk a little bit about naming. So in MXNet every symbol and every array in the symbolic graph has a name and the names are not arbitrary. So we have certain name conventions that can impact behavior. So you can create a variable place holder by declaring a variable and all the other output arrays and named arrays and named input arrays will be named by symbol name and default array name. And for example, fc1 weight, fc1 output which didn't appear in the previous slide. They will be allocated, they will be named by default. In the default model and update, you don't have to use this, but if you are -- just using standard feed for neural network, probably this will make it much easier. If you are using this, all the arrays that ends with weights will be treated as weights. They will be weight decay then the initialized weights and the biases will be initialized to zero and won't be decayed. And data blobs from the data generator will be copied to the blob, NDArrays with the corresponding name. So these are like variable names in your program that it is not just for display. It is not arbitrary. And so I just want to quickly mention that I/O in MXNet is different from TensorFlow and Caffe in that TensorFlow and Caffe is just a layer in their graph. It takes nothing and spit out data. But here we separate the computer and graph from the data iterator because we want to make the forward and backward cleaner so that it acts as the function that takes input and spits out output. So these are separated and during runtime you will be copying the data into the, into the model, the executor. And we provide a number of default data iterators like NDArray which allows you to use numpy arrays as input. And which will have a image generator for ImageNet and other image based, and we also have MNister. You can easily create other data iterators in the front end language like in Python by subclassing the phase data iterator class. Uh-huh? >> Audience: So the data iterator supports [indiscernible] or do users have to support the data [indiscernible]? >> Junyuan Xie: Well, it does some local sharpening but not global ones because it can be very efficient on SSD. So ->> Audience: [speaker away from microphone.] >> Junyuan Xie: So we take 1,000 examples, shuffle that instead of take the whole data set and do a complete shuffle. >> Audience: You do it blob by blob? >> Junyuan Xie: >>: Basically for this, for this [indiscernible]. It is also really run [indiscernible] global run shuffling. >> Audience: Okay. >> Junyuan Xie: So it is not the entire shuffling as you would do with a run array. It can be very inefficient for HHD disk. >> Audience: [speaker away from microphone] few iterations? >> Junyuan Xie: It is specific to. [overlapping speech.] >> Audience: Is it across every -- >> Junyuan Xie: Well, so this is just in memory. So you can do any shuffling you want. And this is on desk. So for large images, that is so we only do local shuffling and blog shuffling. And this I think is not shuffled. Because it is just an example. Like it makes, it makes it convenient to show MNist results. When you are subclassing this and creating your own data iterator, you can do whatever you want. >> Audience: But at the how do you know which iterator supports software [indiscernible]. >> Junyuan Xie: It's in the document, I think. >> Audience: Okay. >> Audience: By the way, does it handle sequencing nicely or not? >> Junyuan Xie: Yeah, recently we added, we do bucketing for sequences. Which I will explain shortly. So after you have the data and the graph, you can train it pretty easily by first creating a model object. You specify the contacts you want it to run now. This is triple 0 you can feed an -- GPU zero to GPU 4 and it will parallel as that for you. (Not triple O: GPU.] So here is an example. So if you use vgg or googlenet, which is actually googlenet which has very few parameters but has a lot of computation, we single machine with four GPUs we get basically linear speed-up with machO GPUs. And all those are just automatic. You don't need any encoding for that. I will also cover recurrent networks and more complicated examples later. And so since MXNet will, the executor will optimize everything out if you don't need it, this can make debugging harder because sometimes when your model are blowing up and you don't know why, you want to print the intermediate gradients or weight arrays during the training to see some of them are doing weird things. To do that, we provide a monitor class that hooks into the executor and immediately after each array is computed we push another operation to it and saying that take this array and compute some statistics on it. And return the statistics. So after you create a monitor saying print every ten batch with this statistic function, this is just center division and I want to compute on everything, every array that matches this. This is regular expression matching. And sort resolve based on name. Then combine the executor and the install the monitor to it or you can provide the monitor to the fit function and you will see these statistics is printed out every ten batch. >> Audience: I have a question. So I think the debug is very important for this, a lot of things. So do you also offer some kind of set to have some [indiscernible] or whatever, some other ->> Junyuan Xie: A what? >> Audience: Discern very interesting how it is very -- so you want to look at the print, print the value and then you have steps ->> Junyuan Xie: Well, here you can return arbitrary things and we will print it for you. So whatever here -- remember that this is pushed into the engine. So you cannot do async, you cannot do blocking across here. Basically anything async a rhythmic, just don't call ask numpy here. So this don't take the -- don't take the NDArray data and try to copy it to the Python firm hand. It will be copied and printed later, but don't copy it here. >> Audience: I also see if I want to bridge [indiscernible] so we have the code already. >> Junyuan Xie: We do have gradient checking. >> Audience: It is in Python code. >> Audience: Oh, okay. It is not -- >> Junyuan Xie: Also this does add more performance because you are computing all this stuff and it is blocking the computation each step to compute this. Only use it when you need it and probably add this to a bigger value. Okay, so we built MXNet with the idea that it should be a open system that is easily extendible. So we already have some examples of that. We can call Torch tensor functions. So this will be once or SVD or all those Torch functions. With MXNet and NDArrays and it's executed in the engine asynchronously. It is pretty transparent. It is just as if you are using MXNet's own functions. We can also integrate Torch and layer into MXNet and body graph so you can take one Torch script that defines a layer. It has forward, backward and gradient. And you can embed it into the MXNet body graph as one symbol. And this allows you to migrate any existing work in Torch you have to MXNet pretty easily. Basically we developed new algorithms. You pretty much define a set of new layers that you want to use, pretty two or three. You can put it into MXNet and you won't lose any existing work. And we are also working on Caffe integration. It is harder because Caffe doesn't have a full function scripting interface. So we need to fake Caffe's headers and like come file with it. It is a lot of hacking. By the way Torch and potentially TensorFlow it is pretty easy because they have this scripting interface. So here we show how you can call Torch tensor functions. They are provided and they are MXNet.TH model. You can create random arrays. And you can call arithmetic operations on it. You can create multiple lines and then do a rift particular computations. These are all async. So when you call this, it returns immediately. And all the execution happens in the engine. You force synchronization whenever you take this output and try to get it from Python, as numpy arrays. That just take MXNet's NDArray structure which is similar to numpy. A lot of times it can be used with numpy arrays transparently as if they are numpy arrays. And all of these are pushing to the engine. Here we show how you can use Torch N layers as MXNet symbols. So these are just like any other symbols in MXNet. You have Torch module for the normal layers and you have Torch criterion for the Torch loss layers. So this would be, the Lua string is an initializer that is a Lua command or function that returns a Lua object representing a neural network layer. So after you construct this graph, it works pretty much the same as any other MXNet graphs. But this allows you to migrate in existing work, but since Torch is imperative, we can not do any memory optimization with it. These are not memory optimized. Anything will be there and won't be freed. Unless you are only doing forward, in which case we know for certain that you are not going to call backward on it again and it will be freed. So if you can use MXNet's layer, if you don't have MXNet layer for that, call Torch. So here we show how you can do custom training loop. This is more advanced like when you are doing something other than feed forward your networks. Say if you have convolutional, fully convolutional network that operates on different size images and you want to do the spatial pooling thing and generate different number of optimals from each batch. You can do it with your custom training loop. And basic loop basically looks like this. You take a symbol. You bind it to create an executor. You take a data iterator for the number of epochs you reset it and then you numerator the batch from the data iterator. And for each batch you load the data to the executor. This is basically a batch of assignments. And then you call forward and then backward. In the middle of the forward and backward and before and after you can do anything you want with the output like user a CRF to it or take this executors forward output and feed it to the next executors input and then call that executors backward and feed the gradient back to this executors backward. As argument here. This here it doesn't have any argument because the MNist neural network has a loss layer at the end. So it doesn't need gradient input and then ->> Audience: So how does it know which ones are missing from the -basically the terminal noticed in the graph? Like if it tries to execute, it's missing and will throw an exception in runtime or something? You know what I'm saying? There are some nodes in the deck, there's some nodes that don't have -- in this scenario there's some nodes that don't have any, don't have anything coming into them, right? But you still -- that's what you have to pass in the partial with respect to that output for those guys. >> Junyuan Xie: Well, so here whenever you bind you will need to provide any, every input and the gradient to that. If you don't provide, they will be optimized out if it can. >> Audience: I see. >> Junyuan Xie: If they can. So if you don't provide a weight, it is pretty bad. It is going to be a random ->> Audience: But let's say you have a SoftNet out work or run a [indiscernible] output over a sequence and you take the final hidden state and pass it as input to another network, radio it? Then you need to pass back the gradient [indiscernible] back to that final hidden state. Where do you tell the network, hey, I'm go to do this? Do you have to bind a variable to that bind ->> Junyuan Xie: Hmm, the backward of the other executor will give you gradient for every input. >> Audience: Okay. >> Junyuan Xie: For every input you require it to fill. Others will be optimized out. For the things you have, you can take that aside to, assign it to the hat grad here. >> Audience: Then the [indiscernible] first one knows that is missing, right? If you didn't provide it ... [speaker away from microphone.] >> Junyuan Xie: So if the final area is not a loss, it will need hat grad to backward if you don't provide one, it will throw in an error. >> Audience: So it assumes that whatever the, like in terms of the actual deck, whatever the final thing is, is going to be the one that you provide it to? >>: [speaker away from microphone.] >> Junyuan Xie: Yeah, so you need to specify which one you want and then it will be the hat grad and you need to assign to it. >> Audience: Okay. >>: So are you [indiscernible] with the forward backward update somehow? Or are you like stop each time it goes and next? Is that like some internal queuing of the iterator? >> Tianqi Chen: It is depending on the engine [speaker away from microphone.] >> Junyuan Xie: So all of these are scheduled. The forward will return immediately and this will returning immediately and the [indiscernible] there also push the compilation of the momentum and modify the weights and tease are also into the engine. The will only thing that is synchronized is the metric update because you need to compute the metric and print it out. So this is synchronized. And this is synchronized on the output. So it doesn't depend on the backward. So as soon as the operations in the forward finishes, this will return and then we will start the next batch. So the backward could be still running. We are already doing the forward for the next batch. >> Audience: Right, right. So when you are blocked on an update, you are not doing any I/O, right? You are. [overlapping speech.] >> Audience: -- to get the next ->> Tianqi Chen: [speaker away from microphone.] >> Audience: Well, the next deck. >> Junyuan Xie: >> Audience: That's what data iterator do. It does fetching. Just for fetching? >> Junyuan Xie: Yeah. >> Audience: Do you have an example of the, we can see things that you put in the middle of forward [indiscernible]? >> Junyuan Xie: >> Audience: Nothing to tells us you have [indiscernible]. >> Junyuan Xie: >> Audience: Not in -- Not in the slides. Okay. >> Junyuan Xie: It is basically pretty easy. You take these outputs. These are just arrays. And you do whatever you want with them. >> Audience: Then that is output from the four? >> Junyuan Xie: Yeah. And backward is these grad arrays. So as I said, you can check multiple executors and you can insert some imperative computation and we also have data par presently executor manager that allows you to easily do data parallelism. Across multiple views or multiple machines. >> Audience: Is there a way to ... [indiscernible] thing to look at the party coder frame? Do you have that? >> Junyuan Xie: For discern. >> Audience: [indiscernible] frames, so now just pick all the frames, the second frame also. >> Tianqi Chen: Yes, so you. [overlapping speech.] >> Junyuan Xie: >> Tianqi Chen: The frame? Layer? [speaker away from microphone.] >> Audience: The time, you know. >>: The video streams? >> Audience: So you already -- Yes, for the [indiscernible] stream. >> Junyuan Xie: Basically you don't bind whatever you don't need and they will be optimized out if you need to compute them and if you don't need to compute them, so if they are not in the middle of other things you want, you just don't compute them. >> Audience: Even for samples [indiscernible] tables? >> Junyuan Xie: Well, it depends on how you define your graph structure. Basically it is a dag, whatever you don't need, say you need the here. So whatever that leads to here which is an output that you don't need, these -- this branch won't be computed if you don't bind this. >> Audience: [speaker away from microphone.] if you have a sequence as [indiscernible], so you have sequences. Are you treating the whole sequence as a unit or you treat it as a sample unit in unit A? >> Junyuan Xie: Well, you can do either. So basically if you treat a sample as a dag, then you need to in Python feed one to the next. If you bind them as entire dag, then you just take the input and then get the output. Whatever that suits your application. >> Audience: So maybe ... >>: So this will be different. whole sequence. >> Junyuan Xie: For example, if you treat this as a Uh-huh. >> Audience: In your whole graph, you probably only have, for example, you have -- if you have used a kernel connect here, certainly you can do batch computation for the lower layers and upper layers, but not that layer. Does this handle he will it, true or not? >> Junyuan Xie: I don't quite ... as far as you mean, so -- >>: This is a kind of application that you can write. [overlapping speech.] >>: You need to write this in Python to do this batching by yourself? >> Tianqi Chen: If you need the patch ->>: No, this can be optimize police department inside engine but you are not doing that right now, right? >> Junyuan Xie: What exactly can be optimized? So you have, you are saying. [overlapping speech.] >> Audience: You have the current work. >> Junyuan Xie: Uh-huh. >> Audience: Let's consider simplistic [indiscernible] why you only have certain group has hidden layers. You have five hidden layers, you only have self recurrency at a certain hidden layer. >> Junyuan Xie: Uh-huh. >> Audience: And if you treat each separate as an independent start within, you need to unload the whole graph into a very big dag and you compute each sample independently. So you can parallel ice by each one is very small computation into [indiscernible]. >> Junyuan Xie: That's where the dependency engine comes in. [overlapping speech.] So. >> Audience: What I mean, this is very slow. A dependent solution is to compute everything below the certain layer as a huge batch computation. So it is very efficient. Everything after, above that is also a whole batch [indiscernible] dependency. Only the third layer which has the recurrent connection unit compute sample by sample. This cannot be -[overlapping speech.] >> Junyuan Xie: Well, it symbolic graph doesn't import batch size across symbols. So you can change the batch size wherever you need. And so the dependency engine if you bind every executor for each ten step, it won't -- so the forward will return immediately after it push all the execution into the engine and then you can directly use the output, although it haven't been computed. You can directly use the output and feed it to the next executor and the dependency engine will schedule everything for you. And you don't need to do any synchronization manually. >> Audience: I am not talking about [indiscernible]. We need to talk -- >>: This is too much of a subtlety. I know what you are talking about, but I think it's a very subtle issue that it's hard to explain. Once you drew it out, I think we can talk about it after, right. >>: Yeah, we probably need to [indiscernible] that. >> Junyuan Xie: So bucketing. Tianqi already talked a little bit about how we are doing bucketing for variable lengths or size inputs. So this, if you have a fully convolutional neural network variable image sizes for example, for image for detection, you can allocate multiple executors and have them share memory so that you are not going to call them in parallel. They are, they share memory and your cost, your memory cost will be go to the maximum cost of the maximum executor. You can have one executor for each image size and you can cache them and take, whenever you get image you put it into the corresponding executor. Question? >> Audience: Yes, what is the overhead of spinning the executor? >> Junyuan Xie: If you are sharing memory, it is pretty vast. We didn't do exact computation, but it is basically some graph search. Basically some DFS search on say 100, 200 node graph. We need to compute the dependencies to our structure. >> Audience: >>: [speaker away from microphone.] No, it's done ahead of time. >> Junyuan Xie: No, for each executor. If every mini batch have absolutely a new size, so you know two mini batches share the same input size, you will need to create one for each. >> Audience: [speaker away from microphone] one bucket, your mini batch. Well, you have one bucket for each mini batch? >>: Well, you have one bucket type for each mini batch, but it's, the cost is done ahead of time. So you define how many buckets you have ahead of time and then you assign it to a batch, but still it's a one time cost so it is not anything. >> Audience: When you do the computation, you do one by one? everything in one batch? In one kernel operation? Or >> Junyuan Xie: Well, it depends on what you want to do. So one kernel feed forward cannot operate on different sized images, right? If you have within one batch different image sizes, you will have to do the maximum. You will have to make the batch the maximum size and then pad the smaller images. Because like GPU doesn't work that way. You can do the forward ->> Audience: [speaker away from microphone.] buckets of five or ten, right? So you. [overlapping speech.] >> Junyuan Xie: Yes, you can assign images with very similar size to each batch and save some overhead. >> Audience: Yeah, actually in [indiscernible] we group everything together with [indiscernible]. So we use just not just one, to do 200s, one computation, one for [indiscernible] processing. But that is only for the speech cases. >> Audience: Not just speech, everything. >>: It is -- [overlapping speech.] >> Junyuan Xie: Suppose you have two images. One is 400 by 800. 200 by 400. Then your kernel will need to operate on 400 by 800. >> Audience: padding. One is No, no, you can, you can -- at a stop, basically you can do >> Junyuan Xie: Yeah, that's what I'm talking about. So if you have images with variable size in within a batch, you have to pad the smaller images to the larger one. >> Audience: Yes, this is what I ask you. So are you doing this together to batch as much this into one kernel fusion or you actually only have single batches multiple times? June synchronization it depends on whatever you want in Python. So it can be done within like ten lines of Python code. Depends on what you want. If we find that most people want some specific thing, we can add it to the library. So ->> Audience: I have a question. So maybe we have not the ... [speaker away from microphone.] So [indiscernible] asked me that. We have this tensor model. It is only the input that is variable but also the output is also variable length? It can be [speaker away from microphone.] >> Junyuan Xie: It doesn't matter. Each packet only has a fixed network structure. It has a fixed number of input and output. Say your network can be like ten inputs, 20 outputs, five input, ten output. You just create a bucket for each possible case. >> Audience: [speaker away from microphone.] >> Junyuan Xie: Uh-huh. >> Audience: I wanted to push that, but [indiscernible]. You can do any sort of weird graph you want. It's just a joint graph, right? So in this case the tension graph says every target connects to every source. So from the bucket size 40 to 50 you have a connector -you have just one target connecting 50 things to the bucket 30 to 40, you have a connector of 40 things. It's very difficult to do in the style of declarative, like in the CDK case style where you don't explicitly unroll the loop because then all of these cases like one to many -- like it's fine for like feed forward and bidirectional. But once you have things that -- or something, like say it's a parse tree where for a sigma like N, the actual graph is a complete binary parse tree of length 2N, right? Where the parse is [indiscernible]. Doing something like that in a simple like lag of, metal language saying how you want to enroll it is not really possible. But this method lets you define this graph and this graph and this graph and this graph and this is the one I really want to use at runtime. >> Junyuan Xie: Yes, basically you can have any number of graphs and they can be totally destroyed and they just share GPU memory. And there won't be any synchronization because they share the same NDArray and that is scheduled by the engine. So you don't need to wait for one executor, one bucket to finish to start the next one. You can just go ahead and do it. Also if you find yourself in need of new symbols that is not in Matlab you can create it in multiple ways. First we have basic arithmetic computations. You can just compose those and get your new symbol. Or if that doesn't work or you want more efficiency, you can do a symbol in C++ call by writing cuda kernels or we have this M shadow with discern which is similar to AGIN. It is the template metrics library. And you can write those and it will be able to run on CPU and GPU. You can also define a symbol in front end language in Python. It is a little less efficient, but it is much easier to write than C++ and cuda code. You can just take the NDArrays and get numpy arrays to Python and do whatever you want on it and then feed it back. We also support writing cuda kernels in front end with runtime compilation. So nvidia has the RKC library that allows you to compile kernels runtime and we can compile kernels for NDArrays runtime. So we also support writing layers in Lua with Torch and then call it from Torch [indiscernible]. So there are a number of ways you can do it. And thank you. That's my talk. [applause.] (End of transcription .]