>>: It's my pleasure to introduce Yujia. So he's finishing his Ph.D. work at the University of Toronto working with Rich Zemel in the machine learning group. He is not new to Microsoft. He did internship in 2014 in the speech and dialect group and in 2015 in MSR Cambridge in the machine learning group. His work in the speech group in 2014 got wide media coverage. And his Ph.D. work is published in like top-tier machine learning and vision conferences. He's focused more in unstructured learning and learning structure models. So please welcome Yujia Li. >> Yujia Li: Thank you. [applause] >> Yujia Li: Thank you, Gupta [phonetic] for the kind introduction. So my name is Yujia. Before I start, a little bit of introduction to myself. So I did my undergrad in Tsinghua University in China studying computer science. So when I started my exploration in machine learning and artificial intelligence, during my undergrad I did an internship at Baidu, which is the lead search engine in China where I applied the machine learning techniques I've learned in real-world machine learning-related problems. And then I went on to do a master's degree in University of Toronto, and then continued to do a Ph.D., as [inaudible] introduced. I did two internships in the past two years at Microsoft Research. One in 2014 in the speech group and another one in 2015 in MSR Cambridge, which lists here the project that I'm presenting here right now. And I'm expected to graduate in this fall. So during my Ph.D. I've worked on a few different things. There are two key focuses of my work. The first is developing structured models and learning methods for structured props. I started with graphical models and studied the use of high-order potentials and developed more expressive potentials as well as efficient inference and learning algorithms for them. More recently, I moved on to study structured neural network models and tried to combine and connect graphical models with this kind of structured neural network models for structured problems. The other focus of my research is unlabeled data and unsupervised learning. I've worked on generative models of data developing new training algorithms for learning generative models and learning representations of data. Now I'm working on analyzing and relating different learning algorithms for generative modeling. I've also worked a little bit on semisupervised learning as well to make use of unlabeled data to improve prediction performance and doing semisupervised learning for structured prediction problems as well. So in this talk, I'm going to talk about the gated graph sequence neural networks for making predictions and sequences of predictions on graphs. This is something I did last summer at MSR Cambridge, and I've been working on this ever since. So there are many different forms of data and problems that have graph structures. The molecules studied in computational biology and chemistry are graphs of atoms and chemical bounds. And people are interested in finding whether a certain molecule has a certain property or not. And knowledge bases are graphs of concepts and their relations. And we may be interested in learning representations for the facts such that they can be used in other places more easily. And even logical reasoning can sometimes be formulated as graph prediction problems. For example, the pairwise predictions. The pairwise predicates relates to variables and therefore forming a graph. And doing reasoning is therefore can be formulated as making prediction on graphs. The motivating application for this project, however, is on analyzing dynamic data structures created in the heap memory. For example, like linked lists and trees. Where the pointers link memory nodes together and naturally forms heap memory graphs. And analyzing the heap memory states can tell us how the program behaves and therefore can be very useful in verifying the correctness of the program and analyzing properties of the programs. So that's the motivating application for the project. But there are also many other different applications possible with graph-structured data and problems. So to make predictions on graphs, it's necessary to have a good representation for the graphs and the different components in the graphs. So the most straightforward way to come up with representations for graph-structured data is to use handcrafted features, like graph fingerprints and others. And usually people design some simple graph properties as features. And those features will be used in different tasks. But the problem is whenever you want to solve a new task, you usually need to design new features adapted to that task. And this is very time-consuming. And then the other side, the features handcrafted are usual very simple and therefore not able -- not powerful enough to solve more complicated tasks. There are also approaches that uses -- defines graph kernels instead of defining the features themselves. This is a little bit better than designing features, but still they're quite limited in what kind of problems they might being able to solve. Another interesting approach is to use random walks on graphs. So the idea is actually very interesting. So the approach is to run a bunch of random walks on the graphs. And each random walk gives you a sequence of nodes on a graph. And then you can use the word representation learning algorithms that's being used to learning word representations from sentences here to learn graph node representations from the sequences of random walks. And, however, this -- even though it's an interesting idea, but still quite limited. For this work, we based our research on this powerful graph neural networks model that I'll explain in more details that's based on the powerful neural network formulations. And more recently there are also other neural net-based graph representation learning algorithms like the neural graph fingerprints and conv nets on graphs, which are also related. So graph neural networks, which is a powerful model to learn representations for graphs, and then making predictions based on those learned representations. So it has two components. First one is a propagation model, which uses a propagation process on the graph to learn node representations. And then a seconds part is an output model that takes those learned node representations and make predictions from those representations for the graph. First let me dive into more details of the propagation model. So propagation model learns representations for nodes using a propagation process. Here I'm showing an example graph that has two types of edges and is a directed graph. So the two types of edges are denoted by different colors. And edge type is an important information that tells us different relations between those entities in the graph. And in the propagation model, it learns node representations. And each note representation for each node has a representation at each propagation step, which is represented by a vector. You can think of it as hidden units in a neural network. And then the propagation model propagates those representations along the edges. At each propagation step, a node -- the representation for one particular node is propagated and transformed along an edge and passed to its neighbor on that edge. And the transformation behavior is determined by both the edge type and the direction of the edge. So here we consider both propagation and both directions to facilitate the flow of information on the graph. And more concretely, this is the propagation step that's being used in this graph neural network paper. Here those F functions are transformation functions that transforms the representation for a node into another form based on its edge type and direction. And then after the representations are transformed, each node collects all the incoming representations and summed them all up to use that as the representation for that node for the next iteration. The simplest example for the transformation function is a linear function parameterized by the transformation matrix A and the bias B. And those two parameters depends on the edge type and direction. This is the simplest possible transformation function. But usually we not -- we not only -- we not directly use the linear transformation but instead use the linear transformation, instead apply some nonlinearity. >>: What are [inaudible]? >> Yujia Li: [inaudible] are -- as this arrow said, it's edge type and direction. So it's kind of saying you have a set of parameters for each edge type and direction. >>: Oh, I see. >> Yujia Li: >>: So it's like a clustering mechanism. It's not doing clustering directly. It's -- Oh, how many different values can L take? >> Yujia Li: If, for example, here, for this graph, you have two edge types, black and red, and you have two directions, forward and reverse. So then you have four different Ls. >>: [inaudible] different on the nodes, right? >> Yujia Li: >>: Yes. [inaudible] is a node. All depends on the edge. I mean, the edge between two nodes. >> Yujia Li: Yes. >>: So you actually have four nodes. So you'll have like four different arrows [inaudible] direction you actually have eight different arrows. >> Yujia Li: >>: So you have -- each edge has a type and a direction. So this -- >> Yujia Li: No matter which node it connects. So as long as it has the same edge type and the same direction, then you have the same set of parameters. So this is not used as an input in the standard way for a neural network. It just denotes which set of parameters is used for this edge type and direction. >>: But the parameter is a type. >> Yujia Li: >>: Right? Yes, yes. Okay. So, for instance, between edges -- >> Yujia Li: So, for example, this is edge 1 to 2 has the same parameter as this edge, 4 to 3. Okay? >>: [inaudible] this type [inaudible] predicate. >> Yujia Li: >>: Yes. [inaudible]. >> Yujia Li: Yes, yes, yes. Since learning the behavior depending on the edge type and direction. Okay. So that's the propagation model. And it's an iterative process. And this is a recurrence equation. So usually this is run for a number of steps, and then we stop there and get the representation for each of the nodes. That's your node representation, from this graph. >>: [inaudible] which this is kind of stabilized, because you have loops? >> Yujia Li: Yeah, I'll talk about that later. But it's -- in general it's not guaranteed to even converge. If you don't restrict the parameters in any way. But ->>: Learning it with a fixed number of steps, anyhow. >> Yujia Li: Yeah, we're learning it with a fixed number of steps. But I'll go into that later. So that's the propagation model. And the -- then once we get the node representations, there's an output model that maps each of the node representations to an output. And this mapping function, G, is -can also be a neural network. And this model is able to make predictions for each of the nodes. So that is the output model. So both the output model and the propagation has a number of parameters. And we need to learn those parameters from the data. In this model, the propagation model is actually -- can be unrolled as a recurrence network, actually, because it's just running the iterative recurrence equations over and over again and can therefore be unrolled into a recurrent neural network. And then the whole thing can actually be trained with back propagation through time. However, if you want to use back propagation through time, then you have to keep track of all the intermediate states. And it might take a long time and many iterations for this propagation process to converge. So the -- as proposed in the original paper by the others, back propagation through time seems expensive. But there is a way to apply as a restriction to the propagation models such that the propagation function becomes a contraction map. What that means is that the propagation model, one step of propagation will map node representations that are close to each other to be even closer or at least as close in the next step. So once it satisfies that condition, the propagation model becomes a contraction map. And if the propagation model has this property, then it has an interesting behave area that it has -- it will -- it's guaranteed to converge, first of all, and then it has a unique fixed point that has nothing to do with initialization. And those are interesting properties that can be exploit to develop a more efficient training algorithm. And the authors propose to, based on this restriction, run the propagation until convergence and then just do training around the fixed point using the classic Almeida–Pineda algorithm. You don't have to back propagate all the way through the recurrence. And you just need to train in the vicinity of the fixed point. So that's the proposed approach to train this. Our modification [inaudible] model, which is the for modification that's modeled, which we called gated graph neural networks, which is actually not a very good name because we not only added gating mechanism to the network but most importantly we unrolled the recurrence for fixed number of steps and just used back propagation through time with some modern optimization methods. Doing this actually has a number of benefits. First of all, we don't restrict the propagation model to be a contraction map. And therefore the model has a lot more power and capacity to solve more complicated problems. And we have actually shown and proved that if we restrict the propagation model to be a contraction map, then the model will have trouble modeling some large independencies, which will be important in many different cases. Second of all, as I mentioned before, if the propagation model is a contraction map, then the initialization actually doesn't matter to the fix point. But now we don't -- we don't have this restriction anymore. So the -- the way it I understand up with after a fixed number of steps actually depends on where initialized, how do you initialize the propagation process. And that can be exploited to fit problem-specific information to this model and make the model produce problem-specific node representations, which will be helpful in a lot of different cases. And, third, we are learning to compute representation within a fixed time budget. And during training we unroll this propagation process for a fixed number of steps. And at testing time we're also unrolling for a fixed number of steps. So that creates a better alignment of the training process and tests process, which will be beneficial. And in the end we've also empirically found that adding a gating mechanism does help making the propagation model a little bit better and training a bit more stable. Next I'm going to talk about some of the details in this gated graph neural network model. The first thing is the initialization. As I mentioned, here initialization actually matters, and we can use this to fit in problem-specific information into the problem, which we call those node annotations. Just to give you an example about what those node annotations are, think about simple reachability problem on a graph. So the problem is given a graph and given two special nodes -- one's a source node, the other is the target node -- what other target node can be reached from the source node. So here in this problem, there are two certain nodes, the source and the target. We have to somehow encode this information and give that to the model. Otherwise, there's no hope of solving this problem. And probably the most straightforward way to encode this information is to use a two-dimensional vector for the source node A, which is encoded as 1, 0, and the target node encoded as 0, 1, and all the other nodes are 0 and 0. So that seems to be a minimal -- seems to be a very straightforward way to encode that information and to distinguish those two special nodes from everything else. So those are the kind of node annotations that we are talking about. And if we initialize the node representations in this way, then it's easy to imagine a propagation model can learn to, for example, copy the first bit of every node to its neighbor. And then after the propagation, the node representations will look something like this. And if we want to check whether this target can be reached from the source, we can just try to match if the target node has a particular pattern with 1 and 1. And an output model classifier can easily learn that pattern. So this is something that we can imagine. If we initialize the nodes in this way, then we can imagine the propagation model to learn this behavior. And here's another example where the target is not reachable from the source because of the directions of the edges. And the point here is we're using the same node annotation mechanism, same propagation model and same output model, but we're able to solve different instances of the same problem for different graph instance -- graph prediction instances. So that's the particular kind of node annotations. And it's very simple. Just encodes all the necessary information and fit that into the propagation model. >>: So this doesn't mention [inaudible] this 0, 1, where does it go? it go into your matrix A, matrix B? >> Yujia Li: >>: Does This goes into the node representations. [inaudible] is two dimensional, or is it a dimensional? >> Yujia Li: So for each node here, for each node, if we just initialize each of the node representations to be this two-dimensional vector, it will be two dimensional. >>: So each of the nodes has a separate HP? >> Yujia Li: Yes. >>: And -- and in the neural network, you'll -- you have this A dimensional vector going to it with the output model? >> Yujia Li: >>: Um -- Or does your output model take -- >> Yujia Li: I'll talk about the output model later. But in the simplest case, an output model can just -- you know, for each nodes, it takes a node representation for that node and make a prediction for that node. >>: Okay. I see. So output model is per node. >> Yujia Li: This is like the output model used in the previous work. But we've expanded the only output tasks by adding more output models. I'll talk about that later. Okay? So this is node annotations. But in practice, actually what we did was we padded those problems, specific node annotations, with some extra 0s to just increase the dimensionality of the node representations which adds more capacity to the node representations, which is useful to contain more information. Yes? >>: You could also have had a learned representation for each the three types raised. Either this is the reachable one, this is the not reachable one. Or this is -- this is green, red, or white, right, basically, right? And that had like the three embeddings for ->> Yujia Li: >>: Right? >> Yujia Li: >>: Yes. Uh-huh. Were there any projects you tried something like that for? >> Yujia Li: We haven't tried that. But, I mean, we've found this simple, straightforward node annotations to work pretty well. But I guess in more complicated tasks, it might be beneficial to learn them. >>: Understanding what you were saying the previous slide. by the reachability? What do you mean >> Yujia Li: Reachability is saying is there exists a path from a source node to a target mode. If there is a path connecting them, then the target is reachable from the source. >>: it? Yeah, so then you -- then you encode that in addition? Want to encode >> Yujia Li: You want to encode which one is a source node, which one is a target node. We have to know that. Otherwise there's no hope of solving this problem. >>: So the input is this thing, and then the training time of the loss function says in this instance I want green -- the first dimension of green to be zero, whereas in the previous slide, the loss functions says I want the first dimension of green to be 1, right? That's basically how you're encoding it, right? >> Yujia Li: So it has two processes. The first is propagation process. After the propagation, you get some final node representation. And those node representations are fit into, in this case, the output model, which is a classifier. That's like a binary classifier that tells you whether it's reachable or not. And then you have the loss coming in at the end of the classifier. And then that's -- all the way back propagates through the propagation model. >>: Right. >>: Maybe I'll understand later when you show the examples. Because where is the input to the model? Is the green node the input to the model? >> Yujia Li: Oh, the input of the model is -- so the model has to know the graph structure. It basically takes a list of edges and construct the graph. And now you have the graph. And then you have to have some problem-specific information that tells you which node is the source and which node is the target. And then the expected output is whether the source node ->>: Is the graph random? >> Yujia Li: The graph is given to you. >>: Okay. So I guess I'll understand this once you get complications. Right now I can't map it to a problem in my head. >>: Well, but in this problem, the -- everything -- can you go back a slide. This is a little -- go back one -- wait. Yeah. So this -- this is the input, right? This at training and test time, this is the input. But then at training time the additional information you get is that in this particular instance green is reachable. So that's what the loss function comes in, it's telling you in this target you want to now make the first dimension of green to be 1 after doing the whole propagation. Right? And that's what -- that's what you learn at training time that you ->>: So I think this is, you know, you don't need to come to the training stages, like propagation stage. You also graph [inaudible] you need to follow ->>: If there's no propagation, it doesn't -- it doesn't know what function to learn. The only reason why it's learning this propagation function is because a loss function tells it this is the function I want you to learn. You're learning a different function, it wouldn't learn how to propagate. >>: But here is actually just check. So if you allow several propagations and I initially initialize them 1, 0, and then you check after several steps. If your target node has the first beat as one, then it's reachable. >>: But that particular function? No, there is [inaudible] that particular function is not intrinsic to the thing. >> Yujia Li: So this is -- this is -- this simple example is a simple supervised learning task where the input is a graph and two nodes. One is a source, one is target. And the output is you want to predict whether on this graph the target can be reached from the source or not. >>: So you have a binary label for each [inaudible]. >> Yujia Li: >>: Yes. Which is yes or no for each green -- for the green node. >> Yujia Li: Yes, yes, yes. And this is a particular way to encode this source and target information into this graph propagation framework. And that's what we call node annotations, which is very straightforward translation of the ->>: So in this graph you're going to be learning those A and B parameters? >> Yujia Li: >>: Yes. Yes. And then, again, how are you going to know if it's reachable or not? >> Yujia Li: It's a supervised learning task. be given those correct outputs. >>: So during training you will But there is only one set of outputs right now. >> Yujia Li: The output -- so for each graph and a pair of nodes -- >>: You only have one [inaudible] there, so there is no data yet in my mind. >> Yujia Li: Oh. Oh, during training you will have a bunch of graphs, a bunch of different nodes. For each of them you have a desired output, whether one is reachable from the other. >>: Okay. >> Yujia Li: So it's like -- so this is a very general framework. You don't have to specify what is the property you want to have. You just give the model of bunch of data and the model will learn what's the correct concept corresponding to this task. Okay? So this is the initialization. And then the propagation model, we also made some changes. The -- we added gating mechanisms and some other minor difference to the original graph neural network propagation model. So if we unroll one step of the propagation, then we get something like this feed-forward layer here. On the maps, the nodes representations in one step to the next. But this feed-forward layer has a very special connection structure. It has a very sparse connection structure. And each of those connections might also have shared weights and shared parameters, depending on which edge type it belongs to and which direction it belongs to. So if we concatenate all the node representations into the big vector, then we can actually write this more compactly using matrix operations. And here the transformation matrix A takes a -- also takes a very special block structure with a lot of sparsity in there. And each block can share parameters with some other blocks depending on the graph structure. Okay. >>: You could have [inaudible] right? That basically is standard recurrent neural network would just be a single self-loop, right? >> Yujia Li: Yes. Yes. Yeah. So you can also think about this as a recurrent neural network with a structured transformation -- transition matrix instead of a fully connected one. And this is how we formulated it. It looks exactly like a recurrent neural network. So this is more like a vanilla neural network kind of propagation model. But instead, we use something more like a gated recurrent neural networks which is -- which uses some gating mechanism, added the reset gate and the update gate to modulate the information propagation process on the graph, which we found to be helpful. We can actually also use more complicated ones like LSTM, long short-term memory, recurrent neural networks, but we found this to be a bit simpler and it works as well or even better in many cases. So that's the changed propagation model. And for the output model, as before, we still have the per-node output, which is the same at the graph neural network model, but we also added two other output models to make our model be able to solve more complicated tasks. The second one is the node selection output, which instead of making a prediction for each node, it selects one of the node for each graph. And the way it does that is it uses the output model to output the score for each node in that graph and then pass the scores through a softmax to select a node for that graph. This is very useful in a lot of different problems. And this provides a way to making predictions where the output is not in a fixed number of classes. So this, here, the output depends on the number of nodes in the graph. And this is node selection output is able to handle that. The third one is the graph-level output. So it makes a prediction at the graph level, either not classified, what this graph category is or doing some other kind of regression problems like that. And the way we did that was to first come up with a graph representation vector, which is a weighted sum of all the node representations in the graph. And the weighting of each node is given by another neural network which is learned and takes node representations and initial annotations as input to compute the weight. And then the graph representation vector can then be used in any standard machine learning tasks like classification or regression in the standard way. So those are the three kind of different output models that we consider and developed for a bunch of different tasks that we experimented in this project. The whole network, then, looks like this. So on the left side there's initialization where we fit in problem specific node annotations and then pad with 0s to initialize the node representations. And then we pass those into the propagation model, which is run for a fixed number of T steps. And then once that finished, we get node representations fed into the output model to compute an output for each of those graph prediction problem instance. And the whole thing is trainable with back propagation, and that's exactly what we did. So to test whether this [inaudible] graph neural networks makes sense and to test whether it can learn something useful, we started by testing it on some toy graph property tasks, like the reachability task that I mentioned before, and also sharing task, whether two nodes can reach the same node or not. The cyclicity task to check whether a node is in a cycle or not, and the reaching cyclicity, whether a node can reach a cycle or not. So those four are very simple graph theoretical properties. And for all of those -- for tasks, we generated some random graphs and picked some random nodes and try to generate data to simulate tasks like that. And for all those four tasks, our models is able to learn from only a few tens of training examples. And using a very small model that has like 100 to 200 parameters to do perfectly in all those four tasks. After that, we moved on to some more complicated toy task, still toy tasks. And some tasks comes from the bAbI experiment -- bAbI synthetic dataset proposed by Jason Weston and others, which is a bunch of natural language reasoning tasks that requires reading the natural language input and do some reasoning and answer a question. Here we used the symbolic format of the data. So we're kind of exclusively focusing on the reasoning part of the problem and ignoring the parsing part of the problem. So the results are not directly comparable with other people's published results but still provides some interesting insight to how this model works. So one example of the bAbI task here is about basic deduction. So this is how the input data looks like. So the input data is composed of a bunch of different instances. So for each instance, the input is given to you as a list of facts. And here for this task in particular, each fact is one edge that encodes one relation between two entities. So here it is saying D is A, B is E, and A has fear of F, G is F, E has fear of edge. And once all those facts are listed, there's a question near the end that asks a related question that you have to do some reasoning based on those facts. Here it is asking which entity does B has fear of. And then if the data is given in this format, it's very easy to transform that into a graph and it's very straightforward conversion process. Here is the graph converted from this input data ->>: Here when you talk about the symbolic format of the [inaudible] are you saying that you just assume that you have encoded its logical form [inaudible] logical form [inaudible]? I'm just curious. Because originally [inaudible] so you -- I mean, whatever is in the natural language in the original data versus what you use the single to represent those entities, how -- how is the [inaudible]? Maybe you say more words about how to translate original questions into the representation you're talking about, how to get the label here and the ->> Yujia Li: >>: Yeah, yeah. What kind of assumption you are making here. >> Yujia Li: Yeah. So here the original format of this bAbI dataset is each fact is represented by a simple sentence. So example, D is A ->>: What would be original sentence that correspond to [inaudible]. >> Yujia Li: So here it might be something like -- so, for example, D is A might be something like J is a ship. >>: Okay. >> Yujia Li: And here B is E, A has fear F maybe like ship is fear of wolves, something like that. >>: Okay. And you make sure that all these entities are correct in terms of [inaudible]. >> Yujia Li: >>: So we -- actually generating the symbolic data -- Modify that, of course, you know, you have different task. >> Yujia Li: Yeah, yeah. So actually we -- here what we -- so here we just remove the parsing part and assume that we've already got that. And it's actually not very hard to get that. Because the dataset is kind of synthetically generated dataset. So it starts -- so to generate those bAbI data, what the others did was they started with those kind of symbolic format and then rendered it into natural language using some very simple rules. So I can just add a switch to the data generation process which give me this kind of symbolic data. >>: Okay. >>: So in this case you're showing here one data point, or you're showing a representation of the entire dataset? What is going to be the data you're training on? Are you going to have some examples of D has fear of G that you're going to try to fit to? >> Yujia Li: Oh, okay. So this is just one instance, one training example. So here this list of facts is the input, and this question at the end with the answer is the output. >>: Oh. So I didn't see that. So the bottom -- >>: When you do this graph, you know, -- so sort of one by one or do you pool entire, you know, task 5 they had together to come up with a fixed set of [inaudible] and then create one single graph? >> Yujia Li: >>: So we create different graphs for different examples. [inaudible] I see. >> Yujia Li: Is things we learn is the thing that can -- can be generalized across different graphs. For each of those graphs, we're solving the same problem. It's just the graph is different and the input is different. But the knowledge that will be applied to those graphs is the same. >>: Is the same. >> Yujia Li: And that's what we learn. propagation model and the output model. >>: That's the parameters of the Okay. >>: But you don't mind anything -- like A, all the letters are totally arbitrary and you don't learn those at all [inaudible] models. Right? >> Yujia Li: >>: No, no. Okay. >>: So in this particular case you would label B as 0, 1, and H as 1, 0 if you know for this example that B and D has fear of H. And then you would run your -- you would set every example like that and then train. So this case, this is labeled, right? So you know if B has fear of H or not. >> Yujia Li: Mm-hmm. >>: And if it does, then you would label it as 1, 0, 0, 1. If they don't, you label in some other way and then train. Is that what you are ->> Yujia Li: Here the input of this -- so here is how we encoded this. So we have the graph and constructed the graph using the set of edges and edge types. And here there's only one special node, which is B, and we label it as some special annotation, say 1, 0. >>: Does it depend -- >>: The answer is -- >>: -- on the real answer? >> Yujia Li: The answer is the output that we want to predict. encode this doesn't depend on the output. So how we >>: So you're saying what the person is going to tell you, that what question they care about before the propagation step, but you don't know what the actually answer is except at training time? >> Yujia Li: You have to know what questions they asked, but you don't know what the output is when you start the propagation model. >>: But during the learning, is there -- is this unsupervised in some way or it's supervised? >> Yujia Li: This is truly supervised. >>: So supervised. So you had an example. the training, if this is a training sample? >> Yujia Li: >>: Yes. Do you know if the sentence correct or not? >> Yujia Li: >>: Do you know if that sentence in Yes. And based on that you're going to label the nodes? >> Yujia Li: Yes. >>: If it's incorrect, you're going to label it differently and then you're going to train on these examples of graphs [inaudible] is that what you're doing, or something else? >> Yujia Li: You don't label the nodes based on the output. nodes based on the question. >>: That's -- that's the output. >> Yujia Li: >>: You label the The question will be -- So based on the question you're going to label the nodes. >> Yujia Li: You -- based on the question -- >>: Whether the question is correct -- the answer to the question is correct or not, you're going to label differently. >> Yujia Li: No, you just label the nodes based on the question but not based on the answer to the question. So the answer is the desired target output. >>: Well, but it's a binary classifier where sometimes where the question is B has fear of H and sometimes the and is yes, sometimes the answer is no, or is it what is the node for which B has fear of? >>: Yes. This is more like a node selection problem. >> Yujia Li: Okay. >> Yujia Li: Yes. >>: So you know that B is going to have fear of one node. And it's just saying which one is it. >> Yujia Li: Yes. >>: During training time you know the -- you give the supervised target, right? >> Yujia Li: >>: You give that answer. >> Yujia Li: >>: Yes. Yes. Is there any incorrect answers that -- >> Yujia Li: >>: Yes. If it is the true answer. >> Yujia Li: >>: Yes, yes. No, there's no incorrect answers. Okay. >> Yujia Li: It's purely supervised learning task. >>: Yeah, so, for example, I forgot how many training instances there are for task 5. I mean, a hundred or something like that? For this task. >> Yujia Li: Usually people use like a thousand examples. >>: A thousand. Okay. So a thousand of similar examples, you extract the same kind of has fear or, you know, these kind of relations out of those. So for what -- suppose you have 1,000 training instances. Do you train 1,000 graph or just one single graph that simulate ->> Yujia Li: >>: 1,000 graphs. 1,000 graphs. >> Yujia Li: Yes. >>: Oh, that's [inaudible] certain case the test set doesn't have -- it's missing [inaudible]. >> Yujia Li: >>: [inaudible] 1,000, you have to generalize that. >> Yujia Li: >>: Uh -- Oh, okay. It learns and learns -- >> Yujia Li: It learns the transformation -- it learns the propagation. >>: One for each class. No -- >>: There's only two classes, which is is and has fear. just randomly initialized. Everything else is >>: Well, basically the vector is [inaudible] never be updated. >>: Oh, I see. >>: The only [inaudible]. >>: [inaudible] C, D, and E are. >>: Right. >>: [inaudible]. >> Yujia Li: Yes. >>: Basically is fixed [inaudible]. >>: He's updating the hidden representation as you go. >>: You're going to also learn the vector representation of [inaudible]? >> Yujia Li: You learn all the representation for each graph. But those node representations doesn't have to be the same across different ones. >>: [inaudible] is that a vector, or is it just [inaudible] representation? So when you have the graph B there. I mean, [inaudible] is the weights, right? [inaudible] is a weight? >>: So D is the recurrent state. But do you initialize it to be all 0s, or do you initialize it to be random? >> Yujia Li: >>: Okay. We initialize to all 0s. So it's just recurrent state that's initialized to all 0s. >>: So this can be considered as a recurrent network [inaudible] this is a recurrent network. So what you need to learn is the edge-specific transition weights [inaudible] has fear with another parameter. So as would in learning. And then the representation of each node are actually either initialized as knowledge-specific things or just 0. >>: Okay, you don't [inaudible] vector. >> Yujia Li: >>: No, no, that's not [inaudible]. If this was a test point, then, what would you do exactly? >> Yujia Li: If this was a test point, then we will be given everything here but not the final answer edge. Okay. So we have all those facts which we use to build the graph, and then we have B which is what has been asked to predict, what is a special node that's related to question, and we just initialize the node's representation for B to be something special like 1, put 1 in one or two bits. Now, for all the other nodes ->>: That's what you do in training as well for [inaudible]. >> Yujia Li: >>: Right? Yes. Okay. >> Yujia Li: >>: Yes. Or not B, but whatever is the first -- Yes, yes, yes. And then? >> Yujia Li: And then -- and then we run the propagation process. And then for here I think it's a node selection output. So we fit those node representations into final node representations into this node selection output model that will give a score for each of the nodes, and then we choose the node that has the highest score as the output. >>: So for these examples [inaudible] B has fear [inaudible] the last sentence, B has fear of whatever it knows. So, for example, the output you have [inaudible] nodes or something. >> Yujia Li: Mm-hmm. >>: Like [inaudible] and then you need to -- in the training you need to have the right prediction on the H. >> Yujia Li: >>: Okay. >> Yujia Li: >>: Yeah, yeah, yeah. Yeah. So, for example, here, if you have like B has fear of B and E is H? >> Yujia Li: Okay. >>: Okay? So in this case your output would be like 2, like E and H, and you would expect that E and H both will have I [inaudible] assume that you have one output node. >> Yujia Li: Yes. So this problem, in general, will have some ambiguous answers if you have those kind of -- >>: Multiple [inaudible] answers. >> Yujia Li: Yeah, yeah. But in this dataset, I believe, there is only one unique answer for each of those instances. So that's guaranteed by this data generation process. >>: So then you are not really coding H with a special symbol like you do B. B is encoded with a special thing, 1, 0, for instance, and so on. >> Yujia Li: >>: Right. And H is not. >> Yujia Li: H is not. H is something you will predict. >>: So in training, then, how do you -- if this is a training point now, how do you encode this information that B does fear H? >> Yujia Li: So H will be used as the correct output target for this model. And the output is a node selection output. We can think of it as like a softmax over all the nodes, and that's your correct node. So you can think about it as like kind of like a classification, multiclass classification thing. >>: But you could have also just used special value for H. In all these cases, you could have just used a special value of H and then in the end you just check which values in your graph have that, which nodes in your graph have that particular vector that encodes fear of B. >> Yujia Li: That's kind of like the node selection, right, you select a node that matches that kind of pattern. >>: Yeah. >> Yujia Li: >>: So it's saying [inaudible]. Kind of. Yeah. Okay. >>: And one other problem that I see is that your positive instances will be way less than the negative instances because negative examples are all possible pairs almost, right, or N square, whereas the positive examples is only one. So you have this skewed training data. >>: What's your training classifier? If you have 10,000 labels, you have one that's positive and 900 that are negative. >> Yujia Li: It's the same as in the multiclass classification problem. For each instance you just have one of, say, a thousand classes as your target. >>: But the case here, you have -- you have balanced hopefully the -- I mean, I'm just asking in your training [inaudible]. >>: It doesn't have any negative examples. The only negativity here comes from the fact that the output is H and not [inaudible]. >> Yujia Li: >>: Yes. [inaudible]. >> Yujia Li: Yes. Okay. So this is just a setup for this toy tasks. And we did try to get the graph neural nets on four of the bAbI tasks. Three of them are node selection tasks, and one is the graph level prediction classification task. And this model is able to solve all of them to a hundred percent accuracy with only 50 training examples. And each -- for each of them, the model has less than 600 parameters. >>: [inaudible] the T? >> Yujia Li: >>: T I think for -- The time step that's [inaudible]? >> Yujia Li: I think. Yes. Yes. I think for this we use something like five or ten, >>: And this was fixes were all the tasks, or this was like tasks like knowing the task you select T? >> Yujia Li: You have to know roughly what size of the graphs you're working on. But for all of those tasks, the graphs are fairly small and we -- I think we used, say, ten for all of those different tasks and didn't tune out much. >>: [inaudible] is 17 and 19. >> Yujia Li: >>: I missed that. You try four tasks out of 20. >> Yujia Li: >>: Sorry? Have you tried that? Right. Yes. So but difficult how is 17 and 19 [inaudible]. >> Yujia Li: It's not there. 17? >>: So basically you tried it, it didn't succeed or you were not able to [inaudible] the graph. >> Yujia Li: Oh, we tried -- I think we tried 19. I'll talk about it later. That's like making a sequence of predictions. Which is the most challenging task. >>: Okay. >> Yujia Li: But there are some other tasks. We didn't claim to solve all the bAbI tasks because there are some tasks that are not naturally formulated in this graph prediction framework. So we didn't try those. Question? >>: So I guess I was going to ask a similar like what prompted you to choose these four? Did you look at these four, decide they could probably be done with graphs, and then you tried them? >> Yujia Li: Yes. >>: And the other ones you don't think could be or you tried it and they didn't work? >> Yujia Li: They -- we -- we looked at those tasks as well. But some of them -- some of them are not naturally -- does not nationally fit into this framework. >>: Can you give an example? >> Yujia Li: One example is the temporal reasoning. So it's a scenario where you have a bunch of different people, like each person is an entity, and you have facts like A moved to somewhere and picked up something and then moved to somewhere else and then dropped that something there. And then at the end you want to know where that something is. So you have to reason about this temporal thing. And we thought about that for a while and didn't figure out a way to encode this temporal thing in a static graph. But, I mean, there are maybe some other things we could try to make this framework suitable for those tasks as well, but we didn't push that too hard. >>: So are you saying that you don't see how to set the problem up in the graph or you don't see how the graph could learn to solve the problem? Is it that ->> Yujia Li: the graphs. It's more about we don't know how to set up the problems to use >>: But maybe -- maybe it would work because you do have some notion of time in your whole -- in your whole learning procedure. So it might be biased anyways to take time into account. >> Yujia Li: It could. Yes. We did thought about that and had some ideas, but we didn't try too hard. So for this project, our -- when I did this, our main goal was to solve this program verification problem. And we did all this after we solved that problem to just demonstrate this model can solve some other problems as well and it has great potential. So we didn't claim -- we didn't try to even solve all those bAbI tasks. Okay. So that's the gated graph neural nets model. And then we want to see whether those results are really significant or not. So we decided to use a standard RNNs/LSTMs as reference baselines. And those RNNs/LSTMs are trained on token streams. The input is a sequence of tokens and the output is another token for that example. And the RNNs/LSTMs has like 5,000 parameters and 30,000 parameters each. The training setup is you have a thousand examples for training and validation and a thousand for testing. And the [inaudible] we start with only 50 training examples and then keep using more training examples until the test accuracy reaches 95 percent or above. That's the protocol. And here is the results. In the brackets we're showing the number of examples needed to reach that level of accuracy. So for all those four tasks, the gated graph neural nets model is able to reach a perfect accuracy with only 50 training examples. But for RNNs/LSTMs, in some of the tasks they really struggle and not able to get good performance even using all the training examples. >>: But if you look at this memory network paper for these all four tasks, they got about 95, 98 or something. Maybe they're using some early ->> Yujia Li: >>: I don't think for all -- Did you get a chance to look at the result for memory network? >> Yujia Li: I did. But I don't think they got like 90 percent for all this. Maybe, I mean, there are different setups for this. Like some people use a thousand training examples. >>: Oh, yeah, yeah. >> Yujia Li: Some people use 10,000 training examples. the difficulty of the problem a lot. So that will change And here I'm just showing the reference results from the paper that proposed this dataset which used LSTM on a text input, which is not directly comparable to ours, just as a reference point. They used like a thousand -all the thousand training examples and get this level of performance. >>: How did you represent the nodes in the LSTM? the symbols, because aren't they arbitrary? Do you lay embeddings for >> Yujia Li: Okay. So the nodes are not completely arbitrary. So across the graphs, the entities still has the same name across all the graphs. So if you use like J or A as a name for an entity, then in the next graph the J is still the name for that entity. It's just the relation between those entities are different. >>: Okay. >>: Oh, so this means that you're initializing different examples from previous instances from the same nodes using different previous examples, right? >> Yujia Li: Uh -- >>: So in this case, like you have J that showed up in some example a couple graphs ago. You would keep the same finite representation and use it when you're initializing for the next example? >> Yujia Li: For the RNNs/LSTMs, those information are encoded in the word embeddings. But for our [inaudible] don't use any of those information at all. >>: So you always start from 0. >> Yujia Li: Always start from 0. Yes. >>: Can you give a sense of how different the setup had to be for the four tasks? I mean, the -- the tasks -- if you wanted to like just like write code to solve the tasks, they are pretty simple to solve also, right? So generally you want like a model that could solve the task without too much change probably. >> Yujia Li: Right. >>: So did you have to -- these four, to do these four tasks, is it like really different, the way you set it up? >> Yujia Li: The -- the only difference is -- let me see. The only difference is when you want to use the graph, the gated graph neural nets to solve those problems, the particular thing for each of those task is the relations might be encoded using different words. So here use is as one of the relations, but in the text task you might use some other words to encode that relation. So you have to figure that out, which is fairly simple, and that's just a transformation to transform that into a graph, which is quite straightforward to do. And then the next thing you need to do is to figure out how do you encode the correct output. So for some tasks, you need a node selection output; for some tasks you need the graph level classification output. So that's another thing you need to do. do. >>: And other than these two, there's not much you need to Okay. >> Yujia Li: So you just need to figure out the task-specific things, and the model will learn to do that task by itself. Okay? So that's this. And the few conclusions about the results. So we use symbolic format data which does make the task easier, but still they're nontrivial. And we don't claim that gated graph neural nets can beat RNNs/LSTMs because we use -- did use more structures in the problems. But this shows that if there are structures in the problems and if we can explore those structures, then this problem can be made a lot easier to solve. I guess I only used -- I already used one hour. I'll just speed up. So next I'm going to talk about the sequential variant of this model which we call gated graph sequence neural networks. Which comes from the background of solving a sequence of predictions on graphs. So many problems require such kind of sequence of predictions. Here I'm showing two examples. The first is the shortest path example. The definition of path is a sequence of nodes on the graph. So this is clearly a sequence prediction problem. And the second one is the kind of problems that we see -- we would see in this program verification problem. We want to analyze heap -- the heap states and we want to know what are the components substructures in those heap and we want to analyze, parse this whole graph into all those subgraphs. And we will have a list of substructures in there, which is, again -- can be formulated, the sequential prediction problem. And for those sequence prediction problems, the prediction in each step can still be made by a gated graph neural network, either a per node prediction or a node selection prediction or a graph level prediction. But one thing in particular that we have to take care of is we need to keep track of where we are in the prediction process to make a sequence of predictions. For example, for the shortest path, the node annotations we used in initialization should be different for different prediction steps. Otherwise at every step we just make the same prediction over and over again, which doesn't make sense. And we have to somehow keep track of that. For the graph structure parsing problem, we need to keep track of which part of the graph have already been parsed and predicted and which part hasn't. Otherwise, we'll just predict the same structure over and over again, which is undesirable. And a solution we propose is to chain multiple prediction steps up using node annotations. So the idea is to use the node annotations as some kind of working memory that keeps track of the progress of the prediction and carry that over from step to step. And at every prediction step, we not only produce an output but also predict the node annotations that will be used for the next step. So here's the architecture for this gated graph sequence neural network model. On the left-hand side -- on the leftmost side, you have the problem specific node annotations, which is fit as the input to the model to do initialization. And at every step, you produce an output using the opposite gate graph neural network and then use another model to predict node annotations which will be used in the next step. And those node annotations act as memories that can be kept from step to step. And one thing to note is that in the first step, the node annotations are problem specific and usually interpretable. But from the second step, the model can choose whatever to be used to put -- whatever to be put in those node annotations. As long as it fits well with the problem. So the model can choose what to put here and it doesn't have to interpret it at all. So those things will act as memories that carried over from one step to the next. >>: I'm sorry, just to understand. So in this case you get [inaudible] each [inaudible] step rather than [inaudible] and then [inaudible]. >> Yujia Li: Oh. For each of those prediction steps, you still run a propagation process. And you run an output model to get an output. >>: I see. So when you say one, you mean like running T steps and then you take some output. >> Yujia Li: >>: But then there are another T step which will be [inaudible]. >> Yujia Li: >>: Yes. Yes. I see. >> Yujia Li: So it's like, for example, in the shortest path example, to predict the first node in the path, you have to run the propagation process and pass that through a node selection output to get a node. And then you predict a second node on the path and you run the propagation and whatever again to get an output. So for this, we started with the bAbI task 19, which is path finding. What it does is given a graph, you're asked to find the path from one node to another on a graph. And the data is created in a way that there is guarantee to be only one path connecting these two nodes. And this is the most challenging tasks in all of the 20 bAbI tasks because almost all the previously proposed approaches failed pretty badly on this task. In addition to this task, we also created two other bAbI-like but a little bit more challenging tasks. One is the shortest path task. So here it is not guaranteed that there's only one path connecting two nodes. But we might have multiple nodes, and the model is asked to select the shortest path among all those different possible paths. And also a Eulerian circuit variant task, which is even more challenging than those shortest path tasks. And there's one thing missing to make a prediction on sequences because in sequence prediction problems, usually for each input, the output sequence might have different lengths. So we cannot just fix a number of predictions for a particular input beforehand. So we have to learn to know how long the output sequence should be. And the way we did that was to learn a model to predict whether we should continue a prediction or stop at a particular point. And the way we did that was at each prediction step, in addition to the other prediction models we have in the past, we add another separate output gated graph neural network model to make a graph level binary classification prediction on whether to continue or stop. So if this model predicts to continue, then we move on to the next prediction step. If it says not to continue, we stop right there. Then using this mechanician, we can learn to predict variable length output sequences. And for the RNNs and LSTMs, we just keep -- add a special end token to the data and just keep predicting tokens until an end token is reached. Here are the results for those three different tasks. For the gated graph sequence neural network were able to solve perfect accuracy with a small number of examples. But for baselines, they weren't able -- they struggled pretty much three problems. Okay. >>: all those tasks, them to pretty those RNNs/LSTMs to solve all those Yes. >> Yujia Li: Yes. >>: Have you done any kind of analysis, what kind of advantages you might get in solving shortest path using this as a [inaudible] techniques. So I realize is the learning task, but is there any particular advantage in solving shortest task or any of these graph based methods using this? >> Yujia Li: >>: You know, like Dijkstra. >> Yujia Li: >>: By traditional techniques, you mean like -- Dijkstra's algorithm. Yeah. >> Yujia Li: Okay. Yes. So maybe in terms of efficiency won't have any particular advantage over those like Dijkstra's algorithm. But one particular advantage it has is this the general framework. It's not limited to solving shortest path problem. It can also be used to solve other types of problems that you might have some idea about what the task should be and you don't have to specify an algorithm to compute that. You can just give it a bunch of data and ask the model to learn it. And you can use it not only to this problem but also other problems as well by using the same framework. So I think it's a more flexible ->>: That's the framework. But in terms of just the shortest path task. >> Yujia Li: Just shortest path task, no, it doesn't have any particular advantage over Dijkstra's algorithm. Yes? >>: But do you think that this could learn second shortest path? That seems much more difficult. And yet that would be straightforward with some other algorithms [inaudible]. >>: [inaudible]. >>: Seems very difficult. >>: Use this for [inaudible]. >>: You can do K shortest paths. >>: No, I mean -- but can you learn it. >>: Yeah. >> Yujia Li: That's just interesting. I don't know. I mean, we haven't tried. But -- >>: But shortest and longest I think I understand how it would learn it, but the others seem [inaudible]. >> Yujia Li: Maybe. Yeah. >>: So do you do -- because presumably to do shortest path, before you put the first thing in the sequence, it already has to know the entire shortest path, right? So do you unroll it for more for the first one and then very short for other remaining ones? >> Yujia Li: In this particular task, we unrolled it for the same number of steps for all the prediction steps. >>: But you have to measure before the output of the first one, or would have to know the entire solution, right, otherwise it can't really make an accurate ->> Yujia Li: Presumably, yes. >>: So really it's just kind of like you could just -- you could just unroll it a bunch of steps and then just have it be outputted every -- you know, have every -- have all its [inaudible] like without any unrolling after that, right? >> Yujia Li: Presumably, yes. We can even imagine a solution that just runs one huge propagation network but spits out output at different time steps. That might also be possible. >>: But generally for any of these things it can't really -- it can't really start outputting things until it knows the entire answer for any of the problems that we actually care about, right, which it might be these are real problems, right? It can't start making decisions until it knows the entire correct answer. So you could imagine like doing the propagation and then feeding it into just an LSTM and then just running like a sequence to sequence style model where you basically take -- take -- where it somehow figured out all the answers that it needs and then it can just, you know, easily spit it out one by one, right? >> Yujia Li: That might be a solution. But for some problems, the -- the solution does not necessarily have to be kind of a global solution. You can make solutions locally at every step. Then you don't have to ->>: Are there any problems that it actually [inaudible] for? >> Yujia Li: For example, if you want to traverse all nodes on the graph, right, you just do -- do it step by step and go to the next node at every step. If your graph is just a cycle, which is very simple toy example, but even this simple toy example, some of the models like RNNs/LSTMs won't be able to solve it. >>: Let me ask you, have you tried solving TSP using this? >> Yujia Li: We haven't. Those aren't -- >>: But that would be interesting, right? That's NP-hard. And if you'd show any kind of improvement, any kind of -- so that would be pretty interesting to see. >> Yujia Li: I can imagine these models may be able to find some kind of approximate solutions without any guarantees. >>: So some sequence model have been done for TSP. We had -not like great success, but [inaudible] better probability. >> Yujia Li: >>: I hope so. Yeah. Given the encoding structure. >> Yujia Li: Yeah. Okay. but it was >>: I have a question of the slide you show [inaudible] so can you go back to the -- yes. So you have this node annotation block. So want this one over? What's the -- so basically there are two steps. What's this [inaudible]? >> Yujia Li: This is just another gated graph. No, not -- it computes the -- computes what will be needed to -- what will be needed in the next step as the [inaudible]. >>: Just like that into the this whole [inaudible]. >> Yujia Li: Yes. It's nothing fancy. What it -- the only thing that's particular this is using a per-node prediction. Because you have an annotation for each of the nodes. But it's just a standard gated graph neural network block as other blocks. Okay. So I guess I have to wrap up soon. So in the end let me just briefly talk about the motivating application which is for program verification. Which is actually a very interesting problem. Before I did this project, I don't know anything about program verification. So what is program verification? So program verification is about verifying the correctness of a program, given inputs that satisfy some preconditions, and we want to tell whether running the program is guaranteed to produce some outputs that satisfy the post conditions. If the program guaranteed that for any input that satisfied the precondition, it will reach the post condition after it executes the program. Then we say that this program is correct. Okay? And we need to formally describe what happens during the execution of the program. And the challenging part is to analyze the heap memory state, which is a heap graph. And we need form descriptions of the heap memory. And using some sort of logic formulas, here we use the separation logic formula, which is a popular type of logic to be used to describe those heap memory states. And we need those formal descriptions to be able to reason about the correctness and strictly prove whether the program is correct or not. >>: This problem has a standard solution already, huh? >> Yujia Li: >>: [inaudible] I think it's the logic, is the one. >> Yujia Li: problem ->>: Huh -- So people always use logic to solve this verification [inaudible]. >> Yujia Li: So a little bit of history about this. So program verification, it always uses logic to represent programs and try to go from the precondition to the post condition. But in the past, people were developing some heuristics to describe what's happening in the heap memory states and try to match one of those kind of predefined heap memory state patterns to the logical formulas and stick that into the verification pipeline. But more recently people are developing machine learning algorithms to map those heap memory states to logical formulas and use that to be able to verify more complicated programs because heuristics only works for very simple programs. >>: Okay, so this is a new -- kind of new train in this field? >> Yujia Li: >>: Yes. [inaudible] okay. >> Yujia Li: Yes. >>: Good to know that. So what community of [inaudible] not machine learning [inaudible] paper like that? >> Yujia Li: Yeah. This is totally new application domain for machine learning. But people in the software engineering and program verification community, they care about this. So one of my ->>: Maybe you should talk to Sumit Gulwani. on this. >> Yujia Li: >>: Sumit Gulwani has been working Okay. So maybe you get a chance to catch up with them. >> Yujia Li: Yeah. So I worked on this because one of my mentors at Cambridge work on this. And my other mentor at Cambridge work on machine learning. So they decided to do a joint project. That's how I started working on this. Okay. So the solution is once we come up with the separation logic formula to describe the graph, we can put that logical formula to a theorem prover which will verify if this formula is indeed consistent with the program and if it is indeed strong enough to complete the proof. Okay. And the theorem prover will help us to do that. And the whole verification pipeline looks like this. So we have some input program and we run it multiple times, potentially with different inputs, and stop at different time to get snapshots of the heap memory, which will give us a bunch of graphs. And then we have this machine learning pipeline that maps those heap memory graphs to logical formulas as consistent with all those graphs and describes all the substructures in those graphs. And this is where the gated graph sequence neural network comes in. And then those logical formulas is fed into a theorem prover which tells us whether this is good enough or not. Okay. And the key step here is to map heap graphs to separation logic formula. And separation logic has a very particular, a very specific grammar. And when we make predictions, we just follow the grammar and do the prediction step by step. And every step is either a graph level classification or a node selection. Because in the grammar, every step expand a non[inaudible] node into something else. And this something else can have many different branches. And if you're choosing those branches, then it's like a graph level classification problem. And in other cases, you'll expand [inaudible] node into one of the variables, which will be a node selection output. So that's how we predict those separation logic formulas. And the results here, we compared the gated graph sequence neural network model with an earlier approach using heavily hand-engineered features using domain knowledge by domain experts combined with standard classifiers. The dataset has 160,000 heap graphs generated from 327 separation logic formulas. And the gated graph sequence neural network model is able to achieve close to 90 percent accuracy without any hand-engineered features, which is even slightly better than the previous approach. So here ->>: What's a supervised signal? >> Yujia Li: The supervised signal is given a graph you know what is a correct formula. >>: And so it's actually trying to produce the formula. It's not trying to produce like whether this crashes or not. It's just saying what is the logical formula. >> Yujia Li: >>: We just predict the formula. Okay. >> Yujia Li: And the formula is fed into the theorem prover that completes the proof. Okay. And here the accuracy is actually a very high accuracy because we counted as correct on the when it predicts a whole sequence correct. So the whole sequence might have many different steps. And to be counted as correct, you have to make all those steps correct. >>: Where did this label come from? Is it human created? >> Yujia Li: So this is the program verification problem. So we can start with some simple programs that we know what's the corresponding formulas. >>: But there's no -- there's no such general system to create a general program? >> Yujia Li: >>: For a general program, no. Okay. >> Yujia Li: That's the problem we want to solve. And people don't know how to solve those problems for complicated programs. But we start from a set of programs that we think are interesting and we can generate data from them because we know the answers to them, and then we train this algorithm. And then we can use this algorithm to predict formulas for some other programs. And that's how we generalize the other programs. And here are some more qualitative results. We have in addition to apply it to this dataset, we have also successfully integrated this system into a program verification pipeline, and it can successfully verify a test suite of list manipulating programs in a benchmark set which just contains some very simplest manipulating programs like traversal list, concatenate two lists, copy a list into another, and things like that. And those are the kind of logical formulas found by our model. So those logical formulas doesn't look very impressive. But our model is actually able to predict for complicated formulas than shown here. And the reason we didn't do it on more complicated programs is because at the time I did that, the other parts of the program verification pipeline wasn't ready to make -- to do program verification for more complicated programs. >>: Is the input a bunch of snapshots of the heap of the same program and the output is a logical [inaudible] or is it far given snapshot of the heap produce a logical form for that snapshot? >>: Yeah. So that's -- that's one thing that I didn't talk about. So actually when we applied it, we run the same program and get a bunch of snapshots for that program at different times with different inputs. And then we mapped a bunch of snapshots into one single formula that's consistent ->>: Okay. So you take your program as a particular logical [inaudible] generated with a bunch of snapshots of that programming. >> Yujia Li: Yes. Yes. So it's kind of looking at the execution traces of this program and try to see what it's doing by analyzing the heap graph. Kind of like an induction process. But it's not guaranteed to always be correct because if you only have a bunch of snapshots that doesn't cover some cases, this will not be a correct formula. But the theorem prover will tell you whether this is correct or not. If it's incorrect, it will give you a counterexample that's added to the set of snapshots and you will make a prediction again until it reaches satisfactory status. Yes. >>: So are you going to talk about how you applied your model to solve the problem, or was this -- I mean, you already showed the results. I was just curious, how did you -- how did you take a set of snapshots and output one formula? >> Yujia Li: Yeah. I didn't plan to talk about that. But one simple -- I can give you some intuition. For example, if you want to choose a node, and then you know that in all those graphs, there are some of the nodes associated with a particular variable. Right? Because when you define programs, you have some variables that's associated with some kind of pointers. And then when you predict a node, it choose one of those variables. And for each variable, it has a -- it has one node in each of those graphs. And then you accumulate all the scores for all those needs to be the score for that variable. And then you choose one of the variables from there. >>: Okay. >> Yujia Li: It's a bit -- >>: You did some aggregation over all the snapshots as you're constructing the formula. >> Yujia Li: >>: Yes. Yes. Okay. >> Yujia Li: And here is a more complicated example with nested data structures, just showing that the model is able to predict something more sophisticated. So it has a list of lists and a tree of a special kind of list where it ends in a cycle, and the model is able to predict all those nests structures correctly. To wrap up, there are a few future directions. We would like to explore the model space a little bit more and further understand what's going on in this model. And we think this is a very general and promising framework, and we want to apply it to some other applications. And another interesting direction is to learn to construct the graph. With this, I would like to conclude, and I'll be happy to answer any other questions that you have. [applause]. >> Yujia Li: Yes. >>: So I have questions regarding other applications. I would assume that it's pretty easy to turn this into a technique that works on weighted graphs, not -- not graphs like this, but probabilistic graphs where you have a relationship between A and B which is probabilistic, like 90 percent of cases A does fear B. And I think that's pretty easy to incorporate in this. And then with that, really a lot of different applications might open up. >> Yujia Li: Yes. >>: So I'm kind of curious why you went for these particular ones and not more traditional machine learning applications. So what are the other applications you have in mind? >> Yujia Li: One example I have in mind is to use this as kind of a prediction model for any problems that have graph structures. Like traditionally people use graphical models for a lot of prediction problems on structured data where the kind of edges or factors connects different variables in the graph. But -- and usually you want -- once you learn the model, you need to run an inference procedure to make predictions. But the inference procedure, if it's based on like message passing or whatever, then it's just another iterative propagation process on the graph. And we can totally replace that with this graph neural network and use that directly to make predictions which has complicated output structures. And the benefit of using graph neural nets is that it doesn't have the intractable -- intractability problem for a lot of graphical models, because as long as you have moderately complicated potential functions defined on the graph, then the inference is intractable. But for this, you can add arbitrary connections between those and still be able to get a good model that's trainable with back propagation. >>: So you're talking about MRFs? >> Yujia Li: >>: Like segmentation problems? >> Yujia Li: >>: Yes. Yes. Multiplication problems? >> Yujia Li: That's -- that will be an interesting application for this. >>: So like even this is not guaranteed to [inaudible] approximations even in message passing for [inaudible] graphs. But can you guarantee anything about the prediction you make, the predicted solution you make here? >> Yujia Li: I'm not thinking about guaranteeing anything, which is kind of common for any neural network models. It's just learning input/output mappings. models. >>: But it doesn't have some of the restrictions to the graphical But do you have a [inaudible] model as well? >> Yujia Li: I do. But the propagation model doesn't have to follow the probabilistic formulation as in a lot of probability [inaudible]. >>: [inaudible] some sort of parameters for which you might also like and you don't know that? >> Yujia Li: It's possible. Yes. But I think this is still a more flexible and powerful propagation process because you can use any arbitrary nonlinearity, you can parameterize in any arbitrary way but it's still able to train. >>: [inaudible] propagation and you can just keep running it, right? And, of course, no guarantee, but still going to give you something useful. And that's all you're saying, right? >> Yujia Li: [inaudible]. I think it's more powerful than belief propagation. It can >>: [inaudible] propagation or not the entire prediction pipeline, but just the propagation piece I think probably -- you know, or initially it will have some other issue as graphical networks, you know, and it's part of the -- I mean, I don't see how you -- how this -- this framework alleviates those problems. >> Yujia Li: >>: Uh -- [inaudible]. >> Yujia Li: think -- Yeah, it might still have some of the problems, but I still >>: The related question is mathematically it's similar to belief propagation because it's just if you don't have factors and computing messages based on these factors, you have direct learning messages. >> Yujia Li: Yes. >>: So have you thought about that? Have you compared things? What other sort of things -- why do you -- like what's your intuition on that, the differences there? >> Yujia Li: We haven't tried this, first of all. The thing that's pretty appealing to me is that this kind of approach can directly learn the propagation behavior, instead of deriving the propagation behavior from optimizing some objective function. This is very appealing to me, but I don't have a good intuition about what this approach -- I mean, I know this -- this approach can do whatever a belief propagation or whatever approximate inference algorithm can do by simulating that, defining a particular graph structure and particular nonlinearity functions, but it can also do many more. It's not a restricted to be within the probabilistic framework. But other than that, I don't have a good understanding about what this cannot do, what kind of problems this will face if we apply directly in that way. >>: And then regarding your results where it showed that you can beat RNNs by using many pure data points, because you're using some structure ->> Yujia Li: Yes. >>: -- that's previously known. But I'm wondering if there is -- if you could also show that even if you -- even if using the same data, regardless -- well, actually you are using the same data in this case. But even if you're working on a dataset which is not structured like this but pure text, for example, and trying to generate text, then you could first go into this text and use traditional techniques to find some sort of potentials, like this word appears before that word, things like that, and initialize your graph and use that as initialization for your RNN, would you then end up getting better results in that case too? What's your intuition based on all these experiments about giving the local minima [inaudible]? >> Yujia Li: So, first of all, I think it should work if you have some kind of knowledge graph. If you want to generate things from that ->>: What I meant was your generate a knowledge graph by some simple statistical procedures from pure data. And then you use that to initialize your RNN. >> Yujia Li: Yes. I mean, that's kind of providing you with some extra knowledge. And I think utilizing that won't make your performance worse at the least. With regard to the local optima issue, I think this kind of model is very structured. So it kind of provides you a very good direction to look for. Compared with a standard RNN that's not that restricted. So this is -- gives you a better search space basically, and it's easier to get you a good solution in this more structured space. But in the unstructured space, there are many different ways for it to get trapped. So that's my intuition behind this, but I don't have any theory to support my claims. And usually if we have the right structure for this, for some problems, then training this graph, gated graph neural networks would usually find a good solution in very few iterations. And it would see the accuracy drop from a hundred percent error to 0 percent error. Immediately. Like in a few iterations. So that's the kind of behavior I was observing. >>: We have time for one more question. >>: [inaudible]. >> Yujia Li: Maybe able to use a kind of 19-by-19 grid graph and do propagation on that to like give you candidates maybe. Yeah. >>: Let's thank Yujia. [applause] >> Yujia Li: Thank you.