>>: It's my pleasure to introduce Yujia. So... at the University of Toronto working with Rich Zemel in...

advertisement
>>: It's my pleasure to introduce Yujia. So he's finishing his Ph.D. work
at the University of Toronto working with Rich Zemel in the machine learning
group. He is not new to Microsoft. He did internship in 2014 in the speech
and dialect group and in 2015 in MSR Cambridge in the machine learning group.
His work in the speech group in 2014 got wide media coverage. And his Ph.D.
work is published in like top-tier machine learning and vision conferences.
He's focused more in unstructured learning and learning structure models.
So please welcome Yujia Li.
>> Yujia Li:
Thank you.
[applause]
>> Yujia Li: Thank you, Gupta [phonetic] for the kind introduction. So my
name is Yujia. Before I start, a little bit of introduction to myself.
So I did my undergrad in Tsinghua University in China studying computer
science. So when I started my exploration in machine learning and artificial
intelligence, during my undergrad I did an internship at Baidu, which is the
lead search engine in China where I applied the machine learning techniques
I've learned in real-world machine learning-related problems.
And then I went on to do a master's degree in University of Toronto, and then
continued to do a Ph.D., as [inaudible] introduced. I did two internships in
the past two years at Microsoft Research. One in 2014 in the speech group
and another one in 2015 in MSR Cambridge, which lists here the project that
I'm presenting here right now. And I'm expected to graduate in this fall.
So during my Ph.D. I've worked on a few different things. There are two key
focuses of my work. The first is developing structured models and learning
methods for structured props. I started with graphical models and studied
the use of high-order potentials and developed more expressive potentials as
well as efficient inference and learning algorithms for them.
More recently, I moved on to study structured neural network models and tried
to combine and connect graphical models with this kind of structured neural
network models for structured problems.
The other focus of my research is unlabeled data and unsupervised learning.
I've worked on generative models of data developing new training algorithms
for learning generative models and learning representations of data. Now I'm
working on analyzing and relating different learning algorithms for
generative modeling.
I've also worked a little bit on semisupervised learning as well to make use
of unlabeled data to improve prediction performance and doing semisupervised
learning for structured prediction problems as well.
So in this talk, I'm going to talk about the gated graph sequence neural
networks for making predictions and sequences of predictions on graphs. This
is something I did last summer at MSR Cambridge, and I've been working on
this ever since.
So there are many different forms of data and problems that have graph
structures. The molecules studied in computational biology and chemistry are
graphs of atoms and chemical bounds. And people are interested in finding
whether a certain molecule has a certain property or not.
And knowledge bases are graphs of concepts and their relations. And we may
be interested in learning representations for the facts such that they can be
used in other places more easily.
And even logical reasoning can sometimes be formulated as graph prediction
problems. For example, the pairwise predictions. The pairwise predicates
relates to variables and therefore forming a graph. And doing reasoning is
therefore can be formulated as making prediction on graphs.
The motivating application for this project, however, is on analyzing dynamic
data structures created in the heap memory. For example, like linked lists
and trees. Where the pointers link memory nodes together and naturally forms
heap memory graphs. And analyzing the heap memory states can tell us how the
program behaves and therefore can be very useful in verifying the correctness
of the program and analyzing properties of the programs.
So that's the motivating application for the project. But there are also
many other different applications possible with graph-structured data and
problems.
So to make predictions on graphs, it's necessary to have a good
representation for the graphs and the different components in the graphs. So
the most straightforward way to come up with representations for
graph-structured data is to use handcrafted features, like graph fingerprints
and others. And usually people design some simple graph properties as
features. And those features will be used in different tasks. But the
problem is whenever you want to solve a new task, you usually need to design
new features adapted to that task. And this is very time-consuming.
And then the other side, the features handcrafted are usual very simple and
therefore not able -- not powerful enough to solve more complicated tasks.
There are also approaches that uses -- defines graph kernels instead of
defining the features themselves. This is a little bit better than designing
features, but still they're quite limited in what kind of problems they might
being able to solve.
Another interesting approach is to use random walks on graphs. So the idea
is actually very interesting. So the approach is to run a bunch of random
walks on the graphs. And each random walk gives you a sequence of nodes on a
graph. And then you can use the word representation learning algorithms
that's being used to learning word representations from sentences here to
learn graph node representations from the sequences of random walks.
And, however, this -- even though it's an interesting idea, but still quite
limited. For this work, we based our research on this powerful graph neural
networks model that I'll explain in more details that's based on the powerful
neural network formulations. And more recently there are also other neural
net-based graph representation learning algorithms like the neural graph
fingerprints and conv nets on graphs, which are also related.
So graph neural networks, which is a powerful model to learn representations
for graphs, and then making predictions based on those learned
representations. So it has two components. First one is a propagation
model, which uses a propagation process on the graph to learn node
representations. And then a seconds part is an output model that takes those
learned node representations and make predictions from those representations
for the graph.
First let me dive into more details of the propagation model. So propagation
model learns representations for nodes using a propagation process. Here I'm
showing an example graph that has two types of edges and is a directed graph.
So the two types of edges are denoted by different colors. And edge type is
an important information that tells us different relations between those
entities in the graph.
And in the propagation model, it learns node representations. And each note
representation for each node has a representation at each propagation step,
which is represented by a vector. You can think of it as hidden units in a
neural network.
And then the propagation model propagates those representations along the
edges. At each propagation step, a node -- the representation for one
particular node is propagated and transformed along an edge and passed to its
neighbor on that edge. And the transformation behavior is determined by both
the edge type and the direction of the edge. So here we consider both
propagation and both directions to facilitate the flow of information on the
graph.
And more concretely, this is the propagation step that's being used in this
graph neural network paper. Here those F functions are transformation
functions that transforms the representation for a node into another form
based on its edge type and direction.
And then after the representations are transformed, each node collects all
the incoming representations and summed them all up to use that as the
representation for that node for the next iteration.
The simplest example for the transformation function is a linear function
parameterized by the transformation matrix A and the bias B. And those two
parameters depends on the edge type and direction. This is the simplest
possible transformation function. But usually we not -- we not only -- we
not directly use the linear transformation but instead use the linear
transformation, instead apply some nonlinearity.
>>:
What are [inaudible]?
>> Yujia Li: [inaudible] are -- as this arrow said, it's edge type and
direction. So it's kind of saying you have a set of parameters for each edge
type and direction.
>>:
Oh, I see.
>> Yujia Li:
>>:
So it's like a clustering mechanism.
It's not doing clustering directly.
It's --
Oh, how many different values can L take?
>> Yujia Li: If, for example, here, for this graph, you have two edge types,
black and red, and you have two directions, forward and reverse. So then you
have four different Ls.
>>:
[inaudible] different on the nodes, right?
>> Yujia Li:
>>:
Yes.
[inaudible] is a node.
All depends on the edge.
I mean, the edge between two nodes.
>> Yujia Li:
Yes.
>>: So you actually have four nodes. So you'll have like four different
arrows [inaudible] direction you actually have eight different arrows.
>> Yujia Li:
>>:
So you have -- each edge has a type and a direction.
So this --
>> Yujia Li: No matter which node it connects. So as long as it has the
same edge type and the same direction, then you have the same set of
parameters. So this is not used as an input in the standard way for a neural
network. It just denotes which set of parameters is used for this edge type
and direction.
>>:
But the parameter is a type.
>> Yujia Li:
>>:
Right?
Yes, yes.
Okay.
So, for instance, between edges --
>> Yujia Li: So, for example, this is edge 1 to 2 has the same parameter as
this edge, 4 to 3. Okay?
>>:
[inaudible] this type [inaudible] predicate.
>> Yujia Li:
>>:
Yes.
[inaudible].
>> Yujia Li: Yes, yes, yes. Since learning the behavior depending on the
edge type and direction. Okay.
So that's the propagation model. And it's an iterative process. And this is
a recurrence equation. So usually this is run for a number of steps, and
then we stop there and get the representation for each of the nodes. That's
your node representation, from this graph.
>>:
[inaudible] which this is kind of stabilized, because you have loops?
>> Yujia Li: Yeah, I'll talk about that later. But it's -- in general it's
not guaranteed to even converge. If you don't restrict the parameters in any
way. But ->>:
Learning it with a fixed number of steps, anyhow.
>> Yujia Li: Yeah, we're learning it with a fixed number of steps. But I'll
go into that later. So that's the propagation model. And the -- then once
we get the node representations, there's an output model that maps each of
the node representations to an output. And this mapping function, G, is -can also be a neural network. And this model is able to make predictions for
each of the nodes. So that is the output model.
So both the output model and the propagation has a number of parameters. And
we need to learn those parameters from the data. In this model, the
propagation model is actually -- can be unrolled as a recurrence network,
actually, because it's just running the iterative recurrence equations over
and over again and can therefore be unrolled into a recurrent neural network.
And then the whole thing can actually be trained with back propagation
through time. However, if you want to use back propagation through time,
then you have to keep track of all the intermediate states. And it might
take a long time and many iterations for this propagation process to
converge.
So the -- as proposed in the original paper by the others, back propagation
through time seems expensive. But there is a way to apply as a restriction
to the propagation models such that the propagation function becomes a
contraction map. What that means is that the propagation model, one step of
propagation will map node representations that are close to each other to be
even closer or at least as close in the next step. So once it satisfies that
condition, the propagation model becomes a contraction map.
And if the propagation model has this property, then it has an interesting
behave area that it has -- it will -- it's guaranteed to converge, first of
all, and then it has a unique fixed point that has nothing to do with
initialization. And those are interesting properties that can be exploit to
develop a more efficient training algorithm. And the authors propose to,
based on this restriction, run the propagation until convergence and then
just do training around the fixed point using the classic Almeida–Pineda
algorithm. You don't have to back propagate all the way through the
recurrence. And you just need to train in the vicinity of the fixed point.
So that's the proposed approach to train this.
Our modification [inaudible] model, which is the for modification that's
modeled, which we called gated graph neural networks, which is actually not a
very good name because we not only added gating mechanism to the network but
most importantly we unrolled the recurrence for fixed number of steps and
just used back propagation through time with some modern optimization
methods.
Doing this actually has a number of benefits. First of all, we don't
restrict the propagation model to be a contraction map. And therefore the
model has a lot more power and capacity to solve more complicated problems.
And we have actually shown and proved that if we restrict the propagation
model to be a contraction map, then the model will have trouble modeling some
large independencies, which will be important in many different cases.
Second of all, as I mentioned before, if the propagation model is a
contraction map, then the initialization actually doesn't matter to the fix
point. But now we don't -- we don't have this restriction anymore. So
the -- the way it I understand up with after a fixed number of steps actually
depends on where initialized, how do you initialize the propagation process.
And that can be exploited to fit problem-specific information to this model
and make the model produce problem-specific node representations, which will
be helpful in a lot of different cases.
And, third, we are learning to compute representation within a fixed time
budget. And during training we unroll this propagation process for a fixed
number of steps. And at testing time we're also unrolling for a fixed number
of steps. So that creates a better alignment of the training process and
tests process, which will be beneficial.
And in the end we've also empirically found that adding a gating mechanism
does help making the propagation model a little bit better and training a bit
more stable.
Next I'm going to talk about some of the details in this gated graph neural
network model. The first thing is the initialization. As I mentioned, here
initialization actually matters, and we can use this to fit in
problem-specific information into the problem, which we call those node
annotations. Just to give you an example about what those node annotations
are, think about simple reachability problem on a graph. So the problem is
given a graph and given two special nodes -- one's a source node, the other
is the target node -- what other target node can be reached from the source
node.
So here in this problem, there are two certain nodes, the source and the
target. We have to somehow encode this information and give that to the
model. Otherwise, there's no hope of solving this problem. And probably the
most straightforward way to encode this information is to use a
two-dimensional vector for the source node A, which is encoded as 1, 0, and
the target node encoded as 0, 1, and all the other nodes are 0 and 0. So
that seems to be a minimal -- seems to be a very straightforward way to
encode that information and to distinguish those two special nodes from
everything else.
So those are the kind of node annotations that we are talking about. And if
we initialize the node representations in this way, then it's easy to imagine
a propagation model can learn to, for example, copy the first bit of every
node to its neighbor. And then after the propagation, the node
representations will look something like this. And if we want to check
whether this target can be reached from the source, we can just try to match
if the target node has a particular pattern with 1 and 1. And an output
model classifier can easily learn that pattern.
So this is something that we can imagine. If we initialize the nodes in this
way, then we can imagine the propagation model to learn this behavior.
And here's another example where the target is not reachable from the source
because of the directions of the edges. And the point here is we're using
the same node annotation mechanism, same propagation model and same output
model, but we're able to solve different instances of the same problem for
different graph instance -- graph prediction instances.
So that's the particular kind of node annotations. And it's very simple.
Just encodes all the necessary information and fit that into the propagation
model.
>>: So this doesn't mention [inaudible] this 0, 1, where does it go?
it go into your matrix A, matrix B?
>> Yujia Li:
>>:
Does
This goes into the node representations.
[inaudible] is two dimensional, or is it a dimensional?
>> Yujia Li: So for each node here, for each node, if we just initialize
each of the node representations to be this two-dimensional vector, it will
be two dimensional.
>>:
So each of the nodes has a separate HP?
>> Yujia Li:
Yes.
>>: And -- and in the neural network, you'll -- you have this A dimensional
vector going to it with the output model?
>> Yujia Li:
>>:
Um --
Or does your output model take --
>> Yujia Li: I'll talk about the output model later. But in the simplest
case, an output model can just -- you know, for each nodes, it takes a node
representation for that node and make a prediction for that node.
>>:
Okay.
I see.
So output model is per node.
>> Yujia Li: This is like the output model used in the previous work. But
we've expanded the only output tasks by adding more output models. I'll talk
about that later. Okay?
So this is node annotations. But in practice, actually what we did was we
padded those problems, specific node annotations, with some extra 0s to just
increase the dimensionality of the node representations which adds more
capacity to the node representations, which is useful to contain more
information. Yes?
>>: You could also have had a learned representation for each the three
types raised. Either this is the reachable one, this is the not reachable
one. Or this is -- this is green, red, or white, right, basically, right?
And that had like the three embeddings for ->> Yujia Li:
>>:
Right?
>> Yujia Li:
>>:
Yes.
Uh-huh.
Were there any projects you tried something like that for?
>> Yujia Li: We haven't tried that. But, I mean, we've found this simple,
straightforward node annotations to work pretty well. But I guess in more
complicated tasks, it might be beneficial to learn them.
>>: Understanding what you were saying the previous slide.
by the reachability?
What do you mean
>> Yujia Li: Reachability is saying is there exists a path from a source
node to a target mode. If there is a path connecting them, then the target
is reachable from the source.
>>:
it?
Yeah, so then you -- then you encode that in addition?
Want to encode
>> Yujia Li: You want to encode which one is a source node, which one is a
target node. We have to know that. Otherwise there's no hope of solving
this problem.
>>: So the input is this thing, and then the training time of the loss
function says in this instance I want green -- the first dimension of green
to be zero, whereas in the previous slide, the loss functions says I want the
first dimension of green to be 1, right? That's basically how you're
encoding it, right?
>> Yujia Li: So it has two processes. The first is propagation process.
After the propagation, you get some final node representation. And those
node representations are fit into, in this case, the output model, which is a
classifier. That's like a binary classifier that tells you whether it's
reachable or not. And then you have the loss coming in at the end of the
classifier. And then that's -- all the way back propagates through the
propagation model.
>>:
Right.
>>: Maybe I'll understand later when you show the examples. Because where
is the input to the model? Is the green node the input to the model?
>> Yujia Li: Oh, the input of the model is -- so the model has to know the
graph structure. It basically takes a list of edges and construct the graph.
And now you have the graph. And then you have to have some problem-specific
information that tells you which node is the source and which node is the
target. And then the expected output is whether the source node ->>:
Is the graph random?
>> Yujia Li:
The graph is given to you.
>>: Okay. So I guess I'll understand this once you get complications.
Right now I can't map it to a problem in my head.
>>: Well, but in this problem, the -- everything -- can you go back a slide.
This is a little -- go back one -- wait. Yeah. So this -- this is the
input, right? This at training and test time, this is the input. But then
at training time the additional information you get is that in this
particular instance green is reachable.
So that's what the loss function comes in, it's telling you in this target
you want to now make the first dimension of green to be 1 after doing the
whole propagation. Right? And that's what -- that's what you learn at
training time that you ->>: So I think this is, you know, you don't need to come to the training
stages, like propagation stage. You also graph [inaudible] you need to
follow ->>: If there's no propagation, it doesn't -- it doesn't know what function
to learn. The only reason why it's learning this propagation function is
because a loss function tells it this is the function I want you to learn.
You're learning a different function, it wouldn't learn how to propagate.
>>: But here is actually just check. So if you allow several propagations
and I initially initialize them 1, 0, and then you check after several steps.
If your target node has the first beat as one, then it's reachable.
>>: But that particular function? No, there is [inaudible] that particular
function is not intrinsic to the thing.
>> Yujia Li: So this is -- this is -- this simple example is a simple
supervised learning task where the input is a graph and two nodes. One is a
source, one is target. And the output is you want to predict whether on this
graph the target can be reached from the source or not.
>>:
So you have a binary label for each [inaudible].
>> Yujia Li:
>>:
Yes.
Which is yes or no for each green -- for the green node.
>> Yujia Li: Yes, yes, yes. And this is a particular way to encode this
source and target information into this graph propagation framework. And
that's what we call node annotations, which is very straightforward
translation of the ->>:
So in this graph you're going to be learning those A and B parameters?
>> Yujia Li:
>>:
Yes.
Yes.
And then, again, how are you going to know if it's reachable or not?
>> Yujia Li: It's a supervised learning task.
be given those correct outputs.
>>:
So during training you will
But there is only one set of outputs right now.
>> Yujia Li:
The output -- so for each graph and a pair of nodes --
>>:
You only have one [inaudible] there, so there is no data yet in my mind.
>> Yujia Li: Oh. Oh, during training you will have a bunch of graphs, a
bunch of different nodes. For each of them you have a desired output,
whether one is reachable from the other.
>>:
Okay.
>> Yujia Li: So it's like -- so this is a very general framework. You don't
have to specify what is the property you want to have. You just give the
model of bunch of data and the model will learn what's the correct concept
corresponding to this task. Okay?
So this is the initialization. And then the propagation model, we also made
some changes. The -- we added gating mechanisms and some other minor
difference to the original graph neural network propagation model. So if we
unroll one step of the propagation, then we get something like this
feed-forward layer here.
On the maps, the nodes representations in one step to the next. But this
feed-forward layer has a very special connection structure. It has a very
sparse connection structure. And each of those connections might also have
shared weights and shared parameters, depending on which edge type it belongs
to and which direction it belongs to.
So if we concatenate all the node representations into the big vector, then
we can actually write this more compactly using matrix operations. And here
the transformation matrix A takes a -- also takes a very special block
structure with a lot of sparsity in there. And each block can share
parameters with some other blocks depending on the graph structure. Okay.
>>: You could have [inaudible] right? That basically is standard recurrent
neural network would just be a single self-loop, right?
>> Yujia Li: Yes. Yes. Yeah. So you can also think about this as a
recurrent neural network with a structured transformation -- transition
matrix instead of a fully connected one.
And this is how we formulated it. It looks exactly like a recurrent neural
network. So this is more like a vanilla neural network kind of propagation
model. But instead, we use something more like a gated recurrent neural
networks which is -- which uses some gating mechanism, added the reset gate
and the update gate to modulate the information propagation process on the
graph, which we found to be helpful.
We can actually also use more complicated ones like LSTM, long short-term
memory, recurrent neural networks, but we found this to be a bit simpler and
it works as well or even better in many cases. So that's the changed
propagation model.
And for the output model, as before, we still have the per-node output, which
is the same at the graph neural network model, but we also added two other
output models to make our model be able to solve more complicated tasks. The
second one is the node selection output, which instead of making a prediction
for each node, it selects one of the node for each graph. And the way it
does that is it uses the output model to output the score for each node in
that graph and then pass the scores through a softmax to select a node for
that graph.
This is very useful in a lot of different problems. And this provides a way
to making predictions where the output is not in a fixed number of classes.
So this, here, the output depends on the number of nodes in the graph. And
this is node selection output is able to handle that.
The third one is the graph-level output. So it makes a prediction at the
graph level, either not classified, what this graph category is or doing some
other kind of regression problems like that. And the way we did that was to
first come up with a graph representation vector, which is a weighted sum of
all the node representations in the graph. And the weighting of each node is
given by another neural network which is learned and takes node
representations and initial annotations as input to compute the weight.
And then the graph representation vector can then be used in any standard
machine learning tasks like classification or regression in the standard way.
So those are the three kind of different output models that we consider and
developed for a bunch of different tasks that we experimented in this
project.
The whole network, then, looks like this. So on the left side there's
initialization where we fit in problem specific node annotations and then pad
with 0s to initialize the node representations. And then we pass those into
the propagation model, which is run for a fixed number of T steps. And then
once that finished, we get node representations fed into the output model to
compute an output for each of those graph prediction problem instance. And
the whole thing is trainable with back propagation, and that's exactly what
we did.
So to test whether this [inaudible] graph neural networks makes sense and to
test whether it can learn something useful, we started by testing it on some
toy graph property tasks, like the reachability task that I mentioned before,
and also sharing task, whether two nodes can reach the same node or not. The
cyclicity task to check whether a node is in a cycle or not, and the reaching
cyclicity, whether a node can reach a cycle or not. So those four are very
simple graph theoretical properties.
And for all of those -- for tasks, we generated some random graphs and picked
some random nodes and try to generate data to simulate tasks like that. And
for all those four tasks, our models is able to learn from only a few tens of
training examples. And using a very small model that has like 100 to 200
parameters to do perfectly in all those four tasks.
After that, we moved on to some more complicated toy task, still toy tasks.
And some tasks comes from the bAbI experiment -- bAbI synthetic dataset
proposed by Jason Weston and others, which is a bunch of natural language
reasoning tasks that requires reading the natural language input and do some
reasoning and answer a question.
Here we used the symbolic format of the data. So we're kind of exclusively
focusing on the reasoning part of the problem and ignoring the parsing part
of the problem. So the results are not directly comparable with other
people's published results but still provides some interesting insight to how
this model works.
So one example of the bAbI task here is about basic deduction. So this is
how the input data looks like. So the input data is composed of a bunch of
different instances. So for each instance, the input is given to you as a
list of facts. And here for this task in particular, each fact is one edge
that encodes one relation between two entities. So here it is saying D is A,
B is E, and A has fear of F, G is F, E has fear of edge.
And once all those facts are listed, there's a question near the end that
asks a related question that you have to do some reasoning based on those
facts. Here it is asking which entity does B has fear of.
And then if the data is given in this format, it's very easy to transform
that into a graph and it's very straightforward conversion process. Here is
the graph converted from this input data ->>: Here when you talk about the symbolic format of the [inaudible] are you
saying that you just assume that you have encoded its logical form
[inaudible] logical form [inaudible]? I'm just curious. Because originally
[inaudible] so you -- I mean, whatever is in the natural language in the
original data versus what you use the single to represent those entities,
how -- how is the [inaudible]? Maybe you say more words about how to
translate original questions into the representation you're talking about,
how to get the label here and the ->> Yujia Li:
>>:
Yeah, yeah.
What kind of assumption you are making here.
>> Yujia Li: Yeah. So here the original format of this bAbI dataset is each
fact is represented by a simple sentence. So example, D is A ->>:
What would be original sentence that correspond to [inaudible].
>> Yujia Li: So here it might be something like -- so, for example, D is A
might be something like J is a ship.
>>:
Okay.
>> Yujia Li: And here B is E, A has fear F maybe like ship is fear of
wolves, something like that.
>>: Okay. And you make sure that all these entities are correct in terms of
[inaudible].
>> Yujia Li:
>>:
So we -- actually generating the symbolic data --
Modify that, of course, you know, you have different task.
>> Yujia Li: Yeah, yeah. So actually we -- here what we -- so here we just
remove the parsing part and assume that we've already got that. And it's
actually not very hard to get that. Because the dataset is kind of
synthetically generated dataset. So it starts -- so to generate those bAbI
data, what the others did was they started with those kind of symbolic format
and then rendered it into natural language using some very simple rules.
So I can just add a switch to the data generation process which give me this
kind of symbolic data.
>>:
Okay.
>>: So in this case you're showing here one data point, or you're showing a
representation of the entire dataset? What is going to be the data you're
training on? Are you going to have some examples of D has fear of G that
you're going to try to fit to?
>> Yujia Li: Oh, okay. So this is just one instance, one training example.
So here this list of facts is the input, and this question at the end with
the answer is the output.
>>:
Oh.
So I didn't see that.
So the bottom --
>>: When you do this graph, you know, -- so sort of one by one or do you
pool entire, you know, task 5 they had together to come up with a fixed set
of [inaudible] and then create one single graph?
>> Yujia Li:
>>:
So we create different graphs for different examples.
[inaudible] I see.
>> Yujia Li: Is things we learn is the thing that can -- can be generalized
across different graphs. For each of those graphs, we're solving the same
problem. It's just the graph is different and the input is different. But
the knowledge that will be applied to those graphs is the same.
>>:
Is the same.
>> Yujia Li: And that's what we learn.
propagation model and the output model.
>>:
That's the parameters of the
Okay.
>>: But you don't mind anything -- like A, all the letters are totally
arbitrary and you don't learn those at all [inaudible] models. Right?
>> Yujia Li:
>>:
No, no.
Okay.
>>: So in this particular case you would label B as 0, 1, and H as 1, 0 if
you know for this example that B and D has fear of H. And then you would run
your -- you would set every example like that and then train. So this case,
this is labeled, right? So you know if B has fear of H or not.
>> Yujia Li:
Mm-hmm.
>>: And if it does, then you would label it as 1, 0, 0, 1. If they don't,
you label in some other way and then train. Is that what you are ->> Yujia Li: Here the input of this -- so here is how we encoded this. So
we have the graph and constructed the graph using the set of edges and edge
types. And here there's only one special node, which is B, and we label it
as some special annotation, say 1, 0.
>>:
Does it depend --
>>:
The answer is --
>>:
-- on the real answer?
>> Yujia Li: The answer is the output that we want to predict.
encode this doesn't depend on the output.
So how we
>>: So you're saying what the person is going to tell you, that what
question they care about before the propagation step, but you don't know what
the actually answer is except at training time?
>> Yujia Li: You have to know what questions they asked, but you don't know
what the output is when you start the propagation model.
>>: But during the learning, is there -- is this unsupervised in some way or
it's supervised?
>> Yujia Li:
This is truly supervised.
>>: So supervised. So you had an example.
the training, if this is a training sample?
>> Yujia Li:
>>:
Yes.
Do you know if the sentence correct or not?
>> Yujia Li:
>>:
Do you know if that sentence in
Yes.
And based on that you're going to label the nodes?
>> Yujia Li:
Yes.
>>: If it's incorrect, you're going to label it differently and then you're
going to train on these examples of graphs [inaudible] is that what you're
doing, or something else?
>> Yujia Li: You don't label the nodes based on the output.
nodes based on the question.
>>:
That's -- that's the output.
>> Yujia Li:
>>:
You label the
The question will be --
So based on the question you're going to label the nodes.
>> Yujia Li:
You -- based on the question --
>>: Whether the question is correct -- the answer to the question is correct
or not, you're going to label differently.
>> Yujia Li: No, you just label the nodes based on the question but not
based on the answer to the question. So the answer is the desired target
output.
>>: Well, but it's a binary classifier where sometimes where the question is
B has fear of H and sometimes the and is yes, sometimes the answer is no, or
is it what is the node for which B has fear of?
>>:
Yes.
This is more like a node selection problem.
>> Yujia Li:
Okay.
>> Yujia Li:
Yes.
>>:
So you know that B is going to have fear of one node.
And it's just saying which one is it.
>> Yujia Li:
Yes.
>>: During training time you know the -- you give the supervised target,
right?
>> Yujia Li:
>>:
You give that answer.
>> Yujia Li:
>>:
Yes.
Yes.
Is there any incorrect answers that --
>> Yujia Li:
>>:
Yes.
If it is the true answer.
>> Yujia Li:
>>:
Yes, yes.
No, there's no incorrect answers.
Okay.
>> Yujia Li:
It's purely supervised learning task.
>>: Yeah, so, for example, I forgot how many training instances there are
for task 5. I mean, a hundred or something like that? For this task.
>> Yujia Li:
Usually people use like a thousand examples.
>>: A thousand. Okay. So a thousand of similar examples, you extract the
same kind of has fear or, you know, these kind of relations out of those. So
for what -- suppose you have 1,000 training instances. Do you train 1,000
graph or just one single graph that simulate ->> Yujia Li:
>>:
1,000 graphs.
1,000 graphs.
>> Yujia Li:
Yes.
>>: Oh, that's [inaudible] certain case the test set doesn't have -- it's
missing [inaudible].
>> Yujia Li:
>>:
[inaudible] 1,000, you have to generalize that.
>> Yujia Li:
>>:
Uh --
Oh, okay.
It learns and learns --
>> Yujia Li:
It learns the transformation -- it learns the propagation.
>>:
One for each class.
No --
>>: There's only two classes, which is is and has fear.
just randomly initialized.
Everything else is
>>:
Well, basically the vector is [inaudible] never be updated.
>>:
Oh, I see.
>>:
The only [inaudible].
>>:
[inaudible] C, D, and E are.
>>:
Right.
>>:
[inaudible].
>> Yujia Li:
Yes.
>>:
Basically is fixed [inaudible].
>>:
He's updating the hidden representation as you go.
>>:
You're going to also learn the vector representation of [inaudible]?
>> Yujia Li: You learn all the representation for each graph. But those
node representations doesn't have to be the same across different ones.
>>: [inaudible] is that a vector, or is it just [inaudible] representation?
So when you have the graph B there. I mean, [inaudible] is the weights,
right? [inaudible] is a weight?
>>: So D is the recurrent state. But do you initialize it to be all 0s, or
do you initialize it to be random?
>> Yujia Li:
>>:
Okay.
We initialize to all 0s.
So it's just recurrent state that's initialized to all 0s.
>>: So this can be considered as a recurrent network [inaudible] this is a
recurrent network. So what you need to learn is the edge-specific transition
weights [inaudible] has fear with another parameter. So as would in
learning. And then the representation of each node are actually either
initialized as knowledge-specific things or just 0.
>>:
Okay, you don't [inaudible] vector.
>> Yujia Li:
>>:
No, no, that's not [inaudible].
If this was a test point, then, what would you do exactly?
>> Yujia Li: If this was a test point, then we will be given everything here
but not the final answer edge. Okay. So we have all those facts which we
use to build the graph, and then we have B which is what has been asked to
predict, what is a special node that's related to question, and we just
initialize the node's representation for B to be something special like 1,
put 1 in one or two bits. Now, for all the other nodes ->>:
That's what you do in training as well for [inaudible].
>> Yujia Li:
>>:
Right?
Yes.
Okay.
>> Yujia Li:
>>:
Yes.
Or not B, but whatever is the first --
Yes, yes, yes.
And then?
>> Yujia Li: And then -- and then we run the propagation process. And then
for here I think it's a node selection output. So we fit those node
representations into final node representations into this node selection
output model that will give a score for each of the nodes, and then we choose
the node that has the highest score as the output.
>>: So for these examples [inaudible] B has fear [inaudible] the last
sentence, B has fear of whatever it knows. So, for example, the output you
have [inaudible] nodes or something.
>> Yujia Li:
Mm-hmm.
>>: Like [inaudible] and then you need to -- in the training you need to
have the right prediction on the H.
>> Yujia Li:
>>:
Okay.
>> Yujia Li:
>>:
Yeah, yeah, yeah.
Yeah.
So, for example, here, if you have like B has fear of B and E is H?
>> Yujia Li:
Okay.
>>: Okay? So in this case your output would be like 2, like E and H, and
you would expect that E and H both will have I [inaudible] assume that you
have one output node.
>> Yujia Li: Yes. So this problem, in general, will have some ambiguous
answers if you have those kind of --
>>:
Multiple [inaudible] answers.
>> Yujia Li: Yeah, yeah. But in this dataset, I believe, there is only one
unique answer for each of those instances. So that's guaranteed by this data
generation process.
>>: So then you are not really coding H with a special symbol like you do B.
B is encoded with a special thing, 1, 0, for instance, and so on.
>> Yujia Li:
>>:
Right.
And H is not.
>> Yujia Li:
H is not.
H is something you will predict.
>>: So in training, then, how do you -- if this is a training point now, how
do you encode this information that B does fear H?
>> Yujia Li: So H will be used as the correct output target for this model.
And the output is a node selection output. We can think of it as like a
softmax over all the nodes, and that's your correct node. So you can think
about it as like kind of like a classification, multiclass classification
thing.
>>: But you could have also just used special value for H. In all these
cases, you could have just used a special value of H and then in the end you
just check which values in your graph have that, which nodes in your graph
have that particular vector that encodes fear of B.
>> Yujia Li: That's kind of like the node selection, right, you select a
node that matches that kind of pattern.
>>:
Yeah.
>> Yujia Li:
>>:
So it's saying [inaudible].
Kind of.
Yeah.
Okay.
>>: And one other problem that I see is that your positive instances will be
way less than the negative instances because negative examples are all
possible pairs almost, right, or N square, whereas the positive examples is
only one. So you have this skewed training data.
>>: What's your training classifier? If you have 10,000 labels, you have
one that's positive and 900 that are negative.
>> Yujia Li: It's the same as in the multiclass classification problem. For
each instance you just have one of, say, a thousand classes as your target.
>>: But the case here, you have -- you have balanced hopefully the -- I
mean, I'm just asking in your training [inaudible].
>>: It doesn't have any negative examples. The only negativity here comes
from the fact that the output is H and not [inaudible].
>> Yujia Li:
>>:
Yes.
[inaudible].
>> Yujia Li: Yes. Okay. So this is just a setup for this toy tasks. And
we did try to get the graph neural nets on four of the bAbI tasks. Three of
them are node selection tasks, and one is the graph level prediction
classification task.
And this model is able to solve all of them to a hundred percent accuracy
with only 50 training examples. And each -- for each of them, the model has
less than 600 parameters.
>>:
[inaudible] the T?
>> Yujia Li:
>>:
T I think for --
The time step that's [inaudible]?
>> Yujia Li:
I think.
Yes.
Yes.
I think for this we use something like five or ten,
>>: And this was fixes were all the tasks, or this was like tasks like
knowing the task you select T?
>> Yujia Li: You have to know roughly what size of the graphs you're working
on. But for all of those tasks, the graphs are fairly small and we -- I
think we used, say, ten for all of those different tasks and didn't tune out
much.
>>:
[inaudible] is 17 and 19.
>> Yujia Li:
>>:
I missed that.
You try four tasks out of 20.
>> Yujia Li:
>>:
Sorry?
Have you tried that?
Right.
Yes.
So but difficult how is 17 and 19 [inaudible].
>> Yujia Li:
It's not there.
17?
>>: So basically you tried it, it didn't succeed or you were not able to
[inaudible] the graph.
>> Yujia Li: Oh, we tried -- I think we tried 19. I'll talk about it later.
That's like making a sequence of predictions. Which is the most challenging
task.
>>:
Okay.
>> Yujia Li: But there are some other tasks. We didn't claim to solve all
the bAbI tasks because there are some tasks that are not naturally formulated
in this graph prediction framework. So we didn't try those. Question?
>>: So I guess I was going to ask a similar like what prompted you to choose
these four? Did you look at these four, decide they could probably be done
with graphs, and then you tried them?
>> Yujia Li:
Yes.
>>: And the other ones you don't think could be or you tried it and they
didn't work?
>> Yujia Li: They -- we -- we looked at those tasks as well. But some of
them -- some of them are not naturally -- does not nationally fit into this
framework.
>>:
Can you give an example?
>> Yujia Li: One example is the temporal reasoning. So it's a scenario
where you have a bunch of different people, like each person is an entity,
and you have facts like A moved to somewhere and picked up something and then
moved to somewhere else and then dropped that something there. And then at
the end you want to know where that something is.
So you have to reason about this temporal thing. And we thought about that
for a while and didn't figure out a way to encode this temporal thing in a
static graph. But, I mean, there are maybe some other things we could try to
make this framework suitable for those tasks as well, but we didn't push that
too hard.
>>: So are you saying that you don't see how to set the problem up in the
graph or you don't see how the graph could learn to solve the problem? Is it
that ->> Yujia Li:
the graphs.
It's more about we don't know how to set up the problems to use
>>: But maybe -- maybe it would work because you do have some notion of time
in your whole -- in your whole learning procedure. So it might be biased
anyways to take time into account.
>> Yujia Li: It could. Yes. We did thought about that and had some ideas,
but we didn't try too hard. So for this project, our -- when I did this, our
main goal was to solve this program verification problem. And we did all
this after we solved that problem to just demonstrate this model can solve
some other problems as well and it has great potential. So we didn't
claim -- we didn't try to even solve all those bAbI tasks.
Okay. So that's the gated graph neural nets model. And then we want to see
whether those results are really significant or not. So we decided to use a
standard RNNs/LSTMs as reference baselines. And those RNNs/LSTMs are trained
on token streams. The input is a sequence of tokens and the output is
another token for that example.
And the RNNs/LSTMs has like 5,000 parameters and 30,000 parameters each. The
training setup is you have a thousand examples for training and validation
and a thousand for testing. And the [inaudible] we start with only 50
training examples and then keep using more training examples until the test
accuracy reaches 95 percent or above. That's the protocol.
And here is the results. In the brackets we're showing the number of
examples needed to reach that level of accuracy. So for all those four
tasks, the gated graph neural nets model is able to reach a perfect accuracy
with only 50 training examples. But for RNNs/LSTMs, in some of the tasks
they really struggle and not able to get good performance even using all the
training examples.
>>: But if you look at this memory network paper for these all four tasks,
they got about 95, 98 or something. Maybe they're using some early ->> Yujia Li:
>>:
I don't think for all --
Did you get a chance to look at the result for memory network?
>> Yujia Li: I did. But I don't think they got like 90 percent for all
this. Maybe, I mean, there are different setups for this. Like some people
use a thousand training examples.
>>:
Oh, yeah, yeah.
>> Yujia Li: Some people use 10,000 training examples.
the difficulty of the problem a lot.
So that will change
And here I'm just showing the reference results from the paper that proposed
this dataset which used LSTM on a text input, which is not directly
comparable to ours, just as a reference point. They used like a thousand -all the thousand training examples and get this level of performance.
>>: How did you represent the nodes in the LSTM?
the symbols, because aren't they arbitrary?
Do you lay embeddings for
>> Yujia Li: Okay. So the nodes are not completely arbitrary. So across
the graphs, the entities still has the same name across all the graphs. So
if you use like J or A as a name for an entity, then in the next graph the J
is still the name for that entity. It's just the relation between those
entities are different.
>>:
Okay.
>>: Oh, so this means that you're initializing different examples from
previous instances from the same nodes using different previous examples,
right?
>> Yujia Li:
Uh --
>>: So in this case, like you have J that showed up in some example a couple
graphs ago. You would keep the same finite representation and use it when
you're initializing for the next example?
>> Yujia Li: For the RNNs/LSTMs, those information are encoded in the word
embeddings. But for our [inaudible] don't use any of those information at
all.
>>:
So you always start from 0.
>> Yujia Li:
Always start from 0.
Yes.
>>: Can you give a sense of how different the setup had to be for the four
tasks? I mean, the -- the tasks -- if you wanted to like just like write
code to solve the tasks, they are pretty simple to solve also, right? So
generally you want like a model that could solve the task without too much
change probably.
>> Yujia Li:
Right.
>>: So did you have to -- these four, to do these four tasks, is it like
really different, the way you set it up?
>> Yujia Li: The -- the only difference is -- let me see. The only
difference is when you want to use the graph, the gated graph neural nets to
solve those problems, the particular thing for each of those task is the
relations might be encoded using different words. So here use is as one of
the relations, but in the text task you might use some other words to encode
that relation. So you have to figure that out, which is fairly simple, and
that's just a transformation to transform that into a graph, which is quite
straightforward to do.
And then the next thing you need to do is to figure out how do you encode the
correct output. So for some tasks, you need a node selection output; for
some tasks you need the graph level classification output. So that's another
thing you need to do.
do.
>>:
And other than these two, there's not much you need to
Okay.
>> Yujia Li: So you just need to figure out the task-specific things, and
the model will learn to do that task by itself. Okay? So that's this. And
the few conclusions about the results. So we use symbolic format data which
does make the task easier, but still they're nontrivial. And we don't claim
that gated graph neural nets can beat RNNs/LSTMs because we use -- did use
more structures in the problems. But this shows that if there are structures
in the problems and if we can explore those structures, then this problem can
be made a lot easier to solve.
I guess I only used -- I already used one hour. I'll just speed up. So next
I'm going to talk about the sequential variant of this model which we call
gated graph sequence neural networks. Which comes from the background of
solving a sequence of predictions on graphs. So many problems require such
kind of sequence of predictions. Here I'm showing two examples. The first
is the shortest path example. The definition of path is a sequence of nodes
on the graph. So this is clearly a sequence prediction problem.
And the second one is the kind of problems that we see -- we would see in
this program verification problem. We want to analyze heap -- the heap
states and we want to know what are the components substructures in those
heap and we want to analyze, parse this whole graph into all those subgraphs.
And we will have a list of substructures in there, which is, again -- can be
formulated, the sequential prediction problem.
And for those sequence prediction problems, the prediction in each step can
still be made by a gated graph neural network, either a per node prediction
or a node selection prediction or a graph level prediction. But one thing in
particular that we have to take care of is we need to keep track of where we
are in the prediction process to make a sequence of predictions.
For example, for the shortest path, the node annotations we used in
initialization should be different for different prediction steps. Otherwise
at every step we just make the same prediction over and over again, which
doesn't make sense. And we have to somehow keep track of that.
For the graph structure parsing problem, we need to keep track of which part
of the graph have already been parsed and predicted and which part hasn't.
Otherwise, we'll just predict the same structure over and over again, which
is undesirable. And a solution we propose is to chain multiple prediction
steps up using node annotations. So the idea is to use the node annotations
as some kind of working memory that keeps track of the progress of the
prediction and carry that over from step to step.
And at every prediction step, we not only produce an output but also predict
the node annotations that will be used for the next step.
So here's the architecture for this gated graph sequence neural network
model. On the left-hand side -- on the leftmost side, you have the problem
specific node annotations, which is fit as the input to the model to do
initialization. And at every step, you produce an output using the opposite
gate graph neural network and then use another model to predict node
annotations which will be used in the next step. And those node annotations
act as memories that can be kept from step to step.
And one thing to note is that in the first step, the node annotations are
problem specific and usually interpretable. But from the second step, the
model can choose whatever to be used to put -- whatever to be put in those
node annotations. As long as it fits well with the problem. So the model
can choose what to put here and it doesn't have to interpret it at all. So
those things will act as memories that carried over from one step to the
next.
>>: I'm sorry, just to understand. So in this case you get [inaudible] each
[inaudible] step rather than [inaudible] and then [inaudible].
>> Yujia Li: Oh. For each of those prediction steps, you still run a
propagation process. And you run an output model to get an output.
>>: I see. So when you say one, you mean like running T steps and then you
take some output.
>> Yujia Li:
>>:
But then there are another T step which will be [inaudible].
>> Yujia Li:
>>:
Yes.
Yes.
I see.
>> Yujia Li: So it's like, for example, in the shortest path example, to
predict the first node in the path, you have to run the propagation process
and pass that through a node selection output to get a node. And then you
predict a second node on the path and you run the propagation and whatever
again to get an output.
So for this, we started with the bAbI task 19, which is path finding. What
it does is given a graph, you're asked to find the path from one node to
another on a graph. And the data is created in a way that there is guarantee
to be only one path connecting these two nodes. And this is the most
challenging tasks in all of the 20 bAbI tasks because almost all the
previously proposed approaches failed pretty badly on this task.
In addition to this task, we also created two other bAbI-like but a little
bit more challenging tasks. One is the shortest path task. So here it is
not guaranteed that there's only one path connecting two nodes. But we might
have multiple nodes, and the model is asked to select the shortest path among
all those different possible paths. And also a Eulerian circuit variant
task, which is even more challenging than those shortest path tasks.
And there's one thing missing to make a prediction on sequences because in
sequence prediction problems, usually for each input, the output sequence
might have different lengths. So we cannot just fix a number of predictions
for a particular input beforehand. So we have to learn to know how long the
output sequence should be.
And the way we did that was to learn a model to predict whether we should
continue a prediction or stop at a particular point. And the way we did that
was at each prediction step, in addition to the other prediction models we
have in the past, we add another separate output gated graph neural network
model to make a graph level binary classification prediction on whether to
continue or stop. So if this model predicts to continue, then we move on to
the next prediction step. If it says not to continue, we stop right there.
Then using this mechanician, we can learn to predict variable length output
sequences. And for the RNNs and LSTMs, we just keep -- add a special end
token to the data and just keep predicting tokens until an end token is
reached.
Here are the results for those three different tasks. For
the gated graph sequence neural network were able to solve
perfect accuracy with a small number of examples. But for
baselines, they weren't able -- they struggled pretty much
three problems. Okay.
>>:
all those tasks,
them to pretty
those RNNs/LSTMs
to solve all those
Yes.
>> Yujia Li:
Yes.
>>: Have you done any kind of analysis, what kind of advantages you might
get in solving shortest path using this as a [inaudible] techniques. So I
realize is the learning task, but is there any particular advantage in
solving shortest task or any of these graph based methods using this?
>> Yujia Li:
>>:
You know, like Dijkstra.
>> Yujia Li:
>>:
By traditional techniques, you mean like --
Dijkstra's algorithm.
Yeah.
>> Yujia Li: Okay. Yes. So maybe in terms of efficiency won't have any
particular advantage over those like Dijkstra's algorithm. But one
particular advantage it has is this the general framework. It's not limited
to solving shortest path problem. It can also be used to solve other types
of problems that you might have some idea about what the task should be and
you don't have to specify an algorithm to compute that. You can just give it
a bunch of data and ask the model to learn it. And you can use it not only
to this problem but also other problems as well by using the same framework.
So I think it's a more flexible ->>:
That's the framework.
But in terms of just the shortest path task.
>> Yujia Li: Just shortest path task, no, it doesn't have any particular
advantage over Dijkstra's algorithm. Yes?
>>: But do you think that this could learn second shortest path? That seems
much more difficult. And yet that would be straightforward with some other
algorithms [inaudible].
>>:
[inaudible].
>>:
Seems very difficult.
>>:
Use this for [inaudible].
>>:
You can do K shortest paths.
>>:
No, I mean -- but can you learn it.
>>:
Yeah.
>> Yujia Li:
That's just interesting.
I don't know.
I mean, we haven't tried.
But --
>>: But shortest and longest I think I understand how it would learn it, but
the others seem [inaudible].
>> Yujia Li:
Maybe.
Yeah.
>>: So do you do -- because presumably to do shortest path, before you put
the first thing in the sequence, it already has to know the entire shortest
path, right? So do you unroll it for more for the first one and then very
short for other remaining ones?
>> Yujia Li: In this particular task, we unrolled it for the same number of
steps for all the prediction steps.
>>: But you have to measure before the output of the first one, or would
have to know the entire solution, right, otherwise it can't really make an
accurate ->> Yujia Li:
Presumably, yes.
>>: So really it's just kind of like you could just -- you could just unroll
it a bunch of steps and then just have it be outputted every -- you know,
have every -- have all its [inaudible] like without any unrolling after that,
right?
>> Yujia Li: Presumably, yes. We can even imagine a solution that just runs
one huge propagation network but spits out output at different time steps.
That might also be possible.
>>: But generally for any of these things it can't really -- it can't really
start outputting things until it knows the entire answer for any of the
problems that we actually care about, right, which it might be these are real
problems, right? It can't start making decisions until it knows the entire
correct answer. So you could imagine like doing the propagation and then
feeding it into just an LSTM and then just running like a sequence to
sequence style model where you basically take -- take -- where it somehow
figured out all the answers that it needs and then it can just, you know,
easily spit it out one by one, right?
>> Yujia Li: That might be a solution. But for some problems, the -- the
solution does not necessarily have to be kind of a global solution. You can
make solutions locally at every step. Then you don't have to ->>:
Are there any problems that it actually [inaudible] for?
>> Yujia Li: For example, if you want to traverse all nodes on the graph,
right, you just do -- do it step by step and go to the next node at every
step. If your graph is just a cycle, which is very simple toy example, but
even this simple toy example, some of the models like RNNs/LSTMs won't be
able to solve it.
>>:
Let me ask you, have you tried solving TSP using this?
>> Yujia Li:
We haven't.
Those aren't --
>>: But that would be interesting, right? That's NP-hard. And if you'd
show any kind of improvement, any kind of -- so that would be pretty
interesting to see.
>> Yujia Li: I can imagine these models may be able to find some kind of
approximate solutions without any guarantees.
>>: So some sequence model have been done for TSP. We had -not like great success, but [inaudible] better probability.
>> Yujia Li:
>>:
I hope so.
Yeah.
Given the encoding structure.
>> Yujia Li:
Yeah.
Okay.
but it was
>>: I have a question of the slide you show [inaudible] so can you go back
to the -- yes. So you have this node annotation block. So want this one
over? What's the -- so basically there are two steps. What's this
[inaudible]?
>> Yujia Li: This is just another gated graph. No, not -- it computes
the -- computes what will be needed to -- what will be needed in the next
step as the [inaudible].
>>:
Just like that into the this whole [inaudible].
>> Yujia Li: Yes. It's nothing fancy. What it -- the only thing that's
particular this is using a per-node prediction. Because you have an
annotation for each of the nodes. But it's just a standard gated graph
neural network block as other blocks. Okay.
So I guess I have to wrap up soon. So in the end let me just briefly talk
about the motivating application which is for program verification. Which is
actually a very interesting problem. Before I did this project, I don't know
anything about program verification.
So what is program verification? So program verification is about verifying
the correctness of a program, given inputs that satisfy some preconditions,
and we want to tell whether running the program is guaranteed to produce some
outputs that satisfy the post conditions. If the program guaranteed that for
any input that satisfied the precondition, it will reach the post condition
after it executes the program. Then we say that this program is correct.
Okay?
And we need to formally describe what happens during the execution of the
program. And the challenging part is to analyze the heap memory state, which
is a heap graph. And we need form descriptions of the heap memory. And
using some sort of logic formulas, here we use the separation logic formula,
which is a popular type of logic to be used to describe those heap memory
states. And we need those formal descriptions to be able to reason about the
correctness and strictly prove whether the program is correct or not.
>>:
This problem has a standard solution already, huh?
>> Yujia Li:
>>:
[inaudible] I think it's the logic, is the one.
>> Yujia Li:
problem ->>:
Huh --
So people always use logic to solve this verification
[inaudible].
>> Yujia Li: So a little bit of history about this. So program
verification, it always uses logic to represent programs and try to go from
the precondition to the post condition. But in the past, people were
developing some heuristics to describe what's happening in the heap memory
states and try to match one of those kind of predefined heap memory state
patterns to the logical formulas and stick that into the verification
pipeline.
But more recently people are developing machine learning algorithms to map
those heap memory states to logical formulas and use that to be able to
verify more complicated programs because heuristics only works for very
simple programs.
>>:
Okay, so this is a new -- kind of new train in this field?
>> Yujia Li:
>>:
Yes.
[inaudible] okay.
>> Yujia Li:
Yes.
>>: Good to know that. So what community of [inaudible] not machine
learning [inaudible] paper like that?
>> Yujia Li: Yeah. This is totally new application domain for machine
learning. But people in the software engineering and program verification
community, they care about this. So one of my ->>: Maybe you should talk to Sumit Gulwani.
on this.
>> Yujia Li:
>>:
Sumit Gulwani has been working
Okay.
So maybe you get a chance to catch up with them.
>> Yujia Li: Yeah. So I worked on this because one of my mentors at
Cambridge work on this. And my other mentor at Cambridge work on machine
learning. So they decided to do a joint project. That's how I started
working on this.
Okay. So the solution is once we come up with the separation logic formula
to describe the graph, we can put that logical formula to a theorem prover
which will verify if this formula is indeed consistent with the program and
if it is indeed strong enough to complete the proof.
Okay. And the theorem prover will help us to do that. And the whole
verification pipeline looks like this. So we have some input program and we
run it multiple times, potentially with different inputs, and stop at
different time to get snapshots of the heap memory, which will give us a
bunch of graphs. And then we have this machine learning pipeline that maps
those heap memory graphs to logical formulas as consistent with all those
graphs and describes all the substructures in those graphs.
And this is where the gated graph sequence neural network comes in. And then
those logical formulas is fed into a theorem prover which tells us whether
this is good enough or not. Okay.
And the key step here is to map heap graphs to separation logic formula. And
separation logic has a very particular, a very specific grammar. And when we
make predictions, we just follow the grammar and do the prediction step by
step. And every step is either a graph level classification or a node
selection. Because in the grammar, every step expand a non[inaudible] node
into something else. And this something else can have many different
branches. And if you're choosing those branches, then it's like a graph
level classification problem.
And in other cases, you'll expand [inaudible] node into one of the variables,
which will be a node selection output. So that's how we predict those
separation logic formulas.
And the results here, we compared the gated graph sequence neural network
model with an earlier approach using heavily hand-engineered features using
domain knowledge by domain experts combined with standard classifiers. The
dataset has 160,000 heap graphs generated from 327 separation logic formulas.
And the gated graph sequence neural network model is able to achieve close to
90 percent accuracy without any hand-engineered features, which is even
slightly better than the previous approach. So here ->>:
What's a supervised signal?
>> Yujia Li: The supervised signal is given a graph you know what is a
correct formula.
>>: And so it's actually trying to produce the formula. It's not trying to
produce like whether this crashes or not. It's just saying what is the
logical formula.
>> Yujia Li:
>>:
We just predict the formula.
Okay.
>> Yujia Li: And the formula is fed into the theorem prover that completes
the proof. Okay. And here the accuracy is actually a very high accuracy
because we counted as correct on the when it predicts a whole sequence
correct. So the whole sequence might have many different steps. And to be
counted as correct, you have to make all those steps correct.
>>:
Where did this label come from?
Is it human created?
>> Yujia Li: So this is the program verification problem. So we can start
with some simple programs that we know what's the corresponding formulas.
>>: But there's no -- there's no such general system to create a general
program?
>> Yujia Li:
>>:
For a general program, no.
Okay.
>> Yujia Li: That's the problem we want to solve. And people don't know how
to solve those problems for complicated programs. But we start from a set of
programs that we think are interesting and we can generate data from them
because we know the answers to them, and then we train this algorithm. And
then we can use this algorithm to predict formulas for some other programs.
And that's how we generalize the other programs.
And here are some more qualitative results. We have in addition to apply it
to this dataset, we have also successfully integrated this system into a
program verification pipeline, and it can successfully verify a test suite of
list manipulating programs in a benchmark set which just contains some very
simplest manipulating programs like traversal list, concatenate two lists,
copy a list into another, and things like that. And those are the kind of
logical formulas found by our model.
So those logical formulas doesn't look very impressive. But our model is
actually able to predict for complicated formulas than shown here. And the
reason we didn't do it on more complicated programs is because at the time I
did that, the other parts of the program verification pipeline wasn't ready
to make -- to do program verification for more complicated programs.
>>: Is the input a bunch of snapshots of the heap of the same program and
the output is a logical [inaudible] or is it far given snapshot of the heap
produce a logical form for that snapshot?
>>: Yeah. So that's -- that's one thing that I didn't talk about. So
actually when we applied it, we run the same program and get a bunch of
snapshots for that program at different times with different inputs. And
then we mapped a bunch of snapshots into one single formula that's
consistent ->>: Okay. So you take your program as a particular logical [inaudible]
generated with a bunch of snapshots of that programming.
>> Yujia Li: Yes. Yes. So it's kind of looking at the execution traces of
this program and try to see what it's doing by analyzing the heap graph.
Kind of like an induction process. But it's not guaranteed to always be
correct because if you only have a bunch of snapshots that doesn't cover some
cases, this will not be a correct formula. But the theorem prover will tell
you whether this is correct or not. If it's incorrect, it will give you a
counterexample that's added to the set of snapshots and you will make a
prediction again until it reaches satisfactory status. Yes.
>>: So are you going to talk about how you applied your model to solve the
problem, or was this -- I mean, you already showed the results. I was just
curious, how did you -- how did you take a set of snapshots and output one
formula?
>> Yujia Li: Yeah. I didn't plan to talk about that. But one simple -- I
can give you some intuition. For example, if you want to choose a node, and
then you know that in all those graphs, there are some of the nodes
associated with a particular variable. Right? Because when you define
programs, you have some variables that's associated with some kind of
pointers. And then when you predict a node, it choose one of those
variables. And for each variable, it has a -- it has one node in each of
those graphs. And then you accumulate all the scores for all those needs to
be the score for that variable. And then you choose one of the variables
from there.
>>:
Okay.
>> Yujia Li:
It's a bit --
>>: You did some aggregation over all the snapshots as you're constructing
the formula.
>> Yujia Li:
>>:
Yes.
Yes.
Okay.
>> Yujia Li: And here is a more complicated example with nested data
structures, just showing that the model is able to predict something more
sophisticated. So it has a list of lists and a tree of a special kind of
list where it ends in a cycle, and the model is able to predict all those
nests structures correctly.
To wrap up, there are a few future directions. We would like to explore the
model space a little bit more and further understand what's going on in this
model. And we think this is a very general and promising framework, and we
want to apply it to some other applications. And another interesting
direction is to learn to construct the graph.
With this, I would like to conclude, and I'll be happy to answer any other
questions that you have.
[applause].
>> Yujia Li:
Yes.
>>: So I have questions regarding other applications. I would assume that
it's pretty easy to turn this into a technique that works on weighted graphs,
not -- not graphs like this, but probabilistic graphs where you have a
relationship between A and B which is probabilistic, like 90 percent of cases
A does fear B. And I think that's pretty easy to incorporate in this. And
then with that, really a lot of different applications might open up.
>> Yujia Li:
Yes.
>>: So I'm kind of curious why you went for these particular ones and not
more traditional machine learning applications. So what are the other
applications you have in mind?
>> Yujia Li: One example I have in mind is to use this as kind of a
prediction model for any problems that have graph structures. Like
traditionally people use graphical models for a lot of prediction problems on
structured data where the kind of edges or factors connects different
variables in the graph. But -- and usually you want -- once you learn the
model, you need to run an inference procedure to make predictions. But the
inference procedure, if it's based on like message passing or whatever, then
it's just another iterative propagation process on the graph. And we can
totally replace that with this graph neural network and use that directly to
make predictions which has complicated output structures.
And the benefit of using graph neural nets is that it doesn't have the
intractable -- intractability problem for a lot of graphical models, because
as long as you have moderately complicated potential functions defined on the
graph, then the inference is intractable. But for this, you can add
arbitrary connections between those and still be able to get a good model
that's trainable with back propagation.
>>:
So you're talking about MRFs?
>> Yujia Li:
>>:
Like segmentation problems?
>> Yujia Li:
>>:
Yes.
Yes.
Multiplication problems?
>> Yujia Li:
That's -- that will be an interesting application for this.
>>: So like even this is not guaranteed to [inaudible] approximations even
in message passing for [inaudible] graphs. But can you guarantee anything
about the prediction you make, the predicted solution you make here?
>> Yujia Li: I'm not thinking about guaranteeing anything, which is kind of
common for any neural network models. It's just learning input/output
mappings.
models.
>>:
But it doesn't have some of the restrictions to the graphical
But do you have a [inaudible] model as well?
>> Yujia Li: I do. But the propagation model doesn't have to follow the
probabilistic formulation as in a lot of probability [inaudible].
>>: [inaudible] some sort of parameters for which you might also like and
you don't know that?
>> Yujia Li: It's possible. Yes. But I think this is still a more flexible
and powerful propagation process because you can use any arbitrary
nonlinearity, you can parameterize in any arbitrary way but it's still able
to train.
>>: [inaudible] propagation and you can just keep running it, right? And,
of course, no guarantee, but still going to give you something useful. And
that's all you're saying, right?
>> Yujia Li:
[inaudible].
I think it's more powerful than belief propagation.
It can
>>: [inaudible] propagation or not the entire prediction pipeline, but just
the propagation piece I think probably -- you know, or initially it will have
some other issue as graphical networks, you know, and it's part of the -- I
mean, I don't see how you -- how this -- this framework alleviates those
problems.
>> Yujia Li:
>>:
Uh --
[inaudible].
>> Yujia Li:
think --
Yeah, it might still have some of the problems, but I still
>>: The related question is mathematically it's similar to belief
propagation because it's just if you don't have factors and computing
messages based on these factors, you have direct learning messages.
>> Yujia Li:
Yes.
>>: So have you thought about that? Have you compared things? What other
sort of things -- why do you -- like what's your intuition on that, the
differences there?
>> Yujia Li: We haven't tried this, first of all. The thing that's pretty
appealing to me is that this kind of approach can directly learn the
propagation behavior, instead of deriving the propagation behavior from
optimizing some objective function. This is very appealing to me, but I
don't have a good intuition about what this approach -- I mean, I know
this -- this approach can do whatever a belief propagation or whatever
approximate inference algorithm can do by simulating that, defining a
particular graph structure and particular nonlinearity functions, but it can
also do many more. It's not a restricted to be within the probabilistic
framework. But other than that, I don't have a good understanding about what
this cannot do, what kind of problems this will face if we apply directly in
that way.
>>: And then regarding your results where it showed that you can beat RNNs
by using many pure data points, because you're using some structure ->> Yujia Li:
Yes.
>>: -- that's previously known. But I'm wondering if there is -- if you
could also show that even if you -- even if using the same data,
regardless -- well, actually you are using the same data in this case. But
even if you're working on a dataset which is not structured like this but
pure text, for example, and trying to generate text, then you could first go
into this text and use traditional techniques to find some sort of
potentials, like this word appears before that word, things like that, and
initialize your graph and use that as initialization for your RNN, would you
then end up getting better results in that case too? What's your intuition
based on all these experiments about giving the local minima [inaudible]?
>> Yujia Li: So, first of all, I think it should work if you have some kind
of knowledge graph. If you want to generate things from that ->>: What I meant was your generate a knowledge graph by some simple
statistical procedures from pure data. And then you use that to initialize
your RNN.
>> Yujia Li: Yes. I mean, that's kind of providing you with some extra
knowledge. And I think utilizing that won't make your performance worse at
the least.
With regard to the local optima issue, I think this kind of model is very
structured. So it kind of provides you a very good direction to look for.
Compared with a standard RNN that's not that restricted.
So this is -- gives you a better search space basically, and it's easier to
get you a good solution in this more structured space. But in the
unstructured space, there are many different ways for it to get trapped. So
that's my intuition behind this, but I don't have any theory to support my
claims.
And usually if we have the right structure for this, for some problems, then
training this graph, gated graph neural networks would usually find a good
solution in very few iterations. And it would see the accuracy drop from a
hundred percent error to 0 percent error. Immediately. Like in a few
iterations. So that's the kind of behavior I was observing.
>>:
We have time for one more question.
>>:
[inaudible].
>> Yujia Li: Maybe able to use a kind of 19-by-19 grid graph and do
propagation on that to like give you candidates maybe. Yeah.
>>:
Let's thank Yujia.
[applause]
>> Yujia Li: Thank you.
Download