>> John Platt: So I'm very pleased to present... of Washington. He's here with his student Robert Gens,...

advertisement
>> John Platt: So I'm very pleased to present Pedro Domingos who is a professor at University
of Washington. He's here with his student Robert Gens, also from University of Washington and
we've had a long relationship with them. Many of his former students work here at MSR and he
gave a really intriguing talk at the International Conference on Learning Representation, so I
invited him to come and tell us updates on that work.
>> Pedro Domingos: All right. Think you all for being here. Thanks to John for bringing me. I
will try to make this worth your while. What I thought that I would clearly talk about here is
highly speculative new research directions that we are pursuing. It's called symmetry-based
learning and this is work that I have been doing with Rob Gens who is sitting right here and is
going to be starting an internship here next week, and Chloe Kiddon. Fair warning, this is work
in progress, so a lot of it is not very mature yet. This is the fanciest slide in the whole talk.
[laughter]. We do have some experimental results that we will show, but I think that some of
this is definitely very much still in the idea stage. Here's what I'm going to try to do. I'll start
with a little bit of motivation. Symmetry-based learning is learning that takes advantage of
symmetry group theory, so I'll first try to say why that might be an interesting thing to do. Then
I will give a very brief background on symmetry group theory for those of you not familiar with
it. Then I will talk about two main things, two applications of symmetry group ideas. The first
one is deep symmetry networks, so this is stuff that Rob has been working on that we just
submitted to NIPS. This is an application of symmetry group theory to generalizing things like
convolutional neural networks to allow for more symmetry than just translation symmetry, which
is what a cone net has. Even more speculatively, I will talk about symmetry-based semantic
parsing which is what Chloe Kiddon has been working on were what we try to do is solve some
of the sort of like long standing problems in semantic parsing which is the problem of going
from sentences to their meaning using ideas from symmetry group theory. Then if time allows I
will conclude with a little bit of discussion, but please feel free to ask me questions at any point.
Here's one way to motivate this. You could argue that the central problem in the machine
learning is learning representations. We know how to do a lot of things very well, but what is
limiting what we do is that the representations in sentences have to be predesigned. If they aren't
we don't really know how to learn them that much and this is what is really limiting the power of
our learners. If we could solve the problem of learning representations, then we would have
really powerful learners. Imagine what those could do given what the ones that we have today
can do. This, I think, a lot of people would agree with at least that this is an important problem
to look at, wherever exactly it falls on the ranking of important problems in machine learning.
What I would like to suggest here is that if what we want to do is one representation the very
natural foundation for that is symmetry group theory. The number of people over time over the
years have applied symmetry group theory in various ways in machine learning and also envision
a lot which is where some of our applications are, but I think we could actually take that a lot
further than we have so far. As like a very simple introductory example, let's think about
symmetry in geometry because usually when people have intuitions about symmetry they come
from geometry. What is symmetry in geometry? The symmetry of an object is a transformation
that maps the object to itself. For example, a square has eight symmetries, rotations of 0, 90, 180
and 270 degrees and reflections on these four axes that we see here. One interesting property of
these symmetries is that we can compose them and whenever we compose them we just get more
of the same symmetries. For example, I can compose in our rotation of 90 and 180 to get one of
270. I can compose to reflections and what I get is a rotation and so on and so forth.
Symmetries can be composed, right? Another interesting thing that happens is that there is an
identity symmetry. The symmetries of transformation, identities of transformation that doesn't
do anything, mainly leaves the square the same, which in this case the identity in this set of
symmetries is going to be rotation of 0 degrees. The next interesting thing that happens is each
symmetry has an inverse, so if I rotate by 90 and then I rotate by 270, which is the same as -90, I
get back my identity. I can compose symmetries and the composition is associative.
Generalizing from that, a symmetry group is a group and the group follows the group axioms
which are -- in general, we have a set, whatever it might be and then a binary operation on that
set, let's call it product, but it could be anything else. The group has four properties. The first
one is closure. When I do x.y if x and y are in the set the result is also in the set. Identity
meaning there is an element E such that when I dot it with anything before and after I get that
same thing, so it's an element that doesn't change anything. Inverses means that for every x
there's an x-1 such that when I combined them I get the identity element and associativity is the
obvious thing. A lot of things are groups; a lot of things that we deal with every day are groups.
For example, the real numbers with addition form a group. And a lot of things follow from just
these group axioms. Symmetry groups are a particular type of group, but they are a particularly
powerful type of group. What happens in a symmetry group is that the group elements are
functions. The group elements are no longer in some sense the objects. There's a function that
we are going to use to operate on objects. The group operation is function composition. So the
impression is composing functions. Symmetry groups can also be continuous. The symmetry
group that I just showed you was a discrete one. There were eight symmetries of a square, but in
general, I could have continuous symmetry groups. Those are usually called Lie groups in honor
of the Norwegian mathematician Sophus Lie who introduced them. Again, the standard example
is the symmetries of a circle. A circle can be rotated through any degree and you can be
reflective on any diameter, so now there's an infinite number of symmetries and they form a twodimensional space. Getting on to more interesting things, Euclidean space itself has symmetries,
for example, translations, rotations and reflections. If I take, for example, the Euclidean plane
and translated what I get at the end of the day is still the Euclidean plane or rotated or reflected
or et cetera et cetera. These symmetries have the property that they preserve the distance
between any pair of points. Euclidean transformations preserve consistencies between points
because if they don't, then they are not part of the Euclidean group anymore. Let's try to see
more generally what symmetry is and why it would be interesting to us in machine learning.
More generally, a symmetry of a function -- let's think of symmetries of not objects like squares
or circles or even Euclidean planes. Let's think of symmetries of a function. Symmetry of a
function is a change in the input that doesn't change the output. Here's my function. It takes x,
you know, to f of x, and what I'm saying is let's say that I first transform x by s, so s is some
function of x. For all s x’s f of s of x is the same as f of x, then s is a symmetry of this function.
Sometimes people talk about symmetries of objects and symmetries of functions and the two are
really related because the symmetry of a function is a symmetry of the objects that you apply to
and the symmetry of an object is really, it has to do with a function that is being preserved. What
is the property that is being preserved? But for our purposes I think it's useful to think of it in
terms of asymmetry being a property of a function. If this is the case, then it almost jumps out at
you that this is going to be relevant to us in machine learning because we are interested in
functions and transformations of the examples that go in them and so forth. In this general guise,
the symmetry… Yeah?
>>: You mean in your example it was just [indiscernible] speaking for example the square? Is if
for any x or for single x?
>> Pedro Domingos: It's for a particular function.
>>: And then any x, right?
>> Pedro Domingos: Yes. One way to think about this is that the function is your object. You
can have symmetries of any object. The object could be a square. It could be a function. For
example, in physics it can be a logarithm. No physicists do everything with logarithm so now
we're interested in symmetries of the logarithm function, and it turns out that the symmetries of
logarithm functions are conservation laws, things like conservation of energy et cetera. In fact,
symmetry group theory pervades physics, relatively, quantum mechanics. It's the standard
model. Just about everything in physics these days, you know, down to string theory and
supersymmetry, these are all theories based on or can be formulated very elegantly in terms of
symmetry. So symmetry is very pervasive in physics. It's also very pervasive in mathematics, so
symmetry group theory is one of the most intensively studied areas in mathematics and part of
the reason is that it shows up everywhere. You go to almost every branch of mathematics and
you will see proofs of theorems that are based on symmetry. Just to take random example, think
of game theory and the Minimax theorem. The most elegant proof of the Minimax theorem is
using symmetry. It also appears in things like optimization and search, so like getting closer to
machine learning. When we're searching, if we notice symmetries in the space that reduces the
size of the search space. For example, I'm trying to figure out how to play tic-tac-toe, if I realize
that all symmetric board positions are the same, then that actually reduces my branching factor
and makes things a lot more efficient. In optimization there are ways to take advantage of
symmetry. It's used, for example, in model checking. A lot of people who do the software
verification, hardware verification, they have these very large systems to check and, again, if you
can detect symmetries -- they don't call them symmetries, necessarily, but that's what they are,
things can become much more efficient. It's made, of course, a lot of appearances in vision
because in vision there are a lot of obvious symmetries from texture to things like, of course,
rotational and symmetries of rigid objects and so on and so forth. It's also of late made a lot of
inroads into probabilistic inference which is actually how I first got interested in it. If you have
very large probabilistic models, inferences in them is very hard, but if those models have a lot of
symmetries, meaning, for example, you have a factoral graph and a lot of factors are copies of
each other, as you have one this factor graph from a Markov logic network, then you can actually
use the symmetries to have a cost of inference that is only proportional to the number of these
these aggregated objects that you have as opposed to all of the individual variables. A lot of
ideas here it turns out can be made much simpler and more general by casting the whole problem
of finding and exploiting symmetries in your graphical models and so on. However, in machine
learning symmetry has made some appearances but not that many. Some of the appearances I
think are actually quite remarkable but when people think of important topics in machine
learning symmetry group theory comes up. However, I think if you think about the definition of
symmetry, it kind of at least jumps out for me, it kind of jumps out that this is going to be
relevant to us in the following way, or at least this is one way. If your function, so remember,
symmetry of a function is a change in the input that doesn't change the output. If the function is
a classifier and the change you need put these representation changes the examples, now we have
the representation learning problem cast as asymmetry problem. Now your function is a
classifier; f of x is the class of x. x is your example. It could be in image. It could be whatever
you want, the description of a customer or something and your symmetry is a representation
change. It's taking your examples and changing them to some other representation such that the
function applied to the new representation still has the same value as the function applied to the
original one. The problem of learning with representation change seems to map very nicely to
this definition of symmetry. To make the point more generally, what is really the goal in
representation learning? The problem that we are always faced with in machine learning and in
particular in vision and speech and language and whatnot, is that there are a lot of variations in
the data that obscured the thing that we are trying to get at, variations in pose, variations in
lighting, variations in all sorts of things, camera parameters and so on and so on. If we could
somehow get rid of all those variations then the things that we really wanted to attack, like, for
example, the class of the object. Is this your grandmother or is this not your grandmother? Then
that would become much easier to learn were probably at that point a very simple classifier
would do the job quite successfully. If you think of this as the goal, we want to separate out the
important variations from the unimportant ones. Then you are important variation is the target
function. It's like what you are trying to predict. Is this a chair or is this a table? Is this a cat or
is this a duck? If the important variation is the target function, it's this f that we are trying to
learn here, then the unimportant variations are the symmetries of the target function. They do
things that I should be able to change in my input without changing the object, without changing
the output. For example, if I can take, I should be able to take my object and rotate it and if it
was a chair before, it would still be a chair after. And if I apprise them in closer or I'd distort the
image in various ways, I should still be okay. Things like pose invariance, lighting invariance
are symmetries of that which we are trying to recognize. The same thing that I'm doing here
with the vision example, you can probably take any domain and think of what the variations are
and then you can think of them as what you want to get rid of and then you can think of them as
symmetries. Now if we do this, if we exploit symmetry in order to do representation change in
order to learn classifiers, I think there's a whole series of benefits that we get out of it. One very
important one and in some ways the most obvious one is it reduces our sample complexity in the
same way that in search and in optimization, recognizing symmetries reduce the size of the
search space. One way to think about it is the curse of dimensionality can be overcome by the
blessings of symmetry. Symmetry reduces the size of your search space. The same thing
happens in machine learning is that symmetry reduces the size of your instant space. You can
have a very high dimensional space, but if you can fold it all using symmetry back down to low
dimensional space, [indiscernible] that low dimensional space the same number of samples goes
a much longer way. The sample complexity that you need, the size of sample that you need to
ensure the generalization and so forth goes down, potentially a lot. This is one benefit of
explaining symmetry. Another one, which we are going to see here shortly is that we can take a
lot of our learning algorithms that we know and love and come up with more powerful, more
general versions of them in a very straightforward way. This is a way to take learners and make
them more powerful. We can probably get new formal results out of this as well and maybe
even a new kind of formal results. This is still a speculative notion, but this I think is another
one. Here's another interesting one. Deep learning is very successful and very popular these
days. Symmetry learning suggests another way to do deep learning is to do deep learning by
posing symmetries. If you think about what the successive layers of a deep network are doing,
for example in a ConvNet what you are doing is getting successively higher degrees of
translation invariance. What is that doing? It's like composing small translations into larger
ones. We can take the same idea and make it more general. Yeah?
>>: Do you think of symmetry as something that is global or is it quantifiable, for example, to
say what I'm meaning, for example, you are trying to recognize figures, so we say in rotations,
somewhere in the degree of rotation is symmetry, but if I were to take 180degrees it might take a
break.
>> Pedro Domingos: Precisely.
>>: [indiscernible] this direction?
>> Pedro Domingos: No. Of course. I think as both a direct answer to your question and a
more general one. The answer is of course. This is part of what we're trying to learn this, for
example. You are in, a 2 is a 6 is invariant with a 3degree translation. It's not invariant and
they're a 180degree translation. On the other hand the zero is invariant under a 180degree
translation, so this is what we want to learn. We're not just going to want to feel like certain
things are globally invariant, because they almost never are. There might be some and those are
actually going to be easy, right? But what we want to do is we want to figure out where in
symmetry space can our objects be?
>>: [indiscernible] proved theory, proof theory there is no notion of [indiscernible] 25…
>> Pedro Domingos: Of course. Yes. Which brings me to the larger answer which is, which
I'm going to elaborate on a little bit later, which is there's I think a reason why symmetry group
theory even though it's potentially relevant to machine learning hasn't had a big impact yet,
which is that symmetry group theory by itself actually doesn't solve any problems. It has to be
combined with other things like, for example, distributions over symmetry space and
approximate symmetries and a whole lot of stuff. But we already know how to do that stuff. We
know there's a lot of people who know how to do symmetry group theory very well. What we
need to do is combine the two, just symmetry group theory by itself is not going to solve the
problem for us.
>>: So why is something with symmetry [indiscernible] invariance which [indiscernible] let's
say [indiscernible] and those are interesting problems, so where does symmetry, so you put a
restriction by saying it [indiscernible] symmetry. What does [indiscernible]
>> Pedro Domingos: Yeah. Fair point, but as usual what happens is that if you don't make any
restrictions you can't do anything. You have to make restrictions meaning like some
assumptions. You have to make some assumptions. Another question that you have to ask
yourself is if I make these assumptions, how much does it by me. You lose something by
making these assumptions and the argument that you should always make if you are going to do
it is well, I don't lose a lot and I gain a lot or I gain a lot more than I lose. I will definitely not
say that symmetries will do everything for you. What I'm saying is there is a lot of mileage that
we can get out of them and a lot of machinery that actually we can exploit. I think this is
definitely not going to be the end of the story. People, for example, envision to have things like
you can have symmetric components and antisymmetric components and whatnot, but I think,
you know, this is part of the road that we have to travel. Yep?
>>: So to think the old-school way of the thing which is like [indiscernible] examples
[indiscernible] class [indiscernible]
>> Pedro Domingos: Yeah. I'm glad you asked that question because that was actually one of
the motivations for what we're doing is like seeing all of this, you know, I have a data set of
100,000 examples and then I have a data set of 100 million examples because I need those
distortions in order to be able to then learn something that is interested in those distortions.
Those distortions shouldn't have to be in the data with the corresponding cost of processing. It
should be your network that figures out what distortions you are invariant with. So definitely
right, what we want to do to some degree, or at least in one instance of this is to, the same effect
you get by putting these distortions but better and with much less computational cost.
>>: I think it's very dependent on the way that you can, it's like sampling. [indiscernible] in this
case you sample the absorption and I get that. [indiscernible] all the way. There shouldn't be
any point in your model where you say but now I cannot do it in close form and I will sample it
because like sampling initial use you are going to decide your distortions, I mean, instead of
like…
>> Pedro Domingos: But the sampling has a high cost and it's imperfect because samples are
only a sample of reality. Anyway, let's press on and certainly I think we can discuss more of
these issues as we go along. The final point is this, is that symmetry, the use of symmetry is
actually agnostic with respect to which learning curve I'm using. You can apply symmetry ideas
to connectionist models, vision ones, kernel-based ones, symbolic ones and we'll see a couple of
examples here. It's actually compatible with all of these different things. All right. That's the
general idea. Let's look at a concrete example. What can we do with symmetry group theory?
As I said, there had been a number of application symmetry group theory in machine learning.
The most notable one these days is called convolutional neural networks. What is a
convolutional neural network? A convolutional neural network is basically a series of layers
each of which contains a set of feature maps and then interspersed with these feature layers I
have pooling layers. At the end I have a classifier that takes the features from the top layer and,
you know, that could just be a multilayer perceptron or something like that. Where do these
feature maps come from? Feature maps are just features like a sigmoid or a rectifier or
something like that applied to some input, but a feature map is a feature function like a SIG
model applied to the product combined with translation. What a feature map in a convolute
really is is the same feature applied at every possible translation in the image. I'm going to apply
this feature here, here, here, here, here and there thing that the feature is detecting might be
present here and here but not there, there, and there. This is what a convolutional network is. If
you look at it this way the generalization immediately suggests itself, which is why should the
feature map just be a feature function applied with all translations. Translations are just one very
limited symmetry group, the translation group. We should be able to build a feature map out of a
feature in any symmetry group that we want. We should, in fact, have some sort of deep
network that is necessarily even committedly priorious to what the symmetry group is that you
are going to be exploiting. You can plug that in later. This is what I'm going to do is like the
first step is I'm going to define an architecture that is valid for any symmetry group. The
architecture just uses symmetry group properties. It doesn't say that this is going to be the
translation group or something else. And then step two I'm going to instantiate that with one
particular symmetry group and the obvious one to do that, of course, is the affine group. The
affine group is a superset of the translation group, but it also includes things like rotations,
reflections, scalings and whatnot. And the 2-D affine group is defined by this. If these are my
original coordinates on the plane my new coordinates are just a linear function of the old ones.
These two things obviously will give you translation and these give you the rest, the rotation, the
scaling and whatnot. So the affine group is six dimensional because it has these six parameters.
This is what we are going to do. And then, so I'm going to call this first one SymNets for deep
symmetry networks and then there's going to be a particular instance that is deep affine
networks. And compliments are just deep translation networks in this setting. There's a couple
of notions that we need from symmetry group theory. Let me just bring them over. Let's
suppose that, one important notion for us is going to be the notion of a generating set. So what is
a generating set? Let's say that G is a group and S is a subset of that group. We say that S is a
generating set of G if every element of G can be expressed as a combination of elements of S and
their inverses. So S is some group that is sufficiently, is some subgroup that is sufficient to
generate the entire group by composition and inversion. For example, in the case of translation
adding an infinitesimal epsilon to the x and adding an infinitesimal epsilon to y are a generating
set, but because it can generate in translation just by combining epsilon translations in x and y or
inverting them which would be x minus epsilon, you know, et cetera et cetera. One notion that
we need to define based on this is the notion of a key neighborhood of a function in the group
and their generating set. That's a bit of a mouthful but bear with me the notion I think is quite
intuitive. This is going to be the subset of elements of the group that can be expressed as f, so
remember this is the neighborhood of f of a particular function that can be expressed as f,
compose with elements of the generating set at most k times. For example, in the case of a
translation group, this would actually look like a diamond because you could do k this way and
then you could do k that way and then you could do like k over 2 like this and k over 2 like that
and so on. If you wanted to have a rectangle you would just take a slightly extended definition
where you could have ki of applications of element i, but there's no need for that complication
here. The notion here is that, you know, I want to not just look at all possible compositions, but I
want to look at limited numbers of compositions. And you can probably see why this is the case
because what happens in a [indiscernible] as you go up the layers is that you are allowing in this
view compositions of more and more elements of your generating set of small translations,
which are going to be translations of one pixel effectively in that case.
>>: [indiscernible] a sliding window?
>> Pedro Domingos: Yeah. Exactly, so one good way to think of this is that a k neighborhood
could be a sliding window but it's going to be a sliding window in symmetry space, not just in
translation. You're going to be sliding not just the wrong translations but around rotations and
scalings and then combinations of them. You can think of all this is like I have, think of a little
window centered at the origin and now cog nets will now put that window everywhere. What we
are going to do is that in addition to putting it everywhere we're going to look at all possible
skills and all possible rotations and sheers and whatnot. We're going to look at all linear
transformations of that window and see what happens.
>>: The semantic parsing example you are going to talk about later, how do you make these
kinds of examples?
>> Pedro Domingos: You'll see. Exactly. The semantic parsing work is not as far along, but I
think it's very interesting because it's a completely different application of symmetry group
theory and yet with some deep similarities. We'll get to that in a second.
>>: [indiscernible] cog net like the two operations that are the same function on different
translated versions of the image and then they get read offs [indiscernible] organization of those
and then in order to like distinguish them right?
>> Pedro Domingos: Right. So I haven't told you how we are going to do so exactly. So really
there's two things that connects to. One is parameter tying and the other is pooling, and we
haven't yet seen how we are going to do them but we are going to see shortly. Here's the general
architecture. It's very simple. One layer is obtained by you take the input and you apply every
symmetry in the group to it. Then you compute features on the transform input. You take your
input. You apply every symmetry from your group and then you compute the feature on the
transform input. This gives you a feature map, the feature map over the entire symmetry space.
Then we pool over neighborhoods; we pool over the k neighborhoods that we just defined. It can
be max pooling or sum pooling or any of the various things between them that people have tried,
but the thing is what you need to pool is you need to know what is the neighborhood. Where are
the values of what your pooling and the answer is is going to be the k neighborhood is we just
defined it. Everything we are doing here has confidence as a very clean special case. Yeah?
>>: [indiscernible] expansion distance. How does it relate?
>> Pedro Domingos: Indeed. Interesting question. I don't know if people here are familiar with
tangent distance but maybe my answer will actually also qualify that. Tangent distances
applying symmetry group theory to nearest neighbor. And this is applying symmetry group
theory to deep networks. In a way what I'm going to talk about here, these deep symmetry
networks, they are to tangent distance as coordinates are to nearest neighbor. Many of the things
that tangent distance can do we can do here as well. We can do some others besides. As we'll
see there's a point at which we need to solve a problem in our deep network that could be solved
by tangent distance. I think it was John that put it this way and this one we took was like what
we are going to do here in a way is like tangent distance inside the ConvNet is one way to look at
it. Yeah?
>>: You said you apply all the symmetry group to the input in pooling all the neighborhood?
Why you did that instead of say find just the symmetries in the k neighborhood?
>> Pedro Domingos: Notice again, you can answer your own question I think if you look at
what happens in ConvNets, right? I do want to compute the feature at every place in the image
and then I pool over neighborhoods, because I want to find out if the feature is present over there
and over there and over there.
>>: So say I assume that there is just local preservation of [indiscernible] on the rotation. But
you will apply all rotations?
>> Pedro Domingos: Yeah. Just like I apply all translations, and remember in the ConvNet,
ideally if the learning permits what will happen is if something is completely invariant with
respect to translation theory only going to find that out at the top once you have pooled over the
entire image. Same thing here. I will find that out if something is completely invariant I will
find that out at the top. If something is only partly rotationally invariant, I will learn that at the
appropriate layer but not beyond. Again, back prop permitting and the data allowing that to
happen, but we will see that this actually works. As I already employed, we can train this by
back prop again in the same way that ConvNets are trained by back prop provided that this is a
Lie group. If this is not a Lie group we will have to do other things that we'll talk about later, but
if this is a Lie group we can train by back prop because the symmetries are differentiable.
>>: [indiscernible] backprop for this type of network is comparable for the [indiscernible]
>> Pedro Domingos: Oh, great question. A difficulty, yes. Good question, right. I think there
is an obvious difficulty is like well, translation was in the 2-D space, right? And now you're
going to who knows what dimensional space? What are you going to do? You can't have a grid.
So we're going to have to deal with that problem, right? Those of you who know what tangent
distance might already have some thoughts about how we might deal with this problem but
indeed, there is actually key. If we don't do that, you know, this isn't really going to be practical.
Okay. That was the general case. Now let's make this concrete, right, with the affine group.
This leads to what we call deep affine networks. This is just the architecture that we saw before
with the affine group as the symmetry group, so one layer is we apply every affine
transformation to this image plane. Stretch it, sheer it, rotate it, translate it, you know, the whole
works, and then we compute features on the transform plank. What this means is that instead of
just trying to look for a particular feature at the different translations, I'm also going to look for it
at different scales, different rotations, different, you know, affine transformations. And then we
pool over the neighborhoods and affine spaces as we saw before, which means that we are going
to do things like, for example, oh, I'm going to pooling into distance translation invariance in
ConvNet. Now it says like I don't care whether this feature is here or here or here. It's all about
the same to me. We're going to say something similar which is like well, not only do I not care if
it is here or here or here, I don't care whether it's the size or the size or the size within some range
and I don't care whether it's translated, you know, rotated by some number of degrees. And now
this is the one layer, right? The first player has the image as its input. The second layer, you
know, has these feature maps and I can just stack as many of these layers as I want and train
them by back prop. Okay? What is the generating set, or what is a generating set that we can
use for affine transformations? Well the obvious one is to just start with the identity matrix and
the origin. This is basically not translating and not rotating, you know, just leaving things on,
and adding epsilon to each parameter so if you look at here, I can have 001001 and, you know,
each adding epsilon to each one of these, you know, gives me, for example, this here would be
an element of x translation y translation as before but then these guys right here if I have a matrix
that's 1 plus epsilon 001 and one that's 1 epsilon et cetera et cetera, I can compose everything out
of those, so this is what we use. All right, so, Uh-huh?
>>: If you go back there, on the step one, the affine…
>> Pedro Domingos: Pedro here?
>>: Yeah, back one more. Apply every symmetry; you are taking the generating set and you are
going to sort of apply over and over and over the generating set, or are you -- I don't quite
understand you.
>> Pedro Domingos: What does it mean, for this part you don't really -- to think, to understand
pooling over neighborhoods you need the concept of the generating set, but here we actually
don't. Just think of the entire symmetry group. It's a 6d group.
>>: So you are basically going to generate all the symmetries and apply all of them?
>> Pedro Domingos: Yeah. Which sounds crazy, right?
>>: That's like real simple, right, real simple?
>> Pedro Domingos: Exactly, so.
>>: That's all feasible.
>> Pedro Domingos: For a moment, let's ignore computational issues. I'm just going to take
every last one and apply them all in their glorious infinity.
>>: Yeah, yeah sure.
>> Pedro Domingos: Right? So conceptually this is what I'm going to do, right? I mean, in
some cases we may be able to do these things in closed forum, but that's not what we're going to
do here and it's also not what ConvNets do. So of course, another problem is I can't do this,
right? This is not going to be feasible. In [indiscernible] it was just about feasible, but it's not
going to be here. What can we do? If you think about what we can do is that we're machine
learning people. We know how to solve this problem. We don't need to sample the space at
every point. What we can do is we can compute our features. Remember, what has to happen is
[indiscernible] are featured every point in symmetry space, but there's too many. We can't do
that. What can we do instead? We can compute our features only at some points in symmetry
space, only for some rotations and sheers and scalings and translations, only for some. And then
we interpolate for the rest. Instead of sampling this room exhaustively in the complete grid, I
pick a number of points. I know my feature there and if I have a point here and a point here, then
the teacher here can be computed by interpolating between them. Come on. We know how to
do this.
>>: [indiscernible] create this a few types of symmetries and then pretty much [indiscernible] so
this is more complete setups?
>> Pedro Domingos: Yeah. So again, for those of you not familiar with Stephen Millatt’s
[phonetic] work and also there is like [indiscernible] in here. There's a number of sort of like
explanations, generalizations of deep networks based on ideas from symmetry group theory.
Again, this has many things in common with those, but it also has some important differences.
One is that he doesn't really learn anything. We do. We're going to do back prop and learn all
these things. It has like this, I think what he has is an interesting concept in terms of how to
explain deep learning, but he actually doesn't have an algorithm that would learn something and
we do.
>>: [indiscernible] translation [indiscernible]
>> Pedro Domingos: Yeah. And again, what he has is not for an arbitrary group or for the
affine group, even. What we have is much more general.
>>: This point is do you understand what you want in terms of symmetry. You don't need to
learn it because when you decide the wavelength [indiscernible]
>> Pedro Domingos: And I agree with that.
>>: Is a concept that it doesn't work. [laughter].
>> Pedro Domingos: It has some conceptual value, but the thing is if you know what you want
you should be able to encode it, but if you don't know what you want you should be able to learn
it. There's no reason why we can't learn the symmetries and this is what the ConvNet is doing
and Millatt’s work isn't, but we are. At the end of the day we have here an algorithm that
actually is a practical full-blown algorithm and we'll see it does give the kind of benefits that we
were speculating really get. From a machine learning point of view, this is a very natural
solution. Would have a very high dimensional, we have actually, here is the ironic thing. The
affine space in 2-D has six dimensions which 5 minutes ago look very scary. Oh you are going
to need the grid in six dimensions. And now it's like six dimensions, that's nothing. We know
how to deal with thousands of dimensions. Six dimensions isn't even that big of a problem.
Now what we need to do, right, if we’re going to do this by having some control points and
interpolating between them, we need some way to choose the control points. Now is important
to realize, for example, one thing that actually John suggested right after my talk was why don't
you just let them be a random sample and leave it at that. Actually, that might work. In many
cases that might actually be just fine, but here is something important to realize, is that I don't
actually care equally about approximating the feature everywhere in affine space. What I am
really looking for is where the feature is high because that's where it's present. This is a feature
that detects a nose. The place where the feature is close to zero I really don't want any samples
there because they're not going to matter. What I really want to nail is the places where the nose
detector is going off full steam. This scale and this position of the image, this rotation, I seem to
have a nose. Our control points ideally, now there's more than one way to do this, but here's
some considerations is that you are probably better off letting them be local maxima of the
feature instead of just being random. How would we find them? We have our feature map. Our
feature map is the value of the feature going up and down over all points of the affine space.
How do we find those points? We can just start with n random starting points and this n is going
to be an interesting parameter, but notice that essentially you are going to be independent of the
dimensionality given an n because everything is just going to depend on this n, not that
dimensionality of the space that you are in. Then what I can do is from those random starting
points find maxima by grade and descent. Uh-huh
>>: [indiscernible] maxima [indiscernible]?
>> Pedro Domingos: It's of the feature, right? Suppose you have a feature, let's take an ordinary
neuron. It's a dot product followed by a sigmoid or by a Gaussian or something. You can apply
that to an image patch of size 16 x 16. It's one neuron, so it features one neuron. Now what
happens is we want the value of that feature as we sweep around in our affine space. If there is a
nose that is a certain scale, it should have a high-value when I apply the feature at that scale, but
not when I apply it at a different scale.
>>: Do you have any idea, the affine spaces big so does, can I take any path and transform it in
some way that [indiscernible] looks like, maybe it's a very conjugated…
>> Pedro Domingos: Yeah, that's a very interesting question, but it's, that will be a problem
when we go where we want to go which is to be very flexible in the symmetries that we can
learn. Then it's an overfitting problem. Here we don't really have that overfitting problem yet
because you can't distort a mouse into a nose, I don't think, I mean just by affine transformations,
yeah.
>>: This assumption that maximum is good seems odd to me. Are you assuming the weights are
positive or -- I mean, a feature could be really good when it's strongly negative, the things that
listen to [indiscernible]
>> Pedro Domingos: There's a certain interplay between this and what the feature is doing,
right, but it's the feature is something like a sigmoid, right, it's 0 when something is not present.
And you are saying, in general, I might be interested in this 0 case, but in practice in vision that's
not what happens.
>>: Just because it's maximum.
>>: It goes 0 to positive instead of -1 to positive 1.
>> Pedro Domingos: Yeah. And now I mean, for example, if this wasn't the case that you are
looking for a local maxima then use some other method, maybe random samples, maybe some
other knowledge that you have. In fact, one of the applications that we have what we actually
did was we didn't do the grid in the same process here we actually just picked where we put the
control points just on our presupposition of what the invariances would look like. Yeah?
>>: Why do you care about the maxima as opposed to characterizing the entire space? Wouldn't
you want the information about the entire group?
>> Pedro Domingos: This is what I was saying. Yes, but I don't have the cycles to look at the
entire space so I have to pick some -- if I'm going to solve this problem by interpolation, I need
to pick the control points.
>>: With this point over here the maximum, what about the, I mean if you missed the point…
>> Pedro Domingos: This is what I'm saying that like because those are what I care about, those
are what I'm going to look for. If those weren't what I cared about I could do things accordingly.
>>: So [indiscernible]
>> Pedro Domingos: This is a choice that…
>>: [indiscernible] absence.
>> Pedro Domingos: Yes, exactly. The usable co-maxima presupposes that you are using a
certain kind of feature which is what you almost always envision, but if you are using a different
kind of feature like Matt was saying where, for example, having the L-1 is very informative, I
think the general theme here is that you may want to do random sampling but you also want to
make this sort of like utility guide that is sampling and it's sampling more in the points where
you have the most relevant information.
>>: Let me ask the converse of what Ashish did. It seems like you are choosing interpolation
scheme, interpolation schemes convex like, for example, if the base functions are [indiscernible]
unity, then you are taking a max over, you don't have to interpolate it all because it's come back
so you know that the max, that the interpolation will always be in the control points you always
pick. There's no integration there's no nothing, right?
>> Pedro Domingos: Good point. That's partly true, but here's the other thing. This is actually
going to do several jobs for us at once. We actually -- I'm wondering when I should talk about
this. I'll just talk about it right now. We are going to do this with kernels, so we're going to put a
kernel on each of these points. This kernel is actually going to do several jobs at once. You
could actually have separate kernels with each of these things and some of them might even be
unnecessary, but here's one of the most important things that these kernels are going to do.
They're going to do the pooling. I'm actually, for example I'm going to use the width of these -it's one thing to think about oh I am looking for a very accurate interpolation of the feature map.
If that's what you were looking for this is not what you would do. It's perfectly legitimate to put
that in here, but actually what we're trying to do here is not an active interpolation, necessarily.
>>: You want the aggregate of the feature.
>> Pedro Domingos: Because we want a pool and pooling means being invariant. We actually
want to be insensitive with respect to a lot of things, so what's going to happen is if I widen my
kernels I'm going to be pooling over a wider… Oh [laughter] special effects. So widening the
kernel is going to have the effect of pooling over a larger area. Suppose I have a maximally wide
kernel, that basically means that I have basically washed all information out about where I am in
feature space. If my kernel is a direct function then basically I am not doing any pooling
whatsoever.
>>: So you don't actually apply another pooling function?
>> Pedro Domingos: Exactly, exactly. This thing does the whole job, so it's a lot simpler. This
one kernel is actually going to do double duty. It's doing the duty of approximating the feature
map. It's also doing the duty -- think about, ignore ConvNets for just a second and think pooling
is like low pass filtering. It's a kind of low pass filtering. What I'm going to do with these
kernels is like, you know, low pass filter the feature map. And if I want to be very invariant I
will have a low pass filter with a high-bandwidth.
>>: Okay.
>> Pedro Domingos: For those of you familiar with vision, this probably looks a lot like Lucas
Kanade and it is, so Lucas Kanade was originally for things like optical flow and whatnot and
then he was generalized to find transformations, but, in essence, it's what we're going to do here.
Once we have found these points, and again, these points could be random points. They could be
maxima found in this way. And they could be points that you put down because something
about the symmetry of the problem. However you put down those points what happens is
instead of having computed the feature everywhere in your affine space, you have computed that
at certain endpoints, but now in order to do my back prop I need to be able to compute the value
of the feature map everywhere. The way I do it is by using Gaussian kernels. Again, you could
use many other things besides Gaussian kernels, but Gaussian kernels have all of the usual
convenient properties and they also work well in practice. What I'm going to do is put a galaxy
in kernel in every one of these points, and then my linear combination of those Gaussian kernels
is going to be my approximation of the feature map. Yeah?
>>: [indiscernible] between what you do at forward and backward. It seems like what you are
suggesting is maximum over neighborhood of my feature every symmetry over the same
[indiscernible]. Which means that basically the backprop just sounds like a sub gradient on the
single point you found.
>> Pedro Domingos: Again, what is going to happen is going to depend on the pooling function
that you have. Suppose, for example, that the function that you are doing, that you are doing
average pooling.
>>: Yeah. No. But what you just said that you were doing kind of in one goal pooling and
estimation with this maximum, so if you are [indiscernible] is maximum there is no need for…
>> Pedro Domingos: No. You have to remember that we are going to have successive layers of
this, so one of these kernels is sitting above the layer and another one is sitting below the
previous one and those are the two that we combine. This is, there's even more than this, but
why don't we get the whole picture first and then I mean I guess this is the hope. I've already
said most of this. If you make your kernel wider you get more pooling. And then you might also
worry about how to implement this oh efficiently because as I'm computing my gradient I am
repeatedly going to have to look up you know like the nearest points to the point that I am
currently in. I'm going to be somewhere in affine space and you can use things like ball trees or
whatnot to do that. You could use KD trees but KD trees are very dimensionally dependent.
Ball trees are less. Making this all efficient, besides this idea of using interpolation in 6D space
also involves some careful attention to the structures, but again we know how to do that.
Experiments. The first thing that we tried this on was the obvious one which was the MNIST
rotated data set. This is somebody went and took I think somebody in Montréal, they took their
MNIST digits and they rotated them in all possible ways and now you have a data set like that.
People have used this for a lot of things. It's a nice simple example. So what happens there?
We compared a ConvNet with one -- the way we implemented the SymNet was by basically
taking the ConvNet code using the [indiscernible] implementation, placing the convolutional
layers with affine layers and then on top of it we still have a model a perceptron. This is one
affine layer in our case and one convolution layer in the ConvNet and on top of that there's the
two usual multilayer perception layers, so this is really four layers; it's one, you know, symmetry
layer.
>>: [indiscernible] dropping error using [indiscernible] stabilize and you get 50,000 altogether?
What is the trend?
>>: That's the [indiscernible]
>>: Oh. [indiscernible] 10,000 [indiscernible]
>> Pedro Domingos: Okay. But notice this curve is like number of training examples versus
test error. And what we see here is that if you have very few training examples the SymNet
vastly dominates the ConvNet, because the ConvNet is trying to approximate rotations with
small translations. If you give them enough training that it's actually basically sees all of the
rotations in the data, well at that point they do the same. The point though is that when you don't
have the space densely sampled in your training data as you want in larger richer data sets, the
ConvNet is having a very hard time actually generalizing correctly. Whereas, the SymNet with
100 examples is doing as well as the ConvNet does with close to 1000 examples.
>>: It looks like if you give it more data it's going to repair it.
>> Pedro Domingos: In this case, again, I think what's going to happen in general is that if you
give the ConvNet enough data and enough parameters, you know, it will eventually learn
everything. But the problem is like you are not going to have that much data or even the cycles
to process that much data, so I don't think it's necessarily the case that as we get more data we are
going to get better. It's that we are going to I think see a curve like this also had a larger scale in
more difficult data sets. And what's going to happen for those data sets is that, for example, the
size of your data set might be here as opposed to there, so there the ConvNet running on
everything is still going to have this kind of error and we're going to have this kind of error.
Yeah?
>>: What's the training time between [indiscernible]
>> Pedro Domingos: The training times were some white somewhat slower, not hugely. I think
by a factor of maybe up to 10. You have to remember that our training time is really dependent
on the end that you use. We can use more; we haven't played with this a lot. If you choose a
larger number, basically our training time goes up with the number of control points.
>>: In this case if you take a 1000 example and you run them through there, and the ConvNet at
the same time would you get the same result?
>> Pedro Domingos: Our game here is not in time; it's in sample. We can use a much smaller
sample is one way to look at it is we can use a much smaller sample to get the same results,
yeah?
>>: Yeah, but they can make samples by sampling the symmetry. I think time is the relevant
comparison because like the opening of the talk was we are going to do the old-school way of
doing things and depending on the [indiscernible].
>> Pedro Domingos: You think that time is the relevant comparison?
>>: Because you can always fake out.
>>: Yeah, you can fake out.
>> Pedro Domingos: Oh, no, no no. Oh, no. I see your question. No, no no. Okay. Good.
What happens is here you can fake out those transformations. In general, you can't. That's the
problem is that you don't know how you generate -- actually, let's look at the next set of
experiments that we have and then come back to this question. This is a good point and I see
your question, but there's a very definite answer to it. You had a question?
>>: A question similar to his, here you already know the symmetry [indiscernible] so you
generate new samples [indiscernible]
>> Pedro Domingos: No. Precisely. So this data set is interesting for the reason that, the data
set is exactly the digits with those rotations. We know exactly what generated them. This is a
good way to test whether we can do what we want to do, but in a way it's not a very realistic test,
because the realistic test is a test where there's all sorts of transformations happening that we
have no control over and we don't even know how to generate them. In fact, to jump to the next
set of experiments which was on Norb. In Norb we actually have 3-D objects that are
undergoing 3-D affine transformations and changes in background and distractor objects and
whatnot. At this point, the symmetries that the data it really has are actually something that we
can't directly capture or the ConvNet, but we still do a lot better. This also gets back to your
question. Because what happens here, and this is a subtle point but in some ways it's the key
point, is that ConvNet can do okay by approximating everything with translation provided the
data is sampled densely enough. What we're buying ourselves by having affine, the affine group
instead of a transition group is that we need less data to get the same degree of approximation of
something that is actually not completely captured by those transformations. In particular, here
the real data has 3-D transformations and we're approximating them with 2, but we're better off
approximating them…
>>: I'm still not happy because I can fake out the affine transformations.
>> Pedro Domingos: No, no you can't.
>>: Yes, we can.
>>: Yes, we can.
>> Pedro Domingos: No. Not in 3-D.
>>: No, no no.
>>: You only having the 2-D affine which I can fake out.
>> Pedro Domingos: No, no no.
>>: This is not a global affine transformation.
>> Pedro Domingos: No, no no. You are missing several things here. One of them is that we
are -- here, what we had and this is why we only needed one layer here, is that I just had a global
rotation of the image. That's it. What I have here and in the real world is nothing of the kind.
There's all sorts of transformations all over the map, so I can't do that anymore. There's an
infinite number of them. It's not just a matter of applying all the 1D; that's not what we're going
to do here. That's not what we need to exploit. That actually becomes impossible here, but the
remarkable point is so these two graphs show different things. We still get a gain over ConvNets
when we use affine transformations, because we can better approximate the real transformations
that we can generate. Now I can say well, maybe we could do an experiment that would consist
of generating all affine transformations and then putting them through the ConvNet. I would bet
that at that point the ConvNet would be a lot slower. It would be slower because it would be
working on a data set, right? Let's say that we are order of magnitude slower than ConvNets. If
you have to increase the amount of them in the ConvNets by a trillion.
>>: If [indiscernible] were here what would they do, right? They would probably generate some
k neighborhoods for each example, which wouldn't be like a millionfold explosion. They would
use this proof you are using and that would be the baseline.
>> Pedro Domingos: No, but we have more than that. Remember we have more than one layer.
We have more than one layer.
>>: The point is that you are saying there is a sampled [indiscernible]
>>: I'll generate from the 2-D.
>> Pedro Domingos: I mean, I think in terms of [indiscernible] complexity, which is I think the
important thing, clearly this is a win. In terms of time it might or might not be a win but I would
still argue that this is going to be a win.
>>: You don't know because you didn't fake…
>> Pedro Domingos: I know, because we haven't done that, so fair point.
>>: It wasn't clear your control points, they are per example or from old network?
>> Pedro Domingos: In the case where we optimize them we start out that n random locations
but then they are optimized on-the-fly.
>>: They are on-the-fly but they are not to search for each instance you find a transformation
maximize the feature map for that example. You can do a search for all examples that way?
>> Pedro Domingos: I do this, it's an inference, right? I do an inference per example. By the
way, in NORB we actually didn't do that at all. In NORB what we actually did is we set down
control points at specific places and actually didn't adapt them, so it's even faster in that regard.
>>: So in your case it is to search for every forward and basically you have to do this search?
>> Pedro Domingos: Yeah. It's like doing an MAP inference, right?
>>: Looks like a big thing, right? In one case you are like [indiscernible] forward with a
[indiscernible] 6D for every neural, for every feature map, right? It looks like a costly process.
>> Pedro Domingos: It is a costly process, but again, let's say it increases the cost by about an
order of magnitude. It depends on the number of control points that you use. When you do this
efficiently the increase in cost is not that huge. It's certainly better than actually dealing with
orders of magnitude more data.
>>: We don't have 20 years of optimization.
>> Pedro Domingos: Yeah, but again, the other thing is that this is also a good point is that we
don't have it here all the engineering tricks that ConvNets do, but because this is such a clean
generalization of ConvNets there's no reason you can't apply every single one of those tricks here
as well. It that way, I mean we are comparing a very state-of-the-art technology with something
that isn't, which is like this thing that we just developed. Yeah?
>>: What I would personally find very convincing is if you somehow parameterized a very large
space of symmetry groups and then learn your way into them, because then it's like the ConvNet
guys won't know exactly how to generate the data from that.
>> Pedro Domingos: I mean that's definitely where we want to go. We just haven't gone there
yet.
>>: I have actually a related question in your data with that, which is that [indiscernible] when
you fake the data it's for 2-D where you know exactly like if you just [indiscernible] a little bit
it's the same digit, but I wonder whether [indiscernible] in the 3-D because of the lighting change
and [indiscernible] so far that wouldn't be the case. You can't…
>>: I understand with the symmetry group that you used also did not handle lighting
[indiscernible]
>> Pedro Domingos: No, but the point is, again, and I go back to this and it is an important point
is that what we're always going to be trying to do is approximating the symmetries of the real
world which are more than we can really capture. What this illustrates or at least makes us
suspect is that having a wider symmetry is a bit like variational approximation. If you have a y
there, you know, family from which to approximate your distributions, you will get a better
approximation. That's my interpretation of why we're doing better here. Yep?
>>: I want to take this in a different direction. So far in the training process, so in terms of the
inference this is still feedforward something that you have to do especially in the inference
phase?
>> Pedro Domingos: No, I mean yeah, and the inference here is fairly straightforward.
>>: But you still have to image for it. You still have to work.
>> Pedro Domingos: Yeah. Again, depending on whether we do that maximization or not we
have to do that maximization at inference time. But you have to remember that…
>>: So what do you [indiscernible] this going to [indiscernible]
>> Pedro Domingos: Remember, what we're doing is we don't have the value of the feature
everywhere. What we need to do is we need to find its maximum. Then use them to compute
the features at the next layer. A feature at one layer is using features at the previous layer, right?
>>: So the maximum is computed over, then interpolated?
>> Pedro Domingos: Yes, exactly, but this is all in one forward sweep. And by the way,
typically it doesn't take that many iterations for this to converge because the search is fairly local.
This is really not that bad.
>>: [indiscernible] between the fixed point in doing this maximization?
>> Pedro Domingos: We haven't done it, I mean we did the fixpoint for NORB and Lucas
cannot afford -- I don't know. Did you try either on the other one?
>>: You need to have the maximization on MNIST and it works better because you have the
larger [indiscernible] and just more well behaved and it scales NORB down and so the objects
are small.
>> Pedro Domingos: This brings up another important point which is patch size. In ConvNets,
ConvNets, you know, we tried different patch sizes for the ConvNets and the SymNets and the
ConvNets work best with the small patches. This is what you would expect because if you are
approximating, for example, a rotation or even a 3-D rotation or something else with a
translation, this is only, you are only going to get a good approximation with small patches. In
our case, because we have more flexibility we can have larger patches and still get a good
approximation, because now we're not just doing approximations of the patch. We're doing
transformations of the patch. Because we can have larger patches, we can actually catch more of
the actual structure of what's going on there, which I think is what lets us generalize better.
>>: And you have fewer strides, so even though each stride costs more you have fewer.
>> Pedro Domingos: Precisely. You can look at this as each one of our layers are actually doing
the job of several ConvNet layers. You need enough ConvNet layers to build up to the patch
size that will actually learn in one step, which again, has the payoff of ease of learning and
learning time.
>>: In the ConvNet the features, like 2-D features, this is what you have, right?
>>: Uh-huh. In 6D.
>>: The part…
>> Pedro Domingos: 2-D, right? The ConvNet I'm going over the 2-D translations by
[indiscernible] and here we're going over a 6-D space of…
>>: No, but he is saying they are still parameterized by grid and…
>>: [indiscernible] of 2-D. Your parameters are not 2-D features?
>> Pedro Domingos: No. I'm not sure I understand your question, but in the ConvNet right
there there are the parameters of the feature, right? Each feature has its weights.
>>: That's what I'm talking about.
>> Pedro Domingos: And we still have that.
>>: Yeah, so the…
>> Pedro Domingos: Again, there's more than one way to do that, right, because we can also,
you know,, if that becomes an infinite vector we can again approximate that with kernels, which
is actually the third jobs at the kernels can do on top of the two that I've mentioned before.
>>: My question was like you see the fixed grid, right? And now you have like these
continuous transformations that you need to discretize to [indiscernible]
>> Pedro Domingos: No. That's what I'm saying. We don't have to discretize if we use kernels.
>>: [indiscernible] so you have a discrete [indiscernible] which then as a [indiscernible]
operation you have a continuous feature and then you can have [indiscernible]
>> Pedro Domingos: Yeah. Think of, for example the sigmoid. Instead of being applied to a -think of the dot product, right, for each feature. Instead of that dot product we can have a kernel
application which is continuous and then this feature map pieces in pictures maps one, two and
three. We have a kernel for each one of them, but then we have weights for those, so we still
have weights this oh. This is really important. The nose must have a point, a bright little spot
with high weight. I have the detector for the spot. And then I have detectors for the nostrils that
have some slack with respect to what they are and maybe a lower weight. I still have weights but
the weights are on the kernels; they are not on every point.
>>: I'm surprised that you need larger patches, because this introduced some level of blurring
like the [indiscernible]
>> Pedro Domingos: No. It's not that we need…
>>: [indiscernible] more patches.
>> Pedro Domingos: It's not that we need more patches. It's that we can learn better larger
patches. We can still learn the small patches as well as before. And remember the amount of
blurriness is controlled by us, and also there's an interesting relationship between the n and the
width of the kernel. If I want to pool a lot they actually need small n because having a larger n
would basically be wasted information. More questions?
>>: Assume to layers will do better than one layer.
>> Pedro Domingos: Yeah.
>>: The reason why [indiscernible] isn't perfect [indiscernible]
>> Pedro Domingos: Even in this to layers didn't buy you anything because all you had was a
global transformation, but here you don't have global transformation, so the two layers payoff
over one. We're actually running overtime and I haven't even talked about -- this is good that we
are having this discussion, but so I don't know.
>>: The neighbor who is in the bigger area [indiscernible] is putting like more, Denver gave
more. So do you have this notion that if you increase your k or your epsilon as layers go up?
>> Pedro Domingos: It depends on which space you are thinking of. If you are thinking in
terms of the original space, yes. The kernels keep getting wider. If you think about the reduced
base, you can think of it as a kernel the same size in a space that keeps shrinking. In fact its
equivalent so you can think about it either way. Next depths more layers which are distortions of
things like image net and whatnot. We were originally thinking of doing this in the context of
sum product networks which is something that we have been working on. I actually think that
you really only get the payoff of this when you combine this with something like some ConvNets
because you need [indiscernible] composition. There is actually something that ConvNets which
is, for example, the formable part models do and we can do which is to say like you have this
object here. You have a face somewhere and then you have a location of the nose relative to the
face. You can have an image where the background is moving and there is a bird flying and then
the bird is flapping its wings. What I'm interested in is that trajectory of the wings relative to the
bird. I need to decompose the image between the bird and the background and then the wings in
the bird and then decompose the transformations as they go through. And doing this in a
ConvNet is very difficult. Nobody has done it and it's not clear how you do it, but in the sum
product network of course that's what they naturally do is decompose the image into parts and
subparts and has distributions over them. Of course, we would like to bring other symmetries.
The first obvious one is lighting, but then in 3-D for example, let's suppose you have depth
images. Suppose you don't have depth images but you in further depth. And then, you know,
ultimately what we want to do is what you were suggesting is like we don't even know what the
symmetries are going to be. We just allow very large symmetry space and let the learning dictate
what happens. If people have other ideas of things to do then we would love to hear them.
>>: Maybe we should get going.
>>: I know you are out of time but can you give us a hint as to the symmetries you will be
exploiting in the semantic [indiscernible] let's get to that.
>> Pedro Domingos: There's actually less detail in that part because that research is not as far
along, but I think it makes for a very interesting contrast with what we just talked about. So let
me briefly say what the problem is and what our solution is. The goal in semantic parsing is to
map sentences to logical formulas. I want to take a sentence into English and translated into a
logical formula that I can then, you know, query and do reasoning with and whatnot. This is
useful for question answering, natural language interfaces, you know, controlling the robot by
giving verbal commands and whatnot. And you can think of it as a structured prediction task.
The input is the sentence and the output is the structure and in parsing the structure is a parse
tree, but here the structure is going to be the actual logical formula representing what the
sentence says. This is a problem that's actually become quite popular in the last few years and
[indiscernible] from here has actually done a lot of the work on this. But there are two very big,
at least two very big problems in semantic parsing. The first one is that no one agrees on what
that formal representation should be. Even for parsing there is already a lot of disagreement, but
no one can agree on what the formal representation of what you read is, like everybody has their
own formulas. There have been some attempts.
>>: [indiscernible] particular application you can derive that.
>> Pedro Domingos: For certain applications, like for example, querying databases, that
language is given to you, but if your goal is to like read the web and form a knowledge graph,
something you might be interested in, then you are not given that formal representation. That's
just a fact of life, right? The other problem which is as big or bigger is even if everybody had
agreed on a formal representation, where would be labeled data come from? You need example
pairs. You can go on vast mechanical tukers to translate sentences into first order logic because
they don't know how to do that. Labeled data is very hard to come by. You can do it with no
labeled data. That is actually what [indiscernible] did. You can also try to use various kinds of
partial or indirect supervision, but what we are proposing here is actually something that's quite
different from all of this, which is to change the definition of semantics to not require an explicit
formal representation, taking advantage of symmetry group notions. Here's the basic idea. We
are going to say that, so remember what a symmetry is. The symmetry of an object is a change
in the object that doesn't, that leaves its fundamental properties unchanged. Natural mapping
here. Our objects are sentences and the symmetries are to be syntactic transformations that
preserve the meaning. For example, synonyms, paraphrases, active, passive voice, you know,
composition like A does B. C does B. A and C do B. Bill wore shades. William donned
sunglasses. Sunglasses were donned by William. These are all syntactic transformations that
don't change the meaning. The semantics is the invariant here. Then you can answer a question,
and if you do this correctly, you can get -- this is the same process that we were looking at before
except with a very different kind of noise variables. This is, the thing that we're trying to get rid
of here is the syntactic form. If we can do that, then we can answer questions like wore shades
done by Bill. Nobody ever actually said this sentence bill donned shades, but if we figure out
that shades and sunglasses means the same in donned and work means the same et cetera et
cetera, then we can actually answer that question, correct? So this is our notion of symmetry, the
symmetry of the sentence is a syntactic transformation that leaves the meaning unchanged. The
very thing we are looking for is now the meaning of the sentence.
>>: [indiscernible] words change, phrases change?
>> Pedro Domingos: Yeah. Everything.
>>: So it's small changes.
>> Pedro Domingos: Yeah. And notice by the way that now, so one step at a time. Synonyms,
paraphrases, all syntactic changes that don't change the meaning are symmetries and you can
compose them, right? I can change Bill to William and wore to donned and, you know, passive
to active voice. I can do things one on top of another. What is the meaning of a sentence if we
have this notion of symmetry? Well, is one more notion from group theory which is the notion
of the orbit of an object under symmetry group. The orbit of X in the symmetry group G is the
set of all objects that X is mapped to by symmetries in G. Take my object X. I apply every
single symmetry to it and I get a set of objects. For example, a face at all translations, rotations
and scalings, or a certain sentence in all the different ways that it can be said. This is the orbit of
an object and the orbits of the objects in the set partition the set because if I can map A to B and
B to C, then basically I can also map A to C. In particular, my language, the English language, is
going to be partitioned into orbits by the syntactic transformations and one orbit is going to be all
of the sentences with the same meaning, all the different ways of saying the same thing. So there
is a one-to-one correspondence between orbits of sentences and meanings, which means that I
don't need to explicitly construct the meaning of a sentence in first-order logic to actually figure
out what the meaning is. The meaning is implicitly represented by the orbit that the sentence
belongs to under the symmetry group of the language. There's going to be a symmetry group of
English, a symmetry group of French, a symmetry group of Chinese et cetera et cetera. Now the
orbits are in this case, and in many other cases too, but here very prominently, they are going to
have compositional structure. The orbit for a sentence is going to be composed of the orbit for
the subject and the orbit for the event and the orbit for each of the arguments of the event
including the agent and the object and so on and so forth, the time, the place. Yes?
>>: When you say the meaning may not be explicitly represented, I think, doesn't it depend on
what you want to do with those orbits? So if you just want to match one sentence to see if it was
expressed in your knowledge base, that's true, but if you need to chain to sentences together in
reasoning, then you need some representation.
>> Pedro Domingos: Funny you should ask that question. Two steps. Obviously, a very good
question. I have more on this, but let me -- the answer is we can do two -- let me give you a
partial answer. Obviously, if you are trying to do extended question-and-answering where the
answer to your question is in your knowledge base except in a different form, then this will work
because you just need to find, figure out that these two guys are in the same orbit and you're
done. If you need to do chaining, let's suppose that I want to chain, and since it's an interesting
case and that's what logic is for, what do you do in that case? Well, wait 30 seconds okay?
>>: I come back to the question of [indiscernible] transform. There seems to be more of a
damaging thing, right? If you are [indiscernible] the knowledge base which supports more than
your question. You still want to say that, my question can be projected onto this fact.
[indiscernible] transform.
>> Pedro Domingos: To be very precise here what we're looking at here is a group of
permutations where you permute. We probably don't have time to get into this, but you know
how every symmetry group is, you know, structurally equivalent to permutation group. What we
are doing here is think, for example, of synonymy. What we have is a permutation group among
the different words, right?
>>: [indiscernible] this notion that we can [indiscernible] and come back to the structured points
compared to like a…
>> Pedro Domingos: It depends on whether you want to lose information or you don't want to
lose information. If you don't want to lose information, then what you have to do is you have to
keep track of the permutation that you just did. You migrated to that guy. That guy migrated
over here and this guy became this guy. How do we do parsing? This is actually kind of falling
onto some questions that before you were asking. The goal of semantic parsing is not to find the
most probable orbit of a sentence. If I find the most probable orbit of the sentence that actually
allows me, for example, to answer the question whether this thing is true according to knowledge
base or not. What this is going to be in practice is mapping my query to each of the sentences in
the knowledge base or to each of a set of sentences that are equivalent by composing symmetries
that I have previously found. If I cannot do this then I need to create a new word because I have
a sentence with a new meaning that was not present before. Again, this can be done efficiently
by taking advantage of the compositional structure of the orbits. The inference machine is going
to be of some part at the end of the day where a union of orbits is going to become a sum and a
partition of products of words is going to be part of that, but I'll just leave it at that very high
level of detail. How do we learn? What is the goal of learning? The goal of learning is to
discover the symmetries of the language. The data for this, the natural data for this is pairs of
sentences with the same meaning, something that now anybody can provide. I can ask turkers
tell me if the sentences mean the same or maybe even better, please restate this sentence in your
own words. Know what he needs to worry about logic anymore. I can get a corpus like this
maybe also from things like multiple news stories about the same event, corpora of paraphrases
that I already have, so there's now lots of different data that I can use. What I am trying to get
out of this is what are the transformations that leave the meaning unchanged. What I would like
to have at the end of the day is not a huge pile of transformations. It's some sort of like minimal
generating set. With these transformations you can produce all variations of the same meaning.
I want to learn, like for example, glasses means the same thing, sunglasses means the same thing
as shades and donned means the same thing as wore. I don't want to have to learn that when I
wore glasses means the same thing as donned shades and what not and compose that with the
active passive and so on. So the structures going to be the generating set that we find. The
parameters are going to be the probabilities of the orbits. What is the probability that given this
meaning I'm going to say it this way with this word, with the active or passive voice et cetera?
My search, this is kind of structural learning like learning the structure of a graphical model. My
initial state could be empty, right, I don't know any symmetries, or I could actually initialize this
with any symmetries that I know. I could initialize it with WordNet. I could initialize it with a
list of known paraphrases. I could initialize it by saying like he, look, there's this thing called
active, passive that you can transform sentences in this way. And then what are our search
operators? They are things like filling gaps in the derivations. I know that these two sentences
are the same and I can make this sentence be this and this sentence be this, but now I have a gap.
Let me postulate that you can transform this one into this one and I can transform this sentence
into that sentence by composing these things. And because this at the end of the day is a
probability distribution over meanings and the sentences that you say given the meaning that you
want, you know, we can use all the usual probabilistic learning things like, you know, have the
penalized likelihood as a scored function and what not.
>>: I just want to ask you [indiscernible]
>> Pedro Domingos: This that I discovered is kind of a supervised learning because it's
supervised by pairs of sentences with the same meaning. We've also looked at doing
unsupervised learning more in the style of [indiscernible].
>>: But of course there are a lot of words. [laughter].
>> Pedro Domingos: Exactly.
>>: How do you parameterize all of the order probabilities?
>> Pedro Domingos: We don't represent them all explicitly. This is actually what is interesting
about this problem is that this is like the difference between structured prediction and
classification. In classification you have two classes or five. In structured prediction every parse
tree is a class, same thing here. I have this vast magnitude of things. How do we handle it? By
factorizing that space into pieces. I'm going to have a compact representation of that space of
orbits. I'm not going to list all the orbits. What I'm going to say as well the orbit for a sentence
is composed of three things, you know, to simplify: the orbit for the subject, the orbit for the
event and the orbit for object.
>>: Now you do have representation for [indiscernible]
>> Pedro Domingos: No. Again, I have, well I do, but it's not -- in the representation whenever
I find a pair, I'm not saying I postulate this. I mean, I could. What I'm saying is this is what I'm
going to end up with. I'm going to end up with a structure of well, a sentence can be active or
passive. If it's active, right, but again the active passive is in use, right? I'm calling it active
passive, but hopefully it discovers this by itself. The point is that we know how to do this, right?
This is what we do for a living. We are representing a very large space, very compactly by using
factorization.
>>: You are talking about high-level graphing [indiscernible]
>> Pedro Domingos: Interesting point, in a way this is grammar, but it's more of a
transformational grammar as opposed to a context free grammar. Again, you can imagine having
any transformation in this guy, but you also need to keep the learning and the inference tractable
so you might want to make some compromises there.
>>: [indiscernible] the first part of the thing you talk about of no sort of definition, no sort of
representation for the logical form.
>> Pedro Domingos: There's a couple of ways to look at this. One of them is I never defined the
formal representation language here. What did I learn? I learned a bunch of transformations and
now those transformations together induce a set of orbits and that's a representation. I didn't say
anything at the outset. This can start with absolutely no…
>>: There's no logical form anymore.
>> Pedro Domingos: Yeah, there's no logical form anywhere. You could now look at like this
orbit structure and like well, if I say which orbit you are in, that's a kind of like [indiscernible]
representation, but it was induced from the data in the same way that [indiscernible] those
representations from there. We've done some brief experiments on this. Let me just mention I
think this was an answer to Matt's question, but a lot of people who do traditional language
semantics like Montagu's semantics and what not based on first-order logic. The point is to then
be able to do inference meaning to chain things and so far we haven't seen how you would do
that. But we just speculate on how that might happen. If you think about it, logical inference
rules like modus ponens, right, they're symmetries of knowledge bases, because they are
syntactic transformations of a knowledge base that don't alter its meaning. This is precisely what
logical inference means. It means to change the form of the knowledge base without changing
its meaning. These things are all very well defined in the case of logic. For example, if I have
the knowledge base Socrates is a human and all humans are mortal. And now I add the sentence
Socrates is mortal, this is syntactically different but it's semantically the same. In fact, even this
type of inference still falls under our general heading of symmetry-based semantics.
>>: So in some sense the axioms are like generating [indiscernible] and generating
[indiscernible] and the neighbor functions.
>> Pedro Domingos: I think that's a good intuition but there's more than that here, because what
happens is that before we were -- you need some notion of oh I can have a knowledge base’s
multiple sentences. But the thing that we were doing before was that we had, we were applying
symmetries to a single sentence. Now we are applying symmetries to sets of sentences and you
need some additional parser to say like oh, you know, I have a sentence like, you need a parser to
generate a set of sentences from a single sentence. But that's okay, right?
>>: So the direction that you are going to is that if the sentence orbits you want to be able to
have operations that say takes two sentence orbits and produce a third sentence orbits that sort of
is the same meaning as those two sentence orbits.
>> Pedro Domingos: Well in particular, exactly. I can think of in orbit over triples of sentences,
for example. The third sentence might be empty. It could have, what is the orbit going to be
here, right? I could have an orbit, I want to have an orbit that includes this pair of sentences but
also this triple of sentences, okay? Now how did I discover that? Because hopefully in my
training there I have a bunch of things like this that allowed me to figure out like he, Socrates is
mortal is really a redundant statement because the result of which sets of sentences with this
additional sentence are in line with something more general than Socrates and something more
general than mortal. They will get aggregated to the previous orbit as one so like we have this
union of orbits and then decomposition, right. We have a union like between orbits with these
three sentences and orbits with these two sentences. Then each of those orbits instead is
composed of the product of the orbit for this guy and the orbit for this guy, all the different ways
of saying Socrates is human, all the different ways of saying humans are mortal composed in
two. Then you have all the different ways of saying these three things. Now, because this
sentence is also, right, this recursive stretch of the orbits is very powerful, right? Now I'm going
to have all the different ways of saying this and I'm going to have is well, if you have these guys,
you could also have this guy I have at this junction. I have the union of the orbits of this form
and the orbits of all three, but if I take this guy out, then no, that's a different orbit. And again, so
hopefully, right, this is speculative, like we can actually do the same job that logic inference does
except that we never actually have a formal meaning representation.
>>: How, those two are in the database. And then you do all this other stuff. How does it
execute? Does it mechanically search for the search contents?
>> Pedro Domingos: Modus ponens, this rule, the rule that I applied here was modus ponens.
There's a first question is how do you induce Modus ponens, but if you give me a bunch of
[indiscernible] in logic, a bunch of knowledge bases, I would need training examples of form.
For example, this knowledge base is equivalent to this knowledge base and the knowledge base
that says that Aristotle is a human blah, blah, blah blah is also, right, from all those examples I
would induce a symmetry and that symmetry would be Modus ponens. Now that I have that
symmetry, I can apply that symmetry to transform one into the other.
>>: Okay. So [indiscernible] paraphrases.
>> Pedro Domingos: Yeah, exactly. If you think of this in terms of paraphrasing, paraphrasing
is a too incomplete process, right? I mean, the logical inferences are kind of paraphrased
inference.
>>: So you probably have noticed this in your modeling but I'm just wondering because
leverage is nasty so [indiscernible] synonyms or paraphrases there are [indiscernible] context
when you say that, so sometimes [indiscernible] and sometimes not. It's not as clean as you
actually wrote [indiscernible]
>> Pedro Domingos: A very good question. There's two parts to the answer to that. The first
one is that we are making some simplifying assumptions here in the same way that Niles Bayes
makes simplifying assumptions and it works very well. So I'm pretending so far that things are
not context dependent. You know, we can probably get some mileage out of that. We've gotten
some already, but, of course, that's not going to be the end. So what is the end? It's going to be
one of two things. Maybe they're the same or maybe a combination. One of them is to say we
need to condition this on features. In the same way that you can do discriminative parsing, you
can say I'm going to do this whole process not generatively, but conditioned on, for example the
bag of words in the neighborhood of the transformation that I'm applying. So you condition it on
the context, so this is one way to do this, which you know, for example, for syntactic parsing that
has worked very well, it's reasonable to expect that would work here as well. The other answer
and in some sense the deeper answer is to say that if what you have is a symmetry that depends
on context, the symmetry is not just between those two items. It's between the items and their
contexts. It's saying like A in context C can be transformed to B in context C, so it's no longer A
can go to B; it’s AC can go to BC. Now you pay a price for that in terms of computational
complexity and whatnot and, you know, sample size, blah blah, but there's certainly at the level
of theory, there's no reason why context can't be taken into account here. Sorry, you were first.
>>: Isn't this kind of inference being proposed by [indiscernible] where you don't really know
formal [indiscernible] know if the first [indiscernible] and the second sentence, so what are the
comments you can picture [indiscernible] and why didn't you apply that technique of
[indiscernible]
>> Pedro Domingos: Great. Exactly. So this actually has good observation. This actually has a
lot in common with what people do for textual entailment. Now for textual entailment, but now
there are a couple of important differences. One is that the schemes that people have come up
with for textual entailment are actually very ad hoc. It's like map your sentence to a graph and
then you try to like, right? This, in a way is part of what we would like to do here is like take a
lot of these things that people have done in areas like semantic parsing and textual entailment
and have a nice clean theory that encompasses them. This is one aspect which didn't exist
before. In the real world what we're trying to do is generalize and formalize things that people
do before, for example, textual entailment. There's even these things like, you know,
transformation based learning that Eric Brill had and others, so there's actually a lot of stuff
there, right? But the other thing is the textual entailment problem is a little bit different from the
one that we're solving in that what the textual entailment problem is trying to do is not transform
something into something equivalent; is to derive one sentence from another small set of
sentences. Textual entailment is like I give you a paragraph and I want to know where the
sentence falls in the paragraph. What happens in textual entailment is that the thing that limits
your performances lack of world knowledge. In order to do textual entailment properly you
actually need in addition to your paragraph you need to have a knowledge base. We want to be
able to, I don't think you can solve textual entailment the way people have been going about
solving it. They are saying like I am just going to map the paragraph, you know, to the sentence.
You need to do it in the presence of a large knowledge base and then the techniques that people
use for textual entailment are not going to work very well, but this potentially might. You had a
question?
>>: I still don't understand is the question about composition and like [indiscernible] and like
coupling this with the context. I think it's more complicated than conditioning in context. If you
have something like Obama visited London and President visited London and you can infer some
kind of transformation from Obama to President. And here [indiscernible] president visit is
going to have like 1945, then he can do this transformation in that case. But in the
[indiscernible] there is no context [indiscernible] on. It's just like some…
>> Pedro Domingos: Good question. The problem that you are pointing toward I think is
actually a different one that we have actually thought about, which is there are multiple levels of
abstraction in which you could talk about things. You could talk about Obama. You could talk
about the president of the United States. You can talk about presidents in general. You can talk
about Obama today. What happens is that you can say something at one of these levels that
implies different things at different levels, but they are not equivalent. But again, we have a very
nice way to represent this which is that if you think about it. Think of a class of objects versus
the object. The class of U.S. presidents versus Obama in particular, what happens is there are
additional symmetries. As you add symmetries you get larger classes. For example, the name
Obama is one president, but the U.S. president, in general, is irrespective of the name. It could
be Obama. It could be Bush. It could be something else.
>>: [indiscernible] transformation. It's not symmetry.
>> Pedro Domingos: Yeah. But that's okay. We just need to be able to do this reasoning at any
level that you choose. Again, if you think about it in terms of the whole knowledge base, it is
symmetry of the whole knowledge base. If I say Obama visited London, that implies, right, that
because president assumes Obama, the president of the United States visited London and so
forth.
>>: Okay. I like it.
>> Pedro Domingos: All right.
>>: Thank you so much Pedro.
>> Pedro Domingos: That was good. [applause].
Related documents
Download