>> John Platt: So I'm very pleased to present Pedro Domingos who is a professor at University of Washington. He's here with his student Robert Gens, also from University of Washington and we've had a long relationship with them. Many of his former students work here at MSR and he gave a really intriguing talk at the International Conference on Learning Representation, so I invited him to come and tell us updates on that work. >> Pedro Domingos: All right. Think you all for being here. Thanks to John for bringing me. I will try to make this worth your while. What I thought that I would clearly talk about here is highly speculative new research directions that we are pursuing. It's called symmetry-based learning and this is work that I have been doing with Rob Gens who is sitting right here and is going to be starting an internship here next week, and Chloe Kiddon. Fair warning, this is work in progress, so a lot of it is not very mature yet. This is the fanciest slide in the whole talk. [laughter]. We do have some experimental results that we will show, but I think that some of this is definitely very much still in the idea stage. Here's what I'm going to try to do. I'll start with a little bit of motivation. Symmetry-based learning is learning that takes advantage of symmetry group theory, so I'll first try to say why that might be an interesting thing to do. Then I will give a very brief background on symmetry group theory for those of you not familiar with it. Then I will talk about two main things, two applications of symmetry group ideas. The first one is deep symmetry networks, so this is stuff that Rob has been working on that we just submitted to NIPS. This is an application of symmetry group theory to generalizing things like convolutional neural networks to allow for more symmetry than just translation symmetry, which is what a cone net has. Even more speculatively, I will talk about symmetry-based semantic parsing which is what Chloe Kiddon has been working on were what we try to do is solve some of the sort of like long standing problems in semantic parsing which is the problem of going from sentences to their meaning using ideas from symmetry group theory. Then if time allows I will conclude with a little bit of discussion, but please feel free to ask me questions at any point. Here's one way to motivate this. You could argue that the central problem in the machine learning is learning representations. We know how to do a lot of things very well, but what is limiting what we do is that the representations in sentences have to be predesigned. If they aren't we don't really know how to learn them that much and this is what is really limiting the power of our learners. If we could solve the problem of learning representations, then we would have really powerful learners. Imagine what those could do given what the ones that we have today can do. This, I think, a lot of people would agree with at least that this is an important problem to look at, wherever exactly it falls on the ranking of important problems in machine learning. What I would like to suggest here is that if what we want to do is one representation the very natural foundation for that is symmetry group theory. The number of people over time over the years have applied symmetry group theory in various ways in machine learning and also envision a lot which is where some of our applications are, but I think we could actually take that a lot further than we have so far. As like a very simple introductory example, let's think about symmetry in geometry because usually when people have intuitions about symmetry they come from geometry. What is symmetry in geometry? The symmetry of an object is a transformation that maps the object to itself. For example, a square has eight symmetries, rotations of 0, 90, 180 and 270 degrees and reflections on these four axes that we see here. One interesting property of these symmetries is that we can compose them and whenever we compose them we just get more of the same symmetries. For example, I can compose in our rotation of 90 and 180 to get one of 270. I can compose to reflections and what I get is a rotation and so on and so forth. Symmetries can be composed, right? Another interesting thing that happens is that there is an identity symmetry. The symmetries of transformation, identities of transformation that doesn't do anything, mainly leaves the square the same, which in this case the identity in this set of symmetries is going to be rotation of 0 degrees. The next interesting thing that happens is each symmetry has an inverse, so if I rotate by 90 and then I rotate by 270, which is the same as -90, I get back my identity. I can compose symmetries and the composition is associative. Generalizing from that, a symmetry group is a group and the group follows the group axioms which are -- in general, we have a set, whatever it might be and then a binary operation on that set, let's call it product, but it could be anything else. The group has four properties. The first one is closure. When I do x.y if x and y are in the set the result is also in the set. Identity meaning there is an element E such that when I dot it with anything before and after I get that same thing, so it's an element that doesn't change anything. Inverses means that for every x there's an x-1 such that when I combined them I get the identity element and associativity is the obvious thing. A lot of things are groups; a lot of things that we deal with every day are groups. For example, the real numbers with addition form a group. And a lot of things follow from just these group axioms. Symmetry groups are a particular type of group, but they are a particularly powerful type of group. What happens in a symmetry group is that the group elements are functions. The group elements are no longer in some sense the objects. There's a function that we are going to use to operate on objects. The group operation is function composition. So the impression is composing functions. Symmetry groups can also be continuous. The symmetry group that I just showed you was a discrete one. There were eight symmetries of a square, but in general, I could have continuous symmetry groups. Those are usually called Lie groups in honor of the Norwegian mathematician Sophus Lie who introduced them. Again, the standard example is the symmetries of a circle. A circle can be rotated through any degree and you can be reflective on any diameter, so now there's an infinite number of symmetries and they form a twodimensional space. Getting on to more interesting things, Euclidean space itself has symmetries, for example, translations, rotations and reflections. If I take, for example, the Euclidean plane and translated what I get at the end of the day is still the Euclidean plane or rotated or reflected or et cetera et cetera. These symmetries have the property that they preserve the distance between any pair of points. Euclidean transformations preserve consistencies between points because if they don't, then they are not part of the Euclidean group anymore. Let's try to see more generally what symmetry is and why it would be interesting to us in machine learning. More generally, a symmetry of a function -- let's think of symmetries of not objects like squares or circles or even Euclidean planes. Let's think of symmetries of a function. Symmetry of a function is a change in the input that doesn't change the output. Here's my function. It takes x, you know, to f of x, and what I'm saying is let's say that I first transform x by s, so s is some function of x. For all s x’s f of s of x is the same as f of x, then s is a symmetry of this function. Sometimes people talk about symmetries of objects and symmetries of functions and the two are really related because the symmetry of a function is a symmetry of the objects that you apply to and the symmetry of an object is really, it has to do with a function that is being preserved. What is the property that is being preserved? But for our purposes I think it's useful to think of it in terms of asymmetry being a property of a function. If this is the case, then it almost jumps out at you that this is going to be relevant to us in machine learning because we are interested in functions and transformations of the examples that go in them and so forth. In this general guise, the symmetry… Yeah? >>: You mean in your example it was just [indiscernible] speaking for example the square? Is if for any x or for single x? >> Pedro Domingos: It's for a particular function. >>: And then any x, right? >> Pedro Domingos: Yes. One way to think about this is that the function is your object. You can have symmetries of any object. The object could be a square. It could be a function. For example, in physics it can be a logarithm. No physicists do everything with logarithm so now we're interested in symmetries of the logarithm function, and it turns out that the symmetries of logarithm functions are conservation laws, things like conservation of energy et cetera. In fact, symmetry group theory pervades physics, relatively, quantum mechanics. It's the standard model. Just about everything in physics these days, you know, down to string theory and supersymmetry, these are all theories based on or can be formulated very elegantly in terms of symmetry. So symmetry is very pervasive in physics. It's also very pervasive in mathematics, so symmetry group theory is one of the most intensively studied areas in mathematics and part of the reason is that it shows up everywhere. You go to almost every branch of mathematics and you will see proofs of theorems that are based on symmetry. Just to take random example, think of game theory and the Minimax theorem. The most elegant proof of the Minimax theorem is using symmetry. It also appears in things like optimization and search, so like getting closer to machine learning. When we're searching, if we notice symmetries in the space that reduces the size of the search space. For example, I'm trying to figure out how to play tic-tac-toe, if I realize that all symmetric board positions are the same, then that actually reduces my branching factor and makes things a lot more efficient. In optimization there are ways to take advantage of symmetry. It's used, for example, in model checking. A lot of people who do the software verification, hardware verification, they have these very large systems to check and, again, if you can detect symmetries -- they don't call them symmetries, necessarily, but that's what they are, things can become much more efficient. It's made, of course, a lot of appearances in vision because in vision there are a lot of obvious symmetries from texture to things like, of course, rotational and symmetries of rigid objects and so on and so forth. It's also of late made a lot of inroads into probabilistic inference which is actually how I first got interested in it. If you have very large probabilistic models, inferences in them is very hard, but if those models have a lot of symmetries, meaning, for example, you have a factoral graph and a lot of factors are copies of each other, as you have one this factor graph from a Markov logic network, then you can actually use the symmetries to have a cost of inference that is only proportional to the number of these these aggregated objects that you have as opposed to all of the individual variables. A lot of ideas here it turns out can be made much simpler and more general by casting the whole problem of finding and exploiting symmetries in your graphical models and so on. However, in machine learning symmetry has made some appearances but not that many. Some of the appearances I think are actually quite remarkable but when people think of important topics in machine learning symmetry group theory comes up. However, I think if you think about the definition of symmetry, it kind of at least jumps out for me, it kind of jumps out that this is going to be relevant to us in the following way, or at least this is one way. If your function, so remember, symmetry of a function is a change in the input that doesn't change the output. If the function is a classifier and the change you need put these representation changes the examples, now we have the representation learning problem cast as asymmetry problem. Now your function is a classifier; f of x is the class of x. x is your example. It could be in image. It could be whatever you want, the description of a customer or something and your symmetry is a representation change. It's taking your examples and changing them to some other representation such that the function applied to the new representation still has the same value as the function applied to the original one. The problem of learning with representation change seems to map very nicely to this definition of symmetry. To make the point more generally, what is really the goal in representation learning? The problem that we are always faced with in machine learning and in particular in vision and speech and language and whatnot, is that there are a lot of variations in the data that obscured the thing that we are trying to get at, variations in pose, variations in lighting, variations in all sorts of things, camera parameters and so on and so on. If we could somehow get rid of all those variations then the things that we really wanted to attack, like, for example, the class of the object. Is this your grandmother or is this not your grandmother? Then that would become much easier to learn were probably at that point a very simple classifier would do the job quite successfully. If you think of this as the goal, we want to separate out the important variations from the unimportant ones. Then you are important variation is the target function. It's like what you are trying to predict. Is this a chair or is this a table? Is this a cat or is this a duck? If the important variation is the target function, it's this f that we are trying to learn here, then the unimportant variations are the symmetries of the target function. They do things that I should be able to change in my input without changing the object, without changing the output. For example, if I can take, I should be able to take my object and rotate it and if it was a chair before, it would still be a chair after. And if I apprise them in closer or I'd distort the image in various ways, I should still be okay. Things like pose invariance, lighting invariance are symmetries of that which we are trying to recognize. The same thing that I'm doing here with the vision example, you can probably take any domain and think of what the variations are and then you can think of them as what you want to get rid of and then you can think of them as symmetries. Now if we do this, if we exploit symmetry in order to do representation change in order to learn classifiers, I think there's a whole series of benefits that we get out of it. One very important one and in some ways the most obvious one is it reduces our sample complexity in the same way that in search and in optimization, recognizing symmetries reduce the size of the search space. One way to think about it is the curse of dimensionality can be overcome by the blessings of symmetry. Symmetry reduces the size of your search space. The same thing happens in machine learning is that symmetry reduces the size of your instant space. You can have a very high dimensional space, but if you can fold it all using symmetry back down to low dimensional space, [indiscernible] that low dimensional space the same number of samples goes a much longer way. The sample complexity that you need, the size of sample that you need to ensure the generalization and so forth goes down, potentially a lot. This is one benefit of explaining symmetry. Another one, which we are going to see here shortly is that we can take a lot of our learning algorithms that we know and love and come up with more powerful, more general versions of them in a very straightforward way. This is a way to take learners and make them more powerful. We can probably get new formal results out of this as well and maybe even a new kind of formal results. This is still a speculative notion, but this I think is another one. Here's another interesting one. Deep learning is very successful and very popular these days. Symmetry learning suggests another way to do deep learning is to do deep learning by posing symmetries. If you think about what the successive layers of a deep network are doing, for example in a ConvNet what you are doing is getting successively higher degrees of translation invariance. What is that doing? It's like composing small translations into larger ones. We can take the same idea and make it more general. Yeah? >>: Do you think of symmetry as something that is global or is it quantifiable, for example, to say what I'm meaning, for example, you are trying to recognize figures, so we say in rotations, somewhere in the degree of rotation is symmetry, but if I were to take 180degrees it might take a break. >> Pedro Domingos: Precisely. >>: [indiscernible] this direction? >> Pedro Domingos: No. Of course. I think as both a direct answer to your question and a more general one. The answer is of course. This is part of what we're trying to learn this, for example. You are in, a 2 is a 6 is invariant with a 3degree translation. It's not invariant and they're a 180degree translation. On the other hand the zero is invariant under a 180degree translation, so this is what we want to learn. We're not just going to want to feel like certain things are globally invariant, because they almost never are. There might be some and those are actually going to be easy, right? But what we want to do is we want to figure out where in symmetry space can our objects be? >>: [indiscernible] proved theory, proof theory there is no notion of [indiscernible] 25… >> Pedro Domingos: Of course. Yes. Which brings me to the larger answer which is, which I'm going to elaborate on a little bit later, which is there's I think a reason why symmetry group theory even though it's potentially relevant to machine learning hasn't had a big impact yet, which is that symmetry group theory by itself actually doesn't solve any problems. It has to be combined with other things like, for example, distributions over symmetry space and approximate symmetries and a whole lot of stuff. But we already know how to do that stuff. We know there's a lot of people who know how to do symmetry group theory very well. What we need to do is combine the two, just symmetry group theory by itself is not going to solve the problem for us. >>: So why is something with symmetry [indiscernible] invariance which [indiscernible] let's say [indiscernible] and those are interesting problems, so where does symmetry, so you put a restriction by saying it [indiscernible] symmetry. What does [indiscernible] >> Pedro Domingos: Yeah. Fair point, but as usual what happens is that if you don't make any restrictions you can't do anything. You have to make restrictions meaning like some assumptions. You have to make some assumptions. Another question that you have to ask yourself is if I make these assumptions, how much does it by me. You lose something by making these assumptions and the argument that you should always make if you are going to do it is well, I don't lose a lot and I gain a lot or I gain a lot more than I lose. I will definitely not say that symmetries will do everything for you. What I'm saying is there is a lot of mileage that we can get out of them and a lot of machinery that actually we can exploit. I think this is definitely not going to be the end of the story. People, for example, envision to have things like you can have symmetric components and antisymmetric components and whatnot, but I think, you know, this is part of the road that we have to travel. Yep? >>: So to think the old-school way of the thing which is like [indiscernible] examples [indiscernible] class [indiscernible] >> Pedro Domingos: Yeah. I'm glad you asked that question because that was actually one of the motivations for what we're doing is like seeing all of this, you know, I have a data set of 100,000 examples and then I have a data set of 100 million examples because I need those distortions in order to be able to then learn something that is interested in those distortions. Those distortions shouldn't have to be in the data with the corresponding cost of processing. It should be your network that figures out what distortions you are invariant with. So definitely right, what we want to do to some degree, or at least in one instance of this is to, the same effect you get by putting these distortions but better and with much less computational cost. >>: I think it's very dependent on the way that you can, it's like sampling. [indiscernible] in this case you sample the absorption and I get that. [indiscernible] all the way. There shouldn't be any point in your model where you say but now I cannot do it in close form and I will sample it because like sampling initial use you are going to decide your distortions, I mean, instead of like… >> Pedro Domingos: But the sampling has a high cost and it's imperfect because samples are only a sample of reality. Anyway, let's press on and certainly I think we can discuss more of these issues as we go along. The final point is this, is that symmetry, the use of symmetry is actually agnostic with respect to which learning curve I'm using. You can apply symmetry ideas to connectionist models, vision ones, kernel-based ones, symbolic ones and we'll see a couple of examples here. It's actually compatible with all of these different things. All right. That's the general idea. Let's look at a concrete example. What can we do with symmetry group theory? As I said, there had been a number of application symmetry group theory in machine learning. The most notable one these days is called convolutional neural networks. What is a convolutional neural network? A convolutional neural network is basically a series of layers each of which contains a set of feature maps and then interspersed with these feature layers I have pooling layers. At the end I have a classifier that takes the features from the top layer and, you know, that could just be a multilayer perceptron or something like that. Where do these feature maps come from? Feature maps are just features like a sigmoid or a rectifier or something like that applied to some input, but a feature map is a feature function like a SIG model applied to the product combined with translation. What a feature map in a convolute really is is the same feature applied at every possible translation in the image. I'm going to apply this feature here, here, here, here, here and there thing that the feature is detecting might be present here and here but not there, there, and there. This is what a convolutional network is. If you look at it this way the generalization immediately suggests itself, which is why should the feature map just be a feature function applied with all translations. Translations are just one very limited symmetry group, the translation group. We should be able to build a feature map out of a feature in any symmetry group that we want. We should, in fact, have some sort of deep network that is necessarily even committedly priorious to what the symmetry group is that you are going to be exploiting. You can plug that in later. This is what I'm going to do is like the first step is I'm going to define an architecture that is valid for any symmetry group. The architecture just uses symmetry group properties. It doesn't say that this is going to be the translation group or something else. And then step two I'm going to instantiate that with one particular symmetry group and the obvious one to do that, of course, is the affine group. The affine group is a superset of the translation group, but it also includes things like rotations, reflections, scalings and whatnot. And the 2-D affine group is defined by this. If these are my original coordinates on the plane my new coordinates are just a linear function of the old ones. These two things obviously will give you translation and these give you the rest, the rotation, the scaling and whatnot. So the affine group is six dimensional because it has these six parameters. This is what we are going to do. And then, so I'm going to call this first one SymNets for deep symmetry networks and then there's going to be a particular instance that is deep affine networks. And compliments are just deep translation networks in this setting. There's a couple of notions that we need from symmetry group theory. Let me just bring them over. Let's suppose that, one important notion for us is going to be the notion of a generating set. So what is a generating set? Let's say that G is a group and S is a subset of that group. We say that S is a generating set of G if every element of G can be expressed as a combination of elements of S and their inverses. So S is some group that is sufficiently, is some subgroup that is sufficient to generate the entire group by composition and inversion. For example, in the case of translation adding an infinitesimal epsilon to the x and adding an infinitesimal epsilon to y are a generating set, but because it can generate in translation just by combining epsilon translations in x and y or inverting them which would be x minus epsilon, you know, et cetera et cetera. One notion that we need to define based on this is the notion of a key neighborhood of a function in the group and their generating set. That's a bit of a mouthful but bear with me the notion I think is quite intuitive. This is going to be the subset of elements of the group that can be expressed as f, so remember this is the neighborhood of f of a particular function that can be expressed as f, compose with elements of the generating set at most k times. For example, in the case of a translation group, this would actually look like a diamond because you could do k this way and then you could do k that way and then you could do like k over 2 like this and k over 2 like that and so on. If you wanted to have a rectangle you would just take a slightly extended definition where you could have ki of applications of element i, but there's no need for that complication here. The notion here is that, you know, I want to not just look at all possible compositions, but I want to look at limited numbers of compositions. And you can probably see why this is the case because what happens in a [indiscernible] as you go up the layers is that you are allowing in this view compositions of more and more elements of your generating set of small translations, which are going to be translations of one pixel effectively in that case. >>: [indiscernible] a sliding window? >> Pedro Domingos: Yeah. Exactly, so one good way to think of this is that a k neighborhood could be a sliding window but it's going to be a sliding window in symmetry space, not just in translation. You're going to be sliding not just the wrong translations but around rotations and scalings and then combinations of them. You can think of all this is like I have, think of a little window centered at the origin and now cog nets will now put that window everywhere. What we are going to do is that in addition to putting it everywhere we're going to look at all possible skills and all possible rotations and sheers and whatnot. We're going to look at all linear transformations of that window and see what happens. >>: The semantic parsing example you are going to talk about later, how do you make these kinds of examples? >> Pedro Domingos: You'll see. Exactly. The semantic parsing work is not as far along, but I think it's very interesting because it's a completely different application of symmetry group theory and yet with some deep similarities. We'll get to that in a second. >>: [indiscernible] cog net like the two operations that are the same function on different translated versions of the image and then they get read offs [indiscernible] organization of those and then in order to like distinguish them right? >> Pedro Domingos: Right. So I haven't told you how we are going to do so exactly. So really there's two things that connects to. One is parameter tying and the other is pooling, and we haven't yet seen how we are going to do them but we are going to see shortly. Here's the general architecture. It's very simple. One layer is obtained by you take the input and you apply every symmetry in the group to it. Then you compute features on the transform input. You take your input. You apply every symmetry from your group and then you compute the feature on the transform input. This gives you a feature map, the feature map over the entire symmetry space. Then we pool over neighborhoods; we pool over the k neighborhoods that we just defined. It can be max pooling or sum pooling or any of the various things between them that people have tried, but the thing is what you need to pool is you need to know what is the neighborhood. Where are the values of what your pooling and the answer is is going to be the k neighborhood is we just defined it. Everything we are doing here has confidence as a very clean special case. Yeah? >>: [indiscernible] expansion distance. How does it relate? >> Pedro Domingos: Indeed. Interesting question. I don't know if people here are familiar with tangent distance but maybe my answer will actually also qualify that. Tangent distances applying symmetry group theory to nearest neighbor. And this is applying symmetry group theory to deep networks. In a way what I'm going to talk about here, these deep symmetry networks, they are to tangent distance as coordinates are to nearest neighbor. Many of the things that tangent distance can do we can do here as well. We can do some others besides. As we'll see there's a point at which we need to solve a problem in our deep network that could be solved by tangent distance. I think it was John that put it this way and this one we took was like what we are going to do here in a way is like tangent distance inside the ConvNet is one way to look at it. Yeah? >>: You said you apply all the symmetry group to the input in pooling all the neighborhood? Why you did that instead of say find just the symmetries in the k neighborhood? >> Pedro Domingos: Notice again, you can answer your own question I think if you look at what happens in ConvNets, right? I do want to compute the feature at every place in the image and then I pool over neighborhoods, because I want to find out if the feature is present over there and over there and over there. >>: So say I assume that there is just local preservation of [indiscernible] on the rotation. But you will apply all rotations? >> Pedro Domingos: Yeah. Just like I apply all translations, and remember in the ConvNet, ideally if the learning permits what will happen is if something is completely invariant with respect to translation theory only going to find that out at the top once you have pooled over the entire image. Same thing here. I will find that out if something is completely invariant I will find that out at the top. If something is only partly rotationally invariant, I will learn that at the appropriate layer but not beyond. Again, back prop permitting and the data allowing that to happen, but we will see that this actually works. As I already employed, we can train this by back prop again in the same way that ConvNets are trained by back prop provided that this is a Lie group. If this is not a Lie group we will have to do other things that we'll talk about later, but if this is a Lie group we can train by back prop because the symmetries are differentiable. >>: [indiscernible] backprop for this type of network is comparable for the [indiscernible] >> Pedro Domingos: Oh, great question. A difficulty, yes. Good question, right. I think there is an obvious difficulty is like well, translation was in the 2-D space, right? And now you're going to who knows what dimensional space? What are you going to do? You can't have a grid. So we're going to have to deal with that problem, right? Those of you who know what tangent distance might already have some thoughts about how we might deal with this problem but indeed, there is actually key. If we don't do that, you know, this isn't really going to be practical. Okay. That was the general case. Now let's make this concrete, right, with the affine group. This leads to what we call deep affine networks. This is just the architecture that we saw before with the affine group as the symmetry group, so one layer is we apply every affine transformation to this image plane. Stretch it, sheer it, rotate it, translate it, you know, the whole works, and then we compute features on the transform plank. What this means is that instead of just trying to look for a particular feature at the different translations, I'm also going to look for it at different scales, different rotations, different, you know, affine transformations. And then we pool over the neighborhoods and affine spaces as we saw before, which means that we are going to do things like, for example, oh, I'm going to pooling into distance translation invariance in ConvNet. Now it says like I don't care whether this feature is here or here or here. It's all about the same to me. We're going to say something similar which is like well, not only do I not care if it is here or here or here, I don't care whether it's the size or the size or the size within some range and I don't care whether it's translated, you know, rotated by some number of degrees. And now this is the one layer, right? The first player has the image as its input. The second layer, you know, has these feature maps and I can just stack as many of these layers as I want and train them by back prop. Okay? What is the generating set, or what is a generating set that we can use for affine transformations? Well the obvious one is to just start with the identity matrix and the origin. This is basically not translating and not rotating, you know, just leaving things on, and adding epsilon to each parameter so if you look at here, I can have 001001 and, you know, each adding epsilon to each one of these, you know, gives me, for example, this here would be an element of x translation y translation as before but then these guys right here if I have a matrix that's 1 plus epsilon 001 and one that's 1 epsilon et cetera et cetera, I can compose everything out of those, so this is what we use. All right, so, Uh-huh? >>: If you go back there, on the step one, the affine… >> Pedro Domingos: Pedro here? >>: Yeah, back one more. Apply every symmetry; you are taking the generating set and you are going to sort of apply over and over and over the generating set, or are you -- I don't quite understand you. >> Pedro Domingos: What does it mean, for this part you don't really -- to think, to understand pooling over neighborhoods you need the concept of the generating set, but here we actually don't. Just think of the entire symmetry group. It's a 6d group. >>: So you are basically going to generate all the symmetries and apply all of them? >> Pedro Domingos: Yeah. Which sounds crazy, right? >>: That's like real simple, right, real simple? >> Pedro Domingos: Exactly, so. >>: That's all feasible. >> Pedro Domingos: For a moment, let's ignore computational issues. I'm just going to take every last one and apply them all in their glorious infinity. >>: Yeah, yeah sure. >> Pedro Domingos: Right? So conceptually this is what I'm going to do, right? I mean, in some cases we may be able to do these things in closed forum, but that's not what we're going to do here and it's also not what ConvNets do. So of course, another problem is I can't do this, right? This is not going to be feasible. In [indiscernible] it was just about feasible, but it's not going to be here. What can we do? If you think about what we can do is that we're machine learning people. We know how to solve this problem. We don't need to sample the space at every point. What we can do is we can compute our features. Remember, what has to happen is [indiscernible] are featured every point in symmetry space, but there's too many. We can't do that. What can we do instead? We can compute our features only at some points in symmetry space, only for some rotations and sheers and scalings and translations, only for some. And then we interpolate for the rest. Instead of sampling this room exhaustively in the complete grid, I pick a number of points. I know my feature there and if I have a point here and a point here, then the teacher here can be computed by interpolating between them. Come on. We know how to do this. >>: [indiscernible] create this a few types of symmetries and then pretty much [indiscernible] so this is more complete setups? >> Pedro Domingos: Yeah. So again, for those of you not familiar with Stephen Millatt’s [phonetic] work and also there is like [indiscernible] in here. There's a number of sort of like explanations, generalizations of deep networks based on ideas from symmetry group theory. Again, this has many things in common with those, but it also has some important differences. One is that he doesn't really learn anything. We do. We're going to do back prop and learn all these things. It has like this, I think what he has is an interesting concept in terms of how to explain deep learning, but he actually doesn't have an algorithm that would learn something and we do. >>: [indiscernible] translation [indiscernible] >> Pedro Domingos: Yeah. And again, what he has is not for an arbitrary group or for the affine group, even. What we have is much more general. >>: This point is do you understand what you want in terms of symmetry. You don't need to learn it because when you decide the wavelength [indiscernible] >> Pedro Domingos: And I agree with that. >>: Is a concept that it doesn't work. [laughter]. >> Pedro Domingos: It has some conceptual value, but the thing is if you know what you want you should be able to encode it, but if you don't know what you want you should be able to learn it. There's no reason why we can't learn the symmetries and this is what the ConvNet is doing and Millatt’s work isn't, but we are. At the end of the day we have here an algorithm that actually is a practical full-blown algorithm and we'll see it does give the kind of benefits that we were speculating really get. From a machine learning point of view, this is a very natural solution. Would have a very high dimensional, we have actually, here is the ironic thing. The affine space in 2-D has six dimensions which 5 minutes ago look very scary. Oh you are going to need the grid in six dimensions. And now it's like six dimensions, that's nothing. We know how to deal with thousands of dimensions. Six dimensions isn't even that big of a problem. Now what we need to do, right, if we’re going to do this by having some control points and interpolating between them, we need some way to choose the control points. Now is important to realize, for example, one thing that actually John suggested right after my talk was why don't you just let them be a random sample and leave it at that. Actually, that might work. In many cases that might actually be just fine, but here is something important to realize, is that I don't actually care equally about approximating the feature everywhere in affine space. What I am really looking for is where the feature is high because that's where it's present. This is a feature that detects a nose. The place where the feature is close to zero I really don't want any samples there because they're not going to matter. What I really want to nail is the places where the nose detector is going off full steam. This scale and this position of the image, this rotation, I seem to have a nose. Our control points ideally, now there's more than one way to do this, but here's some considerations is that you are probably better off letting them be local maxima of the feature instead of just being random. How would we find them? We have our feature map. Our feature map is the value of the feature going up and down over all points of the affine space. How do we find those points? We can just start with n random starting points and this n is going to be an interesting parameter, but notice that essentially you are going to be independent of the dimensionality given an n because everything is just going to depend on this n, not that dimensionality of the space that you are in. Then what I can do is from those random starting points find maxima by grade and descent. Uh-huh >>: [indiscernible] maxima [indiscernible]? >> Pedro Domingos: It's of the feature, right? Suppose you have a feature, let's take an ordinary neuron. It's a dot product followed by a sigmoid or by a Gaussian or something. You can apply that to an image patch of size 16 x 16. It's one neuron, so it features one neuron. Now what happens is we want the value of that feature as we sweep around in our affine space. If there is a nose that is a certain scale, it should have a high-value when I apply the feature at that scale, but not when I apply it at a different scale. >>: Do you have any idea, the affine spaces big so does, can I take any path and transform it in some way that [indiscernible] looks like, maybe it's a very conjugated… >> Pedro Domingos: Yeah, that's a very interesting question, but it's, that will be a problem when we go where we want to go which is to be very flexible in the symmetries that we can learn. Then it's an overfitting problem. Here we don't really have that overfitting problem yet because you can't distort a mouse into a nose, I don't think, I mean just by affine transformations, yeah. >>: This assumption that maximum is good seems odd to me. Are you assuming the weights are positive or -- I mean, a feature could be really good when it's strongly negative, the things that listen to [indiscernible] >> Pedro Domingos: There's a certain interplay between this and what the feature is doing, right, but it's the feature is something like a sigmoid, right, it's 0 when something is not present. And you are saying, in general, I might be interested in this 0 case, but in practice in vision that's not what happens. >>: Just because it's maximum. >>: It goes 0 to positive instead of -1 to positive 1. >> Pedro Domingos: Yeah. And now I mean, for example, if this wasn't the case that you are looking for a local maxima then use some other method, maybe random samples, maybe some other knowledge that you have. In fact, one of the applications that we have what we actually did was we didn't do the grid in the same process here we actually just picked where we put the control points just on our presupposition of what the invariances would look like. Yeah? >>: Why do you care about the maxima as opposed to characterizing the entire space? Wouldn't you want the information about the entire group? >> Pedro Domingos: This is what I was saying. Yes, but I don't have the cycles to look at the entire space so I have to pick some -- if I'm going to solve this problem by interpolation, I need to pick the control points. >>: With this point over here the maximum, what about the, I mean if you missed the point… >> Pedro Domingos: This is what I'm saying that like because those are what I care about, those are what I'm going to look for. If those weren't what I cared about I could do things accordingly. >>: So [indiscernible] >> Pedro Domingos: This is a choice that… >>: [indiscernible] absence. >> Pedro Domingos: Yes, exactly. The usable co-maxima presupposes that you are using a certain kind of feature which is what you almost always envision, but if you are using a different kind of feature like Matt was saying where, for example, having the L-1 is very informative, I think the general theme here is that you may want to do random sampling but you also want to make this sort of like utility guide that is sampling and it's sampling more in the points where you have the most relevant information. >>: Let me ask the converse of what Ashish did. It seems like you are choosing interpolation scheme, interpolation schemes convex like, for example, if the base functions are [indiscernible] unity, then you are taking a max over, you don't have to interpolate it all because it's come back so you know that the max, that the interpolation will always be in the control points you always pick. There's no integration there's no nothing, right? >> Pedro Domingos: Good point. That's partly true, but here's the other thing. This is actually going to do several jobs for us at once. We actually -- I'm wondering when I should talk about this. I'll just talk about it right now. We are going to do this with kernels, so we're going to put a kernel on each of these points. This kernel is actually going to do several jobs at once. You could actually have separate kernels with each of these things and some of them might even be unnecessary, but here's one of the most important things that these kernels are going to do. They're going to do the pooling. I'm actually, for example I'm going to use the width of these -it's one thing to think about oh I am looking for a very accurate interpolation of the feature map. If that's what you were looking for this is not what you would do. It's perfectly legitimate to put that in here, but actually what we're trying to do here is not an active interpolation, necessarily. >>: You want the aggregate of the feature. >> Pedro Domingos: Because we want a pool and pooling means being invariant. We actually want to be insensitive with respect to a lot of things, so what's going to happen is if I widen my kernels I'm going to be pooling over a wider… Oh [laughter] special effects. So widening the kernel is going to have the effect of pooling over a larger area. Suppose I have a maximally wide kernel, that basically means that I have basically washed all information out about where I am in feature space. If my kernel is a direct function then basically I am not doing any pooling whatsoever. >>: So you don't actually apply another pooling function? >> Pedro Domingos: Exactly, exactly. This thing does the whole job, so it's a lot simpler. This one kernel is actually going to do double duty. It's doing the duty of approximating the feature map. It's also doing the duty -- think about, ignore ConvNets for just a second and think pooling is like low pass filtering. It's a kind of low pass filtering. What I'm going to do with these kernels is like, you know, low pass filter the feature map. And if I want to be very invariant I will have a low pass filter with a high-bandwidth. >>: Okay. >> Pedro Domingos: For those of you familiar with vision, this probably looks a lot like Lucas Kanade and it is, so Lucas Kanade was originally for things like optical flow and whatnot and then he was generalized to find transformations, but, in essence, it's what we're going to do here. Once we have found these points, and again, these points could be random points. They could be maxima found in this way. And they could be points that you put down because something about the symmetry of the problem. However you put down those points what happens is instead of having computed the feature everywhere in your affine space, you have computed that at certain endpoints, but now in order to do my back prop I need to be able to compute the value of the feature map everywhere. The way I do it is by using Gaussian kernels. Again, you could use many other things besides Gaussian kernels, but Gaussian kernels have all of the usual convenient properties and they also work well in practice. What I'm going to do is put a galaxy in kernel in every one of these points, and then my linear combination of those Gaussian kernels is going to be my approximation of the feature map. Yeah? >>: [indiscernible] between what you do at forward and backward. It seems like what you are suggesting is maximum over neighborhood of my feature every symmetry over the same [indiscernible]. Which means that basically the backprop just sounds like a sub gradient on the single point you found. >> Pedro Domingos: Again, what is going to happen is going to depend on the pooling function that you have. Suppose, for example, that the function that you are doing, that you are doing average pooling. >>: Yeah. No. But what you just said that you were doing kind of in one goal pooling and estimation with this maximum, so if you are [indiscernible] is maximum there is no need for… >> Pedro Domingos: No. You have to remember that we are going to have successive layers of this, so one of these kernels is sitting above the layer and another one is sitting below the previous one and those are the two that we combine. This is, there's even more than this, but why don't we get the whole picture first and then I mean I guess this is the hope. I've already said most of this. If you make your kernel wider you get more pooling. And then you might also worry about how to implement this oh efficiently because as I'm computing my gradient I am repeatedly going to have to look up you know like the nearest points to the point that I am currently in. I'm going to be somewhere in affine space and you can use things like ball trees or whatnot to do that. You could use KD trees but KD trees are very dimensionally dependent. Ball trees are less. Making this all efficient, besides this idea of using interpolation in 6D space also involves some careful attention to the structures, but again we know how to do that. Experiments. The first thing that we tried this on was the obvious one which was the MNIST rotated data set. This is somebody went and took I think somebody in Montréal, they took their MNIST digits and they rotated them in all possible ways and now you have a data set like that. People have used this for a lot of things. It's a nice simple example. So what happens there? We compared a ConvNet with one -- the way we implemented the SymNet was by basically taking the ConvNet code using the [indiscernible] implementation, placing the convolutional layers with affine layers and then on top of it we still have a model a perceptron. This is one affine layer in our case and one convolution layer in the ConvNet and on top of that there's the two usual multilayer perception layers, so this is really four layers; it's one, you know, symmetry layer. >>: [indiscernible] dropping error using [indiscernible] stabilize and you get 50,000 altogether? What is the trend? >>: That's the [indiscernible] >>: Oh. [indiscernible] 10,000 [indiscernible] >> Pedro Domingos: Okay. But notice this curve is like number of training examples versus test error. And what we see here is that if you have very few training examples the SymNet vastly dominates the ConvNet, because the ConvNet is trying to approximate rotations with small translations. If you give them enough training that it's actually basically sees all of the rotations in the data, well at that point they do the same. The point though is that when you don't have the space densely sampled in your training data as you want in larger richer data sets, the ConvNet is having a very hard time actually generalizing correctly. Whereas, the SymNet with 100 examples is doing as well as the ConvNet does with close to 1000 examples. >>: It looks like if you give it more data it's going to repair it. >> Pedro Domingos: In this case, again, I think what's going to happen in general is that if you give the ConvNet enough data and enough parameters, you know, it will eventually learn everything. But the problem is like you are not going to have that much data or even the cycles to process that much data, so I don't think it's necessarily the case that as we get more data we are going to get better. It's that we are going to I think see a curve like this also had a larger scale in more difficult data sets. And what's going to happen for those data sets is that, for example, the size of your data set might be here as opposed to there, so there the ConvNet running on everything is still going to have this kind of error and we're going to have this kind of error. Yeah? >>: What's the training time between [indiscernible] >> Pedro Domingos: The training times were some white somewhat slower, not hugely. I think by a factor of maybe up to 10. You have to remember that our training time is really dependent on the end that you use. We can use more; we haven't played with this a lot. If you choose a larger number, basically our training time goes up with the number of control points. >>: In this case if you take a 1000 example and you run them through there, and the ConvNet at the same time would you get the same result? >> Pedro Domingos: Our game here is not in time; it's in sample. We can use a much smaller sample is one way to look at it is we can use a much smaller sample to get the same results, yeah? >>: Yeah, but they can make samples by sampling the symmetry. I think time is the relevant comparison because like the opening of the talk was we are going to do the old-school way of doing things and depending on the [indiscernible]. >> Pedro Domingos: You think that time is the relevant comparison? >>: Because you can always fake out. >>: Yeah, you can fake out. >> Pedro Domingos: Oh, no, no no. Oh, no. I see your question. No, no no. Okay. Good. What happens is here you can fake out those transformations. In general, you can't. That's the problem is that you don't know how you generate -- actually, let's look at the next set of experiments that we have and then come back to this question. This is a good point and I see your question, but there's a very definite answer to it. You had a question? >>: A question similar to his, here you already know the symmetry [indiscernible] so you generate new samples [indiscernible] >> Pedro Domingos: No. Precisely. So this data set is interesting for the reason that, the data set is exactly the digits with those rotations. We know exactly what generated them. This is a good way to test whether we can do what we want to do, but in a way it's not a very realistic test, because the realistic test is a test where there's all sorts of transformations happening that we have no control over and we don't even know how to generate them. In fact, to jump to the next set of experiments which was on Norb. In Norb we actually have 3-D objects that are undergoing 3-D affine transformations and changes in background and distractor objects and whatnot. At this point, the symmetries that the data it really has are actually something that we can't directly capture or the ConvNet, but we still do a lot better. This also gets back to your question. Because what happens here, and this is a subtle point but in some ways it's the key point, is that ConvNet can do okay by approximating everything with translation provided the data is sampled densely enough. What we're buying ourselves by having affine, the affine group instead of a transition group is that we need less data to get the same degree of approximation of something that is actually not completely captured by those transformations. In particular, here the real data has 3-D transformations and we're approximating them with 2, but we're better off approximating them… >>: I'm still not happy because I can fake out the affine transformations. >> Pedro Domingos: No, no you can't. >>: Yes, we can. >>: Yes, we can. >> Pedro Domingos: No. Not in 3-D. >>: No, no no. >>: You only having the 2-D affine which I can fake out. >> Pedro Domingos: No, no no. >>: This is not a global affine transformation. >> Pedro Domingos: No, no no. You are missing several things here. One of them is that we are -- here, what we had and this is why we only needed one layer here, is that I just had a global rotation of the image. That's it. What I have here and in the real world is nothing of the kind. There's all sorts of transformations all over the map, so I can't do that anymore. There's an infinite number of them. It's not just a matter of applying all the 1D; that's not what we're going to do here. That's not what we need to exploit. That actually becomes impossible here, but the remarkable point is so these two graphs show different things. We still get a gain over ConvNets when we use affine transformations, because we can better approximate the real transformations that we can generate. Now I can say well, maybe we could do an experiment that would consist of generating all affine transformations and then putting them through the ConvNet. I would bet that at that point the ConvNet would be a lot slower. It would be slower because it would be working on a data set, right? Let's say that we are order of magnitude slower than ConvNets. If you have to increase the amount of them in the ConvNets by a trillion. >>: If [indiscernible] were here what would they do, right? They would probably generate some k neighborhoods for each example, which wouldn't be like a millionfold explosion. They would use this proof you are using and that would be the baseline. >> Pedro Domingos: No, but we have more than that. Remember we have more than one layer. We have more than one layer. >>: The point is that you are saying there is a sampled [indiscernible] >>: I'll generate from the 2-D. >> Pedro Domingos: I mean, I think in terms of [indiscernible] complexity, which is I think the important thing, clearly this is a win. In terms of time it might or might not be a win but I would still argue that this is going to be a win. >>: You don't know because you didn't fake… >> Pedro Domingos: I know, because we haven't done that, so fair point. >>: It wasn't clear your control points, they are per example or from old network? >> Pedro Domingos: In the case where we optimize them we start out that n random locations but then they are optimized on-the-fly. >>: They are on-the-fly but they are not to search for each instance you find a transformation maximize the feature map for that example. You can do a search for all examples that way? >> Pedro Domingos: I do this, it's an inference, right? I do an inference per example. By the way, in NORB we actually didn't do that at all. In NORB what we actually did is we set down control points at specific places and actually didn't adapt them, so it's even faster in that regard. >>: So in your case it is to search for every forward and basically you have to do this search? >> Pedro Domingos: Yeah. It's like doing an MAP inference, right? >>: Looks like a big thing, right? In one case you are like [indiscernible] forward with a [indiscernible] 6D for every neural, for every feature map, right? It looks like a costly process. >> Pedro Domingos: It is a costly process, but again, let's say it increases the cost by about an order of magnitude. It depends on the number of control points that you use. When you do this efficiently the increase in cost is not that huge. It's certainly better than actually dealing with orders of magnitude more data. >>: We don't have 20 years of optimization. >> Pedro Domingos: Yeah, but again, the other thing is that this is also a good point is that we don't have it here all the engineering tricks that ConvNets do, but because this is such a clean generalization of ConvNets there's no reason you can't apply every single one of those tricks here as well. It that way, I mean we are comparing a very state-of-the-art technology with something that isn't, which is like this thing that we just developed. Yeah? >>: What I would personally find very convincing is if you somehow parameterized a very large space of symmetry groups and then learn your way into them, because then it's like the ConvNet guys won't know exactly how to generate the data from that. >> Pedro Domingos: I mean that's definitely where we want to go. We just haven't gone there yet. >>: I have actually a related question in your data with that, which is that [indiscernible] when you fake the data it's for 2-D where you know exactly like if you just [indiscernible] a little bit it's the same digit, but I wonder whether [indiscernible] in the 3-D because of the lighting change and [indiscernible] so far that wouldn't be the case. You can't… >>: I understand with the symmetry group that you used also did not handle lighting [indiscernible] >> Pedro Domingos: No, but the point is, again, and I go back to this and it is an important point is that what we're always going to be trying to do is approximating the symmetries of the real world which are more than we can really capture. What this illustrates or at least makes us suspect is that having a wider symmetry is a bit like variational approximation. If you have a y there, you know, family from which to approximate your distributions, you will get a better approximation. That's my interpretation of why we're doing better here. Yep? >>: I want to take this in a different direction. So far in the training process, so in terms of the inference this is still feedforward something that you have to do especially in the inference phase? >> Pedro Domingos: No, I mean yeah, and the inference here is fairly straightforward. >>: But you still have to image for it. You still have to work. >> Pedro Domingos: Yeah. Again, depending on whether we do that maximization or not we have to do that maximization at inference time. But you have to remember that… >>: So what do you [indiscernible] this going to [indiscernible] >> Pedro Domingos: Remember, what we're doing is we don't have the value of the feature everywhere. What we need to do is we need to find its maximum. Then use them to compute the features at the next layer. A feature at one layer is using features at the previous layer, right? >>: So the maximum is computed over, then interpolated? >> Pedro Domingos: Yes, exactly, but this is all in one forward sweep. And by the way, typically it doesn't take that many iterations for this to converge because the search is fairly local. This is really not that bad. >>: [indiscernible] between the fixed point in doing this maximization? >> Pedro Domingos: We haven't done it, I mean we did the fixpoint for NORB and Lucas cannot afford -- I don't know. Did you try either on the other one? >>: You need to have the maximization on MNIST and it works better because you have the larger [indiscernible] and just more well behaved and it scales NORB down and so the objects are small. >> Pedro Domingos: This brings up another important point which is patch size. In ConvNets, ConvNets, you know, we tried different patch sizes for the ConvNets and the SymNets and the ConvNets work best with the small patches. This is what you would expect because if you are approximating, for example, a rotation or even a 3-D rotation or something else with a translation, this is only, you are only going to get a good approximation with small patches. In our case, because we have more flexibility we can have larger patches and still get a good approximation, because now we're not just doing approximations of the patch. We're doing transformations of the patch. Because we can have larger patches, we can actually catch more of the actual structure of what's going on there, which I think is what lets us generalize better. >>: And you have fewer strides, so even though each stride costs more you have fewer. >> Pedro Domingos: Precisely. You can look at this as each one of our layers are actually doing the job of several ConvNet layers. You need enough ConvNet layers to build up to the patch size that will actually learn in one step, which again, has the payoff of ease of learning and learning time. >>: In the ConvNet the features, like 2-D features, this is what you have, right? >>: Uh-huh. In 6D. >>: The part… >> Pedro Domingos: 2-D, right? The ConvNet I'm going over the 2-D translations by [indiscernible] and here we're going over a 6-D space of… >>: No, but he is saying they are still parameterized by grid and… >>: [indiscernible] of 2-D. Your parameters are not 2-D features? >> Pedro Domingos: No. I'm not sure I understand your question, but in the ConvNet right there there are the parameters of the feature, right? Each feature has its weights. >>: That's what I'm talking about. >> Pedro Domingos: And we still have that. >>: Yeah, so the… >> Pedro Domingos: Again, there's more than one way to do that, right, because we can also, you know,, if that becomes an infinite vector we can again approximate that with kernels, which is actually the third jobs at the kernels can do on top of the two that I've mentioned before. >>: My question was like you see the fixed grid, right? And now you have like these continuous transformations that you need to discretize to [indiscernible] >> Pedro Domingos: No. That's what I'm saying. We don't have to discretize if we use kernels. >>: [indiscernible] so you have a discrete [indiscernible] which then as a [indiscernible] operation you have a continuous feature and then you can have [indiscernible] >> Pedro Domingos: Yeah. Think of, for example the sigmoid. Instead of being applied to a -think of the dot product, right, for each feature. Instead of that dot product we can have a kernel application which is continuous and then this feature map pieces in pictures maps one, two and three. We have a kernel for each one of them, but then we have weights for those, so we still have weights this oh. This is really important. The nose must have a point, a bright little spot with high weight. I have the detector for the spot. And then I have detectors for the nostrils that have some slack with respect to what they are and maybe a lower weight. I still have weights but the weights are on the kernels; they are not on every point. >>: I'm surprised that you need larger patches, because this introduced some level of blurring like the [indiscernible] >> Pedro Domingos: No. It's not that we need… >>: [indiscernible] more patches. >> Pedro Domingos: It's not that we need more patches. It's that we can learn better larger patches. We can still learn the small patches as well as before. And remember the amount of blurriness is controlled by us, and also there's an interesting relationship between the n and the width of the kernel. If I want to pool a lot they actually need small n because having a larger n would basically be wasted information. More questions? >>: Assume to layers will do better than one layer. >> Pedro Domingos: Yeah. >>: The reason why [indiscernible] isn't perfect [indiscernible] >> Pedro Domingos: Even in this to layers didn't buy you anything because all you had was a global transformation, but here you don't have global transformation, so the two layers payoff over one. We're actually running overtime and I haven't even talked about -- this is good that we are having this discussion, but so I don't know. >>: The neighbor who is in the bigger area [indiscernible] is putting like more, Denver gave more. So do you have this notion that if you increase your k or your epsilon as layers go up? >> Pedro Domingos: It depends on which space you are thinking of. If you are thinking in terms of the original space, yes. The kernels keep getting wider. If you think about the reduced base, you can think of it as a kernel the same size in a space that keeps shrinking. In fact its equivalent so you can think about it either way. Next depths more layers which are distortions of things like image net and whatnot. We were originally thinking of doing this in the context of sum product networks which is something that we have been working on. I actually think that you really only get the payoff of this when you combine this with something like some ConvNets because you need [indiscernible] composition. There is actually something that ConvNets which is, for example, the formable part models do and we can do which is to say like you have this object here. You have a face somewhere and then you have a location of the nose relative to the face. You can have an image where the background is moving and there is a bird flying and then the bird is flapping its wings. What I'm interested in is that trajectory of the wings relative to the bird. I need to decompose the image between the bird and the background and then the wings in the bird and then decompose the transformations as they go through. And doing this in a ConvNet is very difficult. Nobody has done it and it's not clear how you do it, but in the sum product network of course that's what they naturally do is decompose the image into parts and subparts and has distributions over them. Of course, we would like to bring other symmetries. The first obvious one is lighting, but then in 3-D for example, let's suppose you have depth images. Suppose you don't have depth images but you in further depth. And then, you know, ultimately what we want to do is what you were suggesting is like we don't even know what the symmetries are going to be. We just allow very large symmetry space and let the learning dictate what happens. If people have other ideas of things to do then we would love to hear them. >>: Maybe we should get going. >>: I know you are out of time but can you give us a hint as to the symmetries you will be exploiting in the semantic [indiscernible] let's get to that. >> Pedro Domingos: There's actually less detail in that part because that research is not as far along, but I think it makes for a very interesting contrast with what we just talked about. So let me briefly say what the problem is and what our solution is. The goal in semantic parsing is to map sentences to logical formulas. I want to take a sentence into English and translated into a logical formula that I can then, you know, query and do reasoning with and whatnot. This is useful for question answering, natural language interfaces, you know, controlling the robot by giving verbal commands and whatnot. And you can think of it as a structured prediction task. The input is the sentence and the output is the structure and in parsing the structure is a parse tree, but here the structure is going to be the actual logical formula representing what the sentence says. This is a problem that's actually become quite popular in the last few years and [indiscernible] from here has actually done a lot of the work on this. But there are two very big, at least two very big problems in semantic parsing. The first one is that no one agrees on what that formal representation should be. Even for parsing there is already a lot of disagreement, but no one can agree on what the formal representation of what you read is, like everybody has their own formulas. There have been some attempts. >>: [indiscernible] particular application you can derive that. >> Pedro Domingos: For certain applications, like for example, querying databases, that language is given to you, but if your goal is to like read the web and form a knowledge graph, something you might be interested in, then you are not given that formal representation. That's just a fact of life, right? The other problem which is as big or bigger is even if everybody had agreed on a formal representation, where would be labeled data come from? You need example pairs. You can go on vast mechanical tukers to translate sentences into first order logic because they don't know how to do that. Labeled data is very hard to come by. You can do it with no labeled data. That is actually what [indiscernible] did. You can also try to use various kinds of partial or indirect supervision, but what we are proposing here is actually something that's quite different from all of this, which is to change the definition of semantics to not require an explicit formal representation, taking advantage of symmetry group notions. Here's the basic idea. We are going to say that, so remember what a symmetry is. The symmetry of an object is a change in the object that doesn't, that leaves its fundamental properties unchanged. Natural mapping here. Our objects are sentences and the symmetries are to be syntactic transformations that preserve the meaning. For example, synonyms, paraphrases, active, passive voice, you know, composition like A does B. C does B. A and C do B. Bill wore shades. William donned sunglasses. Sunglasses were donned by William. These are all syntactic transformations that don't change the meaning. The semantics is the invariant here. Then you can answer a question, and if you do this correctly, you can get -- this is the same process that we were looking at before except with a very different kind of noise variables. This is, the thing that we're trying to get rid of here is the syntactic form. If we can do that, then we can answer questions like wore shades done by Bill. Nobody ever actually said this sentence bill donned shades, but if we figure out that shades and sunglasses means the same in donned and work means the same et cetera et cetera, then we can actually answer that question, correct? So this is our notion of symmetry, the symmetry of the sentence is a syntactic transformation that leaves the meaning unchanged. The very thing we are looking for is now the meaning of the sentence. >>: [indiscernible] words change, phrases change? >> Pedro Domingos: Yeah. Everything. >>: So it's small changes. >> Pedro Domingos: Yeah. And notice by the way that now, so one step at a time. Synonyms, paraphrases, all syntactic changes that don't change the meaning are symmetries and you can compose them, right? I can change Bill to William and wore to donned and, you know, passive to active voice. I can do things one on top of another. What is the meaning of a sentence if we have this notion of symmetry? Well, is one more notion from group theory which is the notion of the orbit of an object under symmetry group. The orbit of X in the symmetry group G is the set of all objects that X is mapped to by symmetries in G. Take my object X. I apply every single symmetry to it and I get a set of objects. For example, a face at all translations, rotations and scalings, or a certain sentence in all the different ways that it can be said. This is the orbit of an object and the orbits of the objects in the set partition the set because if I can map A to B and B to C, then basically I can also map A to C. In particular, my language, the English language, is going to be partitioned into orbits by the syntactic transformations and one orbit is going to be all of the sentences with the same meaning, all the different ways of saying the same thing. So there is a one-to-one correspondence between orbits of sentences and meanings, which means that I don't need to explicitly construct the meaning of a sentence in first-order logic to actually figure out what the meaning is. The meaning is implicitly represented by the orbit that the sentence belongs to under the symmetry group of the language. There's going to be a symmetry group of English, a symmetry group of French, a symmetry group of Chinese et cetera et cetera. Now the orbits are in this case, and in many other cases too, but here very prominently, they are going to have compositional structure. The orbit for a sentence is going to be composed of the orbit for the subject and the orbit for the event and the orbit for each of the arguments of the event including the agent and the object and so on and so forth, the time, the place. Yes? >>: When you say the meaning may not be explicitly represented, I think, doesn't it depend on what you want to do with those orbits? So if you just want to match one sentence to see if it was expressed in your knowledge base, that's true, but if you need to chain to sentences together in reasoning, then you need some representation. >> Pedro Domingos: Funny you should ask that question. Two steps. Obviously, a very good question. I have more on this, but let me -- the answer is we can do two -- let me give you a partial answer. Obviously, if you are trying to do extended question-and-answering where the answer to your question is in your knowledge base except in a different form, then this will work because you just need to find, figure out that these two guys are in the same orbit and you're done. If you need to do chaining, let's suppose that I want to chain, and since it's an interesting case and that's what logic is for, what do you do in that case? Well, wait 30 seconds okay? >>: I come back to the question of [indiscernible] transform. There seems to be more of a damaging thing, right? If you are [indiscernible] the knowledge base which supports more than your question. You still want to say that, my question can be projected onto this fact. [indiscernible] transform. >> Pedro Domingos: To be very precise here what we're looking at here is a group of permutations where you permute. We probably don't have time to get into this, but you know how every symmetry group is, you know, structurally equivalent to permutation group. What we are doing here is think, for example, of synonymy. What we have is a permutation group among the different words, right? >>: [indiscernible] this notion that we can [indiscernible] and come back to the structured points compared to like a… >> Pedro Domingos: It depends on whether you want to lose information or you don't want to lose information. If you don't want to lose information, then what you have to do is you have to keep track of the permutation that you just did. You migrated to that guy. That guy migrated over here and this guy became this guy. How do we do parsing? This is actually kind of falling onto some questions that before you were asking. The goal of semantic parsing is not to find the most probable orbit of a sentence. If I find the most probable orbit of the sentence that actually allows me, for example, to answer the question whether this thing is true according to knowledge base or not. What this is going to be in practice is mapping my query to each of the sentences in the knowledge base or to each of a set of sentences that are equivalent by composing symmetries that I have previously found. If I cannot do this then I need to create a new word because I have a sentence with a new meaning that was not present before. Again, this can be done efficiently by taking advantage of the compositional structure of the orbits. The inference machine is going to be of some part at the end of the day where a union of orbits is going to become a sum and a partition of products of words is going to be part of that, but I'll just leave it at that very high level of detail. How do we learn? What is the goal of learning? The goal of learning is to discover the symmetries of the language. The data for this, the natural data for this is pairs of sentences with the same meaning, something that now anybody can provide. I can ask turkers tell me if the sentences mean the same or maybe even better, please restate this sentence in your own words. Know what he needs to worry about logic anymore. I can get a corpus like this maybe also from things like multiple news stories about the same event, corpora of paraphrases that I already have, so there's now lots of different data that I can use. What I am trying to get out of this is what are the transformations that leave the meaning unchanged. What I would like to have at the end of the day is not a huge pile of transformations. It's some sort of like minimal generating set. With these transformations you can produce all variations of the same meaning. I want to learn, like for example, glasses means the same thing, sunglasses means the same thing as shades and donned means the same thing as wore. I don't want to have to learn that when I wore glasses means the same thing as donned shades and what not and compose that with the active passive and so on. So the structures going to be the generating set that we find. The parameters are going to be the probabilities of the orbits. What is the probability that given this meaning I'm going to say it this way with this word, with the active or passive voice et cetera? My search, this is kind of structural learning like learning the structure of a graphical model. My initial state could be empty, right, I don't know any symmetries, or I could actually initialize this with any symmetries that I know. I could initialize it with WordNet. I could initialize it with a list of known paraphrases. I could initialize it by saying like he, look, there's this thing called active, passive that you can transform sentences in this way. And then what are our search operators? They are things like filling gaps in the derivations. I know that these two sentences are the same and I can make this sentence be this and this sentence be this, but now I have a gap. Let me postulate that you can transform this one into this one and I can transform this sentence into that sentence by composing these things. And because this at the end of the day is a probability distribution over meanings and the sentences that you say given the meaning that you want, you know, we can use all the usual probabilistic learning things like, you know, have the penalized likelihood as a scored function and what not. >>: I just want to ask you [indiscernible] >> Pedro Domingos: This that I discovered is kind of a supervised learning because it's supervised by pairs of sentences with the same meaning. We've also looked at doing unsupervised learning more in the style of [indiscernible]. >>: But of course there are a lot of words. [laughter]. >> Pedro Domingos: Exactly. >>: How do you parameterize all of the order probabilities? >> Pedro Domingos: We don't represent them all explicitly. This is actually what is interesting about this problem is that this is like the difference between structured prediction and classification. In classification you have two classes or five. In structured prediction every parse tree is a class, same thing here. I have this vast magnitude of things. How do we handle it? By factorizing that space into pieces. I'm going to have a compact representation of that space of orbits. I'm not going to list all the orbits. What I'm going to say as well the orbit for a sentence is composed of three things, you know, to simplify: the orbit for the subject, the orbit for the event and the orbit for object. >>: Now you do have representation for [indiscernible] >> Pedro Domingos: No. Again, I have, well I do, but it's not -- in the representation whenever I find a pair, I'm not saying I postulate this. I mean, I could. What I'm saying is this is what I'm going to end up with. I'm going to end up with a structure of well, a sentence can be active or passive. If it's active, right, but again the active passive is in use, right? I'm calling it active passive, but hopefully it discovers this by itself. The point is that we know how to do this, right? This is what we do for a living. We are representing a very large space, very compactly by using factorization. >>: You are talking about high-level graphing [indiscernible] >> Pedro Domingos: Interesting point, in a way this is grammar, but it's more of a transformational grammar as opposed to a context free grammar. Again, you can imagine having any transformation in this guy, but you also need to keep the learning and the inference tractable so you might want to make some compromises there. >>: [indiscernible] the first part of the thing you talk about of no sort of definition, no sort of representation for the logical form. >> Pedro Domingos: There's a couple of ways to look at this. One of them is I never defined the formal representation language here. What did I learn? I learned a bunch of transformations and now those transformations together induce a set of orbits and that's a representation. I didn't say anything at the outset. This can start with absolutely no… >>: There's no logical form anymore. >> Pedro Domingos: Yeah, there's no logical form anywhere. You could now look at like this orbit structure and like well, if I say which orbit you are in, that's a kind of like [indiscernible] representation, but it was induced from the data in the same way that [indiscernible] those representations from there. We've done some brief experiments on this. Let me just mention I think this was an answer to Matt's question, but a lot of people who do traditional language semantics like Montagu's semantics and what not based on first-order logic. The point is to then be able to do inference meaning to chain things and so far we haven't seen how you would do that. But we just speculate on how that might happen. If you think about it, logical inference rules like modus ponens, right, they're symmetries of knowledge bases, because they are syntactic transformations of a knowledge base that don't alter its meaning. This is precisely what logical inference means. It means to change the form of the knowledge base without changing its meaning. These things are all very well defined in the case of logic. For example, if I have the knowledge base Socrates is a human and all humans are mortal. And now I add the sentence Socrates is mortal, this is syntactically different but it's semantically the same. In fact, even this type of inference still falls under our general heading of symmetry-based semantics. >>: So in some sense the axioms are like generating [indiscernible] and generating [indiscernible] and the neighbor functions. >> Pedro Domingos: I think that's a good intuition but there's more than that here, because what happens is that before we were -- you need some notion of oh I can have a knowledge base’s multiple sentences. But the thing that we were doing before was that we had, we were applying symmetries to a single sentence. Now we are applying symmetries to sets of sentences and you need some additional parser to say like oh, you know, I have a sentence like, you need a parser to generate a set of sentences from a single sentence. But that's okay, right? >>: So the direction that you are going to is that if the sentence orbits you want to be able to have operations that say takes two sentence orbits and produce a third sentence orbits that sort of is the same meaning as those two sentence orbits. >> Pedro Domingos: Well in particular, exactly. I can think of in orbit over triples of sentences, for example. The third sentence might be empty. It could have, what is the orbit going to be here, right? I could have an orbit, I want to have an orbit that includes this pair of sentences but also this triple of sentences, okay? Now how did I discover that? Because hopefully in my training there I have a bunch of things like this that allowed me to figure out like he, Socrates is mortal is really a redundant statement because the result of which sets of sentences with this additional sentence are in line with something more general than Socrates and something more general than mortal. They will get aggregated to the previous orbit as one so like we have this union of orbits and then decomposition, right. We have a union like between orbits with these three sentences and orbits with these two sentences. Then each of those orbits instead is composed of the product of the orbit for this guy and the orbit for this guy, all the different ways of saying Socrates is human, all the different ways of saying humans are mortal composed in two. Then you have all the different ways of saying these three things. Now, because this sentence is also, right, this recursive stretch of the orbits is very powerful, right? Now I'm going to have all the different ways of saying this and I'm going to have is well, if you have these guys, you could also have this guy I have at this junction. I have the union of the orbits of this form and the orbits of all three, but if I take this guy out, then no, that's a different orbit. And again, so hopefully, right, this is speculative, like we can actually do the same job that logic inference does except that we never actually have a formal meaning representation. >>: How, those two are in the database. And then you do all this other stuff. How does it execute? Does it mechanically search for the search contents? >> Pedro Domingos: Modus ponens, this rule, the rule that I applied here was modus ponens. There's a first question is how do you induce Modus ponens, but if you give me a bunch of [indiscernible] in logic, a bunch of knowledge bases, I would need training examples of form. For example, this knowledge base is equivalent to this knowledge base and the knowledge base that says that Aristotle is a human blah, blah, blah blah is also, right, from all those examples I would induce a symmetry and that symmetry would be Modus ponens. Now that I have that symmetry, I can apply that symmetry to transform one into the other. >>: Okay. So [indiscernible] paraphrases. >> Pedro Domingos: Yeah, exactly. If you think of this in terms of paraphrasing, paraphrasing is a too incomplete process, right? I mean, the logical inferences are kind of paraphrased inference. >>: So you probably have noticed this in your modeling but I'm just wondering because leverage is nasty so [indiscernible] synonyms or paraphrases there are [indiscernible] context when you say that, so sometimes [indiscernible] and sometimes not. It's not as clean as you actually wrote [indiscernible] >> Pedro Domingos: A very good question. There's two parts to the answer to that. The first one is that we are making some simplifying assumptions here in the same way that Niles Bayes makes simplifying assumptions and it works very well. So I'm pretending so far that things are not context dependent. You know, we can probably get some mileage out of that. We've gotten some already, but, of course, that's not going to be the end. So what is the end? It's going to be one of two things. Maybe they're the same or maybe a combination. One of them is to say we need to condition this on features. In the same way that you can do discriminative parsing, you can say I'm going to do this whole process not generatively, but conditioned on, for example the bag of words in the neighborhood of the transformation that I'm applying. So you condition it on the context, so this is one way to do this, which you know, for example, for syntactic parsing that has worked very well, it's reasonable to expect that would work here as well. The other answer and in some sense the deeper answer is to say that if what you have is a symmetry that depends on context, the symmetry is not just between those two items. It's between the items and their contexts. It's saying like A in context C can be transformed to B in context C, so it's no longer A can go to B; it’s AC can go to BC. Now you pay a price for that in terms of computational complexity and whatnot and, you know, sample size, blah blah, but there's certainly at the level of theory, there's no reason why context can't be taken into account here. Sorry, you were first. >>: Isn't this kind of inference being proposed by [indiscernible] where you don't really know formal [indiscernible] know if the first [indiscernible] and the second sentence, so what are the comments you can picture [indiscernible] and why didn't you apply that technique of [indiscernible] >> Pedro Domingos: Great. Exactly. So this actually has good observation. This actually has a lot in common with what people do for textual entailment. Now for textual entailment, but now there are a couple of important differences. One is that the schemes that people have come up with for textual entailment are actually very ad hoc. It's like map your sentence to a graph and then you try to like, right? This, in a way is part of what we would like to do here is like take a lot of these things that people have done in areas like semantic parsing and textual entailment and have a nice clean theory that encompasses them. This is one aspect which didn't exist before. In the real world what we're trying to do is generalize and formalize things that people do before, for example, textual entailment. There's even these things like, you know, transformation based learning that Eric Brill had and others, so there's actually a lot of stuff there, right? But the other thing is the textual entailment problem is a little bit different from the one that we're solving in that what the textual entailment problem is trying to do is not transform something into something equivalent; is to derive one sentence from another small set of sentences. Textual entailment is like I give you a paragraph and I want to know where the sentence falls in the paragraph. What happens in textual entailment is that the thing that limits your performances lack of world knowledge. In order to do textual entailment properly you actually need in addition to your paragraph you need to have a knowledge base. We want to be able to, I don't think you can solve textual entailment the way people have been going about solving it. They are saying like I am just going to map the paragraph, you know, to the sentence. You need to do it in the presence of a large knowledge base and then the techniques that people use for textual entailment are not going to work very well, but this potentially might. You had a question? >>: I still don't understand is the question about composition and like [indiscernible] and like coupling this with the context. I think it's more complicated than conditioning in context. If you have something like Obama visited London and President visited London and you can infer some kind of transformation from Obama to President. And here [indiscernible] president visit is going to have like 1945, then he can do this transformation in that case. But in the [indiscernible] there is no context [indiscernible] on. It's just like some… >> Pedro Domingos: Good question. The problem that you are pointing toward I think is actually a different one that we have actually thought about, which is there are multiple levels of abstraction in which you could talk about things. You could talk about Obama. You could talk about the president of the United States. You can talk about presidents in general. You can talk about Obama today. What happens is that you can say something at one of these levels that implies different things at different levels, but they are not equivalent. But again, we have a very nice way to represent this which is that if you think about it. Think of a class of objects versus the object. The class of U.S. presidents versus Obama in particular, what happens is there are additional symmetries. As you add symmetries you get larger classes. For example, the name Obama is one president, but the U.S. president, in general, is irrespective of the name. It could be Obama. It could be Bush. It could be something else. >>: [indiscernible] transformation. It's not symmetry. >> Pedro Domingos: Yeah. But that's okay. We just need to be able to do this reasoning at any level that you choose. Again, if you think about it in terms of the whole knowledge base, it is symmetry of the whole knowledge base. If I say Obama visited London, that implies, right, that because president assumes Obama, the president of the United States visited London and so forth. >>: Okay. I like it. >> Pedro Domingos: All right. >>: Thank you so much Pedro. >> Pedro Domingos: That was good. [applause].