>>: Hi, everyone. It's my pleasure to reintroduce Paul Smolensky to continue with his talk series where we discuss making neural networks understandable. And I just wanted to do a preview that we are going to do part 3. Paul has agreed to do apart 3 which will be most oriented towards linguistics on this coming Friday. And thank you for listening. We really appreciate your taking the time to give this extended talk series. >> Paul Smolensky: Well, I appreciate the very valuable introduction I have been getting. And although I had said that the talk part 2 would be addressing linguistics I really should have said language. It is not so linguistic-y. Okay. So last time I mentioned that there were two goals that I had when I came here. One to do with reverse engineering, trying to understand networks that exist and I'll come back to that towards the end. But what I'll talk most about is the goal of engineering networks which are built to be understandable and to hopefully invert you of their greater interpretability, allow us to program into them useful constraints that will allow them to perform to a higher level of linguistic competence. And in virtue of being able to process discrete computational structure which is not really inherent in the neural medium. Okay. So this is the, I think pretty much the same list I had last time. I think we got to item number 2 or something. We'll see, maybe we'll get farther this time. The challenge you may remember concerns trying to unify two approaches to artificial intelligence or cognitive science. And those are stated here in terms of hypotheses. One, about the value of symbolic computation as a means of description of cognition and intelligent behavior. And N, the other one, the value of neural computation for the same types of purposes. And the approach that this system called gradient symbolic computation takes is to posit a cross-level mapping, to view H as describing a more macroscopic level of the same system that N is describing at a more micro level. And to link the two by an embedding mapping called psi here which maps symbolic structures into a discrete subset of the vectors in the neural vector space. So here is the picture I put up the last time to remind you about analogies to other fields, where we have macro structure emerging from micro structure. The mapping psi is what links this kind of description to that kind of description. And you remember the idea is that to try to design networks with the capability of computing functions which on one level of description are taking structured symbolic inputs like perhaps trees and mapping them into other structured objects, perhaps propositional representation of the meaning of the sentence and to compute that function not by operating on the symbols in an explicit way but by taking advantage of this embedding to put the input into the vector space of neural network states, apply neural network computation to produce an output which then can be reinterpreted symbolically if desired. And so I'll say some words about the kinds of function F that have been shown to be computable in this way. And those theorems are the formal arguments that I can offer that this computational system I'm describing really does provide a kind of unification of these two quite different approaches to modeling intelligent cognition. So I also mentioned that the result of this particular way of doing the embedding is a kind of intermediate level in which tensors figure prominently. So we call that tensorial level and so we can add a kind of intermediate hypothesis about the value of gradient symbolic computation described at this level for characterizing the knowledge and processing in intelligent systems. Okay. So now, there's an important step which I would love to do right now. This is where it belongs. But given my experience last time I fear that we might not get beyond it. So I've decided to postpone it. And I hope we will get to it at the end instead of at the beginning. But part of it has to do with characterizing neural computation as involving not just activations and neurons but representations that are in some sense sub-symbolic, where the conceptually interpretable entities, and this talk is all about interpretation, the interpretable entities are distributed through activation in extended parts of the network. So that's a very important part of the whole story, but I propose to postpone most of it. I will mention a few of the reasons for distributed representations centrality, but leave the most important one in my way of thinking about it to the end. So for engineering purposes, distributed representations provide important similarity relations. So if you have one unit dedicated to one concept and another unit dedicated to another concept and all the concepts have single units dedicated to them, then all concepts have zero similarity to each other. That's how local representations are and on the contrary, if an activation pattern is what is encoding a concept, then two different concepts will have similar or more similar or less similar activation patterns. That similarity will play an important role in determining how what is learned about one of these concepts will generalize to the other. There is the fact that you can get many more distributed representations than you can get local representations in a set of N neurons. There's only one representation per neuron in the local case. Distributed representations afford the opportunity to pack the N dimensional space with more than N conceptual entities, using distributed patterns to encode them. Okay. Then for reverse engineering it just seems to be a fact that the representations that learned by networks in their hidden layers and the representations that we find in the brain have very significant distributed component. Rarely you do find individual neurons that can be given a conceptual interpretation, but that is falls far short of providing enough understanding of the system to explain how it functions and why it succeeds and why it fails when it does. So understanding neural computation as trafficking in distributed representations is an important for understanding ->> Audience: In biology there are such terms local [indiscernible] so how does that reconcile the possibility that there is maybe a small number of neurons that maybe do have local representation? >> Paul Smolensky: Well, the principle is that the design of networks must be such that distributed representations are possible. But local representations will also be possible and in some senses perhaps for some purposes preferable. But local representations are a special case of distributed representations. If you are set up to cope with distributed representations, then you can specialize to local, but the reverse is just not true. Okay. And so there is a fundamental aspect of neural computation in my view. What I'm postponing is a very general symmetry argument which leads to the conclusion that neural computation must always allow for distributed representations, as I just said. And I propose to skip that. Let's see if I can skip it. Okay. That didn't work. Let's back up here. Hmm, hmm, hmm. Okay. So I'll move on to the next topic now. And just review the actual proposal for how to embed symbolic structure in distributed activation patterns in neural networks, what this psi function looks like. So just lightning review of what we went through rather slowly last time. So we will be using patterns of activation in which the individual activity levels can be thought of as being the elements of tensors. And the Nth order tensor is characterized by an N dimensional array of real numbers. There will be N subscripts or indices to distinguish these members from each other. The two basic operations that we talked about are first the outer or the tensor product which increases the order of tensors. So if you multiply A and B together using the outer product, you get a tensor of order N plus M. The sum of their orders. Whereas contraction decreases the order. So contraction is tagged to a particular contraction operation is tagged to a particular pair of indices in a tensor. So to contract over the specific pair IJ, in a tensor that has rank order at least 2, is to take the Ith and the Jth index, indices and replace them both by Q, and then sum overall the possible values of Q. And that's what is written out in a painful way here. So it takes two indices, sets their values equals, and sums overall the possibilities. So special cases of this that are very familiar is the dot product. When the two tensors are just first order. The matrix product when the two tensors are second order, that involves taking two indices. The second index of this and the first index of that, which means the second index of the tensor and the outer product and the third index of the outer product, setting those two equal to each other. That gives us, and summing overall the possible values gives us the matrix product. But whenever I write two symbols like this next to each other, I'll try to be consistent about using this open face for the tensors. You should not interpret that as a matrix product unless it's explicitly described as a matrix product. Otherwise, everything will be outer products because, as I told you last time, I'm experimenting with omitting the outer product sign, which otherwise would be covering the page. Okay. So the general concept of inner product with respect to two indices involves basically taking two tensors, A and B, taking their outer product and then contracting over two indices, one falling within the indices of A and the other falling within the indices of B. The matrix product is one case of that. But we have use for inner products more generally because the symbolic operations that these tensors are embodying will use the outer product to bind information together and will use inner products to separate to extract, unbind symbolic elements from one another. So some of the examples that we looked at last time are shown here. So a simple case of a kind of slot with a name filled by some element is represented. The representation of that binding together of this role and this filler is achieved by the tensor product. So if the agent role corresponds to the tensor A and the element individual J corresponds to the tensor J, then it's their outer product that represents this pairing, this order pair essentially. In the case of a graph we have two nodes linked by some labeled arc, let's say, relation R. And again, we just take the outer product of the tensors that encode the individual labels on the nodes A and B as well as the label on the arc that joins them. So writing these tensors here in the simpler way, we just take the through A outer product of those three tensors to encode that one triple or one edge of a graph. And very analogously if we have propositional type representation, we have a relation expressed in terms of two arguments for a binary relation. Then we would take the outer product of three tensors, one for the relation and one for each of the arguments and for the hider ->> Audience: The hidden that one is undirected and the other is directed? >> Paul Smolensky: They are both directed in the sense that RXY and RYX are different tensors. >> Audience: [speaker away from microphone.] the first one, not finding also to be interpreted as very [indiscernible] obviously there. >> Paul Smolensky: Yes. These are just two ways of writing the same thing. >> Audience: Okay, so you could swap those? You actually do the real kinds of computation at the end, sometimes you don't know which order it is because once you do neuron, all the computation, you just get the tensors [indiscernible] sticking to one definition there in order to [indiscernible.] >> Paul Smolensky: Right, that's right. So a tensor product representation brings with it a discipline for interpreting the different dimensions of the tensor, how they are arrayed in the activation vector or activation pattern. >> Audience: In the example you mentioned between not link node versus condition planning [indiscernible] it is interesting why there is a link for two different ways. >> Paul Smolensky: There is no reason to have two different ways. These are a bunch of examples, things that are familiar from symbolic descriptions in AI. >> Audience: I see. >> Paul Smolensky: And how we encode them, that's all. >> Audience: I see, okay. >> Paul Smolensky: In the project that we are -- there are two projects that I will describe very briefly if I get there and one of them is using this notation and one of them is using that notation. So I wrote them both out. Otherwise it's kind of redundant. Okay. And here is the important point that came out in discussion after the lecture with Li, in fact. So I wanted to emphasize it on this review that another type of tensor product representation uses something you can think of as absolute positions within a structure to individuate the symbols that comprise the structure. So in this case what we are talking about is a string, AXB. And X just the representation of the X in that string is one way of doing that is by having a vector that is associated with the role that it plays, the position, which is the second position in the string. So R2 is the variable whose value gives you the second position in the string. And in this case its value is X. So the outer product of this vector R2 and X is the tensor that encodes the single constituent X within this string. With trees, you have a similar story, but now the set of positions is recursive. So in a structure like this where we have this tree, if we want to talk about the X in the middle of this one, then we can talk in the same way in terms of tensors that encode the positions within the tree. In this case I tend to label them with bit string. So R01 is the position which is the left child, zero, of the right child 1, of the root. So the path to the root is indicated by these bits in the name of the role. And it doesn't matter whether you do the R before the X or after the X, as long as you are consistent in carrying it out. >> Audience: May I ask a question? >> Paul Smolensky: Yes. >> Audience: During the study you have a [indiscernible] with A and B. Do you mean it is for later on that you substitute the exact tensor of A and B? And that is just a simple, that is a representation that is a tree you need to use this as the representation of the X? But later on if you have real value of tensor A and B, you need to put A and B somewhere to use together with this? I am not sure, what is the meaning of that tree? The gray letter A and B? >> Paul Smolensky: Hold on until the next slide, which is just one line away. And the fact that the positions in trees are recursively related to each other can be captured and has to be if you want things to work out to do recursive function computation by taking a role like this one, or a position 01 left child of right child of root, and expressing it in turn as an outer product of a vector that represents left child and a vector that represents right child in the particular order associated with the path to the root. So this, in these I neglected to mention that I have been dropping the explicit encoding of the order of the tensors because we were just talking about first order tensors all the way down here. But now in order for us to be able to talk like this, we have to say that we can start with first order tensors, vectors for the primitive roles of left child and right child, and then generate an open-ended limitless set of additional tensors that are used to encode all the positions below. But -[sneezing.] >> Paul Smolensky: Gesundheit. Their order increases as you go down the tree. So this second level has second order. >> Audience: But [indiscernible] because it is -- >> Paul Smolensky: Yes, you can infer that the two ->> Audience: So this still doesn't ... [speaker away from microphone.] >> Paul Smolensky: It is not necessary and I may not carry it through. I'm not sure I remember. I at least wanted to point it out here because it is the first time that higher than first order tensors are being used in the examples. Okay. So binding is done by the outer product. I will -- oh, yes. So the point I wanted to emphasize and just skipped yet again was that in order to encode the structure, symbolic structures like strings and trees in vectors, we make explicit the different roles that symbols can occupy. Those roles are normally implicit in the way we draw the diagrams, in the way we string the symbols together in a string. So it takes a little getting used to sometimes to be making completely explicit the notion of role within a structure because it is so often the job of a notation to hide that; make it implicit. But we make it explicit in order to carry this forward. All right. So outer product is used to bind together symbols and the roles that they play in a structure or symbols to one another, if they are bound together in the structure. And the remaining operation that we need is the means for putting together multiple constituents to give the vector for the whole, which is what you were asking about with the gray letters in the previous tree example. And it couldn't be simpler. It's just done by addition. So if we have two variable structure like this, then each one is represented the way I just said, but we add the two together to get the representation of the structure as a whole. If we have two arcs, we add together the triples that we talked about a moment ago. If we have three letters in a string, then we add together three tensors, each one encoding a single symbol in its location and, for the tree, it's the same story. Here we have three constituents to add together. We talked about the X one, but here are the other two. They get added in. And by using the recursive property of the definition of these embedded roles, that this role here is in fact the outer product of two primitive roles for left and right child, we can observe a nice recursive property of this representational scheme. So once we expand this out into its primitive parts, we can factor out the R1 that these two have in common. That is to say these constituents are both hanging off the right child of the root in this position. We factor out the R1, then we get this expression here for the tree. And that is nothing but the embedding of the symbol A bound to the left child position, and the embedding of the tree structure now, not atomic symbol but embedding of the tree structure bound together with the right child role. So any binary tree that consists of left child P, however complex that sub-tree might be, a right child Q. The encoding of that will be expressible in terms of the embedding of P itself, the embedding of Q itself and then the two roles that each get bound to when they are combined. Okay. So I call that a recursive representation because it obeys this equation here. Yes? >> Audience: Is the conjunction approximate or do you take care of trying find all these tensors and [indiscernible]? >> Paul Smolensky: In the process of constructing our representation from a symbolic structure, it's exact. In the process of reverse engineering, we are going to have to make due with approximate summation, but ->> Audience: So how do you ensure that the first -- the first example, how do you ensure that you get the vector that in truth or [indiscernible] rather than some kind of posited elimination when you add that [indiscernible.] >> Paul Smolensky: So what is encoded in my way of thinking about it is the conjunction of two propositions. One asserting that the agent is J and the other asserting that the agent is K. >> Audience: How do you enforce the conjunction? I don't understand, I mean, this is -- this is an assumption of the model, but how do you know that what you are going to get is the conjunction? It doesn't follow the exact -- I mean, it is not clear to me that you can do it with an addition. I mean, I can imagine how I would make it happen by using proper vectors for agent and J and so on, but if they are not orthogonal, if they don't have certain properties running through the data, it's going to be some kind of fuzzy conjunction. >> Paul Smolensky: Well, yes, if we are learning these from the data, then all bets are off as to what extent these summations and the outer products themselves are going to prove useful as a description of what has been learned in an unconstrained system, I mean. We can build systems to be constrained so that they will always use addition and outer product operations, and then we know that they are exact. But for reverse engineering a generic network that has not been built to specifically use tensor product representations, then all of this will at best be an approximation to what we will find in there, I'm sure. >> Audience: How about -[overlapping speech.] >> Paul Smolensky: I did want to point out that, let me use the term proper to describe a tensor product representation in which all the vectors and coding symbols and roles are linearly independent of one another. So at the very least what we can then say whether or not you're convinced that this summation can encode, that it would encode conjunctions as opposed to something else, we can at least say that this is the result of this is unambiguous as to what is bound to agent and what is bound to patient. And then this is the world in which the theoretical work that I have done takes place, the world of proper tensor product representations. Firstly, the world of Microsoft is different because we can't afford to have as many units as it takes to have linearly independent vectors for all of the symbols. >> Audience: Right, and that is the other question that I was kind of meaning to ask maybe later. Maybe now is the time. A lot of this, I'm sure that theoretically it is your professional analysis, but [indiscernible] it is a inefficient way of representing symbolic representations if you do it naively. Do you revert back later to not expanding and then projecting, but actually computing your inner products rightly? [indiscernible] >> Paul Smolensky: So the context of what defines an acceptable kind of computation that I've worked in is implementable in a neural net using neural operations. Notion of efficiency is rather different because you have parallel computation of the multiplication and addition operations available to you. And I have not explored what it takes to do efficient emulation of these computations using digital architecture instead. But that certainly is also a critical thing now. Yes? >> Audience: In the recursive tree, one, you're adding tensors of different dimensionality, right? But is that a problem? >> Paul Smolensky: Well, what you have to do is you have to have a big vector space in which these are sub-vector spaces. The subspace of order 2 tensors, the subspace of order 3 tensors all together in one big happy vector space. So in that sense it's addition within that bigger space that is well defined. Sometimes >> Audience: [speaker away from microphone] -- dimensionality sort of boosted up to this high dimensionality, but for going the addition? >> Paul Smolensky: In some sense, yes, that's right. >> Audience: But how do you decide what slice of the -- you just kind of arbitrarily pick like a corner of the space things slide to in their best dimension? >> Paul Smolensky: Yes, I guess that's a way of saying it. Yeah. So you have the -the basis for the space as a whole is all built up by multiplying together these R0s and R1s. And so if we just em belled R0 and R1 themselves in the large space, then that picks out a two-dimensional corner where the depth one trees will live, and so on after that. It's all ripped off from particle physics without change. Multiple particle systems consist of a direct sum of spaces for three particles for particles, five particles. So in the equation here -- let's see. Oh. Now, this equation here, right. So when we add together these tensors of different order, there's some references. The symbol here could also be the direct sum symbol instead of the regular sum symbol because we are talking about adding elements of two different subspaces in the bigger space. So if you are happier thinking about it that way that's also a perfectly legitimate way of thinking about it. But I didn't think it was worth going into that, but I guess I just did. Okay. Further questions before I go on? So we use outer products to bind together symbols to each other and to their roles in the structures. And to extract information from a tensor that encodes multiple constituents to unbind roles and figure out what is the symbol in the second position in the string, we use the inner product. And just for convenience, I'm going to assume that the vectors and coding symbols are not only independent, that they are ortho normal. They are orthogonal to each other and normalized to length 1 and we'll make good on that asterisk at the bottom of the slide, to back off from that a little bit. So if we take the example of strings, if we want to extract the symbol in position K within a string, what we want to do is unbind the vector that the role RK, which was used to bind that symbol originally into that position. So here is our representation of the string AXB again called S. If we want to find out what is in the second position we need to take the inner product of S with the second role vector, that's the vector -that's the role we are trying to unbind. We take an inner product of these two and the inner product is a linear operation. So the inner product with S is just the sum of the inner products with all of its constituent tensors and because of the ortho normality assumption, R2 and R3 are orthogonal. So their dot product is zero. Similarly here for 2 and 1. The only one that doesn't get wiped out in the inner product is exactly the role we want, R2.R2 is just one. The vectors have been normalized. So we end up pulling out exactly the filler of that role in the structure as a whole indicated here. And exactly the same thing works with the trees. Exactly the same thing works if we are trying to extract information from the graph kind of representation I showed earlier, but it plays out a little bit differently because we don't have positions in the graph. We are not using that notion of role. We're identifying the nodes in terms of the positions they play within these local three-way relationships instead. So the kind of question that we want to ask in this case is: What is at the tail of an R edge from A? That is, what is related by R to A according to the graph. Here is the representation of our graph. It is representing the two triples. And if we take the product of the two bits of information that we're using as our retrieval queue, the relation R and the Adam A, take their product here, that is what we want to use to probe the tensor for the graph as a whole. That inner product will, for just the same kind of reasons as we saw above, wipe out all of the constituents except those that in fact have AR in them and that will pull out just B as the result. Okay? >> Audience: [speaker away from microphone.] the other check relational finding that is [indiscernible.] This is R here? >> Paul Smolensky: Yes. So we need to just keep our discipline that R is in the second, that the second dimension of our third order tensor is where the relation goes. We could have chosen it to be the first one instead. And then we would have RA here instead of AR as the probe. Okay. So as far as this asterisk is concerned, so in a proper tensor product representation, what we have is that the vectors are linearly independent. They aren't necessarily ortho normal. But the mere fact that they are linearly independent means that there exists another set of vectors which we can use exactly in this way. So instead of R2, we use R2 plus in that case. Instead of R01, we use R01 plus. These dual vectors have exactly the property needed to make this calculation go through, namely they are a dot product with all the other roles as 0 and they are a dot product with their own corresponding role as 1. RK dual dot RK is one. RK dual dot any other RL is 0. So all the calculations go through as before. >> Audience: So what do you -- [speaker away from microphone.] So this is some notation for the ... [speaker away from microphone.] >> Paul Smolensky: Well, we can't do this, we can't unbind by taking the dot product with the vectors that we use to do the binding, if those vectors that we are using to do the binding are not ortho normal. >> Audience: Oh, I see, okay. >> Paul Smolensky: If they aren't ortho normal, then the vectors that we want to use to do unbinding are not identical to the vectors that we use to do binding. >> Audience: [speaker away from microphone.] you don't want, you want tensor everything to be zero so you can ... [speaker away from microphone.] >> Paul Smolensky: Either you use the exact unbinding vectors here which will make the calculations go through exactly the same and you'll pull out exactly the right filler for the role you're unbinding. If instead the vectors are approximately orthogonal in some sense and you don't use the exact unbinding vectors here, which are the dual vectors, instead you stick with using the original role vectors to do the unbinding, then you will get some noisy approximations. >> Audience: [speaker away from microphone] normal to begin with? Because you can always orthogonalize everything. >> Paul Smolensky: Well, one of the big advantages of distributed representations is that we have similarity relations among the elements. So we don't want to lose the ability to use that. We want to take advantage of that. Yes? >> Audience: [speaker away from microphone] -- conjunction, but what about this junction? This makes sense ... I mean, suppose ARB or ERB. >> Paul Smolensky: Yeah. Well, I would need to go up a level and put the logical structure of these expressions in a form in which we are actually con joining clauses together, but the clauses include things like disjunction signs. Then the operations have to be the ones that are appropriately interpreting those signs. So it won't come for free in the same way this low level conjunction comes for free. So it may be a little misleading to even use the term conjunction for this, but certainly it is true when you write down a graph you are saying that this link is there and this link is there and this link is there. That's what the semantics of the picture is, right? >> Audience: [speaker away from microphone.] They are not things that are true/false. They are things that -- [speaker away from microphone.] >> Paul Smolensky: Yes, right. So operations on them could produce 0s and 1s, but they themselves have some kind of richer content, is the idea anyway. So what does gradient symbolic computation get from tensor product representations? It gets a new level of description where we can talk about where the data are these tensors as opposed to just a bunch of numbers which are activity values as opposed to a bunch of symbols in some structure. We have tensors as a kind of intermediate level. And we can derive techniques that apply to arbitrary distributed representations which are interpretable in the sense that we can understand exactly what the data are representing. We can build knowledge into networks that process these representations because we know how the data are represented. We can construct programs to do particular calculations that we want. I'm going to give you some illustrations of that. We can write grammars that will allow networks to pick out the tensors which are the embedding of the let's say trees that are generated by the grammar or evaluated by the grammar. And of course, you also get massively parallel deployment of this knowledge as I mentioned a moment ago. So this is all intended to be construed as neural parallel computation. For cognitive scientists we get a set of models that are really more about unconscious automatic rapid processes than they are about deliberative ones. So reasoning about disjunction is something that we do deliberatively. It would be handled within a larger architecture in which there are inference processes that are built into network machinery. Okay, so ->> Audience: The assumption that I'm finding automatically [indiscernible.] It's all the neuron computation is mentioned in resultant factor. Somehow [indiscernible.] >> Paul Smolensky: Well, the main thing about unbinding is I don't think the brain does it. >> Audience: [speaker away from microphone.] >> Paul Smolensky: We do it when we try to interpret the states of the brain, but so that was a gross exaggeration, but the point is that whereas straightforward implementations of sequential processing and so on would have us unbinding all the time before we do anything, and I think it's the exception rather than the rule that unbinding would be done ->> Audience: [speaker away from microphone.] -- a delay or something, in the previous slide? >> Paul Smolensky: It's just a very crude characterization of sort of the threshold of consciousness and what kind of processes are not accessible to it. Okay. So I want to give you some examples of programming with these tensor product representations and what kinds of products can be computed as a way of just arguing that this isn't just any old way of packing, of taking vectors and giving them names like this tree, that tree, this tree, but rather it's a form of vector and coding in which neural operations can do what we want in terms of computing functions for which those symbolic structures were posited in the first place. Next time I'll talk about grammars, and I won't talk about that today, but I'll talk about something more straightforward. Than and we are going into the dark land of super nerd slides for just a little while. Hopefully it will give you a feel for what I mean when I say that these networks can be programmed. So suppose we have a function that takes a sentence of a certain structure and maps it into an interpretation in the form of, some logical form like this. I want to construe that really as a mapping between binary trees and here is a toy binary tree for something that we can pretend is a passive structure that gets mapped into a binary tree encoding of this proposition here, one of many ways of doing it. But having chosen this particular way allows us to write this function down in a lisp-like notation extracting the right child of a noticed, extracting the left child of a node, putting two children together to form a mother node from them. Those are the primitive operations for tree manipulation here. And in terms of them, we can write down an expression for this function. And no matter how big P is as a sub-tree, no matter how big the agent sub-tree here is, this function will do the right thing and put all of A here, however big it may be and all of P there, and so on. So here is a network that computes this function in the sense that this group of units is the input pool and the activation pattern here is the embedding using the tensor product isomorphism of this tree, with some selection having been made for what activation patterns correspond to R0 and R1 and the symbol A and symbol aux and so on. Actually we don't need A. We just need aux. Having made some choices about what numerical patterns will be used to embed the primitive bits of this, it's all assembled using the tensor product schema I just laid out. Here is our input to the network and this output pattern stands in the same relation. It's the embedding of what we hope will turn out to be this. And the operation from, the input to the output is nothing but a matrix multiplication. This is a linear network, simplest kind of connection network you can have. And so the implementation of this function is just multiplication of the input by this vector of weights here. Now, we can actually write down what the weight matrix needs to be. And here it is. So having this expression for the function in terms of the primitives of binary tree computation, we can write down an exact expression for what this matrix is like, looks like. This is really the upper left corner of an infinite matrix or an unboundedly big matrix because each of these things are actually very, very simple, but unbounded matrices that implement extracting the right child and extracting the left child and stringing together two children. So despite the fact that we have a distributed representation in which members of this tree are mooshed together, we know exactly what weights it takes to produce exactly the right moosh to embed the output that we want. >> Audience: The way to -- [speaker away from microphone.] based upon the principle you talked about earlier and [indiscernible] You said binding operation, you get that from ->> Paul Smolensky: Yes, essentially, essentially, yes. So essentially what this does is take an inner product with the dual of the role vector for binding the right child, that's right. Yes? >> Audience: [speaker away from microphone.] Unique, given the input and output? >> Paul Smolensky: This W, the total wallet picture? >> Audience: The previous slide. >> Paul Smolensky: Hmm, well, I guess if the -- I don't know. I think so. >> Audience: [speaker away from microphone.] >> Paul Smolensky: What is learning? >> Audience: [speaker away from microphone.] [laughter.] >> Audience: This one doesn't require learning. [speaker away from microphone.] You have many, many examples. You won't be able to see that directly. At least you do the least square, you get optimal [indiscernible] that gives the error. Is that the way you think about it? [speaker away from microphone.] >> Paul Smolensky: Well, it is an interesting question whether there is a learning algorithm such that if you give it a whole bunch of pairs that are actually, in the instance of this mapping, that it will learn a matrix that does it. I will tell you one think which will help a lot. There is a theorem that says that recursive functions all have a certain, women at least the ones in the class that include this all have a certain form in which they are the product of some very simple -- it's really an identity matrix but it's infinite for all of the depths of the trees, times some smallish matrix which characters the particular function that you're implementing. So a learning system that was built to incorporate that structure would be able to generalize from what it observes at shallow depths immediately to what should happen at deep depths. So what it really is is building in a kind of translation in variance in the, within the geometry of the tree. So like a vision system that has translation, a learning system that has translation invariance built in somehow, you get generalization in the same sort of way here. But let me ->> Audience: [speaker away from microphone.] >> Paul Smolensky: Let me just ->> Audience: [speaker away from microphone.] So you can figure out what the W error given the constraint on the dimensionality of the ->> Paul Smolensky: I mean, I haven't worked on questions like if we don't have linearly independent vectors for all of our roles in the tree, what is the best matrix and how on to, how would you possibly be able to learn it? >> Audience: That example. [speaker away from microphone.] It is responsive to the dimensions. >> Paul Smolensky: So if you mean what order tensor is it? I mean, it's operating, it's multiplying tensors to produce other tensors. So it's -[overlapping speech.] >> Paul Smolensky: But it can, we can write -- this expression it self characterizes essentially an infinite matrix, which can deal with trees of any depth. >> Audience: [speaker away from microphone] what is W comes actually, is that just the W sub[indiscernible.] >> Paul Smolensky: It constructs a tree by putting together the left child and the right child to create a binary tree with those constituents. So it takes -- you take W cons 0 and multiply it by the tensor that encodes what you want the left child of the new tree to be. And add to that W cons 1 times what you want the right child of the new tree to be. Then you get the new tree with the right children. But it has sort of two parts, cons 0 and cons 1 because it has two arguments essentially. >> Audience: [speaker away from microphone.] >> Paul Smolensky: Okay. Yes, thank you. Thank you. These are actually matrix products here. Yeah. It should say that. Yes? >> Audience: So if I have the sentence "Few leaders are truly admired by George Bush," is it the same W? Because you kind of hard coded that you need this aux, B and by in the --[speaker away from microphone.] If I insert something, it's a different tree. >> Paul Smolensky: Yes, so the function I started with here was not, will not generalize to that kind of case. If we could write a function using these operations that did, then we could correspondingly build a network that would compute it. Yeah? >> Audience: So sorry, I think I've totally gotten confused here. Before the representation of the tree with the tensor of dimensionality kind of like the depth of the tree or something, but now you're saying there are 2D functions that are infinite in size. There was some sort of transformation I missed? >> Paul Smolensky: Okay. So here is what potentially quite confusing about the picture. So it is totally because of the two-dimensionality of the page that I have these units arrayed in two dimensions here. These are the elements of a tensor that has high order. The order, as you say, determined by the depth beyond which everything is 0. So you could array it faithfully in many dimensions. Or you could just linearize it to one long string. And maybe doing that latter would have been less misleading because there's nothing 2-ish about it actually. But anyway, this is the tensor that we have been talking about all along. >> Audience: Uh-huh. >> Paul Smolensky: That you get by binding these symbols to their roles in the tree. That's what this is. >> Audience: Okay. >> Paul Smolensky: Then this is an operation that maps one of those to another one of those. >> Audience: But the operation is that the 2D matrix multiplied? Maybe I misheard. >> Paul Smolensky: If you imagine flattening this tensor out to one long string of numbers, and similarly over here, then we just have a two-dimensional matrix for taking the one string to another string. >> Audience: Okay. So then it separates sort of flattens the tensor into one or 2D? In order to ... not just for the diagram, but for ->> Paul Smolensky: Well, just for the diagram really. So really, you have linear transformations from the space of tensors to the space of tensors is really what this is supposed to be trying to depict. So linear transformations from, you know, a space which may have high dimension to others. We can talk about them without any confusion but we can't necessarily draw them in such a nice way on the page. >> Audience: Okay. >> Paul Smolensky: I mean, if you were to see the actual equation that this is depicting, what you would see is that W would have huge numbers of indices all over the place. It would have two sets, huge sets of indices, one for the inputs and one for the outputs. So nothing to, there would be no 2-ness about that either. Well, except that it is mapping from input to output. So that is two elements that are being related. It only has one input, one output. It has two sets of indices, one for the inputs, one for the outputs. >> Audience: [speaker away from microphone.] once you have the space which has tensor structure inside, how do you unbind, for example, once you have the output? How do you know that that is a tensor representation of the knowledge of the [indiscernible] which voices some other representations about the knowledge? >> Paul Smolensky: So in setting this up, some choice was made about what vectors to use to encode left child and right child. And in theory you could make one choice for this em belling and a different choice for this embedding. I don't think I did. I used the same choice for both. But relative to the choices of the embedding of the primitive roles, left child and right child, then you can unambiguously take inner products galore with this vector and R0, R01, R0001, and each one will pull out what fills that position in the output tree. And so if you were to take the inner product of this thing with R0, you would get the activation pattern for the symbols. If you did the inner product with respect to -- with the R1 vector, what you would get is a pattern that encodes this tree here, the right child. Then if you took that and took an inner product with R0 then you get a pattern for this tree. If you kept doing that, you would eventually get down to the atoms like few and George and so on that are embedded in those constituents. You guys can Duke it out. >> Audience: [speaker away from microphone] -- the R is from the original test centers through the matrix to get the Rs in the output? >> Paul Smolensky: Rs? >> Audience: If you apply R0? >> Paul Smolensky: Yeah. >> Audience: Or R10, you are going to get B, right? >> Paul Smolensky: Yeah. >> Audience: If you multiply by the matrix you should presumably be getting RO. >> Paul Smolensky: Yes, yes. That's what all these guys are in the business of doing. >> Audience: So the input is tensor and the [indiscernible.] [speaker away from microphone.] but there's lots of ways of contracting an N tensor and [indiscernible] is it just that you set things up and you take the first N coordinates and that's the contraction that you do to correspond to the indicated output? >> Paul Smolensky: I think so. Why would you do anything else? >> Audience: I mean, it is quite unique. I can contract and get an N tensor out. There must be something that is happening inside which ones contract [indiscernible.] >> Paul Smolensky: Yes, I guess it's true that -- let me back up to something I said before. So the theorem would say there is a linear transformation from the space of tensors to the space of tensors, okay? And there is no unique way of writing down a linear transformation. So you need to make choices if you are going to do that. But as far as I know, it's no different from the choices that you would make in doing that in other contexts. Yes? >> Audience: Does the mapping work hold true for any choice of role vectors as long as they are orthogonal or do you have to carefully choose them in order to make this work? >> Paul Smolensky: As long as they are linearly independent we can do this because we can define these in terms of the dual vectors which will compensate for any lack of ortho normality that there might be. So this function here is linearly neurally computable. Here is a linear neural network that computes it. So there are results characterizing bigger and bigger sets of functions that are linearly neurally computable. As far as I've gotten so far, it's a class of primitive recursive functions which can be defined in terms of one another by recursions like this. And we can, if anybody's curiosity is piqued, we can come back to that. Let me try to at a higher level characterize what's going on in these results here. So first of all, there is a single step operation that is to say massively parallel operation from the distributed encoding of the input to distributed encoding of the output. That's what it means to be computing a function and we are implementing them in the simplest kind of neural network here, linear transformations. Primitive tree constructing and tree constituent accessing functions are implementable as linear transformations. So we can realize them as matrix multiplication once we make the right choices about bases and all of that. An arbitrarily complex composition of these operations, tree constructing and tree constituent accessing, an arbitrary complex composition of these is still just a single linear transformation. So linear networking can compute the set of recursive functions, that is the closure under composition of the primitive tree operations including the one example we looked at. And to implement a recursive function that is defined by a primitive recursion, an equation -- a recursive equation in this class of primitive recursions that I showed briefly, what we need to do is take the corresponding recursion equation for matrices relating the input vector to the output vector and solve those recursion equations per matrix and we have a way of computing the recursive function. So this is a slide about that, but I'm going to skip that. >> Audience: So I have a question. From everything I understand so far, the operations are all linear. >> Paul Smolensky: So far, yes. >> Audience: In this model. Is it going to stay that way? Curious. >> Paul Smolensky: No. >> Audience: [speaker away from microphone.] >> Paul Smolensky: We will get to this point, we'll get to this point right here and then we will be talking about multilinear later than linear operations. But everything is stated in the multilinear -- they're sort of like polynomial functions. Too bad Ronnie is not -- what? >> Audience: If you [indiscernible] you can multiply these matrices? >> Paul Smolensky: If you bind politicians ->> Audience: The outer product between whatever you are partitioning, whatever the partition leader is? I don't mean as words, but categories? Like a leader might be a word and a politician might be a category. >> Paul Smolensky: Okay. >> Audience: Then you bind that and then you can multiply again, basically the same matrix. >> Paul Smolensky: Yes. >> Audience: So then you end up with something that's -- that doesn't replace the leaders with politicians, but it actually has all that information in the output. It has that relation plus it has the information ->> Paul Smolensky: Yeah, that's interesting. I think that should be doable, yup. Yes, that's interesting, yes. So the example that I looked at involved copying of symbols, but the first, very first statement about the class of linearly computable functions is that transformations that leave symbols in place but change one symbol to another and in what you're interested in is changing one kind of symbol to another kind of symbol, I guess, one that is somehow first of a category and multiple members and such. Yeah? Chris? >> Audience: [speaker away from microphone.] >> Paul Smolensky: All right. So let me see. So the next item here is an example from cognitive science about what happens when you take advantage of the similarity structure of the distribute the representations for the roles of symbols in a structure, specifically the roles of phonemes in syllables. But we are again running very short of time. So I wonder what -So shall I keep going? Or shall we switch into a more open discussion mode? What is your pleasure? >> Audience: Well, I have an open discussion question. So far it seems as though they are kind of [indiscernible] concept has had its own [indiscernible.] Is that going to hold throughout the format? You're going to have these giant virtual spaces where everything is sort of multilinear? >> Paul Smolensky: That is what things have looked like prior to coming here. Right. So maybe I'll jump down to the slides about projects going on here. Why am I not getting my slides here? So let's see. Okay, so one of the will projects involves taking tensor product representations for syntactic trees, dependencies parses and mapping them into tensor product representations for graphs in the formulism of AMR, for the meaning of the sentences. And we are interested in learning the transformation between them. And I'm interested in learning the vectors that are used to encode the symbols and their roles in the structures. But so far we haven't done any of the learning work yet. The algorithms are waiting, but they are not tried yet. And we use the same representational scheme as I indicated before for representing these two types of graphs. By representing all the triples in them and superimposing them, adding them together. Sorry, microphone, I guess I shouldn't touch my shirt. And to have some kind of neural network that does the mapping from one tensor product representation to another is what we're shooting for, but my boss would only allow me 20 dimensions. I have 15,000 symbols. So after I stopped cursing him, I started looking into the question of: Well, if you only can afford 20 dimensions, what can you do in generating 15,000 vectors in representing 15,000 words which appear in the corpus of inputs and outputs for this task? And here is a picture of what happens when you try to minimize a function here which penalizes large dot products for vectors. So what we have here is a plot of the resulting vectors from an example run and what you see is that there are 15,000 vectors, each 20 numbers long. And the biggest dot product between two of them, two different ones is .72. So that means that the dot product -- you remember when we tried to do unbinding, we really want the vector that is binding a particular position we're interested in to have the property that when you take a stock product with itself you get one and when you take a stock product with all of the others you get 0. So wipe out all the other constituents that you are not going after. So what we get here instead, we get a 1 just because we choose them to be normalized, but for the dot product of the role vector in itself, but we get as big, up to about .7-something for all of the other roles that we are actually trying to exclude. That constitutes the noise that we have in the representation because we don't have linearly independent role vectors. >> Audience: [speaker away from microphone.] and beta here? >> Paul Smolensky: What is that? >> Audience: You are solving for VJBK? >> Paul Smolensky: No, I'm sorry. I just fix data. >> Audience: That's strange, because I can make the VJ.UK negative. >> Paul Smolensky: So they are all normalized vectors. So you can try to make them negative, but you can't make them huge. >> Audience: And there's no notion of similarity, semantic similarity between the vectors? >> Paul Smolensky: Right. So this is a case where we are trying to make things as orthogonal as we can, but in other situations we would want rather to impose some. So one of the interesting things about learning is once the network mapping syntax the semantics can influence the choice of these things, will it prove helpful to make similar the vectors for certain pairs of symbols or certain pairs of roles, for that matter? So that's one of the projects. >> Audience: [speaker away from microphone.] coordinates as you go along? Right? Just keep the [indiscernible] small, but when you need to disintegrate among things at a high level, you add more features, more like passion values, almost. Your vector is never fully, is actually never fully represented [indiscernible] vector. >> Paul Smolensky: Right. I haven't pursued that line of inquiry yet but it's interesting. I don't think what I have pursued is what you have in mind, but it certainly is close if not the same. And that is, we really want the -- so the idea is to have purely hypothetical space of dimension 15,000 here for each of the orders of the tensor. And then to do something like principal components analysis over the set of vectors that we are really trying to process, to pull out a lower dimensional subspace that is more manageable and that can be done incrementally, as examples are processed. So that, we haven't started on carrying that out. I have an algorithm but don't have any experience with it. Yeah? >> Audience: [speaker away from microphone.] >> Paul Smolensky: Yeah. >> Audience: [speaker away from microphone] analysis? >> Paul Smolensky: Well, they are dependency trees. >> Audience: They are not [indiscernible]? >> Paul Smolensky: They are just dependency parses. >> Audience: [indiscernible] coming from the Stanford parser? >> Paul Smolensky: Yeah, either the Stanford parser or Illinois parser. >> Audience: Okay. >> Paul Smolensky: But the idea is that we are helping ourselves to using standard symbolic means for taking input string and producing some kind of parse of it. And for our purposes, of course, the more the structure in the syntax mirrors the structure in the semantics, the better we are. The better our chances are for being able to readily implement the mapping between them. So some people have used CCGs for this purpose. Yeah? >> Audience: So the dimensions of 20, what does that represent? Is that like a constraint on realtime hardware? Single view? >> Paul Smolensky: Usually better than that. We need 20 dimensions for A, another 20 for B. So we're up to 400. Then we have ten for R. That's 4,000. So that the overall dimensionality that we get is 4,000. >> Audience: Oh. >> Paul Smolensky: Part of the reason that we have to be very skimpy here is that it gets squared along the way. >> Audience: Okay. >> Paul Smolensky: But another fact is that there's -- we don't have the worldwide Web as our database. We have 10,000 example pairs to work with. So the number of parameters in the network that's mapping from one to the other -- because, right, we have 4,000 input, 4,000 output, 16 billion? Connections? And 10,000 examples to train them on. And so there are various considerations aside from sheer computational cost that also mean that if we are too profligate in our representations we won't expect to generalize well. Yeah, on to the training set. Yes? >> Audience: [indiscernible] is that the input side or the output side? >> Paul Smolensky: Well, the number actually after I did this went up to 15,500. That was the union of the two. >> Audience: [speaker away from microphone]-- on the input side that never show up on the output side? >> Paul Smolensky: Well, this number is coming down as we speak. The intern working on this is doing named entity recognition, replacing all the proper names with just a single symbol. So that number is going to go down. That will reduce the amount of noise that we have to cope with a lot. And let's see. So another task and one idea for it, this isn't the original idea for the intern but it's a more straightforward way of using tensor product mechanisms to do it, is a task that Facebook put together and has gotten a certain amount of attention for reasons that are somewhat mysterious to me, but it has, and the idea is to code things in a relational format and use this exact mechanism of just binding together by the outer product the relation and its arguments to -- so here is what the task looks like. Let me see if I can do anything useful about being able to read these things. So there is some sort of old-fashioned adventure game world for generating sequences of happenings that get expressed in simple sentences. So you get a sentence like "John promised a baseball to K. J gave a baseball to L." You get a bunch of sentences and some sequence in it. Various points during that sequence you have to answer questions like what did J give to K? Or where is the football? Or things like that. So there's an increasingly supposedly increasingly difficult series of types of questions to, for your program to have to answer. And so here, so the example is what did J give K? So the approach here is to translate these things in a pretty straightforward way into a logical form that is called neo-Davidsonian by some people in linguistics, where we have these event variables and properties of events, like this is a promising type of event that is being described here. And that event has as its agent J, very, very straightforward simple things. But the point is as soon as we write it this way, then we can use the mechanism I just described and have tensor product representation up the what zoo for all these presented indications and we add together the information in multiple sentences and then we query -- oops. There they are! Now they're back. We query, so the question was what did J give K? So we have one tensor which has all the information from these four sentences and we do some querying to figure out what event is a giving event in which the agent is J and the recipient is K and having identified that event, then what is the theme in that event? And that's the thing that was given. So anyway, we used out you are products, I haven't rewritten this slide to get rid of all the tensor product signs, but so that's in the works. We don't have results. >> Audience: [speaker away from microphone.] >> Paul Smolensky: What is that? >> Audience: How do you figure out the [indiscernible.] >> Paul Smolensky: How do you figure out where it is? >> Audience: Yeah. >> Paul Smolensky: Well, you have to find some kind of event that is in a class of events that localize something. And query for events of that type. And what this simple scheme does not do, which needs to be done for that type of question really is to tag events with times so that you can keep track of something being in multiple places as the story progresses. And then I don't know how it is we are going to be able to simply determine the most recent time that the event that is most recent that identifies the location of this object. I'm not sure how to do that. But ->> Audience: [speaker away from microphone.] what did J give K? How are you going to solve that? >> Paul Smolensky: What did J give K? >> Audience: Unbind the ... because you've now got, there's a giving. There's a relation giving between J and -- I mean, give is a three-place predicate. >> Paul Smolensky: Well, in neo-Davidsonian terms we only have these triples. So there isn't a proposition that has both J and K in it. There are two propositions that say J is the agent of event one and K is the recipient in event one. So you have to be able to cope with that, which makes certain things harder, not easier. >> Audience: If you want to unbind such that you recover the theme? >> Paul Smolensky: Yeah. So you need someone binding to determine what event is the relevant event, which is the giving event that has this agent and that recipient. And once you've found the event, then you probe the tensor representing the story to find out in that event what element is bound to the theme role, which is what is given. That's the name for what -- that's the name I chose for what gets given. So you unbind theme for that event. >> Audience: And then where in the representation of what do you know that that's how you are going to be unbinding? >> Paul Smolensky: Well, I guess the -- you have to translate this sentence into the form: Find the, return the theme of the event which is of type giving and has agent J and patient K. You have to translate it into that somehow. Somehow. Yeah. Right. So anyway, again we have relatively high order tensors here. So the number of dimensionality of the vectors that encode these is going to be the product of the dimensionality of these three types. And so we are also going to have a challenge of putting a lot of triples of symbols, many different types of triples of symbols into some manageable size of vector space. So let me see if there's any kind of sensible conclusion here for -- oh. Let me just mention this. For the, what we have for reverse engineering because I was really hoping to get some of this done here. I'm just going to have one slide and then I promise I'll stop. So we have a machine learning approach using a generative model to address the following problem. And I would like to try to use it on networks that have been trained to produce English output, and try to analyze the representations in the hidden layers of these networks with this tool. So the problem that this tool addresses, you're given a set of vectors which are, let's say the states of the hidden layer in a deep neural network, although we're also going to try to do it for activation patterns in neural recordings. And different of these vectors correspond to different inputs. You know what input corresponds to each of these states of the hidden layer, let's say. You want to interpret them in order to be able to say how they are encoding the domain data to the point where you can actually explain how it succeeds in finding the output for the input using these intermediate representations. And the hypothesis that this pursues is that the vectors that you're given here at the top are actually tensor product representation unbeknownst to anybody and so you have a generative model which produces tensor product representation using choices for what the roles and fillers are for each of the inputs and what vectors encode each type of role in each type of filler. And those are what have to be fit in the learning process in order to account for maximize the likelihood of the set of vectors that you received as your starting point. So standard generative model and standard techniques have been implemented and tried on some synthetic data with partial success. But I like to see whether anything useful can come out of it applied to actual hidden layers for networks that do the things in language which we need to understand and which we have some information from linguistics about what might be in the representations to make it possible for some system to correctly produce those kinds of sentences. And I think I will stop right there. Thank you for your patience. [applause.] >> Paul Smolensky: Two, two minutes left before noon. No parting questions? >> Audience: [speaker away from microphone.] presentation of the Facebook, what are the results? Do you have some results? >> Paul Smolensky: We have not implemented what I showed you yet. The intern who is working on it implemented a much more special purpose approach, which had a lot of the similar ideas buried in it but not in such a generalizable form. And he claims to be able to answer 18 of the 20 types of questions at this point. At last report. >> Audience: [speaker away from microphone.] >> Paul Smolensky: I believe that's the claim, yeah. Yeah. But how hard it is to do that, making all of the assumptions that are being made, I don't really know. >> Audience: But what the data center, right? Synthetic data? >> Paul Smolensky: Yup, yup. >> Audience: In case you're curious we have a data center that is not synthetic, that has questions we can point you to, if you care ->> Paul Smolensky: Oh, yes, yes. These are not synthetic fictional stories? >> Audience: No, it's [speaker away from microphone.] 500 short stories, short fictional stories with multiple choice questions. >> Paul Smolensky: Right, multiple choice questions, I'm about to learn about those. I'm about to learn more about those. Chris has sent me some stuff. Yeah. All right, well, Friday I'll try to say something more interesting from a linguistics/language point of view. Thank you.