>> Li Deng: Okay, thank you, everybody, for coming... Professor Paul Smolensky, and we thank Lucy for co-organizing this...

>> Li Deng: Okay, thank you, everybody, for coming for this final part of lecture series by Professor Paul Smolensky, and we thank Lucy for co-organizing this series for us. So we have decided to open up all this series to the public, and today is the very last one, and Paul will stay here until mid of December, so you have more questions, you can come and approach him directly. And because we are going to open up this whole series to the public, so we decided not to discuss any internal projects we are working on here, but if you do have any questions within Microsoft, come and talk with Paul and myself, and maybe Lucy, as well. So thank you very much, Paul, for spending a few months with us, for collaborative work with us and also for giving this very insightful series of lectures. We appreciate it very much, so it's for you for today. Thank you. >> Paul Smolensky: Well, it's such a treat to have a dedicated group of people who really want to understand, and so that's been very gratifying, and I've learned a lot in the process. I think I'm supposed to maybe put this out of the way. Okay. So today, I am hoping to make four points. First, in practice, standard tensor product representations are not, as universally believed, larger than alternatives. Second, all known proposals for vectorial encoding of structure are cases of generalized tensor product representations, which I have yet to define. There is a little evidence for tensor product representations in the brain that I will tell you about. There is a topic that's been held over many times from previous lectures, which I want to get to if I can, showing what kind of serious symbol processing can be done with tensor product representations to try to really lay to rest any questions there might be about whether you can do real symbol processing with these networks. Okay, actually, if you'll bear with me a second, I'm going to just restart this. Okay. So I want to talk about what other people have said about the size of tensor product representations to give some context for the comparison I want to make between sizes of tensor product representations and others. And Chris Eliasmith is an important figure in the field. If you're not aware of him, you should perhaps check out this article in Science, which is quite an amazing accomplishment of training networks of neuron-like units, which are more seriously committed to biological reality than most neural networks are, including my own. And so he has tried to do much of the same sort of thing that I have tried to do, but with more emphasis on biological validity, to try to tie together the neural level and higher cognitive levels. But he believes that tensor product representations are too large, so in one of his papers from this year, he quotes his book as showing that in coding a two-level sentence, such as Bill believes that Max is larger than Eve, where lexical items may have hierarchical relations of depth two or more that this will require approximately 625 square centimeters of the cortex, which is about a quarter of the total cortical area, which he finds implausible, and I have to totally agree, that if that were the truth, then it would be bad news. He elaborates on why he believes that this is the right figure in the footnote there. Conservatively, let's assume that only eight dimensions are needed to distinguish the lower-level concepts like mammal. Then, a representation of an individual like Eve, who is somewhere in a hierarchy of being, Eve is a person and a person is a mammal, and so Eve ends up three levels down. If we have eight units involved in the vectors at each of these levels of multiplication, then we end up with 512 dimensional vectors for the individuals. And then, putting them in the sentence, if you put individuals with 512 unit vectors into depth two, then you end up with 512 times 512 times 512, which is 12.5 times 10 to the 7th dimensions, or 12.5 times 10 to the 9th neurons, because elsewhere he's argued that you need 100 neurons per node of a simple connection network in order to get the right signal to noise ratio for these networks to work properly. So now, the thing is that this is not the right way to calculate the number of units needed in a tensor product representation for a tree involving Eve and the other figures in this sentence. What was said here would be correct if -- did I get that laser pointer? It would be correct ->> Li Deng: Magic. >> Paul Smolensky: Thank you. The human brain is a wonderful thing. Okay, so it would be correct -- I finally remembered to bring my own special pointer, which those of you know optimality theory will appreciate the importance of the pointing finger. This is too high for optimality theory. So it would be correct, if the way that we represent something like ABC is A times B times C. In other words, if what we did to represent a pair, left child, right child, was to multiply the two of them together, then as we went up to higher levels of embedding, we would in fact pile up multiplications of filler vectors, each of which would be the size of an individual. So this calculation would be correct if that were the way we represented trees, but we don't represent trees that way. What we actually use is PQ is represented as P times the role of left child plus Q times the role of right child, so P and Q are not multiplied together. They're added together, and they're multiplied by something which is in fact a very small vector, a vector of size two in most work that I've done, because we just need two vectors in this little space to be linearly independent, R0 and R1, so a two-dimensional space suffices. So those are tiny vectors. And when we do go to depth two, then the representation we get looks like this. We do end up multiplying together to get a third-order tensor, the way it was claimed here, but it's not three individuals times each other. It's one individual times one role vector times another role vector, and each of these only is of size two. So the dimension grows as the dimension of the filler vectors A times the dimension of the role vectors raised to the D power, where D is the depth. It does not grow as the dimension of A to the depth, which is what the Eliasmith calculation here was assuming. And it certainly doesn't grow as the dimension of A times the dimension of R, raised to the D power, which is what Gary Marcus assumes in a calculation we'll look at in a few slides from now. >> Li Deng: So what makes people get this wrong calculation? >> Paul Smolensky: Well, it is a mystery. In another paper, in 2014, Eliasmith clearly uses positional role tree representations. They use representations of exactly this sort in another paper, the previous year. So why it was not assumed that that was the way to think about tensor product representations, but rather that this was, is a mystery. It is true that in the way they represent structure, the operation that they use here keeps the dimensionality of the vectors the same, so they have the option of doing this in a way that the tensor product scheme does not really have a viable option. And it is true that they like to do it that way with their operation, but it doesn't make sense to do it that way with our operation. Yes. >>: Let me ask you a clarified question. So the addition, the vector addition, is the direct sum or plain addition? >> Paul Smolensky: Direct sum. >>: Direct sum, okay. >> Paul Smolensky: Yes. Well, yes, this one and this one are direct sums. This one can be an ordinary sum. All right, so I'm not sure why they made this calculational error, but it has to do, I think, with applying their procedure for encoding structure and assuming that the tensor product scheme would do it the same way, and that's not the case. Okay. In this paper, this is the way they talk about representing this tree. Okay, exactly the same way that I do it. So it's not something that was not understood or even it's something that was used by them, but in the context of their operator for combining information here, their form of multiplication rather than tensor product. Okay. So another person who's claimed that tensor product representations are too large and who has a prominent place in this part of the research field, because like me, he believes that it's important to find ways of bridging the connectionist and symbolic levels, he believes that tensor product representations are too large. So what he says is that suppose each filler can be encoded by a vector of five binary nodes. Encoding a tree with five levels of embedding winds up taking 10 times three -- oh, I should have said each role can be encoded with three nodes, three-dimensional vector. That's where the three comes from, 10 for the filler, three for the role. Five levels of embedding, he takes this to the fifth power and claims we need 24 million nodes. Well, the truth of the matter is we need 7,000 nodes -- 7,280 nodes, because as I said on the previous slide, we don't raise this number 10 to any power at all. It's linear in that in the size of the filler vectors. And you only get exponentiation of the role vectors -- he was assuming that they were of size three, and so that's what leads to this number. >> Li Deng: That number requires that the role vector has dimension two or three smaller? Because R vector that you used earlier could be larger dimension or it could be smaller dimension. >> Paul Smolensky: Yes. >> Li Deng: But to reach that number ->> Paul Smolensky: This here? This assumes, along with Gary Marcus here, that the R vectors are of length three, dimension three. So when we go to depth five in the tree, we're not taking filler, which is of size 10, times filler, which is size 10. We're not doing that five times. We're just taking filler times role, one filler, and the role has multiple factors of the primitive roles from which all the others are recursively defined. So you get the role number raised to powers, and that number is just three, according to Gary's assumption here. Now, in fact, I use two, and so we could even say that rather than 24 million, what you need is 630. So you see, tensor product representations are getting a bad rap. >> Li Deng: But when you have very, very small dimensions for R, what kind of things do you lose? >> Paul Smolensky: Nothing. >> Li Deng: What if you have noisy encoding, something gets corrupted. >> Paul Smolensky: So as long as the distributor representation of roles is what we've been calling a proper one, so that they're linearly independent, the two role vectors will have two corresponding dual vectors, which you can use to unbind with perfect accuracy. It's only when you go beyond the ability to have linearly independent vectors that you start getting errors and noise. >> Li Deng: So in that case, R, increasing the dimensionality of R, is it going to help? The solution is somewhere else, other than increasing the dimension of R. >> Paul Smolensky: If you had some inherent noise in the system, if you had a noisy computational system, then it might help to have a somewhat more commodious space to put your vectors in, but I'm not even sure that that's true, so it's not clear to me that it can help. >> Li Deng: So in this case, you're saying that in practice just use dimension level two. >> Paul Smolensky: I haven't seen a reason to use anything but two, myself. >> Li Deng: Okay, that's good. Thank you. >> Paul Smolensky: Okay, now, so why does Gary do this business of 10 times 10 times 10 five times, when in a previous paragraph, just before this paragraph, the clearly describes positional role tree representations correctly. So if you just look back to the previous paragraph, it lays it out exactly right. You take the left subtree and multiply it by a vector that represents the word, and the right tree vector and multiply that and add them together, it's a perfect description of the way that I represent trees. But then when it goes to a continuation of how big it is, there's this big deviation from that description to something else. >>: But I guess you can interpret that paragraph both ways. That paragraph could be interpreted both ways, as if he's only quoting the role, or he's keeping the whole left subtree in his calculations. >> Paul Smolensky: Yes, I considered that possibility, too. And I believe I found evidence that wasn't -- couldn't really be what the intention was, but I'm not sure that I could tell you what it is, and it's possible that it was truly interpreted in a -- in such a fashion that the representation of a subtree is multiplied by the representation of its siblings, not added to the representation of its siblings. Okay, so what approach then do they favor? Well, in the talk that Gary gave here, he had a nice table of all the different kind of computations that he argued we need to be able to do in order for cognition to get off the ground, and he proposed some algorithms for carrying out these computations in a way such that they could be neurally implemented, a very nice piece of work. Now, here is where he talks about how to represent variables and binding of variables, which I call roles, to fillers. So what he says about them is that they should be done with holographic reduced representations, which is based on a multiplication operation, circular convolution. So I'll tell you what that scheme, holographic reduced representations, looks like. It's based on this multiplication operator that functions to bind together vectors the same way that the tensor product does in TPRs, but it's a different operation. And it looks like this. So if we take the vector X -- I'll pretend it has three elements here, and Y -- suppose it has three elements here. We want them to be the same dimensionality. Then, if we take their tensor product, we get all of these combinations of products of one element from X and one element from Y. And if we cycle through them, so that we just repeat this upwards cycling through, pretending that this is mounted on some kind of circle, then if we form summations this way, we add up these three elements of the tensor product and call it Z2. These three, we call it Z1. These become Z0. Then you've gone from a nine-element tensor down to a three-element vector, which is what you started with, so Z has the same dimensionality as X and Y, and this is called circular convolution. It can be written out this way. The lambda component of Z can be gotten from products of multiplying X-alpha times Y-beta, then weighting each one by a number, which is either one, if you're adding it into the sum, or zero, if you're not adding it into the sum. And the ones that you add into the sum turn out to be exactly those for which this coordinate is equal to the sum of these two, Modulo 3 and is the dimensionality -- it should say Mod D, actually, if that's the dimensionality. This can be recognized as a contraction of a tensor product, so this is clearly a tensor product where certain subscripts have been set equal to each other and added, so you have alpha repeated here and beta repeated here, so in this three-way tensor product, what we have is T-alpha-beta. T-lambda-alpha-beta, X-gamma, Y-delta. That's the five-subscript configuration for this order five tensor product. But then if we contract by setting indices 2 and 4 to be the same and adding up over all of their values and set indices 3 and 5 to be the same and add up over their values, then what we get is this sum here, and that is an order 1 tensor. It's just a vector again. >> Li Deng: So where does the convolution come into that contraction ->> Paul Smolensky: :The convolution is hidden in the definition of T. Yes. So modular arithmetic has a kind of cyclical, circular character to it, and that's why circular convolution ->>: Or computationally, you could compute these things in various more efficient ways. You can use a [indiscernible] scheme of sums. These things can be computed in much more efficient ways than this. Is that the point that he is trying to make with using these representations? >> Paul Smolensky: No. It is something that he takes advantage of, but it's not something that is a designed feature of it, that I know of. >> Li Deng: So the design feature is to have the same dimensionality, so it never grows. >> Paul Smolensky: Yes, that's really ->> Li Deng: But the problem is you said that it's not able to do hard binding. >> Paul Smolensky: Well, we haven't gotten to that. We'll talk about unbinding very shortly. >>: So to answer for your question, so the purpose of the binding operation into any plate model is to create a new representation, which looks entirely different from the component. So he artificially created that operation, but the problem is, if you'd like to unbind, you are going to lose a lot of information. So unbind accuracy is pretty bad, at least based on my implementation. >> Paul Smolensky: Okay, well, we will be talking about unbinding before we finish this slide, but that's the key point to keep in mind, that unbinding is ->>: There is sort of a philosophical reason, biological reason, why you would imagine that additional operations actually matter, and that's basically that everything we sense is sensitive to these shifts in the world, that everything is shifted a little bit. Convolutionally, it's kind of a natural thing to have in processing the information, because we temporal shifts, we have spatial shifts, even in the most primitive sensing techniques, and if you assume ->> Paul Smolensky: Well, only the most primitive sensing techniques, I would say. >>: But if you assume that the whole thing is kind of fractal like, that it's kind of layering on top of each other, that the similar pieces are just using output of the previous sensing processing algorithm as an input to the next one and so on, if you think a very simple biological entity that grows, then you would imagine that tendency to model convolutions is there everywhere in the brain. >> Paul Smolensky: Well, I think that there is a qualitative change in the nature of the symmetry operations going from the level of signals to the level of symbolic encodings. And so for example, there is a certain kind of symmetry inherent in the structure of trees that are generated in a context-free language, which is that no matter where in the tree you plant some subtree, it will retain its grammaticality or status. If it's a good tree here, then it's a good tree here. So there's a kind of a shift translation. I call it embedding invariance, but that's really quite different. The equations that describe it are quite different from the equations that describe translation invariance, for example. So I'm inclined to believe that once you make this transition to the macro level, the nature of the invariances changes sufficiently that it's not a foregone conclusion that convolution is the way to capture them anymore. But it could also be that once convolution is understood in the appropriately abstract sense that what I've been working with could be thought of that way, too. It's possible. Okay, so how is this operation here, the circular convolution operation, how is it actually used to build representations in holographic reduced representations? Well, the representation for the pair AB is A times B. The representation for AB embedded as a sister to C is C times AB. So this will remind you of what I showed you a few slides ago. If tensor product representations of trees were done this way, then the calculations we saw before would be correct. And as I said there, the option of doing it this way is open to people using these representations because this multiplication maintains the size of the vectors. In a way, it's not really open for tensor product operation. And to make the connection to tensor product representations a little bit -- as tight as I can, let me define a relation -- a function T, which takes two arguments and produces a binary tree, with this as its left and this as its right child. Then, the holographic reduced representation of this is, as we saw over here, gotten by taking the product of A and B with the circular convolution operation, which is a contraction of this tensor product. So TAB is represented as T, tensor A, tensor B, just like it would be in a tensor product scheme that used those kind of contextual role representations for arguments of a function. And the difference is, so we have this T, which is chosen cleverly to have some nice properties -- it's not just any old T that figures in this calculation, but just the same, it is an outer product, just like you would expect in a tensor product scheme, but we contract it twice. So instead of having order five, it ends up having order one. And so we can say that what that says on the bottom here is that this is a contracted tensor product representation. What you're looking at here is a contracted tensor product representation. It's a tensor product representation that has been contracted. Okay. I did want to mention that if the vectors that you choose to use are binary vectors -- sometimes, people use 0, 1, sometimes they use 1, -1, and if you use operators that are Boolean operators for summation and multiplication, then what you get is one version of it is a system called binary spatter codes, and some of the same mathematics that applies when you're using normal multiplication of normal numbers here, which is the standard holographic reduced representation scheme. Some of the mathematical properties carry over to the binary world, so this is a notion that has various manifestations in the literature. But as several people have said, as [Munte] said, as Li said, unbinding is noisy. You can't get something for nothing. You've taken two vectors of size D, and you've somehow smooshed them together into another vector of size D. You've lost information. Something's going to cause you to pay for that, and so when it comes to binding, it's noisy. What you can do is, you can unbind by taking the circular convolution with the pseudo-inverse of the vector that you want to unbind. So I'm not going to go into the pseudo-inverse business, but it's a little bit like the dual vectors that I have when I have non-orthogonal role vectors. But it's quite different formally, because it has to be an inverse with respect to this kind of multiplication operation. In certain cases, vectors are inverses of themselves. You'll notice that it's the same operation that's used for unbinding as is used for binding, which is different from the tensor product scheme, where we have outer and inner. But because it's noisy, what you have in these kind of representational schemes, people are constantly talking about cleanup. So when you try to unbind and retrieve something from a holographic reduced representation, what you get, instead of a clean version of the symbol A, is some noisy version of the symbol A, with interference from the other symbols that were present with it in the structure that you took it out of. And so oftentimes, it's essential to take what's been retrieved and clean it up, essentially replace the retrieved vector with the actual one, recognize this is a noisy, messed-up version of A, so let's replace it with the real A before we continue to use it in further computations. Of course, that's not necessary with tensor product representations at all. Now, this noise business leads to problems when you try to do something with HRRs that we've been doing with TPRs for some time. We talked about harmonic grammar and how you can use it to do parsing earlier in the lectures, and in this interesting paper that came out this year, the same overall enterprise was investigated using HRRs instead of TPRs, and the noise that you get that gets added to the energy, which you may remember is what we call harmony, or the negative of it, that the computation of the harmony of structures, which is what the harmonic grammar is trying to optimize, is problematic because of all the noise. And so to avoid this problem in the simulations, what they had to do was break up the network into a whole bunch of sub-networks, one for each locally well-formed tree in the grammar, break up the vector, the state vector, into parts that correspond to these sub-networks, and compute the energy each of the subnets, adding them all together to get the total energy. So they have to go to great extremes to cope with the cost of sticking with the same dimension representations for trees as for symbols in the noise that arises when you try to use those representations for things like computing harmony values. And what that says is, all of this is unnecessary with tensor product representations. We don't have to do any of this stuff for the harmonic grammar work that we do. >> Li Deng: So the cleaning up, the way to do cleaning up, done by the author by himself. Or was it done only afterwards, that you realized how difficult it is? >> Paul Smolensky: Tony talked about ->> Li Deng: So the same method of cleaning up, or different ways? >> Paul Smolensky: There are different ways that have been proposed, and I can't remember whether these two papers, whether Tony's work and this work use the same method, but what you have to do is have some sort of cleanup unit that has a stored version of every symbol in it, which it then can use to compare to the noisy element, find a match and then replace the noisy version with the clean version. >> Li Deng: I see. Thank you. >> Paul Smolensky: Okay ,so this is the kind of cost that you pay from noise. Then the question is, how much benefit do you get in getting smaller representations this way? How much do we decrease the representational size because we're willing to put up with noise? Okay, so here's a plot of some comparisons between research projects that have been done using holographic reduced representations, so these are taken from the literature. These are all cognitive models that have been done using HRRs, and what I'm comparing them to is the size that would have been needed had they used tensor product representations instead, so we can see how much of a savings in size there is for all of the headaches that noise gives you. Okay, so here's Tony Plate's dissertation in 1994. So I'm showing you, first, what would be needed with a tensor product representation to do what was done there. You would need 10 units in your network. He had 1,000. Plate 00, you would need 420 in the tensor product representation. He had 2048. Eliasmith and Thagard, we would need 506. They used 512. That's pretty close. That's pretty close. >> Li Deng: So that’s under the same noise condition? >> Paul Smolensky: Well, no. They have noise, and tensor product representations don't. >> Li Deng: So why do they need to have so many, that 2000? >> Paul Smolensky: Because they have noise. >> Li Deng: Oh, so in order to reach certain performance, they have to have lots of ->> Paul Smolensky: Yes. >>: So you count the cleanup structure as part of the representation? >> Paul Smolensky: No. I'm just counting the number of units in the actual HRR, not in any other part of the network that deals with cleaning up the noise, and so I'm looking at the symbolic structures that can be represented and asking, if I was to do a faithful representation of them with tensor product representations, linearly independent vectors for the fillers and the roles, how many units would I end up needing? And so this is a noiseless system, and this is a noisy system. Okay. Next example here, well, the numbers were too big to fit on this, and so I chose to divide them by five, so what would have been -- what would have required five times that many units in a tensor product representation, well, they used 10,000 units. So the idea is that you can only cope with noise if there's a lot of averaging out, so you have to have lots of opportunities for cancellation of the noisy contributions, and that leads to big representations. So in this work by Hannagan et al., we would have needed 64 units to encode their structures. They used 1000. Final one, Blouw and Eliasmith is the harmonic grammar parsing example that I mentioned a moment ago, using HRRs, and they used -- well, they used 128 and 256, and they tried various sizes, and it's not entirely clear what the right comparison is, because they all have different error levels associated with them, so which is the one to pick is not clear, because there's no errors with this one at all. So, anyway, moral of the story is, it's awfully hard to find an example of a study that's been done with HRRs that wouldn't have been -- wouldn't have used smaller representations had TPRs been used instead. >>: And is there any tradeoff there? Do you lose anything by using TPR versus HRR? Is there a reason to even consider doing HRR? >> Paul Smolensky: One of the advantages that has been identified in having fixed size vectors for structures of different sizes is that if you want to ask how similar is this tree of depth two to this tree of depth four, if you use the simplest version of the TPR encoding of trees, you'll end up with basically zero similarity in a situation like that, because the part of the network that encodes depth four and the part that encodes depth two are separate. There is a fully distributed version of representation of trees using tensor product representations, in which, in fact, it's no longer the case that different depths are separated in the network. They're all superimposed together, and so you can recover that advantage by going to something that's a little bit less simple than the one I have talked about. It basically amounts to saying, well, remember that there are these bit vectors of one, zero, one, right child of left child of right child. Well, that's if you're three levels down. There would be another bit if you were four levels down, but what you can do is pad out these, so they're all the same length, and associate a vector for that padded symbol, and you have one more role vector for that padding symbol, in addition to R0 and R1, so it's a relatively small price to pay. But then you get the vectors for symbols at different levels of the tree, all consuming the whole network and not separate anymore, so you can have that advantage if you want to. I don't know whether what you get is desirable. >>: But then you would get all the advantages of HRR without the noise? You would still get good similarity measurements. >> Li Deng: But you don't know how good it is, right? Similarity? Because if there's a noise there. >> Paul Smolensky: I don't know how good the similarity measurements in what I propose would be, but these are published papers, because the similarity structure that they got was sufficiently close to what people have that it constitutes a decent model. >> Li Deng: So they do have this kind of similarity comparison using noisy HRR, and then they found that in the next algorithm. >> Paul Smolensky: That's right. >>: In fact, the cleanup depends on it. >> Paul Smolensky: The cleanup depends on? >>: Of the similarity measurements? >> Paul Smolensky: Well, yes, that's certainly true. >> Li Deng: So that's after cleaning up, they do conversion, but before cleaning up, it may not be so meaningful, or are they? >> Paul Smolensky: I'm not so sure. They may do the comparison, the similarity evaluation, without cleanup involved. And what they're used for in most of these examples is you have analogies. You have a little story with elements bearing relations to each other and another one, and people judge how analogous is this situation to this situation, and they construct the different HRRs and they ask how similar are they, and they use that as the model for judgments. Yes. >>: So [Cole] mentioned that losing of comparing ability might be a hurdle of TPR, but that ended not true, because think about if we are going to use either tree or graph by using these TPR or HRR, then in the level of computer theories, it's going to be an extremely hard problem to compare whether or not two trees perhaps are isomorphic to each other. So if we sincerely map those into the vector space, comparing those two vectors must be as hard as before if the mapping is transparent. But if somebody just simply compared the vector represented by HRR, for example, by measuring cosine similarity or Euclidean distance, that is a very simple comparison, which is not -- which must not be true, because comparing two graphs or a tree must be hard. >> Paul Smolensky: That's an interesting perspective, actually. >>: So it's actually not the lose of TPR, in that sense. Comparing two different structures is an intrinsically hard problem, so even if we map that into the vector space, that must be still a hard problem. >> Paul Smolensky: Yes, that's definitely worth more thought. That's a very good point. Okay, so this is just a graph that brings home the magnitude of difference between the TPR and the HRR that was shown in the previous graph. So this is from Tony Plate's dissertation, actually, so he generated 1000-dimensional -- he generated 1000 vectors, I believe, 1000 vectors randomly, of different dimensonalities, plotted along this axis here. So the case I'm interested in is the 1000 case, 1000 dimensional vectors. And he took a bunch of pairs of them and tried to ask, if we try to put into our short-term memory a bunch of pairings in which X1 and Y1 are paired, X2 and Y2 are paired, and we want to hold those in memory, we superimpose the patterns by adding them together, the patterns being generated by the circular convolution operation, how many of these pairs can we put into that short-term memory and be able to retrieve out of that accurate answers to the question like is X1 paired with Y3? No. Is X1 paired with Y1? Yes. Can we get accurate answers like that? And what he showed is plotted on this graph, if the size of the -- if the dimension of the vectors is indicated here, then this shows how many vectors can be superimposed in that memory trace, that short-term memory trace, how many pairs can be superimposed and have acceptable readout, where the noise level is acceptably small, about what's actually been stored in that superposition. So what you see is if the vectors are of length 1000, we end up somewhere between nine and 10 on the graph, so fewer than 10, more than nine, pairs can be stored all at once, without exceeding some tolerance level of noise. But in a TPR network, if we choose to have vectors for these elements X and Y, whose dimension is the square root of 1,000 and we make them ortho-normal, which is what we like to do -- we could make them linearly independent, actually. I'll change that. Linearly independent vectors of this dimension, then the number of the dimensionality of the pairs, since we use the tensor product here, will be the square root of 1000 times the square root of 1000, so it'll be 1000, so we'll end up with the same number of units as we have in this representation plotted here. However, we can superimpose 1000 pairs and have, that is to say, every possible -- let's see. We can superimpose this many Xs paired with that many Ys. Namely, 1000 pairs can be stored all at once, all of them being now linearly independent from one another, any one of them being retrievable with complete accuracy. So we would have zero error in determining whether a particular pair is or is not in that superposition state. So instead of less than 10, we get 1000. >> Li Deng: So in addition to HRR, they don't really define it's a row vector ending [indiscernible] the separate ones. It's just encoding a vector rather than encoding the structure? >> Paul Smolensky: When you have a symbolic structure that you're trying to represent, like an analogy, then you have to make decisions that amount to deciding what the roles are going to be and what the fillers are going to be. So just like with the tensor product representations, those decisions have to be made for each model. The only thing that's made for you is the decision to use the circular convolution operation and not the tensor product operation for combining the vectors, but the same decisions basically have to be made, and the equations look very similar. They just have a different thing inside the circle. >> Li Deng: So HRR can also be used to encode a tree or graph in the same way that ->> Paul Smolensky: Yes, that's right. Yes. All right, so that was my first point. In practice, TPR does not actually involve larger representations, typically much smaller representations than others have found they needed to use to cope with the noise in HRRs, although I have to say that HRRs have been received with the kind of warm affection by the community that has been noticeably lacking for tensor product representations, so they do have something. And maybe you can help me figure out what it is. Okay, so all known proposals for vectorial encoding of structures are cases of generalized tensor product representations. What I really mean to say here is that even though there's lots of work out there, there's basically just one idea about how to combine elements together in combinatorial structures. There's only one idea, really, and that's tensor products. You can soup them up a little bit, but the core that's actually doing the binding is the tensor product operation. And in some cases, these are examples that don't look anything like tensor product representations but secretly are. So I'm going to claim that generalized tensor product representations is a class that includes HRRs, another system we're about to look at called RAAM, and temporal synchrony schemes for representing structure in neurons that fire. So here's RAAM, Recursive Auto Associative Memory, from Jordan Pollock in the late 1980s, who used a learning to decide how to take an encoding of one symbol, an encoding of a sister symbol and join them together into an encoding of the local tree that has this as left child and this as right child. And he chose the dimensionality of this layer to be the same as this and the same as this, so it's like holographic reduced representations, in that the result of combining these two symbols together is a vector of the same dimensionality as before. So this is the net that does the encoding. These are the weights that are used to multiply these activations to feed into these units. And then there's a decoding network, so it's called an auto-associative memory, because it's trained by taking pairs X0, X1 here, copying them up here, X0, X1, and training the network to have weights down here and weights up here, such that these weights undo what's done down here. So they take the encoding of a pair and unbind the right child and unbind the left child by the matrix multiplications involved in these connections up there. So you have binding at the bottom or encoding. You have unbinding at the top or decoding. Now, the RAAM encoding of AB, which is this, can be written this way. So it's the units in this layer are logistic sigmoid units. They're not linear units, so there's a nonlinear step in the process of constructing this representation that we haven't seen before. So this capital F boldface symbol means apply to all of the elements of this vector this logistic transformation, so it's point wise nonlinear transformation of all the elements in this vector here, and the vector here is the input to that layer of units, which is this vector times that matrix plus this vector times this matrix. They just add together, so here's the adding together, and here is this vector times that matrix and this vector times that matrix, so the R's are the matrices here, and you'll recall that matrix multiplication is a kind of contracted tensor product, so if we take this second rank tensor matrix, R0, and take a tensor product with this, which is of order one, we get something of order three. It's all threeway products from the matrix and elements of the vector, and then we do the contraction in which we require the second and the third indices to be the same. That summation gives us the matrix product of this matrix times that vector, so what we end up here is exactly the activation values that these units here are computing. So what we have is ->> Li Deng: But now, the ->> Paul Smolensky: Hold on a second. This is a squashed contracted tensor product representation, so inside here, we have the contracted tensor product representation, like HRRs, but now we've squashed it by applying this squashing function or this logistic sigmoid. Yes. >> Li Deng: So now with the squashed contraction, adding this nonlinear function, which is not standard product of TPR, then you won't be able to do this -- the same kind of unbinding without loss. >> Paul Smolensky: That's right. >> Li Deng: But that's crucial for this kind of network. >> Paul Smolensky: It's crucial that there be squashing it. Is that what you're saying? >> Li Deng: Yes, yes. If there's no squashing, everything is linear, then you don't get much, so I'm thinking that whether that F can be put into part of the TPR, I think it will be much more powerful. >> Paul Smolensky: So in the last part of the lecture today, I'm going to talk about programming with TPRs, and how you need to use nonlinearities, but the kind of nonlinearity that I propose is actually a case of multilinearity, where you multiple together vectors. But you don't squash them point by point, because we know this is wrong. From our symmetry argument, this is not invariant under change of coordinates. That's why coordinate-wise operations are not part of physical systems that have invariances. >> Li Deng: Interesting, because this whole branch of deep learning now about the lowest encoder, they're all based upon this architecture, and it can do a lot of very, very interesting things, so I was curious exactly what the role that nonlinearity is playing here. >> Paul Smolensky: Yes, it's an important question, and I'll show you some nonlinearities, but it does not answer the question of what this type of nonlinearity is. >> Li Deng: Because the problem is the deep learning, right? >> Paul Smolensky: And if you have rectified linear units, then what you have is a piecewise linear operation there, right? And so it's as close to linear as it could be while still being nonlinear, and so it could be that by looking at those kinds of nonlinearities, which have some nice advantages for gradient computation and all, we could take advantage of the linear properties that you get for both sides of the -- the function. So a generalized tensor product representation has this form here. It looks like a regular tensor product inside. Fillers, tensor product roles, added together over all the constituents. And then it has an optional contraction, and then it has an optional squashing function. That's what I'm calling a generalized TPR, and RAAM is exactly that, so you have both the F and the C for RAAMs. For HRRs, you saw you just needed to see there was no F, and now I'll tell you about the last case of tensor product representations, which actually is a true normal tensor product representation, which involves neither C nor F, but it doesn't look much like a tensor product representation to most people. The idea for it comes out of theoretical neuroscience, where the idea has been around for a long time that neurons in the brain that are encoding properties of one and the same object will tend to issue their spikes in a synchronized fashion, so there'll be a high correlation between the firing of two units, one of which might be representing color and one of which might be representing position or something, that they'll be firing with high correlation if they describe the same object. So if there are multiple objects in the field, then these neurons will be filing in synchrony, these will be firing in synchrony. These will be describing properties common to one object. These will be describing properties common to another object. Okay, and here's a version of it that was proposed for artificial intelligence-type purposes. And there, the representation of this proposition, give John -- John give book to Mary -- is indicated in the following way. You have one unit for each of these elements here. This is a single unit. This is a local representation, but we're going to look at the activation of these units over time, so here's what the activation looks like for these two units. They fire synchronously, so imagine each of these is a spike for the neuron, so these are in phase. That's telling you that the give object is the book. They pertain to the same object, same thing. And the giver is John, because those two are synchronized, and the recipient is Mary, because those two are synchronized, so that's binding by temporal synchrony. You bind together the role and the filler in our terminology by having them fire in synchrony. So the way to turn this into a tensor product representation is to think about it as a network that's laid out in time. So we take this set of units. This is the network. And we unfold it in time, so we just have a copy of this set of units for each time, and then we know which units are active at which times, so we indicate their activation, so this unit is active and then inactive for two steps, then active again, out of sync with this unit. So this is the activation pattern which is the tensor product representation of John gave Mary a book. It is the tensor product representation in the following sense. Here's one of the constituents in it, the one that says Mary is the one who was the object of giving, the recipient, I guess -- no. The book, sorry, is the object that was given. So that you'll notice is a tensor product. This is the constituent corresponding to the give object is book. This is a tensor product. Here is one of the vectors. Here's the other vector. You take the tensor product of these two vectors, then you get exactly that pattern. Okay. Now what we have over here on the filler side are the two units that we're joining together. So this is the one for the book and this -- this is the one for the book, and this is the one for its role, the object. So the filler vector is book plus give object, and this plays the role of the role vector, but it's a more abstract notion of role. We'll call it formal role. It's the role of being in the first cycle of the system's oscillation, so if we think about this constituent as the tensor product of this telling us what role in the oscillation pattern in plays and this telling us what material fills that role, then we have one constituent in the tensor product representation, and then we just superimpose by addition the corresponding green and blue versions of the same thing for the other two constituents. >> Li Deng: So role looks more like the quantized time intervals. >> Paul Smolensky: Yes. Well, there's this fixed repetitive activation, which is the firing phase, so these are firing out of phase with each other, but they have the same firing frequency. >> Li Deng: So presumably the first spatial dimension, you can apply the same kind of row there. >> Paul Smolensky: Say that again. >> Li Deng: So this is a tie. >> Paul Smolensky: That's the tie. >> Li Deng: Different time interval, not for the space for the image, for example, rather than time series. >> Paul Smolensky: Yes. >> Li Deng: But does the same kind of formal rule apply, you quantize spatial that it may be able to use to represent the image or something? >> Paul Smolensky: I think if you had a third dimension, then you could have something like the region of the image, the label of that region and then time. Then I think you could. Yes, Lucy? >> Lucy Vanderwende: So this is an encoding of the sentence, the book was given to Mary by John? No. The book was given by John to Mary. >> Paul Smolensky: It's an encoding of the ->> Lucy Vanderwende: Because book happens first. >> Paul Smolensky: It's an encoding of this proposition, and there isn't any intention that book has some sort of -- that it precedes John in any sense. You could imagine that this goes on for some time, and I just happened to start drawing the picture here. There isn't a significance to the fact that it's the magenta that happens first, because I could have arbitrarily started to draw the picture here. This is intended to be an ongoing pattern. So it doesn't reflect any sequence information, just the binding information of what thematic role goes with what. >>: There is no tree. >> Paul Smolensky: There is no tree. >>: There is just a set of facts. >> Paul Smolensky: It's a slot filler kind of structure. These are the slots and these are the fillers. >> Li Deng: So there is no special advantage to using TPR for this kind of a structure. Any other ways of representing, by the raw data, it will be just as good. >> Paul Smolensky: Are you asking the question of what's the advantage of seeing this as a tensor product representation? >> Li Deng: Yes, exactly. >> Paul Smolensky: Well, I can give you at least two. The first one is to substantiate my point that any idea that anyone has ever had has used tensor products to bind information together, and to take something which prima facie looks like a counterexample to that claim, people would not think of this as -- they would think of it as a counterexample, but actually, it's an example. But let's see, here. That's odd. But there is actually quite a distinct advantage, and you can probably guess what it is. This is a fully local encoding, but we can repeat this whole construction with the tensor products with distributed patterns, so we don't have to have a single unit for John. John could be a pattern, and everything would go through just fine. And until you recognize it as a tensor product, you have no idea how to take this idea and flesh it out with distributed representations. >>: But also this example kind of gives you an idea of why the brain might have very large capacity, because it could be using time to encode things, as well. >> Paul Smolensky: Yes, yes. In this article, actually, they make somewhat of the flip argument, that this explains why short-term memory has such small capacity, because they are using this as a model of what we can hold in our short-term memories at once, and they do some sort of back-of-the-envelope calculations to figure out how many slots are there in the actual cycling of actual neurons that would give you how many slots that you could fill with information like this, and they come up with the number seven, which is the classic number. I don't know that anybody regards it as the correct number anymore, but it's the classic number. >>: Seven of what? >> Paul Smolensky: You can put seven facts, like the book was given, in short-term memory. >> Li Deng: That's pretty similar to short-term memory that people have. Telephone number would be 10. >> Paul Smolensky: Yes. I think seven plus or minus two is the famous paper by George Miller. I think most people would say it's closer to three or four, actually, but in any event ->>: On a distributed representation of the roles, which seem to be wanting to be in the exact time, that's not a problem when they overlap or anything? >> Paul Smolensky: I think that's right, yes. Yes. As long as the patterns for down here are linearly independent, we should be fine. So they don't have to be firing at distinct times. They could be -- you could have some pattern in which you had a different amount of firing at each time, not just one and zero, and then a different such pattern for the second slot, and it should work just fine, as far as the linear algebra properties of the representation are concerned. Now, what you want to do with this in your network might change, might be different. That I wouldn't swear to. But the representation of the entire structure is the sum of these tensor products. Just as you have in a standard tensor product representation, there's no squashing, there's no contraction, but there is some innovation here. What's new is the idea of using a space-time network and not just a spatial network, and independently, it's really a separate idea to use these formal roles instead of meaningful roles, so we don't consider give object to be a role when we look at it this way. We consider it to be a filler that gets bound to the same formal role that this one does, and in that sense, they end up functioning as a unit, which is actually reminiscent of how the neo-Davidsonian move to say instead of having the agent and the patient be bound together, we have a formal thing called the event, and we bind the agent to the event, and we bind the formal -- the patient to the event, and by virtue of being bound to the same event, they have a relationship to each other, but the relationship isn't directly encoded in neo-Davidsonian formalism. >>: So actually, if you think about it, that's a product over there, and then you don't have to have these roles to be localized. They can be distributed, but have people actually observed that in practice, too, because you said that there's a theory. I don't know if it was observed, that you have these neurons all firing at the same time to represent this object, these neurons on all firing at the same time to represent that object, but if it's actually distributed, we'll actually have to do a little bit more math to figure this out. >> Paul Smolensky: Yes, exactly. >>: Which is doable, though. >> Paul Smolensky: It's doable. I do not know whether it has been done, whether there are cases that illustrate this very behavior with distributed as opposed to local formal role representations. >>: I guess the only reason why locality -- what would be the reason for locality rather than distribution? Distribution here, maybe sparsity, a system of sparsity with these policies? >> Paul Smolensky: That might be advantageous. Yes, yes. >>: Is the role here really acting more like a one of four possible event, time multiplex slots or something, rather than a -- it's not a semantic role. It seems more like a temporal role. >> Paul Smolensky: Well, we call it a formal role, because it really isn't about being over time. We could reinterpret this as entirely spatial network that has no time in it at all, and everything would be the same, so it's not really about time. That's the most crucial thing that shocks people, that whatever this idea is about, it's not really about time, actually. Because we can have a formally identical system that has no time in it. It's about having some identifier, some unique identifier, that other things get stuck to as the means of bringing them together, rather than sticking them to each other. >>: But just like in telecommunications, you're sending signals, but time is there just for you to encode the signal over time, but you're getting it as a code word at the end. It's not like there is a timing to the content, just using time. >> Paul Smolensky: Right. And so it may be that formal rules of this sort are used over time in the brain and not otherwise, but there's no reason why it would have to be that way from the formal structure of the representations point of view. >>: So there is a notion that has been longstanding in linguistics, that the object of a verb is much closer to the verb than the subject. >> Paul Smolensky: Yes. >>: And is that something that is capturable or captured with the notion that you were doing this over time, the neural firings are taking place synchronously, so the word book is bound to the word give, earlier than the -- >> Paul Smolensky: It could be that -- if the activation pattern down here for the slot that the verb goes into is more similar to the activation pattern for the slot that the object goes into than it is to the activation pattern for the slot that the subject goes into, then you would expect to see just what you said, that there would be more correlated activity in the encoding of the verb object pair than in the verb subject pair. So it could be used I think in that sort of way. Lucy? >> Lucy Vanderwende: In this way, where you have the filler is give object, you now don't have a more abstract role of object more generally, not linked to the specific -- here, you were linking the object to each specific verb, so give object to each object. >> Paul Smolensky: Oh, you're talking about the fact that there's a bundling of the role object in the verb give here. >> Lucy Vanderwende: So do you get any generalization anymore on how objects on average behave? >> Paul Smolensky: Right, right. So when I talked about trying to capture similarities of that sort last time I think it might have been, it was important that we didn't do this, that we had representation of give, representation of object that had their independent character. They could be bound together, or not, and so there is the question of how well you can recurse this kind of formalism and say, okay, well, I want to use the same idea for binding together object and give, rather than just plunking them together as a label for a unit. I have to think about that. I'm not sure that it would recurse very gracefully. >>: How about uncertainty? You could imagine having a representation where either John is the recipient or he is a giver, with different abilities, that maybe John, most likely he is the giver, but it's possible that he is actually the recipient and Mary is the giver. So you could imagine a situation there where the intensity of the policies are used, they overlap, that in Paul's case, the recipient and the giver are both synchronized with both Mary and John, but to different amplitudes. But that's a representation where you have the same thing. You didn't talk much about uncertainty in representation. I don't know if it's just a very linear thing or not. It can just be based on the amplitudes of things. >> Lucy Vanderwende: Would a good example of that be the start of a sentence, John gave Mary? Because it could be followed by in marriage, in which case he really is kind of giving Mary, or John gave Mary a book. Until you hear what comes after, John gave Mary is uncertain. >>: I more meant just mental uncertainty, like I don't know what I've heard. I know there was something about the book. Somebody gave the book to somebody. I'm just not sure. I think John gave it to Mary, but I'm not sure. It might be the other way around. >> Paul Smolensky: So what has been talked about very little in these lectures, maybe just one slide, about French liaison, is that current focus of work on having partially active symbols in representations and distinguishing that notion a mixture of partially active symbols distinct from a probabilistic mixture of fully active symbols and how when you're in the middle of processing a sentence and you have uncertainty about the rest of a sentence, how being in a blend of partially analyzed parses is different from having a probability distribution over ultimate parses, which is a more standard view. So we have been developing simulation models of that with grammars in networks using tensor product representations and such. But that's been hardly mentioned here, so -- but that is where we would talk about what you just raised, I think, what happens when we don't have John, we have 0.6 John, and so in the French liaison example, I said we had 0.4 T at the end of petit, maybe 0.5. Okay. Serious symbol processing, and there's one more thing here. I'm going to just blast through this and not explain it, because I think it's cute and interesting, but pretty much off topic, actually, and that's just some evidence that neurons are functioning the way tensor product neurons should function. In the parietal cortex, where representations of locations in space of visual stimuli have to take into consideration the combination of the position of the eye and the position of a dot on the retina, so the same retinal position means different spatial positions if the eye moves and conversely. So what you find is that the activity level of a neuron in this part of the visual system has a profile like this, where this is positioned along the retina of a dot, and this is positioned along the horizon of the eye, let's say. And the activity of a single neuron looks like this as a function of these two relevant variables that it has to combine together in order to identify a place in space for that dot. And the point is that this is in fact a tensor product of two functions. This function, this is a distributed representation over the eye position variable. This is a distributed representation over the retinal position. The retinal position one is roughly bell shaped. The eye position one is roughly logistic shaped, and the formula that's given for the perceptive field by the authors who have done this work is exactly the tensor product of these two functions. So there's a case where the way of binding together these two relevant bits of information, where is my eye pointed and where on the retina is a given image cast, are combined using a tensor product in this part of the visual system. Okay, but I wanted to move onto the last topic here, and that is serious symbol processing with tensor product representations, which involves nonlinearity but not point wise nonlinearity, and I'm going to talk first about the basic operation of the lambda calculus, which is function application. So this is called beta reduction. You have a lambda expression which is an expression which has a variable, X, identified by this quantifier lambda, and you have some expression, some function, that is stated in terms of X, whose inner structure is not indicated here, so B stands for some formula involving X. And so what's in parentheses is the function, and what's outside is the argument, and this is supposed to be the value of the function on that argument. That's what the process of beta reduction computes, and if we go to our tree world, we can think of the lambda expression as being built this way if we want. That's not the only way. And what we need our function to do is when given this L as the first argument, and some A as the second argument, what we want is to output this expression B, but all the Xs need to be replaced by A. That's what applying the function to a value means. You replace the variable of that function with the value you're evaluating it at. And here's how we do that using tensor product unbinding and binding. So one step is we unbind the right child of the left child. That's here. So L is the tensor product representation of this, and if we unbind the right child of the left child, what we get out is in fact what the symbol is. That is the variable in the expression. It could be X could be whatever. This will tell us what it is. And so that extracts X. This operates on this tensor product representation for that extracts the right child, which is B. So that extracts B. Here's the function that does the whole thing. There should be no D there. I don't know how that typo got created, but this is the full function that does the job. What this is, identify operator, just pretend that D's not there, please. This multiplies by B, just giving B back, so this reproduces B. What this does here, this inner product, it returns the locations of all of the Xs, so when you take the inner product with X, you get out all of the roles that X fills, and this deletes all the Xs in those very locations, and this inserts A in those very locations. So the net effect is, you have replaced all the Xs by A by this combination of operations. Inner products here, and outer products here. So I haven't used the tensor product symbol, consistent with previous lectures, so this is the outer product of this tensor with that tensor. All right, so in this case, this formula encodes a string. There are atoms at terminal nodes. The atom X here is replaced by an entire tree. A is in general an entire expression itself, not just an atom, so we've managed to replace a symbol with an entire expression. The next thing I'm going to show you, which is tree adjoining, takes an atom at an internal node and replaces it by an entire structure. This is to remind me that in Gary Marcus's book, which he talked about here in his lecture and laid out these seven lessons for what we need to have our brains do in order to produce cognition, we need symbols. We need variables, operations, types, tokens. I claim that being able to do things like function evaluation, beta reduction, means that there's no question that we can do all of these things. All of them are doable. That is a solved problem, I claim. Tree adjoining. So here is the initial tree. It has somewhere buried inside it an A constituent. This is the auxiliary tree. It has A as its root symbol. Little A stands for the whole thing. It has a foot symbol, alpha, and what we need to do is insert this into that. It's a kind of adjoining. We insert the green tree inside here, so that the red one now hangs from the green instead of the blue, and the green hangs from the blue the way that the red used to do, so that's the tree adjoining. And I will just go through this very fast, because I think having seen the lambda expression, you'll get it quickly as much as you're going to get it, and it's 2:00. So here's an inner product that tells us what symbol we're looking to replace. It's the symbol at the root of A. That extracts the root symbol, so here's just a recording of that fact. What this does is it finds all the locations -- the location, I should say, of this symbol A in this original tree. What this does is find the subtree here hanging from that position in the original tree, color coded to match it. What this does is find the location of this node alpha inside this tree, what role in this tree alpha fills. And once we have all of these things in place here, once we have all those in place, we can write the function down for tree adjoining, which takes this as its first argument, this as its second argument and produces that as its result. So first, we have this retains all of T that's unaffected by adjoining. This removes the subtree A from T, and once that's removed from T, what's left is all the part that's unaffected by adjoining. This repositions T by moving it down to where alpha is. This embeds the whole big tree here in the place where that atom was before, and this removes alpha from the final structure, because it's just a placeholder, and voila, you're done. This says down here, this is a bunch of outer products that are used to construct this, and this is a bunch of inner products that are used to unbind these to pull out the relevant bits, so that they're ready to be put back together in this way. So the net result of all this is a single function written here by means of these auxiliary variables here, that does tree adjoining. But it's a high-order, nonlinear function in the following sense. If we look at this term, for example, we have one R times another R, so this R involves taking an inner product with A. This involves inner product with T, so there's an A and a T buried in here. There's an A in here, too, so we have A times A in there, and elsewhere, we have T times T, so here we have -- let's see. Do we have A times R somewhere? Well, it's my belief that somewhere buried in here is a third-order term, NT. So when you cash all of these abbreviations out for what they stand for, you'll see that T enters multiple times and A enters multiple times. Those are the two arguments here, and they get multiplied together with themselves and with each other, so you end up with something that's not linear in T and it's not linear in A. But it's multilinear in the sense that it's just multiplications of them. It's not something like a point-wise squashing by a sigmoid function. Okay, so I've already done that. I don't know why that came back. So in a single-step operation, massively parallel, we take the distributed encoding of the input arguments into distributed encoding of the output. It's third rather than first-order function of the input. Single operation of this whole function applying once simultaneously performs multiple inner and multiple outer operations all at once, and that achieves the effect of extracting all the roles that contain a given filler and inserting a given filler in all of those places. That's substituting a value for a variable in a very rich structural sense. Voila. >> Li Deng: Thank you very much. >> Paul Smolensky: Congratulations. We finished a lecture. >> Lucy Vanderwende: On time. >> Paul Smolensky: Never happened before. >> Li Deng: Thank you very much. Any more questions? >> Paul Smolensky: Yes. >>: So do you think of these operations, like for the tree adjoining, as being -- existing in the language themselves, so they can be representative and you can make new operations based on ->> Paul Smolensky: Are they part of the toolkit that you can use to build other things? >>: Well, in the brain. I'm thinking if this were the right model, would these operations being represented as other knowledges, or would they be something that were just fined that were somehow learned and they're in neuron weights? Well, I guess it's kind of the same thing. I'm just wondering if you can take simple operations and make new operations from them using these same sort of operations. Can they be used on themselves? Is there a language of operations here that can be constructed? >> Paul Smolensky: There is in a sequential sense, for sure, where you could apply one of these operations and take the output and then apply another one to that. There's no question that that exists. When I do these programs, I figure out what bits I want to multiple together and combine to create a function that in one step does lambda evaluation, but it's not clear whether internal to the brain capacity to do that kind of combination is plausible. >>: I guess I'm wondering if these operations are pre-wired and they don't tend to grow, if you just stick with that set of operations, or if it's something that's learned and they grow over time. Do you have any guess or intuition on that? >> Paul Smolensky: I think that there needs -- my best guess, and it is a guess, that there needs to be some sort of organization to the cortex such that these type of tensor-type operations can go on, so that these kind of operations are implementable in the cortex. Whether the implementation of these operations in the cortex is somehow hardwired, or whether it's something that could be learned, I do not know, but I'm guessing that the fundamental ability to do tensor product -- tensor calculations is probably hardwired. At that level, I feel that my best guess is probably secure, but whether the brain can freely combine all of these things the way I do when I write a program I think is a good question to ponder. It could very well be that -- one thing to imagine is that, given that there is the ability to do sequential combination of operations that have already been acquired one way or another, feed the output of one as an input to another, then that gives you training data for the combined function. So if I have F and I have G, and I feed the output of F to G, then I get training data for G composed with F. And so you could imagine then over time learning that combined function and then it functioning as a unit instead of as a sequential set of operations. >>: Almost like a chunking kind of thing. >> Paul Smolensky: Absolutely like a chunking kind of thing. That's right. Yes. >>: Cool. So this whole representation, even though you've been using language a lot as examples, the whole representation isn't just about language. It's about storage and information and knowledge and building knowledge in a neural system. But language is a special case. Are there some examples of what this sort of operations can do that language doesn't, that it's not easily expressed through language, some kind of reasoning that's not language bound. It's hard for humans to think, to discuss things without language. They think that they think in terms of language, but that's probably not the case. It's probably that there's a lot of thinking that's not really language driven. >> Paul Smolensky: In the original paper on tensor product representations, I gave an illustration of using -- of using tensor product representations to encode something like a speech signal, where the roles were points of time and the role vectors were sort of like Gaussian bumps centered at the point of time that they most principally control and where the -- that's the role axis. On the filler axis, you had some sort of detectors for activation -- for energy at different bands of frequencies or whatever, so you could build a spectrogram this way, as a tensor product, but you have a continuum of roles, really, and a continuum of fillers, and there's maybe some sensible geometry to how the pattern of activity is for these types of role vectors and filler vectors both, that it makes sense for them to be filter shaped or something. So these kind of operations can apply to the kind of continuous domain of signals that people aren't very facile at talking about, at least verbally, so I do think it's a much more general mechanism than language or even than higher cognition, really. Whenever there's structured information, you need to encode what role in this structure is being played, so what time is this frequency band being highly -- have a lot of energy. So I think it's just pervasive, and my guess then would be that the capacity that we use for higher cognition evolved from a capacity to encode signals in this sort of way. And an interesting thing about the use of tensor product representations for abstract knowledge and encoding grammar and all of that kind of higher-level stuff is that it's the same notions of role filler combination to form combinatorial structures that you have in things like scenes. So in a scene, you have a lot of objects you've identified. They have roles in the scene, which involve positions but relations to each other and affordances they provide and all of that stuff. So a scene is something that lower animals have to deal with all the time. They must have the capability of encoding combinatorial representations, and so the apparatus that I'm talking about doesn't seem like one that would be exotic. And the ability to encode abstract, in the sense of far from sensory, information in the fillers and the roles could be naturally a result of higher extraction of features at higher levels and so on. But the same fundamental structuring operations can be there from the beginning. And a nice thing about the brain is this, that you might wonder, well, how did -- did mankind make the leap from scenes being represented as combinatorial structures with tensive products to parse trees being represented that way? And the answer is that to the brain, a scene is an activation pattern, and so is a sentence. So despite the fact that they have very different semantics to us, to the brain, it's all the same. It's identifying repeating substructures and seeing that they combine in certain ways, and that has to be done to manage to do with scenes, and once the information that's available to you includes things like language, then the same kind of operation should go a long way towards providing the kind of higher-level capabilities that we're talking about in these lectures. Voila, neural solipsism pays off. Okay, thanks again for enduring this. I'm very impressed.

>> Li Deng: Okay, thank you, everybody, for coming... Professor Paul Smolensky, and we thank Lucy for co-organizing this...

Related documents

Products

Support

&gt;&gt; Li Deng: Okay, thank you, everybody, for coming... Professor Paul Smolensky, and we thank Lucy for co-organizing this...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Li Deng: Okay, thank you, everybody, for coming... Professor Paul Smolensky, and we thank Lucy for co-organizing this...