>> Li Deng: Thank you very much for coming for Paul’s presentation today. He’s expected to give a series presentation. Today’s the very first one. Let me give a very brief introduction to Paul. You know everybody; I don’t know how many actually read this, the chapter that he wrote in his PDP book. It’s an eighty’s book. It’s a very famous a little chapter on Restricted Boltzmann Machine in mid eighties. Now, I think at that time use of dynamic system. Now people call it Research Boltzmann Machine. Paul has been spending his, almost all his lifetime developing these two theories. One is the dynamic system theory related to the Boltzmann Machine. The other one is the TPR theory it’s fair to say. Then there’s a combination of those that gives rise to you know some symbolic and you know neural computation. The connection between the two those are extremely influential. Then he got series award including the most famous one Rumelhart Prize that was very prestigious. Everybody really respect all the people over these. We took an opportunity to have Paul to visit for about four months until I think the first week of December. He will have plenty of time to spend here. We have been engaging Paul to do research with us about the application of a number of theories that he has to our practical problems here. Without further ado I will give the floor to Paul to give the very first lecture in the whole series about some of the, you know cognitive science and neural network gather research that he’s been working on. Okay, thank you. >> Paul Smolensky: It’s been a great time here already. I’m really appreciative of the opportunity. Thank you very much for your hospitality. Let me start off with an exchange that sort of sets the main agenda for this series of comments. This was an exchange between two giants in the field at a plenary lecture. A pioneer in computational linguistics asked a founder of the deep learning field the following question. Shouldn’t your DNNs for language processing have structured representations like tensor product representations or something? I’ll be talking about those below. The answer from the DNN researcher was, well do you want a pretty theory or do you want a system that works? For me where I sit the answer is completely obvious. We want a pretty theory. [laughter] We already have seven point four billion systems that work. What we want to know is how do they work? A focus of the research that I’ll be talking about is trying to understand and not just get performance out of neural networks. >>: What is that seven point four? Is that human, the brain? >> Paul Smolensky: That’s the number of humans on the earth as of October. [laughter] >>: Paul, what does deep learning expert? Why did they review that system that works be a complimentary attempts? >> Paul Smolensky: I wasn’t there. I hesitate to elaborate too much, or speculate too much. But I think that there’s a pretty obvious asymmetry in the neural network community. In terms of the amount of effort that goes into trying to get good performance versus trying to understand how the networks work. How they manage to get that performance. Okay, so the stage for that kind of interaction that question comes from a shocking thing developed in the last few years which is that there are some DNNS that actually produce rather impressive English. There are lots and lots of examples one could give. Here’s a little example from local work. A nice figure caption called a little girl brushing her teeth with a toothbrush. Or real live online translation into English and other languages from Skype translate, just quite dazzling. The thing is we do not know how the networks do it. We could go a couple of different ways. Following long standing AI tradition we could decide to ignore an entire academic discipline that’s been devoted to understanding what it takes to produce great English. Or we could ask how can linguistics help us understand what these networks are doing? In my view linguistics defines the state of the art in understanding what great English is and how it can be produced. It’s not good enough to merely understand how a network minimizes some error function defined over neural activities. That doesn’t qualify, doesn’t meet the state of the art by any stretch of the imagination in terms of understanding how English is produced. The success of these networks and language does not mean the end of theoretical linguistics despite what some people may have hoped, because we need linguistics to understand these networks. If we ask linguistics and traditional approaches to natural language processing, for that matter computational linguistics as well. We’ll be told that there’s a lot of belief in a hypothesis that says, producing great English requires abstract structural knowledge, ASK. According to this hypothesis if want to understand a network that’s producing English we should look inside and see how it encodes this abstract structural knowledge. Now there’s several questions to ask here. One is whether this hypothesis is correct? Whether we should believe it? Another is how on earth could we look inside a network to see if there’s abstract structural knowledge of that linguists say we need in there. Finally, you might wonder what I actually mean by abstract structural knowledge. Let me take a little diversion on that point to start with a very, very simple example deliberately avoiding all the kind of syntactic subtleties that linguists love to use in their examples. Here is a very simple noun phrase, troubled adolescent hacking expert. It took me a long time to come up with that, it has lots of readings. Okay, the means include things like an expert on hacking by troubled adolescents. An expert on hacking by adolescents who is their selves troubled. An expert into hacking into adolescents accounts, who is troubled; a hacking expert who is herself a troubled adolescent. Lots of other readings and the point is that if neural networks are going to be able to process even relatively simple kinds of expressions like this. They need to be able to make all these distinctions somehow. Obviously, a bag of words is not enough. Obviously, a sequence of words is not enough because all of these readings come from the same sequence of words. There are other things that need to be invoked in order to make these distinctions. In the symbolic linguistic type tradition you might say that well the difference in the readings one and two could be related to different implicit grouping of the elements in the phrase. In the first case adolescent is grouping with troubled. The second case with hacking so maybe some kind of grouping structure could be helpful in making these distinctions. The second and third readings have pretty much the same grouping. They differ however in the implicit relations, holding between the elements in this group for example. In the first case the adolescent is the agent of hacking. In the second case the patient of hacking. Just to repeat neural networks somehow have to have representations that can make all these distinctions, and of course many, many more. Maybe they can do it by implementing symbolic representations like the ones I’ve drawn here. Or maybe they’ve discovered entirely different ways of making those distinctions. That’s what I would like to understand. That was a little digression on the kind of thing that makes linguists that believe in hypothesis like this that we have to have some kind of abstract structural knowledge in order to even get off the ground producing English. The second question here how to look inside it? I’ll be talking about in some detail for the rest of the talk. We’ll put that aside for the moment and pass on to the question of whether this hypothesis is to be believed. That in order to understand, in order to produce great English we need to have abstract structural knowledge. I see that there’s several possibilities here. I’m curious to know what the actual state of affairs is. Do these networks actually produce great English? Here I’m reminding you that I think it’s linguistics that defines the state of the art in understanding what great English is. In other words linguistic analysis assumes that certain sentences of English are part of the language. Is it the case that these networks are actually producing the kinds of sentences that cause linguists to believe in the kinds of structures that they have put in their theories? Or in fact do the networks fail to display that kind of confidence? If indeed they don’t display the kind of confidence that causes English, that causes linguists to believe in abstract structural knowledge then they’re not going to help us decide whether or not this hypothesis is true or not. But if they are producing many at least of the kinds of structures that linguists believe require abstract structural knowledge to cope with. Then there are two possibilities. Does the network have this knowledge in it or not? If it doesn’t then we can conclude this hypothesis is wrong. That we’ve been somehow mislead all along into believing that structure is important for language, somehow it’s not. On the other hand if the answer is that looking specifically for the kind of knowledge that linguists propose we actually find it. Then of course that vindicates the hypothesis quite strongly. My first goal in coming here actually was to try to find out which of these possibilities is the actually case. Involving studying the competence of these networks to see whether they have a command of the relevant structure demanding according to the hypothesis constructions and so on in English or not. If they do then can we see evidence that they have somehow found a way of acquiring and storing, and using that knowledge? Yes? >>: By great English do you mean human like English? >> Paul Smolensky: Yes, yes, so I was being a little bit glib just because there are specific reasons why specific kinds of structures are believed. If the networks aren’t producing those kinds of sentences then of course no linguist would expect to find that structure in there. Yes? >>: What do you mean by [indiscernible]? >> Paul Smolensky: Well that by taking it as a working hypothesis that there is a kind of knowledge in these networks that linguistics says is necessary. We actually find out something about the networks we wouldn’t have found out otherwise. It’s actually led us to achieve a greater degree of understand than we could have achieved without it. >>: I was a bit confused. Maybe I didn’t have enough coffee so my logic isn’t up to speed. Is a good indication the same as retaining a hypothesis? >> Paul Smolensky: Is it the same as… >>: Retaining a hypothesis? >> Paul Smolensky: Retaining it, well… >>: Or proving the hypothesis? >> Paul Smolensky: Well, can’t really prove a hypothesis that something is required, I don’t think. >>: It is [inaudible]? >> Paul Smolensky: Yeah. Okay, alright, so as I said this is one of the two goals that I had in coming here. I would really be delighted if anybody else was interested in pursuing this question. >>: I’m, just a comment here. That really doesn’t mean DNN, right, because for image captioning problem many of those sentences are actually coming from the training set so they don’t really count [indiscernible]… >> Paul Smolensky: Yeah, yeah, well… >>: I think the real network you’re linking too is the recurrent neural network. That by itself will produce reasonably good English injecting kind of a [indiscernible] using recurrent neural network to generate our character strings. They’re very often [indiscernible]. >> Paul Smolensky: Right. >>: I think that’s probably better… >> Paul Smolensky: Yeah I agree that’s really the kind of thing I had in mind actually. Yes? >>: You mentioned the [indiscernible] structure being necessary. Did you mean sufficient? >> Paul Smolensky: I actually meant necessary. >>: Okay, right. >> Paul Smolensky: Sufficient I think is less controversial. Okay, so… >>: But that then do, but then the hypothesis is very strong, right. I mean cannot be proven that without the representation [indiscernible]? >> Paul Smolensky: I don’t think that can be demonstrated. All we can ask is whether the existing systems that we have in front of us are counter examples or not. Because the conventional view is that there’s a very strong segregation between knowledge and neural networks, and knowledge of the sort that symbolic linguistic theories involve. That the, not only the hypothesis has to be that, but that knowledge isn’t there and maybe even couldn’t be. But that needs to be pursued. The whole point of the research I’m going to tell you about is how to bring these two kinds of knowledge schemes together. >>: In like the vision community there has been a lot of work. People do really try to sort of look at different parts of the network and see what is being encoded in the network. >> Paul Smolensky: Yes. >>: I’m not familiar with the LLP community that much. But does the community really feel that there is no, I mean your [indiscernible] communities things that these [indiscernible] separate. >> Paul Smolensky: The, my knowledge of the vision side is definitely insufficient here. But it does appear that the notion of some kind of receptive field is a useful notion in vision. It has yet to be shown to be useful I think in the kind of networks that we’re talking about for producing let’s say sequences of a lot of characters or words. There seems to be a more ready access to trying to interpret the computation in networks that are doing visual processing than what we see in these linguistic networks, I believe. If anybody disagrees I’d be happy to be corrected, but. Okay, so how are we going to look inside to see if it encodes abstract structural knowledge? Well, I propose to rely on a theory of integration of neural and symbolic computation that I’ve been developing. It’s called gradient symbolic computation, GSC. This first goal for work is a kind of reverse-engineering application of GSC. But there’s also at least as important engineering applications that we are pursuing. To try to see whether we can build networks using the principles of GSC to increase their capabilities in domains like language. The goal of this kind of engineering is to unify the learning power of neural networks and the generalization power of abstract structural knowledge. Now connecting neural networks and symbolic computation has been a kind of imperative. That is to say if you believe that symbolic computation has an important role to play in understanding intelligence. That has been imperative for a long time or maybe always in cognitive science since we do believe that underlying our intelligence is a neural network. But of course now since the rise of DNNs in AI is I think become an imperative also. That’s a fortuitous confluence I think. We’ll be much better off as a result of having those two communities working on this problem, hopefully to some degree together. If the attempts to engineer networks that integrate symbolic and connectionists, or neural network processing are successful, then what they ought to lead to is increasing the capacity of neural network computation, and to be interpreted and to be programmed. Increase neural networks capacity to represent and process discrete combinatorial structure which is what structural processing is all about of course. On the other side it should enrich symbolic computation by infusing the power to represent, process, and especially learn continuously varying gradient dimensions of content. I’ll try to emphasize it a couple of points in this talk, gradient dimension of structure, which is a pretty foreign concept to most views of symbolic structure. Now this theory GSC unifies neural and symbolic computation in general. We could pursue variants of this hypothesis to understand AI systems that are based in neural networks. That do reasoning or planning, or higher vision. The same belief is out there that successful performance approaching human levels requires some kind of symbolic computation in many fields outside of language. GSC is not itself really restricted to use in language but that’s where I’ve pursued it. Now, today I’ll talk about GSC in general. In the talk October twelfth I will focus on its applications within language. Okay, so the outline for the talk is to zero in a bit more concretely on the problem that GSC is suppose to be a solution to. It talked about the important role that distributed representations have to play. Give a proposal for how to do representation in neural networks that will unify with symbolic computation in an appropriate sense. Talk about how to program these networks. Effects that you get from the similarities resulting from distributed representation. Eventually just identify a couple of applications that have been pursued in cognitive modeling and in AI, and a few words about the reverse engineering prospects. The problem that gradient symbolic computation addresses is some kind of grand unification problem of unifying the symbolic and subsymbolic approaches to artificial intelligence. It attempts to unify the following hypotheses. First of all, the one we’ve already seen that an insightful, powerful description of cognition is possible when it’s viewed as symbolic computation, but also when it’s viewed as neural computation. These are two hypotheses that have not worked together to well in the past. Most people have put their lot in with one or the other. As someone who’s been trying to put them together for thirty years I can tell you that there’s not a whole lot of people interested in the prospect. But… >>: Except… >> Paul Smolensky: Maybe it will change. Maybe it is changing. >>: Except this new workshop and [indiscernible] that. >> Paul Smolensky: Yes, yes there are quite a few things… >>: [indiscernible] example about that… >> Paul Smolensky: Yeah, two days, yep. >>: Okay, and do you… >> Paul Smolensky: Office of Naval Research is interested. >>: Survival for this… >> Paul Smolensky: All of a sudden things are changing, so that’s gratifying. Okay, so as I was just suggesting that nearly everyone is pretty skeptical that some kind of grand unification of this sort is possible. Others might say that it’s not even desirable. But, as a result of the skepticism I’ve focused really mostly on trying to build formal arguments that the proposal I’m making in gradient symbolic computation really does solve the problem. Here I’m trying to move on to engineering and reverse engineering models built on these ideas. But that’s new; most of the work has been entirely about developing formal results of the adequacy as a solution to this unification problem. If these applications go well then maybe some people who have been skeptical of the desirability of this unification maybe persuaded. Okay, so the, yeah? >>: Do you think that it is possible this symbolic computation just arises simply from communication of the needs or intelligent networks to exchange information? Like through a noisy channel with limited capacity? >> Paul Smolensky: Well, there are many places in AI and cognitive science where symbolic theories are applied to problems that are not themselves implicated in communication. It could be that the capacity is there because of communication and being there is applied to other things. There is certainly a lot of speculation about the relation between the evolution of our ability to do abstract symbolic computation and the evolution of language. I don’t have any opinions about the believability of those speculations. But, did you have a particular? >>: I was just wondering you know what they’re saying. Even what you’re saying is there a way to make an experiment like that to where you would see if I developing agents that have to do something in concert. They have to communicate but you’re limiting their authentication. They can’t transfer the entire brain to each other. >> Paul Smolensky: Right. >>: Would then something that had a symbolic meaning emerge in the… >> Paul Smolensky: People have claimed to have produced just such results and just such computational experiments. There’s some reason to believe that in principal things could have evolved that way. Yeah? >>: Do you still believe the individual hypothesis that cognition can be purely symbolic or purely neural? Or do you think that it requires both? >> Paul Smolensky: Well, so the approach that I’ve been taking is that these are both extremely important computational models. But they apply at different levels. Certain questions should be analyzed at the level at which symbolic computation is the powerful model. Others should be examined at the level that which neural computation is a powerful model. Others need both. >>: You would say that the individual ones are not sort of powerful enough than both are sort of required? >> Paul Smolensky: Right. >>: Right. >> Paul Smolensky: Yeah, yeah that’s what I’ve been trying to do. Put them together because of perceived inadequacies that they have individually. Yeah? >>: In a way can it be related to this concept of fast and slow thinking in psychology? >> Paul Smolensky: Yes, so the work that I’ve done really kind of focuses on something like the automatic processes that go on largely in parallel, largely unconscious, and which run for a half a second or something like that. Ultimately that needs to be part of a much bigger architecture that has much more serial control and so on. I think the kind of dichotomy that you’re referring to is part of that picture. I don’t have a whole lot to say about the bigger architecture. I’ve just been focusing on what is something like a primitive unit of parallel computation within it. Okay, so the inspiration for this kind of unification. Unification across levels is easy to see in computer science where the macro structure of computation and the micro structure of computation are very familiar levels that we understand the relationship between because we’ve built them. We understand the notion that there’s all sorts of potential virtual structure at higher levels of organization than what is physically built. In physics the same kind of picture is ubiquitous as well. That emergent properties of large systems can involve properties that have no appearance at the micro level. Here in cognitive science the corresponding picture I think is the most interesting because whereas in the computational. The computer science case both higher and lower levels are essentially discrete. In physics they’re essentially continuous. What happens in cognitive science on this approach is that you actually have a transition from some fundamentally continuous to fundamentally discrete system, as you go from the micro structure to the macro structure. The tools for doing that actually are stolen from both the other two on the left. >>: If I thought that this revision is quite different where I thought in physics we need to sample everything. You can do it. You can keep you know move from micro structure… >> Paul Smolensky: Yes. >>: You can do micro structures [indiscernible] over here. We have theory that simply say that we could… >> Paul Smolensky: Here? >>: We could potentially that it would be the equivalent to each structure it completes the structure. Through neural condition you automatically accomplish whatever things you want to do [indiscernible]. I think these are different. >> Paul Smolensky: Yes, yes. >>: Do you agree? >> Paul Smolensky: I agree that’s sort of considerably further down the line as to where we are in this talk right now, but yes. >>: Okay. >> Paul Smolensky: I would agree with that. What the work that I will tell you about proposes is that there’s actually a valuable level in between in which the objects are tensors and computation involves tensor operations. That this is somewhat of a kind of interlingual between the language of activation vectors and the language of symbol structures. It’s called gradient symbolic computation because the basic element of computation is the gradient symbol structure. Here’s an example of a gradient symbol structure. This is a syllable in which the final consonant is a blend of d and t. It is part of a structure that’s not just part of a heap. Its part of a structure but it involves continuous degrees of activity of different kinds of symbolically interpretable elements. That’s an example which is actually relevant to one of the applications that we have, which I will get to eventually today or next time. This picture is a symbolic kind of drawing for something which could also be written algebraically like this. This is identical to that in its reference. This refers to a particular tensor in which a symbol b is encoded as a vector and bound to a vector which assigns to it the position of onset of the syllable. Similarly for the nuclear vowel and similarly for the coda consonant, which in this case happens to be a linear combination of vectors that are interpretable as d and t. >>: Yes, just one question [indiscernible] very specific question related to the structure that is under different d of t. The [indiscernible] over here where it [indiscernible] propagates like from under b in the lower end of t [indiscernible]. >> Paul Smolensky: Yes. >>: [indiscernible] issue you actually can’t explain that [indiscernible] would be very different [indiscernible] application. >> Paul Smolensky: That is… >>: Separate? >> Paul Smolensky: No, it’s woven into the main questions here. This is intended as a representation within the system of phrenology. >>: Its phrenology. >> Paul Smolensky: Distinct from a system of phonetics. >>: I want to talk about phonetics. >> Paul Smolensky: You were talking about the phonetics differences between the realization of this vowel before d and before t. The question for this is whether there is a difference at the level of the phrenology whether there are phrenological principals that are at work shaping that vowel, or whether it’s only the phonetic realization that has contextual affects in it. There’s reason to think that the phrenological grammar needs to be involved here. I don’t know if there’s reason to believe it there. But I think that some people have said so. That might be. >>: But that would be generalization [indiscernible] would have two versions. >> Paul Smolensky: Yep. >>: It could just depend on the b, but depending on t or b [indiscernible] sort of a different level. >> Paul Smolensky: Right, well it could be part of the, the proper phrenological representation might in fact be one where there’s a blend in the vowel as well as in the coda consonant. Okay and just because it’s a bit of a red herring. All of this is orthogonal to issue a probabilistic modeling. A computational state in this kind of computation is a probability distribution over structures like this. These are not probabilities. But probabilities are part and parcel of the global state of the computational system. This particular representation will have a probability as will other representations. Okay, so looking down the line if this kind of confluence of network driven and symbolic computation driven approaches to intelligence. If the convergence really is achieved then I think that the representation of knowledge and data for cognition in our century will be about understanding these kinds of representations. What kinds of functions are computed over them. What kinds of grammars evaluate them and so on. This will be important for understanding the existing DNNs I think and the brain. Then the hope is that it would also be helpful for engineering better DNNs that are built to process these kinds of structures that have interpretations and not just meaningless activation patterns. Okay, so at the foundation of this architecture is a cross-level mapping. Here it’s represented with the letter Psi which takes symbol structures and maps them into a discrete subset of vector space. We have a space of discrete symbolic inputs to some computation let’s say. Like some syntactic tree for example. We have a discrete symbolic space of outputs. Say logical forms of some sort. We have a function we’re interested in that maps from one to another. In the GSC approach this function is actually not computed at this level, or at least needn’t be. Psychologically speaking isn’t. Rather these are high descriptive levels that are characterizing states of a neural network. The mapping Psi tells us which vector in the space of inputs. That is to say the space of activation states of a set of input units in a neural network. Which of those states is the realization of that syntactic three, Psi tells us that. It also tells us what the realization in the output space is the vector that realizes that logical form. Then GSC provides a way of going from this to this which closes the loop. Uses natural neural network operations to get the output from the input, doesn’t use symbolic operations. Or it doesn’t use conventional symbolic operations to be sure. The theorems that prove these assertions about sets of functions that can be computed in this way are the strongest arguments in my opinion. That GSC really does provide a grand unified theory for AI. I spent a fair amount of effort trying to develop them. But whether they have use in applications is something I’m here to try to understand. >>: Can I ask a question? >> Paul Smolensky: Yeah. >>: Is there something that limits the input to be symbolic? Can you just start straight at the kind of vector space the input? Simply, could the input be numeric instead of symbolic or… >> Paul Smolensky: The idea is that the input is numeric and so is the output. The machine lives down here. This is our theoretical understand of the machine here. The mapping Psi is really between our interpretation and the state of the machine as opposed to part of the computational path. >>: Well, I think, I thought that what you were diagramming is sort of the, a diagram of how such a machine might solve the task of mapping a parse tree to sort of syntactic or semantic representation? Am I totally off on that or is that? >> Paul Smolensky: No you’re totally on, on that. >>: Okay, so then it, if you wanted instead to map like you know auditory inputs to words does it still, can it still do that kind of task? Or is it mostly to do symbolic to symbolic type of tasks? >> Paul Smolensky: Well, the idea is to deploy natural neural network operations here. They can be applied to any kind of neural network input that makes sense and whether it is something that had a symbolic interpretation or not. The answer is there’s no need for the input space to have a symbolic interpretation. But to the extent that we’re using GSC to understand what’s going on in the system then that’s the role that it will play. >>: Okay. >> Paul Smolensky: Yes. >>: Could you please comment on the relation between this framework and say the standard kernel machine method in Machine Learning. Whereas up there you would map symbolic structures potentially to a Hilbert space and then perform operation in that Hilbert space, and can map it back to a symbolic space. There seems to be some resemblance here. I’m curious to hear that. >> Paul Smolensky: Right, well the mappings that I’m talking about could I think be instantiated in the way that they are done in some of the proposals you’re talking about. I would say that the relation is that some of them at least form a case of this. Yes? >>: Could, this isn’t any, so this doesn’t necessarily contrast with any existing state of the art neural network architecture, right. There could be, like this looks similar to like the, they’ve talked about on the [indiscernible], right. That does fall within this framework, right, like where you map to two things to representations. Then take the dot product between them to, whether maybe it’s like an image and it captions that image. Then you map them both to shared embedding space. Does that; is that a special case of this? >> Paul Smolensky: Whether the mapping from a sequence. We might be taking an input image and an input sequence of words… >>: Yeah. >> Paul Smolensky: Scoring where that’s a good caption for that image or something like that. >>: Yeah. >> Paul Smolensky: Then indeed the process of taking the sequence of words and producing a vector to encode it does have some of the character of this path. The way that GSC does it is a parallel approach and not a sequential one. It might very well prove wise to try to expand that aspect of the program to at least incorporate that degree of sequential processing. But it’s really been focused on parallel processing so far. The mapping from the words to the vector is a parallel one and not a sequential one. >>: Okay, so you’re not defining GSC as anything that has these properties. GSC this is just the high level of [indiscernible] and GSC has, you defined specific functions for these vectors… >> Paul Smolensky: Yes, yes, yeah I’m going to instantiate these things specifically. But it’s certainly a very worthwhile question whether there’s a general picture that would also encompass the cases you’re talking about. >>: Yeah, so we did get some questions earlier. Just want to know if your you know your comments are [indiscernible] comments. We expect the very first question as to whether the input you know for this mapping has to be structured enough [indiscernible]. Because [indiscernible] but there’s a big difference you know [indiscernible] project so far. If we met directly from continuous vector on the left corner up there the symbolic structure that we wanted to. Then you miss this. You haven’t lost this advantage of isomorphism between the structure to structure. Therefore the more difficult compared with if you take advantage of the input structure you might actually get to hide things. It would be more efficient… >> Paul Smolensky: Yeah, so I think that that’s a good instance of this general hypothesis that linguists talk about a mapping between syntactic representations and semantic representations as imbedding some kind of isomorphism. That’s one reason to believe in the syntactic structures. To the extent that that underwrites you know sound competence in language. Then you might think a machine would benefit from taking advantage of it. >>: I see, so as a theory actually requires that you need to have an isomorphism in order to have this [indiscernible] or its alternative? >> Paul Smolensky: Well, the isomorphism to the extent that it is, is between this and this rather than between this and this. >>: [inaudible], okay. >> Paul Smolensky: Yes, yes? >>: One more follow-up on the isomorphism. Originally you showed us the mapping of little f which maps purely in symbolic space. Then you said we want to go to this mapping big F which is done on a vector space. You’re saying these two things can be isomorphic. But in general are you saying they need to be exactly isomorphic? The reason I’m asking about this is maybe you want big F to me more flexible or maybe you say no, no really I want to do these operations directly in vector space. But they should be exactly equivalent to little f. Like a machine translation for instance there’s this famous theorem that where you go all the way up to an interlingual and back down, turns out that doesn’t work very well. We went down to syntax a slightly less restrictive symbolic representation. Every time we had to go down the pyramid toward more concrete representations we seem to get through this. Do you think we want to be doing exactly symbolic representations, or do we need flexibility? >> Paul Smolensky: The philosophy of the research program is that if it’s possible to do an exact instantiation of these maps then you’ve demonstrated that the apparatus down here can do symbolic computation, okay. Now how you want to use it is now a second question. But what you know is that it has that kind of capability. The idea for giving a more useful description of cognition than you have already from the symbolic description is precisely in the way in which this mapping is going to be richer and not slavishly implementing this mapping here. The idea is that these, for example the gradient representation of the syllable I showed a moment ago. That doesn’t have a discrete counterpart up here. It can’t be part of a theory of a task up here. But it certainly can be down here. But to say that the vector down here is some sort of blend of t and d is to be taking advantage of this mapping. But doing it in a way that goes beyond what the symbolic discrete representations themselves can accommodate. >>: Seems like [indiscernible] is in as a subset then. >> Paul Smolensky: Yeah, yeah, right. Rich. >>: When you go from left to right you’re going to use natural neural net operations. Do you allow for intermediate states which might have no semantic, no symbolic interpretation? >> Paul Smolensky: Absolutely, I mean, yeah? >>: By vector space of the states you mean actually all the open states is that correct? >> Paul Smolensky: Right, that’s what this box refers to, yeah. >>: [indiscernible]… >> Paul Smolensky: Some discrete subsets of them are the images of these guys up here. Yeah? >>: Is it true that if the [indiscernible] are bijections then there exist a big F that will you know just map them perfectly, right, just map those points to some other points? >> Paul Smolensky: Well… >>: If we allow a neural network to be rich enough it will be able to implement that mapping. >> Paul Smolensky: Right, right, so the idea is to actually concretely exhibit natural instances of this function here that compute interesting and useful functions here. Rather than to just say have a constructive proof that there is something that will do it. We want to actually exhibit it and determine what kind of neural network capabilities are required to do the computation. Yeah? >>: You mentioned that the mapping between the space of discrete symbolic inputs to the vector space happens in parallel. But I would think that if you’re preserving something about symbolic linguistic structure would happen like hierarchically. Are you going to speak more to what that process is maybe in later slides? >> Paul Smolensky: Well, the idea is that it applies in parallel to the different levels of the hierarchy here, if that answers the question. >>: It is hierarchy? Different levels of the hierarchy done in parallel? >> Paul Smolensky: Right. Okay, so let me go on a little bit further. >>: [indiscernible]. [laughter] >> Paul Smolensky: Funny how that works. >>: Yeah, so I was waiting for people. This is for text only, right. But think about images, right, so if you, an image I mean normally we get a continuous input. But something that you can [indiscernible] you really can get a silhouette. You want to extract what is [indiscernible], right. >> Paul Smolensky: Okay. >>: Then this Psi is now, sorry, I’m trying to sort of map this to sort of visual inputs. There is, so are five features? >> Paul Smolensky: Well I think… >>: Are there five feature functions? >> Paul Smolensky: The thing that makes most contact with this in my limited thinking about vision is that if you had something like a description of an object in terms of its parts in the relation of the parts. That would be here. >>: Yeah. >> Paul Smolensky: Then you could have something much closer to an image down here which this is going to be mapped to and serve as a means of interpreting the image state down here. That’s the closest I would get. However, it might well be that the vector space here should be not pixels but some other features of the image space that are much more suitable for instantiating the abstract object representation. That would hardly be surprising. Okay, so let’s see, maybe I will skip this because I just wanted to make sure that this didn’t get gone over too fast. [laughter] It is not the problem. Okay, although this does make a point that when I say natural neural network computation. The results, the first group of results I have applies to linear operations from here to here. Just multiplying this vector by a matrix to get that vector couldn’t be much more straightforward, simple neural network operation than that. But you can do a very interesting set of functions, symbolic functions that way, as it happens. Okay, so, now this is getting back to the issue of the role of the approximation aspect of the picture and how slavish it’s suppose to be to the cruel discrete master on top. The idea is that the vectors that are in fact the image under this realization function Psi of the discrete structures. They form a discrete set of points in the vector space. That’s called d here. Then of course there’s the whole rest of the space. Those encode proper gradient symbol structures. By proper gradient what I mean is not discrete really involving linear combinations of symbols in positions in the structures. There are several kinds of uses that those states get put to. The weakest sense of their usefulness is as transient states in dynamic versions of the computational system. Not like the one like I showed you which is a simple linear mapping, but in recurrent networks that have a dynamics to them in which those are intermediate states in the processing of the final output which is nonetheless a fully discrete state in this set D. Somewhat less, somewhat more interesting are cases where we want to say that the final states, not just the transient intermediate states. But the final state of processing is off this discrete set but near to it. These can be used to model variation within a category of outputs that can be made distinct in ways that their symbolic counterparts can’t. What I showed you in the discrete structure for the syllable was in fact a case of this where we have network with the grammar that says that final stops need to be voiceless, which is true in a language like Dutch where indeed a final D is pronounced as a T. Except it turns out that the T is not pronounced identically to an actual T. It’s a little bit more D like. That’s modeled as a point a little bit off the set of discrete structures with a mild blend of T, of D mixing into T. Then… >>: [indiscernible], are there examples in syntax about that probability you just mentioned in [indiscernible]? >> Paul Smolensky: Then the most radical use of gradient symbolic structure is when the final states are not even close to discrete ones. For example some people believe in some kind of shallow parsing or incomplete parsing, or good enough parsing when people get a sentence they don’t actually drive all the way through to some nice clean discrete parse of it. They leave unresolved a number of ambiguities that would be needed in order to get a full discrete state. What they have is actually a blend of partial parses. That’s an example. Yeah, I’ll mention another one later if I ever get there. Alright… >>: [indiscernible] this probably [indiscernible]… >> Paul Smolensky: It’s the newest part. It’s the newest part. It’s newer than the book The Harmonic Mind, which has only the discrete structure aspect of it. Okay, so now eleven thirty. This calls for some recomputing. [laughter] When should I stop talking? It’s eleven thirty one according to the clock. >> Li Deng: I think we can up for [indiscernible]. >> Paul Smolensky: But we had a lot of discussion already I guess. You guys have had the floor more than I have, right? I get to have some of it back. [laughter] Okay, so… >> Li Deng: I guess finish that session for today… >>: Maybe working on the clock part two. [laughter] >>: We already have top part two. I can bring out top part three. [laughter] >>: But one [indiscernible]… [laughter] >> Paul Smolensky: Okay, so let me see here. >>: Probably [indiscernible]… >> Paul Smolensky: Okay, so I think I’m going to skip that point with the hope of touching on it some other time. I was really interested to get your take on this argument which I haven’t presented before because it always gets left out. For the same reason it’s being left out right now, which is that it’s a bit of a tangent. Okay, so let’s talk about the actual proposal for what the mapping Psi is. That’s here I guess. Okay, so part three the proposal for the representational scheme. It’s based in tensor calculus. The part I just skipped includes an explanation of why tensors? But you’ll just have to believe that if you were willing to sit for long enough you would get the answer to that question. But both of us know you’re not. An nth-order tensor is a very, very simply thought of as an array with n dimensions, so that the elements of it have n subscripts or indices which identify the individual numbers in this array, and apologies to mathematicians for that definition. There are two basic operations. There’s the out product which increases the order of tensors. There’s the inner product, I’m sorry here’s the outer product. If we have two tensors one which has n indices and one which has m indices, then the outer product also really called the tensor product is something which has n plus m indices. This is the symbol for the tensor product. But over thirty years I’ve come to the conclusion that people freak out when they see that symbol too much. I’ve purged it. Everything is written in a new notation. I apologize in advance for mistakes that may have arisen as a result. You will not see the tensor product symbol again I think. I’m just going to write them next to each other here like that. I don’t know why people freak out about that symbol. I don’t know whether there’s something about being both circular and angular that kind of… [laughter] That people don’t know how to relate to it or something, I don’t know. I’m not sure what it is. But we can dispense with it. The definition is pretty simple. The n plus m order tensor is just the numerical product of the elements of the first and the second tensor. There are n plus m indices required to take into account all possible products of the n elements of the first and the m elements of the, n indices of the first and m of the second. The other basic operation is contraction which decreases the order of the tensor. Here is an awful looking formula which is trying to say something very simple. I should have just written it in words. I’ll, what this operation here this is the contraction over the ith index and the jth index of this tensor which has order n plus two. The outcome is a tensor of order n because the ith index and the jth index of this tensor are both eliminated. They’re eliminated in the following way. Here’s where the ith index goes. Here’s where the jth index goes. We just set them both equal to one. Then we have a number here for each set of indices we have a number. Then we just add that to the number that we get by replacing them both by two and then by three, and by four. We sum over all possible values of q of the value of the tensor with a q and the i and j positions. If we only had two to begin with and so T was a matrix we would be taking the sum of the diagonal elements. We would be performing the trace. It’s a generalization of the notion of trace of a matrix to higher order objects. There are many important special cases. The dot product is the simplest one. If the tensors we are talking about have only one index, think of them as vectors A and B. Then, that’s interesting, how did I get two and three? Curious, it should be one and two. There are two indices here. If we set them equal, both equal to q and sum over all possible bodies of q then we get a simple number. That’s the dot product of these two vectors. The matrix product is another special case. In this case we have A and B both having order two matrices. The outcome has order two because we add these two together four A subtract two which is what this does. There are four indices here the two of A and then the two of B. We take the second index of A and the third index of the whole thing which is the first index of B [indiscernible] equal to each other in sum. Then we get the matrix product of A and B. That’s just this equation if these are interpreted as matrices. In general if we have two tensors there are many inner products of the two tensors depending on which indices we contract over. But the idea is that if we have an n and an m order tensor and we take their outer product. Then we contract over one of the indices of A and one of the indices of B then we’ve performed something called the inner product over i and j, which has decreased by two the order of the product of these two guys. That’s the inner product. The other thing was the outer product. Those are basically the only two things that we need because the symbolic operations in this approach are implemented binding by the outer product, and unbinding by the inner product. I’ll just give you a bunch of examples on the next slide of what I mean by binding and unbinding. The idea is that when we perform this product here we are in some sense sticking together whatever it is that A represents and whatever B represents. We’re binding them together. We’re referring to something that is conjoint between them. Then later if want to know what it was that was bound with B then we want to unbind its partner and pull out A. That’s… >>: Inner part is only one example of the [indiscernible] up. There are other ways of binding using different type of [indiscernible] up to half percent [indiscernible]? >> Paul Smolensky: Not that I’m aware of. >>: It’s probably the other one actually. >> Paul Smolensky: Not that I’m aware of. Okay, so binding by the outer product. Here’s a bunch of examples. There’s a nice elegant general approach to this. People don’t seem to understand it so I’m going to avoid it. This is a bunch of simple examples instead. But, maybe you can believe me that they’re all instances of one proposal. Attribute value binding, so if we have an attribute value structure it’s a whole bunch of attributes which are bound to values. Here we have some big structure. One little part of which says that the agent of this event is J. The proposal is that the vector that represents that attribute value binding is just the outer product of the vector that represents agent and the vector that represents J. Another way to write that little bit more perspicuously is to just write it like this. It’s just the outer product of the A tensor and the J tensor that encodes binding together of agency, agent hood, and the individual J. In a graph we have links which may have, so we have an edge I’m assuming is labeled. It joins; it goes from A to B. We represent this object as the vector which is the outer product of the three vectors that encode individually the symbol on A, the symbol for the relation and the symbol for B, again, more perspicuously written that way. Relation, so if we have, if we, this says that X and Y stand in the relation R. Then same notion just as above is that an outer product of the vectors encoding the individual symbols to form the binding together into a single relation. The next one is a little bit different in character. Now we’re talking about binding a symbol to a position within a string. This is a description of the binding of one of the symbols X to one of the positions namely the second position. The vector for the X part of this is the outer product of a vector which represents the second role, second position. The vector that represents the symbol that fills that role, X. That can be extended to trees. The way I found convenient to do that is to think about positions in a binary tree labeled by binary sequences. The position of X is the left child of the right child of the root of the tree. Left child of the right child of the root that is the position that X fills. We take the outer product of these two. It doesn’t matter which order you pick everything will be isomorphic across the two orders. You just have to be consistent in your use of all the bindings of the particular type to have a particular ordering chosen. It doesn’t matter whether we put R in the middle or here as long as we’re consistent. It doesn’t matter whether we put R on the right or the left. But for the formula I’m about to show you its more convenient for it to be on the right. This is the more abstract notion that the position which is the left child of the right child of the root is itself a kind of binding. It’s recusive, it’s recursive. The idea is that the, this particular position is gotten by binding the right child sub-position to this other position, the left child sub-position to the right child. This product is a way of expressing the relationship between this position in the tree, the left child, the right child of root, and the basic vectors. There are only two of them for binary trees which can be used then to generate vectors for all of the positions by the same recursive procedure. That captures the recursive structure of trees in the sense that recursive functions defined over binary trees can be straightforwardly computed by linear operations when the recursive character of the data structure is encoded in this way. In the vectors that encode the positions in the tree. Okay and in a proper tensor product representation, so this is intended to be a technical term here. All the vectors that encode the symbols are linearly independent of each other. I think one of the things that got skipped in the second part is a very important point which I only recently really appreciated. All my life I’ve been kind of living in the cognitive context where the number of neurons is large compared to the number of concepts. There is a one too many relationship that’s usually talked about in neural realization of concepts. Many different neural patterns could instantiate a given conceptual one. But it turns out that as fantastic as the computers in this building are they cannot actually cope with as many processors as we have in the brain. Now I find I’m totally, I’m constantly being told I have to have fewer neurons than I have concepts, not many more. In order for the vectors to, encoding symbols to be linearly independent then you have to have at least as many neurons as there are symbols. Basic property of linear independence you can’t do it otherwise. The work that we’re doing now involves improper tensor product representations which do a kind of compression. In which the pattern over neurons that encodes a symbol is not independent of the other patterns that encode other symbols. Because there’s just not enough space when you compress the code down to a number of neurons that’s small compared to the number of symbols being encoded. >>: You mean compression or contraction? Or [indiscernible] is the way [indiscernible] contraction? If a… >> Paul Smolensky: I mean compression and in this case it’s not necessarily achieved by contraction. It’s just a statement that you can’t have linearly independent vectors for all your symbols because your network isn’t big enough for that. The mathematics that I’ve done for showing what a function can be computed and so on has always been in the context of proper tensor product representations. Which I think is sensible for the brain but not for necessarily application where the number of concepts equals the number of words of English, or something like that. Yes? >>: If you require one vector to be linear independent does that take any advantage of your distributed [indiscernible] of [indiscernible]? When people say you [indiscernible] they won’t take advantage of that. >> Paul Smolensky: Yes. >>: [indiscernible] >> Paul Smolensky: Okay, so I know that you guys are keen about the advantage of distributor representations. That you can have more than n patterns with n units. That is true it is an advantage. But it’s not the only one. The one that has been studied is that when it’s only distributed representations that have non-trivial similarity structure to one another. Two vectors can be more or less similar. But if you have local representation where each concept is associated with a single unit then all concepts have zero similarity to all other concepts. Similarity is such an important factor in cognitive generalizations and so on. That’s an important feature of distributed representations even when they are not compressed relative to the set of symbols. >>: Do you require the linear independence only for X and X symbols or also for each result like within a tree position or in a recursive tree position? >> Paul Smolensky: It’s required, so if these two are linearly independent of each other then all of these guys will be as well. It follows it doesn’t have to be stipulated separately. But it is it’s a property of the atomic elements that are being combined that they be linearly independent. Not necessarily of the composites that you build by binding them together. For the outer product though you’ll preserve the independence. Okay, so that’s one part of the mapping. Another part is the use of addition to encode conjunction. My suggestion is that maybe I should stop here and on some other occasion pick up from here. I will use this as a means to just remind people about the outer product part of the binding. How, when combined with addition for joining together bindings we can encode whole structures and not just individual constituents. Okay, thank you very much for your questions and attention. [applause]