>> Paul Smolensky: So a very quick review. So we are still dealing with these four points that are in the review box here. We are trying to do understandable neural nets by describing what’s going on in them at an abstract level where we can use symbolic descriptions. We have a way of representing and manipulating combinatorial structures in neural networks tensor product representations. Last time we talked about the fact that the computation over these structures when it comes to grammatical computation is a kind of optimization. So we talked about Harmonic Grammar and Optimality Theory as two new theories of grammar that result from that perspective. Then today the last point is that with this gradient symbolic computation system that uses optimization to compute over these combinatorial activation patterns we get both discrete and gradient optimal outputs depending on how we handle the part of the process that is forcing the output to be discrete. And we talked about the discrete part yesterday. I will review that very briefly and then we will talk about the gradient part after that. Then I propose to backtrack and deal with a number of topics from the lectures 1 and 2, which got skipped. So do you want to ask your question now? >>: So my question was yesterday in the talk you gave the points on the grids were rains for representing the concept rains. You had it rains, it rains at 4 points, but there are many other things that presumably are getting considered like rains a tree, or reigns a drop, or rains a bucket. How is it that we only are looking at these 4 points? >> Paul Smolensky: Okay, well so we talked about faithfulness constraints in the context of phenology, but they are also important in the context of syntax as well. So there are different approaches to doing syntax within Optimality Theory. Actually I should have emphasized the Optimality Theory and Harmonic Grammar is grammar formalisms, they are not theories of any particular sort. So you can use Optimality Theory to do lexical functional syntax or you can use it do government binding syntax or minimalist syntax. So it is not a theory of any linguistic component, it’s just a formalism for stating theories. So one thing that syntax theories differ on in OT is how they handle the notion of underlying for more input. So in the more minimalist oriented ones the input is considered to be something like the numeration in minimalism where you have a set of words that are available to you. And in that context the generator of expressions that produces these things, which is one of the components of an Optimality Theory grammar would only produce things using the words in that numeration. And if “Bill” is not in it then “Bill” won’t show up in any of the candidates. The approach that I have pursued with Geraldine Legendre, which is not so far from the one that Jane Grimshaw pursued also, OT syntax of a more government binding sort. You start with a logical form as your underlying form, what you want to express and then these are surface expressions as well. They are surface forms in the sense that they have tree structure associated with them. And in our point of view there are also important faithfulness constraints at work in syntax that are requiring some kind of match between the output expression and the input logical form. So if “Bill” is not in the logical form of what you are trying to express and “Bill” is in the output then it will violate a faithfulness constraint and unless that violation allows it to avoid some other violation and it sticking “Bill” in the sense won’t do that, unless you stick it in as the subject in which case it would help you satisfy the subject constraint. So “Bill rains” would at least be motivated by providing a subject. >>: But that would be wrong. According to OP you would be having a higher score. >> Paul Smolensky: You would have a higher score than just rains in English, potentially depending on how you handle the faithfulness constraint that is penalized whenever you have material in the output that’s not present in the input. So the idea of “it” is that of the words that can go in subject position it least violates the constraint that says there shouldn’t be content in the expression that’s not part of the logical form. So a full NP like “Bill” has such information and is more of a faithfulness violation than “it” is. So the idea is that “do” is in some sense the minimal violator of this when it comes to inserting a verb and it is kind of a minimal violator of it when it comes to inserting an NP, something like that. >>: So maybe at some point, not today, we could look at import that is not a logical form, but a continuous representation of that logical form, such as we have when we do machine translation from a source language to a target language. >> Paul Smolensky: So the betting vector is –. >>: [inaudible]. >> Paul Smolensky: The input to the generation half of the translation plus this? Yeah. >>: So just so I understand, if I say, “Bill rains,” it does not violate the subject so you get a higher score than “it rains”? >> Paul Smolensky: No, no “it rains” doesn’t violate subject either. It rains has a subject. >>: “Bill rains” also has a subject? >> Paul Smolensky: Yes, they both have subject so they both satisfy the subject constraint equally. >>: Correct. Okay, “Bill rains” should not be [indiscernible]. >>: It has semantic content. >>: Oh, so this is only for syntax? >> Paul Smolensky: So the faithfulness constraints that are involved in showing how “Bill” is a worse subject than “it” here are not in the picture, but they are in the grammar. There are faithfulness constraints in the grammar. The grammar has more than these two constraints. >>: Oh, I see. >> Paul Smolensky: I’m sorry to tell you that, but it’s true. So those were omitted. If you wanted to include in the candidate set a more complete list and you had “Bill rains” then you definitely would want to make sure that you also have the faithfulness constraints, which we prefer. >>: A semantic match. >> Paul Smolensky: A semantic match between the expressions and the underlying logical form. Yes? >>: Also it takes us into territory of things like farmed rain fishing, fish rain, [inaudible], for example. We have these [inaudible]. It’s not that they can’t be a subject of rain. So I think that’s one of the points. Bill is [indiscernible]. >> Paul Smolensky: So I have met quite a few Chris’s here. I am inclined to say “rains Chris’s” in this audience and not “Bill’s”. A logical form that is expressed by that expression is not the one that we are trying to express here. So it is true that rains is [indiscernible] in having one of it’s reading subject lists version, logical subject list and other’s which have bonafide subjects. So how does this look at the neural level? So we saw how we use the space here of activation patterns in a network. So each point here is a pattern of activation and how we take our symbolic combinatorial expressions here from the higher level picture of our network and we embed them in the grid of discrete states in the continuous RN space of states of the neural network. And how the harmony function picks out the right form as having the highest harmony, but that we don’t do search over the discrete points. We talked about that as a possibility yesterday, but the neural networks that we consider don’t do that kind of search. They are moving continuously in their state space. So the challenge is that the optimal highest harmony point in the continuous space is not going to be on the grid except in unusual circumstances. So what we did was we talked about how we carved up the continuous space into tractor basins and we put in a force field that pushed the states to these grid points. And this is a gradient force field that derives from a potential function which is gotten by combining the harmony function that comes from the constraints provided by the grammar with a harmony function that comes from the constraints provided by the grammar with a harmony function that penniless non-discrete states. So that contributes its part of the gradient dynamics also and that’s the optimization dynamics which we implement through some kind of noisy or stochastic gradient ascent. So the dynamics has these two parts: so the optimization dynamics is completely oblivious to whether states are discrete or not. And the quantization dynamics is completely oblivious to whether states are good or not as far as the grammar is concerned. So the hope is that by simply linearly combining them we create a system which will find the discrete point that is best according to the grammar and that involves weighting the discretization more and more strongly in that linear combination as the process of computing the output proceeds so that we end up pushing the point which is never on a lattice point until finally the quantization component of this is completely dominant and corresponds to a limit of infinite coefficient on the part that contributes the quantization force. So that was about how this kind of computation goes and produces discrete combinatorial outputs. Now I will talk briefly about the gradient ones. So the question that is driving the gradient symbolic computation work nowadays, which involves quite a number of people at this point, mostly at Johns Hopkins, but not exclusively. We are trying to determine to what extent we can get use out of all of the non-discrete points in the state space of the network for doing modeling psycholinguistic processing the intermediate states along the way to computing the part of the sentence that you have heard so far. For example for generating from a logical output the process of working your way of having a motor plan to produce it. But, also we are looking into a somewhat more radical idea, which is that gradient representations might be suitable for doing linguistic theory, so competence theory not performance theory. So this is in addition to being used for performance theory. So this is the idea that we can take seriously representations in which symbols have continuous activation values, not just as a transient state along the way to a final proper discrete representation, but as entities in their own right that may be valued by grammatical considerations. So this is an example of a gradient symbol structure. It is one way of depicting it anyway. You can view this in 2 ways; it’s just different views of the same structure. You can say that this is a tree where at the left child position we have a blend of 2 symbols, mostly in A but some B as well or you can look at it and say that this is a structure in which the symbol A is mostly in the left position, but partly in the left position as well. So this is a perspective that is more natural for syntax where things occupy multiple locations implicitly if not explicitly and the previous is more suitable we find for phenology where things tend not to move around, but they do change the content of their phonetic material. So both of those have been used in this work and I will tell you about an example in which we actually just look at discrete outputs, but we consider inputs that are gradient. And this is the example I am going to talk about on Friday at Stanford, since you were interested in that Lee. It is the phenomenon of French liaison and I want to acknowledge that the idea of applying these kinds of gradient methods to this phenomenon is due to Jenny Culbertson. So the way this phenomenon looks is that there are words like the masculine version of small, which is written this way petit with a t at the end. It is written with a t at the end, but the t at the end is not always pronounced. So when you combine it with the noun ami you pronounce it and you get petit ami. But when you combine it with a constant initial noun you don’t pronounce the final t you just say petit copain. So you see I have pronunciations here. So the t disappears and that’s not true of all final t’s. So here is another form of that adjective, the feminine form and the t at the end of the feminine form appear all the time. It is not like the one at the end of the masculine form. So the masculine ends in what’s called a liaison consonant, but the feminine form ends in a full consonant. It is pronounced all the time, even before a consonant. So for the female version of what’s literally “little friend” you have petite copine, which has got that final t for petit pronounced. So this is a very well studied phenomenon in French, studied to death you might say. So we are rolling over in its grave an already well trampled phenomenon. However, we have new things to say about it and those –. >>: [inaudible]. >> Paul Smolensky: So explaining what I have shown here is easy to do. The reason that this remains something that people work on and is considered just a solved problem, is that although this is the core of the phenomenon and this is what they teach you in French school, not just foreign language instruction of French, even though that’s the case there are lots of phenomenon that don’t actually fit the picture. Here. So because of those other phenomenon some linguists have proposed that we shouldn’t actually think of this t as part of the first word at all. Some perspectives have it as an independent entity of its own or an independent part of a schema in which we stick the adjective and the noun together and the t is already sitting there. And the view I am going to take into consideration today is the one where the consonant is viewed as being part of the second word. So what you see here is the syllabification is pe ti ta mi. So the story is that the t surfaces when it can be an onset of a syllable because it is followed by a vowel. And as you know from [indiscernible] of syllables theory yesterday, syllables should start with a consonant, that’s a good thing, whereas they shouldn’t end with a consonant. So in the case of a consonant initial noun if we did pronounce the t we would have coded in petit copain, coded are dispreferred and we do not actually pronounce it when it would be in the coded position. That is a characterization of the liaison consonants that make them different from regular normal consonants which always appear, such as this one here even when it’s in coded position. >>: But the last one actually got his coded one also right. >> Paul Smolensky: Right. >>: What don’t you penalize? >> Paul Smolensky: It is penalized, but it is still the optimal output. So there is something different about the t at the end of petit verses the end of petite, there is something different about those two t’s. >>: So there are other constraints there at work to insert –. >> Paul Smolensky: Basically the other constraints are some kind of faithfulness constraints. So you can think about the basic ranking in French as being faithful the fact that there is a t in the underlying form and is stronger than the no coded constraint, except that there are some consonants, the liaison ones, where it seems as though there presence in the underlying form is kind of weak. So they don’t provide as strong of faithfulness force and the force they provide is not sufficiently strong to overcome the no coded constraint, that’s why these weak consonants don’t appear in coded position. So you have to somehow store in the mental lexicon of French a difference between the t at the end of the masculine form and the t at the end of the feminine form. Somehow you have to mark a difference and there are words that adjectives that have a final t that’s always pronounced even when a consonant follows, but aren’t the feminine form of anything necessarily. So it’s not all about the gender difference. The idea though of the main competing analysis to the one that says the t is stored along with the rest of petit at the end, but somehow it’s a weak t. The alternative is a view that says that actually this loudification here reflects the underlying form. So the underlying form is actually petit, nothing at the end and the other following word is tami. So the idea is that the lexicon contains multiple forms of this word. So ami is one form but tami is another form. And when preceded by petit you have to select the version of this word that starts with the t and you have to select tami. So that’s the view that says these consonants are part of the second word not the first word. And what we have in our proposal here is a blend of these two analyses. So in the proposal that we are providing with gradient symbolic representations we have a weak t at the end of petit. So .5 is the activity of t. So remember these consonants can have a strength different from 1, which is the standard strength. So it’s a weak t here, but there is also an even weaker t at the beginning of ami. And t isn’t the only liaison consonant. There are generally considered to be 3 productive liaison consonants in French. So just like t we have to consider n and z as weakly present at the beginning of ami, because they are needed for words, not petit, but other words. So the idea then is that when this input is presented to the harmonic grammar in this case the optimal output will be the form in which the t is pronounced at the beginning of a syllable. Whereas when this petit with a weak t is followed by a consonant initial word we do the optimization we will find out that the optimal form does not have that t. So .5 is not a strong enough faithfulness force to overcome the no coded constraint, whereas the effect of combining the activity at the end of petit with the activity for t at the beginning of ami gives you .8 and with the strengths of the constraints, the faithfulness constraints and the markedness constraints, as given in the grammar we proposed, with those strengths of constraints now once you are up to an activity level of .8 you are now over a threshold such that it is optimal to pronounce the t. Now it has got enough activity to overcome the no coded constraint. So that’s the picture for liaison consonants for regular consonants like the t at the end of the feminine form of petit, that’s a full fledged normal t full strength. So it doesn’t require any additional activity from the second word in order to be above the threshold required for which faithfulness to it exceeds the cost of having a coda. So you get petit copine there. So that’s a summary of a pretty long talk actually. Yes? >>: [inaudible]. >> Paul Smolensky: Uh yes, that a –. >>: So that would be like a child trying to learn this [inaudible]. >> Paul Smolensky: Right, right that’s a good point. The child is fortunate in that French provides an environment where you can actually see the underlying contrast without any complications from any following word. The acquisition story that I have doesn’t rely on that fact, but it is actually a good point. So anyway, the idea is that when children hear petit ami they use a constraint which is very widely observed in the worlds languages that says that [indiscernible] begin at the beginning of a syllable. So when you parse a stream you segment it into words. Doing that at syllable boundaries is a good thing to do. So when you do that to this form here you end up with petit as the first part and tami as the second part. And children use tami as if it were an independent word. They say, “A tami” instead of “anam”i. So children use the forms that are proposed by this second theory here quite productively, but they stop after awhile. >>: So how about if you make audio recordings of these and analyze the spectrum. How many consonants do you actually see there? Is it possible that those of these are there in petit tami, because the alternative would be an oral thing, movement of the tongue in your mouth. You are trying to say petit, but you are supposed to keep it silent and you don’t quite say it, but when you are supposed to go from e to ah there is a transition. You want to insert a consonant so then you shift that t, but then the t almost kind of reverberates. Then there is a little bit at the end that starts with the next one. So you have both rather than just the optimal version. >> Paul Smolensky: Well after leaving Microsoft I am going to spend a couple of months at a lab in France where they do this kind of study and I will be very curios to see what differences can be observed between –. So the feminine form of this is petit ami, so superficially it’s the same as the masculine. But we know that the t’s involved are actually different. And whether there is an acoustic subtle difference that reflects that the question you ask is another interesting one. >>: Well it’s a simple question because those are different t’s because they are being pronounced differently and then we just connect the words. You can get all kinds of things that are a different pronunciation of these or maybe even a reverberation where you get multiple t’s, but one is not heard quite. But you can detect it in your recording. >> Paul Smolensky: Yeah I will be shocked if that turns out to be true, but it needs to be looked at. >>: [indiscernible]. >> Paul Smolensky: Right. >>: [indiscernible]. >> Paul Smolensky: Well it is –. >>: [indiscernible]. >> Paul Smolensky: Well you are not alone in that. So the divide between phonetics and phenology has traditionally been that phenology is all discrete and anything continuous has to be in phonetics. And this is proposing that there are continuous dimensions of phenology as well. But we still believe that there is a phonological grammar that has the constraints that we talked about, which has an existence independent of all of the phonetic machinery. So we believe that grammar can show signs of representations that are not fully discrete, even though all of the outputs of the phenology no matter what go through this process of becoming continuous in the phonetics. Yeah? >>: So what might be a syntactic [indiscernible]? >> Paul Smolensky: Of having a split? >>: An example. >> Paul Smolensky: Well we have worked a little bit with the example of wh questions where our analysis says that the wh in our phrases in our language like English, which fronts the wh not in languages that leave it in [indiscernible], but in languages that front it like English, that the wh phrase is in a blend of 2 positions. So there is a weak version of it in what would be the position that it initiates movement from in a movement theory and a strong version of it in the place where it is pronounced. And you can imagine all sorts of dependencies being treated in a way like that in addition to wh movement. >>: So I was also thinking if you have non-compounds then often the pre-modifying noun you really can’t tell whether it’s an adjective or a noun. >> Paul Smolensky: Yes. >>: And there’s really no need to tell whether it’s an adjective or a noun even though the Penn Tree Bank will declare it either an adjective or a noun. But there is often no reason to make that decision. So thought that might be a really [inaudible]. >> Paul Smolensky: That’s a nice, a different kind of example which is more like the phenology style mixture here where in one position, mainly where this word is, you have a blend of 2 category representations not an adjective. And there is no need to force a decision between the two in certain cases as you say. And states of processing sentences before the whole sentence is completely processed we model with all sorts of syntactic blends, which reflect a mixture of not yet having carried out the computation far enough to be at the final potentially discrete state, but also because there is uncertainty when you are processing the sentence and it hasn’t completed yet. So you only have partial information about the beginning of the sentence which leads to other kinds of gradient representations from a source of uncertainty about what’s missing. Yeah? >>: That input lines seems to make a prediction that there might be a difference between say high frequency processing, for example if you get petit something with a very unfamiliar word that began with a vowel you would expect possibly a delay in reaction times associated with my application of that role [indiscernible]. >> Paul Smolensky: I hope to learn more about the performance aspects of liaison in France, but there is a documented set of punitive facts about the probability of liaison appearing as a function of the frequency of the collocations. That you get more liaison from more frequent collocations. And our analysis has an account of that, but you are absolutely right that this is a good start for that kind of phenomenon. >>: So would it be fair to say that introducing this gradient probabilistic way of dealing with label makes the constraint a bit less, easier, like you don’t need to have so many constraints to explain a specific language [inaudible], otherwise if you don’t have that gradient [indiscernible]. >> Paul Smolensky: I know of one version of that for which that’s correct. I don’t know how general it is, but in Optimality Theory there are some phenomenon where you have different levels of violation of what’s conceptually a single constraint, but you have to fake that by having multiple versions of the constraint in the hierarchy that get’s stronger. So you don’t need to do that when you have numerical elements. So there are at least some places where I am pretty sure that what you said is true. But I do want to say that we are adamant that these are not probabilities. So the representations are like a pattern in a neural network and the network is a stochastic network so every state has a certain probability associated with it. So a state like this has a probability as well as having the gradient internal structure. >>: I see, I see. >> Paul Smolensky: So the probability is laid on top of representations like this for one thing, but you will notice that these do not add up to 1 nor do these. And with the wh question we are not saying that there is some probability that you will treat the wh phrase as if it were at the beginning and some small probability that you will treat it at the end. It is in both places as once. It is a conjunctive combination. It is not a disjunctive combination the way probabilistic blends are. >>: Okay so what would be the neural correlates of these [inaudible]? >> Paul Smolensky: It is absolutely straightforward. So you just have a certain set of roles and fillers for the discrete form petit and in constructing the [indiscernible] representation of a fully discrete form like petit you have “p” in one role, “uh” in one role and you have a second t in a certain role. >>: Oh I see, I see. >> Paul Smolensky: And the filler final t times role has a coefficient of .5 inside. >>: [inaudible]. >> Paul Smolensky: Yes, that’s right. >>: So all the theories will carry through. >> Paul Smolensky: Right, that’s right. So that’s why I say that what we are doing is looking at, for the use of all those states that are not on the grid and in general they can be interpreted as states like that. >>: You said the probability is basically e to the minus energy or e to the harmony. >> Paul Smolensky: Right. >>: So in that case these numbers would actually end up being like logons of on and off states. >> Paul Smolensky: You mean in terms of probability appearing verses not appearing? >>: Yeah. >> Paul Smolensky: Uh-huh. >>: So they don’t have to list representation they are just normalized between 0 and 1. They are just more like logons. They can be any number of positive or negative. >> Paul Smolensky: It’s true they are more like logons, but what we do is –. So this is a description of the input. The output is one in which, as it says here, we are looking at discrete outputs only in this project. So a discrete output has a t in this intermediate position which derives from both of these two. So it is analyzed as a coalesce of two underlying segments into one. So this number and this number add together, but there isn’t really any sense in which I can see that addition, which is the crucial point for deciding whether that t surfaces or not as a combination or probabilities, but maybe I am just not creative enough. >>: Well probability domain is more like a product and here it is a sum. >> Paul Smolensky: It’s not like the probability with which you would actually get the t surfacing in this word like ami on its own. Once it get’s to multiply by the probability you would have t alone. >>: But it wouldn’t be that number. It would be a number that is [indiscernible] plus “c” to the type of number divided by the temperature or whatever it is. >> Paul Smolensky: Yeah, I will look into that. It is certainly true that these coefficients have the status of energy. >>: Well ideally what I think about is that when you build a structure there, put an indent, you [inaudible] and that process is more or less the same. >> Paul Smolensky: Whether is [inaudible]. >>: With or without the gradient [inaudible]. >> Paul Smolensky: Absolutely. >>: Then once you do that [indiscernible] is to say that becomes the space in the projected [indiscernible]. Then you run dynamics and that gives rise to “h”. >> Paul Smolensky: The dynamics dictates the “h”. >>: Now they have the question of how does that terms “h” relate to the weights [indiscernible]? [inaudible]. >> Paul Smolensky: So the harmony contributed by this consonant in the input –. So for an output in which there is a t that is pronounced then there is a certain faithfulness reward from having put in the output a consonant which is also in the input. But the magnitude of that award is equal to the weight associated with the faithfulness constraint itself times the activation value here. So you multiply the weights in the harmony function times the activation values of the elements that are in corresponds in the input and the output. >>: Okay I see. >>: [indiscernible]. But if you integrate all the variables out you are still going to get a distribution that had that [indiscernible]. So that is what I meant, maybe that is similar here. Then in that case on one side is increasing probability of seeing the t, that t on the other side is increasing the probability of seeing the t. [indiscernible]. >> Paul Smolensky: But it is crucial for the resolve of the issue of how different or not is those numbers from probabilities. Yes? >>: I just wanted to kind of make a comment that for me the output of the syntax I wouldn’t assume it is discrete. >> Paul Smolensky: Right, we don’t actually assume that about phenology either, really. >>: Okay, but just from a non-traditional syntax point of view I would find it natural that they would be gradient because there is no such thing as a discrete syntax. >> Paul Smolensky: And I guess even conventional wisdom might say something similar to that about comprehension and whether the comprehender goes all the way to producing the disambiguated discrete analysis given that typically it is not necessary or maybe not even possible. But in production you have to have a well enough –. The syntactic expression that you are using to drive production is going to have to be discrete enough to do the work, but I don’t know how discrete that will turn out to be. Okay, so what I propose to do is go back to the list of points we had in the first lecture and carry it into the second. We covered the light grey ones here, but didn’t cover the darker ones here. So I have slides that are versions of what I would have used then and these are the topics and you can decide which ones you want me to talk about in the remaining time. So the first one here is a very general symmetry argument for why distributed representations are really a key to finding property of neural computation and need to be possible in all legitimate neural network models. So that’s an argument that’s somewhat extended. Then distributed representations themselves are important because they give rise to generalizations based on similarity. So I have 2 examples of similarity based generalization: one at a symbolic level and one at a sub-symbolic level. This is just a relatively short section of how to use tensor product representations to do basic lambda calculus entry joining grammar. And this is actually probably 12 words together because we have already talked about half of the stuff on that slide. >>: [indiscernible]. >>: We have to give the admins a little bit more time this time, because it was really hard for her to get it out. >>: [indiscernible]. >> Paul Smolensky: Which her are you referring to, Tracy? >>: Stacy. >> Paul Smolensky: Stacy. >>: She is doing a great job, but [indiscernible]. >> Paul Smolensky: I will send her a thank you note. >>: I already did, but that would be nice Paul. >> Paul Smolensky: I will add to it. So I am game for that. I will be out of town for 3 days so it can’t be as short notice as today or yesterday. >>: So you go on Friday, Monday or Tuesday? >> Paul Smolensky: Thursday, Friday and Monday. >>: But I will be away for most of the week next week, so let’s do it the following. >> Paul Smolensky: So back to basically the first slide although most of it is grayed out here. So what I was trying to argue was that there was a challenge, kind of a very fundamental challenge of brining together all of the reasons that we have from traditional cognitive theory, and linguistic theory, and AI. All the reasons we have to believe that symbolic computation is a very powerful basis for cognitive functioning and on the other hand, especially recently in the AI world, but for longer in the cognitive world, reasons to believe that neural computation provides that important power as well. And they don’t seem very easy to reconcile with one another. Most people think that they can’t or shouldn’t be. So what is being proposed here in GSC is a kind of integration on gradient symbolic computation most explicitly so is what we just looked at, where we talk about representations with gradient symbols in them. But that is one part of the whole package that pulls these two together. So this was intended to be very early in the talk and is a characterization of what neural computation means and that lays the groundwork for what constitutes the challenge of trying to unify it with symbolic computation. And the most concrete function of this section of the talk is to try to explain why tensors appear and why we should be using tensors to encode information in this integrated picture. >>: So I am a little confused. In the first lecture you talk about exactly how you [indiscernible]. But in this lecture you are talking about gradient symbolic computation and that seems to be separate from tensors. >> Paul Smolensky: Well if you look at it from the symbolic point of view then you have these representations that don’t have any neural networks in them. That’s the characteristic of this higher macro level view of things here. But my claim is that these things are the bridge between those two, the tensors that we write down which might have .5 times t tensor final position. Those things are squarely in the middle because when you cache out for all the units contained in those tensors you are looking at this and if instead you look at the tensors themselves as wholes then you find yourself looking at something like that. >>: Oh I see, okay, okay, yeah. So the gradient symbolic computation you talked about earlier today, I wouldn’t think that to be as important as having tensor as intermediate stage. >> Paul Smolensky: Well I mean I take your point. So, I will reflect on the fact that I have used about 5 different names over the years for this package of ideas. So in the book I called it the integrated connection symbolic architecture. >>: So the gradient symbolic you talked about today to me is more trivial compared to the tensor representation that unify the two. Am I wrong? >> Paul Smolensky: My way of making sense out of those gradient symbolic representations is to say that they function in a harmonic grammar the way that the tensor product representations of them function in the harmony function at the neural level. So in order to make sense out of them from a harmony point of view is seems important to me to know that they stand for these patterns of activation where we can evaluate where the harmony is. You might be able to fineness it if you wanted to though. >>: Then when you have this sort of representation do you no longer need the quantization dynamics? Is everything based now on the projections for which state you have? >> Paul Smolensky: So you don’t need quantization for interpretation of anything. So the tensor representations are about interpretation so you don’t need quantization for that. If you use this isomorphism to build for example a linear network that does this job here. Then the situation is if you put in a point on the grid of discrete states, if that’s your input, if it happens to be one of those special points then so will the output be. It’s feet forward, it’s guaranteed to give you the discrete output that would come from doing it this way, but if you want to do harmony maximization and have a grammar with constraints that you are trying to satisfy then in general compile that into something like a specification of a feed forward network. So we do the actual constraint satisfaction with recurrent activation spreading. And in that situation, that all takes place in the continuous context and you have to put quantization in if you want at the end of the day to get something on that grid of discrete states. >>: But in the brain there is no such thing as discrete state, presumably. How do you optimize, if in neural network in your brain, how do you optimize this? How do you isolate discrete states? >> Paul Smolensky: So the quantization dynamics has a combinatorially structured set of attractors. So we are operating under the assumption that there is something like that in our brains. >>: Oh you would say they are just pockets where things are being drawn the base. And then doing learning things fall off these pockets and it becomes a discrete state? >> Paul Smolensky: Right, yeah, not necessarily that learning is restricted to the situation where things do fall into a discrete state, because you might need learning to get the point where things are falling into discrete states in the first place. >>: So maybe this is where my confusion is coming from. So I thought that to go from [indiscernible]. >> Paul Smolensky: If we can write down a symbolic expression for the function in a certain class –. >>: So what is the role of harmonic grammar as well as dynamics, how do they fit into this? So I think I missed this high-level [indiscernible]. Do you know this [indiscernible]? >> Paul Smolensky: Well until we started talking about it, you and I, your idea last week or so, it didn’t occur to me that you could take a kind of gigantic list approach to thinking about the function that maps from let’s say a syntactic tree to an AMR representation. >>: Yeah. >> Paul Smolensky: And that you can just identify thousands of cases and for each one write down a symbolic function that does the right thing. Under that approach there isn’t a need for a grammar. >>: Right, that’s what I mean. >> Paul Smolensky: As long as you are satisfied with the generalizations that occur when you subject unfamiliar inputs to the list. >>: I see, for the new instances. >>: Yeah. >>: [indiscernible]. >>: But then you can never handle new input. >>: [indiscernible]. >>: But I can give you a sentence that’s not nearest to anything and it should still get an AMR representation. >>: [indiscernible]. >>: Under no circumstances and I can give you complete possible states that you have never even dreamed of and your mind will attach an interpretation to them. >>: I see, I see, okay. >> Paul Smolensky: So we believe that the generalizations that are involved in language process are captured by these grammars and they need to be implemented by continuous satisfaction. >>: [inaudible]. >> Paul Smolensky: All right. So very rudimentary speaking, and we will do somewhat better I think maybe, tensors are n dimensional arrays of real numbers. So why are they something that we should be so concerned with in this picture? That’s the question here; why tensors? And to answer that, as I was starting to say a moment ago, elaborate on how we define neural computations so we see what exactly the specifications are for unification. So the most obvious contrast between symbolic and neural computation is in terms of the syntax of the formalism. I don’t mean natural language syntax; I mean the syntax of computational formalisms here. So, at the macro level here, our data, our symbols that are connected by relations and the operations are to compare whether 2 tokens are symbols of the same type to concatenate or embed constituents in one another. That’s the kinds of operations and data that are associated with the symbolic, computational architecture. As opposed to having as the data numerical vectors operations which are element wise addition multiplication by consonants combining together to give you matrix multiplication and stuff like that. So this is, in terms of what the computational operations work on and do, a fairly clear contrast. But what’s more important to understanding about tensors is what I would call the semantic contrast about the semantics of these two computational architectures. So in the case of this symbolic world, generically speaking, the symbols that appear in the data structures correspond to concepts. So a concept is locally encoded in individual symbols which allow us to write programs that deal with the symbols in the way that we think they should be dealt with because we know what concepts they correspond to and we have conceptual understanding what the tasks are and what they require. On the other hand generically speaking in what I take to be true neural computation the individual units are meaningless. You can’t assign any conceptual interpretation to activity in a single unit. Concepts are distributed over many neurons and a given set of neurons has many concepts distributed over them, which makes for difficult programming, difficult interpretation. So that’s what you might call sub-symbolic representation as opposed to symbolic, although I originally called it sub-conceptual and that is a better name. >>: [indiscernible]. >> Paul Smolensky: Yes. >>: [indiscernible]. >> Paul Smolensky: Well remember the idea is that both of these are true of the same system. You can look at it this way or you can look at this way. That is to say, well I have explained what that means. Okay, so why tensors? Tensors are very natural for dealing with distributed representations. Once you accept that distributed representations are really the essence of what defines neural computation and what distinguishes it from symbolic computation, with respect to the semantics of these formalisms, then you can see that something that is natural for distributed representations is important for us to know about and use. Okay, so why do I think distributed representations are a necessary component of characterizing micro-computation? The sense of “necessary” I mean here is a little bit convoluted. So I am not arguing that every neural net model needs to use distributed representations. What I am arguing is that an adequate neural net architecture must allow distributed representations. An architectures design cannot require that the representations be local in order for it to be a true neural computational system. This is relevant as we will see to a typical neural network systems doing NLP these days. They violate this principle because they do require local representations in a certain sense. Okay and for the time being I am going to assume that when we are talking about distributed representations what we are talking about is what I defined earlier as proper distributed representations in which the patterns that encode the concepts are linearly independent of each other so that they don’t get confused when combined. And with respect to that there is this big difference. I mentioned it once before, but it is important enough that I will mention it again that in cognitive science the situation is that the number of units available to us in our networks, namely the number of neurons in the brain, are considered to be much bigger than the number of concepts that are encoded in them. So people always have assumed, and I don’t see reason to doubt it yet, that the vocabulary at the conceptual level that we use when we consciously think about domains is much smaller and restricted compared to all of the neurons that are involved in processing those domains. So in that situation you can just assume that you have proper distributed representations because if the number of neurons is bigger than the number of concepts there is no reason the concept vectors can’t be linearly independent. But coming home drove home that in engineering we can’t necessarily afford as many neurons as we have concepts or words to deal with. So now we have number of neurons much less than the number of concepts and in which case we don’t have proper distributed representations anymore. So put the improper ones aside for a moment and just think in the context of the proper distributed representations. Okay so the basis of this argument here for the centrality of distributed representations is something that is here called the symmetry metaprinciple, which says that if some system is left unchanged or invariant under a group of transformations, which are called symmetries if they leave the system aren’t unchanged, then the fundamental laws of the system must be invariant under those symmetries. It’s hard to believe that anybody could disagree with that. So there is a very simple example of this. If we consider all of the rectangles in the plane which are access aligned then that’s a set where you can rotate by 90 degrees or 180 degrees any one of these rectangles and get another rectangle in the same set and the area of the rectangle won’t change when you perform those rotations. So if somebody tried to sell you this formula for the area of a rectangle you could immediately be suspicious and say, “This can’t be right because it violates the metaprinciple here.” So this formula says that you take the width of the x dimension of the rectangle squared times the width in the y direction, because when you rotate by 90 degrees the width of the x and y dimensions flip. What was x becomes “y” and vice versa width and this formula doesn’t stay the same when you exchange x and y. So it can’t possibly be right. Okay so that’s a very straightforward example of this principle, but I want to obviously apply it to now representations in vector spaces. So the representational medium for neural networks is the vector of activation patterns over n dimensions if you have n units. And this vector space is invariant under a change of basis or a change of coordinates, which is the general linear group of invertible linear transformations of the space rn. So what does that mean by a coordinate change of basis here? So when we write down that we are interested in the vector 4-17-5 and when we write that down what we mean is that this vector is a linear combination 4-14-5 of 3 basis vectors which point in the directions of the coordinate axis. So if we have Cartesian x, y z coordinates then these vectors you e1 points along the x direction and has unit length. And this is the linear combination that defines what we mean when we write done this formula. Now if we change the coordinates so that we go to some different set of axis for laying out in this case 3 dimensional space then in the same vector will have a different description with respect to the coordinate system and in the new coordinate system we have e1 prime, e2 prime, e3 prime pointing along the x prime, y prime and z prime dimensions of the new coordinate system. And we will need different weights, v1 prime, 2 prime, 3 prime, in order to get the original vector back again. So we change the basis we use for describing it like changing the coordinate system here. Then we need to change the numbers that describe any particular vector in that coordinate system. And it is just a matter of multiplying the original set of coefficients by a matrix to get the new ones. So this matrix is a member of the general linear group and in this case the columns of this matrix are the coordinates of the new basis with respect to the old one. So e1 prime is itself a vector which can be written in this way as a sum mixture of the original e1, e2, e3 and those mixing coefficients are the column of this m matrix, which is the change of basis matrix. Okay now this can effect change between distributed and local representations, that’s the point. Swapping between distributed and local representations is a kind of change of basis invertible linear transformation given that the distributed representation is a proper one. So here is a concrete example. It might not be the beset type of example to imagine, but it has concreteness in its favor. So we have these phonetic segments, or phonemes or whatever you want to think of them as: v, s, n and this is a distributed representation of these in terms of some kind of phonological features: lab, cor, vel, nasal, etc, and we put a 1 there whenever the corresponding feature of property is present in that sound. So we can now switch our coordinate system so that the axis actually point along the direction of the phoneme. So we have a v axis, an s axis and an n axis, instead of having a labial access, and a corneal access and a velar axis. So in that case the new e1 prime is let’s say v and similarly for each e2 and e3. So the change of basis matrix here just involves taking this vector and making it the first column. Then taking this vector and making the second a matrix which when multiplied by the coefficients that you use in the original distributed representation give you a new description in terms of what is now a local representation because each coordinate now has a conceptual interpretation as a vah or sah. And the sounds themselves become fully localized now, whereas v was this combination of features in the original coordinate system. In the new coordinate system v is by definition the first axis. So it has got a coefficient of 1 in the first direction and 0 in all the other directions. So now we have a local representation because we have switched to an axis that point in the conceptual directions instead of in what were the neural directions. So, this is where the example is sub-optimal, because we don’t necessarily want to think of a labial neuron and a corneal neuron, but if you will just grant that for convenience. Each of these numbers in the distributed description of some concept is the activity of a neuron. So, each of the original basis vectors here points in the direction of the activation of a single neuron. And we switch now to basis vectors that point in the direction of single concepts. It’s a linear transformation, we apply a member of this general linear group of transformations to the first description and we get the second one. Now if we have a network which has the usual form in which –. Let’s see maybe I will go down here. The point of this is that the function computed in a neural network is invariant under the change of basis. So conceptually if you have a distributed representation over a set of units that form a 1 layer in some network. We can redescribe the states of the vector space not in terms of the activities of the units, but in terms of the weights of concepts that are the distributed patterns in that network doing the change of basis that we just talked about. And if you do that you can change the outgoing weights so that they now are appropriate for the conceptual coordinates instead of what they were before, which was appropriate for the neural coordinates. And once you have changed the weights to match what you did to change the description of the states what that network layer feeds to the next one is exactly unchanged. So here is the activation in layer l, suppose it starts off by having distributed representations in it and this is the weight matrix from layer l to layer l plus 1. If we now change our description so that what was the vector a becomes the vector a prime in the new system which is local representation, equivalent to multiplying by this m inverse matrix. What we need to do is change the weight matrix w correspondingly which involves right multiplying it by m. So this is our new weight matrix and if we take the new weight matrix times the new activation vector, w prime, a prime, what we get is just the same thing as we got before, wa. So any change you make by changing the basis for describing the states of a layer can be completely compensated for by changing the weights so that what the next layer sees is exactly the same in both cases. So it can’t possibly change how the network behaves or what the network can or can’t do, what it does, nothing can change if you switch from a distributed to a local description of what’s going on in a layer of the network. So that’s an invariance, it’s a symmetry. So the architecture, the laws that govern this system, the architecture of the network must accommodate both local and distributed representations. That’s a symmetry that goes between them. It can’t be restricted to either one or the other, but the relevant thing is it can’t be restricted to local representations alone. And existing neural networks applying to NLP are often problematic in this way because they implicitly are using local representations of structure. So here is an example of a kind of network which may be out of favor with its first author here as I have been told, but nonetheless makes the point that in order to get the imbedding vector for a phrase we take the embedding vectors for the words and we combine first these 2 together with some combination to get this and then we combine this to the next word and get that. So this is a description of the process, but it’s also implicitly a description of the encoding of the tree itself. But what you see is that there is an isomorphism between the tree itself and the networks architecture. So what’s built into this is the role of being the fourth word is localized to these units. The role of being the third word is localized to these units. So if we think of it as a tensor product representation then in each of these we have a distributed pattern for a filler, but they are multiplied by vectors for roles which are completely local. >>: No, no I just thought this particular model, this x1, x2, x3, x4, each of those are embedded. >> Paul Smolensky: Yea. >>: It’s not 1 hot representation for [indiscernible]. >> Paul Smolensky: Right, yeah I am saying that the roles are 1 hot vectors, not the filler vectors. >>: Okay, okay, the role vector, yeah, yeah. >> Paul Smolensky: But as a consequence nobody thinks about role vectors. But implicitly what’s going on in all these kinds of systems is that the architecture is dependent on the including of symbolic roles being local. And our architecture is built so that the roles can local or they can be distributed in any way that you want, in other words –. >>: My point was that [indiscernible] doesn’t change much from local to distributed. [indiscernible]. >> Paul Smolensky: I know, I know you have said that 4 times everyday since I came to Microsoft, maybe 5 sometimes and every time you tell me what my answer is and now you have learned what my answer is. The distributed representations have similarity structure. So they are different not because they are smaller. >>: I see okay, okay. So maybe my argument was mainly focused on filler representation. That can be small, for example for x1 [indiscernible]. >> Paul Smolensky: Yes, so if the role vectors are linearly independent of each other they don’t have to be localized to 1 hot vectors. They don’t have to be localized to particular groups of units, but –. >>: [indiscernible]. >> Paul Smolensky: What I said about the neural situation implies to fillers as well as to roles. So it wasn’t specifically about [inaudible]. >>: So in that case [indiscernible] in practice of reducing dimensionally by [indiscernible] is not something that you think is cognitively sensible? >> Paul Smolensky: Yeah. >>: But in engineering it’s wonderful. You can solve so many problems and then we do similarity measure, all that [indiscernible] that you talk about is there and yet we still have dimensional [indiscernible]. Then what is the discrepancy in this case? >> Paul Smolensky: Why doesn’t the brain do what your computer’s do? >>: [indiscernible]? >> Paul Smolensky: If it works so great? That’s a good question and I think the answer is –. >>: [inaudible]. >> Paul Smolensky: Pardon? >>: For engineering [indiscernible]. >> Paul Smolensky: Right, well so on the engineering side what you always hear is that if your networks are too big then you will over fit your training data or you won’t be able to generalize well to new examples, same thing and therefore power to generalize has to be forced by having a small network. So even if your machines are big enough to allow for computationally a large network you won’t get good generalization necessarily. And I think we, on the cognitive side of things and I imagine that Jeff Hinton would agree with this, have always thought that can’t be the right answer for why the brain generalizes well from a small amount of data. It can’t be the right answer. >>: So you are saying that not even for the filler vectors it will work, for example the units for distribution probably should be [indiscernible]? >> Paul Smolensky: Yeah or in the millions more likely. >>: I see, okay. >> Paul Smolensky: I mean in terms of neurons. >>: So in machine learning, can I comment about the symmetries? If you have a system that naturally has symmetries, but then you are insisting on a solution that’s going to insist on a particular configuration that doesn’t allow for all the symmetric solutions then you usually have very bad local minimum when you are training. One example would be when if you are training a rich analysis simple mixture of Gaussian’s model and then somebody tells you, “Well I know what the variances are for this Gaussian’s. So use this one for the first one, this one for the second one, this one for the third one and so on.” Then you have violated the symmetry and your insisting of the first one is narrow and the second one is broad, but the algorithm actually wants to do it the other way around early on because it just logs into the different parts of the data. Then you usually get really bad results. If on the other hand you say, “No I am going to learn the variances and the means together and then later identify which ones should be the smaller ones and which ones the bigger ones and then reverse them,” then you get a good result. And the reason is the violation of symmetries. If the model wants to have multiple answers and you are trying to force it early on in one direction like here then you have a local minimum. You can get a good result maybe, but you do have more commitment. So is that something that you were going to talk about or is it that your argument for why this shouldn’t be done? >> Paul Smolensky: Well no it’s better than my argument. Yeah, so my argument is an argument from a physicist who was enamored with relativity theory and how you could derive a lot of constraint on what the laws of nature must be by observing the symmetries that just can’t be right if it doesn’t respect the symmetries. So that is what I am saying here, the neural architecture just can’t be right if it doesn’t respect them. But from an engineering perspective it is much nicer to be able to say, “Well if you choose to break the symmetry then you will pay for it.” >>: The engineering decides to minimize the pain. >> Paul Smolensky: Minimize the pain. >>: Yes, minimize how you get penalty by violating symmetry for example. >> Paul Smolensky: Yeah very good to know. I am not sure I really realized that. Yeah it is very interesting how adding knowledge and constraints to the system hurts you. I will have to ponder the significance of that. So the bottom line of all of this linear algebra stuff is that if you have some given layer of neural units any computation that you can do, by assuming that the representation is local, could also be done with the proper distributed representation. All you need to do is view the local version that you have in front of you as just a description in a convenient coordinate system of a system that can be described in other ways giving rise to distributed representations. You correct for that change in the matrix of connections and whatever you computed before you still compute. But, and this is the answer I have to repeat 5 times a day to Lee, but they are not equivalent in that distributed representations exhibit similarity effects automatically, which you have to wire in artificially if you want them in the case of local representations. And that’s because two 1 hot vectors have zero similarity and every pair of 1 hot vectors is just as dissimilar from one another as every other pair. But phonemes aren’t like that and most concepts aren’t like that. So if the distributed representations of those concepts reflect the similarity structure underlying them in whatever task they are being used for that should be an advantage. >>: So that’s really interesting, but it seems like if you are just doing this in some sort of non-constructive way, just randomly taking this basis and so on, then you are at the whim of chance. How much similarity you are going to get embedded? Is there a way to maximize the amount of similarity you are embedding, sort of the opposite of the ICA in a way? >> Paul Smolensky: Well I don’t necessarily advocate maximizing similarity for its own sake even if it is the distinguishing feature from local representations. What I would say simply is that the architecture should admit distributed representation. So a learning algorithm, using the architecture, will be able to decide what the similarity structure ought to be for the demo. >>: Then my follow up question is what sort of distributed representation are you using rather than the 1 hot vector? It should be something that doesn’t confuse you, but still maintains a lot of similarity and is done in an automatic way to get a good representation, which is distributed in a way that is going to help you with similarities. Clearly it is very close to local and it is going to have a little bit of similarity, but not much. So you want something that is further away from local. >> Paul Smolensky: Maybe or maybe that’s the right answer. I have no clue. I mean you guys have worked with networks where you learn an embedding such that similarity of pairs meets some sort of externally given criteria. >>: That’s in the small dimension we think about how we can do that. So I know this is the fifth time now. >> Paul Smolensky: You can –. >>: So for a small dimensionality like coming from 100,000 to maybe 4,000, 3,000 whatever, in practice we use typically use 400 to 300, that already gave enough –. >> Paul Smolensky: That’s good that you are using distributed representations, and this is an argument that the architecture should allow it, and your architecture does allow it and you are taking advantage of it. >>: Yeah, yeah, I know, but I think –. So I am really much inspired by your earlier [indiscernible] in phenology that in 3 full names you get what 8 phonetic features. That means that [indiscernible]. >> Paul Smolensky: Well I intended the opposite. In reality it’s the opposite. I just didn’t list all the phonemes; I just listed 3 of them. >>: In practice indeed the number of phonemes were [indiscernible]. So that actually follows your philosophy that neurons have more than that. >> Paul Smolensky: More than you need. >>: Then of course that has a lot of redundant representation, because I can [indiscernible]. Now for the word representation that everybody is doing now it is just completely the opposite. It is just so much lower [indiscernible]. It gives you enough similarity. So why does that unit have neuron bigger than concept? I think if number of neurons is smaller than concept it can still represent everything and do [indiscernible] things. [indiscernible]. >>: So maybe I have something that will help with words. The way that the engineering community treats words is as if they are –. >>: One or two dimension. >>: Right, but as if they don’t have themselves features. But for example we know of a word that is abstract or concrete. We know of a word that denotes a human or not. So that type of feature which has never been used in the NLP versions of these neural nets to my knowledge are more like the columns that you just showed for the phones that you are familiar with: labial and voice. >>: Or rather some linear combination of those features. >>: Exactly. So I have yet to see anybody unpack the meaning of a word along these different axes, the fact that something is human or not [indiscernible] the grammar. >>: I see, oh okay, okay. >>: So I am curios why we are not seeing that in the NLP. >>: It’s because we don’t have the inventory of these features for words. >>: Is it only that you are lacking that inventory? >>: [indiscernible]. >>: There are some well known things: number, gender, whether it’s human or not, lots of [indiscernible]. >>: [indiscernible]. >>: It’s linear algebra, like if you take word embedding and take vectors describing every one of the these and just unwrap them and see whether you get a good fit. >>: Right and so I was [inaudible]. >>: [inaudible]. >>: In the phonetic world there is a huge redundancy because [indiscernible]. >>: So what I don’t understand is why you except it in phenology and then these abstract ways of describe them. >> Paul Smolensky: Because Lee has studied phenology. That’s why, because nobody besides Lee accepts that you realize. He’s the only one. >>: But I understand why it’s useful because [indiscernible]. >>: But is it only because of these features or do you have like a vocabulary. >>: We actually do, they are from a dictionary. >>: So let’s just fit it. >>: And the dictionary itself is going to be recursive. >>: But it doesn’t have that for [indiscernible] every word in English. What are the fundamental features around that? >>: It does, it does. >>: So you see this [inaudible]. >>: [inaudible]. Chris has been favoring [inaudible]. >>: [indiscernible]. >> Paul Smolensky: But even if it’s not complete it will probably add some value, because what you have now is just the sound base. >>: You just have your word embeddings for all the words, you take the half, you find the linear transformation that it to these features and then on the other half test whether you are right or not, whether you can say the word is adjective or not and so on. >>: But again the worst case it’s nicely recursive, because a teacher is a person who teaches. So teaching is part of the meaning of teacher, but like a professor is a person who teaches at a university, so now you have got a mixture of the meaning of teaching and university. >>: So your idea is write down the same thing as the matrix that [indiscernible]. >>: And those are dictionary definitions. >>: So you do that, but then [indiscernible]. >>: So now [indiscernible] because I don’t know the math of it. >>: Okay, okay. >>: [indiscernible]. >> Paul Smolensky: Okay so we are past twelve. I have one slide left for this little section. Should I do it? >>: Yeah let’s finish that. >> Paul Smolensky: Okay, so the bottom line was that whatever you can do with local representations you can do with proper distributed representations and you can program networks that have distributed representations, just like you can program the ones that have local representations once you understand how to write equations and so on in such a way that they work, both for whatever coordinate system they work you can then go from a local to a distributed one and back and forth. So there is a section here that just shows you what I have in mind when I say we can program nets with distributed representations, but that’s going to be for another day. So this is the last slide of this section. So the thing is that the tensor notation is invariant under the general linear group of invertible transformations. If you write a local representation equation in tensor notation then the same equation will be valid for arbitrary proper distributed representations and maybe the same equation is the same approximately valid for improper, compressed representations. That’s what we are trying to explore in some of our ongoing work here. Local representations are lossless and so are proper distributor representations, but improper compressed distributed representations are inherently lossless in some ways and require typically some type of extra noise elimination process that is unnecessary with proper distributed representations. So that was all to say that the invariance properties of computation in neural networks leads to the conclusion that the natural and right, as far as I am concerned, architecture should be one in which the principles are invariant under change of coordinate description and in order for equations to have that property you should use tensor notation and that will enable you to use distributed representations, not just for fillers, but also for roles and therefore get generalizations across roles in the way that we are more familiar with generalization across fillers. So from one symbol to a similar symbol you get from one position to a similar position or something like that. Okay. >>: Okay that’s wonderful. So the last point you are making here you would probably refer to this common other type of distributed representation, like how to eliminate the actual noise? >> Paul Smolensky: Right, yeah. >>: [indiscernible]. >> Paul Smolensky: Typically yeah I would say. I don’t know that there are really principle ways to do that, maybe Tony would say that is method is principle. Then I can see an argument for that. >>: I see, okay thank you very much.