>> Paul Smolensky: So I will try to finish... time here and, if there's time, go back and take...

>> Paul Smolensky: So I will try to finish the points that were started last time here and, if there's time, go back and take up some points from the first pair of talks where a lot of important material was skipped. So to remind you, we're just trying to see how to use ideas from linguistic theory in order to better understand neural networks that are processing language and that there's a theory of how vectors in neural networks might represent linguistic structures, tensor product representations. And last time we started talking about this point No. 2 here, which says that grammar is optimization and that knowledge of grammar doesn't consist in knowledge of procedures for constructing grammatical sentences, it consists in knowledge of what desired properties grammatical structures have. And that is a theory of grammar which has two variants, harmonic grammar and optimality theory. So the networks that underlie those theories are in the business of maximizing a function I call harmony, which is like negative energy, if you're familiar with the energy notion. It is a measure of how well the state of the network satisfies the constraints that are implicit in all of the connections between the units in the network. So we talked about earlier about how each of those connections constitutes a kind of micro-level desideratum that two units should or shouldn't be active at the same time depending on whether they're connected with a positive or negative connection and that circulating activation in the way that many networks do causes this well-formedness measure to increase. And in deterministic networks, that takes you to a local harmony maximum and in stochastic networks you can asymptotically go to a global harmony maximum using a kind of stochastic differential equation instead of a deterministic one to define the dynamics. And this equation has as its equilibrium distribution this Boltzmann distribution. It's an exponential of this well-formedness measure harmony divided by a scaling parameter T. And this means states with higher harmony which better satisfy the desiderata of the network are exponentially more likely in the equilibrium distribution of this dynamic. And often we run in the way of -- in the manner of simulated annealing where we drop the temperature so that as we approach the end of the computation and temperature approaches zero, then the probability for states that are not globally optimal goes down to zero. So all of that is what I refer to as the optimization dynamics, and it has a name because there's a second dynamic which interacts with it strongly, and that's called quantization, which we'll get to shortly. And I will show you some examples of networks running in this fashion, or at least the results of them, when we get down to the next point here. So have to wait just a little bit before we see some examples of this running in practice. Now, I wanted to remind you about how harmonic grammar works. So this is one way to understand it. This is a different way of presenting it than I did before. So maybe it will make sense if the other one didn't. So we take as we take it as encode symbol just ask what would it mean a given that we have networks that are maximizing harmony. And a hypothesis that the states of interest are vectors that structures using tensor product representation. So then we is the harmony of the tensor product representation and what to be trying to maximize harmony with respect to such states. So here is a tree for a syllable, cat, and the harmony for that symbol structure, which is a macro-level concept, is defined by mapping the symbol structure down into an activation pattern using the tensor product isomorphism psi. So there's the pattern of activation that we get. And the harmony at the micro level is defined algebraically by an equation that has this as its central term. So this is maximized when units that are connected by a positive weight are both active at the same time and units that are connected by a negative weight are not both active at the same time. So using this method, we can assign harmony to the symbol structure there, and this is what leads us to ->>: [inaudible] each position in the tree or not? >> Paul Smolensky: These are different neurons. At the micro level we have neurons; at the macro level we'll have positions in the tree. >>: Okay. >> Paul Smolensky: So, in fact, what I sketched last time was that the simple observation that if you actually substitute N for these, for the activation vector here, the tensor product representation of some structure, like a tree, then what you get for this harmony is something which can be calculated in terms of the symbolic constituents alone. So for every pair of constituents that might appear in a structure, there's a number which is there at mutual harmony which can be positive if the harmony function, which is now encoding the grammar, regards that combination as felicitous as well formed, then this number will be positive. If it's something that's ill formed according to the grammar, that contribution will be negative. But the set of these contributions then defines a way of calculating the harmony of symbolic representations. Underlyingly, we have this network at the bottom, which is passing activation around and building this structure, but we can talk about what structures will be built, the ones that have highest harmony, by using calculations at the symbolic level to determine what states have highest harmony. >>: So here the W, W is the neural network? >> Paul Smolensky: >>: Yes. It's the weight. >> Paul Smolensky: Those are the weights that are -- >>: [inaudible] you assume the weights are predefined or prerun? >> Paul Smolensky: Well, I'm talking about the weights that encode a grammar. So they can get there by learning or they can get there by programming, which is the way we do it. We program them. >>: Yeah. So you don't need to define define the harmony of the whole tree to [inaudible]. So why there is a need to into the constituent [inaudible]? This [inaudible]. So as long as you be the same as the harmony of decompose harmony of the structure is a redundant definition, is it? >> Paul Smolensky: In a way. It just means that you can take two views. You can say we don't need this because we like the network, or you can be a linguist and you say we don't need this because we don't like networks. >>: Oh, I see. Okay. >> Paul Smolensky: We can just operate with the symbolic grammar. >>: Okay. So the function H is different in this case. a function that [inaudible]. >> Paul Smolensky: >>: I mean, H to me is Yeah. So is the full function of a yellow H versus a white H? >> Paul Smolensky: You know, these are the individual terms that, which when added together, give you the white one. >>: I see. So in that case, what is the analytical form for H, if you don't define that? What is the symbolic level [inaudible] define the neural network. >> Paul Smolensky: This is just a set of numerical values, one for each pair of possible constituents. And those numbers define the harmony function as a symbolic level. >>: I see. Okay. >> Paul Smolensky: So there really isn't any functional form to the terms. They're just numbers. >>: Okay. And then the way to complete that is through neuron [inaudible]. >> Paul Smolensky: from ->>: All right. So this is conceptual level -- >> Paul Smolensky: >>: Yes. -- of the function. >> Paul Smolensky: >>: If you wanted to understand where those numbers came The other one is [inaudible]. That's right. Okay. >> Paul Smolensky: And I'll show you two examples which will be relevant to a paper we'll talk about in a little while of terms in this sum. So here's an example. This is the harmony resulting by having a consonant in this position here, which is the right child of the right child of the root, which is a syllable. The combination of those two is actually negatively weighted in the grammars of the world, because it is a property of the world's syllable structures that it's dispreferred to have consonants at the end. So a consonant in a coda position lowers the harmony of a syllable. And in some languages you never see them; in other languages they're just less common or more restricted in when they can appear. On the other hand, this combination of a consonant in the left child position, the onset position of the syllable, is positively weighted in real languages. Syllables that have a consonant at the beginning are preferred, and there are languages in which all syllables must have consonants at the beginning. So ->>: [inaudible] higher weights, but the theory can tell you how higher. >> Paul Smolensky: Right. And that varies from one language to another. So by looking at the different possible values that might be assigned to all of these terms, we can map out the typology, if possible. In this case syllable patterns across languages is the constraints we're talking about or these harmony terms refer to syllables as these examples do. So what the grammar is doing in circulating activation and maximizing harmony is building a representation which at the symbolic level ought to be the structure that maximizes this function, that best satisfies the requirements that are imposed by all of these grammatical principals which are now encoded in numbers. So our language can have a stronger or weaker version of this onset constraint. All languages will assign positive value to having a consonant in the onset, but they could differ in how much they weight that. Another way of saying the same thing is that the grammar generates the representation that minimizes ill-formedness. Useful to point out just because in linguistics the term markedness is used for ill-formedness in this sense. And so you hear the phrase markedness a lot. And, as a matter of fact, in the theories we're discussing, these constraints are called markedness constraints. The idea is that an element which is marked is something that is branded as having some constraint violation which gives it a certain degree of ill-formedness. So being marked is a bad thing, and the grammar is trying to construct the structure which is in some sense least marked. >>: So this term literally comes from marking text, markedness? >> Paul Smolensky: No. >>: No? >>: This is the phonology part. >> Paul Smolensky: >>: Mm-hmm. Is it the same term as used in phonology? Yeah. Yeah. Chomsky's phonology? >> Paul Smolensky: Yes. Yes. Yeah. Well, it was -- it was developed in the '40s, primarily, in the Prague Circle linguists, for example, [inaudible] from Russia. And the idea was that unmarked is like default. You don't have to write it because it's the default, so it's not marked. You don't have to mark what it is. So the default, if it's interpreted as the default one, doesn't need to be marked. Whereas the thing which is not the default has to be marked. That's the origin of the term. >>: In this case everything is marked, right, like the [inaudible]? >> Paul Smolensky: >>: Yeah, they're -- [inaudible]. >> Paul Smolensky: So in this theory -- sorry, everything is marked [inaudible] ->>: [inaudible] the whole set is marked coda. >> Paul Smolensky: So the -- I'm [inaudible] unmarked in the sense that you get positive harmony for having it. >>: Oh. Okay. So [inaudible] means that you get positive [inaudible]. >> Paul Smolensky: >>: Right. It's preferred. Okay. >> Paul Smolensky: So sometimes that takes the form of having a positive. Or more sometimes it just takes the form of not having a negative one. >>: [inaudible] >>: But in this theory high unmarkedness can overcome high markedness? >> Paul Smolensky: That's right. You can trade one off for the other. Which is not true in optimality theory, which is what we're coming to shortly. So okay. So optimality theory is a variant of what we've just seen in which the constraints are -- first of all, we insist that the constraints be the same in all languages. And, second, the constraints are ranked rather than weighted. So a language takes these universal constraints, which include things like codas are marked and onsets are unmarked. Those are examples of universal principles of syllable structure. And rather than weighting them in optimality theory, a language will rank them in some priority ordering. And so onset might be high ranked or low ranked in a given language. And we saw an example of the conflict between the subject constraint and the full interpretation constraint when you have a sentence like it drains that doesn't have a logical subject. And we saw how using numbers like three and two we could describe two patterns depending on which of those had a greater number, the Italian pattern or the English pattern. But in optimality theory, it's simpler, we just say one is stronger than the other, and that's it. Now, the hierarchy of constraints is a priority hierarchy in which each constraint has absolute priority over all of the weaker ones. So there's no way that you can overcome a disadvantage from a higher ranked constraint by doing well on lots of lower ones. So you can't do the kind of thing you were just pointing out, can be done with a numerical harmonic grammar version of the theory. >>: So the real language is [inaudible] theory or you can weight the constraints? >> Paul Smolensky: Well, that's a big theoretical issue in the field, which is the better description of natural languages. So I would say that the majority opinion still favors optimality theory, but there's been more and more work put into the numerical theory, and certain advantages are pointed out from that. The idea in optimality theory is that all languages have the same constraints and that what defines a possible language is just something that can arise by some ordering of the constraints. And if we're using weights, then we say what defines a possible language is some pattern of optimal structures that can be defined by some set of numbers. And that's a harder space to explore, but it can be done. And there are certain pathologies unlinguistic-like predictions can come out of that, but people try to find ways of avoiding them. Okay. So and so just to emphasize that what optimality theory really brought to the table was a way to formally compute the typology of possible grammars that follow from some theory of what the constraints of the universal grammar are, what the universal constraints are. It's just plug it into your program and it tells you what the possible languages are. Didn't have that kind of capability before. Okay. So it's been applied at all sorts of levels and all sorts of ways. But let me just go to the bottom line, which is that its value for linguistics is that it gives two -- it gives an answer to the fundamental questions: What is it exactly that human languages share, what human grammars share, and how can they differ. So what they share is the constraints. These are desiderata for what makes a good combinatorial structure, and they're shared by all the world's languages. Even though Italian doesn't use null expletive subjects, meaningless subjects like English does, it's not because English likes those things and Italian doesn't like them, but both languages don't like them. But because there's another constraint at work, the subject constraint that's [inaudible] subjects, it's a matter of how much the language doesn't like it. So that's what varies across languages. They differ in terms of the weighting or the ranking of these constraints. All right. So now I have a whole bunch of skipping points in this presentation here. I guess that's what I was up during the night mostly doing is finding ways to skip material. So here is thing we could look at, how the advent of optimality theory in harmonic grammar made a really major shift in how phonology is done and how certain people practice other fields like the ones I mentioned, their syntax and semantics included, from using sequences of symbolic rewriting operations or manipulations of some sort which was a practice before to something very different, which is a kind of optimization calculation. So I can show you how that works out in a real example in phonology or we can skip it. Six is the number of slides. >>: Let's not skip it [inaudible]. >> Paul Smolensky: >>: Okay. [inaudible]. >> Paul Smolensky: You don't want to know how many more there would be if we continue at the rate we're going. >>: That's okay [inaudible]. >> Paul Smolensky: >>: That's fine. Six is the answer. That's fine. >> Paul Smolensky: Okay. So this real example comes from the optimality theory book that Alan Prince and I distributed in 1991. It's a language, Lardil, which is spoken in -- on Mornington Island in the Bay of Carpentaria in northern part of Australia. So it's an indigenous language of Australia. And how do you say wooden axe in this language, as an example question to which the language needs to provide an answer. Okay, well, the answer depends on whether you're using it in the subject or object position of a sentence. So in the object position it has accusative form, which comes out as a munkumunkun, which when you look at all the forms you see is a stem, munkumu -- munkumunku, followed by accusative or object marker N at the end, N plus suffix form. And if you use it as a subject, in the nominative form, it's just munkumu. And so the phonology of Lardil, the knowledge of the language that the speakers have is such that they can figure out that these two different forms are the right ones for this word knowing what the stem is and what the grammar -- the phonological grammar is. Okay. So what kind of general knowledge would allow the speaker to compute this as well as all the other forms that we see in this table here. And here's a selection of data from one paper about Lardil, and this is the form that we were just looking at. Here's the accusative ending N for sentences that are not in the future tense, and those which are in the future tense get an R instead. And this shows lots of different patterns that different lexical items display. We just looked at one so far. The simplest one is at the top here. There are some words like kentapal, which means dugong, if you know what that is. Manatee I think is the same thing. So that's a simple case where you just add the endings to the stem and leave the stem just the way it was, just like we add Ss and Ds to stems in English and leave the stems the way they are. But most types of -- patterns are different from that. So just focusing on the nominative form here, forgetting about the accusative forms, the object forms, just focusing on this part of the data table we see that lots of different things happen across the lexicon. So in some cases, like naluk, you lose the final consonant. In other ones, like yiliyili, you lose the final vowel. So you get yiliyil as the nominative form subject form. You get the consonant and vowel at the end of the world deleted sometimes. Yukarpa becomes yukar. Then we saw how a consonant-consonant vowel sequence, [inaudible], can be deleted in some cases. Now, when words are short, you see a different kind of pattern. Even though yiliyili loses its last vowel, wite does not lose its last vowel. It stays there unaffected. So there are some cases, and these happen in short words, words that would in some sense intuitively become too short if you were to delete the vowel at the end. On the end when the stem is really short, what you see is that the language actually adds material rather than subtracting it as in the case of yak which becomes yaka, fish. Sometimes a consonant and vowel are added in two different ways. So what we set out to do is to explain this pattern of behavior in forming -going from the stem to the subject form. And so these are the examples that we'll see in the optimality theory analysis specifically, these three, which behave differently. A lot of truncation, nothing, and then augmentation here. But before that, let's look at what kind of analysis was done before optimality theory came along. And it looked like this. There were sequences of operations. You started with the stem. The first rule deleted the final vowel. The second rule deleted the final consonant and then this consonant went away and then this consonant is no longer connected to the syllable and then it goes away. And so each of these is a rule that manipulates the form of the word until no more rules could apply, and then what you get is the actual form that you pronounce. You have to make sure that you require that the rules can't be reused because if you're allowed to reuse them, the whole word would actually disappear from the set of rules. Okay. So that's the sequential serial operation of symbolic rewriting or symbol manipulation that was dominant before optimality theory. And now it looks completely different, as you'll see. So here is what happens when we go to harmony maximization instead. We have a bunch of constraints which are -- some of which are lined up along the top of this table that we'll look at. And these constraints include certain kinds of constraints called faithfulness constraints which will appear in the application I'll talk about shortly, faithfulness constraints, which say that the stored form of the word and the pronounced form of the word should be as identical as possible. So star insert V says don't insert a vowel in going from the stored stem form to the nominative. The star insert C says don't insert a consonant. The last one says don't delete anything. So those are faithfulness constraints. They're always the pronounced form identical to the underlying stored are other kind of constraints, markedness constraints, of the onset in coda constraints, and they're a couple which we can go into a few in just a bit. satisfied by making form. But then there which we saw examples of other ones here, I just want to show you how the practice of doing phonology operates under optimality theory. So we have a stem, munkumunku, and we want to know what is the right pronunciation of it. So we do what we did with the [inaudible] case. We think of the possible expressions that you might use to express that. In this case it's not a logical form, it's a stored lexical entry that you want to pronounce. So part of the theory generates a bunch of candidate possible pronunciations, and here are some of them, and for each possible pronunciation you look to see how they -- how each one fares with respect to all of these constraints. And so what you can see -- I don't know that I got my pointer this time -the -- the first form is one in which you pronounce it exactly as it's the stem is stored. You don't change it at all. So all the faithfulness constraints are happy. There's one constraint, though, that's special, and it says that nominative forms in Lardil should not end in a vowel. And this one does. >>: How do you know the underlying fault has a [inaudible]? >> Paul Smolensky: Because of the other forms that we saw which added an R or an N in the other forms of the word. So if we go back to the table of data, you'll see that ->>: I see. Okay. >> Paul Smolensky: >>: So that's like in the [inaudible] but that's -Sorry. One more here. I see. >> Paul Smolensky: I that all of the forms or a vowel EN if what future they all add R guess I can't interrupt this. So what you see here is end up putting either an N if what we see is a vowel, we see is a consonant and in the nonfuture, and in the or a vowel R. >>: Okay. So but you can't [inaudible] how you know which form should be [inaudible]. >> Paul Smolensky: >>: That's right. Okay. >> Paul Smolensky: That's right. Yeah, these are very regular. You just add an affix. And so they're the best way of seeing what the underlying form of the stem is. But in the nominative form, what you see is a lot of deviation from the underlying form that you don't see in the accusative object forms. Okay. So let's see here. All right. So we were going through this table here. Right. Faithfulness constraints are all satisfied by this first option. That's the fully faithful pronunciation. You don't change anything. So all the faithfulness constraints are happy. The nominative constraint says the nominative form should not end in a vowel, and it's violated by this pronunciation. And that's why there's a star there. That says that this candidate pronunciation violates that constraint. And the exclamation mark means it does so in fatal ways that prevents it from being optimal, which you can only tell by looking at the others, which we haven't seen yet. Here they are. These are a sample of all the possibilities. And there are high-ranked constraints in the language, which are actually never violated in the forms of the language against having a complex coda in a syllable, which means having two consonants at the end of a syllable like in the second munk possibility. Just dropping the U and the K is the third possibility. And in both cases we're deleting a vowel, so we get a violation of the don't delete constraint in both cases. We get two violations of it in the munkumunku -- munkumun possibility, the third one. And there's another strong constraint. These are very highly ranked constraints in this language which say that you can't have a nasal consonant at the end, at the end of a syllable unless it's followed by a stop consonant at the same place of articulation. So [inaudible] is okay if it's followed by K but not otherwise. So that constraint is violated by the third possibility and so on. And this is the ranking of the constraints in the language. The strongest constraints are on the left. And so the top ranked constraints, which I haven't made separate columns for but indicated here what the violations are, knock out options 2 and 3. This constraint then knocks out option 4. Because the constraints are strictly dominating, the fact that these fail with respect to these high-ranked constraints means they're out of the running. No matter how well they do on all the other constraints, it doesn't matter. They lost because there are better options available that don't violate those constraints. And similarly here. Then we get down to this point where the faithful candidate loses. And now we're at a point where only these two are left. Now we look at the constraints that are left. And here's a case where both of the remaining options violate the constraint. And we can't eliminate them because something has to remain. So they just survive. They're equally good as far as this constraint is concerned, or equally bad, depending on how you want to put it. And so neither one is preferred, neither one is selected at that point. Similarly, these two candidates are also equally evaluated by that constraint, so nothing is eliminated. And only when you get down here to a constraint that says don't delete do we prefer that this candidate here to this one because it has fewer deletions. So it violates that constraint less. And that's the winner. It's the correct pronunciation. And so this way of rendering the calculation has produced the right result. And if we look at different stems, other things come into play. So the really short stems are evaluated -- there's a constraint that says that no word of the language can have less than two units of weight, which this does, without going into details of weight. So even though no final vowel constraint here in the nominative doesn't like this, eliminating that drops the word to subminimal length, and that violates a higher ranked constraint, and so it's not a better option. It's a worse option. And similarly since these stems are already subminimal, they already violate the minimum word, you have to add something to get them up to the point where they no longer violate that constraint, and you end up through interaction of these constraints -- I won't go into it -- adding a consonant and a vowel to do so, at least in this case. Not in all cases. But so the point is that if you look at this pattern, you'll see that there's an interesting thing about constraints and optimality theory, and that's usually described by saying the constraints in the theory are violable. So there were constraints in grammars before, but they were also inviolable. You violate a constraint, you're out. That's always how it was. But these constraints are called violable. And what you see is that here what we have is the constraint operating in the normal way. It rules one of these possibilities out because it violates that constraint. And here, though, what we see is that the form that's actually pronounced, the optimal structure, the pronounced form violates that same constraint. And the difference is that, okay, so that constraint is the constraint that says don't insert a vowel. The violation is fatal in the munkumunku stem, but it's not fatal, it's tolerated, in the real stem. And the reason is simple. There is a better option here. There are options that don't violate and they're preferred and they win out and it -- causing that one to lose. Here there's no options left that are better than this. They all violate the constraint, and equally. So none of them are eliminated from the running at this point. Elimination has to come later when they're not all evaluated the same. So it's about whether there's a better option. So if there's no better option, constraints are violated in the optimal forms, the highest harmony forms, the grammatical forms. >>: Do you have a good example of this type for English phonology where you show that this grammar is much simpler [inaudible]? >> Paul Smolensky: Well, the goal isn't to make the grammar simpler with respect to a single language. The goal really is to make it capable of explaining all the patterns that you see across languages. So it's a grammatical approach that's really aimed at universal grammar. It really focuses on a single phenomenon across all the languages rather than all the phenomena across and within a single language. >>: So you're saying that these constraint that you prepared are not only for Lardil but also for English, for Chinese [inaudible]? >> Paul Smolensky: Yeah, that's right. With the one exception, which is unfortunate, is the nominative constraint is an idiosyncratic property of this language that nominative forms shouldn't end in vowels. I don't think that that's a universal constraint, and so that's an undesirable aspect of this analysis but one that I don't know how to eliminate. Okay. So contributions. As I said, the formal theory of typology is what I consider the main one; that we now can take a proposal for a theory of syllables or whatever and look at the implications across all the world's languages by re-ranking the constraints that are being proposed to see what possible languages are predicted. And there are lots of universals that need to be explained. So this online archive, at the time I checked, which is now a while ago -- I should check again -- had over 2,000 universals indicated. So there's a lot of stuff that needs to be explained, and optimality theory is well positioned to do that. Another aspect of the restrictiveness of analysis is that because the constraints of the ideal analysis in optimality theory are universal, you can't just make up constraints because they are convenient for the particular phenomenon in the language you're looking at. They need to be responsible for the implications that they have when you put them in the grammars of all the other world's languages as well. So that makes it harder to analyze rather than easier to analyze a phenomenon in a given language. And that's usually regarded as a good thing in linguistics, even if not in other fields. It turns out that there's been quite a lot of productive work on learning theory, which primarily asks the question: How can the aspect of a grammar, which is not universal -- namely, the ranking -- be learned from data that you might be exposed to as a learner. And there are other contributions. 15 of them are described in this book. So I want to take the opportunity to move on to a particular application which is in machine translation in which I just became aware of the day before I created this talk. So my knowledge of this paper is definitely not very deep, but I will tell you about it. It uses optimality theory, it says. That's actually not the way I would say it. But it uses it for improving machine translation in virtue of handling [inaudible] vocabulary words in a very nice way. So it turns out that a lot of words in any -- in most languages are borrowed from other languages. And you might think, well, everybody is borrowing words from English. English is everywhere. All languages are borrowing from us. But we certainly did our share of borrowing already. So you look at this picture, English is supposed to be Germanic language, well, that's 26 percent of our words. Another 29 from French, Latin, and other languages. So lots and lots of borrowing in the history of our language. And that's not atypical. But what's tricky is that when a word is borrowed into other language, the resulting thing, a loanword, is modified to fit the phonology and the morphology of the recipient language. It's not just taken in whole hog; it's modified to fit the requirements of being a word in that language. So here's a picture from that paper that shows some borrowings of the word, well, you can probably guess what falafel means in the Hebrew form, but you can see that quite a few changes are made in the course of borrowing. Ps and Fs change. L gets somehow from the end of the word into the first syllable of the word. Long-distance liquid metathesis of sorts. And vowels change. So lots of things which are understood in terms of principles of phonology and the constraints in optimality theory can help us understand what's going on when a word borrows into another language. The simple applies to you submit optimal on view is you have a ranking of constraints, which is the one that your language, another word comes in from another language, and that to the constraint hierarchy and it doesn't come out as its own. What comes out as optimal is a modification of it. That's the simple picture of how words get adapted from optimality theory perspective. And the useful thing for the application [inaudible] machine translation is that resource-poor languages often have a lot of words that are borrowed from resource-rich ones. And so an example that they look at is doing translation between Swahili and English. This is a low-resource situation which benefits from the fact that a high-resource pair, Arabic-English, provides lots of opportunities for analyzing outer vocabulary items in Swahili which are not known in advance as to what their counterparts in English are. But knowing -- being able to deduce that these words are borrowed from Arabic, and since we have a lot of information about the relation of these words to English, in some sense we know what they mean, then we can take those as plausible candidates for what these words in Swahili mean. >>: Some pieces at the level of lexicon [inaudible] phonology. >> Paul Smolensky: >>: Yeah. phonemes. Sorry? Lexicon? So the new word is defined in terms of spelling [inaudible] to >> Paul Smolensky: Well, there are some situations when it -- when borrowings are affected by spelling. I'm not sure that that's true for Swahili and Arabic. >>: Okay. So mostly what you're talking about is the foreign words, putting into a new word based upon [inaudible]. >> Paul Smolensky: Yeah. Right. That's the case that this method will work with. >>: So the form of the word in the end depends on the order in which the borrowing happens? Or is it pretty much independent of it? >> Paul Smolensky: I think it does depend on the ordering. You can well imagine, for example, just to take one trivial example, is if there's a certain phoneme that's absent from the first language that borrows it, it might just disappear. And then even if that phoneme is present in the second language, it doesn't have a chance to come back. >>: After phoneme is determined according to optimality theory, it still has to create the spelling of the phoneme. >> Paul Smolensky: We're not concerned about spelling I don't think. >>: Oh. Oh. So this for [inaudible] for the new word between the formal spelling of it [inaudible] to give you a new word. >> Paul Smolensky: Right. It's true that we have to have a way to access the pronunciation from the resources that we have. And if the resources are written, then we do need to know how to translate the -- yeah, put it back into a phonetic form. >>: And the paper doesn't focus on [inaudible]? >> Paul Smolensky: might have and ->>: I don't remember that they talked about that, but they Okay. >>: So they try to infer the history of the changes in the ordering of the universal constraints in order for a word to become [inaudible]? >> Paul Smolensky: So they just considered a pair of languages. So a borrowing from Arabic into Swahili without trying to reconstruct the history or anything like that, which linguists certainly love to do, but that's not what their program does yet anyway. >>: [inaudible] particular languages, they could just be going through the constraints, the same -- the ordering changed this way. If you change the ordering of the constraints in this way, and then change again the ordering of the constraints in this way, the minimum number of constraint changes, the order -- the ranking changes that are necessary to go from here to there, you could be kind of integrating over languages, you don't have to actually go through the specific ones. >> Paul Smolensky: Yeah, that's very interesting, and much more interesting than what they are doing here actually. Because they're not looking at changes in the constraints from one language to another. They're just trying to figure out how should the constraints of Swahili be weighted so that when Arabic forms are stuck into that constraint hierarchy what comes out best approximates the words that we actually see in Swahili ->>: [inaudible] the word [inaudible] from one language to another. >> Paul Smolensky: Yes, that's right. That's right. >>: So this process is quite intuitive [inaudible] if I want to borrow a new language, say English, for example, first of all, [inaudible]. >> Paul Smolensky: >>: Right. And that that implies that you minimize violation of [inaudible]. >> Paul Smolensky: Right. >>: I see. Okay. >> Paul Smolensky: Right. >>: But I thought that alternative phonology can give you the same thing. Will it? >> Paul Smolensky: too. Yep. There are theories of loanwords in generative phonology >>: So optimum theory is just simpler than generative? With generative you're also told the [inaudible] you don't put that [inaudible]. >> Paul Smolensky: >>: The -- It's just simplicity-wise [inaudible] or... >> Paul Smolensky: Well, the method that they used has a straightforward means of doing the learning required. I don't know how easy it would be to do the learning required if we were dealing with generative phonology. So here's what they say about the results, and then I'll say a bit more about what you just pointed out. So the features are based on universal constraints for optimality theory. So what they're using really is a Maxent model. They're using something that's like harmonic grammar where the constraints have weights. They're not really using optimality theory. But they're taking all the constraints from optimality theory, and that's what's doing the work for them, is knowing what the constraints are. So they take the constraints from optimality theory, but they treat them as having weights, but then they use Maxent learning procedures to learn those weights in order to optimize the performance on some very small training set of parallel text or whatever parallel language form they have. So they say that their model outperforms baselines. I don't have linguistic input in them in the same sort of way. And that with only a few dozen training examples of borrowings, they can get good estimates of the weights of all of these constraints and then make good protections about what words were the source in borrowing in Arabic for the [inaudible] vocabulary items in Swahili. So they show that this actually does give rise to a measurable improvement. >>: [inaudible] first talked about the harmonium language is just numbers. >> Paul Smolensky: Yes. That's what they're doing. But they're rightfully pointing out that the power of the method doesn't come -- the new power doesn't come from the fact that it's a Maxent model. Everybody uses those. What's new about it is the input that's comes from optimality theory. So I think it's legit in that sense. So universal constraints straight from OT phonology. So here are the constraints they actually use. I told you about faithfulness constraints and markedness constraints partly because of this. Here's a list of very standard kinds of faithfulness constraints that refer to features of phonemes, like don't change the place of articulation of a phoneme, don't change the sonority of -- that means like go from a stop to a liquid, from ta to la. And so all of these things are just very much bread and butter kind of faithfulness constraints in optimality theory. And the markedness constraints will look familiar. Syllables must not have a coda. Syllables must have onsets. The basis for syllable theory. This is about the sequencing of consonants within syllables, which is a very well-known ->>: [inaudible] just partly not true for English [inaudible]. >> Paul Smolensky: All the in optimality theory can be operative in English. It's constraints that outrank it >>: constraints can be violated. All the constraints violated. But no coda is a constraint that is just low enough rank that we have lots of other and force codas to come into existence. Okay. >> Paul Smolensky: So the -- so I think you see how the game is played here. >>: So then the question arises why do we have different languages? How come? How come we can't understand each other when speaking different languages? If these constraints are universal, why are we not aware of them and why are we not using them optimally while talking to each other? >>: [inaudible] the second language. >> Paul Smolensky: Well, I mean, the -- the difference in grammars is re-ranking, but of course the difference in vocabularies is another matter altogether. So if you knew all the words in another language as a start, that would help. There are -- there is some reason to believe that the process of learning a second language can be modeled by taking the ranking of constraints that you have from your first language and modifying it in order to produce the forms that you need to be able to now produce. And so you get some sort of transfer benefit from having the constraints installed in your original grammar, but the reason that there is not just one language is there seems to be no grounds for ranking the constraints one way as opposed to another, at least in a comprehensive way. >>: I guess it seems like maybe there are different levels of things that are actually happening in the brain. This seems like a very good way to represent knowledge. But then the fact that you have the representation knowledge doesn't mean that you can easily access it and use it. there is almost ->> Paul Smolensky: And so But we can easily access and use it but not consciously. >>: Well, but it's very difficult to see these things between the words even in related languages. Like if you -- if you speak a Slavic language, you can probably read most other Slavic languages, but you'll take a lot of time. You really have to focus. So it doesn't come easily. So it seems like there is another level of universal knowledge is kind of realized in particular form is accessed, not universal like the knowledge is represented in a way At high speed, at least. processing where maybe this a particular form and that knowledge and so on, that it seems which is not easily accessible. >> Paul Smolensky: It could be that we have something like a kind of compiled version of a constraint hierarchy that builds our own language's rankings in and therefore is not in a form that's readily modifiable to constitute the compilation of some other ranking. That's certainly possible. And there's some reason I think that's true. But everything is relative. I mean, you say it's hard, but I think it's absolutely astonishing that you can take another Slavic language and with any degree of effort read it. I mean, so ->>: Yeah, but the thing is there is difference in performance is interesting from sort of a practical point of view. It reminds me of Boltzmann machines and harmonium. You can between these models with lots of -- and they're hard to train, but once you train them, you can sample from them. But it's hard to sample from them. So there is this dichotomy with these representations [inaudible] ->> Paul Smolensky: >>: It's hard to sample from them you say? Yeah. >> Paul Smolensky: Why do you say that? >>: It's hard to general -- it's hard to generate a -- well, if you train the model, an RBM, for example, on the amount of data -- on a set of data, and now you want to generate the sample that's similar to that data, you actually have to go through MCMC process for a long time, whereas the model itself is very simple in terms inference, procedure, and so on. So there's a dichotomy there that it's very difficult to kind of unroll, the knowledge is kind of wrapped, but to actually extract examples becomes hard. And so then in language, for example, it's nice to represent languages in terms of constraints, but to actually satisfy these constraints while we're talking is difficult. And it almost seems like it isn't that you only have the constraints or only have grammars, it seems like maybe grammar or something akin to that in sequential processing is necessary for you to express, to generate from the model, whereas just satisfied constraints on the fly is hard. >> Paul Smolensky: Well, it sounds reasonable, but the more the form of the mental grammar is sort of compiled in a language-specific form, the more it's difficult to explain why you can explain the difference in possible and impossible languages in terms of whether there is a ranking or not. So, I mean, you have to believe that somehow we have these language-specific compiled forms, but they all originate from some sort of more universal form and then get specialized ->>: [inaudible] everything you said. I'm just wondering beyond that once you start thinking about how the real systems work you could have these like versions of compiled and exploded versions of things. Like instead of having your constraints, they have this constraint in many different conditions exploded in the memory somewhere so you can just pull that off instead of trying to satisfy the constraint and so on. >> Paul Smolensky: Right. Right. I mean, it's a perennial issue in the psycholinguistics to do with phonology about how much grammatical knowledge is used on the fly anyway, given that there's a limited set of words compared to the unlimited set of sentences. You can imagine storing a preprocessed, pre-grammatically processed forms in abundance. >>: Right. So maybe even grammar doesn't [inaudible]. >> Paul Smolensky: >>: Well, it does in the -- [inaudible] it doesn't work well. >> Paul Smolensky: It does in the sense that we -- well, I mean, you can also see that people have the general knowledge of the constraints as they work in their language as well, because it's not -- they can take nonword forms and tell you what the right thing to do with them chronologically is. So they have a way of getting general predictive value out of what they know. >>: So the other question about the theory, the [inaudible] theory or even the weight version of it with using these constraints, universal constraints and so on, I assume there must have been work on evolutional languages and the analysis of what happens when you change the rankings slightly and then everything changes in the language dramatically and these sort of analyses. >> Paul Smolensky: There have been analyses about historical change as re-ranking of constraints indeed. Yes. Once you have a closed theory of all the possible languages, then you know that the historical path has to somehow be in that space. >>: Right. So has there been an example where the paths have changed using this theory as opposed to the previous work on [inaudible] languages? >> Paul Smolensky: Oh. Good question. I don't really know enough about the historical literature to know. That's a very good question, interesting question. I'd like to find the answer. There was more work of that sort that I saw earlier on in the theory then there has been later. So it could be that those kind of questions have not been as much [inaudible] as other questions have been. All right. So let's not go through this here. Let me move on to the dynamic simulations in networks finally. So the question is what does this kind of [inaudible] business that we saw just now for munkumunku or the one that we saw earlier for it rains look like at the neural level. So here's our answer to that at this point in the development of our theory. So in thinking about this case where we have kind of like it rains it to decide among, so we have these symbol structures at the higher level, the macroscopic level of a description, and they get mapped down to activation patterns at the lower level by our isomorphism here. There are -- here are four such possibilities for how you might want to pronounce what we say is it's raining and it rains. And each of these is mapped onto a particular point in the space of activation pattern, which is a R to the N continuum here. And so we often refer to these points collectively as the grid. They are the points -- they are the activation patterns that are the realization of exact symbol structures like the ones up here, the grid of discrete states, D. Now, here's the story. Here's our harmony map. So this is the activation space, the particular points that correspond to these particular discrete structures that are the embedding of those structures. The harmony values for these structures in a harmonic grammar look like -- where am I? Here's the cursor -- look like this. This is a ranking of the constraints that corresponds to the English form. So the highest harmony option of these is the one for it rains. The pattern of heights would be different for the Italian grammar. That's the optimal one ->>: How is this computer [inaudible] exponential maximum entropy formula [inaudible]? >> Paul Smolensky: Here are the harmony values that we want. And so what we have to do is we have to figure out how to put connections in the network whose states are described here. >>: Okay. >> Paul Smolensky: What weights to put in the network so that when we look at them at the higher level we get numbers like this. >>: I see. Okay. So that's exponential of the number [inaudible]. >> Paul Smolensky: The exponential only comes in when we go to probability. So right now we're just talking about the raw formulas for harmony measure. >>: Okay. >> Paul Smolensky: >>: But it's very arbitrary. >> Paul Smolensky: >>: So there's an exponential there. Hmm? It's quite arbitrary. >> Paul Smolensky: Yeah. >>: But don't you want quantify saying that [inaudible] three, sometimes four, sometimes five [inaudible]. >> Paul Smolensky: [inaudible] implicitly saying is definitely apropos here, which is that if we had some probabilistic data available to us about the likelihood of suboptimal states, then we could quantify what these numerical values really ought to be. Because exponentiating these things should tell us about the relative probabilities. And so if we had data like that, we could have nonarbitrary. So if you were working with a corpus, then it wouldn't be arbitrary in the same way that it is here. All right. So these are the harmony values on the grid of discrete states. But there's the harmony surface over the whole continuum of states in the space. It's just a quadratic because the harmony function is a quadratic form. And here's our system which we can think of as a kind of drunken ant climbing up this harmony hill. So it's wandering around on average going up. But because there's a stochastic component to the dynamics, it's only going up on average. And if the temperature is going down as it proceeds, then it will be ->>: What do you say is harmonic? To me the harmony number is arbitrary defined. You said minus 3 rather than minus 2. As long as the order is right [inaudible] H. So why do you say that? It's quadratic? I don't see [inaudible]. >> Paul Smolensky: The -- this is the formula for the harmony of all of the states here. These are not states in the [inaudible]. The harmony of all these states here. >>: Oh, I see. Okay. >> Paul Smolensky: So you take an arbitrary activation vector, its harmony is A transpose WA, where W is the weight matrix of the network. >>: Okay. That's a neural network you define. >> Paul Smolensky: >>: Okay. Yeah. Okay. >> Paul Smolensky: Yeah. Yep. So I'm freely going back and forth between neural harmony, which is what this is picturing, and symbolic harmony. They are the same when it comes to the points on the grid. Numerically the same, that is. Because we program the network to make that true. Okay. So I was drawing a picture of the optimization dynamics here. It's stochastic gradient ascent in harmony. However, what you'll notice is that the optimum in the continuous space is actually here. So this is the highest harmony grid state, but it's not the highest harmony state in the entire continuous RN. So it's almost always the case that in RN the best states blend together discrete states and aren't a single one. So what I mean by a blend is that this state here is a linear combination of these guys with some weighting. And so in order to actually compute what we want, which is this guy here, not that, assuming that what we want is the real discrete optimum from the higher-level grammatical picture, then we have to do something else in addition, which is the quantization dynamics, which looks like that. So the quantization dynamics is constantly pushing the system towards the grid. And it has an attractor at every grid point. These are the boundaries of the attractor basin. So any point in this area gets pushed to that by the dynamics that we call quantization. So we actually have quantization and optimization working at the same time because there are two things we want. One is to end up with a discrete state and, two, to end up with the highest harmony discrete state. And so a schematic picture of what's going on during the processing is we turn the quantization up and we turn the temperature down. But the picture here will show you about the quantization getting stronger. So quantization is off in terms of the harmony function here. But once quantization starts getting turned on during the computation, you see that the landscape shifts underneath our ant so that the points that are not discrete grid points have their harmony systematically pushed down the more so the stronger quantization is. So there's a quantization parameter Q that's being raised as the computation goes on and the landscape is changing this way. And what we want is for our ant, which started off on that nice hill, doing stochastic gradient ascent as the cardboard is falling out below him. We want the ant to end up right through there at the peak which is the tallest peak. And you can see that this is a miserable landscape to try to optimize. If you were right there, your chances of finding the global optimum would be pretty slim. And so the idea is that by gradually turning the landscape into this [inaudible] while the network is going after the highest harmony area that we will be able to have our cake and eat it, too, end up in a discrete place and end up in the best discrete place. And what we want really is that in the finite temperature case, if we want to have a distribution of responses to have a model of what probabilities are assigned to ungrammatical forms as well just identifying what the grammatical ones are, then what we would like is that the probability that the ant ends up at a particular peak should be an exponential function of the harmony of that peak for some small but finite temperature. That's what we want. And I'll show you a theorem in a moment. The picture here has been a schematic picture. What we have here is an anatomically correct picture of the harmony landscape associated with quantization. The quantization dynamics is a gradient dynamics on this function. And if you want to see what it looks like more algebraically and in one dimension, it looks like that, which I'm sure you'll recognize immediately as the Higgs potential from Higgs Boson theory upside down. And so now we're getting into harmony functions that are fourth order, not second order, so that we can get shapes like this. We can't do anything better than just have one immediately located optimum if we stick to quadratic harmony functions. So the quick quantization is a higher-order function that allows for these multiple maxima of the different grid points. So here's what it looks like. We can come back if somebody is fascinated by such things. But here's the anatomically correct version of the picture I showed you schematically a moment ago as quantization increases. There's the original harmony surface at the top, just the quadratic ->>: [inaudible]. >> Paul Smolensky: -- [inaudible] quantization is off -- yes. >>: [inaudible] I think the whole purpose is to show that it rains is meaningful [inaudible] so they are not feasible. But you don't need to do this [inaudible] just by your original, you know, constraint ranking you can really get the solution. >> Paul Smolensky: We know what the solution is. >>: [inaudible]. >> Paul Smolensky: That doesn't mean we can generate it. >>: Then why do you have to do all [inaudible] as to the purpose of doing these dynamics. You know the solution already based upon OT ->> Paul Smolensky: You are a hard man to please. So like four slides ago you said why do we need this higher level stuff? We got the network. Why wouldn't you use the network? Right? Now you're saying what the hell do we need the network for? >>: I see. So I'm just -- even one of them [inaudible] you don't need both. >> Paul Smolensky: You need both for the following reason. picture tells you what you want the output to be. >>: Okay. >> Paul Smolensky: it should be. >>: Okay. And the network's job is to get it. Right. -- [inaudible] >> Paul Smolensky: Right. algorithmic about it ->>: Right. Sorry? Now you're talking about neural implementation of [inaudible]. >> Paul Smolensky: >>: If you want to be a little bit more I mean, [inaudible] only symbol. >> Paul Smolensky: >>: Right. So that's not linguistic anymore, right, you're not -- >> Paul Smolensky: >>: It tells you what So one is to explain how that number -- >> Paul Smolensky: >>: But it doesn't tell you how to get it. I see. >> Paul Smolensky: >>: The high-level Okay. For the moment that's so, yes. >> Paul Smolensky: Yes. I mean, what I mean for the moment, in the talk that's so. Yeah. So the -- a more algorithmic way of saying it is that if you had the luxury of examining every lattice point, computing its harmony and finding the one that has highest, then that would be a way of getting the answer. But there's exponentially many of those to examine, and our network is just doing gradient ascent and ending up at the global maximum peak. >>: [inaudible] once you get all this stuff you set up, you don't need to go through the ranking of the violation of the constraint, you can just run the dynamics, eventually you come up with the same number. >> Paul Smolensky: >>: I think that's right. Yeah. >> Paul Smolensky: >>: Right. Yeah. So -- [inaudible]. >> Paul Smolensky: But the dynamics is doing gradient ascent in a surface which is determined by the strengths of the constraints, by the weights of the constraints. So it's not by any means independent of the weight. It's not like you don't need to weights. >>: I see. Okay. >> Paul Smolensky: of. They're giving you the shape that you're crawling on top >>: I see. Okay. So the purpose is that after you run dynamics, you can throw away constraint and then you come up with [inaudible]. Is that the purpose of doing this? On linguistic level, you already got the constraint and you know that as a ->> Paul Smolensky: Okay. So -- >>: Now you throw away that, this neural simulation, and then you try to do [inaudible] gradient ascent, and the hope is you [inaudible]. >> Paul Smolensky: >>: Yes. Whatever number you get. >> Paul Smolensky: I think that -- think that's right. So when we do -when we go from linguistics to cognitive science more general, to cycle linguistics, then what we want is a model of the process that the brain is going through or that the mind is going through while you're forming an utterance in phonology or while you're comprehending a sentence in syntax. And so this is intended as a model of what's going on -- >>: The process. >> Paul Smolensky: >>: The process that's -- [inaudible]. >> Paul Smolensky: Yeah. >>: Okay. Okay. I see. >> Paul Smolensky: description. Right. >>: Okay. I see. Okay. Which you don't get from the symbolic >> Paul Smolensky: That tells you what the end state should be, but it doesn't tell you how you get there and ->>: [inaudible]. >>: So the optimal solution here would be to have it before rain [inaudible] 54 percent of it before the rains and 20 percent after the rains. That would be the optimal solution. But since that's not acceptable, we have to go to grid points. So you're trying to steer the gradient ascent towards a point. And you can't -- I suppose when it's complex enough, you can't just simply find the solution and quantize it at that point. >> Paul Smolensky: >>: Because you're not going to get [inaudible]. >> Paul Smolensky: >>: Unfortunately. That's right. [inaudible]. >> Paul Smolensky: attraction ->>: Right. So this is not necessarily in the right basin of Yeah. >> Paul Smolensky: -- of the global maximum. Or yeah. So -- >>: So that's going to question I was asking before. Is there some more efficient way of using the structure of these constraints to more quickly get to the answer [inaudible]? Like is it possible to learn how to jump through these modes in some better way? Is that actually what the machine learning algorithms are doing these days? >> Paul Smolensky: I don't know. >>: Are these -- >> Paul Smolensky: You can tell me. >>: [inaudible] the exponential harmony function is a [inaudible] function. So there should be much better way. >> Paul Smolensky: >>: Well -- But the question is not -- >> Paul Smolensky: This surface -- this surface is not a simple kind of ex-optimization problem, is it? >>: No, it's not. >> Paul Smolensky: Is it? >>: [inaudible] I thought that you write down to be E to minus H divided by T and H is quadratic. That makes the whole thing [inaudible]. >> Paul Smolensky: H is quadratic for the grammatical harmony, but it's not quadratic for the quantization harmony. The quantization harmony is fourth order and it's designed specifically to have peaks at all the discrete ->>: Okay. Okay. >> Paul Smolensky: -- at the exponentially many combinations of constituents. So it's constructed to have an attractor, a maximum at every grid point. And there are lots of them. Okay. So I don't know the answer to your question, but -- and I don't have the training to think about it very intelligently. So if you do, I would love to hear about it, talk about it. Yes. >>: Just want to make sure I'm not too lost, but maybe I am. here are precalculated learned or hard coded or whatever. >> Paul Smolensky: is ->>: They're held fixed while all of this is going on. This So this reference [inaudible] the inference not the learning. >> Paul Smolensky: >>: So the weights Thank you. Precisely. That's good. That's right. That's right. Thank you. >> Paul Smolensky: But because the grammatical constraints are part of a kind of maximum entropy model method for learning their strengths that are used all the time can be applied here. That's what was done in that paper on [inaudible] vocabulary that I mentioned. Okay. So I haven't showed you how to set the weights in the network so that the harmony values follow the grammar that you want. And I don't have any slides on that, but we can talk about it offline if you want. So this is the picture of the dynamics as it goes on over time from left to right. And each of these is the activity value of the -- how strong the distributed representation in the network is for each of these constituents that are possible. So the one that is most rapidly settled in as highly active is the pattern from the verb being filled by rains. There's no -- the faithfulness constraints say that that weight should be and there's nothing that conflicts with that. The second most rapidly settled on constituent is this one which means that there's nothing following the verb. So this is the role, post verbal position. This is the filler, nothing. And so nothing following the verb is rapidly settled on. There's no -- none of the constraints want that to be filled with anything. And then the last one that gets decided is that in the subject position, preverbal position, you should have the word it as opposed to the word rains or nothing. And it's interesting that this one is the slowest to be decided because this is the one that violates the subject constraint. But that's overruled by the -- sorry. It violate -- it satisfies the subject constraint but violates the full interpretation constraint. So there's conflict here, and that slows down the decision process. And this guy here hasn't even really turned itself all the way off the way it should be in the end. You have to process longer to get that. This is the one that says that there should be nothing in the subject position. So this is favored by the full interpretation constraint. This is favored by the subject constraint. Subject is stronger, so this one wins. But you very much see that there's a conflict going on here. Okay. And what I said was that what we pray is that the probability that an ant ends up on a particular peak after the quantization has gotten really strong, the probability of ending up here is exponential of the harmony of this grid point. Because we can reason about that. So here is a theorem to that effect. Okay. So the theorem says if you consider our network operating at a fixed temperature, then there's a distance such that for all smaller distances across all of the discrete points in the grid, as the quantization goes to infinity, the probability that the network state is within that distance of the grid state, the probability that it is near a particular grid state, then we have the distance is less than R, is exponential in the harmony of the grid state. And, furthermore, the probability that the state is not near any grid point at all goes to zero. So we have this nice result. The simulations don't obey the result very well, and so we're trying to understand what the route from theory to practice is going to be here in terms of -- I mean, all these kinds of theorems are very asymptotic, so as Q goes to infinity and we need to have network and thermal equilibrium, and as you were pointing out, that can take a long time, so just how to pull it off to make good on this theorem is what we're working on nowadays. So that's the story about the dynamics underlying harmony, harmonic grammar and optimality theory. And there's one more part of the story which I think is brief. It only one slide I think actually. So I think there's time before three o'clock. And that -- we've talked about how the -- we talked about continuous neural computation that leads to discrete optimal outputs. And now I'm going to talk about gradient optimal outputs as something that we might actually want to use in our theory, not just get rid of. Quantization was a way of getting rid of all of these nondiscrete states. But now we're focusing on what actually those nondiscrete states are good for. >>: Just one question about the previous topic. So you could try instead of doing [inaudible] space to actually go through all possible combinations but not -- not brute force but using samples of some sort, [inaudible] sampling [inaudible] or something. >> Paul Smolensky: To visit all discrete things? Yeah. >>: [inaudible] or maximum harmony. Have you tried that? So you're only skipping through discrete states but you're doing it in the MCMC [inaudible]. >> Paul Smolensky: If we did, which I think we probably did, it's long -- so long ago I don't remember about it. But that is sort of a baseline that we really need to be clear on. That's an excellent point. >>: Then it could just be like one feet forward calculation, right, and winner take all sort of? >>: Well, yeah, if you basically fix some variables and generate the other ones in a way that you know that asymptotically it's going to lead to a state from the network, from the energy profile, from [inaudible] to the energy profile. So it depends on what -- for harmonium, we know how we would do it. We'd generate the hidden space on the observes, observe them from the observer [inaudible] and do that in a while, then presumably that would be a better way to generate from those constraints than simulated annealing. If your target get is discrete output. But then if you have this more general thing, then maybe that's harder to find the right way to sample through discrete states. >> Paul Smolensky: Yeah, I mean, since we're cognitive scientists and not computer scientists, we have certain, you know, priorities that might not totally make sense from a computer science perspective. So we don't think that the brain jumps from discrete state to discrete state to discrete state trying to find the highest harmony one. And so in a certain sense, we don't care how good that algorithm is. >>: Right. >> Paul Smolensky: >>: You could certainly argue the point. But -- -- all these time steps. >> Paul Smolensky: >>: Well -- In terms of -- >> Paul Smolensky: >>: But we should know how good it is nonetheless. What you've got now isn't biologically plausible, is it? >> Paul Smolensky: >>: Right? Yeah. How many time steps it takes and so on you mean? Especially that. Yeah. >> Paul Smolensky: Well, there's sort of quantitative biological plausibility and qualitative biological plausibility. And I've always been most attentive to the qualitative part of it. So and believing that the longer you work to polish it the better the quantitative side gets. But if the qualitative side is wrong, it doesn't matter. >>: But is there any sort of neural net that can compute this dual optimization discussion and discrete goals? I mean ->> Paul Smolensky: >>: Is there -- Is there a neural net model that computes this? >> Paul Smolensky: Other than ours you mean? >>: Well, you're -- okay. not -- So yours does. And it's just a neural net, it's >> Paul Smolensky: It's doing continuous -- it's following a dynamical differential equation, which is a polynomial function but not a low-order polynomial function, as low as typical, what typically you'd find in neural networks. That's why I mention that the function that we're doing the gradient ascent on is quadratic -- is quartic and not just quadratic. So there's a question about the -- >>: So time steps are just like recurrent -- >> Paul Smolensky: >>: Yeah. -- and then it converges to some answer [inaudible]? >> Paul Smolensky: Yeah. Yeah. >>: But the number of time steps you've seen is on the order of hundreds or thousands? >> Paul Smolensky: More or less. Yeah. Well, maybe I won't do this since it's now three o'clock. That's okay. Am I giving a talk tomorrow? >> Li Deng: Yes. >> Paul Smolensky: Okay. Li wanted to know what I'm going to talk about at Stanford, and what I'm going to talk about at Stanford is the next slide. >> Li Deng: Okay. >> Paul Smolensky: But I can do that tomorrow. >> Li Deng: So tomorrow, Wednesday, 1:30 to 3:00. And the announcement should go out. I'm surprised it hasn't gone out. I'm not ->> Paul Smolensky: You were an optimist if you were surprised. >> Li Deng: I can't send it out myself. Yeah. >> Paul Smolensky: >> Li Deng: Yes. >> Paul Smolensky: >> Li Deng: So 1:30 to 3:00 tomorrow afternoon. Okay. Same place. >> Paul Smolensky: >> Li Deng: Yeah, I see. Okay. [applause] >> Paul Smolensky: Thanks. All righty.

>> Paul Smolensky: So I will try to finish... time here and, if there's time, go back and take...

Related documents

Products

Support

&gt;&gt; Paul Smolensky: So I will try to finish... time here and, if there's time, go back and take...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Paul Smolensky: So I will try to finish... time here and, if there's time, go back and take...