>> Paul Smolensky: So I will try to finish... time here and, if there's time, go back and take...

advertisement
>> Paul Smolensky: So I will try to finish the points that were started last
time here and, if there's time, go back and take up some points from the
first pair of talks where a lot of important material was skipped.
So to remind you, we're just trying to see how to use ideas from linguistic
theory in order to better understand neural networks that are processing
language and that there's a theory of how vectors in neural networks might
represent linguistic structures, tensor product representations.
And last time we started talking about this point No. 2 here, which says that
grammar is optimization and that knowledge of grammar doesn't consist in
knowledge of procedures for constructing grammatical sentences, it consists
in knowledge of what desired properties grammatical structures have. And
that is a theory of grammar which has two variants, harmonic grammar and
optimality theory.
So the networks that underlie those theories are in the business of
maximizing a function I call harmony, which is like negative energy, if
you're familiar with the energy notion. It is a measure of how well the
state of the network satisfies the constraints that are implicit in all of
the connections between the units in the network.
So we talked about earlier about how each of those connections constitutes a
kind of micro-level desideratum that two units should or shouldn't be active
at the same time depending on whether they're connected with a positive or
negative connection and that circulating activation in the way that many
networks do causes this well-formedness measure to increase.
And in deterministic networks, that takes you to a local harmony maximum and
in stochastic networks you can asymptotically go to a global harmony maximum
using a kind of stochastic differential equation instead of a deterministic
one to define the dynamics.
And this equation has as its equilibrium distribution this Boltzmann
distribution. It's an exponential of this well-formedness measure harmony
divided by a scaling parameter T. And this means states with higher harmony
which better satisfy the desiderata of the network are exponentially more
likely in the equilibrium distribution of this dynamic.
And often we run in the way of -- in the manner of simulated annealing where
we drop the temperature so that as we approach the end of the computation and
temperature approaches zero, then the probability for states that are not
globally optimal goes down to zero. So all of that is what I refer to as the
optimization dynamics, and it has a name because there's a second dynamic
which interacts with it strongly, and that's called quantization, which we'll
get to shortly.
And I will show you some examples of networks running in this fashion, or at
least the results of them, when we get down to the next point here. So have
to wait just a little bit before we see some examples of this running in
practice.
Now, I wanted to remind you about how harmonic grammar works. So this is one
way to understand it. This is a different way of presenting it than I did
before. So maybe it will make sense if the other one didn't.
So we take as
we take it as
encode symbol
just ask what
would it mean
a given that we have networks that are maximizing harmony. And
a hypothesis that the states of interest are vectors that
structures using tensor product representation. So then we
is the harmony of the tensor product representation and what
to be trying to maximize harmony with respect to such states.
So here is a tree for a syllable, cat, and the harmony for that symbol
structure, which is a macro-level concept, is defined by mapping the symbol
structure down into an activation pattern using the tensor product
isomorphism psi. So there's the pattern of activation that we get.
And the harmony at the micro level is defined algebraically by an equation
that has this as its central term. So this is maximized when units that are
connected by a positive weight are both active at the same time and units
that are connected by a negative weight are not both active at the same time.
So using this method, we can assign harmony to the symbol structure there,
and this is what leads us to ->>:
[inaudible] each position in the tree or not?
>> Paul Smolensky: These are different neurons. At the micro level we have
neurons; at the macro level we'll have positions in the tree.
>>:
Okay.
>> Paul Smolensky: So, in fact, what I sketched last time was that the
simple observation that if you actually substitute N for these, for the
activation vector here, the tensor product representation of some structure,
like a tree, then what you get for this harmony is something which can be
calculated in terms of the symbolic constituents alone.
So for every pair of constituents that might appear in a structure, there's a
number which is there at mutual harmony which can be positive if the harmony
function, which is now encoding the grammar, regards that combination as
felicitous as well formed, then this number will be positive. If it's
something that's ill formed according to the grammar, that contribution will
be negative. But the set of these contributions then defines a way of
calculating the harmony of symbolic representations.
Underlyingly, we have this network at the bottom, which is passing activation
around and building this structure, but we can talk about what structures
will be built, the ones that have highest harmony, by using calculations at
the symbolic level to determine what states have highest harmony.
>>:
So here the W, W is the neural network?
>> Paul Smolensky:
>>:
Yes.
It's the weight.
>> Paul Smolensky:
Those are the weights that are --
>>: [inaudible] you assume the weights are predefined or prerun?
>> Paul Smolensky: Well, I'm talking about the weights that encode a
grammar. So they can get there by learning or they can get there by
programming, which is the way we do it. We program them.
>>: Yeah. So you don't need to define
define the harmony of the whole tree to
[inaudible]. So why there is a need to
into the constituent [inaudible]? This
[inaudible]. So as long as you
be the same as the harmony of
decompose harmony of the structure
is a redundant definition, is it?
>> Paul Smolensky: In a way. It just means that you can take two views.
You can say we don't need this because we like the network, or you can be a
linguist and you say we don't need this because we don't like networks.
>>:
Oh, I see.
Okay.
>> Paul Smolensky:
We can just operate with the symbolic grammar.
>>: Okay. So the function H is different in this case.
a function that [inaudible].
>> Paul Smolensky:
>>:
I mean, H to me is
Yeah.
So is the full function of a yellow H versus a white H?
>> Paul Smolensky: You know, these are the individual terms that, which when
added together, give you the white one.
>>: I see. So in that case, what is the analytical form for H, if you don't
define that? What is the symbolic level [inaudible] define the neural
network.
>> Paul Smolensky: This is just a set of numerical values, one for each pair
of possible constituents. And those numbers define the harmony function as a
symbolic level.
>>:
I see.
Okay.
>> Paul Smolensky: So there really isn't any functional form to the terms.
They're just numbers.
>>:
Okay.
And then the way to complete that is through neuron [inaudible].
>> Paul Smolensky:
from ->>:
All right.
So this is conceptual level --
>> Paul Smolensky:
>>:
Yes.
-- of the function.
>> Paul Smolensky:
>>:
If you wanted to understand where those numbers came
The other one is [inaudible].
That's right.
Okay.
>> Paul Smolensky: And I'll show you two examples which will be relevant to
a paper we'll talk about in a little while of terms in this sum.
So here's an example. This is the harmony resulting by having a consonant in
this position here, which is the right child of the right child of the root,
which is a syllable. The combination of those two is actually negatively
weighted in the grammars of the world, because it is a property of the
world's syllable structures that it's dispreferred to have consonants at the
end.
So a consonant in a coda position lowers the harmony of a syllable. And in
some languages you never see them; in other languages they're just less
common or more restricted in when they can appear.
On the other hand, this combination of a consonant in the left child
position, the onset position of the syllable, is positively weighted in real
languages. Syllables that have a consonant at the beginning are preferred,
and there are languages in which all syllables must have consonants at the
beginning. So ->>:
[inaudible] higher weights, but the theory can tell you how higher.
>> Paul Smolensky: Right. And that varies from one language to another. So
by looking at the different possible values that might be assigned to all of
these terms, we can map out the typology, if possible. In this case syllable
patterns across languages is the constraints we're talking about or these
harmony terms refer to syllables as these examples do.
So what the grammar is doing in circulating activation and maximizing harmony
is building a representation which at the symbolic level ought to be the
structure that maximizes this function, that best satisfies the requirements
that are imposed by all of these grammatical principals which are now encoded
in numbers.
So our language can have a stronger or weaker version of this onset
constraint. All languages will assign positive value to having a consonant
in the onset, but they could differ in how much they weight that.
Another way of saying the same thing is that the grammar generates the
representation that minimizes ill-formedness. Useful to point out just
because in linguistics the term markedness is used for ill-formedness in this
sense. And so you hear the phrase markedness a lot. And, as a matter of
fact, in the theories we're discussing, these constraints are called
markedness constraints.
The idea is that an element which is marked is something that is branded as
having some constraint violation which gives it a certain degree of
ill-formedness. So being marked is a bad thing, and the grammar is trying to
construct the structure which is in some sense least marked.
>>:
So this term literally comes from marking text, markedness?
>> Paul Smolensky:
No.
>>:
No?
>>:
This is the phonology part.
>> Paul Smolensky:
>>:
Mm-hmm.
Is it the same term as used in phonology?
Yeah.
Yeah.
Chomsky's phonology?
>> Paul Smolensky: Yes. Yes. Yeah. Well, it was -- it was developed in
the '40s, primarily, in the Prague Circle linguists, for example, [inaudible]
from Russia. And the idea was that unmarked is like default. You don't have
to write it because it's the default, so it's not marked. You don't have to
mark what it is. So the default, if it's interpreted as the default one,
doesn't need to be marked. Whereas the thing which is not the default has to
be marked. That's the origin of the term.
>>:
In this case everything is marked, right, like the [inaudible]?
>> Paul Smolensky:
>>:
Yeah, they're --
[inaudible].
>> Paul Smolensky: So in this theory -- sorry, everything is
marked [inaudible] ->>:
[inaudible] the whole set is marked coda.
>> Paul Smolensky: So the -- I'm [inaudible] unmarked in the sense that you
get positive harmony for having it.
>>:
Oh.
Okay.
So [inaudible] means that you get positive [inaudible].
>> Paul Smolensky:
>>:
Right.
It's preferred.
Okay.
>> Paul Smolensky: So sometimes that takes the form of having a positive.
Or more sometimes it just takes the form of not having a negative one.
>>:
[inaudible]
>>:
But in this theory high unmarkedness can overcome high markedness?
>> Paul Smolensky: That's right. You can trade one off for the other.
Which is not true in optimality theory, which is what we're coming to
shortly.
So okay. So optimality theory is a variant of what we've just seen in which
the constraints are -- first of all, we insist that the constraints be the
same in all languages. And, second, the constraints are ranked rather than
weighted.
So a language takes these universal constraints, which include things like
codas are marked and onsets are unmarked. Those are examples of universal
principles of syllable structure. And rather than weighting them in
optimality theory, a language will rank them in some priority ordering. And
so onset might be high ranked or low ranked in a given language.
And we saw an example of the conflict between the subject constraint and the
full interpretation constraint when you have a sentence like it drains that
doesn't have a logical subject. And we saw how using numbers like three and
two we could describe two patterns depending on which of those had a greater
number, the Italian pattern or the English pattern.
But in optimality theory, it's simpler, we just say one is stronger than the
other, and that's it. Now, the hierarchy of constraints is a priority
hierarchy in which each constraint has absolute priority over all of the
weaker ones. So there's no way that you can overcome a disadvantage from a
higher ranked constraint by doing well on lots of lower ones. So you can't
do the kind of thing you were just pointing out, can be done with a numerical
harmonic grammar version of the theory.
>>: So the real language is [inaudible] theory or you can weight the
constraints?
>> Paul Smolensky: Well, that's a big theoretical issue in the field, which
is the better description of natural languages. So I would say that the
majority opinion still favors optimality theory, but there's been more and
more work put into the numerical theory, and certain advantages are pointed
out from that.
The idea in optimality theory is that all languages have the same constraints
and that what defines a possible language is just something that can arise by
some ordering of the constraints. And if we're using weights, then we say
what defines a possible language is some pattern of optimal structures that
can be defined by some set of numbers.
And that's a harder space to explore, but it can be done. And there are
certain pathologies unlinguistic-like predictions can come out of that, but
people try to find ways of avoiding them.
Okay. So and so just to emphasize that what optimality theory really brought
to the table was a way to formally compute the typology of possible grammars
that follow from some theory of what the constraints of the universal grammar
are, what the universal constraints are.
It's just plug it into your program and it tells you what the possible
languages are. Didn't have that kind of capability before. Okay. So it's
been applied at all sorts of levels and all sorts of ways.
But let me just go to the bottom line, which is that its value for
linguistics is that it gives two -- it gives an answer to the fundamental
questions: What is it exactly that human languages share, what human
grammars share, and how can they differ. So what they share is the
constraints.
These are desiderata for what makes a good combinatorial structure, and
they're shared by all the world's languages. Even though Italian doesn't use
null expletive subjects, meaningless subjects like English does, it's not
because English likes those things and Italian doesn't like them, but both
languages don't like them.
But because there's another constraint at work, the subject constraint that's
[inaudible] subjects, it's a matter of how much the language doesn't like it.
So that's what varies across languages. They differ in terms of the
weighting or the ranking of these constraints.
All right. So now I have a whole bunch of skipping points in this
presentation here. I guess that's what I was up during the night mostly
doing is finding ways to skip material.
So here is thing we could look at, how the advent of optimality theory in
harmonic grammar made a really major shift in how phonology is done and how
certain people practice other fields like the ones I mentioned, their syntax
and semantics included, from using sequences of symbolic rewriting operations
or manipulations of some sort which was a practice before to something very
different, which is a kind of optimization calculation.
So I can show you how that works out in a real example in phonology or we can
skip it. Six is the number of slides.
>>:
Let's not skip it [inaudible].
>> Paul Smolensky:
>>:
Okay.
[inaudible].
>> Paul Smolensky: You don't want to know how many more there would be if we
continue at the rate we're going.
>>:
That's okay [inaudible].
>> Paul Smolensky:
>>:
That's fine.
Six is the answer.
That's fine.
>> Paul Smolensky: Okay. So this real example comes from the optimality
theory book that Alan Prince and I distributed in 1991. It's a language,
Lardil, which is spoken in -- on Mornington Island in the Bay of Carpentaria
in northern part of Australia. So it's an indigenous language of Australia.
And how do you say wooden axe in this language, as an example question to
which the language needs to provide an answer. Okay, well, the answer
depends on whether you're using it in the subject or object position of a
sentence.
So in the object position it has accusative form, which comes out as a
munkumunkun, which when you look at all the forms you see is a stem,
munkumu -- munkumunku, followed by accusative or object marker N at the end,
N plus suffix form. And if you use it as a subject, in the nominative form,
it's just munkumu.
And so the phonology of Lardil, the knowledge of the language that the
speakers have is such that they can figure out that these two different forms
are the right ones for this word knowing what the stem is and what the
grammar -- the phonological grammar is.
Okay. So what kind of general knowledge would allow the speaker to compute
this as well as all the other forms that we see in this table here. And
here's a selection of data from one paper about Lardil, and this is the form
that we were just looking at.
Here's the accusative ending N for sentences that are not in the future
tense, and those which are in the future tense get an R instead. And this
shows lots of different patterns that different lexical items display. We
just looked at one so far. The simplest one is at the top here. There are
some words like kentapal, which means dugong, if you know what that is.
Manatee I think is the same thing.
So that's a simple case where you just add the endings to the stem and leave
the stem just the way it was, just like we add Ss and Ds to stems in English
and leave the stems the way they are. But most types of -- patterns are
different from that.
So just focusing on the nominative form here, forgetting about the accusative
forms, the object forms, just focusing on this part of the data table we see
that lots of different things happen across the lexicon. So in some cases,
like naluk, you lose the final consonant. In other ones, like yiliyili, you
lose the final vowel. So you get yiliyil as the nominative form subject
form.
You get the consonant and vowel at the end of the world deleted sometimes.
Yukarpa becomes yukar. Then we saw how a consonant-consonant vowel sequence,
[inaudible], can be deleted in some cases.
Now, when words are short, you see a different kind of pattern. Even though
yiliyili loses its last vowel, wite does not lose its last vowel. It stays
there unaffected. So there are some cases, and these happen in short words,
words that would in some sense intuitively become too short if you were to
delete the vowel at the end.
On the end when the stem is really short, what you see is that the language
actually adds material rather than subtracting it as in the case of yak which
becomes yaka, fish. Sometimes a consonant and vowel are added in two
different ways.
So what we set out to do is to explain this pattern of behavior in forming -going from the stem to the subject form. And so these are the examples that
we'll see in the optimality theory analysis specifically, these three, which
behave differently. A lot of truncation, nothing, and then augmentation
here.
But before that, let's look at what kind of analysis was done before
optimality theory came along. And it looked like this. There were sequences
of operations. You started with the stem. The first rule deleted the final
vowel. The second rule deleted the final consonant and then this consonant
went away and then this consonant is no longer connected to the syllable and
then it goes away.
And so each of these is a rule that manipulates the form of the word until no
more rules could apply, and then what you get is the actual form that you
pronounce. You have to make sure that you require that the rules can't be
reused because if you're allowed to reuse them, the whole word would actually
disappear from the set of rules.
Okay. So that's the sequential serial operation of symbolic rewriting or
symbol manipulation that was dominant before optimality theory. And now it
looks completely different, as you'll see. So here is what happens when we
go to harmony maximization instead. We have a bunch of constraints which
are -- some of which are lined up along the top of this table that we'll look
at.
And these constraints include certain kinds of constraints called
faithfulness constraints which will appear in the application I'll talk about
shortly, faithfulness constraints, which say that the stored form of the word
and the pronounced form of the word should be as identical as possible. So
star insert V says don't insert a vowel in going from the stored stem form to
the nominative. The star insert C says don't insert a consonant. The last
one says don't delete anything.
So those are faithfulness constraints. They're always
the pronounced form identical to the underlying stored
are other kind of constraints, markedness constraints,
of the onset in coda constraints, and they're a couple
which we can go into a few in just a bit.
satisfied by making
form. But then there
which we saw examples
of other ones here,
I just want to show you how the practice of doing phonology operates under
optimality theory. So we have a stem, munkumunku, and we want to know what
is the right pronunciation of it. So we do what we did with the [inaudible]
case. We think of the possible expressions that you might use to express
that. In this case it's not a logical form, it's a stored lexical entry that
you want to pronounce.
So part of the theory generates a bunch of candidate possible pronunciations,
and here are some of them, and for each possible pronunciation you look to
see how they -- how each one fares with respect to all of these constraints.
And so what you can see -- I don't know that I got my pointer this time -the -- the first form is one in which you pronounce it exactly as it's the
stem is stored. You don't change it at all. So all the faithfulness
constraints are happy. There's one constraint, though, that's special, and
it says that nominative forms in Lardil should not end in a vowel. And this
one does.
>>:
How do you know the underlying fault has a [inaudible]?
>> Paul Smolensky: Because of the other forms that we saw which added an R
or an N in the other forms of the word. So if we go back to the table of
data, you'll see that ->>:
I see.
Okay.
>> Paul Smolensky:
>>:
So that's like in the [inaudible] but that's -Sorry.
One more here.
I see.
>> Paul Smolensky: I
that all of the forms
or a vowel EN if what
future they all add R
guess I can't interrupt this. So what you see here is
end up putting either an N if what we see is a vowel,
we see is a consonant and in the nonfuture, and in the
or a vowel R.
>>: Okay. So but you can't [inaudible] how you know which form should be
[inaudible].
>> Paul Smolensky:
>>:
That's right.
Okay.
>> Paul Smolensky: That's right. Yeah, these are very regular. You just
add an affix. And so they're the best way of seeing what the underlying form
of the stem is. But in the nominative form, what you see is a lot of
deviation from the underlying form that you don't see in the accusative
object forms.
Okay. So let's see here. All right. So we were going through this table
here. Right. Faithfulness constraints are all satisfied by this first
option. That's the fully faithful pronunciation. You don't change anything.
So all the faithfulness constraints are happy.
The nominative constraint says the nominative form should not end in a vowel,
and it's violated by this pronunciation. And that's why there's a star
there. That says that this candidate pronunciation violates that constraint.
And the exclamation mark means it does so in fatal ways that prevents it from
being optimal, which you can only tell by looking at the others, which we
haven't seen yet. Here they are. These are a sample of all the
possibilities.
And there are high-ranked constraints in the language, which are actually
never violated in the forms of the language against having a complex coda in
a syllable, which means having two consonants at the end of a syllable like
in the second munk possibility. Just dropping the U and the K is the third
possibility.
And in both cases we're deleting a vowel, so we get a violation of the don't
delete constraint in both cases. We get two violations of it in the
munkumunku -- munkumun possibility, the third one.
And there's another strong constraint. These are very highly ranked
constraints in this language which say that you can't have a nasal consonant
at the end, at the end of a syllable unless it's followed by a stop consonant
at the same place of articulation. So [inaudible] is okay if it's followed
by K but not otherwise.
So that constraint is violated by the third possibility and so on.
And this is the ranking of the constraints in the language. The strongest
constraints are on the left. And so the top ranked constraints, which I
haven't made separate columns for but indicated here what the violations are,
knock out options 2 and 3. This constraint then knocks out option 4.
Because the constraints are strictly dominating, the fact that these fail
with respect to these high-ranked constraints means they're out of the
running. No matter how well they do on all the other constraints, it doesn't
matter. They lost because there are better options available that don't
violate those constraints. And similarly here.
Then we get down to this point where the faithful candidate loses. And now
we're at a point where only these two are left. Now we look at the
constraints that are left. And here's a case where both of the remaining
options violate the constraint. And we can't eliminate them because
something has to remain. So they just survive. They're equally good as far
as this constraint is concerned, or equally bad, depending on how you want to
put it.
And so neither one is preferred, neither one is selected at that point.
Similarly, these two candidates are also equally evaluated by that
constraint, so nothing is eliminated. And only when you get down here to a
constraint that says don't delete do we prefer that this candidate here to
this one because it has fewer deletions. So it violates that constraint
less. And that's the winner. It's the correct pronunciation. And so this
way of rendering the calculation has produced the right result.
And if we look at different stems, other things come into play. So the
really short stems are evaluated -- there's a constraint that says that no
word of the language can have less than two units of weight, which this does,
without going into details of weight. So even though no final vowel
constraint here in the nominative doesn't like this, eliminating that drops
the word to subminimal length, and that violates a higher ranked constraint,
and so it's not a better option. It's a worse option.
And similarly since these stems are already subminimal, they already violate
the minimum word, you have to add something to get them up to the point where
they no longer violate that constraint, and you end up through interaction of
these constraints -- I won't go into it -- adding a consonant and a vowel to
do so, at least in this case. Not in all cases.
But so the point is that if you look at this pattern, you'll see that there's
an interesting thing about constraints and optimality theory, and that's
usually described by saying the constraints in the theory are violable. So
there were constraints in grammars before, but they were also inviolable.
You violate a constraint, you're out. That's always how it was.
But these constraints are called violable. And what you see is that here
what we have is the constraint operating in the normal way. It rules one of
these possibilities out because it violates that constraint. And here,
though, what we see is that the form that's actually pronounced, the optimal
structure, the pronounced form violates that same constraint. And the
difference is that, okay, so that constraint is the constraint that says
don't insert a vowel. The violation is fatal in the munkumunku stem, but
it's not fatal, it's tolerated, in the real stem.
And the reason is simple. There is a better option here. There are options
that don't violate and they're preferred and they win out and it -- causing
that one to lose. Here there's no options left that are better than this.
They all violate the constraint, and equally. So none of them are eliminated
from the running at this point. Elimination has to come later when they're
not all evaluated the same.
So it's about whether there's a better option. So if there's no better
option, constraints are violated in the optimal forms, the highest harmony
forms, the grammatical forms.
>>: Do you have a good example of this type for English phonology where you
show that this grammar is much simpler [inaudible]?
>> Paul Smolensky: Well, the goal isn't to make the grammar simpler with
respect to a single language. The goal really is to make it capable of
explaining all the patterns that you see across languages. So it's a
grammatical approach that's really aimed at universal grammar. It really
focuses on a single phenomenon across all the languages rather than all the
phenomena across and within a single language.
>>: So you're saying that these constraint that you prepared are not only
for Lardil but also for English, for Chinese [inaudible]?
>> Paul Smolensky: Yeah, that's right. With the one exception, which is
unfortunate, is the nominative constraint is an idiosyncratic property of
this language that nominative forms shouldn't end in vowels. I don't think
that that's a universal constraint, and so that's an undesirable aspect of
this analysis but one that I don't know how to eliminate.
Okay. So contributions. As I said, the formal theory of typology is what I
consider the main one; that we now can take a proposal for a theory of
syllables or whatever and look at the implications across all the world's
languages by re-ranking the constraints that are being proposed to see what
possible languages are predicted.
And there are lots of universals that need to be explained. So this online
archive, at the time I checked, which is now a while ago -- I should check
again -- had over 2,000 universals indicated. So there's a lot of stuff that
needs to be explained, and optimality theory is well positioned to do that.
Another aspect of the restrictiveness of analysis is that because the
constraints of the ideal analysis in optimality theory are universal, you
can't just make up constraints because they are convenient for the particular
phenomenon in the language you're looking at. They need to be responsible
for the implications that they have when you put them in the grammars of all
the other world's languages as well.
So that makes it harder to analyze rather than easier to analyze a phenomenon
in a given language. And that's usually regarded as a good thing in
linguistics, even if not in other fields. It turns out that there's been
quite a lot of productive work on learning theory, which primarily asks the
question: How can the aspect of a grammar, which is not universal -- namely,
the ranking -- be learned from data that you might be exposed to as a
learner.
And there are other contributions.
15 of them are described in this book.
So I want to take the opportunity to move on to a particular application
which is in machine translation in which I just became aware of the day
before I created this talk. So my knowledge of this paper is definitely not
very deep, but I will tell you about it.
It uses optimality theory, it says. That's actually not the way I would say
it. But it uses it for improving machine translation in virtue of handling
[inaudible] vocabulary words in a very nice way.
So it turns out that a lot of words in any -- in most languages are borrowed
from other languages. And you might think, well, everybody is borrowing
words from English. English is everywhere. All languages are borrowing from
us. But we certainly did our share of borrowing already.
So you look at this picture, English is supposed to be Germanic language,
well, that's 26 percent of our words. Another 29 from French, Latin, and
other languages. So lots and lots of borrowing in the history of our
language.
And that's not atypical. But what's tricky is that when a word is borrowed
into other language, the resulting thing, a loanword, is modified to fit the
phonology and the morphology of the recipient language. It's not just taken
in whole hog; it's modified to fit the requirements of being a word in that
language.
So here's a picture from that paper that shows some borrowings of the word,
well, you can probably guess what falafel means in the Hebrew form, but you
can see that quite a few changes are made in the course of borrowing. Ps and
Fs change. L gets somehow from the end of the word into the first syllable
of the word. Long-distance liquid metathesis of sorts. And vowels change.
So lots of things which are understood in terms of principles of phonology
and the constraints in optimality theory can help us understand what's going
on when a word borrows into another language.
The simple
applies to
you submit
optimal on
view is you have a ranking of constraints, which is the one that
your language, another word comes in from another language, and
that to the constraint hierarchy and it doesn't come out as
its own. What comes out as optimal is a modification of it.
That's the simple picture of how words get adapted from optimality theory
perspective.
And the useful thing for the application [inaudible] machine translation is
that resource-poor languages often have a lot of words that are borrowed from
resource-rich ones.
And so an example that they look at is doing translation between Swahili and
English. This is a low-resource situation which benefits from the fact that
a high-resource pair, Arabic-English, provides lots of opportunities for
analyzing outer vocabulary items in Swahili which are not known in advance as
to what their counterparts in English are.
But knowing -- being able to deduce that these words are borrowed from
Arabic, and since we have a lot of information about the relation of these
words to English, in some sense we know what they mean, then we can take
those as plausible candidates for what these words in Swahili mean.
>>:
Some pieces at the level of lexicon [inaudible] phonology.
>> Paul Smolensky:
>>: Yeah.
phonemes.
Sorry?
Lexicon?
So the new word is defined in terms of spelling [inaudible] to
>> Paul Smolensky: Well, there are some situations when it -- when
borrowings are affected by spelling. I'm not sure that that's true for
Swahili and Arabic.
>>: Okay. So mostly what you're talking about is the foreign words, putting
into a new word based upon [inaudible].
>> Paul Smolensky:
Yeah.
Right.
That's the case that this method will work with.
>>: So the form of the word in the end depends on the order in which the
borrowing happens? Or is it pretty much independent of it?
>> Paul Smolensky: I think it does depend on the ordering. You can well
imagine, for example, just to take one trivial example, is if there's a
certain phoneme that's absent from the first language that borrows it, it
might just disappear. And then even if that phoneme is present in the second
language, it doesn't have a chance to come back.
>>: After phoneme is determined according to optimality theory, it still has
to create the spelling of the phoneme.
>> Paul Smolensky:
We're not concerned about spelling I don't think.
>>: Oh. Oh. So this for [inaudible] for the new word between the formal
spelling of it [inaudible] to give you a new word.
>> Paul Smolensky: Right. It's true that we have to have a way to access
the pronunciation from the resources that we have. And if the resources are
written, then we do need to know how to translate the -- yeah, put it back
into a phonetic form.
>>:
And the paper doesn't focus on [inaudible]?
>> Paul Smolensky:
might have and ->>:
I don't remember that they talked about that, but they
Okay.
>>: So they try to infer the history of the changes in the ordering of the
universal constraints in order for a word to become [inaudible]?
>> Paul Smolensky: So they just considered a pair of languages. So a
borrowing from Arabic into Swahili without trying to reconstruct the history
or anything like that, which linguists certainly love to do, but that's not
what their program does yet anyway.
>>: [inaudible] particular languages, they could just be going through the
constraints, the same -- the ordering changed this way. If you change the
ordering of the constraints in this way, and then change again the ordering
of the constraints in this way, the minimum number of constraint changes, the
order -- the ranking changes that are necessary to go from here to there, you
could be kind of integrating over languages, you don't have to actually go
through the specific ones.
>> Paul Smolensky: Yeah, that's very interesting, and much more interesting
than what they are doing here actually. Because they're not looking at
changes in the constraints from one language to another. They're just trying
to figure out how should the constraints of Swahili be weighted so that when
Arabic forms are stuck into that constraint hierarchy what comes out best
approximates the words that we actually see in Swahili ->>:
[inaudible] the word [inaudible] from one language to another.
>> Paul Smolensky:
Yes, that's right.
That's right.
>>: So this process is quite intuitive [inaudible] if I want to borrow a new
language, say English, for example, first of all, [inaudible].
>> Paul Smolensky:
>>:
Right.
And that that implies that you minimize violation of [inaudible].
>> Paul Smolensky:
Right.
>>:
I see.
Okay.
>> Paul Smolensky:
Right.
>>: But I thought that alternative phonology can give you the same thing.
Will it?
>> Paul Smolensky:
too. Yep.
There are theories of loanwords in generative phonology
>>: So optimum theory is just simpler than generative? With generative
you're also told the [inaudible] you don't put that [inaudible].
>> Paul Smolensky:
>>:
The --
It's just simplicity-wise [inaudible] or...
>> Paul Smolensky: Well, the method that they used has a straightforward
means of doing the learning required. I don't know how easy it would be to
do the learning required if we were dealing with generative phonology.
So here's what they say about the results, and then I'll say a bit more about
what you just pointed out. So the features are based on universal
constraints for optimality theory. So what they're using really is a Maxent
model. They're using something that's like harmonic grammar where the
constraints have weights. They're not really using optimality theory. But
they're taking all the constraints from optimality theory, and that's what's
doing the work for them, is knowing what the constraints are.
So they take the constraints from optimality theory, but they treat them as
having weights, but then they use Maxent learning procedures to learn those
weights in order to optimize the performance on some very small training set
of parallel text or whatever parallel language form they have.
So they say that their model outperforms baselines. I don't have linguistic
input in them in the same sort of way. And that with only a few dozen
training examples of borrowings, they can get good estimates of the weights
of all of these constraints and then make good protections about what words
were the source in borrowing in Arabic for the [inaudible] vocabulary items
in Swahili. So they show that this actually does give rise to a measurable
improvement.
>>:
[inaudible] first talked about the harmonium language is just numbers.
>> Paul Smolensky: Yes. That's what they're doing. But they're rightfully
pointing out that the power of the method doesn't come -- the new power
doesn't come from the fact that it's a Maxent model. Everybody uses those.
What's new about it is the input that's comes from optimality theory. So I
think it's legit in that sense.
So universal constraints straight from OT phonology. So here are the
constraints they actually use. I told you about faithfulness constraints and
markedness constraints partly because of this.
Here's a list of very standard kinds of faithfulness constraints that refer
to features of phonemes, like don't change the place of articulation of a
phoneme, don't change the sonority of -- that means like go from a stop to a
liquid, from ta to la. And so all of these things are just very much bread
and butter kind of faithfulness constraints in optimality theory.
And the markedness constraints will look familiar. Syllables must not have a
coda. Syllables must have onsets. The basis for syllable theory. This is
about the sequencing of consonants within syllables, which is a very
well-known ->>:
[inaudible] just partly not true for English [inaudible].
>> Paul Smolensky: All the
in optimality theory can be
operative in English. It's
constraints that outrank it
>>:
constraints can be violated. All the constraints
violated. But no coda is a constraint that is
just low enough rank that we have lots of other
and force codas to come into existence.
Okay.
>> Paul Smolensky:
So the -- so I think you see how the game is played here.
>>: So then the question arises why do we have different languages? How
come? How come we can't understand each other when speaking different
languages? If these constraints are universal, why are we not aware of them
and why are we not using them optimally while talking to each other?
>>:
[inaudible] the second language.
>> Paul Smolensky: Well, I mean, the -- the difference in grammars is
re-ranking, but of course the difference in vocabularies is another matter
altogether. So if you knew all the words in another language as a start,
that would help. There are -- there is some reason to believe that the
process of learning a second language can be modeled by taking the ranking of
constraints that you have from your first language and modifying it in order
to produce the forms that you need to be able to now produce.
And so you get some sort of transfer benefit from having the constraints
installed in your original grammar, but the reason that there is not just one
language is there seems to be no grounds for ranking the constraints one way
as opposed to another, at least in a comprehensive way.
>>: I guess it seems like maybe there are different levels of things that
are actually happening in the brain. This seems like a very good way to
represent knowledge. But then the fact that you have the representation
knowledge doesn't mean that you can easily access it and use it.
there is almost ->> Paul Smolensky:
And so
But we can easily access and use it but not consciously.
>>: Well, but it's very difficult to see these things between the words even
in related languages. Like if you -- if you speak a Slavic language, you can
probably read most other Slavic languages, but you'll take a lot of time.
You really have to focus. So it doesn't come easily.
So it seems like there is another level of
universal knowledge is kind of realized in
particular form is accessed, not universal
like the knowledge is represented in a way
At high speed, at least.
processing where maybe this
a particular form and that
knowledge and so on, that it seems
which is not easily accessible.
>> Paul Smolensky: It could be that we have something like a kind of
compiled version of a constraint hierarchy that builds our own language's
rankings in and therefore is not in a form that's readily modifiable to
constitute the compilation of some other ranking. That's certainly possible.
And there's some reason I think that's true.
But everything is relative. I mean, you say it's hard, but I think it's
absolutely astonishing that you can take another Slavic language and with any
degree of effort read it. I mean, so ->>: Yeah, but the thing is there is difference in performance is interesting
from sort of a practical point of view. It reminds me of Boltzmann machines
and harmonium. You can between these models with lots of -- and they're hard
to train, but once you train them, you can sample from them. But it's hard
to sample from them. So there is this dichotomy with these representations
[inaudible] ->> Paul Smolensky:
>>:
It's hard to sample from them you say?
Yeah.
>> Paul Smolensky:
Why do you say that?
>>: It's hard to general -- it's hard to generate a -- well, if you train
the model, an RBM, for example, on the amount of data -- on a set of data,
and now you want to generate the sample that's similar to that data, you
actually have to go through MCMC process for a long time, whereas the model
itself is very simple in terms inference, procedure, and so on.
So there's a dichotomy there that it's very difficult to kind of unroll, the
knowledge is kind of wrapped, but to actually extract examples becomes hard.
And so then in language, for example, it's nice to represent languages in
terms of constraints, but to actually satisfy these constraints while we're
talking is difficult. And it almost seems like it isn't that you only have
the constraints or only have grammars, it seems like maybe grammar or
something akin to that in sequential processing is necessary for you to
express, to generate from the model, whereas just satisfied constraints on
the fly is hard.
>> Paul Smolensky: Well, it sounds reasonable, but the more the form of the
mental grammar is sort of compiled in a language-specific form, the more it's
difficult to explain why you can explain the difference in possible and
impossible languages in terms of whether there is a ranking or not. So, I
mean, you have to believe that somehow we have these language-specific
compiled forms, but they all originate from some sort of more universal form
and then get specialized ->>: [inaudible] everything you said. I'm just wondering beyond that once you
start thinking about how the real systems work you could have these like
versions of compiled and exploded versions of things. Like instead of having
your constraints, they have this constraint in many different conditions
exploded in the memory somewhere so you can just pull that off instead of
trying to satisfy the constraint and so on.
>> Paul Smolensky: Right. Right. I mean, it's a perennial issue in the
psycholinguistics to do with phonology about how much grammatical knowledge
is used on the fly anyway, given that there's a limited set of words compared
to the unlimited set of sentences. You can imagine storing a preprocessed,
pre-grammatically processed forms in abundance.
>>:
Right.
So maybe even grammar doesn't [inaudible].
>> Paul Smolensky:
>>:
Well, it does in the --
[inaudible] it doesn't work well.
>> Paul Smolensky: It does in the sense that we -- well, I mean, you can
also see that people have the general knowledge of the constraints as they
work in their language as well, because it's not -- they can take nonword
forms and tell you what the right thing to do with them chronologically is.
So they have a way of getting general predictive value out of what they know.
>>: So the other question about the theory, the [inaudible] theory or even
the weight version of it with using these constraints, universal constraints
and so on, I assume there must have been work on evolutional languages and
the analysis of what happens when you change the rankings slightly and then
everything changes in the language dramatically and these sort of analyses.
>> Paul Smolensky: There have been analyses about historical change as
re-ranking of constraints indeed. Yes. Once you have a closed theory of all
the possible languages, then you know that the historical path has to somehow
be in that space.
>>: Right. So has there been an example where the paths have changed using
this theory as opposed to the previous work on [inaudible] languages?
>> Paul Smolensky: Oh. Good question. I don't really know enough about the
historical literature to know. That's a very good question, interesting
question. I'd like to find the answer.
There was more work of that sort that I saw earlier on in the theory then
there has been later. So it could be that those kind of questions have not
been as much [inaudible] as other questions have been.
All right. So let's not go through this here. Let me move on to the dynamic
simulations in networks finally. So the question is what does this kind of
[inaudible] business that we saw just now for munkumunku or the one that we
saw earlier for it rains look like at the neural level. So here's our answer
to that at this point in the development of our theory.
So in thinking about this case where we have kind of like it rains it to
decide among, so we have these symbol structures at the higher level, the
macroscopic level of a description, and they get mapped down to activation
patterns at the lower level by our isomorphism here.
There are -- here are four such possibilities for how you might want to
pronounce what we say is it's raining and it rains. And each of these is
mapped onto a particular point in the space of activation pattern, which is a
R to the N continuum here. And so we often refer to these points
collectively as the grid. They are the points -- they are the activation
patterns that are the realization of exact symbol structures like the ones up
here, the grid of discrete states, D.
Now, here's the story. Here's our harmony map. So this is the activation
space, the particular points that correspond to these particular discrete
structures that are the embedding of those structures. The harmony values
for these structures in a harmonic grammar look like -- where am I? Here's
the cursor -- look like this. This is a ranking of the constraints that
corresponds to the English form. So the highest harmony option of these is
the one for it rains. The pattern of heights would be different for the
Italian grammar. That's the optimal one ->>: How is this computer [inaudible] exponential maximum entropy formula
[inaudible]?
>> Paul Smolensky: Here are the harmony values that we want. And so what we
have to do is we have to figure out how to put connections in the network
whose states are described here.
>>:
Okay.
>> Paul Smolensky: What weights to put in the network so that when we look
at them at the higher level we get numbers like this.
>>:
I see.
Okay.
So that's exponential of the number [inaudible].
>> Paul Smolensky: The exponential only comes in when we go to probability.
So right now we're just talking about the raw formulas for harmony measure.
>>:
Okay.
>> Paul Smolensky:
>>:
But it's very arbitrary.
>> Paul Smolensky:
>>:
So there's an exponential there.
Hmm?
It's quite arbitrary.
>> Paul Smolensky:
Yeah.
>>: But don't you want quantify saying that [inaudible] three, sometimes
four, sometimes five [inaudible].
>> Paul Smolensky: [inaudible] implicitly saying is definitely apropos here,
which is that if we had some probabilistic data available to us about the
likelihood of suboptimal states, then we could quantify what these numerical
values really ought to be. Because exponentiating these things should tell
us about the relative probabilities.
And so if we had data like that, we could have nonarbitrary. So if you were
working with a corpus, then it wouldn't be arbitrary in the same way that it
is here. All right. So these are the harmony values on the grid of discrete
states. But there's the harmony surface over the whole continuum of states
in the space. It's just a quadratic because the harmony function is a
quadratic form.
And here's our system which we can think of as a kind of drunken ant climbing
up this harmony hill. So it's wandering around on average going up. But
because there's a stochastic component to the dynamics, it's only going up on
average. And if the temperature is going down as it proceeds, then it will
be ->>: What do you say is harmonic? To me the harmony number is arbitrary
defined. You said minus 3 rather than minus 2. As long as the order is
right [inaudible] H. So why do you say that? It's quadratic? I don't see
[inaudible].
>> Paul Smolensky: The -- this is the formula for the harmony of all of the
states here. These are not states in the [inaudible]. The harmony of all
these states here.
>>:
Oh, I see.
Okay.
>> Paul Smolensky: So you take an arbitrary activation vector, its harmony
is A transpose WA, where W is the weight matrix of the network.
>>:
Okay.
That's a neural network you define.
>> Paul Smolensky:
>>:
Okay.
Yeah.
Okay.
>> Paul Smolensky: Yeah. Yep. So I'm freely going back and forth between
neural harmony, which is what this is picturing, and symbolic harmony. They
are the same when it comes to the points on the grid. Numerically the same,
that is. Because we program the network to make that true.
Okay. So I was drawing a picture of the optimization dynamics here. It's
stochastic gradient ascent in harmony. However, what you'll notice is that
the optimum in the continuous space is actually here. So this is the highest
harmony grid state, but it's not the highest harmony state in the entire
continuous RN. So it's almost always the case that in RN the best states
blend together discrete states and aren't a single one.
So what I mean by a blend is that this state here is a linear combination of
these guys with some weighting. And so in order to actually compute what we
want, which is this guy here, not that, assuming that what we want is the
real discrete optimum from the higher-level grammatical picture, then we have
to do something else in addition, which is the quantization dynamics, which
looks like that.
So the quantization dynamics is constantly pushing the system towards the
grid. And it has an attractor at every grid point. These are the boundaries
of the attractor basin. So any point in this area gets pushed to that by the
dynamics that we call quantization.
So we actually have quantization and optimization working at the same time
because there are two things we want. One is to end up with a discrete state
and, two, to end up with the highest harmony discrete state.
And so a schematic picture of what's going on during the processing is we
turn the quantization up and we turn the temperature down. But the picture
here will show you about the quantization getting stronger.
So quantization is off in terms of the harmony function here. But once
quantization starts getting turned on during the computation, you see that
the landscape shifts underneath our ant so that the points that are not
discrete grid points have their harmony systematically pushed down the more
so the stronger quantization is. So there's a quantization parameter Q
that's being raised as the computation goes on and the landscape is changing
this way.
And what we want is for our ant, which started off on that nice hill, doing
stochastic gradient ascent as the cardboard is falling out below him. We
want the ant to end up right through there at the peak which is the tallest
peak.
And you can see that this is a miserable landscape to try to optimize. If
you were right there, your chances of finding the global optimum would be
pretty slim. And so the idea is that by gradually turning the landscape into
this [inaudible] while the network is going after the highest harmony area
that we will be able to have our cake and eat it, too, end up in a discrete
place and end up in the best discrete place.
And what we want really is that in the finite temperature case, if we want to
have a distribution of responses to have a model of what probabilities are
assigned to ungrammatical forms as well just identifying what the grammatical
ones are, then what we would like is that the probability that the ant ends
up at a particular peak should be an exponential function of the harmony of
that peak for some small but finite temperature. That's what we want. And
I'll show you a theorem in a moment.
The picture here has been a schematic picture. What we have here is an
anatomically correct picture of the harmony landscape associated with
quantization. The quantization dynamics is a gradient dynamics on this
function. And if you want to see what it looks like more algebraically and
in one dimension, it looks like that, which I'm sure you'll recognize
immediately as the Higgs potential from Higgs Boson theory upside down.
And so now we're getting into harmony functions that are fourth order, not
second order, so that we can get shapes like this. We can't do anything
better than just have one immediately located optimum if we stick to
quadratic harmony functions. So the quick quantization is a higher-order
function that allows for these multiple maxima of the different grid points.
So here's what it looks like. We can come back if somebody is fascinated by
such things.
But here's the anatomically correct version of the picture I showed you
schematically a moment ago as quantization increases. There's the original
harmony surface at the top, just the quadratic ->>:
[inaudible].
>> Paul Smolensky:
-- [inaudible] quantization is off -- yes.
>>: [inaudible] I think the whole purpose is to show that it rains is
meaningful [inaudible] so they are not feasible. But you don't need to do
this [inaudible] just by your original, you know, constraint ranking you can
really get the solution.
>> Paul Smolensky:
We know what the solution is.
>>:
[inaudible].
>> Paul Smolensky:
That doesn't mean we can generate it.
>>: Then why do you have to do all [inaudible] as to the purpose of doing
these dynamics. You know the solution already based upon OT ->> Paul Smolensky: You are a hard man to please. So like four slides ago
you said why do we need this higher level stuff? We got the network. Why
wouldn't you use the network? Right? Now you're saying what the hell do we
need the network for?
>>:
I see.
So I'm just -- even one of them [inaudible] you don't need both.
>> Paul Smolensky: You need both for the following reason.
picture tells you what you want the output to be.
>>:
Okay.
>> Paul Smolensky:
it should be.
>>:
Okay.
And the network's job is to get it.
Right.
-- [inaudible]
>> Paul Smolensky: Right.
algorithmic about it ->>:
Right.
Sorry?
Now you're talking about neural implementation of [inaudible].
>> Paul Smolensky:
>>:
If you want to be a little bit more
I mean, [inaudible] only symbol.
>> Paul Smolensky:
>>:
Right.
So that's not linguistic anymore, right, you're not --
>> Paul Smolensky:
>>:
It tells you what
So one is to explain how that number --
>> Paul Smolensky:
>>:
But it doesn't tell you how to get it.
I see.
>> Paul Smolensky:
>>:
The high-level
Okay.
For the moment that's so, yes.
>> Paul Smolensky: Yes. I mean, what I mean for the moment, in the talk
that's so. Yeah. So the -- a more algorithmic way of saying it is that if
you had the luxury of examining every lattice point, computing its harmony
and finding the one that has highest, then that would be a way of getting the
answer. But there's exponentially many of those to examine, and our network
is just doing gradient ascent and ending up at the global maximum peak.
>>: [inaudible] once you get all this stuff you set up, you don't need to go
through the ranking of the violation of the constraint, you can just run the
dynamics, eventually you come up with the same number.
>> Paul Smolensky:
>>:
I think that's right.
Yeah.
>> Paul Smolensky:
>>:
Right.
Yeah.
So --
[inaudible].
>> Paul Smolensky: But the dynamics is doing gradient ascent in a surface
which is determined by the strengths of the constraints, by the weights of
the constraints. So it's not by any means independent of the weight. It's
not like you don't need to weights.
>>:
I see.
Okay.
>> Paul Smolensky:
of.
They're giving you the shape that you're crawling on top
>>: I see. Okay. So the purpose is that after you run dynamics, you can
throw away constraint and then you come up with [inaudible]. Is that the
purpose of doing this? On linguistic level, you already got the constraint
and you know that as a ->> Paul Smolensky:
Okay.
So --
>>: Now you throw away that, this neural simulation, and then you try to do
[inaudible] gradient ascent, and the hope is you [inaudible].
>> Paul Smolensky:
>>:
Yes.
Whatever number you get.
>> Paul Smolensky: I think that -- think that's right. So when we do -when we go from linguistics to cognitive science more general, to cycle
linguistics, then what we want is a model of the process that the brain is
going through or that the mind is going through while you're forming an
utterance in phonology or while you're comprehending a sentence in syntax.
And so this is intended as a model of what's going on --
>>:
The process.
>> Paul Smolensky:
>>:
The process that's --
[inaudible].
>> Paul Smolensky:
Yeah.
>>:
Okay.
Okay.
I see.
>> Paul Smolensky:
description.
Right.
>>:
Okay.
I see.
Okay.
Which you don't get from the symbolic
>> Paul Smolensky: That tells you what the end state should be, but it
doesn't tell you how you get there and ->>:
[inaudible].
>>: So the optimal solution here would be to have it before rain [inaudible]
54 percent of it before the rains and 20 percent after the rains. That would
be the optimal solution. But since that's not acceptable, we have to go to
grid points. So you're trying to steer the gradient ascent towards a point.
And you can't -- I suppose when it's complex enough, you can't just simply
find the solution and quantize it at that point.
>> Paul Smolensky:
>>:
Because you're not going to get [inaudible].
>> Paul Smolensky:
>>:
Unfortunately.
That's right.
[inaudible].
>> Paul Smolensky:
attraction ->>:
Right.
So this is not necessarily in the right basin of
Yeah.
>> Paul Smolensky:
-- of the global maximum.
Or yeah.
So --
>>: So that's going to question I was asking before. Is there some more
efficient way of using the structure of these constraints to more quickly get
to the answer [inaudible]? Like is it possible to learn how to jump through
these modes in some better way? Is that actually what the machine learning
algorithms are doing these days?
>> Paul Smolensky:
I don't know.
>>:
Are these --
>> Paul Smolensky:
You can tell me.
>>: [inaudible] the exponential harmony function is a [inaudible] function.
So there should be much better way.
>> Paul Smolensky:
>>:
Well --
But the question is not --
>> Paul Smolensky: This surface -- this surface is not a simple kind of
ex-optimization problem, is it?
>>:
No, it's not.
>> Paul Smolensky:
Is it?
>>: [inaudible] I thought that you write down to be E to minus H divided by
T and H is quadratic. That makes the whole thing [inaudible].
>> Paul Smolensky: H is quadratic for the grammatical harmony, but it's not
quadratic for the quantization harmony. The quantization harmony is fourth
order and it's designed specifically to have peaks at all the discrete ->>:
Okay.
Okay.
>> Paul Smolensky: -- at the exponentially many combinations of
constituents. So it's constructed to have an attractor, a maximum at every
grid point. And there are lots of them.
Okay. So I don't know the answer to your question, but -- and I don't have
the training to think about it very intelligently. So if you do, I would
love to hear about it, talk about it. Yes.
>>: Just want to make sure I'm not too lost, but maybe I am.
here are precalculated learned or hard coded or whatever.
>> Paul Smolensky:
is ->>:
They're held fixed while all of this is going on.
This
So this reference [inaudible] the inference not the learning.
>> Paul Smolensky:
>>:
So the weights
Thank you.
Precisely.
That's good.
That's right.
That's right.
Thank you.
>> Paul Smolensky: But because the grammatical constraints are part of a
kind of maximum entropy model method for learning their strengths that are
used all the time can be applied here. That's what was done in that paper on
[inaudible] vocabulary that I mentioned.
Okay. So I haven't showed you how to set the weights in the network so that
the harmony values follow the grammar that you want. And I don't have any
slides on that, but we can talk about it offline if you want.
So this is the picture of the dynamics as it goes on over time from left to
right. And each of these is the activity value of the -- how strong the
distributed representation in the network is for each of these constituents
that are possible.
So the one that is most rapidly settled in as highly active is the pattern
from the verb being filled by rains. There's no -- the faithfulness
constraints say that that weight should be and there's nothing that conflicts
with that.
The second most rapidly settled on constituent is this one which means that
there's nothing following the verb. So this is the role, post verbal
position. This is the filler, nothing. And so nothing following the verb is
rapidly settled on. There's no -- none of the constraints want that to be
filled with anything.
And then the last one that gets decided is that in the subject position,
preverbal position, you should have the word it as opposed to the word rains
or nothing. And it's interesting that this one is the slowest to be decided
because this is the one that violates the subject constraint. But that's
overruled by the -- sorry. It violate -- it satisfies the subject constraint
but violates the full interpretation constraint. So there's conflict here,
and that slows down the decision process.
And this guy here hasn't even really turned itself all the way off the way it
should be in the end. You have to process longer to get that. This is the
one that says that there should be nothing in the subject position. So this
is favored by the full interpretation constraint. This is favored by the
subject constraint. Subject is stronger, so this one wins. But you very
much see that there's a conflict going on here.
Okay. And what I said was that what we pray is that the probability that an
ant ends up on a particular peak after the quantization has gotten really
strong, the probability of ending up here is exponential of the harmony of
this grid point. Because we can reason about that.
So here is a theorem to that effect. Okay. So the theorem says if you
consider our network operating at a fixed temperature, then there's a
distance such that for all smaller distances across all of the discrete
points in the grid, as the quantization goes to infinity, the probability
that the network state is within that distance of the grid state, the
probability that it is near a particular grid state, then we have the
distance is less than R, is exponential in the harmony of the grid state.
And, furthermore, the probability that the state is not near any grid point
at all goes to zero.
So we have this nice result. The simulations don't obey the result very
well, and so we're trying to understand what the route from theory to
practice is going to be here in terms of -- I mean, all these kinds of
theorems are very asymptotic, so as Q goes to infinity and we need to have
network and thermal equilibrium, and as you were pointing out, that can take
a long time, so just how to pull it off to make good on this theorem is what
we're working on nowadays. So that's the story about the dynamics underlying
harmony, harmonic grammar and optimality theory.
And there's one more part of the story which I think is brief. It only one
slide I think actually. So I think there's time before three o'clock. And
that -- we've talked about how the -- we talked about continuous neural
computation that leads to discrete optimal outputs. And now I'm going to
talk about gradient optimal outputs as something that we might actually want
to use in our theory, not just get rid of. Quantization was a way of getting
rid of all of these nondiscrete states. But now we're focusing on what
actually those nondiscrete states are good for.
>>: Just one question about the previous topic. So you could try instead of
doing [inaudible] space to actually go through all possible combinations but
not -- not brute force but using samples of some sort, [inaudible] sampling
[inaudible] or something.
>> Paul Smolensky:
To visit all discrete things?
Yeah.
>>: [inaudible] or maximum harmony. Have you tried that? So you're only
skipping through discrete states but you're doing it in the MCMC [inaudible].
>> Paul Smolensky: If we did, which I think we probably did, it's long -- so
long ago I don't remember about it. But that is sort of a baseline that we
really need to be clear on. That's an excellent point.
>>: Then it could just be like one feet forward calculation, right, and
winner take all sort of?
>>: Well, yeah, if you basically fix some variables and generate the other
ones in a way that you know that asymptotically it's going to lead to a state
from the network, from the energy profile, from [inaudible] to the energy
profile. So it depends on what -- for harmonium, we know how we would do it.
We'd generate the hidden space on the observes, observe them from the
observer [inaudible] and do that in a while, then presumably that would be a
better way to generate from those constraints than simulated annealing. If
your target get is discrete output. But then if you have this more general
thing, then maybe that's harder to find the right way to sample through
discrete states.
>> Paul Smolensky: Yeah, I mean, since we're cognitive scientists and not
computer scientists, we have certain, you know, priorities that might not
totally make sense from a computer science perspective. So we don't think
that the brain jumps from discrete state to discrete state to discrete state
trying to find the highest harmony one. And so in a certain sense, we don't
care how good that algorithm is.
>>:
Right.
>> Paul Smolensky:
>>:
You could certainly argue the point.
But --
-- all these time steps.
>> Paul Smolensky:
>>:
Well --
In terms of --
>> Paul Smolensky:
>>:
But we should know how good it is nonetheless.
What you've got now isn't biologically plausible, is it?
>> Paul Smolensky:
>>:
Right?
Yeah.
How many time steps it takes and so on you mean?
Especially that.
Yeah.
>> Paul Smolensky: Well, there's sort of quantitative biological
plausibility and qualitative biological plausibility. And I've always been
most attentive to the qualitative part of it. So and believing that the
longer you work to polish it the better the quantitative side gets. But if
the qualitative side is wrong, it doesn't matter.
>>: But is there any sort of neural net that can compute this dual
optimization discussion and discrete goals? I mean ->> Paul Smolensky:
>>:
Is there --
Is there a neural net model that computes this?
>> Paul Smolensky:
Other than ours you mean?
>>: Well, you're -- okay.
not --
So yours does.
And it's just a neural net, it's
>> Paul Smolensky: It's doing continuous -- it's following a dynamical
differential equation, which is a polynomial function but not a low-order
polynomial function, as low as typical, what typically you'd find in neural
networks. That's why I mention that the function that we're doing the
gradient ascent on is quadratic -- is quartic and not just quadratic. So
there's a question about the --
>>:
So time steps are just like recurrent --
>> Paul Smolensky:
>>:
Yeah.
-- and then it converges to some answer [inaudible]?
>> Paul Smolensky:
Yeah.
Yeah.
>>: But the number of time steps you've seen is on the order of hundreds or
thousands?
>> Paul Smolensky: More or less. Yeah. Well, maybe I won't do this since
it's now three o'clock. That's okay. Am I giving a talk tomorrow?
>> Li Deng:
Yes.
>> Paul Smolensky: Okay. Li wanted to know what I'm going to talk about at
Stanford, and what I'm going to talk about at Stanford is the next slide.
>> Li Deng:
Okay.
>> Paul Smolensky:
But I can do that tomorrow.
>> Li Deng: So tomorrow, Wednesday, 1:30 to 3:00. And the announcement
should go out. I'm surprised it hasn't gone out. I'm not ->> Paul Smolensky:
You were an optimist if you were surprised.
>> Li Deng:
I can't send it out myself.
Yeah.
>> Paul Smolensky:
>> Li Deng:
Yes.
>> Paul Smolensky:
>> Li Deng:
So 1:30 to 3:00 tomorrow afternoon.
Okay.
Same place.
>> Paul Smolensky:
>> Li Deng:
Yeah, I see.
Okay.
[applause]
>> Paul Smolensky: Thanks.
All righty.
Download