>> Paul Smolensky: So a very quick review. So... points that are in the review box here. We...

advertisement
>> Paul Smolensky: So a very quick review. So we are still dealing with these four
points that are in the review box here. We are trying to do understandable neural nets by
describing what’s going on in them at an abstract level where we can use symbolic
descriptions. We have a way of representing and manipulating combinatorial structures
in neural networks tensor product representations. Last time we talked about the fact that
the computation over these structures when it comes to grammatical computation is a
kind of optimization. So we talked about Harmonic Grammar and Optimality Theory as
two new theories of grammar that result from that perspective.
Then today the last point is that with this gradient symbolic computation system that uses
optimization to compute over these combinatorial activation patterns we get both discrete
and gradient optimal outputs depending on how we handle the part of the process that is
forcing the output to be discrete. And we talked about the discrete part yesterday. I will
review that very briefly and then we will talk about the gradient part after that. Then I
propose to backtrack and deal with a number of topics from the lectures 1 and 2, which
got skipped.
So do you want to ask your question now?
>>: So my question was yesterday in the talk you gave the points on the grids were rains
for representing the concept rains. You had it rains, it rains at 4 points, but there are
many other things that presumably are getting considered like rains a tree, or reigns a
drop, or rains a bucket. How is it that we only are looking at these 4 points?
>> Paul Smolensky: Okay, well so we talked about faithfulness constraints in the context
of phenology, but they are also important in the context of syntax as well. So there are
different approaches to doing syntax within Optimality Theory. Actually I should have
emphasized the Optimality Theory and Harmonic Grammar is grammar formalisms, they
are not theories of any particular sort. So you can use Optimality Theory to do lexical
functional syntax or you can use it do government binding syntax or minimalist syntax.
So it is not a theory of any linguistic component, it’s just a formalism for stating theories.
So one thing that syntax theories differ on in OT is how they handle the notion of
underlying for more input. So in the more minimalist oriented ones the input is
considered to be something like the numeration in minimalism where you have a set of
words that are available to you. And in that context the generator of expressions that
produces these things, which is one of the components of an Optimality Theory grammar
would only produce things using the words in that numeration. And if “Bill” is not in it
then “Bill” won’t show up in any of the candidates.
The approach that I have pursued with Geraldine Legendre, which is not so far from the
one that Jane Grimshaw pursued also, OT syntax of a more government binding sort.
You start with a logical form as your underlying form, what you want to express and then
these are surface expressions as well. They are surface forms in the sense that they have
tree structure associated with them. And in our point of view there are also important
faithfulness constraints at work in syntax that are requiring some kind of match between
the output expression and the input logical form.
So if “Bill” is not in the logical form of what you are trying to express and “Bill” is in the
output then it will violate a faithfulness constraint and unless that violation allows it to
avoid some other violation and it sticking “Bill” in the sense won’t do that, unless you
stick it in as the subject in which case it would help you satisfy the subject constraint. So
“Bill rains” would at least be motivated by providing a subject.
>>: But that would be wrong. According to OP you would be having a higher score.
>> Paul Smolensky: You would have a higher score than just rains in English, potentially
depending on how you handle the faithfulness constraint that is penalized whenever you
have material in the output that’s not present in the input. So the idea of “it” is that of the
words that can go in subject position it least violates the constraint that says there
shouldn’t be content in the expression that’s not part of the logical form. So a full NP
like “Bill” has such information and is more of a faithfulness violation than “it” is. So
the idea is that “do” is in some sense the minimal violator of this when it comes to
inserting a verb and it is kind of a minimal violator of it when it comes to inserting an
NP, something like that.
>>: So maybe at some point, not today, we could look at import that is not a logical form,
but a continuous representation of that logical form, such as we have when we do
machine translation from a source language to a target language.
>> Paul Smolensky: So the betting vector is –.
>>: [inaudible].
>> Paul Smolensky: The input to the generation half of the translation plus this? Yeah.
>>: So just so I understand, if I say, “Bill rains,” it does not violate the subject so you get
a higher score than “it rains”?
>> Paul Smolensky: No, no “it rains” doesn’t violate subject either. It rains has a subject.
>>: “Bill rains” also has a subject?
>> Paul Smolensky: Yes, they both have subject so they both satisfy the subject
constraint equally.
>>: Correct. Okay, “Bill rains” should not be [indiscernible].
>>: It has semantic content.
>>: Oh, so this is only for syntax?
>> Paul Smolensky: So the faithfulness constraints that are involved in showing how
“Bill” is a worse subject than “it” here are not in the picture, but they are in the grammar.
There are faithfulness constraints in the grammar. The grammar has more than these two
constraints.
>>: Oh, I see.
>> Paul Smolensky: I’m sorry to tell you that, but it’s true. So those were omitted. If
you wanted to include in the candidate set a more complete list and you had “Bill rains”
then you definitely would want to make sure that you also have the faithfulness
constraints, which we prefer.
>>: A semantic match.
>> Paul Smolensky: A semantic match between the expressions and the underlying
logical form. Yes?
>>: Also it takes us into territory of things like farmed rain fishing, fish rain, [inaudible],
for example. We have these [inaudible]. It’s not that they can’t be a subject of rain. So I
think that’s one of the points. Bill is [indiscernible].
>> Paul Smolensky: So I have met quite a few Chris’s here. I am inclined to say “rains
Chris’s” in this audience and not “Bill’s”. A logical form that is expressed by that
expression is not the one that we are trying to express here. So it is true that rains is
[indiscernible] in having one of it’s reading subject lists version, logical subject list and
other’s which have bonafide subjects.
So how does this look at the neural level? So we saw how we use the space here of
activation patterns in a network. So each point here is a pattern of activation and how we
take our symbolic combinatorial expressions here from the higher level picture of our
network and we embed them in the grid of discrete states in the continuous RN space of
states of the neural network. And how the harmony function picks out the right form as
having the highest harmony, but that we don’t do search over the discrete points. We
talked about that as a possibility yesterday, but the neural networks that we consider
don’t do that kind of search. They are moving continuously in their state space.
So the challenge is that the optimal highest harmony point in the continuous space is not
going to be on the grid except in unusual circumstances. So what we did was we talked
about how we carved up the continuous space into tractor basins and we put in a force
field that pushed the states to these grid points. And this is a gradient force field that
derives from a potential function which is gotten by combining the harmony function that
comes from the constraints provided by the grammar with a harmony function that comes
from the constraints provided by the grammar with a harmony function that penniless
non-discrete states.
So that contributes its part of the gradient dynamics also and that’s the optimization
dynamics which we implement through some kind of noisy or stochastic gradient ascent.
So the dynamics has these two parts: so the optimization dynamics is completely
oblivious to whether states are discrete or not. And the quantization dynamics is
completely oblivious to whether states are good or not as far as the grammar is
concerned.
So the hope is that by simply linearly combining them we create a system which will find
the discrete point that is best according to the grammar and that involves weighting the
discretization more and more strongly in that linear combination as the process of
computing the output proceeds so that we end up pushing the point which is never on a
lattice point until finally the quantization component of this is completely dominant and
corresponds to a limit of infinite coefficient on the part that contributes the quantization
force.
So that was about how this kind of computation goes and produces discrete combinatorial
outputs. Now I will talk briefly about the gradient ones. So the question that is driving
the gradient symbolic computation work nowadays, which involves quite a number of
people at this point, mostly at Johns Hopkins, but not exclusively. We are trying to
determine to what extent we can get use out of all of the non-discrete points in the state
space of the network for doing modeling psycholinguistic processing the intermediate
states along the way to computing the part of the sentence that you have heard so far. For
example for generating from a logical output the process of working your way of having
a motor plan to produce it.
But, also we are looking into a somewhat more radical idea, which is that gradient
representations might be suitable for doing linguistic theory, so competence theory not
performance theory. So this is in addition to being used for performance theory. So this
is the idea that we can take seriously representations in which symbols have continuous
activation values, not just as a transient state along the way to a final proper discrete
representation, but as entities in their own right that may be valued by grammatical
considerations.
So this is an example of a gradient symbol structure. It is one way of depicting it
anyway. You can view this in 2 ways; it’s just different views of the same structure.
You can say that this is a tree where at the left child position we have a blend of 2
symbols, mostly in A but some B as well or you can look at it and say that this is a
structure in which the symbol A is mostly in the left position, but partly in the left
position as well. So this is a perspective that is more natural for syntax where things
occupy multiple locations implicitly if not explicitly and the previous is more suitable we
find for phenology where things tend not to move around, but they do change the content
of their phonetic material.
So both of those have been used in this work and I will tell you about an example in
which we actually just look at discrete outputs, but we consider inputs that are gradient.
And this is the example I am going to talk about on Friday at Stanford, since you were
interested in that Lee. It is the phenomenon of French liaison and I want to acknowledge
that the idea of applying these kinds of gradient methods to this phenomenon is due to
Jenny Culbertson. So the way this phenomenon looks is that there are words like the
masculine version of small, which is written this way petit with a t at the end. It is
written with a t at the end, but the t at the end is not always pronounced.
So when you combine it with the noun ami you pronounce it and you get petit ami. But
when you combine it with a constant initial noun you don’t pronounce the final t you just
say petit copain. So you see I have pronunciations here. So the t disappears and that’s
not true of all final t’s. So here is another form of that adjective, the feminine form and
the t at the end of the feminine form appear all the time. It is not like the one at the end
of the masculine form. So the masculine ends in what’s called a liaison consonant, but
the feminine form ends in a full consonant. It is pronounced all the time, even before a
consonant. So for the female version of what’s literally “little friend” you have petite
copine, which has got that final t for petit pronounced.
So this is a very well studied phenomenon in French, studied to death you might say. So
we are rolling over in its grave an already well trampled phenomenon. However, we
have new things to say about it and those –.
>>: [inaudible].
>> Paul Smolensky: So explaining what I have shown here is easy to do. The reason that
this remains something that people work on and is considered just a solved problem, is
that although this is the core of the phenomenon and this is what they teach you in French
school, not just foreign language instruction of French, even though that’s the case there
are lots of phenomenon that don’t actually fit the picture. Here. So because of those other
phenomenon some linguists have proposed that we shouldn’t actually think of this t as
part of the first word at all. Some perspectives have it as an independent entity of its own
or an independent part of a schema in which we stick the adjective and the noun together
and the t is already sitting there.
And the view I am going to take into consideration today is the one where the consonant
is viewed as being part of the second word. So what you see here is the syllabification is
pe ti ta mi. So the story is that the t surfaces when it can be an onset of a syllable because
it is followed by a vowel. And as you know from [indiscernible] of syllables theory
yesterday, syllables should start with a consonant, that’s a good thing, whereas they
shouldn’t end with a consonant. So in the case of a consonant initial noun if we did
pronounce the t we would have coded in petit copain, coded are dispreferred and we do
not actually pronounce it when it would be in the coded position. That is a
characterization of the liaison consonants that make them different from regular normal
consonants which always appear, such as this one here even when it’s in coded position.
>>: But the last one actually got his coded one also right.
>> Paul Smolensky: Right.
>>: What don’t you penalize?
>> Paul Smolensky: It is penalized, but it is still the optimal output. So there is
something different about the t at the end of petit verses the end of petite, there is
something different about those two t’s.
>>: So there are other constraints there at work to insert –.
>> Paul Smolensky: Basically the other constraints are some kind of faithfulness
constraints. So you can think about the basic ranking in French as being faithful the fact
that there is a t in the underlying form and is stronger than the no coded constraint, except
that there are some consonants, the liaison ones, where it seems as though there presence
in the underlying form is kind of weak. So they don’t provide as strong of faithfulness
force and the force they provide is not sufficiently strong to overcome the no coded
constraint, that’s why these weak consonants don’t appear in coded position.
So you have to somehow store in the mental lexicon of French a difference between the t
at the end of the masculine form and the t at the end of the feminine form. Somehow you
have to mark a difference and there are words that adjectives that have a final t that’s
always pronounced even when a consonant follows, but aren’t the feminine form of
anything necessarily. So it’s not all about the gender difference.
The idea though of the main competing analysis to the one that says the t is stored along
with the rest of petit at the end, but somehow it’s a weak t. The alternative is a view that
says that actually this loudification here reflects the underlying form. So the underlying
form is actually petit, nothing at the end and the other following word is tami. So the
idea is that the lexicon contains multiple forms of this word. So ami is one form but tami
is another form. And when preceded by petit you have to select the version of this word
that starts with the t and you have to select tami. So that’s the view that says these
consonants are part of the second word not the first word.
And what we have in our proposal here is a blend of these two analyses. So in the
proposal that we are providing with gradient symbolic representations we have a weak t
at the end of petit. So .5 is the activity of t. So remember these consonants can have a
strength different from 1, which is the standard strength. So it’s a weak t here, but there
is also an even weaker t at the beginning of ami. And t isn’t the only liaison consonant.
There are generally considered to be 3 productive liaison consonants in French. So just
like t we have to consider n and z as weakly present at the beginning of ami, because they
are needed for words, not petit, but other words.
So the idea then is that when this input is presented to the harmonic grammar in this case
the optimal output will be the form in which the t is pronounced at the beginning of a
syllable. Whereas when this petit with a weak t is followed by a consonant initial word
we do the optimization we will find out that the optimal form does not have that t. So .5
is not a strong enough faithfulness force to overcome the no coded constraint, whereas
the effect of combining the activity at the end of petit with the activity for t at the
beginning of ami gives you .8 and with the strengths of the constraints, the faithfulness
constraints and the markedness constraints, as given in the grammar we proposed, with
those strengths of constraints now once you are up to an activity level of .8 you are now
over a threshold such that it is optimal to pronounce the t. Now it has got enough activity
to overcome the no coded constraint.
So that’s the picture for liaison consonants for regular consonants like the t at the end of
the feminine form of petit, that’s a full fledged normal t full strength. So it doesn’t
require any additional activity from the second word in order to be above the threshold
required for which faithfulness to it exceeds the cost of having a coda. So you get petit
copine there. So that’s a summary of a pretty long talk actually. Yes?
>>: [inaudible].
>> Paul Smolensky: Uh yes, that a –.
>>: So that would be like a child trying to learn this [inaudible].
>> Paul Smolensky: Right, right that’s a good point. The child is fortunate in that French
provides an environment where you can actually see the underlying contrast without any
complications from any following word. The acquisition story that I have doesn’t rely on
that fact, but it is actually a good point. So anyway, the idea is that when children hear
petit ami they use a constraint which is very widely observed in the worlds languages that
says that [indiscernible] begin at the beginning of a syllable. So when you parse a stream
you segment it into words. Doing that at syllable boundaries is a good thing to do. So
when you do that to this form here you end up with petit as the first part and tami as the
second part. And children use tami as if it were an independent word. They say, “A
tami” instead of “anam”i. So children use the forms that are proposed by this second
theory here quite productively, but they stop after awhile.
>>: So how about if you make audio recordings of these and analyze the spectrum. How
many consonants do you actually see there? Is it possible that those of these are there in
petit tami, because the alternative would be an oral thing, movement of the tongue in
your mouth. You are trying to say petit, but you are supposed to keep it silent and you
don’t quite say it, but when you are supposed to go from e to ah there is a transition. You
want to insert a consonant so then you shift that t, but then the t almost kind of
reverberates. Then there is a little bit at the end that starts with the next one. So you
have both rather than just the optimal version.
>> Paul Smolensky: Well after leaving Microsoft I am going to spend a couple of months
at a lab in France where they do this kind of study and I will be very curios to see what
differences can be observed between –. So the feminine form of this is petit ami, so
superficially it’s the same as the masculine. But we know that the t’s involved are
actually different. And whether there is an acoustic subtle difference that reflects that the
question you ask is another interesting one.
>>: Well it’s a simple question because those are different t’s because they are being
pronounced differently and then we just connect the words. You can get all kinds of
things that are a different pronunciation of these or maybe even a reverberation where
you get multiple t’s, but one is not heard quite. But you can detect it in your recording.
>> Paul Smolensky: Yeah I will be shocked if that turns out to be true, but it needs to be
looked at.
>>: [indiscernible].
>> Paul Smolensky: Right.
>>: [indiscernible].
>> Paul Smolensky: Well it is –.
>>: [indiscernible].
>> Paul Smolensky: Well you are not alone in that. So the divide between phonetics and
phenology has traditionally been that phenology is all discrete and anything continuous
has to be in phonetics. And this is proposing that there are continuous dimensions of
phenology as well. But we still believe that there is a phonological grammar that has the
constraints that we talked about, which has an existence independent of all of the
phonetic machinery. So we believe that grammar can show signs of representations that
are not fully discrete, even though all of the outputs of the phenology no matter what go
through this process of becoming continuous in the phonetics. Yeah?
>>: So what might be a syntactic [indiscernible]?
>> Paul Smolensky: Of having a split?
>>: An example.
>> Paul Smolensky: Well we have worked a little bit with the example of wh questions
where our analysis says that the wh in our phrases in our language like English, which
fronts the wh not in languages that leave it in [indiscernible], but in languages that front it
like English, that the wh phrase is in a blend of 2 positions. So there is a weak version of
it in what would be the position that it initiates movement from in a movement theory and
a strong version of it in the place where it is pronounced. And you can imagine all sorts
of dependencies being treated in a way like that in addition to wh movement.
>>: So I was also thinking if you have non-compounds then often the pre-modifying
noun you really can’t tell whether it’s an adjective or a noun.
>> Paul Smolensky: Yes.
>>: And there’s really no need to tell whether it’s an adjective or a noun even though the
Penn Tree Bank will declare it either an adjective or a noun. But there is often no reason
to make that decision. So thought that might be a really [inaudible].
>> Paul Smolensky: That’s a nice, a different kind of example which is more like the
phenology style mixture here where in one position, mainly where this word is, you have
a blend of 2 category representations not an adjective. And there is no need to force a
decision between the two in certain cases as you say. And states of processing sentences
before the whole sentence is completely processed we model with all sorts of syntactic
blends, which reflect a mixture of not yet having carried out the computation far enough
to be at the final potentially discrete state, but also because there is uncertainty when you
are processing the sentence and it hasn’t completed yet. So you only have partial
information about the beginning of the sentence which leads to other kinds of gradient
representations from a source of uncertainty about what’s missing. Yeah?
>>: That input lines seems to make a prediction that there might be a difference between
say high frequency processing, for example if you get petit something with a very
unfamiliar word that began with a vowel you would expect possibly a delay in reaction
times associated with my application of that role [indiscernible].
>> Paul Smolensky: I hope to learn more about the performance aspects of liaison in
France, but there is a documented set of punitive facts about the probability of liaison
appearing as a function of the frequency of the collocations. That you get more liaison
from more frequent collocations. And our analysis has an account of that, but you are
absolutely right that this is a good start for that kind of phenomenon.
>>: So would it be fair to say that introducing this gradient probabilistic way of dealing
with label makes the constraint a bit less, easier, like you don’t need to have so many
constraints to explain a specific language [inaudible], otherwise if you don’t have that
gradient [indiscernible].
>> Paul Smolensky: I know of one version of that for which that’s correct. I don’t know
how general it is, but in Optimality Theory there are some phenomenon where you have
different levels of violation of what’s conceptually a single constraint, but you have to
fake that by having multiple versions of the constraint in the hierarchy that get’s stronger.
So you don’t need to do that when you have numerical elements. So there are at least
some places where I am pretty sure that what you said is true. But I do want to say that
we are adamant that these are not probabilities. So the representations are like a pattern
in a neural network and the network is a stochastic network so every state has a certain
probability associated with it. So a state like this has a probability as well as having the
gradient internal structure.
>>: I see, I see.
>> Paul Smolensky: So the probability is laid on top of representations like this for one
thing, but you will notice that these do not add up to 1 nor do these. And with the wh
question we are not saying that there is some probability that you will treat the wh phrase
as if it were at the beginning and some small probability that you will treat it at the end.
It is in both places as once. It is a conjunctive combination. It is not a disjunctive
combination the way probabilistic blends are.
>>: Okay so what would be the neural correlates of these [inaudible]?
>> Paul Smolensky: It is absolutely straightforward. So you just have a certain set of
roles and fillers for the discrete form petit and in constructing the [indiscernible]
representation of a fully discrete form like petit you have “p” in one role, “uh” in one role
and you have a second t in a certain role.
>>: Oh I see, I see.
>> Paul Smolensky: And the filler final t times role has a coefficient of .5 inside.
>>: [inaudible].
>> Paul Smolensky: Yes, that’s right.
>>: So all the theories will carry through.
>> Paul Smolensky: Right, that’s right. So that’s why I say that what we are doing is
looking at, for the use of all those states that are not on the grid and in general they can be
interpreted as states like that.
>>: You said the probability is basically e to the minus energy or e to the harmony.
>> Paul Smolensky: Right.
>>: So in that case these numbers would actually end up being like logons of on and off
states.
>> Paul Smolensky: You mean in terms of probability appearing verses not appearing?
>>: Yeah.
>> Paul Smolensky: Uh-huh.
>>: So they don’t have to list representation they are just normalized between 0 and 1.
They are just more like logons. They can be any number of positive or negative.
>> Paul Smolensky: It’s true they are more like logons, but what we do is –. So this is a
description of the input. The output is one in which, as it says here, we are looking at
discrete outputs only in this project. So a discrete output has a t in this intermediate
position which derives from both of these two. So it is analyzed as a coalesce of two
underlying segments into one. So this number and this number add together, but there
isn’t really any sense in which I can see that addition, which is the crucial point for
deciding whether that t surfaces or not as a combination or probabilities, but maybe I am
just not creative enough.
>>: Well probability domain is more like a product and here it is a sum.
>> Paul Smolensky: It’s not like the probability with which you would actually get the t
surfacing in this word like ami on its own. Once it get’s to multiply by the probability
you would have t alone.
>>: But it wouldn’t be that number. It would be a number that is [indiscernible] plus “c”
to the type of number divided by the temperature or whatever it is.
>> Paul Smolensky: Yeah, I will look into that. It is certainly true that these coefficients
have the status of energy.
>>: Well ideally what I think about is that when you build a structure there, put an indent,
you [inaudible] and that process is more or less the same.
>> Paul Smolensky: Whether is [inaudible].
>>: With or without the gradient [inaudible].
>> Paul Smolensky: Absolutely.
>>: Then once you do that [indiscernible] is to say that becomes the space in the
projected [indiscernible]. Then you run dynamics and that gives rise to “h”.
>> Paul Smolensky: The dynamics dictates the “h”.
>>: Now they have the question of how does that terms “h” relate to the weights
[indiscernible]? [inaudible].
>> Paul Smolensky: So the harmony contributed by this consonant in the input –. So for
an output in which there is a t that is pronounced then there is a certain faithfulness
reward from having put in the output a consonant which is also in the input. But the
magnitude of that award is equal to the weight associated with the faithfulness constraint
itself times the activation value here. So you multiply the weights in the harmony
function times the activation values of the elements that are in corresponds in the input
and the output.
>>: Okay I see.
>>: [indiscernible]. But if you integrate all the variables out you are still going to get a
distribution that had that [indiscernible]. So that is what I meant, maybe that is similar
here. Then in that case on one side is increasing probability of seeing the t, that t on the
other side is increasing the probability of seeing the t. [indiscernible].
>> Paul Smolensky: But it is crucial for the resolve of the issue of how different or not is
those numbers from probabilities. Yes?
>>: I just wanted to kind of make a comment that for me the output of the syntax I
wouldn’t assume it is discrete.
>> Paul Smolensky: Right, we don’t actually assume that about phenology either, really.
>>: Okay, but just from a non-traditional syntax point of view I would find it natural that
they would be gradient because there is no such thing as a discrete syntax.
>> Paul Smolensky: And I guess even conventional wisdom might say something similar
to that about comprehension and whether the comprehender goes all the way to
producing the disambiguated discrete analysis given that typically it is not necessary or
maybe not even possible. But in production you have to have a well enough –. The
syntactic expression that you are using to drive production is going to have to be discrete
enough to do the work, but I don’t know how discrete that will turn out to be.
Okay, so what I propose to do is go back to the list of points we had in the first lecture
and carry it into the second. We covered the light grey ones here, but didn’t cover the
darker ones here. So I have slides that are versions of what I would have used then and
these are the topics and you can decide which ones you want me to talk about in the
remaining time.
So the first one here is a very general symmetry argument for why distributed
representations are really a key to finding property of neural computation and need to be
possible in all legitimate neural network models. So that’s an argument that’s somewhat
extended. Then distributed representations themselves are important because they give
rise to generalizations based on similarity. So I have 2 examples of similarity based
generalization: one at a symbolic level and one at a sub-symbolic level. This is just a
relatively short section of how to use tensor product representations to do basic lambda
calculus entry joining grammar. And this is actually probably 12 words together because
we have already talked about half of the stuff on that slide.
>>: [indiscernible].
>>: We have to give the admins a little bit more time this time, because it was really hard
for her to get it out.
>>: [indiscernible].
>> Paul Smolensky: Which her are you referring to, Tracy?
>>: Stacy.
>> Paul Smolensky: Stacy.
>>: She is doing a great job, but [indiscernible].
>> Paul Smolensky: I will send her a thank you note.
>>: I already did, but that would be nice Paul.
>> Paul Smolensky: I will add to it. So I am game for that. I will be out of town for 3
days so it can’t be as short notice as today or yesterday.
>>: So you go on Friday, Monday or Tuesday?
>> Paul Smolensky: Thursday, Friday and Monday.
>>: But I will be away for most of the week next week, so let’s do it the following.
>> Paul Smolensky: So back to basically the first slide although most of it is grayed out
here. So what I was trying to argue was that there was a challenge, kind of a very
fundamental challenge of brining together all of the reasons that we have from traditional
cognitive theory, and linguistic theory, and AI. All the reasons we have to believe that
symbolic computation is a very powerful basis for cognitive functioning and on the other
hand, especially recently in the AI world, but for longer in the cognitive world, reasons to
believe that neural computation provides that important power as well. And they don’t
seem very easy to reconcile with one another. Most people think that they can’t or
shouldn’t be.
So what is being proposed here in GSC is a kind of integration on gradient symbolic
computation most explicitly so is what we just looked at, where we talk about
representations with gradient symbols in them. But that is one part of the whole package
that pulls these two together. So this was intended to be very early in the talk and is a
characterization of what neural computation means and that lays the groundwork for what
constitutes the challenge of trying to unify it with symbolic computation. And the most
concrete function of this section of the talk is to try to explain why tensors appear and
why we should be using tensors to encode information in this integrated picture.
>>: So I am a little confused. In the first lecture you talk about exactly how you
[indiscernible]. But in this lecture you are talking about gradient symbolic computation
and that seems to be separate from tensors.
>> Paul Smolensky: Well if you look at it from the symbolic point of view then you have
these representations that don’t have any neural networks in them. That’s the
characteristic of this higher macro level view of things here. But my claim is that these
things are the bridge between those two, the tensors that we write down which might
have .5 times t tensor final position. Those things are squarely in the middle because
when you cache out for all the units contained in those tensors you are looking at this and
if instead you look at the tensors themselves as wholes then you find yourself looking at
something like that.
>>: Oh I see, okay, okay, yeah. So the gradient symbolic computation you talked about
earlier today, I wouldn’t think that to be as important as having tensor as intermediate
stage.
>> Paul Smolensky: Well I mean I take your point. So, I will reflect on the fact that I
have used about 5 different names over the years for this package of ideas. So in the
book I called it the integrated connection symbolic architecture.
>>: So the gradient symbolic you talked about today to me is more trivial compared to
the tensor representation that unify the two. Am I wrong?
>> Paul Smolensky: My way of making sense out of those gradient symbolic
representations is to say that they function in a harmonic grammar the way that the tensor
product representations of them function in the harmony function at the neural level. So
in order to make sense out of them from a harmony point of view is seems important to
me to know that they stand for these patterns of activation where we can evaluate where
the harmony is. You might be able to fineness it if you wanted to though.
>>: Then when you have this sort of representation do you no longer need the
quantization dynamics? Is everything based now on the projections for which state you
have?
>> Paul Smolensky: So you don’t need quantization for interpretation of anything. So
the tensor representations are about interpretation so you don’t need quantization for that.
If you use this isomorphism to build for example a linear network that does this job here.
Then the situation is if you put in a point on the grid of discrete states, if that’s your
input, if it happens to be one of those special points then so will the output be. It’s feet
forward, it’s guaranteed to give you the discrete output that would come from doing it
this way, but if you want to do harmony maximization and have a grammar with
constraints that you are trying to satisfy then in general compile that into something like a
specification of a feed forward network. So we do the actual constraint satisfaction with
recurrent activation spreading. And in that situation, that all takes place in the continuous
context and you have to put quantization in if you want at the end of the day to get
something on that grid of discrete states.
>>: But in the brain there is no such thing as discrete state, presumably. How do you
optimize, if in neural network in your brain, how do you optimize this? How do you
isolate discrete states?
>> Paul Smolensky: So the quantization dynamics has a combinatorially structured set of
attractors. So we are operating under the assumption that there is something like that in
our brains.
>>: Oh you would say they are just pockets where things are being drawn the base. And
then doing learning things fall off these pockets and it becomes a discrete state?
>> Paul Smolensky: Right, yeah, not necessarily that learning is restricted to the situation
where things do fall into a discrete state, because you might need learning to get the point
where things are falling into discrete states in the first place.
>>: So maybe this is where my confusion is coming from. So I thought that to go from
[indiscernible].
>> Paul Smolensky: If we can write down a symbolic expression for the function in a
certain class –.
>>: So what is the role of harmonic grammar as well as dynamics, how do they fit into
this? So I think I missed this high-level [indiscernible]. Do you know this
[indiscernible]?
>> Paul Smolensky: Well until we started talking about it, you and I, your idea last week
or so, it didn’t occur to me that you could take a kind of gigantic list approach to thinking
about the function that maps from let’s say a syntactic tree to an AMR representation.
>>: Yeah.
>> Paul Smolensky: And that you can just identify thousands of cases and for each one
write down a symbolic function that does the right thing. Under that approach there isn’t
a need for a grammar.
>>: Right, that’s what I mean.
>> Paul Smolensky: As long as you are satisfied with the generalizations that occur when
you subject unfamiliar inputs to the list.
>>: I see, for the new instances.
>>: Yeah.
>>: [indiscernible].
>>: But then you can never handle new input.
>>: [indiscernible].
>>: But I can give you a sentence that’s not nearest to anything and it should still get an
AMR representation.
>>: [indiscernible].
>>: Under no circumstances and I can give you complete possible states that you have
never even dreamed of and your mind will attach an interpretation to them.
>>: I see, I see, okay.
>> Paul Smolensky: So we believe that the generalizations that are involved in language
process are captured by these grammars and they need to be implemented by continuous
satisfaction.
>>: [inaudible].
>> Paul Smolensky: All right. So very rudimentary speaking, and we will do somewhat
better I think maybe, tensors are n dimensional arrays of real numbers. So why are they
something that we should be so concerned with in this picture? That’s the question here;
why tensors? And to answer that, as I was starting to say a moment ago, elaborate on
how we define neural computations so we see what exactly the specifications are for
unification. So the most obvious contrast between symbolic and neural computation is in
terms of the syntax of the formalism. I don’t mean natural language syntax; I mean the
syntax of computational formalisms here.
So, at the macro level here, our data, our symbols that are connected by relations and the
operations are to compare whether 2 tokens are symbols of the same type to concatenate
or embed constituents in one another. That’s the kinds of operations and data that are
associated with the symbolic, computational architecture. As opposed to having as the
data numerical vectors operations which are element wise addition multiplication by
consonants combining together to give you matrix multiplication and stuff like that.
So this is, in terms of what the computational operations work on and do, a fairly clear
contrast. But what’s more important to understanding about tensors is what I would call
the semantic contrast about the semantics of these two computational architectures. So in
the case of this symbolic world, generically speaking, the symbols that appear in the data
structures correspond to concepts. So a concept is locally encoded in individual symbols
which allow us to write programs that deal with the symbols in the way that we think
they should be dealt with because we know what concepts they correspond to and we
have conceptual understanding what the tasks are and what they require.
On the other hand generically speaking in what I take to be true neural computation the
individual units are meaningless. You can’t assign any conceptual interpretation to
activity in a single unit. Concepts are distributed over many neurons and a given set of
neurons has many concepts distributed over them, which makes for difficult
programming, difficult interpretation. So that’s what you might call sub-symbolic
representation as opposed to symbolic, although I originally called it sub-conceptual and
that is a better name.
>>: [indiscernible].
>> Paul Smolensky: Yes.
>>: [indiscernible].
>> Paul Smolensky: Well remember the idea is that both of these are true of the same
system. You can look at it this way or you can look at this way. That is to say, well I
have explained what that means.
Okay, so why tensors? Tensors are very natural for dealing with distributed
representations. Once you accept that distributed representations are really the essence of
what defines neural computation and what distinguishes it from symbolic computation,
with respect to the semantics of these formalisms, then you can see that something that is
natural for distributed representations is important for us to know about and use.
Okay, so why do I think distributed representations are a necessary component of
characterizing micro-computation? The sense of “necessary” I mean here is a little bit
convoluted. So I am not arguing that every neural net model needs to use distributed
representations. What I am arguing is that an adequate neural net architecture must allow
distributed representations. An architectures design cannot require that the
representations be local in order for it to be a true neural computational system. This is
relevant as we will see to a typical neural network systems doing NLP these days. They
violate this principle because they do require local representations in a certain sense.
Okay and for the time being I am going to assume that when we are talking about
distributed representations what we are talking about is what I defined earlier as proper
distributed representations in which the patterns that encode the concepts are linearly
independent of each other so that they don’t get confused when combined. And with
respect to that there is this big difference. I mentioned it once before, but it is important
enough that I will mention it again that in cognitive science the situation is that the
number of units available to us in our networks, namely the number of neurons in the
brain, are considered to be much bigger than the number of concepts that are encoded in
them.
So people always have assumed, and I don’t see reason to doubt it yet, that the
vocabulary at the conceptual level that we use when we consciously think about domains
is much smaller and restricted compared to all of the neurons that are involved in
processing those domains. So in that situation you can just assume that you have proper
distributed representations because if the number of neurons is bigger than the number of
concepts there is no reason the concept vectors can’t be linearly independent. But
coming home drove home that in engineering we can’t necessarily afford as many
neurons as we have concepts or words to deal with. So now we have number of neurons
much less than the number of concepts and in which case we don’t have proper
distributed representations anymore. So put the improper ones aside for a moment and
just think in the context of the proper distributed representations.
Okay so the basis of this argument here for the centrality of distributed representations is
something that is here called the symmetry metaprinciple, which says that if some system
is left unchanged or invariant under a group of transformations, which are called
symmetries if they leave the system aren’t unchanged, then the fundamental laws of the
system must be invariant under those symmetries. It’s hard to believe that anybody
could disagree with that. So there is a very simple example of this. If we consider all of
the rectangles in the plane which are access aligned then that’s a set where you can rotate
by 90 degrees or 180 degrees any one of these rectangles and get another rectangle in the
same set and the area of the rectangle won’t change when you perform those rotations.
So if somebody tried to sell you this formula for the area of a rectangle you could
immediately be suspicious and say, “This can’t be right because it violates the
metaprinciple here.” So this formula says that you take the width of the x dimension of
the rectangle squared times the width in the y direction, because when you rotate by 90
degrees the width of the x and y dimensions flip. What was x becomes “y” and vice
versa width and this formula doesn’t stay the same when you exchange x and y. So it
can’t possibly be right.
Okay so that’s a very straightforward example of this principle, but I want to obviously
apply it to now representations in vector spaces. So the representational medium for
neural networks is the vector of activation patterns over n dimensions if you have n units.
And this vector space is invariant under a change of basis or a change of coordinates,
which is the general linear group of invertible linear transformations of the space rn. So
what does that mean by a coordinate change of basis here? So when we write down that
we are interested in the vector 4-17-5 and when we write that down what we mean is that
this vector is a linear combination 4-14-5 of 3 basis vectors which point in the directions
of the coordinate axis. So if we have Cartesian x, y z coordinates then these vectors you
e1 points along the x direction and has unit length. And this is the linear combination
that defines what we mean when we write done this formula.
Now if we change the coordinates so that we go to some different set of axis for laying
out in this case 3 dimensional space then in the same vector will have a different
description with respect to the coordinate system and in the new coordinate system we
have e1 prime, e2 prime, e3 prime pointing along the x prime, y prime and z prime
dimensions of the new coordinate system. And we will need different weights, v1 prime,
2 prime, 3 prime, in order to get the original vector back again.
So we change the basis we use for describing it like changing the coordinate system here.
Then we need to change the numbers that describe any particular vector in that coordinate
system. And it is just a matter of multiplying the original set of coefficients by a matrix
to get the new ones. So this matrix is a member of the general linear group and in this
case the columns of this matrix are the coordinates of the new basis with respect to the
old one. So e1 prime is itself a vector which can be written in this way as a sum mixture
of the original e1, e2, e3 and those mixing coefficients are the column of this m matrix,
which is the change of basis matrix.
Okay now this can effect change between distributed and local representations, that’s the
point. Swapping between distributed and local representations is a kind of change of
basis invertible linear transformation given that the distributed representation is a proper
one. So here is a concrete example. It might not be the beset type of example to imagine,
but it has concreteness in its favor. So we have these phonetic segments, or phonemes or
whatever you want to think of them as: v, s, n and this is a distributed representation of
these in terms of some kind of phonological features: lab, cor, vel, nasal, etc, and we put
a 1 there whenever the corresponding feature of property is present in that sound.
So we can now switch our coordinate system so that the axis actually point along the
direction of the phoneme. So we have a v axis, an s axis and an n axis, instead of having
a labial access, and a corneal access and a velar axis. So in that case the new e1 prime is
let’s say v and similarly for each e2 and e3. So the change of basis matrix here just
involves taking this vector and making it the first column. Then taking this vector and
making the second a matrix which when multiplied by the coefficients that you use in the
original distributed representation give you a new description in terms of what is now a
local representation because each coordinate now has a conceptual interpretation as a vah
or sah.
And the sounds themselves become fully localized now, whereas v was this combination
of features in the original coordinate system. In the new coordinate system v is by
definition the first axis. So it has got a coefficient of 1 in the first direction and 0 in all
the other directions. So now we have a local representation because we have switched to
an axis that point in the conceptual directions instead of in what were the neural
directions.
So, this is where the example is sub-optimal, because we don’t necessarily want to think
of a labial neuron and a corneal neuron, but if you will just grant that for convenience.
Each of these numbers in the distributed description of some concept is the activity of a
neuron. So, each of the original basis vectors here points in the direction of the activation
of a single neuron. And we switch now to basis vectors that point in the direction of
single concepts. It’s a linear transformation, we apply a member of this general linear
group of transformations to the first description and we get the second one.
Now if we have a network which has the usual form in which –. Let’s see maybe I will
go down here. The point of this is that the function computed in a neural network is
invariant under the change of basis. So conceptually if you have a distributed
representation over a set of units that form a 1 layer in some network. We can redescribe the states of the vector space not in terms of the activities of the units, but in
terms of the weights of concepts that are the distributed patterns in that network doing the
change of basis that we just talked about. And if you do that you can change the
outgoing weights so that they now are appropriate for the conceptual coordinates instead
of what they were before, which was appropriate for the neural coordinates. And once
you have changed the weights to match what you did to change the description of the
states what that network layer feeds to the next one is exactly unchanged.
So here is the activation in layer l, suppose it starts off by having distributed
representations in it and this is the weight matrix from layer l to layer l plus 1. If we now
change our description so that what was the vector a becomes the vector a prime in the
new system which is local representation, equivalent to multiplying by this m inverse
matrix. What we need to do is change the weight matrix w correspondingly which
involves right multiplying it by m. So this is our new weight matrix and if we take the
new weight matrix times the new activation vector, w prime, a prime, what we get is just
the same thing as we got before, wa. So any change you make by changing the basis for
describing the states of a layer can be completely compensated for by changing the
weights so that what the next layer sees is exactly the same in both cases.
So it can’t possibly change how the network behaves or what the network can or can’t do,
what it does, nothing can change if you switch from a distributed to a local description of
what’s going on in a layer of the network. So that’s an invariance, it’s a symmetry. So
the architecture, the laws that govern this system, the architecture of the network must
accommodate both local and distributed representations. That’s a symmetry that goes
between them. It can’t be restricted to either one or the other, but the relevant thing is it
can’t be restricted to local representations alone.
And existing neural networks applying to NLP are often problematic in this way because
they implicitly are using local representations of structure. So here is an example of a
kind of network which may be out of favor with its first author here as I have been told,
but nonetheless makes the point that in order to get the imbedding vector for a phrase we
take the embedding vectors for the words and we combine first these 2 together with
some combination to get this and then we combine this to the next word and get that. So
this is a description of the process, but it’s also implicitly a description of the encoding of
the tree itself. But what you see is that there is an isomorphism between the tree itself
and the networks architecture.
So what’s built into this is the role of being the fourth word is localized to these units.
The role of being the third word is localized to these units. So if we think of it as a tensor
product representation then in each of these we have a distributed pattern for a filler, but
they are multiplied by vectors for roles which are completely local.
>>: No, no I just thought this particular model, this x1, x2, x3, x4, each of those are
embedded.
>> Paul Smolensky: Yea.
>>: It’s not 1 hot representation for [indiscernible].
>> Paul Smolensky: Right, yeah I am saying that the roles are 1 hot vectors, not the filler
vectors.
>>: Okay, okay, the role vector, yeah, yeah.
>> Paul Smolensky: But as a consequence nobody thinks about role vectors. But
implicitly what’s going on in all these kinds of systems is that the architecture is
dependent on the including of symbolic roles being local. And our architecture is built so
that the roles can local or they can be distributed in any way that you want, in other
words –.
>>: My point was that [indiscernible] doesn’t change much from local to distributed.
[indiscernible].
>> Paul Smolensky: I know, I know you have said that 4 times everyday since I came to
Microsoft, maybe 5 sometimes and every time you tell me what my answer is and now
you have learned what my answer is. The distributed representations have similarity
structure. So they are different not because they are smaller.
>>: I see okay, okay. So maybe my argument was mainly focused on filler
representation. That can be small, for example for x1 [indiscernible].
>> Paul Smolensky: Yes, so if the role vectors are linearly independent of each other they
don’t have to be localized to 1 hot vectors. They don’t have to be localized to particular
groups of units, but –.
>>: [indiscernible].
>> Paul Smolensky: What I said about the neural situation implies to fillers as well as to
roles. So it wasn’t specifically about [inaudible].
>>: So in that case [indiscernible] in practice of reducing dimensionally by
[indiscernible] is not something that you think is cognitively sensible?
>> Paul Smolensky: Yeah.
>>: But in engineering it’s wonderful. You can solve so many problems and then we do
similarity measure, all that [indiscernible] that you talk about is there and yet we still
have dimensional [indiscernible]. Then what is the discrepancy in this case?
>> Paul Smolensky: Why doesn’t the brain do what your computer’s do?
>>: [indiscernible]?
>> Paul Smolensky: If it works so great? That’s a good question and I think the answer
is –.
>>: [inaudible].
>> Paul Smolensky: Pardon?
>>: For engineering [indiscernible].
>> Paul Smolensky: Right, well so on the engineering side what you always hear is that if
your networks are too big then you will over fit your training data or you won’t be able to
generalize well to new examples, same thing and therefore power to generalize has to be
forced by having a small network. So even if your machines are big enough to allow for
computationally a large network you won’t get good generalization necessarily. And I
think we, on the cognitive side of things and I imagine that Jeff Hinton would agree with
this, have always thought that can’t be the right answer for why the brain generalizes well
from a small amount of data. It can’t be the right answer.
>>: So you are saying that not even for the filler vectors it will work, for example the
units for distribution probably should be [indiscernible]?
>> Paul Smolensky: Yeah or in the millions more likely.
>>: I see, okay.
>> Paul Smolensky: I mean in terms of neurons.
>>: So in machine learning, can I comment about the symmetries? If you have a system
that naturally has symmetries, but then you are insisting on a solution that’s going to
insist on a particular configuration that doesn’t allow for all the symmetric solutions then
you usually have very bad local minimum when you are training. One example would be
when if you are training a rich analysis simple mixture of Gaussian’s model and then
somebody tells you, “Well I know what the variances are for this Gaussian’s. So use this
one for the first one, this one for the second one, this one for the third one and so on.”
Then you have violated the symmetry and your insisting of the first one is narrow and the
second one is broad, but the algorithm actually wants to do it the other way around early
on because it just logs into the different parts of the data. Then you usually get really bad
results.
If on the other hand you say, “No I am going to learn the variances and the means
together and then later identify which ones should be the smaller ones and which ones the
bigger ones and then reverse them,” then you get a good result. And the reason is the
violation of symmetries. If the model wants to have multiple answers and you are trying
to force it early on in one direction like here then you have a local minimum. You can
get a good result maybe, but you do have more commitment. So is that something that
you were going to talk about or is it that your argument for why this shouldn’t be done?
>> Paul Smolensky: Well no it’s better than my argument. Yeah, so my argument is an
argument from a physicist who was enamored with relativity theory and how you could
derive a lot of constraint on what the laws of nature must be by observing the symmetries
that just can’t be right if it doesn’t respect the symmetries. So that is what I am saying
here, the neural architecture just can’t be right if it doesn’t respect them. But from an
engineering perspective it is much nicer to be able to say, “Well if you choose to break
the symmetry then you will pay for it.”
>>: The engineering decides to minimize the pain.
>> Paul Smolensky: Minimize the pain.
>>: Yes, minimize how you get penalty by violating symmetry for example.
>> Paul Smolensky: Yeah very good to know. I am not sure I really realized that. Yeah
it is very interesting how adding knowledge and constraints to the system hurts you. I
will have to ponder the significance of that.
So the bottom line of all of this linear algebra stuff is that if you have some given layer of
neural units any computation that you can do, by assuming that the representation is
local, could also be done with the proper distributed representation. All you need to do is
view the local version that you have in front of you as just a description in a convenient
coordinate system of a system that can be described in other ways giving rise to
distributed representations. You correct for that change in the matrix of connections and
whatever you computed before you still compute. But, and this is the answer I have to
repeat 5 times a day to Lee, but they are not equivalent in that distributed representations
exhibit similarity effects automatically, which you have to wire in artificially if you want
them in the case of local representations.
And that’s because two 1 hot vectors have zero similarity and every pair of 1 hot vectors
is just as dissimilar from one another as every other pair. But phonemes aren’t like that
and most concepts aren’t like that. So if the distributed representations of those concepts
reflect the similarity structure underlying them in whatever task they are being used for
that should be an advantage.
>>: So that’s really interesting, but it seems like if you are just doing this in some sort of
non-constructive way, just randomly taking this basis and so on, then you are at the whim
of chance. How much similarity you are going to get embedded? Is there a way to
maximize the amount of similarity you are embedding, sort of the opposite of the ICA in
a way?
>> Paul Smolensky: Well I don’t necessarily advocate maximizing similarity for its own
sake even if it is the distinguishing feature from local representations. What I would say
simply is that the architecture should admit distributed representation. So a learning
algorithm, using the architecture, will be able to decide what the similarity structure
ought to be for the demo.
>>: Then my follow up question is what sort of distributed representation are you using
rather than the 1 hot vector? It should be something that doesn’t confuse you, but still
maintains a lot of similarity and is done in an automatic way to get a good representation,
which is distributed in a way that is going to help you with similarities. Clearly it is very
close to local and it is going to have a little bit of similarity, but not much. So you want
something that is further away from local.
>> Paul Smolensky: Maybe or maybe that’s the right answer. I have no clue. I mean you
guys have worked with networks where you learn an embedding such that similarity of
pairs meets some sort of externally given criteria.
>>: That’s in the small dimension we think about how we can do that. So I know this is
the fifth time now.
>> Paul Smolensky: You can –.
>>: So for a small dimensionality like coming from 100,000 to maybe 4,000, 3,000
whatever, in practice we use typically use 400 to 300, that already gave enough –.
>> Paul Smolensky: That’s good that you are using distributed representations, and this is
an argument that the architecture should allow it, and your architecture does allow it and
you are taking advantage of it.
>>: Yeah, yeah, I know, but I think –. So I am really much inspired by your earlier
[indiscernible] in phenology that in 3 full names you get what 8 phonetic features. That
means that [indiscernible].
>> Paul Smolensky: Well I intended the opposite. In reality it’s the opposite. I just
didn’t list all the phonemes; I just listed 3 of them.
>>: In practice indeed the number of phonemes were [indiscernible]. So that actually
follows your philosophy that neurons have more than that.
>> Paul Smolensky: More than you need.
>>: Then of course that has a lot of redundant representation, because I can
[indiscernible]. Now for the word representation that everybody is doing now it is just
completely the opposite. It is just so much lower [indiscernible]. It gives you enough
similarity. So why does that unit have neuron bigger than concept? I think if number of
neurons is smaller than concept it can still represent everything and do [indiscernible]
things. [indiscernible].
>>: So maybe I have something that will help with words. The way that the engineering
community treats words is as if they are –.
>>: One or two dimension.
>>: Right, but as if they don’t have themselves features. But for example we know of a
word that is abstract or concrete. We know of a word that denotes a human or not. So
that type of feature which has never been used in the NLP versions of these neural nets to
my knowledge are more like the columns that you just showed for the phones that you are
familiar with: labial and voice.
>>: Or rather some linear combination of those features.
>>: Exactly. So I have yet to see anybody unpack the meaning of a word along these
different axes, the fact that something is human or not [indiscernible] the grammar.
>>: I see, oh okay, okay.
>>: So I am curios why we are not seeing that in the NLP.
>>: It’s because we don’t have the inventory of these features for words.
>>: Is it only that you are lacking that inventory?
>>: [indiscernible].
>>: There are some well known things: number, gender, whether it’s human or not, lots
of [indiscernible].
>>: [indiscernible].
>>: It’s linear algebra, like if you take word embedding and take vectors describing every
one of the these and just unwrap them and see whether you get a good fit.
>>: Right and so I was [inaudible].
>>: [inaudible].
>>: In the phonetic world there is a huge redundancy because [indiscernible].
>>: So what I don’t understand is why you except it in phenology and then these abstract
ways of describe them.
>> Paul Smolensky: Because Lee has studied phenology. That’s why, because nobody
besides Lee accepts that you realize. He’s the only one.
>>: But I understand why it’s useful because [indiscernible].
>>: But is it only because of these features or do you have like a vocabulary.
>>: We actually do, they are from a dictionary.
>>: So let’s just fit it.
>>: And the dictionary itself is going to be recursive.
>>: But it doesn’t have that for [indiscernible] every word in English. What are the
fundamental features around that?
>>: It does, it does.
>>: So you see this [inaudible].
>>: [inaudible]. Chris has been favoring [inaudible].
>>: [indiscernible].
>> Paul Smolensky: But even if it’s not complete it will probably add some value,
because what you have now is just the sound base.
>>: You just have your word embeddings for all the words, you take the half, you find
the linear transformation that it to these features and then on the other half test whether
you are right or not, whether you can say the word is adjective or not and so on.
>>: But again the worst case it’s nicely recursive, because a teacher is a person who
teaches. So teaching is part of the meaning of teacher, but like a professor is a person
who teaches at a university, so now you have got a mixture of the meaning of teaching
and university.
>>: So your idea is write down the same thing as the matrix that [indiscernible].
>>: And those are dictionary definitions.
>>: So you do that, but then [indiscernible].
>>: So now [indiscernible] because I don’t know the math of it.
>>: Okay, okay.
>>: [indiscernible].
>> Paul Smolensky: Okay so we are past twelve. I have one slide left for this little
section. Should I do it?
>>: Yeah let’s finish that.
>> Paul Smolensky: Okay, so the bottom line was that whatever you can do with local
representations you can do with proper distributed representations and you can program
networks that have distributed representations, just like you can program the ones that
have local representations once you understand how to write equations and so on in such
a way that they work, both for whatever coordinate system they work you can then go
from a local to a distributed one and back and forth. So there is a section here that just
shows you what I have in mind when I say we can program nets with distributed
representations, but that’s going to be for another day.
So this is the last slide of this section. So the thing is that the tensor notation is invariant
under the general linear group of invertible transformations. If you write a local
representation equation in tensor notation then the same equation will be valid for
arbitrary proper distributed representations and maybe the same equation is the same
approximately valid for improper, compressed representations. That’s what we are trying
to explore in some of our ongoing work here. Local representations are lossless and so
are proper distributor representations, but improper compressed distributed
representations are inherently lossless in some ways and require typically some type of
extra noise elimination process that is unnecessary with proper distributed
representations.
So that was all to say that the invariance properties of computation in neural networks
leads to the conclusion that the natural and right, as far as I am concerned, architecture
should be one in which the principles are invariant under change of coordinate
description and in order for equations to have that property you should use tensor
notation and that will enable you to use distributed representations, not just for fillers, but
also for roles and therefore get generalizations across roles in the way that we are more
familiar with generalization across fillers. So from one symbol to a similar symbol you
get from one position to a similar position or something like that. Okay.
>>: Okay that’s wonderful. So the last point you are making here you would probably
refer to this common other type of distributed representation, like how to eliminate the
actual noise?
>> Paul Smolensky: Right, yeah.
>>: [indiscernible].
>> Paul Smolensky: Typically yeah I would say. I don’t know that there are really
principle ways to do that, maybe Tony would say that is method is principle. Then I can
see an argument for that.
>>: I see, okay thank you very much.
Download