Document 17828232

advertisement
>> Li Deng: Thank you very much for coming for Paul’s presentation today. He’s expected to give a
series presentation. Today’s the very first one. Let me give a very brief introduction to Paul. You know
everybody; I don’t know how many actually read this, the chapter that he wrote in his PDP book. It’s an
eighty’s book. It’s a very famous a little chapter on Restricted Boltzmann Machine in mid eighties. Now,
I think at that time use of dynamic system. Now people call it Research Boltzmann Machine.
Paul has been spending his, almost all his lifetime developing these two theories. One is the dynamic
system theory related to the Boltzmann Machine. The other one is the TPR theory it’s fair to say. Then
there’s a combination of those that gives rise to you know some symbolic and you know neural
computation. The connection between the two those are extremely influential.
Then he got series award including the most famous one Rumelhart Prize that was very prestigious.
Everybody really respect all the people over these. We took an opportunity to have Paul to visit for
about four months until I think the first week of December. He will have plenty of time to spend here.
We have been engaging Paul to do research with us about the application of a number of theories that
he has to our practical problems here. Without further ado I will give the floor to Paul to give the very
first lecture in the whole series about some of the, you know cognitive science and neural network
gather research that he’s been working on. Okay, thank you.
>> Paul Smolensky: It’s been a great time here already. I’m really appreciative of the opportunity.
Thank you very much for your hospitality.
Let me start off with an exchange that sort of sets the main agenda for this series of comments. This
was an exchange between two giants in the field at a plenary lecture. A pioneer in computational
linguistics asked a founder of the deep learning field the following question. Shouldn’t your DNNs for
language processing have structured representations like tensor product representations or something?
I’ll be talking about those below.
The answer from the DNN researcher was, well do you want a pretty theory or do you want a system
that works? For me where I sit the answer is completely obvious. We want a pretty theory.
[laughter]
We already have seven point four billion systems that work. What we want to know is how do they
work? A focus of the research that I’ll be talking about is trying to understand and not just get
performance out of neural networks.
>>: What is that seven point four? Is that human, the brain?
>> Paul Smolensky: That’s the number of humans on the earth as of October.
[laughter]
>>: Paul, what does deep learning expert? Why did they review that system that works be a
complimentary attempts?
>> Paul Smolensky: I wasn’t there. I hesitate to elaborate too much, or speculate too much. But I think
that there’s a pretty obvious asymmetry in the neural network community. In terms of the amount of
effort that goes into trying to get good performance versus trying to understand how the networks
work. How they manage to get that performance.
Okay, so the stage for that kind of interaction that question comes from a shocking thing developed in
the last few years which is that there are some DNNS that actually produce rather impressive English.
There are lots and lots of examples one could give. Here’s a little example from local work. A nice figure
caption called a little girl brushing her teeth with a toothbrush. Or real live online translation into
English and other languages from Skype translate, just quite dazzling.
The thing is we do not know how the networks do it. We could go a couple of different ways. Following
long standing AI tradition we could decide to ignore an entire academic discipline that’s been devoted to
understanding what it takes to produce great English. Or we could ask how can linguistics help us
understand what these networks are doing?
In my view linguistics defines the state of the art in understanding what great English is and how it can
be produced. It’s not good enough to merely understand how a network minimizes some error function
defined over neural activities. That doesn’t qualify, doesn’t meet the state of the art by any stretch of
the imagination in terms of understanding how English is produced. The success of these networks and
language does not mean the end of theoretical linguistics despite what some people may have hoped,
because we need linguistics to understand these networks.
If we ask linguistics and traditional approaches to natural language processing, for that matter
computational linguistics as well. We’ll be told that there’s a lot of belief in a hypothesis that says,
producing great English requires abstract structural knowledge, ASK. According to this hypothesis if
want to understand a network that’s producing English we should look inside and see how it encodes
this abstract structural knowledge.
Now there’s several questions to ask here. One is whether this hypothesis is correct? Whether we
should believe it? Another is how on earth could we look inside a network to see if there’s abstract
structural knowledge of that linguists say we need in there. Finally, you might wonder what I actually
mean by abstract structural knowledge.
Let me take a little diversion on that point to start with a very, very simple example deliberately avoiding
all the kind of syntactic subtleties that linguists love to use in their examples. Here is a very simple noun
phrase, troubled adolescent hacking expert. It took me a long time to come up with that, it has lots of
readings.
Okay, the means include things like an expert on hacking by troubled adolescents. An expert on hacking
by adolescents who is their selves troubled. An expert into hacking into adolescents accounts, who is
troubled; a hacking expert who is herself a troubled adolescent.
Lots of other readings and the point is that if neural networks are going to be able to process even
relatively simple kinds of expressions like this. They need to be able to make all these distinctions
somehow. Obviously, a bag of words is not enough. Obviously, a sequence of words is not enough
because all of these readings come from the same sequence of words.
There are other things that need to be invoked in order to make these distinctions. In the symbolic
linguistic type tradition you might say that well the difference in the readings one and two could be
related to different implicit grouping of the elements in the phrase. In the first case adolescent is
grouping with troubled. The second case with hacking so maybe some kind of grouping structure could
be helpful in making these distinctions.
The second and third readings have pretty much the same grouping. They differ however in the implicit
relations, holding between the elements in this group for example. In the first case the adolescent is the
agent of hacking. In the second case the patient of hacking.
Just to repeat neural networks somehow have to have representations that can make all these
distinctions, and of course many, many more. Maybe they can do it by implementing symbolic
representations like the ones I’ve drawn here. Or maybe they’ve discovered entirely different ways of
making those distinctions. That’s what I would like to understand. That was a little digression on the
kind of thing that makes linguists that believe in hypothesis like this that we have to have some kind of
abstract structural knowledge in order to even get off the ground producing English.
The second question here how to look inside it? I’ll be talking about in some detail for the rest of the
talk. We’ll put that aside for the moment and pass on to the question of whether this hypothesis is to
be believed. That in order to understand, in order to produce great English we need to have abstract
structural knowledge.
I see that there’s several possibilities here. I’m curious to know what the actual state of affairs is. Do
these networks actually produce great English? Here I’m reminding you that I think it’s linguistics that
defines the state of the art in understanding what great English is. In other words linguistic analysis
assumes that certain sentences of English are part of the language. Is it the case that these networks
are actually producing the kinds of sentences that cause linguists to believe in the kinds of structures
that they have put in their theories? Or in fact do the networks fail to display that kind of confidence?
If indeed they don’t display the kind of confidence that causes English, that causes linguists to believe in
abstract structural knowledge then they’re not going to help us decide whether or not this hypothesis is
true or not. But if they are producing many at least of the kinds of structures that linguists believe
require abstract structural knowledge to cope with. Then there are two possibilities.
Does the network have this knowledge in it or not? If it doesn’t then we can conclude this hypothesis is
wrong. That we’ve been somehow mislead all along into believing that structure is important for
language, somehow it’s not. On the other hand if the answer is that looking specifically for the kind of
knowledge that linguists propose we actually find it. Then of course that vindicates the hypothesis quite
strongly.
My first goal in coming here actually was to try to find out which of these possibilities is the actually
case. Involving studying the competence of these networks to see whether they have a command of the
relevant structure demanding according to the hypothesis constructions and so on in English or not. If
they do then can we see evidence that they have somehow found a way of acquiring and storing, and
using that knowledge?
Yes?
>>: By great English do you mean human like English?
>> Paul Smolensky: Yes, yes, so I was being a little bit glib just because there are specific reasons why
specific kinds of structures are believed. If the networks aren’t producing those kinds of sentences then
of course no linguist would expect to find that structure in there.
Yes?
>>: What do you mean by [indiscernible]?
>> Paul Smolensky: Well that by taking it as a working hypothesis that there is a kind of knowledge in
these networks that linguistics says is necessary. We actually find out something about the networks we
wouldn’t have found out otherwise. It’s actually led us to achieve a greater degree of understand than
we could have achieved without it.
>>: I was a bit confused. Maybe I didn’t have enough coffee so my logic isn’t up to speed. Is a good
indication the same as retaining a hypothesis?
>> Paul Smolensky: Is it the same as…
>>: Retaining a hypothesis?
>> Paul Smolensky: Retaining it, well…
>>: Or proving the hypothesis?
>> Paul Smolensky: Well, can’t really prove a hypothesis that something is required, I don’t think.
>>: It is [inaudible]?
>> Paul Smolensky: Yeah. Okay, alright, so as I said this is one of the two goals that I had in coming
here. I would really be delighted if anybody else was interested in pursuing this question.
>>: I’m, just a comment here. That really doesn’t mean DNN, right, because for image captioning
problem many of those sentences are actually coming from the training set so they don’t really count
[indiscernible]…
>> Paul Smolensky: Yeah, yeah, well…
>>: I think the real network you’re linking too is the recurrent neural network. That by itself will
produce reasonably good English injecting kind of a [indiscernible] using recurrent neural network to
generate our character strings. They’re very often [indiscernible].
>> Paul Smolensky: Right.
>>: I think that’s probably better…
>> Paul Smolensky: Yeah I agree that’s really the kind of thing I had in mind actually. Yes?
>>: You mentioned the [indiscernible] structure being necessary. Did you mean sufficient?
>> Paul Smolensky: I actually meant necessary.
>>: Okay, right.
>> Paul Smolensky: Sufficient I think is less controversial. Okay, so…
>>: But that then do, but then the hypothesis is very strong, right. I mean cannot be proven that
without the representation [indiscernible]?
>> Paul Smolensky: I don’t think that can be demonstrated. All we can ask is whether the existing
systems that we have in front of us are counter examples or not. Because the conventional view is that
there’s a very strong segregation between knowledge and neural networks, and knowledge of the sort
that symbolic linguistic theories involve. That the, not only the hypothesis has to be that, but that
knowledge isn’t there and maybe even couldn’t be.
But that needs to be pursued. The whole point of the research I’m going to tell you about is how to
bring these two kinds of knowledge schemes together.
>>: In like the vision community there has been a lot of work. People do really try to sort of look at
different parts of the network and see what is being encoded in the network.
>> Paul Smolensky: Yes.
>>: I’m not familiar with the LLP community that much. But does the community really feel that there is
no, I mean your [indiscernible] communities things that these [indiscernible] separate.
>> Paul Smolensky: The, my knowledge of the vision side is definitely insufficient here. But it does
appear that the notion of some kind of receptive field is a useful notion in vision. It has yet to be shown
to be useful I think in the kind of networks that we’re talking about for producing let’s say sequences of
a lot of characters or words. There seems to be a more ready access to trying to interpret the
computation in networks that are doing visual processing than what we see in these linguistic networks,
I believe. If anybody disagrees I’d be happy to be corrected, but.
Okay, so how are we going to look inside to see if it encodes abstract structural knowledge? Well, I
propose to rely on a theory of integration of neural and symbolic computation that I’ve been
developing. It’s called gradient symbolic computation, GSC.
This first goal for work is a kind of reverse-engineering application of GSC. But there’s also at least as
important engineering applications that we are pursuing. To try to see whether we can build networks
using the principles of GSC to increase their capabilities in domains like language. The goal of this kind
of engineering is to unify the learning power of neural networks and the generalization power of
abstract structural knowledge.
Now connecting neural networks and symbolic computation has been a kind of imperative. That is to
say if you believe that symbolic computation has an important role to play in understanding intelligence.
That has been imperative for a long time or maybe always in cognitive science since we do believe that
underlying our intelligence is a neural network.
But of course now since the rise of DNNs in AI is I think become an imperative also. That’s a fortuitous
confluence I think. We’ll be much better off as a result of having those two communities working on
this problem, hopefully to some degree together.
If the attempts to engineer networks that integrate symbolic and connectionists, or neural network
processing are successful, then what they ought to lead to is increasing the capacity of neural network
computation, and to be interpreted and to be programmed. Increase neural networks capacity to
represent and process discrete combinatorial structure which is what structural processing is all about
of course.
On the other side it should enrich symbolic computation by infusing the power to represent, process,
and especially learn continuously varying gradient dimensions of content. I’ll try to emphasize it a
couple of points in this talk, gradient dimension of structure, which is a pretty foreign concept to most
views of symbolic structure.
Now this theory GSC unifies neural and symbolic computation in general. We could pursue variants of
this hypothesis to understand AI systems that are based in neural networks. That do reasoning or
planning, or higher vision. The same belief is out there that successful performance approaching human
levels requires some kind of symbolic computation in many fields outside of language. GSC is not itself
really restricted to use in language but that’s where I’ve pursued it.
Now, today I’ll talk about GSC in general. In the talk October twelfth I will focus on its applications
within language.
Okay, so the outline for the talk is to zero in a bit more concretely on the problem that GSC is suppose to
be a solution to. It talked about the important role that distributed representations have to play. Give a
proposal for how to do representation in neural networks that will unify with symbolic computation in
an appropriate sense. Talk about how to program these networks. Effects that you get from the
similarities resulting from distributed representation. Eventually just identify a couple of applications
that have been pursued in cognitive modeling and in AI, and a few words about the reverse engineering
prospects.
The problem that gradient symbolic computation addresses is some kind of grand unification problem of
unifying the symbolic and subsymbolic approaches to artificial intelligence. It attempts to unify the
following hypotheses. First of all, the one we’ve already seen that an insightful, powerful description of
cognition is possible when it’s viewed as symbolic computation, but also when it’s viewed as neural
computation.
These are two hypotheses that have not worked together to well in the past. Most people have put
their lot in with one or the other. As someone who’s been trying to put them together for thirty years I
can tell you that there’s not a whole lot of people interested in the prospect. But…
>>: Except…
>> Paul Smolensky: Maybe it will change. Maybe it is changing.
>>: Except this new workshop and [indiscernible] that.
>> Paul Smolensky: Yes, yes there are quite a few things…
>>: [indiscernible] example about that…
>> Paul Smolensky: Yeah, two days, yep.
>>: Okay, and do you…
>> Paul Smolensky: Office of Naval Research is interested.
>>: Survival for this…
>> Paul Smolensky: All of a sudden things are changing, so that’s gratifying. Okay, so as I was just
suggesting that nearly everyone is pretty skeptical that some kind of grand unification of this sort is
possible. Others might say that it’s not even desirable. But, as a result of the skepticism I’ve focused
really mostly on trying to build formal arguments that the proposal I’m making in gradient symbolic
computation really does solve the problem.
Here I’m trying to move on to engineering and reverse engineering models built on these ideas. But
that’s new; most of the work has been entirely about developing formal results of the adequacy as a
solution to this unification problem. If these applications go well then maybe some people who have
been skeptical of the desirability of this unification maybe persuaded.
Okay, so the, yeah?
>>: Do you think that it is possible this symbolic computation just arises simply from communication of
the needs or intelligent networks to exchange information? Like through a noisy channel with limited
capacity?
>> Paul Smolensky: Well, there are many places in AI and cognitive science where symbolic theories are
applied to problems that are not themselves implicated in communication. It could be that the capacity
is there because of communication and being there is applied to other things. There is certainly a lot of
speculation about the relation between the evolution of our ability to do abstract symbolic computation
and the evolution of language. I don’t have any opinions about the believability of those speculations.
But, did you have a particular?
>>: I was just wondering you know what they’re saying. Even what you’re saying is there a way to make
an experiment like that to where you would see if I developing agents that have to do something in
concert. They have to communicate but you’re limiting their authentication. They can’t transfer the
entire brain to each other.
>> Paul Smolensky: Right.
>>: Would then something that had a symbolic meaning emerge in the…
>> Paul Smolensky: People have claimed to have produced just such results and just such computational
experiments. There’s some reason to believe that in principal things could have evolved that way.
Yeah?
>>: Do you still believe the individual hypothesis that cognition can be purely symbolic or purely neural?
Or do you think that it requires both?
>> Paul Smolensky: Well, so the approach that I’ve been taking is that these are both extremely
important computational models. But they apply at different levels. Certain questions should be
analyzed at the level at which symbolic computation is the powerful model. Others should be examined
at the level that which neural computation is a powerful model. Others need both.
>>: You would say that the individual ones are not sort of powerful enough than both are sort of
required?
>> Paul Smolensky: Right.
>>: Right.
>> Paul Smolensky: Yeah, yeah that’s what I’ve been trying to do. Put them together because of
perceived inadequacies that they have individually. Yeah?
>>: In a way can it be related to this concept of fast and slow thinking in psychology?
>> Paul Smolensky: Yes, so the work that I’ve done really kind of focuses on something like the
automatic processes that go on largely in parallel, largely unconscious, and which run for a half a second
or something like that. Ultimately that needs to be part of a much bigger architecture that has much
more serial control and so on. I think the kind of dichotomy that you’re referring to is part of that
picture. I don’t have a whole lot to say about the bigger architecture. I’ve just been focusing on what is
something like a primitive unit of parallel computation within it.
Okay, so the inspiration for this kind of unification. Unification across levels is easy to see in computer
science where the macro structure of computation and the micro structure of computation are very
familiar levels that we understand the relationship between because we’ve built them. We understand
the notion that there’s all sorts of potential virtual structure at higher levels of organization than what is
physically built.
In physics the same kind of picture is ubiquitous as well. That emergent properties of large systems can
involve properties that have no appearance at the micro level. Here in cognitive science the
corresponding picture I think is the most interesting because whereas in the computational. The
computer science case both higher and lower levels are essentially discrete. In physics they’re
essentially continuous.
What happens in cognitive science on this approach is that you actually have a transition from some
fundamentally continuous to fundamentally discrete system, as you go from the micro structure to the
macro structure. The tools for doing that actually are stolen from both the other two on the left.
>>: If I thought that this revision is quite different where I thought in physics we need to sample
everything. You can do it. You can keep you know move from micro structure…
>> Paul Smolensky: Yes.
>>: You can do micro structures [indiscernible] over here. We have theory that simply say that we
could…
>> Paul Smolensky: Here?
>>: We could potentially that it would be the equivalent to each structure it completes the structure.
Through neural condition you automatically accomplish whatever things you want to do [indiscernible].
I think these are different.
>> Paul Smolensky: Yes, yes.
>>: Do you agree?
>> Paul Smolensky: I agree that’s sort of considerably further down the line as to where we are in this
talk right now, but yes.
>>: Okay.
>> Paul Smolensky: I would agree with that. What the work that I will tell you about proposes is that
there’s actually a valuable level in between in which the objects are tensors and computation involves
tensor operations. That this is somewhat of a kind of interlingual between the language of activation
vectors and the language of symbol structures.
It’s called gradient symbolic computation because the basic element of computation is the gradient
symbol structure. Here’s an example of a gradient symbol structure. This is a syllable in which the final
consonant is a blend of d and t. It is part of a structure that’s not just part of a heap. Its part of a
structure but it involves continuous degrees of activity of different kinds of symbolically interpretable
elements.
That’s an example which is actually relevant to one of the applications that we have, which I will get to
eventually today or next time. This picture is a symbolic kind of drawing for something which could also
be written algebraically like this. This is identical to that in its reference. This refers to a particular
tensor in which a symbol b is encoded as a vector and bound to a vector which assigns to it the position
of onset of the syllable. Similarly for the nuclear vowel and similarly for the coda consonant, which in
this case happens to be a linear combination of vectors that are interpretable as d and t.
>>: Yes, just one question [indiscernible] very specific question related to the structure that is under
different d of t. The [indiscernible] over here where it [indiscernible] propagates like from under b in
the lower end of t [indiscernible].
>> Paul Smolensky: Yes.
>>: [indiscernible] issue you actually can’t explain that [indiscernible] would be very different
[indiscernible] application.
>> Paul Smolensky: That is…
>>: Separate?
>> Paul Smolensky: No, it’s woven into the main questions here. This is intended as a representation
within the system of phrenology.
>>: Its phrenology.
>> Paul Smolensky: Distinct from a system of phonetics.
>>: I want to talk about phonetics.
>> Paul Smolensky: You were talking about the phonetics differences between the realization of this
vowel before d and before t. The question for this is whether there is a difference at the level of the
phrenology whether there are phrenological principals that are at work shaping that vowel, or whether
it’s only the phonetic realization that has contextual affects in it.
There’s reason to think that the phrenological grammar needs to be involved here. I don’t know if
there’s reason to believe it there. But I think that some people have said so. That might be.
>>: But that would be generalization [indiscernible] would have two versions.
>> Paul Smolensky: Yep.
>>: It could just depend on the b, but depending on t or b [indiscernible] sort of a different level.
>> Paul Smolensky: Right, well it could be part of the, the proper phrenological representation might in
fact be one where there’s a blend in the vowel as well as in the coda consonant. Okay and just because
it’s a bit of a red herring. All of this is orthogonal to issue a probabilistic modeling. A computational
state in this kind of computation is a probability distribution over structures like this. These are not
probabilities. But probabilities are part and parcel of the global state of the computational system. This
particular representation will have a probability as will other representations.
Okay, so looking down the line if this kind of confluence of network driven and symbolic computation
driven approaches to intelligence. If the convergence really is achieved then I think that the
representation of knowledge and data for cognition in our century will be about understanding these
kinds of representations. What kinds of functions are computed over them. What kinds of grammars
evaluate them and so on.
This will be important for understanding the existing DNNs I think and the brain. Then the hope is that it
would also be helpful for engineering better DNNs that are built to process these kinds of structures that
have interpretations and not just meaningless activation patterns.
Okay, so at the foundation of this architecture is a cross-level mapping. Here it’s represented with the
letter Psi which takes symbol structures and maps them into a discrete subset of vector space. We have
a space of discrete symbolic inputs to some computation let’s say. Like some syntactic tree for example.
We have a discrete symbolic space of outputs. Say logical forms of some sort. We have a function we’re
interested in that maps from one to another.
In the GSC approach this function is actually not computed at this level, or at least needn’t be.
Psychologically speaking isn’t. Rather these are high descriptive levels that are characterizing states of a
neural network. The mapping Psi tells us which vector in the space of inputs. That is to say the space of
activation states of a set of input units in a neural network. Which of those states is the realization of
that syntactic three, Psi tells us that.
It also tells us what the realization in the output space is the vector that realizes that logical form. Then
GSC provides a way of going from this to this which closes the loop. Uses natural neural network
operations to get the output from the input, doesn’t use symbolic operations. Or it doesn’t use
conventional symbolic operations to be sure.
The theorems that prove these assertions about sets of functions that can be computed in this way are
the strongest arguments in my opinion. That GSC really does provide a grand unified theory for AI. I
spent a fair amount of effort trying to develop them. But whether they have use in applications is
something I’m here to try to understand.
>>: Can I ask a question?
>> Paul Smolensky: Yeah.
>>: Is there something that limits the input to be symbolic? Can you just start straight at the kind of
vector space the input? Simply, could the input be numeric instead of symbolic or…
>> Paul Smolensky: The idea is that the input is numeric and so is the output. The machine lives down
here. This is our theoretical understand of the machine here. The mapping Psi is really between our
interpretation and the state of the machine as opposed to part of the computational path.
>>: Well, I think, I thought that what you were diagramming is sort of the, a diagram of how such a
machine might solve the task of mapping a parse tree to sort of syntactic or semantic representation?
Am I totally off on that or is that?
>> Paul Smolensky: No you’re totally on, on that.
>>: Okay, so then it, if you wanted instead to map like you know auditory inputs to words does it still,
can it still do that kind of task? Or is it mostly to do symbolic to symbolic type of tasks?
>> Paul Smolensky: Well, the idea is to deploy natural neural network operations here. They can be
applied to any kind of neural network input that makes sense and whether it is something that had a
symbolic interpretation or not. The answer is there’s no need for the input space to have a symbolic
interpretation. But to the extent that we’re using GSC to understand what’s going on in the system then
that’s the role that it will play.
>>: Okay.
>> Paul Smolensky: Yes.
>>: Could you please comment on the relation between this framework and say the standard kernel
machine method in Machine Learning. Whereas up there you would map symbolic structures
potentially to a Hilbert space and then perform operation in that Hilbert space, and can map it back to a
symbolic space. There seems to be some resemblance here. I’m curious to hear that.
>> Paul Smolensky: Right, well the mappings that I’m talking about could I think be instantiated in the
way that they are done in some of the proposals you’re talking about. I would say that the relation is
that some of them at least form a case of this. Yes?
>>: Could, this isn’t any, so this doesn’t necessarily contrast with any existing state of the art neural
network architecture, right. There could be, like this looks similar to like the, they’ve talked about on
the [indiscernible], right. That does fall within this framework, right, like where you map to two things
to representations. Then take the dot product between them to, whether maybe it’s like an image and
it captions that image. Then you map them both to shared embedding space. Does that; is that a
special case of this?
>> Paul Smolensky: Whether the mapping from a sequence. We might be taking an input image and an
input sequence of words…
>>: Yeah.
>> Paul Smolensky: Scoring where that’s a good caption for that image or something like that.
>>: Yeah.
>> Paul Smolensky: Then indeed the process of taking the sequence of words and producing a vector to
encode it does have some of the character of this path. The way that GSC does it is a parallel approach
and not a sequential one. It might very well prove wise to try to expand that aspect of the program to at
least incorporate that degree of sequential processing. But it’s really been focused on parallel
processing so far. The mapping from the words to the vector is a parallel one and not a sequential one.
>>: Okay, so you’re not defining GSC as anything that has these properties. GSC this is just the high
level of [indiscernible] and GSC has, you defined specific functions for these vectors…
>> Paul Smolensky: Yes, yes, yeah I’m going to instantiate these things specifically. But it’s certainly a
very worthwhile question whether there’s a general picture that would also encompass the cases you’re
talking about.
>>: Yeah, so we did get some questions earlier. Just want to know if your you know your comments are
[indiscernible] comments. We expect the very first question as to whether the input you know for this
mapping has to be structured enough [indiscernible]. Because [indiscernible] but there’s a big
difference you know [indiscernible] project so far. If we met directly from continuous vector on the left
corner up there the symbolic structure that we wanted to. Then you miss this. You haven’t lost this
advantage of isomorphism between the structure to structure. Therefore the more difficult compared
with if you take advantage of the input structure you might actually get to hide things. It would be more
efficient…
>> Paul Smolensky: Yeah, so I think that that’s a good instance of this general hypothesis that linguists
talk about a mapping between syntactic representations and semantic representations as imbedding
some kind of isomorphism. That’s one reason to believe in the syntactic structures. To the extent that
that underwrites you know sound competence in language. Then you might think a machine would
benefit from taking advantage of it.
>>: I see, so as a theory actually requires that you need to have an isomorphism in order to have this
[indiscernible] or its alternative?
>> Paul Smolensky: Well, the isomorphism to the extent that it is, is between this and this rather than
between this and this.
>>: [inaudible], okay.
>> Paul Smolensky: Yes, yes?
>>: One more follow-up on the isomorphism. Originally you showed us the mapping of little f which
maps purely in symbolic space. Then you said we want to go to this mapping big F which is done on a
vector space. You’re saying these two things can be isomorphic. But in general are you saying they
need to be exactly isomorphic? The reason I’m asking about this is maybe you want big F to me more
flexible or maybe you say no, no really I want to do these operations directly in vector space. But they
should be exactly equivalent to little f.
Like a machine translation for instance there’s this famous theorem that where you go all the way up to
an interlingual and back down, turns out that doesn’t work very well. We went down to syntax a slightly
less restrictive symbolic representation. Every time we had to go down the pyramid toward more
concrete representations we seem to get through this. Do you think we want to be doing exactly
symbolic representations, or do we need flexibility?
>> Paul Smolensky: The philosophy of the research program is that if it’s possible to do an exact
instantiation of these maps then you’ve demonstrated that the apparatus down here can do symbolic
computation, okay. Now how you want to use it is now a second question. But what you know is that it
has that kind of capability.
The idea for giving a more useful description of cognition than you have already from the symbolic
description is precisely in the way in which this mapping is going to be richer and not slavishly
implementing this mapping here. The idea is that these, for example the gradient representation of the
syllable I showed a moment ago. That doesn’t have a discrete counterpart up here. It can’t be part of a
theory of a task up here. But it certainly can be down here.
But to say that the vector down here is some sort of blend of t and d is to be taking advantage of this
mapping. But doing it in a way that goes beyond what the symbolic discrete representations themselves
can accommodate.
>>: Seems like [indiscernible] is in as a subset then.
>> Paul Smolensky: Yeah, yeah, right. Rich.
>>: When you go from left to right you’re going to use natural neural net operations. Do you allow for
intermediate states which might have no semantic, no symbolic interpretation?
>> Paul Smolensky: Absolutely, I mean, yeah?
>>: By vector space of the states you mean actually all the open states is that correct?
>> Paul Smolensky: Right, that’s what this box refers to, yeah.
>>: [indiscernible]…
>> Paul Smolensky: Some discrete subsets of them are the images of these guys up here. Yeah?
>>: Is it true that if the [indiscernible] are bijections then there exist a big F that will you know just map
them perfectly, right, just map those points to some other points?
>> Paul Smolensky: Well…
>>: If we allow a neural network to be rich enough it will be able to implement that mapping.
>> Paul Smolensky: Right, right, so the idea is to actually concretely exhibit natural instances of this
function here that compute interesting and useful functions here. Rather than to just say have a
constructive proof that there is something that will do it. We want to actually exhibit it and determine
what kind of neural network capabilities are required to do the computation. Yeah?
>>: You mentioned that the mapping between the space of discrete symbolic inputs to the vector space
happens in parallel. But I would think that if you’re preserving something about symbolic linguistic
structure would happen like hierarchically. Are you going to speak more to what that process is maybe
in later slides?
>> Paul Smolensky: Well, the idea is that it applies in parallel to the different levels of the hierarchy
here, if that answers the question.
>>: It is hierarchy? Different levels of the hierarchy done in parallel?
>> Paul Smolensky: Right. Okay, so let me go on a little bit further.
>>: [indiscernible].
[laughter]
>> Paul Smolensky: Funny how that works.
>>: Yeah, so I was waiting for people. This is for text only, right. But think about images, right, so if you,
an image I mean normally we get a continuous input. But something that you can [indiscernible] you
really can get a silhouette. You want to extract what is [indiscernible], right.
>> Paul Smolensky: Okay.
>>: Then this Psi is now, sorry, I’m trying to sort of map this to sort of visual inputs. There is, so are five
features?
>> Paul Smolensky: Well I think…
>>: Are there five feature functions?
>> Paul Smolensky: The thing that makes most contact with this in my limited thinking about vision is
that if you had something like a description of an object in terms of its parts in the relation of the parts.
That would be here.
>>: Yeah.
>> Paul Smolensky: Then you could have something much closer to an image down here which this is
going to be mapped to and serve as a means of interpreting the image state down here. That’s the
closest I would get.
However, it might well be that the vector space here should be not pixels but some other features of the
image space that are much more suitable for instantiating the abstract object representation. That
would hardly be surprising.
Okay, so let’s see, maybe I will skip this because I just wanted to make sure that this didn’t get gone over
too fast.
[laughter]
It is not the problem. Okay, although this does make a point that when I say natural neural network
computation. The results, the first group of results I have applies to linear operations from here to here.
Just multiplying this vector by a matrix to get that vector couldn’t be much more straightforward, simple
neural network operation than that. But you can do a very interesting set of functions, symbolic
functions that way, as it happens.
Okay, so, now this is getting back to the issue of the role of the approximation aspect of the picture and
how slavish it’s suppose to be to the cruel discrete master on top. The idea is that the vectors that are
in fact the image under this realization function Psi of the discrete structures. They form a discrete set
of points in the vector space. That’s called d here.
Then of course there’s the whole rest of the space. Those encode proper gradient symbol structures.
By proper gradient what I mean is not discrete really involving linear combinations of symbols in
positions in the structures.
There are several kinds of uses that those states get put to. The weakest sense of their usefulness is as
transient states in dynamic versions of the computational system. Not like the one like I showed you
which is a simple linear mapping, but in recurrent networks that have a dynamics to them in which
those are intermediate states in the processing of the final output which is nonetheless a fully discrete
state in this set D.
Somewhat less, somewhat more interesting are cases where we want to say that the final states, not
just the transient intermediate states. But the final state of processing is off this discrete set but near to
it. These can be used to model variation within a category of outputs that can be made distinct in ways
that their symbolic counterparts can’t.
What I showed you in the discrete structure for the syllable was in fact a case of this where we have
network with the grammar that says that final stops need to be voiceless, which is true in a language like
Dutch where indeed a final D is pronounced as a T. Except it turns out that the T is not pronounced
identically to an actual T. It’s a little bit more D like. That’s modeled as a point a little bit off the set of
discrete structures with a mild blend of T, of D mixing into T.
Then…
>>: [indiscernible], are there examples in syntax about that probability you just mentioned in
[indiscernible]?
>> Paul Smolensky: Then the most radical use of gradient symbolic structure is when the final states are
not even close to discrete ones. For example some people believe in some kind of shallow parsing or
incomplete parsing, or good enough parsing when people get a sentence they don’t actually drive all the
way through to some nice clean discrete parse of it. They leave unresolved a number of ambiguities
that would be needed in order to get a full discrete state. What they have is actually a blend of partial
parses. That’s an example. Yeah, I’ll mention another one later if I ever get there.
Alright…
>>: [indiscernible] this probably [indiscernible]…
>> Paul Smolensky: It’s the newest part. It’s the newest part. It’s newer than the book The Harmonic
Mind, which has only the discrete structure aspect of it.
Okay, so now eleven thirty. This calls for some recomputing.
[laughter]
When should I stop talking? It’s eleven thirty one according to the clock.
>> Li Deng: I think we can up for [indiscernible].
>> Paul Smolensky: But we had a lot of discussion already I guess. You guys have had the floor more
than I have, right? I get to have some of it back.
[laughter]
Okay, so…
>> Li Deng: I guess finish that session for today…
>>: Maybe working on the clock part two.
[laughter]
>>: We already have top part two. I can bring out top part three.
[laughter]
>>: But one [indiscernible]…
[laughter]
>> Paul Smolensky: Okay, so let me see here.
>>: Probably [indiscernible]…
>> Paul Smolensky: Okay, so I think I’m going to skip that point with the hope of touching on it some
other time. I was really interested to get your take on this argument which I haven’t presented before
because it always gets left out. For the same reason it’s being left out right now, which is that it’s a bit
of a tangent.
Okay, so let’s talk about the actual proposal for what the mapping Psi is. That’s here I guess. Okay, so
part three the proposal for the representational scheme. It’s based in tensor calculus. The part I just
skipped includes an explanation of why tensors? But you’ll just have to believe that if you were willing
to sit for long enough you would get the answer to that question. But both of us know you’re not.
An nth-order tensor is a very, very simply thought of as an array with n dimensions, so that the elements
of it have n subscripts or indices which identify the individual numbers in this array, and apologies to
mathematicians for that definition. There are two basic operations. There’s the out product which
increases the order of tensors. There’s the inner product, I’m sorry here’s the outer product. If we have
two tensors one which has n indices and one which has m indices, then the outer product also really
called the tensor product is something which has n plus m indices.
This is the symbol for the tensor product. But over thirty years I’ve come to the conclusion that people
freak out when they see that symbol too much. I’ve purged it. Everything is written in a new notation. I
apologize in advance for mistakes that may have arisen as a result. You will not see the tensor product
symbol again I think.
I’m just going to write them next to each other here like that. I don’t know why people freak out about
that symbol. I don’t know whether there’s something about being both circular and angular that kind
of…
[laughter]
That people don’t know how to relate to it or something, I don’t know. I’m not sure what it is. But we
can dispense with it. The definition is pretty simple. The n plus m order tensor is just the numerical
product of the elements of the first and the second tensor. There are n plus m indices required to take
into account all possible products of the n elements of the first and the m elements of the, n indices of
the first and m of the second.
The other basic operation is contraction which decreases the order of the tensor. Here is an awful
looking formula which is trying to say something very simple. I should have just written it in words. I’ll,
what this operation here this is the contraction over the ith index and the jth index of this tensor which
has order n plus two. The outcome is a tensor of order n because the ith index and the jth index of this
tensor are both eliminated. They’re eliminated in the following way. Here’s where the ith index goes.
Here’s where the jth index goes.
We just set them both equal to one. Then we have a number here for each set of indices we have a
number. Then we just add that to the number that we get by replacing them both by two and then by
three, and by four. We sum over all possible values of q of the value of the tensor with a q and the i and
j positions. If we only had two to begin with and so T was a matrix we would be taking the sum of the
diagonal elements. We would be performing the trace. It’s a generalization of the notion of trace of a
matrix to higher order objects.
There are many important special cases. The dot product is the simplest one. If the tensors we are
talking about have only one index, think of them as vectors A and B. Then, that’s interesting, how did I
get two and three? Curious, it should be one and two. There are two indices here. If we set them
equal, both equal to q and sum over all possible bodies of q then we get a simple number. That’s the
dot product of these two vectors.
The matrix product is another special case. In this case we have A and B both having order two
matrices. The outcome has order two because we add these two together four A subtract two which is
what this does. There are four indices here the two of A and then the two of B. We take the second
index of A and the third index of the whole thing which is the first index of B [indiscernible] equal to
each other in sum. Then we get the matrix product of A and B. That’s just this equation if these are
interpreted as matrices.
In general if we have two tensors there are many inner products of the two tensors depending on which
indices we contract over. But the idea is that if we have an n and an m order tensor and we take their
outer product. Then we contract over one of the indices of A and one of the indices of B then we’ve
performed something called the inner product over i and j, which has decreased by two the order of the
product of these two guys.
That’s the inner product. The other thing was the outer product. Those are basically the only two things
that we need because the symbolic operations in this approach are implemented binding by the outer
product, and unbinding by the inner product.
I’ll just give you a bunch of examples on the next slide of what I mean by binding and unbinding. The
idea is that when we perform this product here we are in some sense sticking together whatever it is
that A represents and whatever B represents. We’re binding them together. We’re referring to
something that is conjoint between them. Then later if want to know what it was that was bound with B
then we want to unbind its partner and pull out A.
That’s…
>>: Inner part is only one example of the [indiscernible] up. There are other ways of binding using
different type of [indiscernible] up to half percent [indiscernible]?
>> Paul Smolensky: Not that I’m aware of.
>>: It’s probably the other one actually.
>> Paul Smolensky: Not that I’m aware of. Okay, so binding by the outer product. Here’s a bunch of
examples. There’s a nice elegant general approach to this. People don’t seem to understand it so I’m
going to avoid it. This is a bunch of simple examples instead. But, maybe you can believe me that
they’re all instances of one proposal.
Attribute value binding, so if we have an attribute value structure it’s a whole bunch of attributes which
are bound to values. Here we have some big structure. One little part of which says that the agent of
this event is J. The proposal is that the vector that represents that attribute value binding is just the
outer product of the vector that represents agent and the vector that represents J.
Another way to write that little bit more perspicuously is to just write it like this. It’s just the outer
product of the A tensor and the J tensor that encodes binding together of agency, agent hood, and the
individual J.
In a graph we have links which may have, so we have an edge I’m assuming is labeled. It joins; it goes
from A to B. We represent this object as the vector which is the outer product of the three vectors that
encode individually the symbol on A, the symbol for the relation and the symbol for B, again, more
perspicuously written that way.
Relation, so if we have, if we, this says that X and Y stand in the relation R. Then same notion just as
above is that an outer product of the vectors encoding the individual symbols to form the binding
together into a single relation.
The next one is a little bit different in character. Now we’re talking about binding a symbol to a position
within a string. This is a description of the binding of one of the symbols X to one of the positions
namely the second position. The vector for the X part of this is the outer product of a vector which
represents the second role, second position. The vector that represents the symbol that fills that role, X.
That can be extended to trees. The way I found convenient to do that is to think about positions in a
binary tree labeled by binary sequences. The position of X is the left child of the right child of the root of
the tree. Left child of the right child of the root that is the position that X fills. We take the outer
product of these two.
It doesn’t matter which order you pick everything will be isomorphic across the two orders. You just
have to be consistent in your use of all the bindings of the particular type to have a particular ordering
chosen. It doesn’t matter whether we put R in the middle or here as long as we’re consistent.
It doesn’t matter whether we put R on the right or the left. But for the formula I’m about to show you
its more convenient for it to be on the right. This is the more abstract notion that the position which is
the left child of the right child of the root is itself a kind of binding. It’s recusive, it’s recursive. The idea
is that the, this particular position is gotten by binding the right child sub-position to this other position,
the left child sub-position to the right child.
This product is a way of expressing the relationship between this position in the tree, the left child, the
right child of root, and the basic vectors. There are only two of them for binary trees which can be used
then to generate vectors for all of the positions by the same recursive procedure. That captures the
recursive structure of trees in the sense that recursive functions defined over binary trees can be
straightforwardly computed by linear operations when the recursive character of the data structure is
encoded in this way. In the vectors that encode the positions in the tree.
Okay and in a proper tensor product representation, so this is intended to be a technical term here. All
the vectors that encode the symbols are linearly independent of each other.
I think one of the things that got skipped in the second part is a very important point which I only
recently really appreciated. All my life I’ve been kind of living in the cognitive context where the number
of neurons is large compared to the number of concepts. There is a one too many relationship that’s
usually talked about in neural realization of concepts. Many different neural patterns could instantiate a
given conceptual one.
But it turns out that as fantastic as the computers in this building are they cannot actually cope with as
many processors as we have in the brain. Now I find I’m totally, I’m constantly being told I have to have
fewer neurons than I have concepts, not many more. In order for the vectors to, encoding symbols to
be linearly independent then you have to have at least as many neurons as there are symbols. Basic
property of linear independence you can’t do it otherwise.
The work that we’re doing now involves improper tensor product representations which do a kind of
compression. In which the pattern over neurons that encodes a symbol is not independent of the other
patterns that encode other symbols. Because there’s just not enough space when you compress the
code down to a number of neurons that’s small compared to the number of symbols being encoded.
>>: You mean compression or contraction? Or [indiscernible] is the way [indiscernible] contraction? If
a…
>> Paul Smolensky: I mean compression and in this case it’s not necessarily achieved by contraction. It’s
just a statement that you can’t have linearly independent vectors for all your symbols because your
network isn’t big enough for that.
The mathematics that I’ve done for showing what a function can be computed and so on has always
been in the context of proper tensor product representations. Which I think is sensible for the brain but
not for necessarily application where the number of concepts equals the number of words of English, or
something like that. Yes?
>>: If you require one vector to be linear independent does that take any advantage of your distributed
[indiscernible] of [indiscernible]? When people say you [indiscernible] they won’t take advantage of
that.
>> Paul Smolensky: Yes.
>>: [indiscernible]
>> Paul Smolensky: Okay, so I know that you guys are keen about the advantage of distributor
representations. That you can have more than n patterns with n units. That is true it is an advantage.
But it’s not the only one. The one that has been studied is that when it’s only distributed
representations that have non-trivial similarity structure to one another. Two vectors can be more or
less similar.
But if you have local representation where each concept is associated with a single unit then all
concepts have zero similarity to all other concepts. Similarity is such an important factor in cognitive
generalizations and so on. That’s an important feature of distributed representations even when they
are not compressed relative to the set of symbols.
>>: Do you require the linear independence only for X and X symbols or also for each result like within a
tree position or in a recursive tree position?
>> Paul Smolensky: It’s required, so if these two are linearly independent of each other then all of these
guys will be as well. It follows it doesn’t have to be stipulated separately. But it is it’s a property of the
atomic elements that are being combined that they be linearly independent. Not necessarily of the
composites that you build by binding them together. For the outer product though you’ll preserve the
independence.
Okay, so that’s one part of the mapping. Another part is the use of addition to encode conjunction. My
suggestion is that maybe I should stop here and on some other occasion pick up from here. I will use
this as a means to just remind people about the outer product part of the binding. How, when
combined with addition for joining together bindings we can encode whole structures and not just
individual constituents.
Okay, thank you very much for your questions and attention.
[applause]
Related documents
Download