>> Li Deng: Okay, thank you, everybody, for coming... Professor Paul Smolensky, and we thank Lucy for co-organizing this...

advertisement
>> Li Deng: Okay, thank you, everybody, for coming for this final part of lecture series by
Professor Paul Smolensky, and we thank Lucy for co-organizing this series for us. So we have
decided to open up all this series to the public, and today is the very last one, and Paul will stay
here until mid of December, so you have more questions, you can come and approach him
directly. And because we are going to open up this whole series to the public, so we decided not
to discuss any internal projects we are working on here, but if you do have any questions within
Microsoft, come and talk with Paul and myself, and maybe Lucy, as well. So thank you very
much, Paul, for spending a few months with us, for collaborative work with us and also for
giving this very insightful series of lectures. We appreciate it very much, so it's for you for
today. Thank you.
>> Paul Smolensky: Well, it's such a treat to have a dedicated group of people who really want
to understand, and so that's been very gratifying, and I've learned a lot in the process. I think I'm
supposed to maybe put this out of the way. Okay. So today, I am hoping to make four points.
First, in practice, standard tensor product representations are not, as universally believed, larger
than alternatives. Second, all known proposals for vectorial encoding of structure are cases of
generalized tensor product representations, which I have yet to define. There is a little evidence
for tensor product representations in the brain that I will tell you about. There is a topic that's
been held over many times from previous lectures, which I want to get to if I can, showing what
kind of serious symbol processing can be done with tensor product representations to try to really
lay to rest any questions there might be about whether you can do real symbol processing with
these networks. Okay, actually, if you'll bear with me a second, I'm going to just restart this.
Okay. So I want to talk about what other people have said about the size of tensor product
representations to give some context for the comparison I want to make between sizes of tensor
product representations and others. And Chris Eliasmith is an important figure in the field. If
you're not aware of him, you should perhaps check out this article in Science, which is quite an
amazing accomplishment of training networks of neuron-like units, which are more seriously
committed to biological reality than most neural networks are, including my own. And so he has
tried to do much of the same sort of thing that I have tried to do, but with more emphasis on
biological validity, to try to tie together the neural level and higher cognitive levels. But he
believes that tensor product representations are too large, so in one of his papers from this year,
he quotes his book as showing that in coding a two-level sentence, such as Bill believes that Max
is larger than Eve, where lexical items may have hierarchical relations of depth two or more that
this will require approximately 625 square centimeters of the cortex, which is about a quarter of
the total cortical area, which he finds implausible, and I have to totally agree, that if that were the
truth, then it would be bad news. He elaborates on why he believes that this is the right figure in
the footnote there. Conservatively, let's assume that only eight dimensions are needed to
distinguish the lower-level concepts like mammal. Then, a representation of an individual like
Eve, who is somewhere in a hierarchy of being, Eve is a person and a person is a mammal, and
so Eve ends up three levels down. If we have eight units involved in the vectors at each of these
levels of multiplication, then we end up with 512 dimensional vectors for the individuals. And
then, putting them in the sentence, if you put individuals with 512 unit vectors into depth two,
then you end up with 512 times 512 times 512, which is 12.5 times 10 to the 7th dimensions, or
12.5 times 10 to the 9th neurons, because elsewhere he's argued that you need 100 neurons per
node of a simple connection network in order to get the right signal to noise ratio for these
networks to work properly. So now, the thing is that this is not the right way to calculate the
number of units needed in a tensor product representation for a tree involving Eve and the other
figures in this sentence. What was said here would be correct if -- did I get that laser pointer? It
would be correct ->> Li Deng: Magic.
>> Paul Smolensky: Thank you. The human brain is a wonderful thing. Okay, so it would be
correct -- I finally remembered to bring my own special pointer, which those of you know
optimality theory will appreciate the importance of the pointing finger. This is too high for
optimality theory. So it would be correct, if the way that we represent something like ABC is A
times B times C. In other words, if what we did to represent a pair, left child, right child, was to
multiply the two of them together, then as we went up to higher levels of embedding, we would
in fact pile up multiplications of filler vectors, each of which would be the size of an individual.
So this calculation would be correct if that were the way we represented trees, but we don't
represent trees that way. What we actually use is PQ is represented as P times the role of left
child plus Q times the role of right child, so P and Q are not multiplied together. They're added
together, and they're multiplied by something which is in fact a very small vector, a vector of
size two in most work that I've done, because we just need two vectors in this little space to be
linearly independent, R0 and R1, so a two-dimensional space suffices. So those are tiny vectors.
And when we do go to depth two, then the representation we get looks like this. We do end up
multiplying together to get a third-order tensor, the way it was claimed here, but it's not three
individuals times each other. It's one individual times one role vector times another role vector,
and each of these only is of size two. So the dimension grows as the dimension of the filler
vectors A times the dimension of the role vectors raised to the D power, where D is the depth. It
does not grow as the dimension of A to the depth, which is what the Eliasmith calculation here
was assuming. And it certainly doesn't grow as the dimension of A times the dimension of R,
raised to the D power, which is what Gary Marcus assumes in a calculation we'll look at in a few
slides from now.
>> Li Deng: So what makes people get this wrong calculation?
>> Paul Smolensky: Well, it is a mystery. In another paper, in 2014, Eliasmith clearly uses
positional role tree representations. They use representations of exactly this sort in another
paper, the previous year. So why it was not assumed that that was the way to think about tensor
product representations, but rather that this was, is a mystery. It is true that in the way they
represent structure, the operation that they use here keeps the dimensionality of the vectors the
same, so they have the option of doing this in a way that the tensor product scheme does not
really have a viable option. And it is true that they like to do it that way with their operation, but
it doesn't make sense to do it that way with our operation. Yes.
>>: Let me ask you a clarified question. So the addition, the vector addition, is the direct sum or
plain addition?
>> Paul Smolensky: Direct sum.
>>: Direct sum, okay.
>> Paul Smolensky: Yes. Well, yes, this one and this one are direct sums. This one can be an
ordinary sum. All right, so I'm not sure why they made this calculational error, but it has to do, I
think, with applying their procedure for encoding structure and assuming that the tensor product
scheme would do it the same way, and that's not the case. Okay. In this paper, this is the way
they talk about representing this tree. Okay, exactly the same way that I do it. So it's not
something that was not understood or even it's something that was used by them, but in the
context of their operator for combining information here, their form of multiplication rather than
tensor product. Okay. So another person who's claimed that tensor product representations are
too large and who has a prominent place in this part of the research field, because like me, he
believes that it's important to find ways of bridging the connectionist and symbolic levels, he
believes that tensor product representations are too large. So what he says is that suppose each
filler can be encoded by a vector of five binary nodes. Encoding a tree with five levels of
embedding winds up taking 10 times three -- oh, I should have said each role can be encoded
with three nodes, three-dimensional vector. That's where the three comes from, 10 for the filler,
three for the role. Five levels of embedding, he takes this to the fifth power and claims we need
24 million nodes. Well, the truth of the matter is we need 7,000 nodes -- 7,280 nodes, because as
I said on the previous slide, we don't raise this number 10 to any power at all. It's linear in that in
the size of the filler vectors. And you only get exponentiation of the role vectors -- he was
assuming that they were of size three, and so that's what leads to this number.
>> Li Deng: That number requires that the role vector has dimension two or three smaller?
Because R vector that you used earlier could be larger dimension or it could be smaller
dimension.
>> Paul Smolensky: Yes.
>> Li Deng: But to reach that number ->> Paul Smolensky: This here? This assumes, along with Gary Marcus here, that the R vectors
are of length three, dimension three. So when we go to depth five in the tree, we're not taking
filler, which is of size 10, times filler, which is size 10. We're not doing that five times. We're
just taking filler times role, one filler, and the role has multiple factors of the primitive roles from
which all the others are recursively defined. So you get the role number raised to powers, and
that number is just three, according to Gary's assumption here. Now, in fact, I use two, and so
we could even say that rather than 24 million, what you need is 630. So you see, tensor product
representations are getting a bad rap.
>> Li Deng: But when you have very, very small dimensions for R, what kind of things do you
lose?
>> Paul Smolensky: Nothing.
>> Li Deng: What if you have noisy encoding, something gets corrupted.
>> Paul Smolensky: So as long as the distributor representation of roles is what we've been
calling a proper one, so that they're linearly independent, the two role vectors will have two
corresponding dual vectors, which you can use to unbind with perfect accuracy. It's only when
you go beyond the ability to have linearly independent vectors that you start getting errors and
noise.
>> Li Deng: So in that case, R, increasing the dimensionality of R, is it going to help? The
solution is somewhere else, other than increasing the dimension of R.
>> Paul Smolensky: If you had some inherent noise in the system, if you had a noisy
computational system, then it might help to have a somewhat more commodious space to put
your vectors in, but I'm not even sure that that's true, so it's not clear to me that it can help.
>> Li Deng: So in this case, you're saying that in practice just use dimension level two.
>> Paul Smolensky: I haven't seen a reason to use anything but two, myself.
>> Li Deng: Okay, that's good. Thank you.
>> Paul Smolensky: Okay, now, so why does Gary do this business of 10 times 10 times 10 five
times, when in a previous paragraph, just before this paragraph, the clearly describes positional
role tree representations correctly. So if you just look back to the previous paragraph, it lays it
out exactly right. You take the left subtree and multiply it by a vector that represents the word,
and the right tree vector and multiply that and add them together, it's a perfect description of the
way that I represent trees. But then when it goes to a continuation of how big it is, there's this
big deviation from that description to something else.
>>: But I guess you can interpret that paragraph both ways. That paragraph could be interpreted
both ways, as if he's only quoting the role, or he's keeping the whole left subtree in his
calculations.
>> Paul Smolensky: Yes, I considered that possibility, too. And I believe I found evidence that
wasn't -- couldn't really be what the intention was, but I'm not sure that I could tell you what it is,
and it's possible that it was truly interpreted in a -- in such a fashion that the representation of a
subtree is multiplied by the representation of its siblings, not added to the representation of its
siblings. Okay, so what approach then do they favor? Well, in the talk that Gary gave here, he
had a nice table of all the different kind of computations that he argued we need to be able to do
in order for cognition to get off the ground, and he proposed some algorithms for carrying out
these computations in a way such that they could be neurally implemented, a very nice piece of
work. Now, here is where he talks about how to represent variables and binding of variables,
which I call roles, to fillers. So what he says about them is that they should be done with
holographic reduced representations, which is based on a multiplication operation, circular
convolution. So I'll tell you what that scheme, holographic reduced representations, looks like.
It's based on this multiplication operator that functions to bind together vectors the same way that
the tensor product does in TPRs, but it's a different operation. And it looks like this. So if we
take the vector X -- I'll pretend it has three elements here, and Y -- suppose it has three elements
here. We want them to be the same dimensionality. Then, if we take their tensor product, we get
all of these combinations of products of one element from X and one element from Y. And if we
cycle through them, so that we just repeat this upwards cycling through, pretending that this is
mounted on some kind of circle, then if we form summations this way, we add up these three
elements of the tensor product and call it Z2. These three, we call it Z1. These become Z0.
Then you've gone from a nine-element tensor down to a three-element vector, which is what you
started with, so Z has the same dimensionality as X and Y, and this is called circular convolution.
It can be written out this way. The lambda component of Z can be gotten from products of
multiplying X-alpha times Y-beta, then weighting each one by a number, which is either one, if
you're adding it into the sum, or zero, if you're not adding it into the sum. And the ones that you
add into the sum turn out to be exactly those for which this coordinate is equal to the sum of
these two, Modulo 3 and is the dimensionality -- it should say Mod D, actually, if that's the
dimensionality. This can be recognized as a contraction of a tensor product, so this is clearly a
tensor product where certain subscripts have been set equal to each other and added, so you have
alpha repeated here and beta repeated here, so in this three-way tensor product, what we have is
T-alpha-beta. T-lambda-alpha-beta, X-gamma, Y-delta. That's the five-subscript configuration
for this order five tensor product. But then if we contract by setting indices 2 and 4 to be the
same and adding up over all of their values and set indices 3 and 5 to be the same and add up
over their values, then what we get is this sum here, and that is an order 1 tensor. It's just a
vector again.
>> Li Deng: So where does the convolution come into that contraction ->> Paul Smolensky: :The convolution is hidden in the definition of T. Yes. So modular
arithmetic has a kind of cyclical, circular character to it, and that's why circular convolution ->>: Or computationally, you could compute these things in various more efficient ways. You
can use a [indiscernible] scheme of sums. These things can be computed in much more efficient
ways than this. Is that the point that he is trying to make with using these representations?
>> Paul Smolensky: No. It is something that he takes advantage of, but it's not something that is
a designed feature of it, that I know of.
>> Li Deng: So the design feature is to have the same dimensionality, so it never grows.
>> Paul Smolensky: Yes, that's really ->> Li Deng: But the problem is you said that it's not able to do hard binding.
>> Paul Smolensky: Well, we haven't gotten to that. We'll talk about unbinding very shortly.
>>: So to answer for your question, so the purpose of the binding operation into any plate model
is to create a new representation, which looks entirely different from the component. So he
artificially created that operation, but the problem is, if you'd like to unbind, you are going to
lose a lot of information. So unbind accuracy is pretty bad, at least based on my implementation.
>> Paul Smolensky: Okay, well, we will be talking about unbinding before we finish this slide,
but that's the key point to keep in mind, that unbinding is ->>: There is sort of a philosophical reason, biological reason, why you would imagine that
additional operations actually matter, and that's basically that everything we sense is sensitive to
these shifts in the world, that everything is shifted a little bit. Convolutionally, it's kind of a
natural thing to have in processing the information, because we temporal shifts, we have spatial
shifts, even in the most primitive sensing techniques, and if you assume ->> Paul Smolensky: Well, only the most primitive sensing techniques, I would say.
>>: But if you assume that the whole thing is kind of fractal like, that it's kind of layering on top
of each other, that the similar pieces are just using output of the previous sensing processing
algorithm as an input to the next one and so on, if you think a very simple biological entity that
grows, then you would imagine that tendency to model convolutions is there everywhere in the
brain.
>> Paul Smolensky: Well, I think that there is a qualitative change in the nature of the symmetry
operations going from the level of signals to the level of symbolic encodings. And so for
example, there is a certain kind of symmetry inherent in the structure of trees that are generated
in a context-free language, which is that no matter where in the tree you plant some subtree, it
will retain its grammaticality or status. If it's a good tree here, then it's a good tree here. So
there's a kind of a shift translation. I call it embedding invariance, but that's really quite
different. The equations that describe it are quite different from the equations that describe
translation invariance, for example. So I'm inclined to believe that once you make this transition
to the macro level, the nature of the invariances changes sufficiently that it's not a foregone
conclusion that convolution is the way to capture them anymore. But it could also be that once
convolution is understood in the appropriately abstract sense that what I've been working with
could be thought of that way, too. It's possible. Okay, so how is this operation here, the circular
convolution operation, how is it actually used to build representations in holographic reduced
representations? Well, the representation for the pair AB is A times B. The representation for
AB embedded as a sister to C is C times AB. So this will remind you of what I showed you a
few slides ago. If tensor product representations of trees were done this way, then the
calculations we saw before would be correct. And as I said there, the option of doing it this way
is open to people using these representations because this multiplication maintains the size of the
vectors. In a way, it's not really open for tensor product operation. And to make the connection
to tensor product representations a little bit -- as tight as I can, let me define a relation -- a
function T, which takes two arguments and produces a binary tree, with this as its left and this as
its right child. Then, the holographic reduced representation of this is, as we saw over here,
gotten by taking the product of A and B with the circular convolution operation, which is a
contraction of this tensor product. So TAB is represented as T, tensor A, tensor B, just like it
would be in a tensor product scheme that used those kind of contextual role representations for
arguments of a function. And the difference is, so we have this T, which is chosen cleverly to
have some nice properties -- it's not just any old T that figures in this calculation, but just the
same, it is an outer product, just like you would expect in a tensor product scheme, but we
contract it twice. So instead of having order five, it ends up having order one. And so we can
say that what that says on the bottom here is that this is a contracted tensor product
representation. What you're looking at here is a contracted tensor product representation. It's a
tensor product representation that has been contracted. Okay. I did want to mention that if the
vectors that you choose to use are binary vectors -- sometimes, people use 0, 1, sometimes they
use 1, -1, and if you use operators that are Boolean operators for summation and multiplication,
then what you get is one version of it is a system called binary spatter codes, and some of the
same mathematics that applies when you're using normal multiplication of normal numbers here,
which is the standard holographic reduced representation scheme. Some of the mathematical
properties carry over to the binary world, so this is a notion that has various manifestations in the
literature. But as several people have said, as [Munte] said, as Li said, unbinding is noisy. You
can't get something for nothing. You've taken two vectors of size D, and you've somehow
smooshed them together into another vector of size D. You've lost information. Something's
going to cause you to pay for that, and so when it comes to binding, it's noisy. What you can do
is, you can unbind by taking the circular convolution with the pseudo-inverse of the vector that
you want to unbind. So I'm not going to go into the pseudo-inverse business, but it's a little bit
like the dual vectors that I have when I have non-orthogonal role vectors. But it's quite different
formally, because it has to be an inverse with respect to this kind of multiplication operation. In
certain cases, vectors are inverses of themselves. You'll notice that it's the same operation that's
used for unbinding as is used for binding, which is different from the tensor product scheme,
where we have outer and inner. But because it's noisy, what you have in these kind of
representational schemes, people are constantly talking about cleanup. So when you try to
unbind and retrieve something from a holographic reduced representation, what you get, instead
of a clean version of the symbol A, is some noisy version of the symbol A, with interference
from the other symbols that were present with it in the structure that you took it out of. And so
oftentimes, it's essential to take what's been retrieved and clean it up, essentially replace the
retrieved vector with the actual one, recognize this is a noisy, messed-up version of A, so let's
replace it with the real A before we continue to use it in further computations. Of course, that's
not necessary with tensor product representations at all. Now, this noise business leads to
problems when you try to do something with HRRs that we've been doing with TPRs for some
time. We talked about harmonic grammar and how you can use it to do parsing earlier in the
lectures, and in this interesting paper that came out this year, the same overall enterprise was
investigated using HRRs instead of TPRs, and the noise that you get that gets added to the
energy, which you may remember is what we call harmony, or the negative of it, that the
computation of the harmony of structures, which is what the harmonic grammar is trying to
optimize, is problematic because of all the noise. And so to avoid this problem in the
simulations, what they had to do was break up the network into a whole bunch of sub-networks,
one for each locally well-formed tree in the grammar, break up the vector, the state vector, into
parts that correspond to these sub-networks, and compute the energy each of the subnets, adding
them all together to get the total energy. So they have to go to great extremes to cope with the
cost of sticking with the same dimension representations for trees as for symbols in the noise that
arises when you try to use those representations for things like computing harmony values. And
what that says is, all of this is unnecessary with tensor product representations. We don't have to
do any of this stuff for the harmonic grammar work that we do.
>> Li Deng: So the cleaning up, the way to do cleaning up, done by the author by himself. Or
was it done only afterwards, that you realized how difficult it is?
>> Paul Smolensky: Tony talked about ->> Li Deng: So the same method of cleaning up, or different ways?
>> Paul Smolensky: There are different ways that have been proposed, and I can't remember
whether these two papers, whether Tony's work and this work use the same method, but what
you have to do is have some sort of cleanup unit that has a stored version of every symbol in it,
which it then can use to compare to the noisy element, find a match and then replace the noisy
version with the clean version.
>> Li Deng: I see. Thank you.
>> Paul Smolensky: Okay ,so this is the kind of cost that you pay from noise. Then the question
is, how much benefit do you get in getting smaller representations this way? How much do we
decrease the representational size because we're willing to put up with noise? Okay, so here's a
plot of some comparisons between research projects that have been done using holographic
reduced representations, so these are taken from the literature. These are all cognitive models
that have been done using HRRs, and what I'm comparing them to is the size that would have
been needed had they used tensor product representations instead, so we can see how much of a
savings in size there is for all of the headaches that noise gives you. Okay, so here's Tony Plate's
dissertation in 1994. So I'm showing you, first, what would be needed with a tensor product
representation to do what was done there. You would need 10 units in your network. He had
1,000. Plate 00, you would need 420 in the tensor product representation. He had 2048.
Eliasmith and Thagard, we would need 506. They used 512. That's pretty close. That's pretty
close.
>> Li Deng: So that’s under the same noise condition?
>> Paul Smolensky: Well, no. They have noise, and tensor product representations don't.
>> Li Deng: So why do they need to have so many, that 2000?
>> Paul Smolensky: Because they have noise.
>> Li Deng: Oh, so in order to reach certain performance, they have to have lots of ->> Paul Smolensky: Yes.
>>: So you count the cleanup structure as part of the representation?
>> Paul Smolensky: No. I'm just counting the number of units in the actual HRR, not in any
other part of the network that deals with cleaning up the noise, and so I'm looking at the
symbolic structures that can be represented and asking, if I was to do a faithful representation of
them with tensor product representations, linearly independent vectors for the fillers and the
roles, how many units would I end up needing? And so this is a noiseless system, and this is a
noisy system. Okay. Next example here, well, the numbers were too big to fit on this, and so I
chose to divide them by five, so what would have been -- what would have required five times
that many units in a tensor product representation, well, they used 10,000 units. So the idea is
that you can only cope with noise if there's a lot of averaging out, so you have to have lots of
opportunities for cancellation of the noisy contributions, and that leads to big representations. So
in this work by Hannagan et al., we would have needed 64 units to encode their structures. They
used 1000. Final one, Blouw and Eliasmith is the harmonic grammar parsing example that I
mentioned a moment ago, using HRRs, and they used -- well, they used 128 and 256, and they
tried various sizes, and it's not entirely clear what the right comparison is, because they all have
different error levels associated with them, so which is the one to pick is not clear, because
there's no errors with this one at all. So, anyway, moral of the story is, it's awfully hard to find
an example of a study that's been done with HRRs that wouldn't have been -- wouldn't have used
smaller representations had TPRs been used instead.
>>: And is there any tradeoff there? Do you lose anything by using TPR versus HRR? Is there
a reason to even consider doing HRR?
>> Paul Smolensky: One of the advantages that has been identified in having fixed size vectors
for structures of different sizes is that if you want to ask how similar is this tree of depth two to
this tree of depth four, if you use the simplest version of the TPR encoding of trees, you'll end up
with basically zero similarity in a situation like that, because the part of the network that encodes
depth four and the part that encodes depth two are separate. There is a fully distributed version
of representation of trees using tensor product representations, in which, in fact, it's no longer the
case that different depths are separated in the network. They're all superimposed together, and so
you can recover that advantage by going to something that's a little bit less simple than the one I
have talked about. It basically amounts to saying, well, remember that there are these bit vectors
of one, zero, one, right child of left child of right child. Well, that's if you're three levels down.
There would be another bit if you were four levels down, but what you can do is pad out these,
so they're all the same length, and associate a vector for that padded symbol, and you have one
more role vector for that padding symbol, in addition to R0 and R1, so it's a relatively small price
to pay. But then you get the vectors for symbols at different levels of the tree, all consuming the
whole network and not separate anymore, so you can have that advantage if you want to. I don't
know whether what you get is desirable.
>>: But then you would get all the advantages of HRR without the noise? You would still get
good similarity measurements.
>> Li Deng: But you don't know how good it is, right? Similarity? Because if there's a noise
there.
>> Paul Smolensky: I don't know how good the similarity measurements in what I propose
would be, but these are published papers, because the similarity structure that they got was
sufficiently close to what people have that it constitutes a decent model.
>> Li Deng: So they do have this kind of similarity comparison using noisy HRR, and then they
found that in the next algorithm.
>> Paul Smolensky: That's right.
>>: In fact, the cleanup depends on it.
>> Paul Smolensky: The cleanup depends on?
>>: Of the similarity measurements?
>> Paul Smolensky: Well, yes, that's certainly true.
>> Li Deng: So that's after cleaning up, they do conversion, but before cleaning up, it may not
be so meaningful, or are they?
>> Paul Smolensky: I'm not so sure. They may do the comparison, the similarity evaluation,
without cleanup involved. And what they're used for in most of these examples is you have
analogies. You have a little story with elements bearing relations to each other and another one,
and people judge how analogous is this situation to this situation, and they construct the different
HRRs and they ask how similar are they, and they use that as the model for judgments. Yes.
>>: So [Cole] mentioned that losing of comparing ability might be a hurdle of TPR, but that
ended not true, because think about if we are going to use either tree or graph by using these TPR
or HRR, then in the level of computer theories, it's going to be an extremely hard problem to
compare whether or not two trees perhaps are isomorphic to each other. So if we sincerely map
those into the vector space, comparing those two vectors must be as hard as before if the
mapping is transparent. But if somebody just simply compared the vector represented by HRR,
for example, by measuring cosine similarity or Euclidean distance, that is a very simple
comparison, which is not -- which must not be true, because comparing two graphs or a tree must
be hard.
>> Paul Smolensky: That's an interesting perspective, actually.
>>: So it's actually not the lose of TPR, in that sense. Comparing two different structures is an
intrinsically hard problem, so even if we map that into the vector space, that must be still a hard
problem.
>> Paul Smolensky: Yes, that's definitely worth more thought. That's a very good point. Okay,
so this is just a graph that brings home the magnitude of difference between the TPR and the
HRR that was shown in the previous graph. So this is from Tony Plate's dissertation, actually, so
he generated 1000-dimensional -- he generated 1000 vectors, I believe, 1000 vectors randomly,
of different dimensonalities, plotted along this axis here. So the case I'm interested in is the 1000
case, 1000 dimensional vectors. And he took a bunch of pairs of them and tried to ask, if we try
to put into our short-term memory a bunch of pairings in which X1 and Y1 are paired, X2 and
Y2 are paired, and we want to hold those in memory, we superimpose the patterns by adding
them together, the patterns being generated by the circular convolution operation, how many of
these pairs can we put into that short-term memory and be able to retrieve out of that accurate
answers to the question like is X1 paired with Y3? No. Is X1 paired with Y1? Yes. Can we get
accurate answers like that? And what he showed is plotted on this graph, if the size of the -- if
the dimension of the vectors is indicated here, then this shows how many vectors can be
superimposed in that memory trace, that short-term memory trace, how many pairs can be
superimposed and have acceptable readout, where the noise level is acceptably small, about
what's actually been stored in that superposition. So what you see is if the vectors are of length
1000, we end up somewhere between nine and 10 on the graph, so fewer than 10, more than
nine, pairs can be stored all at once, without exceeding some tolerance level of noise. But in a
TPR network, if we choose to have vectors for these elements X and Y, whose dimension is the
square root of 1,000 and we make them ortho-normal, which is what we like to do -- we could
make them linearly independent, actually. I'll change that. Linearly independent vectors of this
dimension, then the number of the dimensionality of the pairs, since we use the tensor product
here, will be the square root of 1000 times the square root of 1000, so it'll be 1000, so we'll end
up with the same number of units as we have in this representation plotted here. However, we
can superimpose 1000 pairs and have, that is to say, every possible -- let's see. We can
superimpose this many Xs paired with that many Ys. Namely, 1000 pairs can be stored all at
once, all of them being now linearly independent from one another, any one of them being
retrievable with complete accuracy. So we would have zero error in determining whether a
particular pair is or is not in that superposition state. So instead of less than 10, we get 1000.
>> Li Deng: So in addition to HRR, they don't really define it's a row vector ending
[indiscernible] the separate ones. It's just encoding a vector rather than encoding the structure?
>> Paul Smolensky: When you have a symbolic structure that you're trying to represent, like an
analogy, then you have to make decisions that amount to deciding what the roles are going to be
and what the fillers are going to be. So just like with the tensor product representations, those
decisions have to be made for each model. The only thing that's made for you is the decision to
use the circular convolution operation and not the tensor product operation for combining the
vectors, but the same decisions basically have to be made, and the equations look very similar.
They just have a different thing inside the circle.
>> Li Deng: So HRR can also be used to encode a tree or graph in the same way that ->> Paul Smolensky: Yes, that's right. Yes. All right, so that was my first point. In practice,
TPR does not actually involve larger representations, typically much smaller representations than
others have found they needed to use to cope with the noise in HRRs, although I have to say that
HRRs have been received with the kind of warm affection by the community that has been
noticeably lacking for tensor product representations, so they do have something. And maybe
you can help me figure out what it is. Okay, so all known proposals for vectorial encoding of
structures are cases of generalized tensor product representations. What I really mean to say here
is that even though there's lots of work out there, there's basically just one idea about how to
combine elements together in combinatorial structures. There's only one idea, really, and that's
tensor products. You can soup them up a little bit, but the core that's actually doing the binding
is the tensor product operation. And in some cases, these are examples that don't look anything
like tensor product representations but secretly are. So I'm going to claim that generalized tensor
product representations is a class that includes HRRs, another system we're about to look at
called RAAM, and temporal synchrony schemes for representing structure in neurons that fire.
So here's RAAM, Recursive Auto Associative Memory, from Jordan Pollock in the late 1980s,
who used a learning to decide how to take an encoding of one symbol, an encoding of a sister
symbol and join them together into an encoding of the local tree that has this as left child and this
as right child. And he chose the dimensionality of this layer to be the same as this and the same
as this, so it's like holographic reduced representations, in that the result of combining these two
symbols together is a vector of the same dimensionality as before. So this is the net that does the
encoding. These are the weights that are used to multiply these activations to feed into these
units. And then there's a decoding network, so it's called an auto-associative memory, because
it's trained by taking pairs X0, X1 here, copying them up here, X0, X1, and training the network
to have weights down here and weights up here, such that these weights undo what's done down
here. So they take the encoding of a pair and unbind the right child and unbind the left child by
the matrix multiplications involved in these connections up there. So you have binding at the
bottom or encoding. You have unbinding at the top or decoding. Now, the RAAM encoding of
AB, which is this, can be written this way. So it's the units in this layer are logistic sigmoid
units. They're not linear units, so there's a nonlinear step in the process of constructing this
representation that we haven't seen before. So this capital F boldface symbol means apply to all
of the elements of this vector this logistic transformation, so it's point wise nonlinear
transformation of all the elements in this vector here, and the vector here is the input to that layer
of units, which is this vector times that matrix plus this vector times this matrix. They just add
together, so here's the adding together, and here is this vector times that matrix and this vector
times that matrix, so the R's are the matrices here, and you'll recall that matrix multiplication is a
kind of contracted tensor product, so if we take this second rank tensor matrix, R0, and take a
tensor product with this, which is of order one, we get something of order three. It's all threeway products from the matrix and elements of the vector, and then we do the contraction in
which we require the second and the third indices to be the same. That summation gives us the
matrix product of this matrix times that vector, so what we end up here is exactly the activation
values that these units here are computing. So what we have is ->> Li Deng: But now, the ->> Paul Smolensky: Hold on a second. This is a squashed contracted tensor product
representation, so inside here, we have the contracted tensor product representation, like HRRs,
but now we've squashed it by applying this squashing function or this logistic sigmoid. Yes.
>> Li Deng: So now with the squashed contraction, adding this nonlinear function, which is not
standard product of TPR, then you won't be able to do this -- the same kind of unbinding without
loss.
>> Paul Smolensky: That's right.
>> Li Deng: But that's crucial for this kind of network.
>> Paul Smolensky: It's crucial that there be squashing it. Is that what you're saying?
>> Li Deng: Yes, yes. If there's no squashing, everything is linear, then you don't get much, so
I'm thinking that whether that F can be put into part of the TPR, I think it will be much more
powerful.
>> Paul Smolensky: So in the last part of the lecture today, I'm going to talk about programming
with TPRs, and how you need to use nonlinearities, but the kind of nonlinearity that I propose is
actually a case of multilinearity, where you multiple together vectors. But you don't squash them
point by point, because we know this is wrong. From our symmetry argument, this is not
invariant under change of coordinates. That's why coordinate-wise operations are not part of
physical systems that have invariances.
>> Li Deng: Interesting, because this whole branch of deep learning now about the lowest
encoder, they're all based upon this architecture, and it can do a lot of very, very interesting
things, so I was curious exactly what the role that nonlinearity is playing here.
>> Paul Smolensky: Yes, it's an important question, and I'll show you some nonlinearities, but it
does not answer the question of what this type of nonlinearity is.
>> Li Deng: Because the problem is the deep learning, right?
>> Paul Smolensky: And if you have rectified linear units, then what you have is a piecewise
linear operation there, right? And so it's as close to linear as it could be while still being
nonlinear, and so it could be that by looking at those kinds of nonlinearities, which have some
nice advantages for gradient computation and all, we could take advantage of the linear
properties that you get for both sides of the -- the function. So a generalized tensor product
representation has this form here. It looks like a regular tensor product inside. Fillers, tensor
product roles, added together over all the constituents. And then it has an optional contraction,
and then it has an optional squashing function. That's what I'm calling a generalized TPR, and
RAAM is exactly that, so you have both the F and the C for RAAMs. For HRRs, you saw you
just needed to see there was no F, and now I'll tell you about the last case of tensor product
representations, which actually is a true normal tensor product representation, which involves
neither C nor F, but it doesn't look much like a tensor product representation to most people.
The idea for it comes out of theoretical neuroscience, where the idea has been around for a long
time that neurons in the brain that are encoding properties of one and the same object will tend to
issue their spikes in a synchronized fashion, so there'll be a high correlation between the firing of
two units, one of which might be representing color and one of which might be representing
position or something, that they'll be firing with high correlation if they describe the same object.
So if there are multiple objects in the field, then these neurons will be filing in synchrony, these
will be firing in synchrony. These will be describing properties common to one object. These
will be describing properties common to another object. Okay, and here's a version of it that was
proposed for artificial intelligence-type purposes. And there, the representation of this
proposition, give John -- John give book to Mary -- is indicated in the following way. You have
one unit for each of these elements here. This is a single unit. This is a local representation, but
we're going to look at the activation of these units over time, so here's what the activation looks
like for these two units. They fire synchronously, so imagine each of these is a spike for the
neuron, so these are in phase. That's telling you that the give object is the book. They pertain to
the same object, same thing. And the giver is John, because those two are synchronized, and the
recipient is Mary, because those two are synchronized, so that's binding by temporal synchrony.
You bind together the role and the filler in our terminology by having them fire in synchrony.
So the way to turn this into a tensor product representation is to think about it as a network that's
laid out in time. So we take this set of units. This is the network. And we unfold it in time, so
we just have a copy of this set of units for each time, and then we know which units are active at
which times, so we indicate their activation, so this unit is active and then inactive for two steps,
then active again, out of sync with this unit. So this is the activation pattern which is the tensor
product representation of John gave Mary a book. It is the tensor product representation in the
following sense. Here's one of the constituents in it, the one that says Mary is the one who was
the object of giving, the recipient, I guess -- no. The book, sorry, is the object that was given.
So that you'll notice is a tensor product. This is the constituent corresponding to the give object
is book. This is a tensor product. Here is one of the vectors. Here's the other vector. You take
the tensor product of these two vectors, then you get exactly that pattern. Okay. Now what we
have over here on the filler side are the two units that we're joining together. So this is the one
for the book and this -- this is the one for the book, and this is the one for its role, the object. So
the filler vector is book plus give object, and this plays the role of the role vector, but it's a more
abstract notion of role. We'll call it formal role. It's the role of being in the first cycle of the
system's oscillation, so if we think about this constituent as the tensor product of this telling us
what role in the oscillation pattern in plays and this telling us what material fills that role, then
we have one constituent in the tensor product representation, and then we just superimpose by
addition the corresponding green and blue versions of the same thing for the other two
constituents.
>> Li Deng: So role looks more like the quantized time intervals.
>> Paul Smolensky: Yes. Well, there's this fixed repetitive activation, which is the firing phase,
so these are firing out of phase with each other, but they have the same firing frequency.
>> Li Deng: So presumably the first spatial dimension, you can apply the same kind of row
there.
>> Paul Smolensky: Say that again.
>> Li Deng: So this is a tie.
>> Paul Smolensky: That's the tie.
>> Li Deng: Different time interval, not for the space for the image, for example, rather than
time series.
>> Paul Smolensky: Yes.
>> Li Deng: But does the same kind of formal rule apply, you quantize spatial that it may be
able to use to represent the image or something?
>> Paul Smolensky: I think if you had a third dimension, then you could have something like
the region of the image, the label of that region and then time. Then I think you could. Yes,
Lucy?
>> Lucy Vanderwende: So this is an encoding of the sentence, the book was given to Mary by
John? No. The book was given by John to Mary.
>> Paul Smolensky: It's an encoding of the ->> Lucy Vanderwende: Because book happens first.
>> Paul Smolensky: It's an encoding of this proposition, and there isn't any intention that book
has some sort of -- that it precedes John in any sense. You could imagine that this goes on for
some time, and I just happened to start drawing the picture here. There isn't a significance to the
fact that it's the magenta that happens first, because I could have arbitrarily started to draw the
picture here. This is intended to be an ongoing pattern. So it doesn't reflect any sequence
information, just the binding information of what thematic role goes with what.
>>: There is no tree.
>> Paul Smolensky: There is no tree.
>>: There is just a set of facts.
>> Paul Smolensky: It's a slot filler kind of structure. These are the slots and these are the
fillers.
>> Li Deng: So there is no special advantage to using TPR for this kind of a structure. Any
other ways of representing, by the raw data, it will be just as good.
>> Paul Smolensky: Are you asking the question of what's the advantage of seeing this as a
tensor product representation?
>> Li Deng: Yes, exactly.
>> Paul Smolensky: Well, I can give you at least two. The first one is to substantiate my point
that any idea that anyone has ever had has used tensor products to bind information together, and
to take something which prima facie looks like a counterexample to that claim, people would not
think of this as -- they would think of it as a counterexample, but actually, it's an example. But
let's see, here. That's odd. But there is actually quite a distinct advantage, and you can probably
guess what it is. This is a fully local encoding, but we can repeat this whole construction with
the tensor products with distributed patterns, so we don't have to have a single unit for John.
John could be a pattern, and everything would go through just fine. And until you recognize it as
a tensor product, you have no idea how to take this idea and flesh it out with distributed
representations.
>>: But also this example kind of gives you an idea of why the brain might have very large
capacity, because it could be using time to encode things, as well.
>> Paul Smolensky: Yes, yes. In this article, actually, they make somewhat of the flip
argument, that this explains why short-term memory has such small capacity, because they are
using this as a model of what we can hold in our short-term memories at once, and they do some
sort of back-of-the-envelope calculations to figure out how many slots are there in the actual
cycling of actual neurons that would give you how many slots that you could fill with
information like this, and they come up with the number seven, which is the classic number. I
don't know that anybody regards it as the correct number anymore, but it's the classic number.
>>: Seven of what?
>> Paul Smolensky: You can put seven facts, like the book was given, in short-term memory.
>> Li Deng: That's pretty similar to short-term memory that people have. Telephone number
would be 10.
>> Paul Smolensky: Yes. I think seven plus or minus two is the famous paper by George
Miller. I think most people would say it's closer to three or four, actually, but in any event ->>: On a distributed representation of the roles, which seem to be wanting to be in the exact
time, that's not a problem when they overlap or anything?
>> Paul Smolensky: I think that's right, yes. Yes. As long as the patterns for down here are
linearly independent, we should be fine. So they don't have to be firing at distinct times. They
could be -- you could have some pattern in which you had a different amount of firing at each
time, not just one and zero, and then a different such pattern for the second slot, and it should
work just fine, as far as the linear algebra properties of the representation are concerned. Now,
what you want to do with this in your network might change, might be different. That I wouldn't
swear to. But the representation of the entire structure is the sum of these tensor products. Just
as you have in a standard tensor product representation, there's no squashing, there's no
contraction, but there is some innovation here. What's new is the idea of using a space-time
network and not just a spatial network, and independently, it's really a separate idea to use these
formal roles instead of meaningful roles, so we don't consider give object to be a role when we
look at it this way. We consider it to be a filler that gets bound to the same formal role that this
one does, and in that sense, they end up functioning as a unit, which is actually reminiscent of
how the neo-Davidsonian move to say instead of having the agent and the patient be bound
together, we have a formal thing called the event, and we bind the agent to the event, and we
bind the formal -- the patient to the event, and by virtue of being bound to the same event, they
have a relationship to each other, but the relationship isn't directly encoded in neo-Davidsonian
formalism.
>>: So actually, if you think about it, that's a product over there, and then you don't have to have
these roles to be localized. They can be distributed, but have people actually observed that in
practice, too, because you said that there's a theory. I don't know if it was observed, that you
have these neurons all firing at the same time to represent this object, these neurons on all firing
at the same time to represent that object, but if it's actually distributed, we'll actually have to do a
little bit more math to figure this out.
>> Paul Smolensky: Yes, exactly.
>>: Which is doable, though.
>> Paul Smolensky: It's doable. I do not know whether it has been done, whether there are
cases that illustrate this very behavior with distributed as opposed to local formal role
representations.
>>: I guess the only reason why locality -- what would be the reason for locality rather than
distribution? Distribution here, maybe sparsity, a system of sparsity with these policies?
>> Paul Smolensky: That might be advantageous. Yes, yes.
>>: Is the role here really acting more like a one of four possible event, time multiplex slots or
something, rather than a -- it's not a semantic role. It seems more like a temporal role.
>> Paul Smolensky: Well, we call it a formal role, because it really isn't about being over time.
We could reinterpret this as entirely spatial network that has no time in it at all, and everything
would be the same, so it's not really about time. That's the most crucial thing that shocks people,
that whatever this idea is about, it's not really about time, actually. Because we can have a
formally identical system that has no time in it. It's about having some identifier, some unique
identifier, that other things get stuck to as the means of bringing them together, rather than
sticking them to each other.
>>: But just like in telecommunications, you're sending signals, but time is there just for you to
encode the signal over time, but you're getting it as a code word at the end. It's not like there is a
timing to the content, just using time.
>> Paul Smolensky: Right. And so it may be that formal rules of this sort are used over time in
the brain and not otherwise, but there's no reason why it would have to be that way from the
formal structure of the representations point of view.
>>: So there is a notion that has been longstanding in linguistics, that the object of a verb is
much closer to the verb than the subject.
>> Paul Smolensky: Yes.
>>: And is that something that is capturable or captured with the notion that you were doing this
over time, the neural firings are taking place synchronously, so the word book is bound to the
word give, earlier than the --
>> Paul Smolensky: It could be that -- if the activation pattern down here for the slot that the
verb goes into is more similar to the activation pattern for the slot that the object goes into than it
is to the activation pattern for the slot that the subject goes into, then you would expect to see just
what you said, that there would be more correlated activity in the encoding of the verb object
pair than in the verb subject pair. So it could be used I think in that sort of way. Lucy?
>> Lucy Vanderwende: In this way, where you have the filler is give object, you now don't have
a more abstract role of object more generally, not linked to the specific -- here, you were linking
the object to each specific verb, so give object to each object.
>> Paul Smolensky: Oh, you're talking about the fact that there's a bundling of the role object in
the verb give here.
>> Lucy Vanderwende: So do you get any generalization anymore on how objects on average
behave?
>> Paul Smolensky: Right, right. So when I talked about trying to capture similarities of that
sort last time I think it might have been, it was important that we didn't do this, that we had
representation of give, representation of object that had their independent character. They could
be bound together, or not, and so there is the question of how well you can recurse this kind of
formalism and say, okay, well, I want to use the same idea for binding together object and give,
rather than just plunking them together as a label for a unit. I have to think about that. I'm not
sure that it would recurse very gracefully.
>>: How about uncertainty? You could imagine having a representation where either John is
the recipient or he is a giver, with different abilities, that maybe John, most likely he is the giver,
but it's possible that he is actually the recipient and Mary is the giver. So you could imagine a
situation there where the intensity of the policies are used, they overlap, that in Paul's case, the
recipient and the giver are both synchronized with both Mary and John, but to different
amplitudes. But that's a representation where you have the same thing. You didn't talk much
about uncertainty in representation. I don't know if it's just a very linear thing or not. It can just
be based on the amplitudes of things.
>> Lucy Vanderwende: Would a good example of that be the start of a sentence, John gave
Mary? Because it could be followed by in marriage, in which case he really is kind of giving
Mary, or John gave Mary a book. Until you hear what comes after, John gave Mary is uncertain.
>>: I more meant just mental uncertainty, like I don't know what I've heard. I know there was
something about the book. Somebody gave the book to somebody. I'm just not sure. I think
John gave it to Mary, but I'm not sure. It might be the other way around.
>> Paul Smolensky: So what has been talked about very little in these lectures, maybe just one
slide, about French liaison, is that current focus of work on having partially active symbols in
representations and distinguishing that notion a mixture of partially active symbols distinct from
a probabilistic mixture of fully active symbols and how when you're in the middle of processing
a sentence and you have uncertainty about the rest of a sentence, how being in a blend of
partially analyzed parses is different from having a probability distribution over ultimate parses,
which is a more standard view. So we have been developing simulation models of that with
grammars in networks using tensor product representations and such. But that's been hardly
mentioned here, so -- but that is where we would talk about what you just raised, I think, what
happens when we don't have John, we have 0.6 John, and so in the French liaison example, I said
we had 0.4 T at the end of petit, maybe 0.5. Okay. Serious symbol processing, and there's one
more thing here. I'm going to just blast through this and not explain it, because I think it's cute
and interesting, but pretty much off topic, actually, and that's just some evidence that neurons are
functioning the way tensor product neurons should function. In the parietal cortex, where
representations of locations in space of visual stimuli have to take into consideration the
combination of the position of the eye and the position of a dot on the retina, so the same retinal
position means different spatial positions if the eye moves and conversely. So what you find is
that the activity level of a neuron in this part of the visual system has a profile like this, where
this is positioned along the retina of a dot, and this is positioned along the horizon of the eye,
let's say. And the activity of a single neuron looks like this as a function of these two relevant
variables that it has to combine together in order to identify a place in space for that dot. And the
point is that this is in fact a tensor product of two functions. This function, this is a distributed
representation over the eye position variable. This is a distributed representation over the retinal
position. The retinal position one is roughly bell shaped. The eye position one is roughly
logistic shaped, and the formula that's given for the perceptive field by the authors who have
done this work is exactly the tensor product of these two functions. So there's a case where the
way of binding together these two relevant bits of information, where is my eye pointed and
where on the retina is a given image cast, are combined using a tensor product in this part of the
visual system. Okay, but I wanted to move onto the last topic here, and that is serious symbol
processing with tensor product representations, which involves nonlinearity but not point wise
nonlinearity, and I'm going to talk first about the basic operation of the lambda calculus, which is
function application. So this is called beta reduction. You have a lambda expression which is an
expression which has a variable, X, identified by this quantifier lambda, and you have some
expression, some function, that is stated in terms of X, whose inner structure is not indicated
here, so B stands for some formula involving X. And so what's in parentheses is the function,
and what's outside is the argument, and this is supposed to be the value of the function on that
argument. That's what the process of beta reduction computes, and if we go to our tree world,
we can think of the lambda expression as being built this way if we want. That's not the only
way. And what we need our function to do is when given this L as the first argument, and some
A as the second argument, what we want is to output this expression B, but all the Xs need to be
replaced by A. That's what applying the function to a value means. You replace the variable of
that function with the value you're evaluating it at. And here's how we do that using tensor
product unbinding and binding. So one step is we unbind the right child of the left child. That's
here. So L is the tensor product representation of this, and if we unbind the right child of the left
child, what we get out is in fact what the symbol is. That is the variable in the expression. It
could be X could be whatever. This will tell us what it is. And so that extracts X. This operates
on this tensor product representation for that extracts the right child, which is B. So that extracts
B. Here's the function that does the whole thing. There should be no D there. I don't know how
that typo got created, but this is the full function that does the job. What this is, identify
operator, just pretend that D's not there, please. This multiplies by B, just giving B back, so this
reproduces B. What this does here, this inner product, it returns the locations of all of the Xs, so
when you take the inner product with X, you get out all of the roles that X fills, and this deletes
all the Xs in those very locations, and this inserts A in those very locations. So the net effect is,
you have replaced all the Xs by A by this combination of operations. Inner products here, and
outer products here. So I haven't used the tensor product symbol, consistent with previous
lectures, so this is the outer product of this tensor with that tensor. All right, so in this case, this
formula encodes a string. There are atoms at terminal nodes. The atom X here is replaced by an
entire tree. A is in general an entire expression itself, not just an atom, so we've managed to
replace a symbol with an entire expression. The next thing I'm going to show you, which is tree
adjoining, takes an atom at an internal node and replaces it by an entire structure. This is to
remind me that in Gary Marcus's book, which he talked about here in his lecture and laid out
these seven lessons for what we need to have our brains do in order to produce cognition, we
need symbols. We need variables, operations, types, tokens. I claim that being able to do things
like function evaluation, beta reduction, means that there's no question that we can do all of these
things. All of them are doable. That is a solved problem, I claim. Tree adjoining. So here is the
initial tree. It has somewhere buried inside it an A constituent. This is the auxiliary tree. It has
A as its root symbol. Little A stands for the whole thing. It has a foot symbol, alpha, and what
we need to do is insert this into that. It's a kind of adjoining. We insert the green tree inside
here, so that the red one now hangs from the green instead of the blue, and the green hangs from
the blue the way that the red used to do, so that's the tree adjoining. And I will just go through
this very fast, because I think having seen the lambda expression, you'll get it quickly as much as
you're going to get it, and it's 2:00. So here's an inner product that tells us what symbol we're
looking to replace. It's the symbol at the root of A. That extracts the root symbol, so here's just a
recording of that fact. What this does is it finds all the locations -- the location, I should say, of
this symbol A in this original tree. What this does is find the subtree here hanging from that
position in the original tree, color coded to match it. What this does is find the location of this
node alpha inside this tree, what role in this tree alpha fills. And once we have all of these things
in place here, once we have all those in place, we can write the function down for tree adjoining,
which takes this as its first argument, this as its second argument and produces that as its result.
So first, we have this retains all of T that's unaffected by adjoining. This removes the subtree A
from T, and once that's removed from T, what's left is all the part that's unaffected by adjoining.
This repositions T by moving it down to where alpha is. This embeds the whole big tree here in
the place where that atom was before, and this removes alpha from the final structure, because
it's just a placeholder, and voila, you're done. This says down here, this is a bunch of outer
products that are used to construct this, and this is a bunch of inner products that are used to
unbind these to pull out the relevant bits, so that they're ready to be put back together in this way.
So the net result of all this is a single function written here by means of these auxiliary variables
here, that does tree adjoining. But it's a high-order, nonlinear function in the following sense. If
we look at this term, for example, we have one R times another R, so this R involves taking an
inner product with A. This involves inner product with T, so there's an A and a T buried in here.
There's an A in here, too, so we have A times A in there, and elsewhere, we have T times T, so
here we have -- let's see. Do we have A times R somewhere? Well, it's my belief that
somewhere buried in here is a third-order term, NT. So when you cash all of these abbreviations
out for what they stand for, you'll see that T enters multiple times and A enters multiple times.
Those are the two arguments here, and they get multiplied together with themselves and with
each other, so you end up with something that's not linear in T and it's not linear in A. But it's
multilinear in the sense that it's just multiplications of them. It's not something like a point-wise
squashing by a sigmoid function. Okay, so I've already done that. I don't know why that came
back. So in a single-step operation, massively parallel, we take the distributed encoding of the
input arguments into distributed encoding of the output. It's third rather than first-order function
of the input. Single operation of this whole function applying once simultaneously performs
multiple inner and multiple outer operations all at once, and that achieves the effect of extracting
all the roles that contain a given filler and inserting a given filler in all of those places. That's
substituting a value for a variable in a very rich structural sense. Voila.
>> Li Deng: Thank you very much.
>> Paul Smolensky: Congratulations. We finished a lecture.
>> Lucy Vanderwende: On time.
>> Paul Smolensky: Never happened before.
>> Li Deng: Thank you very much. Any more questions?
>> Paul Smolensky: Yes.
>>: So do you think of these operations, like for the tree adjoining, as being -- existing in the
language themselves, so they can be representative and you can make new operations based on ->> Paul Smolensky: Are they part of the toolkit that you can use to build other things?
>>: Well, in the brain. I'm thinking if this were the right model, would these operations being
represented as other knowledges, or would they be something that were just fined that were
somehow learned and they're in neuron weights? Well, I guess it's kind of the same thing. I'm
just wondering if you can take simple operations and make new operations from them using
these same sort of operations. Can they be used on themselves? Is there a language of
operations here that can be constructed?
>> Paul Smolensky: There is in a sequential sense, for sure, where you could apply one of these
operations and take the output and then apply another one to that. There's no question that that
exists. When I do these programs, I figure out what bits I want to multiple together and combine
to create a function that in one step does lambda evaluation, but it's not clear whether internal to
the brain capacity to do that kind of combination is plausible.
>>: I guess I'm wondering if these operations are pre-wired and they don't tend to grow, if you
just stick with that set of operations, or if it's something that's learned and they grow over time.
Do you have any guess or intuition on that?
>> Paul Smolensky: I think that there needs -- my best guess, and it is a guess, that there needs
to be some sort of organization to the cortex such that these type of tensor-type operations can go
on, so that these kind of operations are implementable in the cortex. Whether the
implementation of these operations in the cortex is somehow hardwired, or whether it's
something that could be learned, I do not know, but I'm guessing that the fundamental ability to
do tensor product -- tensor calculations is probably hardwired. At that level, I feel that my best
guess is probably secure, but whether the brain can freely combine all of these things the way I
do when I write a program I think is a good question to ponder. It could very well be that -- one
thing to imagine is that, given that there is the ability to do sequential combination of operations
that have already been acquired one way or another, feed the output of one as an input to another,
then that gives you training data for the combined function. So if I have F and I have G, and I
feed the output of F to G, then I get training data for G composed with F. And so you could
imagine then over time learning that combined function and then it functioning as a unit instead
of as a sequential set of operations.
>>: Almost like a chunking kind of thing.
>> Paul Smolensky: Absolutely like a chunking kind of thing. That's right. Yes.
>>: Cool. So this whole representation, even though you've been using language a lot as
examples, the whole representation isn't just about language. It's about storage and information
and knowledge and building knowledge in a neural system. But language is a special case. Are
there some examples of what this sort of operations can do that language doesn't, that it's not
easily expressed through language, some kind of reasoning that's not language bound. It's hard
for humans to think, to discuss things without language. They think that they think in terms of
language, but that's probably not the case. It's probably that there's a lot of thinking that's not
really language driven.
>> Paul Smolensky: In the original paper on tensor product representations, I gave an
illustration of using -- of using tensor product representations to encode something like a speech
signal, where the roles were points of time and the role vectors were sort of like Gaussian bumps
centered at the point of time that they most principally control and where the -- that's the role
axis. On the filler axis, you had some sort of detectors for activation -- for energy at different
bands of frequencies or whatever, so you could build a spectrogram this way, as a tensor product,
but you have a continuum of roles, really, and a continuum of fillers, and there's maybe some
sensible geometry to how the pattern of activity is for these types of role vectors and filler
vectors both, that it makes sense for them to be filter shaped or something. So these kind of
operations can apply to the kind of continuous domain of signals that people aren't very facile at
talking about, at least verbally, so I do think it's a much more general mechanism than language
or even than higher cognition, really. Whenever there's structured information, you need to
encode what role in this structure is being played, so what time is this frequency band being
highly -- have a lot of energy. So I think it's just pervasive, and my guess then would be that the
capacity that we use for higher cognition evolved from a capacity to encode signals in this sort of
way. And an interesting thing about the use of tensor product representations for abstract
knowledge and encoding grammar and all of that kind of higher-level stuff is that it's the same
notions of role filler combination to form combinatorial structures that you have in things like
scenes. So in a scene, you have a lot of objects you've identified. They have roles in the scene,
which involve positions but relations to each other and affordances they provide and all of that
stuff. So a scene is something that lower animals have to deal with all the time. They must have
the capability of encoding combinatorial representations, and so the apparatus that I'm talking
about doesn't seem like one that would be exotic. And the ability to encode abstract, in the sense
of far from sensory, information in the fillers and the roles could be naturally a result of higher
extraction of features at higher levels and so on. But the same fundamental structuring
operations can be there from the beginning. And a nice thing about the brain is this, that you
might wonder, well, how did -- did mankind make the leap from scenes being represented as
combinatorial structures with tensive products to parse trees being represented that way? And
the answer is that to the brain, a scene is an activation pattern, and so is a sentence. So despite
the fact that they have very different semantics to us, to the brain, it's all the same. It's
identifying repeating substructures and seeing that they combine in certain ways, and that has to
be done to manage to do with scenes, and once the information that's available to you includes
things like language, then the same kind of operation should go a long way towards providing
the kind of higher-level capabilities that we're talking about in these lectures. Voila, neural
solipsism pays off. Okay, thanks again for enduring this. I'm very impressed.
Download