On-Line Learning of Predictive Compositional Hierarchies by Hebbian Chunking Karl Pfleger

advertisement
On-Line Learning of Predictive
Compositional Hierarchies by Hebbian Chunking
Karl Pfleger∗
Computer Science Department
Stanford University
kpfleger@cs.stanford.edu
Abstract
I have investigated systems for on-line, cumulative
learning of compositional hierarchies embedded within
predictive probabilistic models. The hierarchies are
learned unsupervised from unsegmented data streams.
Such learning is critical for long-lived intelligent agents
in complex worlds. Learned patterns enable prediction
of unseen data and serve as building blocks for higherlevel knowledge representation. These systems are examples of a rare combination—unsupervised, on-line
structure learning (specifically structure growth). The
system described here embeds a compositional hierarchy within an undirected graphical model based directly
on Boltzmann machines, extended to handle categorical variables. A novel on-line chunking rule creates
new nodes corresponding to frequently occurring patterns that are combinations of existing known patterns.
This work can be viewed as a direct (and long overdue) attempt to explain how the hierarchical compositional structure of classic models such as McClelland
and Rumelhart’s Interactive Activation model of context
effects in letter perception can be learned automatically.
1
Introduction
Though compositional hierarchies have a long history within
AI and connectionism, including now-classic models such
as blackboard systems and the Interactive Activation model
of context effects in letter perception [1] (henceforth IAM),
and despite that much work has been done on important
compositional representation issues, far too little work has
investigated how to learn which of the exponentially many
compositions to bother representing. Within learning fields,
much more effort is devoted to taxonomic relationships (e.g.,
clustering). Learning which compositions to represent based
on statistics of perceptions requires unsupervised, on-line
structure learning that is also cumulative (builds directly on
past learning) and long-term. Such research is already more
difficult experimentally than investigations of batch training of datasets small enough to iterate over in order to induce relationships at a single level of abstraction or granularity. Thus, I use localist representations and the simplest
∗
This research was performed at Stanford. The author is now at
Google.
then
th
th
e
e
h
position 1
e
h
n
t
th
n
t
h
n
t
position 2
e
h
position 3
n
t
position 4
environment
Figure 1: Hypothetical partial network with 4- and 2-letter
chunks, applied to 4-letter input. All “th”-to-‘t’ weights
are identical. All “th”-to-‘h’ weights are identical.
of the three techniques described by Hinton [2] for representing compositional structure in connectionist networks.
This paper does not directly address the important work on
more complex compositional representations such as recursive composition or any of the valuable work in the overlapping areas known as connectionist symbol processing
and sparse distributed representations, but is instead complementary to such work. The paper focuses on how the
structural parts of compositional models, such as IAM, can
be created algorithmically. Such techniques will likely hold
lessons for learning in the context of the more complex representational methods of compositional connectionism.
IAM (and blackboard systems in AI) demonstrated the
ability of compositional hierarchies to make predictive inferences that smoothly integrate bottom-up and top-down
influences at multiple granularities. Prediction of missing
information based on context is a specific goal of my work.
Note that much of the more complicated representational
work has ignored this type of prediction for the moment,
but it will clearly be important in the future. Unlike simple
recurrent networks (Elman style nets), IAM and the models
presented here use undirected links and can predict equally
well in any direction.
IAM structurally encodes a compositional hierarchy of
d_t tan _of is_ ere ink ear ing oun ten
-0.132
-0.132
-0.132
-0.132
-0.132
0.00531
-0.132
0.472-0.132
-0.132
3-char 0
im en ar
ke un
-0.104
-0.0838
-0.137
-0.137
-0.0838
im en ar
ke un
-0.124
-0.124
-0.124
-0.124
-0.124
es ed _r
ou s_ _a
-0.137
-0.137
-0.137
-0.137
-0.137
-0.137
es ed _r
ou s_ _a
-0.124
-0.124
-0.124
-0.124
-0.124
-0.124
ea ce _c an _d in
-0.137
-0.137
-0.137
-0.0838
-0.137
0.578
ea ce _c an _d in
-0.124
-0.124
-0.124
-0.124
-0.124
-0.124
_w th
n_ or
is
_h
-0.137
-0.124
-0.137
-0.137
-0.104
-0.124
_w th
n_ or
is
_h
-0.124
-0.124
0.309-0.124
-0.124
-0.124
he er
_o _s il
_t
-0.137
-0.137
-0.137
-0.137
-0.104
-0.137
2-char 0
he er
_o _s il
_t
-0.105
-0.124
-0.124
-0.124
-0.124
-0.124
2-char 1
y
z
_
-0.145
-0.145
-0.145
y
z
_
-0.146
-0.146
-0.146
y
z
_
-0.145
-0.145
-0.133
u
v
w
x
-0.145
-0.145
-0.145
-0.145
u
v
w
x
-0.146
-0.146
-0.146
-0.146
u
v
w
x
-0.145
-0.145
-0.145
-0.145
q
r
s
t
-0.145
-0.145
-0.145
-0.145
q
r
s
t
-0.146
-0.146
-0.146
-0.146
q
r
s
t
0.293-0.145
-0.145
-0.145
m
n
o
p
-0.145
-0.145
-0.145
-0.145
m
n
o
p
-0.146
0.593-0.146
-0.146
m
n
o
p
-0.145
-0.145
-0.145
-0.145
i
j
k
l
0.5340.282-0.145
-0.145
i
j
k
l
-0.146
-0.146
-0.146
-0.146
i
j
k
l
-0.145
-0.145
-0.144
-0.145
e
f
g
h
-0.145
-0.145
-0.145
-0.145
e
f
g
h
-0.146
-0.146
-0.146
0.254
e
f
g
h
-0.145
-0.1450.51 -0.145
a
b
c
d
-0.145
-0.145
-0.145
-0.145
1-char 0
a
b
c
d
-0.146
-0.146
-0.146
-0.146
1-char 1
a
b
c
d
-0.145
-0.145
-0.145
-0.145
1-char 2
Figure 2: A small Sequitur-IAM network after settling with
ambiguous input i/j, h/n, g/q. “ing” is most consistent with
the chunks known to the network and dominates the interpretation. Underscores represent spaces, which were not
distinguished as special. Each large rectangle collects the
nodes for alternative chunks at a location. Dark shading represents activation of the individual nodes.
individual letters and four-letter words into a symmetricrecurrent (relaxation-style) network. For each letter position
there are 26 nodes, one for each letter, and there are nodes
for common four-letter words. Excitatory links connect each
word node to its four constituent letters. Information flows
bottom-up and top-down along these links. This produced
a landmark model that was remarkably adept at prediction
from context and in matching human performance data.
2
Greedy Composition
One of IAM’s most severe limitations was its complete lack
of learning. Since IAM, there has been analogous work for
other sensory modalities, specifically the TRACE model of
speech perception [3, 4], and there has been ample work
on parameter tuning techniques for symmetric-recurrent networks, but there has been very little work on learning any
structure in such networks and virtually no work on learning
to encode compositional patterns in the network structure.
I first demonstrated that some of IAM’s performance
properties could be replicated with automatically created
structure. Sequitur [5] is a simple greedy algorithm for discovering compositional structure in discrete sequences. Essentially, each time it sees a substring of any length repeated
exactly for a second time, it will create a node in the hierarchy for that substring. Sequitur incorporates no mechanism
for making predictions from the compositional hierarchies
it creates. I created a system that uses Sequitur to build
a compositional hierarchy from unsegmented data, then in-
0.6-
0.6-
0.6-
0.5-
0.5-
0.5-
0.4-
0.4-
0.4-
0.3-
0.3-
0.3-
0.2-
0.1-
0.2-
0.1-
0.6-
0.6-
0.6-
0.6-
0.5-
0.5-
0.5-
0.5-
0.4-
0.4-
0.4-
0.4-
0.3-
0.3-
0.3-
0.3-
0.2-
0.2-
0.2-
0.2-
0.1-
0.1-
0.1-
0.1-
0-
0-
0-
0-
-0.1-
-0.1-
-0.1-
-0.1-
1X
1X
1X
0.2-
0.1-
0-
0-
0-
-0.1-
-0.1-
-0.1-
1X
1X
1X
1X
0
5
10
15
20
(a)
25
30
35
40
45
50
1X
1X
0
5
10
15
20
25
30
35
40
45
50
(b)
Figure 3: Plots of node activations vs. cycle during settling.
Ordered top to bottom at the rightmost point: (a) Activations
of ‘k’, ‘r’, and ‘d’ in position 4 after presentation of w,
o, r, k/r. The network favors ‘k’ over ‘r’ due to feedback
from the “work” node. (b) Activations of ‘e’ (position 2),
“ea” (pos. 2–3), “re” (pos. 1–2), and ‘f’ (pos. 2) after
presentation of r, e/f, a, t. The network favors ‘e’ due to the
influences of the digrams “re” and “ea”.
corporates the learned hierarchy into an IAM-style network,
generalized to an arbitrary number of levels.
The resulting network corresponds to an “unrolled” version of the hierarchy generated by Sequitur. Sequitur’s hierarchy can be thought of as a grammar, for which the IAMstyle network encodes all possible parses, with the appropriate use of weight sharing to handle the necessary “hardware
duplication”, as shown in the example network of Figure 1.
By settling, the network determines its belief for each
chunk and each atomic symbol in each position. Thus, the
network parses the input into multiple interpretations simultaneously. Figure 2 shows an example network that has just
finished settling after presentation of weak ambiguous inputs: ‘i’ and ‘j’ at position 1, ‘h’ and ‘n’ at 2, and ‘g’ and
‘q’ at 3. At the 3-chunk level, the network settled on “ing”
as strongly dominant, and the corresponding letters at the
atom level each became more active than the alternative at
the same position.
Figure 3 shows examples of the temporal evolution of the
network’s settling. With no weight tuning (all weights were
set to +1 or -1), the model qualitatively reproduces many
of the major perceptual phenomena demonstrated by IAM,
including the word-superiority effect and the pseudo-wordsuperiority effect. IAM explains context effects with pronounceable non-words by activating many different 4-letter
word nodes all of which share several letters with the input.
The Sequitur-IAM hybrid adds an explanatory mechanism
in which common (sub-word) letter sequences play a role,
as shown in Figure 3(b). Further details of this model are
described in [6, Ch.4].
3 Connectionist Compositional Chunking
To grow compositional structure within a network directly, I
created a constructive network that uses a new node creation
rule, Hebbian chunking. The network used is a direct categorical generalization of a Boltzmann machine, using softmax activation functions for the categorical variables (pools
of binary units) [6, Ch.5], with the same use of unrolled
atoms
0
0
he
he
th
h
h
e
e
t
h
h
atoms
e
e
t
2-chunks
t
t
3-chunks
2-chunks
Figure 4: An example of chunking “th”. Large circles
represent pools of alternative units. Small circles represent
nodes in the categorical Boltzmann machine. The thick line
represents weight wth (left) that is transformed into a node
(right). ∅ nodes represent no known chunk at that position.
weight sharing described above. The resulting model learns
both structure and parameters. It can grow to arbitrary size,
decreasing both bias error and variance error [7], by growing
new parameters to capture wider interactions and by tuning
its existing parameters, respectively.
Node creation depends specifically on the nature of Boltzmann weight updates, which are based on the Hebb rule—
the strength of a weight goes up if both nodes fire together
frequently. Hebbian weight updates lead to a large weight
if the weight connects two nodes representing concepts that
co-occur often. Subsequently, when presented with inputs
that activate either, the network will reproduce the other. I
introduce a new chunking rule, Hebbian chunking, which
works as follows. When the weight w between two nodes,
nodei and nodej , rises above a threshold parameter, remove
weight wij and create a new node nodeij with connections
to both nodei and nodej , both with large positive weights.
After the modification, activation must flow indirectly between the two original nodes through nodeij , but activation
of either will still tend to activate the other, and in addition
the new node will become active in proportion to the evidence for the compositional pattern consisting of both nodei
and nodej (see Figure 4).
The newly created nodeij can also be connected to other
nodes, enabling hierarchical composition. (Only lateral
weights between compatible nodes are considered for Hebbian chunking.) Figure 5 shows the structure of the networks. The implicit compositional hierarchy can always be
unrolled to any given width, but a minimum full width necessary to make use of the largest chunks is dictated by the
height of the hierarchy. Hebbian chunking automatically
promotes second-order correlations to become themselves
first-class entities, allowing subsequent detection of higher
and higher order relationships, in this case corresponding
specifically to compositional patterns.
Though there is not an active thread of work on constructive algorithms for symmetric-recurrent networks, Saul and
Jordan [8, 9] present a technique called decimation for going
in the opposite direction, removing nodes. It is not meant to
actually shrink networks, but to perform efficient inference
atoms
Figure 5: The structure of unrolled networks (left) corresponding to the core model of layered sets of chunks and
weights (right). Circles here represent pools of alternative
nodes. Identical (shared) weight sets are shown with the
same number and style of lines.
by removing nodes and updating the remaining weights to
insure similar dynamics. The Hebbian chunking in some
ways represents a type of reverse decimation, though the
subsequent extra connections that enable further compositional chunking remove the nice efficiencies of decimatable
topologies.
Experiments with strings of English letters from
punctuation-removed text showed that the networks learn
chunks corresponding to a hierarchy of compositionally related common letter sequences. Prediction accuracy generally improves over time as training proceeds, with chunking
allowing the networks to break out of plateaus to higher levels of accuracy. More details omitted here for space, but see
[6, Sec.6.3.4]. Thus, these networks successfully discover a
hierarchy of frequent patterns while demonstrating the ability to make reasonable predictions with arbitrary inference
patterns.
Nonetheless, the model is far from perfect. The predictive
accuracies are poorer than they could be with further work
to optimize the model and its parameters, including possibly
adding general-purpose hidden units as well as the specifically compositional units. Also, the speed of training suffers
from both the general intractibility of undirected graphical
models with complicated structure and the specifically slow
speed of Boltzmann machine training. (The goal of the research was to focus on compositional learning, rather than
to push the frontier of research on the core dynamics of any
particular model class. Thus, rather than optimizing these
networks further, I have also investigated compositional
learning in non-connectionist models. Specifically, I introduced a novel hierarchical sparse n-gram model that learns
on-line by stochasticly selecting new n-tuple patterns biased
towards compositions of existing known-frequent patterns
[10], essentially using a probabilistic, on-line analog of the
pruning used in association rule mining.)
The most closely related work to Hebbian chunking is the
relatively unknown thesis of Ring [11], which is presented
as specifically reinforcement learning research but could be
separated from the issues of that paradigm and treated purely
as sequence learning (though he did not do this). Ring presented two connectionist systems, Behavior Hierarchies and
Temporal Transition Hierarchies, both for forward prediction.
In Temporal Transition Hierarchies new units are created
whenever a weight is “pulled” strongly in both directions.1
There is no particular compositional bias though, so the similarity to the present work ends with the node creation from
weights.
Behavior Hierarchies learn compositional hierarchies of
actions. The network is a fully-connected single-layer
autoassociative net. Compositionality enters in when a
large weight trigger the creation of a new node, added to
both the input and output layers, which corresponds to the
macro-action of the two previous actions connected by the
weight. Removed from the reinforcement learning context,
the system operates essentially as a first-order connectionist Markov model, but one that dynamically adds new states
with multisymbol outputs. The weight-based triggering for
new node creation is similar to my independently conceived
chunking mechanism based on Hebbian weight dynamics,
but Ring’s system can only predict in one direction and the
lack of network unrolling and weight sharing means that
his networks cannot really recognize higher-level concepts
based on accumulated evidence from a wide receptive field.
Also, the deterministic prediction of all the elements of a
macro will often lead to poorer predictions, even in the forward direction. For a more detailed comparison with Ring’s
work and extensive comparisons to other systems, see [6,
Sec.6.4 and Ch.9].
4 Conclusion
This paper presents research into on-line learning of compositional hierarchies of frequent patterns by data-driven
bottom-up mechanisms, a natural and overdue extension to
add learning abilities to early non-learning systems such as
IAM. The systems presented here are some of the first capable of on-line learning of frequent patterns of ever-growing
complexity using storage space that grows significantly sublinearly in the input data (Ring’s second system is the only
other of which I am aware).
This research has concentrated on prediction as the primary way to utilize compositional hierarchies of frequent
patterns, but frequent chunks can be very valuable in many
other ways within a larger computational system. Learned
chunks can act as aids to increase working memory capacity, based on substitution recoding, which improves information processing capacity quite broadly.[13] Frequent patterns are important for developing communication or shared
language. Frequent chunks can serve as important features
1
This seems very similar to Hanson’s meiosis networks [12],
though Ring seemed to be unaware of that earlier work.
for other types of learning and can enable the automatic formation of associations that would otherwise be impossible to
induce. For more on each of these topics, see [6, Sec.10.3].
The point is that a single general learning mechanism can
build compositional representations that both enable useful
predictions and also serve as foundations for improving several aspects of an agent’s cognitive behavior.
Hierarchical compositional chunking is one of the keys
for bridging the gap between low-level, fine-grained representations and high-level concepts. For more extensive
treatment of all material, including motivations, derivations,
a formal description of the learning problem, experimental results, related work comparisons, and explanations of
why the various learning research communities should devote more attention to the learning of compositional structure, see [6].
References
[1] James L. McClelland and David E. Rumelhart. An interactive
activation model of context effects in letter perception: Part 1.
an account of basic findings. Psychological Review, 88:375–
407, 1981.
[2] Geoffrey E. Hinton. Mapping part-whole hierarchies into
connectionist networks. Artificial Intelligence, 46(1–2):47–
75, 1990.
[3] James L. McClelland and Jeffrey L. Elman. The TRACE
model of speech perception. Cognitive Psychology, 18:1–86,
1986.
[4] James L. McClelland. Stochastic interactive processes and
the effect of context on perception. Cognitive Psychology,
23:1–44, 1991.
[5] Craig G. Nevill-Manning and Ian. H. Witten. Identifying hierarchical structure in sequences: a linear-time algorithm. Journal of Artificial Intelligence Research, 7:67–82, 1997.
[6] Karl Pfleger. On-Line Learning of Predictive Compositonal
Hierarchies. PhD thesis, Stanford University, June 2002. See
http://ksl.stanford.edu/˜kpfleger/thesis/.
[7] Stuart Geman, Elie Bienenstock, and Rene Doursat. Neural
networks and the bias/variance dilemma. Neural Computation, 4:1–58, 1992.
[8] Lawrence K. Saul and Michael I Jordan. Boltzmann chains
and hidden Markov models. In Advances in Neural Information Processing Systems 7. MIT Press, 1994.
[9] Lawrence K. Saul and Michael I Jordan. Learning in Boltzmann trees. Neural Computation, 6(6):1174–84, 1994.
[10] Karl Pfleger. On-line cumulative learning of hierarchical
sparse n-grams. In Proceedings of the International Conference on Development and Learning, 2004. To appear. See
ksl.stanford.edu/˜kpfleger/pubs/.
[11] Mark Ring. Continual Learning in Reinforcement Environments. R. Oldenbourg Verlag, Mùˆnchen, Wien, 1995.
[12] Stephen J. Hanson. Meiosis networks. In Advances in Neural
Information Processing Systems 2, pages 533–541, 1990.
[13] George A. Miller. The magic number seven, plus or minus
two: Some limits of our capacity for processing information.
Psychological Review, 63, 1956.
Download