On-Line Learning of Predictive Compositional Hierarchies by Hebbian Chunking Karl Pfleger∗ Computer Science Department Stanford University kpfleger@cs.stanford.edu Abstract I have investigated systems for on-line, cumulative learning of compositional hierarchies embedded within predictive probabilistic models. The hierarchies are learned unsupervised from unsegmented data streams. Such learning is critical for long-lived intelligent agents in complex worlds. Learned patterns enable prediction of unseen data and serve as building blocks for higherlevel knowledge representation. These systems are examples of a rare combination—unsupervised, on-line structure learning (specifically structure growth). The system described here embeds a compositional hierarchy within an undirected graphical model based directly on Boltzmann machines, extended to handle categorical variables. A novel on-line chunking rule creates new nodes corresponding to frequently occurring patterns that are combinations of existing known patterns. This work can be viewed as a direct (and long overdue) attempt to explain how the hierarchical compositional structure of classic models such as McClelland and Rumelhart’s Interactive Activation model of context effects in letter perception can be learned automatically. 1 Introduction Though compositional hierarchies have a long history within AI and connectionism, including now-classic models such as blackboard systems and the Interactive Activation model of context effects in letter perception [1] (henceforth IAM), and despite that much work has been done on important compositional representation issues, far too little work has investigated how to learn which of the exponentially many compositions to bother representing. Within learning fields, much more effort is devoted to taxonomic relationships (e.g., clustering). Learning which compositions to represent based on statistics of perceptions requires unsupervised, on-line structure learning that is also cumulative (builds directly on past learning) and long-term. Such research is already more difficult experimentally than investigations of batch training of datasets small enough to iterate over in order to induce relationships at a single level of abstraction or granularity. Thus, I use localist representations and the simplest ∗ This research was performed at Stanford. The author is now at Google. then th th e e h position 1 e h n t th n t h n t position 2 e h position 3 n t position 4 environment Figure 1: Hypothetical partial network with 4- and 2-letter chunks, applied to 4-letter input. All “th”-to-‘t’ weights are identical. All “th”-to-‘h’ weights are identical. of the three techniques described by Hinton [2] for representing compositional structure in connectionist networks. This paper does not directly address the important work on more complex compositional representations such as recursive composition or any of the valuable work in the overlapping areas known as connectionist symbol processing and sparse distributed representations, but is instead complementary to such work. The paper focuses on how the structural parts of compositional models, such as IAM, can be created algorithmically. Such techniques will likely hold lessons for learning in the context of the more complex representational methods of compositional connectionism. IAM (and blackboard systems in AI) demonstrated the ability of compositional hierarchies to make predictive inferences that smoothly integrate bottom-up and top-down influences at multiple granularities. Prediction of missing information based on context is a specific goal of my work. Note that much of the more complicated representational work has ignored this type of prediction for the moment, but it will clearly be important in the future. Unlike simple recurrent networks (Elman style nets), IAM and the models presented here use undirected links and can predict equally well in any direction. IAM structurally encodes a compositional hierarchy of d_t tan _of is_ ere ink ear ing oun ten -0.132 -0.132 -0.132 -0.132 -0.132 0.00531 -0.132 0.472-0.132 -0.132 3-char 0 im en ar ke un -0.104 -0.0838 -0.137 -0.137 -0.0838 im en ar ke un -0.124 -0.124 -0.124 -0.124 -0.124 es ed _r ou s_ _a -0.137 -0.137 -0.137 -0.137 -0.137 -0.137 es ed _r ou s_ _a -0.124 -0.124 -0.124 -0.124 -0.124 -0.124 ea ce _c an _d in -0.137 -0.137 -0.137 -0.0838 -0.137 0.578 ea ce _c an _d in -0.124 -0.124 -0.124 -0.124 -0.124 -0.124 _w th n_ or is _h -0.137 -0.124 -0.137 -0.137 -0.104 -0.124 _w th n_ or is _h -0.124 -0.124 0.309-0.124 -0.124 -0.124 he er _o _s il _t -0.137 -0.137 -0.137 -0.137 -0.104 -0.137 2-char 0 he er _o _s il _t -0.105 -0.124 -0.124 -0.124 -0.124 -0.124 2-char 1 y z _ -0.145 -0.145 -0.145 y z _ -0.146 -0.146 -0.146 y z _ -0.145 -0.145 -0.133 u v w x -0.145 -0.145 -0.145 -0.145 u v w x -0.146 -0.146 -0.146 -0.146 u v w x -0.145 -0.145 -0.145 -0.145 q r s t -0.145 -0.145 -0.145 -0.145 q r s t -0.146 -0.146 -0.146 -0.146 q r s t 0.293-0.145 -0.145 -0.145 m n o p -0.145 -0.145 -0.145 -0.145 m n o p -0.146 0.593-0.146 -0.146 m n o p -0.145 -0.145 -0.145 -0.145 i j k l 0.5340.282-0.145 -0.145 i j k l -0.146 -0.146 -0.146 -0.146 i j k l -0.145 -0.145 -0.144 -0.145 e f g h -0.145 -0.145 -0.145 -0.145 e f g h -0.146 -0.146 -0.146 0.254 e f g h -0.145 -0.1450.51 -0.145 a b c d -0.145 -0.145 -0.145 -0.145 1-char 0 a b c d -0.146 -0.146 -0.146 -0.146 1-char 1 a b c d -0.145 -0.145 -0.145 -0.145 1-char 2 Figure 2: A small Sequitur-IAM network after settling with ambiguous input i/j, h/n, g/q. “ing” is most consistent with the chunks known to the network and dominates the interpretation. Underscores represent spaces, which were not distinguished as special. Each large rectangle collects the nodes for alternative chunks at a location. Dark shading represents activation of the individual nodes. individual letters and four-letter words into a symmetricrecurrent (relaxation-style) network. For each letter position there are 26 nodes, one for each letter, and there are nodes for common four-letter words. Excitatory links connect each word node to its four constituent letters. Information flows bottom-up and top-down along these links. This produced a landmark model that was remarkably adept at prediction from context and in matching human performance data. 2 Greedy Composition One of IAM’s most severe limitations was its complete lack of learning. Since IAM, there has been analogous work for other sensory modalities, specifically the TRACE model of speech perception [3, 4], and there has been ample work on parameter tuning techniques for symmetric-recurrent networks, but there has been very little work on learning any structure in such networks and virtually no work on learning to encode compositional patterns in the network structure. I first demonstrated that some of IAM’s performance properties could be replicated with automatically created structure. Sequitur [5] is a simple greedy algorithm for discovering compositional structure in discrete sequences. Essentially, each time it sees a substring of any length repeated exactly for a second time, it will create a node in the hierarchy for that substring. Sequitur incorporates no mechanism for making predictions from the compositional hierarchies it creates. I created a system that uses Sequitur to build a compositional hierarchy from unsegmented data, then in- 0.6- 0.6- 0.6- 0.5- 0.5- 0.5- 0.4- 0.4- 0.4- 0.3- 0.3- 0.3- 0.2- 0.1- 0.2- 0.1- 0.6- 0.6- 0.6- 0.6- 0.5- 0.5- 0.5- 0.5- 0.4- 0.4- 0.4- 0.4- 0.3- 0.3- 0.3- 0.3- 0.2- 0.2- 0.2- 0.2- 0.1- 0.1- 0.1- 0.1- 0- 0- 0- 0- -0.1- -0.1- -0.1- -0.1- 1X 1X 1X 0.2- 0.1- 0- 0- 0- -0.1- -0.1- -0.1- 1X 1X 1X 1X 0 5 10 15 20 (a) 25 30 35 40 45 50 1X 1X 0 5 10 15 20 25 30 35 40 45 50 (b) Figure 3: Plots of node activations vs. cycle during settling. Ordered top to bottom at the rightmost point: (a) Activations of ‘k’, ‘r’, and ‘d’ in position 4 after presentation of w, o, r, k/r. The network favors ‘k’ over ‘r’ due to feedback from the “work” node. (b) Activations of ‘e’ (position 2), “ea” (pos. 2–3), “re” (pos. 1–2), and ‘f’ (pos. 2) after presentation of r, e/f, a, t. The network favors ‘e’ due to the influences of the digrams “re” and “ea”. corporates the learned hierarchy into an IAM-style network, generalized to an arbitrary number of levels. The resulting network corresponds to an “unrolled” version of the hierarchy generated by Sequitur. Sequitur’s hierarchy can be thought of as a grammar, for which the IAMstyle network encodes all possible parses, with the appropriate use of weight sharing to handle the necessary “hardware duplication”, as shown in the example network of Figure 1. By settling, the network determines its belief for each chunk and each atomic symbol in each position. Thus, the network parses the input into multiple interpretations simultaneously. Figure 2 shows an example network that has just finished settling after presentation of weak ambiguous inputs: ‘i’ and ‘j’ at position 1, ‘h’ and ‘n’ at 2, and ‘g’ and ‘q’ at 3. At the 3-chunk level, the network settled on “ing” as strongly dominant, and the corresponding letters at the atom level each became more active than the alternative at the same position. Figure 3 shows examples of the temporal evolution of the network’s settling. With no weight tuning (all weights were set to +1 or -1), the model qualitatively reproduces many of the major perceptual phenomena demonstrated by IAM, including the word-superiority effect and the pseudo-wordsuperiority effect. IAM explains context effects with pronounceable non-words by activating many different 4-letter word nodes all of which share several letters with the input. The Sequitur-IAM hybrid adds an explanatory mechanism in which common (sub-word) letter sequences play a role, as shown in Figure 3(b). Further details of this model are described in [6, Ch.4]. 3 Connectionist Compositional Chunking To grow compositional structure within a network directly, I created a constructive network that uses a new node creation rule, Hebbian chunking. The network used is a direct categorical generalization of a Boltzmann machine, using softmax activation functions for the categorical variables (pools of binary units) [6, Ch.5], with the same use of unrolled atoms 0 0 he he th h h e e t h h atoms e e t 2-chunks t t 3-chunks 2-chunks Figure 4: An example of chunking “th”. Large circles represent pools of alternative units. Small circles represent nodes in the categorical Boltzmann machine. The thick line represents weight wth (left) that is transformed into a node (right). ∅ nodes represent no known chunk at that position. weight sharing described above. The resulting model learns both structure and parameters. It can grow to arbitrary size, decreasing both bias error and variance error [7], by growing new parameters to capture wider interactions and by tuning its existing parameters, respectively. Node creation depends specifically on the nature of Boltzmann weight updates, which are based on the Hebb rule— the strength of a weight goes up if both nodes fire together frequently. Hebbian weight updates lead to a large weight if the weight connects two nodes representing concepts that co-occur often. Subsequently, when presented with inputs that activate either, the network will reproduce the other. I introduce a new chunking rule, Hebbian chunking, which works as follows. When the weight w between two nodes, nodei and nodej , rises above a threshold parameter, remove weight wij and create a new node nodeij with connections to both nodei and nodej , both with large positive weights. After the modification, activation must flow indirectly between the two original nodes through nodeij , but activation of either will still tend to activate the other, and in addition the new node will become active in proportion to the evidence for the compositional pattern consisting of both nodei and nodej (see Figure 4). The newly created nodeij can also be connected to other nodes, enabling hierarchical composition. (Only lateral weights between compatible nodes are considered for Hebbian chunking.) Figure 5 shows the structure of the networks. The implicit compositional hierarchy can always be unrolled to any given width, but a minimum full width necessary to make use of the largest chunks is dictated by the height of the hierarchy. Hebbian chunking automatically promotes second-order correlations to become themselves first-class entities, allowing subsequent detection of higher and higher order relationships, in this case corresponding specifically to compositional patterns. Though there is not an active thread of work on constructive algorithms for symmetric-recurrent networks, Saul and Jordan [8, 9] present a technique called decimation for going in the opposite direction, removing nodes. It is not meant to actually shrink networks, but to perform efficient inference atoms Figure 5: The structure of unrolled networks (left) corresponding to the core model of layered sets of chunks and weights (right). Circles here represent pools of alternative nodes. Identical (shared) weight sets are shown with the same number and style of lines. by removing nodes and updating the remaining weights to insure similar dynamics. The Hebbian chunking in some ways represents a type of reverse decimation, though the subsequent extra connections that enable further compositional chunking remove the nice efficiencies of decimatable topologies. Experiments with strings of English letters from punctuation-removed text showed that the networks learn chunks corresponding to a hierarchy of compositionally related common letter sequences. Prediction accuracy generally improves over time as training proceeds, with chunking allowing the networks to break out of plateaus to higher levels of accuracy. More details omitted here for space, but see [6, Sec.6.3.4]. Thus, these networks successfully discover a hierarchy of frequent patterns while demonstrating the ability to make reasonable predictions with arbitrary inference patterns. Nonetheless, the model is far from perfect. The predictive accuracies are poorer than they could be with further work to optimize the model and its parameters, including possibly adding general-purpose hidden units as well as the specifically compositional units. Also, the speed of training suffers from both the general intractibility of undirected graphical models with complicated structure and the specifically slow speed of Boltzmann machine training. (The goal of the research was to focus on compositional learning, rather than to push the frontier of research on the core dynamics of any particular model class. Thus, rather than optimizing these networks further, I have also investigated compositional learning in non-connectionist models. Specifically, I introduced a novel hierarchical sparse n-gram model that learns on-line by stochasticly selecting new n-tuple patterns biased towards compositions of existing known-frequent patterns [10], essentially using a probabilistic, on-line analog of the pruning used in association rule mining.) The most closely related work to Hebbian chunking is the relatively unknown thesis of Ring [11], which is presented as specifically reinforcement learning research but could be separated from the issues of that paradigm and treated purely as sequence learning (though he did not do this). Ring presented two connectionist systems, Behavior Hierarchies and Temporal Transition Hierarchies, both for forward prediction. In Temporal Transition Hierarchies new units are created whenever a weight is “pulled” strongly in both directions.1 There is no particular compositional bias though, so the similarity to the present work ends with the node creation from weights. Behavior Hierarchies learn compositional hierarchies of actions. The network is a fully-connected single-layer autoassociative net. Compositionality enters in when a large weight trigger the creation of a new node, added to both the input and output layers, which corresponds to the macro-action of the two previous actions connected by the weight. Removed from the reinforcement learning context, the system operates essentially as a first-order connectionist Markov model, but one that dynamically adds new states with multisymbol outputs. The weight-based triggering for new node creation is similar to my independently conceived chunking mechanism based on Hebbian weight dynamics, but Ring’s system can only predict in one direction and the lack of network unrolling and weight sharing means that his networks cannot really recognize higher-level concepts based on accumulated evidence from a wide receptive field. Also, the deterministic prediction of all the elements of a macro will often lead to poorer predictions, even in the forward direction. For a more detailed comparison with Ring’s work and extensive comparisons to other systems, see [6, Sec.6.4 and Ch.9]. 4 Conclusion This paper presents research into on-line learning of compositional hierarchies of frequent patterns by data-driven bottom-up mechanisms, a natural and overdue extension to add learning abilities to early non-learning systems such as IAM. The systems presented here are some of the first capable of on-line learning of frequent patterns of ever-growing complexity using storage space that grows significantly sublinearly in the input data (Ring’s second system is the only other of which I am aware). This research has concentrated on prediction as the primary way to utilize compositional hierarchies of frequent patterns, but frequent chunks can be very valuable in many other ways within a larger computational system. Learned chunks can act as aids to increase working memory capacity, based on substitution recoding, which improves information processing capacity quite broadly.[13] Frequent patterns are important for developing communication or shared language. Frequent chunks can serve as important features 1 This seems very similar to Hanson’s meiosis networks [12], though Ring seemed to be unaware of that earlier work. for other types of learning and can enable the automatic formation of associations that would otherwise be impossible to induce. For more on each of these topics, see [6, Sec.10.3]. The point is that a single general learning mechanism can build compositional representations that both enable useful predictions and also serve as foundations for improving several aspects of an agent’s cognitive behavior. Hierarchical compositional chunking is one of the keys for bridging the gap between low-level, fine-grained representations and high-level concepts. For more extensive treatment of all material, including motivations, derivations, a formal description of the learning problem, experimental results, related work comparisons, and explanations of why the various learning research communities should devote more attention to the learning of compositional structure, see [6]. References [1] James L. McClelland and David E. Rumelhart. An interactive activation model of context effects in letter perception: Part 1. an account of basic findings. Psychological Review, 88:375– 407, 1981. [2] Geoffrey E. Hinton. Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence, 46(1–2):47– 75, 1990. [3] James L. McClelland and Jeffrey L. Elman. The TRACE model of speech perception. Cognitive Psychology, 18:1–86, 1986. [4] James L. McClelland. Stochastic interactive processes and the effect of context on perception. Cognitive Psychology, 23:1–44, 1991. [5] Craig G. Nevill-Manning and Ian. H. Witten. Identifying hierarchical structure in sequences: a linear-time algorithm. Journal of Artificial Intelligence Research, 7:67–82, 1997. [6] Karl Pfleger. On-Line Learning of Predictive Compositonal Hierarchies. PhD thesis, Stanford University, June 2002. See http://ksl.stanford.edu/˜kpfleger/thesis/. [7] Stuart Geman, Elie Bienenstock, and Rene Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4:1–58, 1992. [8] Lawrence K. Saul and Michael I Jordan. Boltzmann chains and hidden Markov models. In Advances in Neural Information Processing Systems 7. MIT Press, 1994. [9] Lawrence K. Saul and Michael I Jordan. Learning in Boltzmann trees. Neural Computation, 6(6):1174–84, 1994. [10] Karl Pfleger. On-line cumulative learning of hierarchical sparse n-grams. In Proceedings of the International Conference on Development and Learning, 2004. To appear. See ksl.stanford.edu/˜kpfleger/pubs/. [11] Mark Ring. Continual Learning in Reinforcement Environments. R. Oldenbourg Verlag, Mùˆnchen, Wien, 1995. [12] Stephen J. Hanson. Meiosis networks. In Advances in Neural Information Processing Systems 2, pages 533–541, 1990. [13] George A. Miller. The magic number seven, plus or minus two: Some limits of our capacity for processing information. Psychological Review, 63, 1956.