Proceedings of the European Cognitive Science Conference, Delphi, Greece, May 2007, pp. 312–317 A Generalization of Hebbian Learning in Perceptual and Conceptual Categorization Harry E. Foundalis (harry@cogsci.indiana.edu) Center for Research on Concepts and Cognition, Indiana University Bloomington, IN 47408, USA Maricarmen Martínez (m.martinez97@uniandes.edu.co) Department of Mathematics, Universidad de los Andes Bogotá, Colombia Abstract An algorithm for unsupervised competitive learning is presented that, at first sight, appears as a straightforward implementation of Hebbian learning. The algorithm is then generalized in a way that preserves its basic properties but facilitates its use in completely different domains, thus rendering Hebbian learning a special case of its range of applications. The algorithm is not a neural network application: it works not at the neural but at the conceptual level, although it borrows ideas from neural networks. Its performance and effectiveness are briefly examined. Introduction Traditionally, learning has been considered to be one of the foundational pillars of cognition. In 1949, Donald Hebb expressed the basic idea of association learning as follows: when two neurons are physically close and are repeatedly activated together, some chemical changes must occur in their structures that signify the fact that the two neurons fired together (Hebb, 1949). Neuroscientists and cognitive science researchers dubbed this idea “Hebbian learning”, and took it to mean that whenever two percepts appear together, we (or animals in general) learn an association between them. The “how” is usually left to algorithms in artificial neural networks (ANN’s). James McClelland points out in a recent study that the potential of Hebbian learning has been largely underestimated (McClelland, 2006). McClelland focuses on the neuronal level, but refers also to work at the conceptual level (the focus of the present article); specifically, in categorization (e.g., Schyns, 1991) and in self-organizing topological maps (Kohonen, 1982). In what follows, an algorithm that uses Hebbian learning at the conceptual level is introduced through an example that appears in human cognition. It is then generalized in a way that, while retaining its basic properties, makes it applicable to problems that involve the detection (and hence the learning) of sets of entities that are most closely related to each other. The Basic Algorithm Suppose we are given input consisting of a pair of elements, i.e., a drawing depicting some familiar objects, and a phrase with identifiable words, the meaning of which is unknown (Figure 1). The phrase is supposed to be “about” the objects, their properties, and/or relations. The problem is to find the correct associations between words and percepts; or, stated otherwise, to learn the meaning of the words (in some rudimentary sense of the word “meaning”1). nae triogon aems es nae cycol Figure 1: A drawing, and its associated phrase The drawing in Figure 1 shows two familiar geometric objects, and the phrase is given underneath the objects in italics. The learner is not expected to know any words for these objects or their properties in some other language; indeed, the phrase might originate from what will turn out to be the learner’s native language.2 The learner will receive a number of such examples, pairing visual and linguistic data, and the algorithm described below will discover the correct associations between words and visual percepts. This problem has been examined also by Deb Roy (Roy, 2002), but Roy’s learning succeeds by batch processing, whereas the learning described in the present article is incremental. Suppose the learner is capable of perceiving some objects, features, and/or relations — collectively called percepts — by looking at the visual input, which activate corresponding concepts in long-term memory (LTM). Also, the learner can identify (i.e., separate from each other) the words in the given phrase.3 In reality, an infant learning a language will not identify all the words in a phrase, but it does not harm the generality of our algorithm to assume so. As a first — perhaps naïve — step in our effort to discover which word corresponds to which percept, let us adopt the following simplistic strategy: associate every concept with every word. Figure 2 depicts this idea. LTM concepts that were activated by percepts in the input are listed on the top row, and the words of the given phrase are added to LTM and shown on the bottom row. Notice that 1 Contrary to our simplifying assumption, in reality words and percepts are not associated according to a 1–1 correspondence. 2 However, infants do not become native speakers by hearing phrases that refer to geometric objects. What is described here is an abstraction of a real-world situation. 3 Thus, suppose the “word-segmentation problem” is solved. Proceedings of the European Cognitive Science Conference, Delphi, Greece, May 2007, pp. 312–317 each word is shown only once (for instance, “nae”, which is repeated in the input, is not repeated in Figure 2). “in” nae triogon 3 aems 1 es cycol Figure 2: First uninformed step in the algorithm The top row of nodes in Figure 2 shows only some of the LTM concepts that could be activated by the input; specifically, “circle”, “triangle”, “in”, “three-ness”, “right angle”, and “one-ness”. A look by a different agent (or by the same agent at a different time) might activate different concepts, such as “round”, “surrounds”, “slanted line”, etc. Those shown in Figure 2 are the ones that happened to be activated most strongly in the given learning session. Next, the learning session continues with a second pair of visual and linguistic input. and U – “nae”) have their associations reinforced (shown in Figure 4 with bold lines), whereas the associations of those pairs of input 1 that did not co-occur in input 2 have faded slightly (shown in Figure 4 with dotted lines). At the same time, as before, the added words and concepts form all possible associations (lines of normal thickness). The expectation is now clear: given more instances of paired visual and linguistic input (not too many — at most a few dozen) the “correct” associations (i.e., those intended by the input provider) should prevail, while those that are “noise” (unintended) should be eliminated, having faded beyond some detection threshold. “in” nae triogon aems 3 1 es cycol 2 doy ot Figure 5: Desired (final) set of associations Figure 5 shows an approximation of the desired result of this procedure. (There will be many more concepts and words introduced on the two rows of nodes, and most, but not all associations will be “correct”: there will be spurious associations that did not fade to the point of elimination.) Activations of Associations Figure 3: A second pair of visual and linguistic input The image in Figure 3 activates again some concepts in LTM, some of which were activated also by the input of Figure 1 (e.g., “triangle”). Linguistically, there are some old and new words. Also, one word looks very much like an old one (“triagon” vs. “triogon”). A suitable word-identification algorithm would not only see the similarity, but also analyze their morphological difference, which would later allow the agent to make an association between this morphological change and the visual percept that was responsible for the change. But at this stage no handling of morphology is required by the algorithm; suffice it to assume that the two words, “triagon” and “triogon”, are treated as “the same”. The algorithm now adds the new concepts to the list at the top (that is, it forms the union of the sets of the old and new concepts) and the new words to the list at the bottom. “in” nae triogon aems 3 1 es cycol 2 doy ot Figure 4: Some associations are reinforced For lack of horizontal space, only a few of the new words and concepts are shown in Figure 4. The important development in Figure 4 is that those concept – word pairs that are repeated in inputs 1 and 2 (such as U – “triogon” For the above-outlined algorithm to converge to the desired associations, a precise mechanism of association fading and reinforcement must be established. If the associations fade too fast, for example, they will all be eliminated before additional evidence arrives making them strong enough to survive to the end of the process; if they fade too slow, all associations (including “noise”) will eventually survive; finally, if a minute degree of fading is allowed even after the system receives no further input (waiting, doing nothing), then given enough time all associations will drop back to zero and the system will become amnesic. To counter these problems, the “strength” of each association, which we call activation, is expected to have the following properties. 1 α (x) discrete steps activation doy triagon, ot nae ainei scelisoes 0 x 1 Figure 6: How activation changes over time The activation of an association is the quantity α(x) on the y-axis of Figure 6. Suppose that, at any point in time, there is a quantity x on the x-axis ranging between 0 and 1, but that can take on values only along one of the discrete points shown on the x-axis. (The total number of these points is an Proceedings of the European Cognitive Science Conference, Delphi, Greece, May 2007, pp. 312–317 Implementation The above-described principles were implemented as part of Phaeaco, a visual pattern-recognition system (Foundalis, 2006), in which the visual input consisted of a 200 × 200 rectangle of black-and-white dots. A “training set” consisted of 50 pairs of the form [image, phrase], where the images were similar to those shown earlier in Figure 1 and Figure 3 (they contained geometric figures), and phrases (in English) were likewise relevant to the content of their paired images. Phaeaco is capable of perceiving the geometric structure of such figures, building an internal representation of the structure in its “working memory” according to its architectural principles, and letting the parts of this representation activate the corresponding concepts in its long-term memory (LTM). Thus, if a square is drawn in the input, the following concepts might be activated in Phaeaco’s LTM: square, four, four lines, parallel lines, equal lengths, interior, four vertices, right angle, equal angles, etc. (listed in no particular order). Some of these concepts will be more strongly activated than others, due to the principles in Phaeaco’s architecture (e.g., square is more “important” than vertex, because a vertex is only a part of the representation of a square). Normally, Phaeaco starts with some “primitives”, or a priori known concepts (such as point, line, angle, interior, etc.), and is capable of learning other concepts (such as square) based on the primitives.4 However, for the purposes of the described experiment we suppressed Phaeaco’s mechanism of learning new concepts, so as to avoid interference with the learning of associations between words and concepts. Thus, we worked with an LTM that already included composite concepts such as square, triangle, etc. — i.e., anything that might appear in the visual input of a training set. An additional simplifying assumption was that the rudimentary morphology of English was ignored. Thus, words such as “triangle” and “triangles” were treated as identical; so were all forms of the verb “to be”, etc. The entire training set can be found in Martínez (2006). Performance The following graph (Figure 7) presents the progress of the learning of correct associations over time. correct assoc. important parameter, to be discussed soon.) Thus, x makes only discrete steps along the x-axis. Then the quantity α (x) is given by the sigmoid-like function α, also ranging between 0 and 1. Below, we explain how and when x moves along the x-axis, and why α must be sigmoid-like. Activations fade automatically, as time goes by. This means that if a sufficient amount of time elapses, x makes one discrete step backwards in its range. Consequently, α (x) is decreased, since α is monotonic. “Time” is usually simulated in computational implementations, but in a realtime learning system it is assumed to be the familiar temporal dimension (of physics). Activations are reinforced by the learning system by letting x make a discrete step forward in its range. Again, the monotonicity of α implies that α (x) is increased. However, when x exceeds a threshold that is just before the maximum value 1, the number of discrete steps along the xaxis is increased somewhat (the resolution of segmenting the x-axis grows). This implies that the subsequent fading of the activation will become slower, because x will have more backward steps to traverse along the x-axis. The meaning of this change is that associations that are well established should become progressively harder to fade, after repeated confirmations of their correctness. The amount by which the number of steps is increased is a parameter of the system. Finally, an explanation must be given for why α must be sigmoid-like. First, observe that α must be increasing strictly monotonically, otherwise the motion of x along the x-axis would not move α (x) in the proper direction. Now, of all the monotonically increasing curves that connect the lower-left and upper-right corners of the square in Figure 6, the sigmoid is the most appropriate shape for the following reasons: the curve must be initially increasing slowly, so that an initial number of reinforcements starting from x = 0 does not result in an abrupt increase in α (x). This is necessary because if a wrong association is made, we do not want a small number of initial reinforcements to result in a significant α (x) — we do not wish “noise” to be taken seriously. Conversely, if x has approached 1, and thus α (x) is also close to 1, we do not want α (x) to suddenly drop to lower values; α must be conservative, meaning that once a significant α (x) has been established it should not be too easy to “forget” it. This explains the slow increase in the final part of the curve in Figure 6. Having established that the initial and final parts must be increasing slowly, there are only few possibilities for the middle part of a monotonic curve, hence the sigmoid shape of curve α. 12 10 8 6 4 2 0 0 5 10 15 20 25 30 35 40 45 50 time Figure 7: Progress of learning over time The graph in Figure 7 shows that the average number of correctly learned associations (y-axis) is a generally increasing function with respect to the number of input pairs presented (x-axis). Since the input pairs were presented at regular intervals in our implementation, the x-axis can also be seen as representing time. The y-values are averages of a large number (100) of random presentation orderings of the training set. The gradient of the slope of the curve (the “speed of learning”) depends on a variety of factors, including the settings of the activation parameters and the content of the visual and linguistic input. Under most reasonable settings, however, the learning curve appears to slowly increase with respect to time, and this was the main objective of our implementation. In addition, notice that our algorithm requires only one presentation of the training set, as opposed to multiple epochs typically required in ANN’s. 4 This learning ability is completely unrelated to the algorithm of learning associations, discussed in the present text. Proceedings of the European Cognitive Science Conference, Delphi, Greece, May 2007, pp. 312–317 A Seemingly Different Application The above-described process can be generalized in an obvious — and rather trivial — way, to any case in which data from two different sets can be paired, and associations between their members can be discerned and learned. For example, consider discovering the cause for an allergy. One set is “the set of all possible pathogens that could cause me this allergy”, and the second set has a single member, the event (or fact) “I have this allergy”. What is needed is an association of the form “pathogen X causes this allergy to me”. We might observe the candidate causes during a long period of time, and subconsciously only, without actively trying to discover the cause. Over time, some candidates are reinforced due to repetition, whereas others fade. Given the right conditions (sufficient number of repetitions and not too long an interval of observation time forcing all associations to fade back to zero), we might reach an “Aha!” moment, in which one of the associations becomes strong enough to be noticed consciously.5 Similarly, some examples of scientific discovery (“what could be the cause of phenomenon X?”) can be implemented algorithmically by means of the same process. However, the generalization discussed in what follows goes beyond the pairing of elements of two sets. Suppose that the input is in the visual modality only, and consists of the shapes shown in Figure 8 (a). (a) (b) Figure 8 (a): Sample visual input; (b): a few pixels seen Suppose also that a visual processing system examines the black pixels of this input, not in some systematic, top-tobottom, left-to-right fashion, but randomly. Indeed, this is the way in which Phaeaco examines its input (Foundalis, 2006). After some fast initial processing, Phaeaco has seen a few pixels that belong to the central (“median”) region of the parallelogram and/or the circle, as shown in Figure 8 (b). Shortly afterwards, a few more pixels become known, as shown in Figure 9 (a), where the outlines of the figures are barely discernible (to the human eye). Within similarly short time, enough pixels have been seen — as in Figure 9 (b) — for the outlines to become quite clear, at least to the human eye. What is needed is an algorithm that, when employed by the visual processing system, will make it as capable and fast in discerning shapes from a few pixels as the human visual system. To achieve this, Phaeaco employs the following algorithm. As soon as the first pixels become known (Figure 8 (b)), Phaeaco starts forming hypotheses about the line segments on which the so-far known pixels might lie (Figure 10). Legend: weaker activation stronger activation Figure 10: First attempts to “connect the dots” with lines Most of these hypotheses will turn out to be wrong (“false positives”). But it does not matter. Any of these initial hypotheses that are mere “noise” will not be reinforced, so their activations will fade over time; whereas the correct hypotheses for line segments will endure. Thus, Phaeaco entertains line-segment detectors, which are line segments equipped with an activation value. The method of least squares is used to fit points to line segments, and the activation value of each detector depends both on how many points (pixels) participate in it, and how well the points fit. Note that only one of the detectors in Figure 10 is a “real” one, i.e., one that will become stronger and survive until the end (the nearly horizontal one at the top); but Phaeaco does not know this yet. Subsequently more points arrive, as in Figure 9 (a). Some of the early detectors of Figure 10 will receive one or two more spurious points, but their activations will fade more than they will be reinforced. Also, a few more spurious detectors might form. But, simultaneously, the “real” ones will start emerging (Figure 11). Legend: weak medium strong (a) (b) Figure 9 (a) – (b): More pixels seen 5 It might be incorrect to make this association, thus concluding the wrong cause, but it is the process that concerns us here. Figure 11: More points keep the detectors evolving Note that several of the early detectors in Figure 10 have disappeared in Figure 11, because they were not fed with more data points and “died”. Thus, the fittest detectors survive at the expense of other detectors that do not explain Proceedings of the European Cognitive Science Conference, Delphi, Greece, May 2007, pp. 312–317 the data sufficiently well. The last (but not final) stage in this series is shown in Figure 12, below. Legend: fade, and we end up with a clearer view of the correct categories in this space, as shown in Figure 14. (The space dimensionality has been kept constant, equal to 2, for purposes of illustration.) weak medium strong Figure 12: Most of the surviving detectors are “real” In Figure 12, the desired detectors that form the sides of the parallelogram are among the survivors. Also, the tiny detectors that will be used later to identify the circle have started appearing. In the end, all “true” detectors will be reinforced with more points and deemed the line segments of the image, whereas all false detectors will fade and die. The particular reasons why Phaeaco employs this rather unconventional (randomized) processing of its visual input are explained in Foundalis (2006). For the purposes of the present article it is important to point out that the above procedure is extendible to any case where the detection of the most salient among a group of entities (objects, features, etc.) is required. The reason why this process is similar to the Hebbian association-building will be discussed soon. Figure 14: More points help to better discern the categories The algorithm is described in detail in Foundalis (2006, pp. 228–235). In Phaeaco, categories formed in this way (see Figure 15) are defined by the barycenter (“centroid”) of the set, its standard deviation, and a few more statistics. They also include a few of the initially encountered data points. Thus, they are an amalgam of the prototype (Rosch, 1975; Rosch and Mervis, 1975) and exemplar (Medin and Schaffer, 1978; Nosofsky, 1992) theories of category representation, following the principles of the FARG family of cognitive architectures (Hofstadter, 1995). Generalizing to Categorization The process described in the previous section is a specific application of the more general and fundamental process of categorization. An example will suffice to make this clear. Suppose we are visitors at a new location on Earth where we observe the inhabitants’ faces. Initially all faces appear unfamiliar, and, having seen only a few of them, we can do no better than place them all in one large category, “inhabitant of this new place”, as abstracted in Figure 13. Figure 13: Categorization in abstract space of facial types Note that the face-space on the left side of Figure 13 includes only two facial dimensions, x and y, since each dot represents a face. In reality the space is multidimensional, and we become capable of perceiving more dimensions as our experience is enriched with examples. An initial stage at categorization is shown on the right side of Figure 13. The dashed outlined regions are detectors of categories, almost identical in nature with the detectors of lines of the previous section. As before, new examples are assigned to the group in which they best fit, statistically. Over time, as more examples become available, the categories that are “noise” Figure 15: The categories finally become clear The connection with the earlier discussion of line segment detection should now be clear: the same mechanism can be employed in both cases for the gradual emergence and discerning of the categories, or concepts, in the given space. The two cases, perceptual detection of lines and conceptual detection of categories, are very nearly isomorphic in nature. In general, the discerning of concepts (of any kind, whether concrete objects or abstract categories), might be thought to rely on a process similar to the one described here. For concrete objects, this process has been completely automated and has been hardwired by evolution, so that very few parameters of it need be learned; for other, more abstract and gradually discerned categories, it might look more like the mechanism outlined above. Discussion Two seemingly different case-studies were discussed in the previous sections: a Hebbian-like “association-building by co-occurrences” process, and one of gradual discerning of categories in a multidimensional space. What is the deeper commonality between the two? Both are about discerning entities. In the case of Hebbian learning the entities are the associations, whereas in the Proceedings of the European Cognitive Science Conference, Delphi, Greece, May 2007, pp. 312–317 other examples mentioned the entities are categories that appear as clusters of objects. The discerning is gradual, and initially there is a lot of “noise” in the form of wrongly detected entities. But over time the “noise” fades away and the correct entities emerge highly activated. This is an unsupervised learning algorithm, the success of which depends on the appropriate settings of the parameters of the activations of entities. Activations (illustrated in Figure 6) must have the following properties: • The initial increase in activation must be conservative, to avoid an early commitment to “noise”. • The activation strength must be fading automatically as time goes by, to avoid having all activations of associations or categories eventually reach the value 1. • Those associations or categories that appear to stand consistently above others in strength must curb the fading speed of their activations, to avoid forgetting even the most important ones among them over time. It might be thought that our choice of setting the starting value of an activation to 0 results in a strictly deterministic process. However, this criticism is superficial. Although our model is indeed deterministic, in a real-world situation the order in which the input is encountered is practically never predetermined. Indeterminism arises from the real world. Thus, what is required is that any (random) order allows our algorithm to run, and indeed, in our measurements we varied the input presentation order randomly, observing no dependence of the algorithm on any particular input order. Although ideas similar to the above have traditionally been viewed as belonging to the neural level that inspired ANN’s, our work supports the idea that it is possible that the same principles have been utilized by human cognition at a higher conceptual level. This is not without precedent in material evolution and has been noted also in other scientific disciplines (e.g., physics, chemistry, biology). For example: • The structure of a nucleus with surrounding material is found in atoms at the quantum level, and in planetary systems and galaxies at the macroscopic level. It is also found in biology in eukaryotic cells, and in animal societies organized around a leading group, with “distances” of individuals from the leader, or leaders, varying according to their social status. • The notion of “force”: in the quantum world, forces are interactions of fermions through the exchange of bosons (e.g., Ford, 2004). Chemically, forces are responsible for molecular structure. In biology, a force is exerted usually by a muscular structure. By analogy, a “force” can be of psychological or social nature. • The notion of “wave”: in the microworld there are waves of matter, or waves of probability; in the macroworld there are waves of sound, fluids, gravity, etc. More abstractly, there are “waves” of fashion, cultural ideas, economic crises, etc. In a similar manner, we suggest that, through evolutionary mechanisms, cognition abstracted from what was initially employed as a simple association-building mechanism in creatures that appeared early on in evolutionary history to a conceptual categorization method, which finds its most versatile expression and application in human cognition. References Ford, Kenneth W. (2004). The Quantum World: Harvard University Press. Foundalis, Harry E. (2006). “Phaeaco: A Cognitive Architecture Inspired by Bongard’s Problems”. Ph.D. dissertation, Computer Science and Cognitive Science, Indiana University, Bloomington, Indiana. Hebb, Donald O. (1949). The Organization of Behavior. New York: Wiley. Hofstadter, Douglas, R. (1995). Fluid Concepts and Creative Analogies: Computer Models of the Fundamental Mechanisms of Thought. New York: Basic Books. Kohonen, Teuvo (1982). “Self-organized formation of topologically correct feature maps”. Biological Cybernetics, no. 43, pp. 59–69. Martínez, Maricarmen (2006). “Implementation and performance results of the learning algorithm presented at the 2007 European Cognitive Science Conference, Delphi, Greece”: http://pentagono.uniandes.edu.co/~mmartinez97/EuroCog Sci07 McClelland, James L. (2006). “How Far Can You Go with Hebbian Learning, and When Does it Lead you Astray?”. In Y. Munakata and M. H. Johnson (ed.), Processes of Change in Brain and Cognitive Development: Attention and Performance XXI, pp. 33–69. Oxford: Oxford University Press. Medin, D. L. and M. M. Schaffer (1978). “Context theory of classification learning”. Psychological Review, no. 85, pp. 207-238. Nosofsky, Robert M. (1992). “Exemplars, prototypes, and similarity rules”. In A. Healy, S. Kosslyn and R. Shiffrin (ed.), From Learning Theory to Connectionist Theory: Essays in Honor of W. K. Estes, vol. 1 pp. 149-168. Hillsdale, NJ: Erlbaum. Rosch, Eleanor (1975). “Cognitive representations of semantic categories”. Journal of Experimental Psychology: General, no. 104, pp. 192-233. Rosch, Eleanor and C. B. Mervis (1975). “Family resemblance: Studies in the internal structure of categories”. Cognitive Psychology, no. 7, pp. 573-605. Roy, Deb K. (2002). “Learning Words and Syntax for a Visual Description Task”. Computer Speech and Language, vol. 16, no. 3. Schyns, Philippe G. (1991). “A Modular Neural Network Model of Concept Acquisition”. Cognitive Science, vol. 15, no. 4, pp. 461–508.