RBCC Networks: A Model for Analogical Learning Jean-Philippe Thivierge (jpthiv@ego.psych.mcgill.ca) Department of Psychology, McGill University, 1205 Dr. Penfield Avenue Montreal, QC, H3A 1B1 Canada Thomas R. Shultz (thomas.shultz@mcgill.ca) Department of Psychology, McGill University, 1205 Dr. Penfield Avenue Montreal, QC, H3A 1B1 Canada Abstract This paper introduces Rule-based Cascade-correlation (RBCC), a constructive neural network able to perform analogical learning by transferring formal domain rules. The ability of the model to reproduce human behavioural data is assessed on a task of visual pattern categorization. Results show that networks as well as participants benefit from preexposure to a formal domain rule with respect to their classification performance. Analogical Learning A common strategy in problem solving is to apply preacquired knowledge (source) to solve a novel problem (target). If this process is performed within a situation that necessitates learning (i.e., an adaptation in behavior), finding a solution may require two distinct processes. The first, known as analogical transfer (Melis & Veloso, 1987; Forbus & Gentner, 1989; Hall, 1989), consists of retrieving a relevant source and establishing a correspondence between that source and the task. The second, known as induction, is concerned specifically with searching for a solution to the target task. Combined, these two processes can be described using the term “analogical learning”, which involves both an analogical and an inductive process. With respect to the type of source knowledge that can be transferred, many learning situations involve the application of pre-acquired rules. Learning to categorize odd and even numbers, for instance, involves the use of a specific rule defining even numbers as those divisible by two with no remainder. In the current paper, we introduce a novel neural network model of analogical learning that has the ability to transfer formal domain rules. The performance of this network is compared to that of participants instructed to use a visual and verbal rule to solve a task of pattern categorization. Rule-based Cascade-correlation In order to model analogical learning, the current paper will employ a new type of rule-based neural network. Neural networks have been criticized for focusing on low-level cognitive tasks such as categorization, recognition, and simple learning, while having little to do with higher-level cognitive functions such as analogical reasoning (Eliasmith & Thagard, 2001). In fact, few neural network algorithms are able to capture a combination of rule-based and inductive learning. One such model is KBANN (Shavlik, 1994), a method for converting a set of symbolic domain rules into a feed-forward network with the final rule conclusions at the output, and intermediate rule conclusions at the hidden unit level. This technique involves designing subnetworks representing rules. These subnetworks are combined to form the main network used to learn the task. The knowledge contained in the subnetworks does not have to be complete or perfectly correct, because the main network can adjust its connecting weights to adaptively make up for missing knowledge. Other models address some of the issues involved in knowledge-based problem solving (e.g., KRES, Rehder & Murphy, 2001; discriminability-based transfer, Pratt, 1993; MTL, Silver & Mercer, 1996; mixture-of-experts, Jacobs, Jordan, Nowlan, & Hinton, 1991; explanation-based neural networks, Mitchell & Thrun, 1993; DRAMA, Eliasmith & Thagard, 2001; ACME, Holyoak & Thagard, 1989; LISA, Hummel & Holyoak, 1997; ATRIUM, Erickson & Kruschke, 1998; for a review of psychological models see French, 2002; a review of computational transfer is available from Pratt & Jennings, 1996). Finally, Pulvermüller (1998) proposed a modular network to capture a combination of rule-based and case-based performance on past tense acquisition. However, none of the models capture the full phenomenon of analogical learning. As will be argued, our new model, termed Rule-based Cascadecorrelation (RBCC) can fulfill this role. The goal of designing the RBCC algorithm was to combine analogical and inductive learning in a seamless and effective fashion. RBCC is part of a family of neural networks that include Cascade correlation (CC; Fahlman & Lebiere, 1989) and Knowledge-based Cascade correlation (KBCC; Shultz & Rivest, 2001). These algorithms share many common features with regard to the way they are trained and structured. All these networks learn both by adjusting their connection weights and by adding new layers of hidden units. These networks are initially composed of a number of input and output nodes linked together by connections of varying strengths (see Figure 1a). The goal of learning is to adjust the weights in order to reduce the error obtained by comparing the output of the network to an expected response. Only the weights that link directly to the output nodes are adjusted for this purpose. These adjustments are performed in a part of learning called the “output phase”. In addition to adjusting their weights, networks of the CC family can also expand their topology while learning (Figure 1b). CC, RBCC, and KBCC differ from one another in terms of the type of neural components made available to them in order to grow. CC networks are at the most basic level, and are limited to adding simple units consisting of a single input and a single output. These nodes get placed between the input and the output layers of the network. The goal of recruiting new units is to increase the computational power according to the demands of a given task. The weights feeding the hidden nodes are trained to maximize the covariance between the output of the hidden nodes and the error of the network. A number of new units can be added, each in a new layer connected to all layers below it and to the output layer. RBCC networks constitute a special form of KBCC where the network is able to recruit both single units and formal rules encoded in a distributed system. Figure 1 shows the architecture of an RBCC network before and after recruiting a rule. generalization has the advantage of simplifying the notation by replacing long chains of conjunctive and disjunctive rules by a simpler n-of-m chain. This strategy also enables us to produce a generalized rule for creating any n-of-m network. In a rule network, all weights are equal to W=4, except the bias: x 2 N 2 M 1 (1) where x is the value of the bias weight, and W is a constant equal to W (Towell & Shavlik, 1991). The network itself has two layers, one composed of m number of units for the input, and one composed of a single unit for the output. This way of encoding rules enables the creation of a network that fires only if at least n of the m features are present at the output. The rule networks created are presented to RBCC for selection. Thus, the proposed model is a combination of rules generated by KBANN, and learning performed by RBCC. It seems reasonable from a cognitive point of view to inject rules in networks. In fact, many learning experiments provide subjects with a rule and require them to use it. RBCC performs essentially the same task – it is provided with a rule without having to learn it, and then must use it to solve a given task. This type of algorithm goes against dual systems because rules and connectionist induction are both implemented in a homogenous system. Simulations (a) (b) Figure 1: Architecture of RBCC networks. (a) Initial architecture with 3 input units (I1 to I3) and a single output unit (O1). These nodes are connected by 3 weights (W1 to W3). (b) Architecture after recruitment of a rule (R1). RBCC is able to account for both inductive and analogical processes, respectively by lowering its error rate and by recruiting pre-trained networks. RBCC is also able to perform knowledge selection by incorporating in its architecture those elements that best help solve the target task. In addition, RBCC has the advantage of being able to incorporate source knowledge that only represents a partial solution to the target problem (e.g., Thivierge & Shultz, 2002). Finally, because new source elements are incorporated by stacking them on top of previously incorporated elements, RBCC can potentially perform knowledge combination. Rules are encoded in a similar fashion to KBANN (Towell & Shavlik, 1991). KBANN proposes a method of generating networks that represent if...then rules. However, in their original paper, the authors of KBANN only propose a framework for creating disjunctive (“OR”) and conjunctive (“AND”) rules. In the current paper, we propose a generalization that allows for any n-of-m rule, where n is the number of features to select among m. This An experiment was designed to determine if participants could benefit from a pre-acquired rule when solving a related problem. The target task consisted of categorizing visual images as either belonging to category A or not. Images not belonging to this category were composed exclusively of random features. Category A images contained some random features, but also features occurring with a high probability (p=0.9) within the category, and with a low probability outside it (p=0.1). These features are termed “diagnostic”, because they convey some critical information about category membership. The domain rule consisted of presenting participants with an image containing all of the diagnostic features that could be found in members in cateogory A. This rule image is referred to as the “prototype” image of category A. This image was never actually seen in training, because only some of the diagnostic features were contained in each instance of category A. After training, participants were tested on their ability to recognize images where either the random or diagnostic features had been occluded. Participants A total of 40 participants took part in this study and were rewarded with either course credits or a chance to win a $50 prize. All participants were undergraduates at McGill University. Stimuli The stimuli were based on Goldstone (1996), and consisted of dots linked by horizontal, vertical, and diagonal bars. Images varied according to the configuration of their bars. The same dots were present in all images and were simply designed to guide the participants’ visual perception. All images were presented in black on a white background, using a 15” monitor. Some images had ten possible features in total (Figure 2a), while others had twenty features (Figure 2b). In order to generate the stimuli, two sets of ten images each were initially created, all of which were 5” x 2.5” in size, with six dots and up to ten bars. The first set of images was designed by first randomly generating a single pattern meant to represent the prototype of category A (see Figure 3a). From this prototype, a set of ten patterns termed “diagnostic source” were created by either adding or removing one bar. For instance, Figure 3b illustrates a pattern where one diagonal bar was added to the initial prototype found in 2a. For the second set of images, termed “random source”, the number of bars and their location was determined randomly, with the constraint that each pattern varied by at least two bars from any of the diagnostic source patterns. Figure 2: (a) All possible lines in images with a maximum of ten features. (b) All possible lines in images with a maximum of twenty features. Numbers were not shown on the actual stimuli. (a) (b) Figure 3: Patterns generated for the diagnostic source data set. The image in (a) depicts the prototype; the image in (b) depicts a diagnostic source pattern created by adding a diagonal bar to the prototype. The target training set was generated based on the diagnostic and random source images. Stimuli in this set were obtained by stacking the diagnostic source images on top of the random source images, resulting in images that were 5” x 5” in size, with ten dots, and a possibility of twenty bars in total. An example of such an image is presented in Figure 4. Figure 4: Example of a target pattern. Procedure The experimental sessions were fully automated by a custom-designed program running on a Dell Latitude C800. Participants were randomly assigned to one of two experimental conditions, where they were either exposed to some rule-based prior knowledge (rule condition) or not (control condition). For the rule condition, the experiment was divided into three steps: (1) a practice task; (2) the presentation of the category A prototype as a rule; and (3) the target task. For the control condition, only steps 1 and 3 were involved. In the practice and target tasks, participants had to classify images as either belonging to category A or not. Category A, in this case, had no correspondence to category A of the actual target task. Feedback was provided after each image, indicating its correct category. A test phase followed each task, where feedback was not provided. Participants were asked to respond as quickly as possible throughout, but without sacrificing accuracy. They were also instructed to use their intuition to classify the images. The practice consisted of a relatively simple problem with patterns of nine dots each and up to twenty bars (e.g., Figure 2b). Prior to every image, a 0.25” by 0.25” cross was presented in the middle of the screen for a duration of 1000 ms to assist participants in focusing their attention. Images were presented for a duration of 4000 ms, regardless of participants’ response. Twenty images were presented in training, and ten in testing. In the test phase, no feedback was provided on the correct category of the images shown. The goal of the practice task was to expose participants to a procedure similar to the actual experiment, without conveying any information about how the actual target task could be solved. Following the practice trial, participants in the rule condition were presented with the prototype used to create the diagnostic features of the target task. Participants were provided with the following verbal instructions: “Please take as much time as necessary to memorize the position and orientation of the bars within the image. This could be very useful for you to learn to categorize patterns of category A in the task to follow.” The target task consisted of three trials of twenty images each. The general procedure for this task was the same as for the practice task. The target task consisted of discriminating between patterns containing the diagnostic features, and random patterns of same size (nine dots and twenty features). In a subsequent test phase, participants were exposed to twenty images where certain features were occluded; sometimes the diagnostic features were occluded, and at other times the random features were occluded. Networks RBCC networks were ran in an experiment closely matching the one described above. Forty networks were assigned to one of two conditions according to whether or not they received access to rule-based prior knowledge. In the rule condition, networks were provided with a subnetwork encoding the rule. No candidate was available in the control condition. Networks were limited to eight epochs in output phase in both the rule and control condition. This was intended to mimic the fact that participants were not given a chance to fully learn the patterns before being tested. The rule encoded in our simulation is meant to mimic as closely as possible the representation of the rule that was acquired by the participants. First, the model encodes the same image as was presented to participants. Second, the model associates this image with category A because it is encoded in such as a way as to fire the same response for this image as it would for patterns of category A. Results Table 1: One-sample t-tests for learning accuracy. Condition Control Trial t(19) = -0.48, p>0.64 t(20) = 5.85, p<0.01 Rule Trial 2 t(17) = 1.84, p>0.08 t(20) = 4.22, p<0.01 45 40 35 30 25 20 15 10 5 0 rule Figure 5: Average training error on the target task of participants. Training accuracy The accuracy of participants across the learning trials is presented in Figure 5. Differences in training accuracy between the two groups of participants were assessed using a 2-way mixed model ANOVA with trial (1st, 2nd, and 3rd) as a within-subject factor, and prior knowledge (rule versus none) as a between-subject factor. The main effect of source task was significant (F(2, 1506) = 4.56, p<0.01), confirming that the accuracy of the rule group was reliably higher than that of the control group throughout. The main effect of trial (F(3, 1506) = 3.82, p<0.01) was also significant, demonstrating that accuracy improved throughout the learning trials for both groups. The interaction of trial and source task was not significant Trial 3 t(17) = 2.95, p<0.01 t(20) = 6.72, p<0.01 In order to assess the fit of RBCC networks to the performance of participants on the learning trials, we compared the accuracy (SSE) the networks to the percentage of correct responses achieved by participants (see Figure 6). The results of RBCC replicated those of participants. Specifically, we found a significant group difference between the rule and control groups (F(1, 39) = 93.24, p<0.01) conditions. Higher accuracy was obtained in the rule condition. SSE Performance on the target task was analyzed using analysis of variance (ANOVA) and one-sample t-tests, both with a minimum level of significance of p<0.05. (F(6, 1506) = 0.12, p>0.99), meaning that prior knowledge did not speed up the gains in accuracy through the learning trials. One-sample t-tests were used to determine, for each trial of each group, whether performance differed significantly from chance (50% correct). Results are presented in Table 1. According to these results, the control group’s performance was not significantly different from chance until the third trial. By comparison, the rule group differed significantly from a random performance from the first learning trial. control conditions Figure 6: Average training SSE of neural networks. Testing accuracy The testing accuracies of participants and RBCC are depicted in Figure 7. An effect of prior knowledge was found, suggesting that the rule condition attained reliably better performances than the control condition, both for participants (F(3, 204) = 5.5, p<0.01) and networks (F1,38 = 1807.96, p<0.01). T-tests demonstrated that both the rule (t(19) = 4.77, p<0.01) and control (t(19) = 12.7, p<0.01) participants tested above chance level, suggesting that both were able to learn the task to some extent through the training trials. Further analysis describing details of participants’ and networks’ performance on the testing data will be included in a more extensive publication. Figure 7: Average testing accuracy of rule and control groups. Discussion A vast body of evidence from the psychological literature suggests an ability to acquire rules and employ them in the resolution of novel problems (Nosofsky, Clark, & Shin, 1989; Erickson & Kruschke, 1998). In order to account for this phenomenon, the current paper introduces RBCC, a special case of the KBCC algorithm that can perform analogical transfer using domain rules. The model was compared to human performance in a visual pattern categorization task. Results demonstrated that both participants and networks manifested improved performance on the target task when they received prior exposure to the rule. Despite a number of studies addressing the transfer of formal rules (Pazzani, 1991; Nakamura, 1985, Heit & Bott, 1999), a large portion of the available literature is actually concerned with analogical reasoning. The current study, on the other hand, deals with analogical learning. The addition of a strong learning component to the use of rules in a novel task extends to selecting an appropriate rule, and performing a correct mapping between this rule and the task to be solved. The RBCC model proposed here offers several advantages. First, RBCC has the ability to capture both crisp and fuzzy concepts in a single unified network. Many authors have argued that capturing rule-based as well as fuzzy performance requires two separate systems (e.g., Erickson & Kruschke, 1998). From a modeling perspective, it is highly desirable to have a model that can account for both rule-based and fuzzy performance using a simpler, homogeneous system. Such a model not only offers a more parsimonious account of the phenomenon, but is also biologically more plausible, because information in the brain is encoded by neurons and not symbolic rules. The capacity of the network to capture rule-like performance generates an interesting hypothesis regarding the capacity of real neural systems to capture crisp, as opposed to fuzzy, representations, despite neurons being sluggish units. This is a paradox that has puzzled researchers for years, and can now be addressed by RBCC. With the current network, we are able to account for the combination of inductive and analogical learning. This is an advantage that is of interest for at least two reasons: (1) we can deal with imperfect rules in a given domain, and (2) we can compensate for poor induction through the use of rules. Poor induction may be due to short training times or degraded stimuli. Imperfect rules are common in many learning domains, including language, mathematics, and the acquisition of specialized skills for various activities involving strategic planning. The RBCC model can also potentially account for knowledge selection in situations where multiple rules are available, although testing of this has yet to be carried out. With a sequential selection of rules, the system can potentially combine rules by stacking them on top of each other. The use of multiple rules in KBCC networks has been demonstrated in categorizing DNA sequences (Thivierge & Shultz, 2002). Rules can also potentially be selected in parallel, using a special recruitment scheme called Sibling-descendent Cascade-correlation (Baluja & Fahlman, 1994). Finally, the selection of multiple rules can lead to possible conceptual combinations. However, these possibilities are left to future research. At the current stage of development, there still exists the problem that RBCC does not actually model the acquisition of crisp rules. In the current methodology, rules are coded according to a specific formula. The actual acquisition of these rules is another topic of investigation which is not discussed here. Some simulations have already lead to the conclusion that rule acquisition can be modeled by CC networks (Shultz & Bale, 2001). New research could be conducted to link these findings to analogical transfer. RBCC offers a strong alternative to symbolic models of analogical transfer. These models, despite providing flexible representations of structure, are often criticized for being semantically brittle (Hinton, 1986; Clark & Toribio, 1994). This “brittleness” is due to the fact that if only some of the conditions of a given rule are satisfied, the rule will not be activated. This has severe consequences with regard to the ability of rule systems to generalize to novel data. The formalism of a rule leaves it unable to reach beyond the necessary and sufficient features that define it (e.g., Pulvermüller, 1998). By comparison, in connectionist networks, a phenomenon of graceful degradation occurs, whereby the probability that a rule will fire is slowly reduced as fewer and fewer of the conditions are met. Rulebased models are also neurologically unrealistic (Eliasmith & Thagard, 2001). Neural network models of analogical transfer are interesting from a cognitive as well as an engineering point of view. From an engineering perspective, the ability to perform rule-based transfer in neural networks opens the posibility of integrating new findings about a classification as it becomes available through science. Models of analogical transfer in neural networks can capture crucial principles underlying human performance that can be useful both in understanding cognition and in designing efficient intelligent systems. Acknowledgments This research was supported by a scholarship to J.P.T. from the Fonds pour la Formation de Chercheurs et l’Aide à la Recherche (FCAR), as well a grant to T.R.S. from NSERC. J.P.T. would like to thank François Rivest, Frédéric Dandurand, and Vanessa Taler for comments on the manuscript. References Baluja, S., & Fahlman, S.E. (1994). Reducing network depth in the Cascade-correlation architecture. Technical Report, Carnegie Mellon University. Clark, A., & Toribio J. (1994). “Doing without representing?” Sythese, 101, 401–431. Eliasmith, C., & Thagard, P. (2001). Integrating structure and meaning: A distributed model of analogical mapping. Cognitive Science, 25, 245-286. Erickson, M.A.,& Kruschke, J.K. (1998). Rules and exemplars in category learning. Journal of Experimental Psychology: General, 127, 107-140. Fahlman, S.E., Lebiere, C. (1989). The cascade-correlation learning architecture. Advances in Neural Information Processing Systems 2, 525-532. Forbus, K., & Gentner, D. (1989). Structural evaluation of analogies: What counts? Proceedings of the Cognitive Science Society. Hillsdale, NJ: Lawrence Erlbaum Associates. French, R.M. (2002). The computational modeling of analogy-making. Trends in Cognitive Science, 6, 200205. Goldstone, R.L. (1996). Isolated and interrelated concepts. Memory and Cognition, 24, 608-628. Hall, R. (1989). Computational approaches to analogical reasoning: a comparative analysis. Artificial Intelligence, 39, 39–120. Heit, E., & Bott, L. (1999). Selecting prior knowledge for category learning. Medin (Ed.), Psychology of Learning and Motivation (Vol. 39). San Diego: Academic Press. Hinton, G. E. (1986). Learning distributed representations of concepts. Eighth Conference of the Cognitive Science Society. (pp. 1-12). Lawrence Erlbaum Associates. Holyoak, K.J. & Thagard, P. (1989). Analogical mapping by constraint satisfaction. Cognitive Science, 13, 295–355. Hummel, J.E., & Holyoak, K.J. (1997). Distributed representations of structure: A theory of analogical access and mapping. Psychology Review, 104, 427–466. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., & Hinton, G.E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79, 87. Melis, E., & Veloso, M.M. (1997). Analogy in problem solving. In L.F. del Cerro, D. Gabbay, and H.J. Ohlbach (Eds.). Handbook of Practical Reasoning Computational and Theoretical Aspects. Oxford University Press. Mitchell, T.M., & Thrun, S.B. (1993). Explanation-based neural network learning for robot control. Advances in Neural Information Processing Systems 5. (pp. 287-294). Morgan Kaufmann San Mateo, CA. Nakamura, G. (1985). Knowledge-based classification of illdefined categories. Memory and Cognition, 13, 377-384. Nosofsky, R.M., Clark, S.E., & Shin, H.J. (1989). Rules and exemplars in categorization, identification, and recognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15, 282-304. Pazzani, M. J. (1991). Influence of prior knowledge on concept acquisition: Experimental and computational results. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17, 416-432. Pratt, Y.L. (1993). Discriminability-based transfer between neural networks. Advances in Neural Information Processing Systems 5. (pp. 204-211). Morgan Kaufmann. Pratt, L., & Jennings, B. (1996). A survey of transfer between connectionist networks, Connection Science, 8, 163-184. Pulvermûller, F. (1998). On the matter of rules: Past-tense formation and its significance for cognitive neuroscience. Network: Computational Neural Systems, 9, R1-R52. Rehder, B., & Murphy, G.L. (2001). A knowledgeresonance (KRES) model of category learning. Proceedings of the Twenty-third Annual Conference of the Cognitive Science Society. (pp.821-826). Mahwah, NJ: Lawrence Erlbaum Associates. Shavlik, J.W. (1994). A Framework for Combining Symbolic and Neural Learning, Machine Learning, 14, 321-331. Silver, D. & Mercer, R. (1996). The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness. Connection Science Special Issue: Transfer in Inductive Systems. (pp. 277-294). Carfax Publishing Company. Shultz, T. R., & Bale, A. C. (2001). Neural network simulation of infant familiarization to artificial sentences: Rule-like behavior without explicit rules and variables. Infancy, 2, 501-536. Shultz, T. R., & Rivest, F. (2001). Knowledge-based cascade-correlation: Using knowledge to speed learning. Connection Science, 13, 43-72. Thivierge, J.P., & Shultz, T.R. (2002). Finding relevant knowledge: KBCC applied to DNA splice-junction determination. Proceedings of the IEEE International Joint Conference on Neural Network. (pp. 1401-1405). Towell, G.G., & Shavlik, J.W. (1991). The extracton of refined rules from knowledge-based neural networks. Machine Learning, 13, 71-101.