Analogical Learning - Department of Psychology

advertisement
RBCC Networks: A Model for Analogical Learning
Jean-Philippe Thivierge (jpthiv@ego.psych.mcgill.ca)
Department of Psychology, McGill University, 1205 Dr. Penfield Avenue
Montreal, QC, H3A 1B1 Canada
Thomas R. Shultz (thomas.shultz@mcgill.ca)
Department of Psychology, McGill University, 1205 Dr. Penfield Avenue
Montreal, QC, H3A 1B1 Canada
Abstract
This paper introduces Rule-based Cascade-correlation
(RBCC), a constructive neural network able to perform
analogical learning by transferring formal domain rules. The
ability of the model to reproduce human behavioural data is
assessed on a task of visual pattern categorization. Results
show that networks as well as participants benefit from preexposure to a formal domain rule with respect to their
classification performance.
Analogical Learning
A common strategy in problem solving is to apply preacquired knowledge (source) to solve a novel problem
(target). If this process is performed within a situation that
necessitates learning (i.e., an adaptation in behavior),
finding a solution may require two distinct processes. The
first, known as analogical transfer (Melis & Veloso, 1987;
Forbus & Gentner, 1989; Hall, 1989), consists of retrieving
a relevant source and establishing a correspondence between
that source and the task. The second, known as induction, is
concerned specifically with searching for a solution to the
target task.
Combined, these two processes can be
described using the term “analogical learning”, which
involves both an analogical and an inductive process.
With respect to the type of source knowledge that can be
transferred, many learning situations involve the application
of pre-acquired rules. Learning to categorize odd and even
numbers, for instance, involves the use of a specific rule
defining even numbers as those divisible by two with no
remainder. In the current paper, we introduce a novel neural
network model of analogical learning that has the ability to
transfer formal domain rules. The performance of this
network is compared to that of participants instructed to use
a visual and verbal rule to solve a task of pattern
categorization.
Rule-based Cascade-correlation
In order to model analogical learning, the current paper will
employ a new type of rule-based neural network. Neural
networks have been criticized for focusing on low-level
cognitive tasks such as categorization, recognition, and
simple learning, while having little to do with higher-level
cognitive functions such as analogical reasoning (Eliasmith
& Thagard, 2001). In fact, few neural network algorithms
are able to capture a combination of rule-based and
inductive learning. One such model is KBANN (Shavlik,
1994), a method for converting a set of symbolic domain
rules into a feed-forward network with the final rule
conclusions at the output, and intermediate rule conclusions
at the hidden unit level. This technique involves designing
subnetworks representing rules. These subnetworks are
combined to form the main network used to learn the task.
The knowledge contained in the subnetworks does not have
to be complete or perfectly correct, because the main
network can adjust its connecting weights to adaptively
make up for missing knowledge.
Other models address some of the issues involved in
knowledge-based problem solving (e.g., KRES, Rehder &
Murphy, 2001; discriminability-based transfer, Pratt, 1993;
MTL, Silver & Mercer, 1996; mixture-of-experts, Jacobs,
Jordan, Nowlan, & Hinton, 1991; explanation-based neural
networks, Mitchell & Thrun, 1993; DRAMA, Eliasmith &
Thagard, 2001; ACME, Holyoak & Thagard, 1989; LISA,
Hummel & Holyoak, 1997; ATRIUM, Erickson &
Kruschke, 1998; for a review of psychological models see
French, 2002; a review of computational transfer is
available from Pratt & Jennings, 1996).
Finally,
Pulvermüller (1998) proposed a modular network to capture
a combination of rule-based and case-based performance on
past tense acquisition. However, none of the models
capture the full phenomenon of analogical learning. As will
be argued, our new model, termed Rule-based Cascadecorrelation (RBCC) can fulfill this role.
The goal of designing the RBCC algorithm was to
combine analogical and inductive learning in a seamless and
effective fashion. RBCC is part of a family of neural
networks that include Cascade correlation (CC; Fahlman &
Lebiere, 1989) and Knowledge-based Cascade correlation
(KBCC; Shultz & Rivest, 2001). These algorithms share
many common features with regard to the way they are
trained and structured. All these networks learn both by
adjusting their connection weights and by adding new layers
of hidden units. These networks are initially composed of a
number of input and output nodes linked together by
connections of varying strengths (see Figure 1a). The goal
of learning is to adjust the weights in order to reduce the
error obtained by comparing the output of the network to an
expected response. Only the weights that link directly to the
output nodes are adjusted for this purpose.
These
adjustments are performed in a part of learning called the
“output phase”. In addition to adjusting their weights,
networks of the CC family can also expand their topology
while learning (Figure 1b). CC, RBCC, and KBCC differ
from one another in terms of the type of neural components
made available to them in order to grow. CC networks are
at the most basic level, and are limited to adding simple
units consisting of a single input and a single output. These
nodes get placed between the input and the output layers of
the network. The goal of recruiting new units is to increase
the computational power according to the demands of a
given task. The weights feeding the hidden nodes are
trained to maximize the covariance between the output of
the hidden nodes and the error of the network. A number of
new units can be added, each in a new layer connected to all
layers below it and to the output layer.
RBCC networks constitute a special form of KBCC where
the network is able to recruit both single units and formal
rules encoded in a distributed system. Figure 1 shows the
architecture of an RBCC network before and after recruiting
a rule.
generalization has the advantage of simplifying the notation
by replacing long chains of conjunctive and disjunctive
rules by a simpler n-of-m chain. This strategy also enables
us to produce a generalized rule for creating any n-of-m
network. In a rule network, all weights are equal to W=4,
except the bias:
x   2 N  2 M  1 
(1)
where x is the value of the bias weight, and W is a
constant equal to W (Towell & Shavlik, 1991). The
network itself has two layers, one composed of m number
of units for the input, and one composed of a single unit
for the output. This way of encoding rules enables the
creation of a network that fires only if at least n of the m
features are present at the output. The rule networks
created are presented to RBCC for selection. Thus, the
proposed model is a combination of rules generated by
KBANN, and learning performed by RBCC. It seems
reasonable from a cognitive point of view to inject rules
in networks. In fact, many learning experiments provide
subjects with a rule and require them to use it. RBCC
performs essentially the same task – it is provided with a
rule without having to learn it, and then must use it to
solve a given task. This type of algorithm goes against
dual systems because rules and connectionist induction
are both implemented in a homogenous system.
Simulations
(a)
(b)
Figure 1: Architecture of RBCC networks. (a) Initial
architecture with 3 input units (I1 to I3) and a single output
unit (O1). These nodes are connected by 3 weights (W1 to
W3). (b) Architecture after recruitment of a rule (R1).
RBCC is able to account for both inductive and analogical
processes, respectively by lowering its error rate and by
recruiting pre-trained networks. RBCC is also able to
perform knowledge selection by incorporating in its
architecture those elements that best help solve the target
task. In addition, RBCC has the advantage of being able to
incorporate source knowledge that only represents a partial
solution to the target problem (e.g., Thivierge & Shultz,
2002).
Finally, because new source elements are
incorporated by stacking them on top of previously
incorporated elements, RBCC can potentially perform
knowledge combination.
Rules are encoded in a similar fashion to KBANN (Towell
& Shavlik, 1991).
KBANN proposes a method of
generating networks that represent if...then rules. However,
in their original paper, the authors of KBANN only propose
a framework for creating disjunctive (“OR”) and
conjunctive (“AND”) rules. In the current paper, we
propose a generalization that allows for any n-of-m rule,
where n is the number of features to select among m. This
An experiment was designed to determine if participants
could benefit from a pre-acquired rule when solving a
related problem. The target task consisted of categorizing
visual images as either belonging to category A or not.
Images not belonging to this category were composed
exclusively of random features.
Category A images
contained some random features, but also features occurring
with a high probability (p=0.9) within the category, and
with a low probability outside it (p=0.1). These features are
termed “diagnostic”, because they convey some critical
information about category membership. The domain rule
consisted of presenting participants with an image
containing all of the diagnostic features that could be found
in members in cateogory A. This rule image is referred to as
the “prototype” image of category A. This image was never
actually seen in training, because only some of the
diagnostic features were contained in each instance of
category A. After training, participants were tested on their
ability to recognize images where either the random or
diagnostic features had been occluded.
Participants
A total of 40 participants took part in this study and were
rewarded with either course credits or a chance to win a $50
prize. All participants were undergraduates at McGill
University.
Stimuli
The stimuli were based on Goldstone (1996), and consisted
of dots linked by horizontal, vertical, and diagonal bars.
Images varied according to the configuration of their bars.
The same dots were present in all images and were simply
designed to guide the participants’ visual perception. All
images were presented in black on a white background,
using a 15” monitor. Some images had ten possible features
in total (Figure 2a), while others had twenty features (Figure
2b).
In order to generate the stimuli, two sets of ten images
each were initially created, all of which were 5” x 2.5” in
size, with six dots and up to ten bars. The first set of images
was designed by first randomly generating a single pattern
meant to represent the prototype of category A (see Figure
3a). From this prototype, a set of ten patterns termed
“diagnostic source” were created by either adding or
removing one bar. For instance, Figure 3b illustrates a
pattern where one diagonal bar was added to the initial
prototype found in 2a. For the second set of images, termed
“random source”, the number of bars and their location was
determined randomly, with the constraint that each pattern
varied by at least two bars from any of the diagnostic source
patterns.
Figure 2: (a) All possible lines in images with a maximum
of ten features. (b) All possible lines in images with a
maximum of twenty features. Numbers were not shown on
the actual stimuli.
(a)
(b)
Figure 3: Patterns generated for the diagnostic source data
set. The image in (a) depicts the prototype; the image in (b)
depicts a diagnostic source pattern created by adding a
diagonal bar to the prototype.
The target training set was generated based on the
diagnostic and random source images. Stimuli in this set
were obtained by stacking the diagnostic source images on
top of the random source images, resulting in images that
were 5” x 5” in size, with ten dots, and a possibility of
twenty bars in total. An example of such an image is
presented in Figure 4.
Figure 4: Example of a target pattern.
Procedure
The experimental sessions were fully automated by a
custom-designed program running on a Dell Latitude C800.
Participants were randomly assigned to one of two
experimental conditions, where they were either exposed to
some rule-based prior knowledge (rule condition) or not
(control condition).
For the rule condition, the experiment was divided into
three steps: (1) a practice task; (2) the presentation of the
category A prototype as a rule; and (3) the target task. For
the control condition, only steps 1 and 3 were involved. In
the practice and target tasks, participants had to classify
images as either belonging to category A or not. Category
A, in this case, had no correspondence to category A of the
actual target task. Feedback was provided after each image,
indicating its correct category. A test phase followed each
task, where feedback was not provided. Participants were
asked to respond as quickly as possible throughout, but
without sacrificing accuracy. They were also instructed to
use their intuition to classify the images.
The practice consisted of a relatively simple problem with
patterns of nine dots each and up to twenty bars (e.g., Figure
2b). Prior to every image, a 0.25” by 0.25” cross was
presented in the middle of the screen for a duration of 1000
ms to assist participants in focusing their attention. Images
were presented for a duration of 4000 ms, regardless of
participants’ response. Twenty images were presented in
training, and ten in testing. In the test phase, no feedback
was provided on the correct category of the images shown.
The goal of the practice task was to expose participants to a
procedure similar to the actual experiment, without
conveying any information about how the actual target task
could be solved.
Following the practice trial, participants in the rule
condition were presented with the prototype used to create
the diagnostic features of the target task. Participants were
provided with the following verbal instructions:
“Please take as much time as necessary to memorize the
position and orientation of the bars within the image. This
could be very useful for you to learn to categorize patterns
of category A in the task to follow.”
The target task consisted of three trials of twenty images
each. The general procedure for this task was the same as
for the practice task.
The target task consisted of
discriminating between patterns containing the diagnostic
features, and random patterns of same size (nine dots and
twenty features). In a subsequent test phase, participants
were exposed to twenty images where certain features were
occluded; sometimes the diagnostic features were occluded,
and at other times the random features were occluded.
Networks
RBCC networks were ran in an experiment closely
matching the one described above. Forty networks were
assigned to one of two conditions according to whether or
not they received access to rule-based prior knowledge. In
the rule condition, networks were provided with a
subnetwork encoding the rule. No candidate was available
in the control condition. Networks were limited to eight
epochs in output phase in both the rule and control
condition. This was intended to mimic the fact that
participants were not given a chance to fully learn the
patterns before being tested.
The rule encoded in our simulation is meant to mimic as
closely as possible the representation of the rule that was
acquired by the participants. First, the model encodes the
same image as was presented to participants. Second, the
model associates this image with category A because it is
encoded in such as a way as to fire the same response for
this image as it would for patterns of category A.
Results
Table 1: One-sample t-tests for learning accuracy.
Condition
Control
Trial
t(19) = -0.48,
p>0.64
t(20) = 5.85,
p<0.01
Rule
Trial 2
t(17) = 1.84,
p>0.08
t(20) = 4.22,
p<0.01
45
40
35
30
25
20
15
10
5
0
rule
Figure 5: Average training error on the target task of
participants.
Training accuracy The accuracy of participants across the
learning trials is presented in Figure 5. Differences in
training accuracy between the two groups of participants
were assessed using a 2-way mixed model ANOVA with
trial (1st, 2nd, and 3rd) as a within-subject factor, and prior
knowledge (rule versus none) as a between-subject factor.
The main effect of source task was significant (F(2, 1506)
= 4.56, p<0.01), confirming that the accuracy of the rule
group was reliably higher than that of the control group
throughout. The main effect of trial (F(3, 1506) = 3.82,
p<0.01) was also significant, demonstrating that accuracy
improved throughout the learning trials for both groups.
The interaction of trial and source task was not significant
Trial 3
t(17) = 2.95,
p<0.01
t(20) = 6.72,
p<0.01
In order to assess the fit of RBCC networks to the
performance of participants on the learning trials, we
compared the accuracy (SSE) the networks to the
percentage of correct responses achieved by participants
(see Figure 6). The results of RBCC replicated those of
participants. Specifically, we found a significant group
difference between the rule and control groups (F(1, 39) =
93.24, p<0.01) conditions. Higher accuracy was obtained in
the rule condition.
SSE
Performance on the target task was analyzed using analysis
of variance (ANOVA) and one-sample t-tests, both with a
minimum level of significance of p<0.05.
(F(6, 1506) = 0.12, p>0.99), meaning that prior knowledge
did not speed up the gains in accuracy through the learning
trials.
One-sample t-tests were used to determine, for each trial
of each group, whether performance differed significantly
from chance (50% correct). Results are presented in Table
1.
According to these results, the control group’s
performance was not significantly different from chance
until the third trial. By comparison, the rule group differed
significantly from a random performance from the first
learning trial.
control
conditions
Figure 6: Average training SSE of neural networks.
Testing accuracy The testing accuracies of participants
and RBCC are depicted in Figure 7. An effect of prior
knowledge was found, suggesting that the rule condition
attained reliably better performances than the control
condition, both for participants (F(3, 204) = 5.5, p<0.01)
and networks (F1,38 = 1807.96, p<0.01).
T-tests demonstrated that both the rule (t(19) = 4.77,
p<0.01) and control (t(19) = 12.7, p<0.01) participants
tested above chance level, suggesting that both were able to
learn the task to some extent through the training trials.
Further analysis describing details of participants’ and
networks’ performance on the testing data will be included
in a more extensive publication.
Figure 7: Average testing accuracy of rule and control
groups.
Discussion
A vast body of evidence from the psychological literature
suggests an ability to acquire rules and employ them in the
resolution of novel problems (Nosofsky, Clark, & Shin,
1989; Erickson & Kruschke, 1998). In order to account for
this phenomenon, the current paper introduces RBCC, a
special case of the KBCC algorithm that can perform
analogical transfer using domain rules. The model was
compared to human performance in a visual pattern
categorization task.
Results demonstrated that both
participants and networks manifested improved performance
on the target task when they received prior exposure to the
rule.
Despite a number of studies addressing the transfer of
formal rules (Pazzani, 1991; Nakamura, 1985, Heit & Bott,
1999), a large portion of the available literature is actually
concerned with analogical reasoning. The current study, on
the other hand, deals with analogical learning. The addition
of a strong learning component to the use of rules in a novel
task extends to selecting an appropriate rule, and performing
a correct mapping between this rule and the task to be
solved.
The RBCC model proposed here offers several
advantages. First, RBCC has the ability to capture both
crisp and fuzzy concepts in a single unified network. Many
authors have argued that capturing rule-based as well as
fuzzy performance requires two separate systems (e.g.,
Erickson & Kruschke, 1998). From a modeling perspective,
it is highly desirable to have a model that can account for
both rule-based and fuzzy performance using a simpler,
homogeneous system. Such a model not only offers a more
parsimonious account of the phenomenon, but is also
biologically more plausible, because information in the
brain is encoded by neurons and not symbolic rules. The
capacity of the network to capture rule-like performance
generates an interesting hypothesis regarding the capacity of
real neural systems to capture crisp, as opposed to fuzzy,
representations, despite neurons being sluggish units. This
is a paradox that has puzzled researchers for years, and can
now be addressed by RBCC.
With the current network, we are able to account for the
combination of inductive and analogical learning. This is an
advantage that is of interest for at least two reasons: (1) we
can deal with imperfect rules in a given domain, and (2) we
can compensate for poor induction through the use of rules.
Poor induction may be due to short training times or
degraded stimuli. Imperfect rules are common in many
learning domains, including language, mathematics, and the
acquisition of specialized skills for various activities
involving strategic planning.
The RBCC model can also potentially account for
knowledge selection in situations where multiple rules are
available, although testing of this has yet to be carried out.
With a sequential selection of rules, the system can
potentially combine rules by stacking them on top of each
other. The use of multiple rules in KBCC networks has
been demonstrated in categorizing DNA sequences
(Thivierge & Shultz, 2002). Rules can also potentially be
selected in parallel, using a special recruitment scheme
called Sibling-descendent Cascade-correlation (Baluja &
Fahlman, 1994). Finally, the selection of multiple rules can
lead to possible conceptual combinations. However, these
possibilities are left to future research.
At the current stage of development, there still exists the
problem that RBCC does not actually model the acquisition
of crisp rules. In the current methodology, rules are coded
according to a specific formula. The actual acquisition of
these rules is another topic of investigation which is not
discussed here. Some simulations have already lead to the
conclusion that rule acquisition can be modeled by CC
networks (Shultz & Bale, 2001). New research could be
conducted to link these findings to analogical transfer.
RBCC offers a strong alternative to symbolic models of
analogical transfer.
These models, despite providing
flexible representations of structure, are often criticized for
being semantically brittle (Hinton, 1986; Clark & Toribio,
1994). This “brittleness” is due to the fact that if only some
of the conditions of a given rule are satisfied, the rule will
not be activated. This has severe consequences with regard
to the ability of rule systems to generalize to novel data.
The formalism of a rule leaves it unable to reach beyond the
necessary and sufficient features that define it (e.g.,
Pulvermüller, 1998). By comparison, in connectionist
networks, a phenomenon of graceful degradation occurs,
whereby the probability that a rule will fire is slowly
reduced as fewer and fewer of the conditions are met. Rulebased models are also neurologically unrealistic (Eliasmith
& Thagard, 2001). Neural network models of analogical
transfer are interesting from a cognitive as well as an
engineering point of view.
From an engineering
perspective, the ability to perform rule-based transfer in
neural networks opens the posibility of integrating new
findings about a classification as it becomes available
through science. Models of analogical transfer in neural
networks can capture crucial principles underlying human
performance that can be useful both in understanding
cognition and in designing efficient intelligent systems.
Acknowledgments
This research was supported by a scholarship to J.P.T.
from the Fonds pour la Formation de Chercheurs et l’Aide à
la Recherche (FCAR), as well a grant to T.R.S. from
NSERC. J.P.T. would like to thank François Rivest,
Frédéric Dandurand, and Vanessa Taler for comments on
the manuscript.
References
Baluja, S., & Fahlman, S.E. (1994). Reducing network
depth in the Cascade-correlation architecture. Technical
Report, Carnegie Mellon University.
Clark, A., & Toribio J. (1994). “Doing without
representing?” Sythese, 101, 401–431.
Eliasmith, C., & Thagard, P. (2001). Integrating
structure and meaning: A distributed model of analogical
mapping. Cognitive Science, 25, 245-286.
Erickson, M.A.,& Kruschke, J.K. (1998). Rules and
exemplars in category learning. Journal of Experimental
Psychology: General, 127, 107-140.
Fahlman, S.E., Lebiere, C. (1989). The cascade-correlation
learning architecture. Advances in Neural Information
Processing Systems 2, 525-532.
Forbus, K., & Gentner, D. (1989). Structural evaluation of
analogies: What counts? Proceedings of the Cognitive
Science Society.
Hillsdale, NJ: Lawrence Erlbaum
Associates.
French, R.M. (2002). The computational modeling of
analogy-making. Trends in Cognitive Science, 6, 200205.
Goldstone, R.L. (1996). Isolated and interrelated concepts.
Memory and Cognition, 24, 608-628.
Hall, R. (1989). Computational approaches to analogical
reasoning: a comparative analysis. Artificial Intelligence,
39, 39–120.
Heit, E., & Bott, L. (1999). Selecting prior knowledge for
category learning. Medin (Ed.), Psychology of Learning
and Motivation (Vol. 39). San Diego: Academic Press.
Hinton, G. E. (1986). Learning distributed representations
of concepts. Eighth Conference of the Cognitive Science
Society. (pp. 1-12). Lawrence Erlbaum Associates.
Holyoak, K.J. & Thagard, P. (1989). Analogical mapping by
constraint satisfaction. Cognitive Science, 13, 295–355.
Hummel, J.E., & Holyoak, K.J. (1997).
Distributed
representations of structure: A theory of analogical access
and mapping. Psychology Review, 104, 427–466.
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., & Hinton, G.E.
(1991). Adaptive mixtures of local experts. Neural
Computation, 3, 79, 87.
Melis, E., & Veloso, M.M. (1997). Analogy in problem
solving. In L.F. del Cerro, D. Gabbay, and H.J. Ohlbach
(Eds.). Handbook of Practical Reasoning Computational
and Theoretical Aspects. Oxford University Press.
Mitchell, T.M., & Thrun, S.B. (1993). Explanation-based
neural network learning for robot control. Advances in
Neural Information Processing Systems 5. (pp. 287-294).
Morgan Kaufmann San Mateo, CA.
Nakamura, G. (1985). Knowledge-based classification of illdefined categories. Memory and Cognition, 13, 377-384.
Nosofsky, R.M., Clark, S.E., & Shin, H.J. (1989). Rules
and exemplars in categorization, identification, and
recognition.
Journal of Experimental Psychology:
Learning, Memory, and Cognition, 15, 282-304.
Pazzani, M. J. (1991). Influence of prior knowledge on
concept acquisition: Experimental and computational
results. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 17, 416-432.
Pratt, Y.L. (1993). Discriminability-based transfer between
neural networks. Advances in Neural Information
Processing Systems 5. (pp. 204-211). Morgan Kaufmann.
Pratt, L., & Jennings, B. (1996). A survey of transfer
between connectionist networks, Connection Science, 8,
163-184.
Pulvermûller, F. (1998). On the matter of rules: Past-tense
formation and its significance for cognitive neuroscience.
Network: Computational Neural Systems, 9, R1-R52.
Rehder, B., & Murphy, G.L. (2001). A knowledgeresonance (KRES) model of category learning.
Proceedings of the Twenty-third Annual Conference of
the Cognitive Science Society. (pp.821-826). Mahwah,
NJ: Lawrence Erlbaum Associates.
Shavlik, J.W. (1994). A Framework for Combining
Symbolic and Neural Learning, Machine Learning, 14,
321-331.
Silver, D. & Mercer, R. (1996). The parallel transfer of task
knowledge using dynamic learning rates based on a
measure of relatedness. Connection Science Special Issue:
Transfer in Inductive Systems. (pp. 277-294). Carfax
Publishing Company.
Shultz, T. R., & Bale, A. C. (2001). Neural network
simulation of infant familiarization to artificial sentences:
Rule-like behavior without explicit rules and variables.
Infancy, 2, 501-536.
Shultz, T. R., & Rivest, F. (2001). Knowledge-based
cascade-correlation: Using knowledge to speed learning.
Connection Science, 13, 43-72.
Thivierge, J.P., & Shultz, T.R. (2002). Finding relevant
knowledge: KBCC applied to DNA splice-junction
determination. Proceedings of the IEEE International
Joint Conference on Neural Network. (pp. 1401-1405).
Towell, G.G., & Shavlik, J.W. (1991). The extracton of
refined rules from knowledge-based neural networks.
Machine Learning, 13, 71-101.
Download