2 The probability model - Machine Perception Laboratory

advertisement
A probabilistic model of eye movements
in concept formation
Jonathan D. Nelson
Cognitive Science Dept., UCSD
La Jolla, CA 92093
jnelson@cogsci.ucsd.edu
Garrison W. Cottrell
Computer Science and Engineering, UCSD
La Jolla, CA 92093
gary@cs.ucsd.edu
Abstract
Earlier probabilistic concept learning models (Tenenbaum, 1999;
Tenenbaum & Griffiths, 2001; Nelson, Tenenbaum, & Movellan,
2001) have the property that a single inconsistent observation
eliminates a concept by setting its posterior probability to zero.
This property poses difficulties for modeling Shepard’s (Shepard,
Hovland, & Jenkins, 1961) classic concept learning task. We
introduce a new probabilistic model that accounts for these classic
experimental concept formation results. We extend this model to
an eye movement version of the task introduced by Rehder and
Hoffman, (2003, in press), in which the learner decides what
stimulus dimensions to attend to at each point in time. Results
show that the same rational sampling function can predict eye
movements early in learning, when uncertainty is high, as well as
late in learning when the learner is certain of the true category.
1
A cl assi c task an d resu l ts w i th h u man p arti ci p an ts
A classic concept learning task, introduced by Shepard, Hovland, and Jenkins
(1961), involves eight objects, formed by taking every combination of three binary
stimulus dimensions, such as size (large or small), shape (square or circle), and
color (white or black). Concepts are a subset of the eight objects. The re are 256
(2 8) possible concepts, although behavioral experiments have mainly considered
only the 70 (8 choose 4) concepts of size 4. 1
Shepard et al. (1961) defined a
taxonomy of six types of concepts; Table 1 gives illustrative examples. Type I
concepts are defined by a single stimulus dimension, for example “large”, or
“circle”. Type II concepts are defined by the combination of two stimulus
dimensions; an example is “large square or small circle.” Types III, IV, and V
concepts require attention to all three stimulus dimensions for correct classification,
but above-chance performance can be achieved by attending a single dimension.
One Type IV concept is “large, but also the small white circle and excluding the
large black square.” Type VI concepts require attention to all three stimulus
1
An equivalent alternative is to describe concepts as separating two types of objects.
Here it is easier to discuss whether an object is consistent with a concept or no t.
dimensions; they are usually learned by
memorizing the specific set of objects that is
consistent with them.
In an individual trial, an object is drawn at
random from the set of eight objects, the
learner judges it as consistent or inconsistent
with the true, hidden concept, and the learner
is given feedback. The objects are usually
drawn without replacement to form a block of
eight trials. A typical experimental result on
speed of learning, with Type I most rapid, is
I<II<III≈IV≈V<VI.
Table 1. Example concepts.
Type
Objects
Err. Dim.
I
8.2
1
II
31.2
2
IV
36.9
2.5
VI
70.6 3
Early models of Bayesian reasoning, including
conservative models (Edwards, 1968), had the
property that if an observed datum was Note. The third column gives the
incompatible with a particular hypothesis, that average number of errors in Render &
Hoffman’s (in press) study; the fourth
hypothesis was eliminated from consideration. column, the average number of
Recent Bayesian concept learning models stimulus dimensions needed to
(Tenenbaum, 1999; Tenenbaum & Griffiths, classify.
2001; Nelson, Tenenbaum & Movellan, 2001)
also have this property. Under these models,
one exposure to every possible object uniquely identifies the true concept. If
applied to the concept learning task, these types of models would predict that after a
single block of trials, in which each object is viewed once with feedback, the learner
would know the concept with complete certainty. This result hold s irrespective of
the type of concept and the priors. In the classic concept learning task, however,
empirical results consistently show that multiple exposures to individual objects are
required before a category is learned.
2
T h e p rob ab i li ty mo d el
We use a probability model to describe the learner’s prior beliefs, and how those
beliefs change in response to new information. We use standard probabilistic
notation, in which random variables begin with capital letters, and specific values of
random variables begin with lowercase letters. For example, C represents the
learner’s beliefs about the a priori probability of each of the concepts, represented in
a prior probability distribution over all concepts ci. C=ci represents the event that
C takes the value ci, for instance that the true concept is “small”. The learner
typically wishes to know P(C=ci |D=d), the posterior probability of a particular
concept given a particular observed datum. For instance, the learner may wish to
know the probability that “small” is the true concept, given that in their first trial of
learning, they observe that a small black circle is consistent with the true concept.
The posterior probability of the concept “small” is not observed directly. Rather,
P(C=ci |D=d) is calculated according to Bayes’ theorem, with reference to both the
prior probability of the concept “small”, P(C=ci), and the a priori likelihood of
observing the small black circle as a positive example, if the true concept were
“small”, P(D=d|C=ci). By Bayes’ theorem, the posterior probability of a particular
concept is calculated as follows:
P(C  ci | D  d ) 
P( D  d | C  ci ) * P(C  ci )
.
P( D  d )
The simulated model learner obtains the posterior probability of every concept in
this manner.
P( D  d )   P( D  d | C  ci ) * P(C  ci ) .
ci
P(D=d), the probability of the observed data, is a normalizing constant that results
in the sum of the posterior probabilities of all concepts equaling 1.
2.1
Capturing the learner’s intuitions with prior probabilities
Most research on Shepard’s concept learning task has considered the le arning of
concepts that are consistent with exactly 4 of the 8 possible objects. We therefore
assign nonzero prior probability to the 70 concepts that include exactly 4 of 8
objects. If a function that described the concepts’ relative a priori complexity , or
learnability, were readily available, prior probability of each concept could be
assigned according to that function, with less complex concepts receiving higher
prior probability. Shepard, Hovland and Jenkins (1951) noted a property of the
concepts that is relevant to this discussion. This property is the average number of
stimulus dimensions required to classify a random object, if the true concept were
known with certainty. Type I concepts require a single dimension; Type VI
concepts require three dimensions, and other types of concepts require intermediate
numbers or dimensions. Preliminary modeling work shows that this criterion has
promise to serve as the basis for assignment of prior probabilities. Present modeling
results, reported in this paper, assign priors according to the average number of
errors committed by Rehder and Hoffman’s study participants (“Err.” column in
Table 1) on each type of concept.
We set the prior of each concept of a particular type to be inversely proportional to
the average number of errors committed by Rehder and Hoffman’s study
participants when learning that type of concept, εi, raised to a power, λ, a free
parameter in the model:
P(C=ci )= κ (εi)-λ,
where κ is a normalizing constant. Rehder and Hoffman did not study Type III and
V concepts. Because Type III, IV, and V concepts are not reliably different from
each other in the speed at which they are learned (Shepard et al., 1961), Type III,
IV, and V concepts are given the same priors in our model. Each concept of a
particular type, for example the Type I concepts “small” and “circle”, receive the
same prior probability. We note here that setting the priors according to how many
stimulus dimensions are required to be observed to classify an object (a measure of
concept complexity), leads to results similar to those found in this paper.
2.2
Belief updating: the likelihood functi on
The final component of the probability model is a likelihood function, to describe
the learner’s beliefs about how the example data, including their labels (in the form
of the feedback given after each trial), are generated. The likelihood functions of
classic Bayesian reasoning models (Edwards, 1968), and recent Bayesian concept
learning models (Tenenbaum, 1999; Tenenbaum & Griffiths, 2001; Nelson,
Tenenbaum & Movellan, 2001) would appear to be applicable in this task.
However, simulated model learners based on those likelihood functions learn the
task too quickly, and fail to differentiate the relative difficulty of the different types
of concepts, and thus are unable to serve as a basis for modeling human
performance on Shepard’s concept learning task. If the classic Bayesian reasoning
models’ likelihood functions are used, then once every possible object has been
encountered a single time, the true concept is known with complete certainty,
irrespective of the prior distribution. Empirical results strongly contradict this
claim, however: no concept is learned after a single block of trials, and there are
clear distinctions wherein some types of concept are learned more quickly than
others. The central problem in the classic likelihood functions’ failure to account
for this task is that a single observation is enough to eliminate incompatible
concepts.
To account for the empirical finding that a single example observation does not
eliminate incompatible concepts, we assume that the subjects do not encode the
examples perfectly in memory. To model this, we include two generators of data.
One data generator is completely noisy, and labels objects as consistent or
inconsistent with the true concept randomly with equal probability. The other data
generator always correctly labels example objects. We use a parameter μ, between
zero and one, for the probability that a particular observation was generated by the
informative data generator. When μ=1, the model behaves in a manner analogous to
earlier probabilistic concept learning models (Tenenbaum, 1999; Tenenbaum &
Griffiths, 2001; Nelson, Tenenbaum & Movellan, 2001), and learns the task in a
single block of trials, irrespective of the type of concept and prior distribution.
When μ=0, the likelihood function is uninformative and no learning takes place. If
the veridical data generator is selected in a particular trial, then there are 8 equally
probable observations (4 positive and 4 negative); if the noise generator is selected,
then there are 16 equally probable observations (8 positive and 8 negative). The
likelihood of a particular observation (or datum, d), given a particular concept (ci),
where g indexes the two data generators, either veridical or noise, is given by
1
PD  d | C  ci    PD  d | C  ci , G  g PG  g  ,
g 0
by the law of total probability, with P(G=1)=μ, for the veridical generator and
P(G=0)=1-μ, for the noise generator. This reduces to (1/8)*μ + (1/16)*(1-μ), if d is
consistent with the concept ci, and (1/16)*(1-μ) otherwise. To approximate the
learner’s beliefs the likelihood assumes that the observations are conditionally
independent of each other, given the true concept. The psychological premise of
this assumption is that the learner does not know that the objects are drawn without
replacement in each block of trials, and believes that objects are drawn randomly,
throughout learning. Bayes’ theorem is used to update beliefs.
2.3
Belief updating illustrated with an example
The likelihood function is illustrated with a simplified example, where only two
concepts, “square” and “large”, have nonzero prior probability. Let the prior
probability of each concept be 0.50. (Except for the simplified prior pro bability
distribution, this example is identical to the full concept learning model.) Let
μ=0.20. Recall that the likelihood of a particular observation is 1/8 given a
particular concept and the veridical generator, and 1/16 given a particular concept
and the noise generator, as in full-scale model.
The learner, as an example, sees the large white circle as a positive example of the
concept. The learner believes that with probability 0.20, the observation is
veridical; and with probability 0.80, the observation is uninformative. In the
equations below we use D=1 to mean that the observed object is the large white
circle, and that it is labeled as a positive example of the unknown true concept. (In
Shepard’s concept learning task there are 16 possible observations, corresponding to
each of the 8 objects as positive and negative examples of the true concept.) We use
G=1 to denote that the veridical generator of data is selected, and G=0 to denote that
the noise generator is selected. We use C=1 to denote that the true concept is
“square,” and C=2 to denote that the true concept is “large.” If the observation is
from the veridical data generator,
[“square” ”large”] would be [0 1]:
then
the
posterior
probabilities
of
P(C  1 | D  1, G  1)

P(D  1 | C  1, G  1) * P(C  1 | G  1) ,
P(D  1 | G  1)
but because P(C=1) and P(D=1) do not depend on G, we obtain

P(D  1 | C  1, G  1) * P(C  1)
,
P(D  1)
= ( 0*(1/2) ) / (1/16)
= 0.
Similarly,
P(C  2 | D  1, G  1)

P(D  1 | C  2, G  1) * P(C  2)
P(D  1)
= ( (1/8)*(1/2) ) / (1/16)
=1.
If the observation is uninformative, then posterior probabilities of [ “square” ”large”]
would be [0.5 0.5], unchanged from the prior probabilities. Consider the posterior
probability of “large”:
P(C  2 | D  1, G  0)

P(D  1 | C  2, G  0) * P(C  2)
P(D  1)
= ( (1/16)*(1/2) ) / (1/16)
=0.5.
The same line of reasoning shows that P(C=1 | D=1, G=0)=0.5.
Suppose two trials of learning had been completed, and the observations consisted
of the large white circle as a positive example, as above, but also the small black
square as a negative example of the concept. In this case, the mo del learner
considers four possitibilities:
1.
both the large white circle and small black square were generated by the
veridical generator, with probability μ*μ=0.04;
2.
only the large white circle is generated by the veridical generator, with
probability μ*(1-μ)=0.16;
3.
only the small black square is generated by the veridical generator, with
probability (1-μ)*μ=0.16; or
4.
that neither examples were generated by the veridical generator, with
probability (1-μ)*(1-μ)=0.64.
It follows that in the first three cases, where one or both of the examples were
generated by the veridical generator, the posterior probabilities of [ “square”
“large”] would be [0 1]; in the final case, where both examples were generated by
the noise generator, the posterior probabilities of [“square” ”large”] would be
[0.5 0.5]. As before, the learner does not know which of these possibilities are
correct, and therefore must average over all possibilities to form posterior
probabilities of [“square” ”large”]:
(0.04 + 0.16 + 0.16) [0 1] + 0.64 [0.5 0.5] = [0.32 0.68].
2.4
Belief updating
In this paper we approximate optimal inference using Monte -Carlo sampling. In
particular we sample the presented examples many (e.g., 1000) times. In each
sample, specific observations are selected independently, each with probability μ,
from the set of observations. Posterior probabilities are calculated relative to the
veridical data generator, given the selected observations, and those probabilities are
averaged. Random jitter in the figures in this paper results from using this
stochastic approximation.
2.5
S i m u l a t i o n o f t h e p r o b a b i l i t y mo d e l ’ s l e a r n i n g
Does our model account for results in the classic non-eye movement concept
learning task, in which all 3 stimulus dimensions are presented in every trial? We
set λ=5 and μ=0.25 in this paper. (The qualitative pattern of results does not
strongly depend on the exact parameter values used.) Figure 1 shows how the
model’s classification performance improves as each type of concept is learned.
Qualitatively, the most important finding would be that the ordering observed in
empirical investigations (e.g. Shepard et al., 1961; Rehder and Hoffman, in press)
be preserved, with high performance on Type I concepts developing most quickly,
followed, in order, by Types II, IV, and VI. Because the concept learning model
explicitly represents beliefs about the probability of each concept, at each point in
learning, it will prove possible to query the model learner in ways that go beyond
available empirical data on human learning, as well. Note that in the equations
below, as well as throughout the paper, we use x to refer to all the learning that has
taken place to date, including each exposure to individual objects, and feedback on
whether an object is consistent with the unknown true concept, in each trial. (This
notation is more concise than explicitly denoting all the objects that have been seen,
together with the feedback given about whether each object is consistent with the
unknown true concept, at each point throughout learning.)
2.5.1
The model learner’s probability of correct response
Although the goal of studying concept learning is presumably mainly to know how
people’s beliefs change through time, evidence for belief change has largely been in
the form of error rates during learning. To test whether the model learner provides
insight into human learning, it is important to test the probability of correct response
during training, rather than beliefs per se. Evidence suggests that subjects do not
use winner-take-all response functions, but rather probability matching response
functions, in which probability of picking response A is proportional to one’s
estimated probability that response A is correct (Shanks, Tunney, & McCarthy,
2002). A winner-take-all response function could be used here, but it would start
responding correctly too quickly in learning to serve as a credible model of human
performance. We use a probability matching response function in Figure 1.
Intuitively, the idea is that if, based on summing over all concepts that co ntain the
object in the trial, the learner believes there is 90% probability that the object is
consistent with the unknown true concept, the learner has a 90% probability of
responding that the object is consistent with the true unknown concept. Note tha t in
this case, the learner will be correct in 90% of the trials in which they responded yes
(81% of total trials), and an additional 10% of trials in which they responded no
(1% of total trials), or 82% of total trials.
Formally, we define  ci obj  as an indicator function equal to 1 if this particular
trial’s random object is consistent with the concept ci. The overall probability that
the random object ‘obj’ in this trial is consistent with the unknown true concept,
P(  C obj   1) , is as follows:
P(  C obj   1)    c obj P(C  ci | x) .
ci
i
The probability of correct response, given this quantity, which is plotted in Figure 1,
is therefore
P(C obj   1)2  1  P(C obj   1)2 .
How
does
classification
performance change during the
course of learning? First, note that
Type I concepts show the fastest
improvement
in
classification
performance, and that Type VI
concepts
show
the
slowest
improvement.
Interestingly,
however, performance on Type IV
concepts initially develops more
rapidly than performance on Type
II concepts. This phenomenon may
result from the high similarity
between Type IV and Type I
concepts.
Greater than chance
performance on Type IV concepts
can be obtained by responding in
accordance with the appropriate
Type I concept.
2.5.2
1
0.75
0.5
0
2
4
6
8
10
Figure 1. Probability of correct classification
(vertical axis), through several training blocks
(horizontal axis). Key: Type I, triangle; II, open
circle; IV, filled circle; VI, asterisk.
The model learner’s beliefs about the true concept
Intuitively, although it is needed to assess the model’s fidelity to behavioral human
performance on the task, a probability matching response function seems a
convoluted way to measure the development of beliefs. An advantage of testing our
model learner is that beliefs, as measured by the probability distribution P(C|x), per
se can be queried at each moment in learning, in addition to probability of correct
response. Figure 2 therefore gives the posterior probability of the actual concept.
In other words, Figure 2 gives the model concept learner’s posterior beliefs, given
all learning to date x, about the probability of the true conce pt. In other words, for
the Type I concept “square,” this figure plots the model learner’s belief that
P(C=1|x), at each point in learning. Before any learning has taken place, almost all
of the model learner’s belief is concentrated in the six Type I co ncepts (filled
triangles at left of Figure 2); the other types of concepts have extremely low prior
probability. During the course of learning, the probability of the actual concept
increases, irrespective of its type.
This figure corroborates the
ordering of Type I>II>VI>VI.
2.5.3
The model learner’s
uncertainty,
throughout learning
1
0.5
Finally, while it is difficult to chart
the model’s beliefs about the
0
probability of each of the 70
0
2
4
6
8
10
concepts at each point in learning,
it is desirable to understand the
Figure 2. The model learner’s posterior
model learner’s subjective level of
probability of the correct concept (vertical
uncertainty over all the concepts, at
axis), through several training blocks
each point.
Shannon entropy
(horizontal axis). Key: Type I, triangle; II,
(Shannon, 1948; Cover & Thomas,
open circle; IV, filled circle; VI, asterisk.
1991)
is
ideally
suited
to
quantifying this overall level of uncertainty, which is plotted in Figure 3:
H (C | x)   P(C  ci | x) log 2
ci
1
P(C  ci | x)
A surprising result is that for nonType I concepts, the model
5
learner’s subjective uncertainty
(Shannon
entropy)
actually
4
increases
in
early
learning.
Interestingly, this increase in
3
uncertainty corresponds to the
2
increase in probability of non-Type
I concepts, which together begin
1
with probability 0.6%, increasing
after two blocks of learning to
0
39.7%, 16.7%, or 63.4% when the
0
2
4
6
8
10
true, hidden concept is Type II, IV,
or VI, respectively. It would be
Figure 3. Uncertainty about the true concept, in
interesting, and potentially feasible,
bits (vertical axis), through several blocks of
to test the model’s claim, that
training (horizontal axis). Key: Type I, triangle;
uncertainty
actually
increases
II, open circle; IV, filled circle; VI, asterisk
during early learning of non-Type I
concepts. With human subjects, this prediction could potentially be tested either via
subjective report or through physiological measures such as galvanic skin response
to feedback on each trial.
3
T h e acti ve sa mp l i n g task an d b eh avi oral resu l ts
Numerous models of category learning describe how selective attention to different
stimulus dimensions evolves, yet little direct evidence is available. Rehder and
Hoffman (in press) devised a novel experimental design in which eye movements
provided direct evidence for how selective attention was deployed at each moment
in time. This was accomplished by representing the value of each stimulus
dimension with abstract characters (such as & and ¢ in Figure 4) at a particular
vertex of a large triangle on a computer
screen, and recording eye movements
throughout learning.
Among 72 subjects, a variety of patterns
were observed. For the vast majority of
participants, the following findings held:
(1). Early in learning, all 3 stimulus
dimensions were viewed (fixated).
Figure 4.
Example stimulus from
Rehder & Hoffman’s experiment.
Reprinted from Rehder, B., & Hoffman,
A. (in press), Eyetracking and selective
attention in category learning. Cognitive
Psychology. Permission to reprint is
pending.
(2).
Improvement in categorization
performance occurred before reduction in
number of stimulus dimensions viewed to
just
those
necessary
for
correct
classification (“efficient” eye movements).
For instance, the modal pattern for Type I learners was for error to drop from 50%
(chance) to almost 0% after about 12 trials (1.5 blocks), but for the number of
dimensions fixated to remain high until about 16 trials (2.0 blocks). For Type II
learners, errors decreased more gradually, beginning approximately on trial 40 (5
blocks), and decreasing gradually through approximately trial 72 (9 blocks). (This
information on modal patterns for
Type I and II learners is from figures
10 and 11, respectively, in Rehder &
Hoffman, in press).
(3).
Eventually, efficient eye
movements (to only the necessary
stimulus dimensions) were used.
For Type I concepts, efficient eye
movements are to a single
dimension, such as the size query for
the concept “small”.
Rehder &
Hoffman’s Type I subjects fixated
Figure 5. Number of stimulus dimensions
approximately 1 dimension in
fixated in each type of problem, as a function of
terminal performance. For Type II
number of learning blocks.
(Each block
concepts, eye movements are to the
contains
8
trials.)
Reprinted
from
Figure 3 in
relevant two dimensions, such as the
Rehder, B., & Hoffman, A. (in press),
size-shape query for the concept
Eyetracking and selective attention in category
“large square or small circle”; this
learning. Cognitive Psychology. Permission to
was approximately observed in the
reprint is pending.
subjects. Type VI concepts always
require all three stimulus dimensions; the vast majority of subjects always
considered all three dimensions. Type IV concepts, in the general case, require all
three dimensions. In some cases, however, depending on what is learned from the
first dimension, Type IV concept objects can be classified with two dimensions.
Type IV subjects tended to fixate all three dimensions, although the average number
of dimensions fixated for Type IV subjects was slightly less than the number of
dimensions fixated by Type VI subjects. The finding that Type IV subjects tended
to fixate all three dimensions, even late in leaning, could suggest that subjects found
it more expeditious to plan a sequence of eye movements in advance (e.g. a visual
routine), rather than to fixate a single dimension and then consider what dimension
to fixate next.
We will focus on these empirical findings when evaluating our sampling model.
Q u a n t i f y i n g e y e mo v e m e n t s ’ u s e f u l n e s s
3.1
We model early learning with the assumption that the learner fixates all stimulus
dimensions, using the concept learning model described above. Our eye movement
model works in conjunction with the concept learning model. Technically, the eye
movement model is a subjective utility function for evidence acquisition, in the
sense of Savage (1954). When we refer to the usefulness of each possible eye
movement, we mean the subjective expected utility that the learner expects, obtain
as a result of making that eye movement. In the equations below we refer to
queries, rather than eye movements per se, to reflect that a task could be designed so
that information is obtained by other means, for instance by as king a question,
clicking a mouse, etc. We define the efficient query as fixating only the
dimensions(s) necessary to categorize objects once the true concept is known, such
as the size query for the concept “small”. At each point in learning, the learner
picks between 8 possible queries: all dimensions, size-shape, size-color, colorshape, size, color, shape, or null. The task for modeling is to see whether a
principled utility function can offer insight into subjects’ eye movements to
different stimulus dimensions observed in Rehder & Hoffman’s (in press) data.
Ideally, this utility function would be rational in the sense of Anderson (1990;
Chater & Oaksford, 1999), analogous to the ideal observers sometimes used in
analyses of perceptual tasks (Knill & Richards, 1996; Movellan & McClelland,
2001). There are three central ideas to the usefulness function we propose:
1.
The learner wishes to correctly classify the randomly selected object that is
presented in a particular trial;
2.
The learner wishes to learn the true concept;
3.
The learner wishes to avoid making extraneous eye movements.
Each of these ideas is captured by a term in the usefulness (utility) function. We
propose that a particular query’s usefulness:
usefulness(Q|x)= pGain(Object, Q|x) + I(Q, C|x) – cost(Q).
(Recall that x is shorthand for all the queries that have taken place to date, including
the feedback on whether or not the object in each trial is consistent with the
unknown true concept. For example, usefulness(Q size|x) refers to the usefulness of
querying the size dimension, given everything that has been learned to date.)
Each term will be explained in turn.
3.1.1
Classifying a random object: pGain(Object, Q|x)
The learner’s immediate task is to correctly classify the random object presented in
a particular trial. Suppose that the true concept were known to be “large”. In this
case, queries of size, size-shape, size-color, and size-shape-color are each equally
useful, and the other queries are each useless. The eye movement model’s
usefulness function captures this idea by quantifying the amount that a query is
expected to improve the probability of correctly identifying the random object
presented in a particular trial. Note that irrespective of learning, the task is such
that the null query results in a 50% probability of correctly identifying the object.
For example, if the true concept were “large”, then the size query would result in
100% probability of correctly identifying the concept, an improvement of 50%; the
shape query would result in a 50% probability of correctly identifying the concept,
no improvement over the null query. The random variable Obj represents the
random object that appears in a particular trial. The term pGain(Obj,Q|x) refers to
the expected improvement in probability of correctly classifying a random object,
averaging over all concepts, expected from making the query:
pGainObj , Q | x  pCorr (Obj | Q, x)  pCorr (Obj | x) ,
where pCorr(Obj|Q,x), the expected probability of correctly classifying a random
object given learning x, averaging over possible results of the query Q,
pCorr Obj | Q, x    Pobj | x   Pq | x  pCorr obj , q | x  , and
obj
q

pCorr obj , q | x   max   Pc | q, x   c obj ,
 c

 Pc | q, x   c obj  .
c
χc_i(obj) is an indicator function equal to 1 if obj is consistent with the concept ci, 0
otherwise. χ¬c_i (obj) equals 1 if obj is not consistent with the concept ci, 0
otherwise. P(obj)=1/8 for all objects, and does not depend on x. Note that because
objects are drawn randomly with equal probability, and 4 of the 8 objects are
consistent with each concept, pCorr(Obj|x)= pCorr(Obj)=0.50, the probability of
correctly classifying a random object if no query is made.
3.1.2
Learning the true concept: I(Q,C|x)
The second term in the usefulness function, I(Q,C|x), quantifies a query Q’s value
for reducing uncertainty about the true concept C. Technically, the mutual
information between Q and C, given learning to date x,
I Q, C | x    Pq | x   Pc | q, x  log 2
q
c
1
,
Pc | q, x 
is the expected reduction in Shannon entropy in C from making the query Q. This
term can be thought of as quantifying the anticipated future usefulness of what is
learned in the present trial. Sometimes called ‘information gain,’ the use of mutual
information to identify useful experiments was proposed by Lindley (1956). Lee
and Yu (1999) suggested using mutual information to predict eye movements;
Renninger, Coughlan, Verghese, and Malik (in press) use information gain in a
model of a shape learning task. Mutual information has also been used to model
several cognitive information acquisition tasks (reviewed by Nelson, in press).
3.1.3
A v o i d i n g e x t r a n e o u s e y e mo v e m e n t s : c o s t ( Q )
Finally, we include a cost term, cost(Q), to represent the subjective cost of fixating
one or more stimulus dimensions; we set cost(Q) to 0.02 per dimension fixated.
3.2
Questions about the usefulness function
A number of points of confusion about why we formulated the usefulnes s function
in this way may arise. Are all the terms in it necessary? What would happen if
some of them were eliminated? Could some of the terms be formulated differently?
This section, formulated as a set of questions and answers, discusses those issue s.
What if the pGain(Obj,Q|x) term were eliminated? Intuitively, it may seem that the
learner’s goal is to learn the true concept, and that this goal is served by the I(C,Q|x)
term. Suppose the pGain(Obj,Q|x) term were eliminated. Early in learning, the
model concept learner would behave similarly, because the main driver for the
query to all stimulus dimensions is its ability to provide information about the
unknown true concept. Late in learning, however, once the true concept is known
with high certainty, the model concept learner, rather than relying on the efficient
query relative to the true concept, would stop making eye movements altogether
(switching to the null query), because the cost of any query would more than
outweigh the usefulness of its miniscule expected reduction in uncertainty about the
true concept. The paradoxical result would be that just as the learner comes to
know the true concept with high probability, as attested by the probabilities of each
concept in the model, performance on the categorization task would decrease to
chance.
What if the I(C,Q|x) term were eliminated? In the first trial of learning it is
impossible to classify the object with greater than chance (50%) accuracy,
irrespective of the number of stimulus dimensions fixated. Thus, because there is a
small cost to every query (except the null query), every query except the null query
would have negative usefulness, and the learner would not fixate any of the stimulus
dimensions. In the second learning trial, because no learning took place on the
previous trial, the same situation would arise. The problem would iterate
indefinitely, and at no point in time would the learner fixate any stimulus
dimensions, or achieve greater than chance performance on the task.
What if cost(Q) term were eliminated? Suppose there were truly zero cost to make a
query. In this case, even very late in the learning process, when uncertainty about
the true concept is negligible, e.g. when the true concept is known with nearly 100%
certainty, the learner would fixate all stimulus dimensions, rather than the subset of
stimulus dimensions that is required given a certain kind of concept, such as size
and shape for the concept “large square or small circle.” As an approximation of
human visual perception, it is unreasonable to consider eye movements free. Every
eye movement query has a biomechanical cost, and an opportunity cost in terms of
the other queries that must be foregone during its execution.
Why use a change in probability measure for the first term of the usefulness
function, pGain(C,Q|x), and an information theoretic measure for the second term,
I(C,Q|x)? Is that the only way to make the model work? Presumably, there is more
than one mathematically adequate way to capture the learner’s goals to both (1)
correctly classify the immediate random object, and (2) to learn the correct concept,
so as to be able to reliably and correctly classify the objects in future trials. In
particular, it is possible to reformulate the second term as pGain(C,Q|x), rather than
I(C,Q|x); it seems unlikely that this would impair performance of the concept
learning model. Baron (1985), Klayman & Ha (1987), Nickerson (1996), and
Nelson (in press) each discuss several usefulness functions that might be viably
employed for either of the first two terms of our usefulness function.
Is a term for intrinsic visual salience needed? A term for intrinsic visual salience
(Yamada & Cottrell, 1995; Itti, Koch, & Niebur, 1998) could be added to the
usefulness function, but is not necessary to model these data.
What likelihood does the model learner think they will use when evaluating each
possible query’s usefulness? To simplify calculations of the reward function, the
current model assumes that when evaluating possible queries’ (eye movements ’)
anticipated usefulness, μ=1. The implicit psychological claim behind this choice is
that when considering what eye movements to make, the learner believes he or she
will have complete confidence in the results. In future work we intend to explore
the implications of using the same likelihood function as in the concept learning
model, where μ<1, when calculating possible queries’ usefuless.
3.3
R e s u l t s f r o m t h e e y e m o v e m e n t mo d e l
What eye movements does the model predict, at particular stages of learning? In
each trial, there are 8 possible queries: size-shape-color, size-shape, size-color,
color-shape, size, color, shape, and null. Before learning, the query to all stimulus
dimensions (usefulness=0.94) is more useful than any query of two
(usefulness=0.50) or one (usefulness=0.19) stimulus dimensions, consistent with
Rehder and Hoffman’s (in press) empirical finding that early in learning, learners
tended to fixate all stimulus dimensions. This is due to the queries’ different
I(C,Q), their expected information gain with respect to the learner’s beliefs about
the true concept. (Because no learning has taken place, this is true irrespective of
the true concept. Before learning, pGain(Q,C) is zero for every query.) Figure 6
illustrates how I(C,Q), the filled diamonds, and pGain(Obj, Q), the open diamonds,
change during learning of the concept “small”, comparing the query to all stimulus
dimensions (left panel), and the size query (right panel). The I(C,Q) term always
favors the query to all stimulus dimensions, relative to the query to the size
dimension alone. During learning, however, the I(C,Q) terms tend toward zero, and
eventually the size query, which has the lower cost, becomes more valuable than the
query to all dimensions. A similar pattern holds for Type II concepts. (For Type IV
and VI concepts the model always prefers the query to all stimulus dimensions.)
1
1
0.75
0.75
0.5
0.5
0.25
0.25
0
0
0
2
4
6
8
10
0
2
4
6
8
10
Figure 6. Components of the eye movement function for the concept “small.”.
I(C,Q), filled diamonds; pGain(Object, Q), open diamonds
Vertical axis: usefulness. Horizontal axis, blocks of learning.
Left panel, the query to all stimulus dimensions; Right, the size query.
Figure 7 shows that classification performance (solid line) increases well before the
efficient query (open circles) becomes more valuable than the query to all
dimensions l (filled circles), as Rehder and Hoffman (in press) observed in their data.
At left, the Type I concept “small.” Filled circles denote the query to all stimulus
dimensions; open circles the efficient query, to only the size dimension. The right
panel is for the Type II concept “large square or small circle”: the filled circles
denote the query to all dimensions; the open circles denote the query to only the sise
and shape dimensions. As the I(C,Q) term (shown above, in Figure 6) decreases for
both the query to all dimensions and the efficient query, the higher cost of the query
to all dimensions eventually causes its usefulness to be lower than that of the
efficient query. For Type IV and VI concepts, the model rates the query to all
dimensions as having the highest usefulness throughout learning.
1
1
0.75
0.75
0.5
0.5
0.25
0.25
0
0
0
2
4
6
8
10
0
2
4
6
8
10
Figure 7. Classification accuracy and selected queries’ usefulness. See text.
The vertical axis gives a query’s usefulness, or the probability of correct response; the
horizontal axis shows the number of blocks of learning to date.
4
Imp or tan ce, i mp l i c ati on s, an d f u tu re w ork
Rehder & Hoffman (in press) noted that that their eye movement data include some
patterns characteristic of RULEX, a rule-based category learning model (Nosofsky,
Palmieri, & McKinley, 1994), and ALCOVE, a similarity-based model (Kruschke,
1992). Rehder and Hoffman (in press) theorized that multiple rule - and similaritybased systems combine and make different demands on selective attention. Our
results show that a single probabilistic model, with just two parameters (comparing
favorably with the larger number of parameters in RULEX and ALCOVE), can
provide a parsimonious account of both classic concept learning results, and the
central findings in Rehder and Hoffman’s compelling eye movement data.
Some authors have viewed “conservatism” in belief updating as an error. Edwards
(1968) summarized research on conservatism as showing that “it takes anywhere
from two to five observations to do one observation’s worth of work in inducing a
subject to change his opinions,” (p. 18), and considered whether “discrepancies in
human performance [from] that specified by Bayes’ theorem” (p. 21) were caused
by human misperception of individual observations or by the misaggregation of
multiple observations. Oaksford and Chater (1998) modeled belief updating by
assuming “that participants revise their degree of belief in a hypothesis by only a
half of what Bayes’ theorem would recommend” (p. 386). (Oaksford & Chater,
1998, pp. 386, 395, used the same likelihood function, on a model of a different
task.) Our likelihood function—in which the learner believes that some information
presented is noisy—shows that imperfect memory is a possible cause of seemingly
conservative belief updating.
One critique of probabilistic models of leaning has been that prior probabilities have
sometimes been set largely according to intuition. On the concept learning task
empirical results can be used to infer a reasonable prior distribution, as we did in
our model. It would be more theoretically satisfying to obtain priors from intrinsic
properties of the concepts. One such property is the average number of stimulus
dimensions required to classify an object once the true concept is known (“Dim.”
column in Table 1). This method obtains the same ordering of prior probabilities as
our model, and similar results. Minimum description length (Fass & Feldman,
2002) could also be explored as providing an explicit basis for the prior distribution.
In life, virtually every interesting learning or inference scenario involves some
active contribution of the learner in deciding where to look, what questions to ask or
experiments to conduct, or what data to attend to. The present model, to our
knowledge, is the first unified account where both early learning and asymptotic
performance on a concept formation task are produced by the same principled
sampling function. This model contributes to a growing body of work using
information- and decision-theoretic concepts to model perceptual tasks (Lee & Yu,
1999; Sprague & Ballard, 2003; Trommershäuser, Maloney & Landy, 2003). We
hope it may also spur work to understand the relationship between active
information acquisition in perceptual, such as eye movement, and other cognitive
learning and inference tasks.
Author note
Bob Rehder and Aaron Hoffman kindly corresponded about their experiment and
findings, and provided their data to us. Bob Rehder and Javier Movellan provided
feedback on a draft of this manuscript. JDN was funded by NIMH grant
5T32MH020002-05 to T. Sejnowski and by NSF grant DGE 0333451 to GWC.
GWC is also supported by NIH grant MH57075.
References
Anderson, J. R. (1990). The Adaptive Character of Thought. Hillsdale NJ: Erlbaum.
Baron, J. (1985). Rationality and Intelligence. Cambridge: Cambridge University Press
Brunswik, E. (1952). The conceptual framework of psychology. Chicago: University of
Chicago Press.
Chater N, Oaksford M (1999). Ten years of the rational analysis of cognition. Trends in
Cognitive Sciences, 3(2), 57-65.
Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. New York: Wiley
Edwards, W. (1968). Conservatism in human information processing. In B. Kleinmuntz
(Ed.), Formal Representation of Human Judgment, (pp. 17-52). New York: John Wiley
Fass, D., & Feldman, J. (2002). Categorization Under Complexity: A Unified MDL Account
of Human Learning of Regular and Irregular Categories. NIPS 15
Itti, L., Koch, C., & Niebur, E. (1998). A Model of Saliency-Based Visual Attention for
Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(
11), 1254-1259.
Klayman, J. & Ha, Y.-W. (1987). Confirmation, disconfirmation, and information.
Psychological Review, 94, 211-228.
Knill, D. C., & Richards, W. (Eds.), (1996). Perception as Bayesian inference. Cambridge,
U.K.: Cambridge University Press.
Kruschke, J. K. (1992). ALCOVE: An exemplar-based connectionist model of category
learning. Psychological Review. 99, 22-44.
Lee, T. S., & Yu, S. X. (1999). An information-theoretic framework for understanding
saccadic eye movements. Advances in Neural Information Processing Systems, 12, 834-840.
Lindley, D. V. (1956). On a measure of the information provided by an experiment. Annals
of Mathematical Statistics, 27, 986-1005.
Movellan, J. R., & McClelland, J. L. (2001). The Morton-Massaro law of information
integration: Implications for models of perception. Psychological Review, 108(1), 113–148,
2001.
Nelson, J. D. (in press). Finding useful questions: on Bayesian diagnosticity, probability,
impact, and information gain. Psychological Review.
Nelson, J. D., Tenenbaum, J. B., & Movellan, J. R. (2001). Active inference in conc ept
learning. In J. D. Moore & K. Stenning (Eds.), Proceedings of the 23rd Conference of the
Cognitive Science Society, 692-697.
Nickerson, R. S. (1996). Hempel’s paradox and Wason’s selection task: Logical and
psychological puzzles of confirmation. Thinking and Reasoning, 2, 1-32.
Nosofsky, R. M., Palmieri, T. J., & McKinley, S. C. (1994). Rule-plus-exception model of
classification learning. Psychological Review, 101, 53-79.
Oaksford, M., & Chater, N. (1998). A revised rational analysis of the selection task:
exceptions and sequential sampling. In M. Oaksford, & N. Chater (Eds.), Rational Models of
Cognition (pp. 372-393). Oxford: Oxford University Press.
Rehder, B., & Hoffman, A. B. (2003). Eyetracking and selective attention in category
learning. In R. Alternman & D. Kirsh (Eds.), Proceedings of the 25th Annual Conference of
the Cognitive Science Society. Boston, MA: Cognitive Science Society.
Rehder, B., & Hoffman, A. B. (in press). Eyetracking and selective attention in category
learning. Cognitive Psychology.
Renninger, L. W., Coughlan, J., Verghese, P. & Malik, J. (in press). An information
maximization model of eye movements. Advances in Neural Information Processing
Systems.
Savage, L. J. (1954). The Foundations of Statistics. New York: Wiley.
Shanks, D. R., Tunney, R. J., & McCarthy, J. D. (2002). A re-examination of probability
matching and rational choice. Journal of Behavioral Decision Making, 15, 233-250.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical
Journal, 27, 379–423, 623–656
Shepard, R. N., Hovland, C. I., & Jenkins, H. M. (1961). Learning and memorization of
classifications. Psychological Monographs: General and Applied, 75(13), 1-42.
Sprague, N., & Ballard, D. (2003). Eye Movements for Reward Maximization. NIPS 16
Steyvers, M., Tenenbaum, J. B., Wagenmakers, E.-J. & Blum, B. (2003). Inferring causal
networks from observations and interventions. Cognitive Science, 27, 453-489.
Tenenbaum, J. B. (1999). A Bayesian Framework for Concept Learning. Ph.D. Thesis, MIT
Tenenbaum, J. B. (2000). Rules and similarity in concept learning. NIPS 12, 59–65.
Tenenbaum, J. B., & Griffiths, T. L. (2001). Generalization, similarity, and Bayesian
inference. Behavioral and Brain Sciences, 24(4), 629-640.
Trommershäuser, J., Maloney, L. T., & Landy, M. S. (2003). Statistical decision theory and
the selection of rapid, goal-directed movements. Journal of the Optical Society of America
A, 20(7), 1419-1433.
Yamada, K., & Cottrell, G. W. (1995). A model of scan paths applied to face recognition. In
Proceedings of the Seventeenth Annual Cognitive Science Conference, Pittsburgh, PA, pp.
55-60, Mahwah: Lawrence Erlbaum.
Download