A probabilistic model of eye movements in concept formation Jonathan D. Nelson Cognitive Science Dept., UCSD La Jolla, CA 92093 jnelson@cogsci.ucsd.edu Garrison W. Cottrell Computer Science and Engineering, UCSD La Jolla, CA 92093 gary@cs.ucsd.edu Abstract Earlier probabilistic concept learning models (Tenenbaum, 1999; Tenenbaum & Griffiths, 2001; Nelson, Tenenbaum, & Movellan, 2001) have the property that a single inconsistent observation eliminates a concept by setting its posterior probability to zero. This property poses difficulties for modeling Shepard’s (Shepard, Hovland, & Jenkins, 1961) classic concept learning task. We introduce a new probabilistic model that accounts for these classic experimental concept formation results. We extend this model to an eye movement version of the task introduced by Rehder and Hoffman, (2003, in press), in which the learner decides what stimulus dimensions to attend to at each point in time. Results show that the same rational sampling function can predict eye movements early in learning, when uncertainty is high, as well as late in learning when the learner is certain of the true category. 1 A cl assi c task an d resu l ts w i th h u man p arti ci p an ts A classic concept learning task, introduced by Shepard, Hovland, and Jenkins (1961), involves eight objects, formed by taking every combination of three binary stimulus dimensions, such as size (large or small), shape (square or circle), and color (white or black). Concepts are a subset of the eight objects. The re are 256 (2 8) possible concepts, although behavioral experiments have mainly considered only the 70 (8 choose 4) concepts of size 4. 1 Shepard et al. (1961) defined a taxonomy of six types of concepts; Table 1 gives illustrative examples. Type I concepts are defined by a single stimulus dimension, for example “large”, or “circle”. Type II concepts are defined by the combination of two stimulus dimensions; an example is “large square or small circle.” Types III, IV, and V concepts require attention to all three stimulus dimensions for correct classification, but above-chance performance can be achieved by attending a single dimension. One Type IV concept is “large, but also the small white circle and excluding the large black square.” Type VI concepts require attention to all three stimulus 1 An equivalent alternative is to describe concepts as separating two types of objects. Here it is easier to discuss whether an object is consistent with a concept or no t. dimensions; they are usually learned by memorizing the specific set of objects that is consistent with them. In an individual trial, an object is drawn at random from the set of eight objects, the learner judges it as consistent or inconsistent with the true, hidden concept, and the learner is given feedback. The objects are usually drawn without replacement to form a block of eight trials. A typical experimental result on speed of learning, with Type I most rapid, is I<II<III≈IV≈V<VI. Table 1. Example concepts. Type Objects Err. Dim. I 8.2 1 II 31.2 2 IV 36.9 2.5 VI 70.6 3 Early models of Bayesian reasoning, including conservative models (Edwards, 1968), had the property that if an observed datum was Note. The third column gives the incompatible with a particular hypothesis, that average number of errors in Render & Hoffman’s (in press) study; the fourth hypothesis was eliminated from consideration. column, the average number of Recent Bayesian concept learning models stimulus dimensions needed to (Tenenbaum, 1999; Tenenbaum & Griffiths, classify. 2001; Nelson, Tenenbaum & Movellan, 2001) also have this property. Under these models, one exposure to every possible object uniquely identifies the true concept. If applied to the concept learning task, these types of models would predict that after a single block of trials, in which each object is viewed once with feedback, the learner would know the concept with complete certainty. This result hold s irrespective of the type of concept and the priors. In the classic concept learning task, however, empirical results consistently show that multiple exposures to individual objects are required before a category is learned. 2 T h e p rob ab i li ty mo d el We use a probability model to describe the learner’s prior beliefs, and how those beliefs change in response to new information. We use standard probabilistic notation, in which random variables begin with capital letters, and specific values of random variables begin with lowercase letters. For example, C represents the learner’s beliefs about the a priori probability of each of the concepts, represented in a prior probability distribution over all concepts ci. C=ci represents the event that C takes the value ci, for instance that the true concept is “small”. The learner typically wishes to know P(C=ci |D=d), the posterior probability of a particular concept given a particular observed datum. For instance, the learner may wish to know the probability that “small” is the true concept, given that in their first trial of learning, they observe that a small black circle is consistent with the true concept. The posterior probability of the concept “small” is not observed directly. Rather, P(C=ci |D=d) is calculated according to Bayes’ theorem, with reference to both the prior probability of the concept “small”, P(C=ci), and the a priori likelihood of observing the small black circle as a positive example, if the true concept were “small”, P(D=d|C=ci). By Bayes’ theorem, the posterior probability of a particular concept is calculated as follows: P(C ci | D d ) P( D d | C ci ) * P(C ci ) . P( D d ) The simulated model learner obtains the posterior probability of every concept in this manner. P( D d ) P( D d | C ci ) * P(C ci ) . ci P(D=d), the probability of the observed data, is a normalizing constant that results in the sum of the posterior probabilities of all concepts equaling 1. 2.1 Capturing the learner’s intuitions with prior probabilities Most research on Shepard’s concept learning task has considered the le arning of concepts that are consistent with exactly 4 of the 8 possible objects. We therefore assign nonzero prior probability to the 70 concepts that include exactly 4 of 8 objects. If a function that described the concepts’ relative a priori complexity , or learnability, were readily available, prior probability of each concept could be assigned according to that function, with less complex concepts receiving higher prior probability. Shepard, Hovland and Jenkins (1951) noted a property of the concepts that is relevant to this discussion. This property is the average number of stimulus dimensions required to classify a random object, if the true concept were known with certainty. Type I concepts require a single dimension; Type VI concepts require three dimensions, and other types of concepts require intermediate numbers or dimensions. Preliminary modeling work shows that this criterion has promise to serve as the basis for assignment of prior probabilities. Present modeling results, reported in this paper, assign priors according to the average number of errors committed by Rehder and Hoffman’s study participants (“Err.” column in Table 1) on each type of concept. We set the prior of each concept of a particular type to be inversely proportional to the average number of errors committed by Rehder and Hoffman’s study participants when learning that type of concept, εi, raised to a power, λ, a free parameter in the model: P(C=ci )= κ (εi)-λ, where κ is a normalizing constant. Rehder and Hoffman did not study Type III and V concepts. Because Type III, IV, and V concepts are not reliably different from each other in the speed at which they are learned (Shepard et al., 1961), Type III, IV, and V concepts are given the same priors in our model. Each concept of a particular type, for example the Type I concepts “small” and “circle”, receive the same prior probability. We note here that setting the priors according to how many stimulus dimensions are required to be observed to classify an object (a measure of concept complexity), leads to results similar to those found in this paper. 2.2 Belief updating: the likelihood functi on The final component of the probability model is a likelihood function, to describe the learner’s beliefs about how the example data, including their labels (in the form of the feedback given after each trial), are generated. The likelihood functions of classic Bayesian reasoning models (Edwards, 1968), and recent Bayesian concept learning models (Tenenbaum, 1999; Tenenbaum & Griffiths, 2001; Nelson, Tenenbaum & Movellan, 2001) would appear to be applicable in this task. However, simulated model learners based on those likelihood functions learn the task too quickly, and fail to differentiate the relative difficulty of the different types of concepts, and thus are unable to serve as a basis for modeling human performance on Shepard’s concept learning task. If the classic Bayesian reasoning models’ likelihood functions are used, then once every possible object has been encountered a single time, the true concept is known with complete certainty, irrespective of the prior distribution. Empirical results strongly contradict this claim, however: no concept is learned after a single block of trials, and there are clear distinctions wherein some types of concept are learned more quickly than others. The central problem in the classic likelihood functions’ failure to account for this task is that a single observation is enough to eliminate incompatible concepts. To account for the empirical finding that a single example observation does not eliminate incompatible concepts, we assume that the subjects do not encode the examples perfectly in memory. To model this, we include two generators of data. One data generator is completely noisy, and labels objects as consistent or inconsistent with the true concept randomly with equal probability. The other data generator always correctly labels example objects. We use a parameter μ, between zero and one, for the probability that a particular observation was generated by the informative data generator. When μ=1, the model behaves in a manner analogous to earlier probabilistic concept learning models (Tenenbaum, 1999; Tenenbaum & Griffiths, 2001; Nelson, Tenenbaum & Movellan, 2001), and learns the task in a single block of trials, irrespective of the type of concept and prior distribution. When μ=0, the likelihood function is uninformative and no learning takes place. If the veridical data generator is selected in a particular trial, then there are 8 equally probable observations (4 positive and 4 negative); if the noise generator is selected, then there are 16 equally probable observations (8 positive and 8 negative). The likelihood of a particular observation (or datum, d), given a particular concept (ci), where g indexes the two data generators, either veridical or noise, is given by 1 PD d | C ci PD d | C ci , G g PG g , g 0 by the law of total probability, with P(G=1)=μ, for the veridical generator and P(G=0)=1-μ, for the noise generator. This reduces to (1/8)*μ + (1/16)*(1-μ), if d is consistent with the concept ci, and (1/16)*(1-μ) otherwise. To approximate the learner’s beliefs the likelihood assumes that the observations are conditionally independent of each other, given the true concept. The psychological premise of this assumption is that the learner does not know that the objects are drawn without replacement in each block of trials, and believes that objects are drawn randomly, throughout learning. Bayes’ theorem is used to update beliefs. 2.3 Belief updating illustrated with an example The likelihood function is illustrated with a simplified example, where only two concepts, “square” and “large”, have nonzero prior probability. Let the prior probability of each concept be 0.50. (Except for the simplified prior pro bability distribution, this example is identical to the full concept learning model.) Let μ=0.20. Recall that the likelihood of a particular observation is 1/8 given a particular concept and the veridical generator, and 1/16 given a particular concept and the noise generator, as in full-scale model. The learner, as an example, sees the large white circle as a positive example of the concept. The learner believes that with probability 0.20, the observation is veridical; and with probability 0.80, the observation is uninformative. In the equations below we use D=1 to mean that the observed object is the large white circle, and that it is labeled as a positive example of the unknown true concept. (In Shepard’s concept learning task there are 16 possible observations, corresponding to each of the 8 objects as positive and negative examples of the true concept.) We use G=1 to denote that the veridical generator of data is selected, and G=0 to denote that the noise generator is selected. We use C=1 to denote that the true concept is “square,” and C=2 to denote that the true concept is “large.” If the observation is from the veridical data generator, [“square” ”large”] would be [0 1]: then the posterior probabilities of P(C 1 | D 1, G 1) P(D 1 | C 1, G 1) * P(C 1 | G 1) , P(D 1 | G 1) but because P(C=1) and P(D=1) do not depend on G, we obtain P(D 1 | C 1, G 1) * P(C 1) , P(D 1) = ( 0*(1/2) ) / (1/16) = 0. Similarly, P(C 2 | D 1, G 1) P(D 1 | C 2, G 1) * P(C 2) P(D 1) = ( (1/8)*(1/2) ) / (1/16) =1. If the observation is uninformative, then posterior probabilities of [ “square” ”large”] would be [0.5 0.5], unchanged from the prior probabilities. Consider the posterior probability of “large”: P(C 2 | D 1, G 0) P(D 1 | C 2, G 0) * P(C 2) P(D 1) = ( (1/16)*(1/2) ) / (1/16) =0.5. The same line of reasoning shows that P(C=1 | D=1, G=0)=0.5. Suppose two trials of learning had been completed, and the observations consisted of the large white circle as a positive example, as above, but also the small black square as a negative example of the concept. In this case, the mo del learner considers four possitibilities: 1. both the large white circle and small black square were generated by the veridical generator, with probability μ*μ=0.04; 2. only the large white circle is generated by the veridical generator, with probability μ*(1-μ)=0.16; 3. only the small black square is generated by the veridical generator, with probability (1-μ)*μ=0.16; or 4. that neither examples were generated by the veridical generator, with probability (1-μ)*(1-μ)=0.64. It follows that in the first three cases, where one or both of the examples were generated by the veridical generator, the posterior probabilities of [ “square” “large”] would be [0 1]; in the final case, where both examples were generated by the noise generator, the posterior probabilities of [“square” ”large”] would be [0.5 0.5]. As before, the learner does not know which of these possibilities are correct, and therefore must average over all possibilities to form posterior probabilities of [“square” ”large”]: (0.04 + 0.16 + 0.16) [0 1] + 0.64 [0.5 0.5] = [0.32 0.68]. 2.4 Belief updating In this paper we approximate optimal inference using Monte -Carlo sampling. In particular we sample the presented examples many (e.g., 1000) times. In each sample, specific observations are selected independently, each with probability μ, from the set of observations. Posterior probabilities are calculated relative to the veridical data generator, given the selected observations, and those probabilities are averaged. Random jitter in the figures in this paper results from using this stochastic approximation. 2.5 S i m u l a t i o n o f t h e p r o b a b i l i t y mo d e l ’ s l e a r n i n g Does our model account for results in the classic non-eye movement concept learning task, in which all 3 stimulus dimensions are presented in every trial? We set λ=5 and μ=0.25 in this paper. (The qualitative pattern of results does not strongly depend on the exact parameter values used.) Figure 1 shows how the model’s classification performance improves as each type of concept is learned. Qualitatively, the most important finding would be that the ordering observed in empirical investigations (e.g. Shepard et al., 1961; Rehder and Hoffman, in press) be preserved, with high performance on Type I concepts developing most quickly, followed, in order, by Types II, IV, and VI. Because the concept learning model explicitly represents beliefs about the probability of each concept, at each point in learning, it will prove possible to query the model learner in ways that go beyond available empirical data on human learning, as well. Note that in the equations below, as well as throughout the paper, we use x to refer to all the learning that has taken place to date, including each exposure to individual objects, and feedback on whether an object is consistent with the unknown true concept, in each trial. (This notation is more concise than explicitly denoting all the objects that have been seen, together with the feedback given about whether each object is consistent with the unknown true concept, at each point throughout learning.) 2.5.1 The model learner’s probability of correct response Although the goal of studying concept learning is presumably mainly to know how people’s beliefs change through time, evidence for belief change has largely been in the form of error rates during learning. To test whether the model learner provides insight into human learning, it is important to test the probability of correct response during training, rather than beliefs per se. Evidence suggests that subjects do not use winner-take-all response functions, but rather probability matching response functions, in which probability of picking response A is proportional to one’s estimated probability that response A is correct (Shanks, Tunney, & McCarthy, 2002). A winner-take-all response function could be used here, but it would start responding correctly too quickly in learning to serve as a credible model of human performance. We use a probability matching response function in Figure 1. Intuitively, the idea is that if, based on summing over all concepts that co ntain the object in the trial, the learner believes there is 90% probability that the object is consistent with the unknown true concept, the learner has a 90% probability of responding that the object is consistent with the true unknown concept. Note tha t in this case, the learner will be correct in 90% of the trials in which they responded yes (81% of total trials), and an additional 10% of trials in which they responded no (1% of total trials), or 82% of total trials. Formally, we define ci obj as an indicator function equal to 1 if this particular trial’s random object is consistent with the concept ci. The overall probability that the random object ‘obj’ in this trial is consistent with the unknown true concept, P( C obj 1) , is as follows: P( C obj 1) c obj P(C ci | x) . ci i The probability of correct response, given this quantity, which is plotted in Figure 1, is therefore P(C obj 1)2 1 P(C obj 1)2 . How does classification performance change during the course of learning? First, note that Type I concepts show the fastest improvement in classification performance, and that Type VI concepts show the slowest improvement. Interestingly, however, performance on Type IV concepts initially develops more rapidly than performance on Type II concepts. This phenomenon may result from the high similarity between Type IV and Type I concepts. Greater than chance performance on Type IV concepts can be obtained by responding in accordance with the appropriate Type I concept. 2.5.2 1 0.75 0.5 0 2 4 6 8 10 Figure 1. Probability of correct classification (vertical axis), through several training blocks (horizontal axis). Key: Type I, triangle; II, open circle; IV, filled circle; VI, asterisk. The model learner’s beliefs about the true concept Intuitively, although it is needed to assess the model’s fidelity to behavioral human performance on the task, a probability matching response function seems a convoluted way to measure the development of beliefs. An advantage of testing our model learner is that beliefs, as measured by the probability distribution P(C|x), per se can be queried at each moment in learning, in addition to probability of correct response. Figure 2 therefore gives the posterior probability of the actual concept. In other words, Figure 2 gives the model concept learner’s posterior beliefs, given all learning to date x, about the probability of the true conce pt. In other words, for the Type I concept “square,” this figure plots the model learner’s belief that P(C=1|x), at each point in learning. Before any learning has taken place, almost all of the model learner’s belief is concentrated in the six Type I co ncepts (filled triangles at left of Figure 2); the other types of concepts have extremely low prior probability. During the course of learning, the probability of the actual concept increases, irrespective of its type. This figure corroborates the ordering of Type I>II>VI>VI. 2.5.3 The model learner’s uncertainty, throughout learning 1 0.5 Finally, while it is difficult to chart the model’s beliefs about the 0 probability of each of the 70 0 2 4 6 8 10 concepts at each point in learning, it is desirable to understand the Figure 2. The model learner’s posterior model learner’s subjective level of probability of the correct concept (vertical uncertainty over all the concepts, at axis), through several training blocks each point. Shannon entropy (horizontal axis). Key: Type I, triangle; II, (Shannon, 1948; Cover & Thomas, open circle; IV, filled circle; VI, asterisk. 1991) is ideally suited to quantifying this overall level of uncertainty, which is plotted in Figure 3: H (C | x) P(C ci | x) log 2 ci 1 P(C ci | x) A surprising result is that for nonType I concepts, the model 5 learner’s subjective uncertainty (Shannon entropy) actually 4 increases in early learning. Interestingly, this increase in 3 uncertainty corresponds to the 2 increase in probability of non-Type I concepts, which together begin 1 with probability 0.6%, increasing after two blocks of learning to 0 39.7%, 16.7%, or 63.4% when the 0 2 4 6 8 10 true, hidden concept is Type II, IV, or VI, respectively. It would be Figure 3. Uncertainty about the true concept, in interesting, and potentially feasible, bits (vertical axis), through several blocks of to test the model’s claim, that training (horizontal axis). Key: Type I, triangle; uncertainty actually increases II, open circle; IV, filled circle; VI, asterisk during early learning of non-Type I concepts. With human subjects, this prediction could potentially be tested either via subjective report or through physiological measures such as galvanic skin response to feedback on each trial. 3 T h e acti ve sa mp l i n g task an d b eh avi oral resu l ts Numerous models of category learning describe how selective attention to different stimulus dimensions evolves, yet little direct evidence is available. Rehder and Hoffman (in press) devised a novel experimental design in which eye movements provided direct evidence for how selective attention was deployed at each moment in time. This was accomplished by representing the value of each stimulus dimension with abstract characters (such as & and ¢ in Figure 4) at a particular vertex of a large triangle on a computer screen, and recording eye movements throughout learning. Among 72 subjects, a variety of patterns were observed. For the vast majority of participants, the following findings held: (1). Early in learning, all 3 stimulus dimensions were viewed (fixated). Figure 4. Example stimulus from Rehder & Hoffman’s experiment. Reprinted from Rehder, B., & Hoffman, A. (in press), Eyetracking and selective attention in category learning. Cognitive Psychology. Permission to reprint is pending. (2). Improvement in categorization performance occurred before reduction in number of stimulus dimensions viewed to just those necessary for correct classification (“efficient” eye movements). For instance, the modal pattern for Type I learners was for error to drop from 50% (chance) to almost 0% after about 12 trials (1.5 blocks), but for the number of dimensions fixated to remain high until about 16 trials (2.0 blocks). For Type II learners, errors decreased more gradually, beginning approximately on trial 40 (5 blocks), and decreasing gradually through approximately trial 72 (9 blocks). (This information on modal patterns for Type I and II learners is from figures 10 and 11, respectively, in Rehder & Hoffman, in press). (3). Eventually, efficient eye movements (to only the necessary stimulus dimensions) were used. For Type I concepts, efficient eye movements are to a single dimension, such as the size query for the concept “small”. Rehder & Hoffman’s Type I subjects fixated Figure 5. Number of stimulus dimensions approximately 1 dimension in fixated in each type of problem, as a function of terminal performance. For Type II number of learning blocks. (Each block concepts, eye movements are to the contains 8 trials.) Reprinted from Figure 3 in relevant two dimensions, such as the Rehder, B., & Hoffman, A. (in press), size-shape query for the concept Eyetracking and selective attention in category “large square or small circle”; this learning. Cognitive Psychology. Permission to was approximately observed in the reprint is pending. subjects. Type VI concepts always require all three stimulus dimensions; the vast majority of subjects always considered all three dimensions. Type IV concepts, in the general case, require all three dimensions. In some cases, however, depending on what is learned from the first dimension, Type IV concept objects can be classified with two dimensions. Type IV subjects tended to fixate all three dimensions, although the average number of dimensions fixated for Type IV subjects was slightly less than the number of dimensions fixated by Type VI subjects. The finding that Type IV subjects tended to fixate all three dimensions, even late in leaning, could suggest that subjects found it more expeditious to plan a sequence of eye movements in advance (e.g. a visual routine), rather than to fixate a single dimension and then consider what dimension to fixate next. We will focus on these empirical findings when evaluating our sampling model. Q u a n t i f y i n g e y e mo v e m e n t s ’ u s e f u l n e s s 3.1 We model early learning with the assumption that the learner fixates all stimulus dimensions, using the concept learning model described above. Our eye movement model works in conjunction with the concept learning model. Technically, the eye movement model is a subjective utility function for evidence acquisition, in the sense of Savage (1954). When we refer to the usefulness of each possible eye movement, we mean the subjective expected utility that the learner expects, obtain as a result of making that eye movement. In the equations below we refer to queries, rather than eye movements per se, to reflect that a task could be designed so that information is obtained by other means, for instance by as king a question, clicking a mouse, etc. We define the efficient query as fixating only the dimensions(s) necessary to categorize objects once the true concept is known, such as the size query for the concept “small”. At each point in learning, the learner picks between 8 possible queries: all dimensions, size-shape, size-color, colorshape, size, color, shape, or null. The task for modeling is to see whether a principled utility function can offer insight into subjects’ eye movements to different stimulus dimensions observed in Rehder & Hoffman’s (in press) data. Ideally, this utility function would be rational in the sense of Anderson (1990; Chater & Oaksford, 1999), analogous to the ideal observers sometimes used in analyses of perceptual tasks (Knill & Richards, 1996; Movellan & McClelland, 2001). There are three central ideas to the usefulness function we propose: 1. The learner wishes to correctly classify the randomly selected object that is presented in a particular trial; 2. The learner wishes to learn the true concept; 3. The learner wishes to avoid making extraneous eye movements. Each of these ideas is captured by a term in the usefulness (utility) function. We propose that a particular query’s usefulness: usefulness(Q|x)= pGain(Object, Q|x) + I(Q, C|x) – cost(Q). (Recall that x is shorthand for all the queries that have taken place to date, including the feedback on whether or not the object in each trial is consistent with the unknown true concept. For example, usefulness(Q size|x) refers to the usefulness of querying the size dimension, given everything that has been learned to date.) Each term will be explained in turn. 3.1.1 Classifying a random object: pGain(Object, Q|x) The learner’s immediate task is to correctly classify the random object presented in a particular trial. Suppose that the true concept were known to be “large”. In this case, queries of size, size-shape, size-color, and size-shape-color are each equally useful, and the other queries are each useless. The eye movement model’s usefulness function captures this idea by quantifying the amount that a query is expected to improve the probability of correctly identifying the random object presented in a particular trial. Note that irrespective of learning, the task is such that the null query results in a 50% probability of correctly identifying the object. For example, if the true concept were “large”, then the size query would result in 100% probability of correctly identifying the concept, an improvement of 50%; the shape query would result in a 50% probability of correctly identifying the concept, no improvement over the null query. The random variable Obj represents the random object that appears in a particular trial. The term pGain(Obj,Q|x) refers to the expected improvement in probability of correctly classifying a random object, averaging over all concepts, expected from making the query: pGainObj , Q | x pCorr (Obj | Q, x) pCorr (Obj | x) , where pCorr(Obj|Q,x), the expected probability of correctly classifying a random object given learning x, averaging over possible results of the query Q, pCorr Obj | Q, x Pobj | x Pq | x pCorr obj , q | x , and obj q pCorr obj , q | x max Pc | q, x c obj , c Pc | q, x c obj . c χc_i(obj) is an indicator function equal to 1 if obj is consistent with the concept ci, 0 otherwise. χ¬c_i (obj) equals 1 if obj is not consistent with the concept ci, 0 otherwise. P(obj)=1/8 for all objects, and does not depend on x. Note that because objects are drawn randomly with equal probability, and 4 of the 8 objects are consistent with each concept, pCorr(Obj|x)= pCorr(Obj)=0.50, the probability of correctly classifying a random object if no query is made. 3.1.2 Learning the true concept: I(Q,C|x) The second term in the usefulness function, I(Q,C|x), quantifies a query Q’s value for reducing uncertainty about the true concept C. Technically, the mutual information between Q and C, given learning to date x, I Q, C | x Pq | x Pc | q, x log 2 q c 1 , Pc | q, x is the expected reduction in Shannon entropy in C from making the query Q. This term can be thought of as quantifying the anticipated future usefulness of what is learned in the present trial. Sometimes called ‘information gain,’ the use of mutual information to identify useful experiments was proposed by Lindley (1956). Lee and Yu (1999) suggested using mutual information to predict eye movements; Renninger, Coughlan, Verghese, and Malik (in press) use information gain in a model of a shape learning task. Mutual information has also been used to model several cognitive information acquisition tasks (reviewed by Nelson, in press). 3.1.3 A v o i d i n g e x t r a n e o u s e y e mo v e m e n t s : c o s t ( Q ) Finally, we include a cost term, cost(Q), to represent the subjective cost of fixating one or more stimulus dimensions; we set cost(Q) to 0.02 per dimension fixated. 3.2 Questions about the usefulness function A number of points of confusion about why we formulated the usefulnes s function in this way may arise. Are all the terms in it necessary? What would happen if some of them were eliminated? Could some of the terms be formulated differently? This section, formulated as a set of questions and answers, discusses those issue s. What if the pGain(Obj,Q|x) term were eliminated? Intuitively, it may seem that the learner’s goal is to learn the true concept, and that this goal is served by the I(C,Q|x) term. Suppose the pGain(Obj,Q|x) term were eliminated. Early in learning, the model concept learner would behave similarly, because the main driver for the query to all stimulus dimensions is its ability to provide information about the unknown true concept. Late in learning, however, once the true concept is known with high certainty, the model concept learner, rather than relying on the efficient query relative to the true concept, would stop making eye movements altogether (switching to the null query), because the cost of any query would more than outweigh the usefulness of its miniscule expected reduction in uncertainty about the true concept. The paradoxical result would be that just as the learner comes to know the true concept with high probability, as attested by the probabilities of each concept in the model, performance on the categorization task would decrease to chance. What if the I(C,Q|x) term were eliminated? In the first trial of learning it is impossible to classify the object with greater than chance (50%) accuracy, irrespective of the number of stimulus dimensions fixated. Thus, because there is a small cost to every query (except the null query), every query except the null query would have negative usefulness, and the learner would not fixate any of the stimulus dimensions. In the second learning trial, because no learning took place on the previous trial, the same situation would arise. The problem would iterate indefinitely, and at no point in time would the learner fixate any stimulus dimensions, or achieve greater than chance performance on the task. What if cost(Q) term were eliminated? Suppose there were truly zero cost to make a query. In this case, even very late in the learning process, when uncertainty about the true concept is negligible, e.g. when the true concept is known with nearly 100% certainty, the learner would fixate all stimulus dimensions, rather than the subset of stimulus dimensions that is required given a certain kind of concept, such as size and shape for the concept “large square or small circle.” As an approximation of human visual perception, it is unreasonable to consider eye movements free. Every eye movement query has a biomechanical cost, and an opportunity cost in terms of the other queries that must be foregone during its execution. Why use a change in probability measure for the first term of the usefulness function, pGain(C,Q|x), and an information theoretic measure for the second term, I(C,Q|x)? Is that the only way to make the model work? Presumably, there is more than one mathematically adequate way to capture the learner’s goals to both (1) correctly classify the immediate random object, and (2) to learn the correct concept, so as to be able to reliably and correctly classify the objects in future trials. In particular, it is possible to reformulate the second term as pGain(C,Q|x), rather than I(C,Q|x); it seems unlikely that this would impair performance of the concept learning model. Baron (1985), Klayman & Ha (1987), Nickerson (1996), and Nelson (in press) each discuss several usefulness functions that might be viably employed for either of the first two terms of our usefulness function. Is a term for intrinsic visual salience needed? A term for intrinsic visual salience (Yamada & Cottrell, 1995; Itti, Koch, & Niebur, 1998) could be added to the usefulness function, but is not necessary to model these data. What likelihood does the model learner think they will use when evaluating each possible query’s usefulness? To simplify calculations of the reward function, the current model assumes that when evaluating possible queries’ (eye movements ’) anticipated usefulness, μ=1. The implicit psychological claim behind this choice is that when considering what eye movements to make, the learner believes he or she will have complete confidence in the results. In future work we intend to explore the implications of using the same likelihood function as in the concept learning model, where μ<1, when calculating possible queries’ usefuless. 3.3 R e s u l t s f r o m t h e e y e m o v e m e n t mo d e l What eye movements does the model predict, at particular stages of learning? In each trial, there are 8 possible queries: size-shape-color, size-shape, size-color, color-shape, size, color, shape, and null. Before learning, the query to all stimulus dimensions (usefulness=0.94) is more useful than any query of two (usefulness=0.50) or one (usefulness=0.19) stimulus dimensions, consistent with Rehder and Hoffman’s (in press) empirical finding that early in learning, learners tended to fixate all stimulus dimensions. This is due to the queries’ different I(C,Q), their expected information gain with respect to the learner’s beliefs about the true concept. (Because no learning has taken place, this is true irrespective of the true concept. Before learning, pGain(Q,C) is zero for every query.) Figure 6 illustrates how I(C,Q), the filled diamonds, and pGain(Obj, Q), the open diamonds, change during learning of the concept “small”, comparing the query to all stimulus dimensions (left panel), and the size query (right panel). The I(C,Q) term always favors the query to all stimulus dimensions, relative to the query to the size dimension alone. During learning, however, the I(C,Q) terms tend toward zero, and eventually the size query, which has the lower cost, becomes more valuable than the query to all dimensions. A similar pattern holds for Type II concepts. (For Type IV and VI concepts the model always prefers the query to all stimulus dimensions.) 1 1 0.75 0.75 0.5 0.5 0.25 0.25 0 0 0 2 4 6 8 10 0 2 4 6 8 10 Figure 6. Components of the eye movement function for the concept “small.”. I(C,Q), filled diamonds; pGain(Object, Q), open diamonds Vertical axis: usefulness. Horizontal axis, blocks of learning. Left panel, the query to all stimulus dimensions; Right, the size query. Figure 7 shows that classification performance (solid line) increases well before the efficient query (open circles) becomes more valuable than the query to all dimensions l (filled circles), as Rehder and Hoffman (in press) observed in their data. At left, the Type I concept “small.” Filled circles denote the query to all stimulus dimensions; open circles the efficient query, to only the size dimension. The right panel is for the Type II concept “large square or small circle”: the filled circles denote the query to all dimensions; the open circles denote the query to only the sise and shape dimensions. As the I(C,Q) term (shown above, in Figure 6) decreases for both the query to all dimensions and the efficient query, the higher cost of the query to all dimensions eventually causes its usefulness to be lower than that of the efficient query. For Type IV and VI concepts, the model rates the query to all dimensions as having the highest usefulness throughout learning. 1 1 0.75 0.75 0.5 0.5 0.25 0.25 0 0 0 2 4 6 8 10 0 2 4 6 8 10 Figure 7. Classification accuracy and selected queries’ usefulness. See text. The vertical axis gives a query’s usefulness, or the probability of correct response; the horizontal axis shows the number of blocks of learning to date. 4 Imp or tan ce, i mp l i c ati on s, an d f u tu re w ork Rehder & Hoffman (in press) noted that that their eye movement data include some patterns characteristic of RULEX, a rule-based category learning model (Nosofsky, Palmieri, & McKinley, 1994), and ALCOVE, a similarity-based model (Kruschke, 1992). Rehder and Hoffman (in press) theorized that multiple rule - and similaritybased systems combine and make different demands on selective attention. Our results show that a single probabilistic model, with just two parameters (comparing favorably with the larger number of parameters in RULEX and ALCOVE), can provide a parsimonious account of both classic concept learning results, and the central findings in Rehder and Hoffman’s compelling eye movement data. Some authors have viewed “conservatism” in belief updating as an error. Edwards (1968) summarized research on conservatism as showing that “it takes anywhere from two to five observations to do one observation’s worth of work in inducing a subject to change his opinions,” (p. 18), and considered whether “discrepancies in human performance [from] that specified by Bayes’ theorem” (p. 21) were caused by human misperception of individual observations or by the misaggregation of multiple observations. Oaksford and Chater (1998) modeled belief updating by assuming “that participants revise their degree of belief in a hypothesis by only a half of what Bayes’ theorem would recommend” (p. 386). (Oaksford & Chater, 1998, pp. 386, 395, used the same likelihood function, on a model of a different task.) Our likelihood function—in which the learner believes that some information presented is noisy—shows that imperfect memory is a possible cause of seemingly conservative belief updating. One critique of probabilistic models of leaning has been that prior probabilities have sometimes been set largely according to intuition. On the concept learning task empirical results can be used to infer a reasonable prior distribution, as we did in our model. It would be more theoretically satisfying to obtain priors from intrinsic properties of the concepts. One such property is the average number of stimulus dimensions required to classify an object once the true concept is known (“Dim.” column in Table 1). This method obtains the same ordering of prior probabilities as our model, and similar results. Minimum description length (Fass & Feldman, 2002) could also be explored as providing an explicit basis for the prior distribution. In life, virtually every interesting learning or inference scenario involves some active contribution of the learner in deciding where to look, what questions to ask or experiments to conduct, or what data to attend to. The present model, to our knowledge, is the first unified account where both early learning and asymptotic performance on a concept formation task are produced by the same principled sampling function. This model contributes to a growing body of work using information- and decision-theoretic concepts to model perceptual tasks (Lee & Yu, 1999; Sprague & Ballard, 2003; Trommershäuser, Maloney & Landy, 2003). We hope it may also spur work to understand the relationship between active information acquisition in perceptual, such as eye movement, and other cognitive learning and inference tasks. Author note Bob Rehder and Aaron Hoffman kindly corresponded about their experiment and findings, and provided their data to us. Bob Rehder and Javier Movellan provided feedback on a draft of this manuscript. JDN was funded by NIMH grant 5T32MH020002-05 to T. Sejnowski and by NSF grant DGE 0333451 to GWC. GWC is also supported by NIH grant MH57075. References Anderson, J. R. (1990). The Adaptive Character of Thought. Hillsdale NJ: Erlbaum. Baron, J. (1985). Rationality and Intelligence. Cambridge: Cambridge University Press Brunswik, E. (1952). The conceptual framework of psychology. Chicago: University of Chicago Press. Chater N, Oaksford M (1999). Ten years of the rational analysis of cognition. Trends in Cognitive Sciences, 3(2), 57-65. Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. New York: Wiley Edwards, W. (1968). Conservatism in human information processing. In B. Kleinmuntz (Ed.), Formal Representation of Human Judgment, (pp. 17-52). New York: John Wiley Fass, D., & Feldman, J. (2002). Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories. NIPS 15 Itti, L., Koch, C., & Niebur, E. (1998). A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20( 11), 1254-1259. Klayman, J. & Ha, Y.-W. (1987). Confirmation, disconfirmation, and information. Psychological Review, 94, 211-228. Knill, D. C., & Richards, W. (Eds.), (1996). Perception as Bayesian inference. Cambridge, U.K.: Cambridge University Press. Kruschke, J. K. (1992). ALCOVE: An exemplar-based connectionist model of category learning. Psychological Review. 99, 22-44. Lee, T. S., & Yu, S. X. (1999). An information-theoretic framework for understanding saccadic eye movements. Advances in Neural Information Processing Systems, 12, 834-840. Lindley, D. V. (1956). On a measure of the information provided by an experiment. Annals of Mathematical Statistics, 27, 986-1005. Movellan, J. R., & McClelland, J. L. (2001). The Morton-Massaro law of information integration: Implications for models of perception. Psychological Review, 108(1), 113–148, 2001. Nelson, J. D. (in press). Finding useful questions: on Bayesian diagnosticity, probability, impact, and information gain. Psychological Review. Nelson, J. D., Tenenbaum, J. B., & Movellan, J. R. (2001). Active inference in conc ept learning. In J. D. Moore & K. Stenning (Eds.), Proceedings of the 23rd Conference of the Cognitive Science Society, 692-697. Nickerson, R. S. (1996). Hempel’s paradox and Wason’s selection task: Logical and psychological puzzles of confirmation. Thinking and Reasoning, 2, 1-32. Nosofsky, R. M., Palmieri, T. J., & McKinley, S. C. (1994). Rule-plus-exception model of classification learning. Psychological Review, 101, 53-79. Oaksford, M., & Chater, N. (1998). A revised rational analysis of the selection task: exceptions and sequential sampling. In M. Oaksford, & N. Chater (Eds.), Rational Models of Cognition (pp. 372-393). Oxford: Oxford University Press. Rehder, B., & Hoffman, A. B. (2003). Eyetracking and selective attention in category learning. In R. Alternman & D. Kirsh (Eds.), Proceedings of the 25th Annual Conference of the Cognitive Science Society. Boston, MA: Cognitive Science Society. Rehder, B., & Hoffman, A. B. (in press). Eyetracking and selective attention in category learning. Cognitive Psychology. Renninger, L. W., Coughlan, J., Verghese, P. & Malik, J. (in press). An information maximization model of eye movements. Advances in Neural Information Processing Systems. Savage, L. J. (1954). The Foundations of Statistics. New York: Wiley. Shanks, D. R., Tunney, R. J., & McCarthy, J. D. (2002). A re-examination of probability matching and rational choice. Journal of Behavioral Decision Making, 15, 233-250. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379–423, 623–656 Shepard, R. N., Hovland, C. I., & Jenkins, H. M. (1961). Learning and memorization of classifications. Psychological Monographs: General and Applied, 75(13), 1-42. Sprague, N., & Ballard, D. (2003). Eye Movements for Reward Maximization. NIPS 16 Steyvers, M., Tenenbaum, J. B., Wagenmakers, E.-J. & Blum, B. (2003). Inferring causal networks from observations and interventions. Cognitive Science, 27, 453-489. Tenenbaum, J. B. (1999). A Bayesian Framework for Concept Learning. Ph.D. Thesis, MIT Tenenbaum, J. B. (2000). Rules and similarity in concept learning. NIPS 12, 59–65. Tenenbaum, J. B., & Griffiths, T. L. (2001). Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24(4), 629-640. Trommershäuser, J., Maloney, L. T., & Landy, M. S. (2003). Statistical decision theory and the selection of rapid, goal-directed movements. Journal of the Optical Society of America A, 20(7), 1419-1433. Yamada, K., & Cottrell, G. W. (1995). A model of scan paths applied to face recognition. In Proceedings of the Seventeenth Annual Cognitive Science Conference, Pittsburgh, PA, pp. 55-60, Mahwah: Lawrence Erlbaum.