Abstract There is strong empirical and theoretical evidence in psychology that human development follows a stage-like progression throughout the lifespan. Development is marked by relative stability, punctuated by periods of rapid and profound conceptual change. However, the driving forces behind these bursts of change are poorly understood. In this paper I examine a growing body of evidence that developmental change can emerge through a process of rational, statistical inference as a learner interacts with his or her environment. I briefly introduce the main idea behind statistical (bayesian) inference over structured hypothesis spaces, and then explore the specific domain of number learning with unique implications on the emergence of stages in human development. 1 Stages of conceptual change in humans and robots Jonathan Scholz 15 December 2010 1 Introduction Dating back at least to the work of Piaget [17], psychologists have distinguished between the ideas of learning and development. Whereas learning was conceived to be a basic process of association and reaction, development was viewed as a fundamental reorganization of the machinery of cognition. Development proceeded in a series of discrete stages, between which humans made profound conceptual leaps. The critical need for such a process in any artificial system approaching human intelligence has been appreciated for some time, but until recently the basic requirements for a developmental learner were poorly understood at the computational level. However, recent progress has been made towards formalizing conceptual change in children using the tools of bayesian statistics. This paper will consider how a recent model of one particular developmental process, numerical concept learning, has provided the first computational account of a stage-like learning phenomenon. This paper is organized as follows. First I review the findings in the psychology literature that suggest general stage-like processes of development. I then describe the number learning problem in greater detail, including both the basic empirical results and theoretical accounts in the developmental literature. In section 4, I introduce the key ideas behind the use of bayesian statistics in psychology, and describe several results with particular relevance to developmental learning in both humans and robots. Finally, I explain a bayesian model of the bootstrapping of number concepts designed to address a gap in the developmental literature. I conclude by discussing the implications of this research, and some future agendas for research. 2 2 Stages of development Since at least as far back as the work of Piaget in the 1950’s, most psychologists have considered human development as procession through a series of discrete stages. Far from being an arbitrary characterization of child development, Piaget’s “accommodation” hypothesis commits to a strong empiricist view that children modify their mental machinery to understand novel stimuli [19, 18]. While contrary to the views of some of his leading contemporaries such as Chomsky and Fodor, Piaget’s view has featured prominently in modern accounts of development such as the “Theory-Theory” [11]. Although Piaget’s constructivist ideas were initially inspired by observing his own children, there has been a great deal of empirical support over the past several decades. He was the first to propose the idea of object permanence as a key developmental milestone, in which children come to understand objects as existing beyond their immediate sensory environment [17]. Object permanence was initially thought to emerge at between 3 and 4 years of age, but more recent studies have pushed the age to 2 years and even 5 months using preferentiallooking time methods [2]. In either case, researchers agree that it is an important conceptual accomplishment for young children. The identification of object permanence at younger and younger ages was made possible by removing the dependence on language, another well studied and stage-like developmental process. Carol Chomsky identified 5 specific stages of language faculty development [5]. Such qualitative shifts in children’s abilities have also been observed in Theory of Mind reasoning [13], multimodal perception [1], and number comprehension [24]. It is the last of these, reasoning about number, that will be of particular interest as a instance of conceptual bootstrapping. 3 Learning the concept of number One of the most striking examples of sudden conceptual change can be found in children’s apprehension of number words. Children proceed through a series of stages in their understanding of the relationship between numerosity and the number words. Wynn identified 4 discrete stages of number competence, as well as a general window in which these phases progress [24]. Notably, researchers and parents alike observe the progression to the final phase as a “quantum leap” in understanding, in which children switch from a rote to a systematic understanding of number [20]. It is this leap that is of deepest interest from a modeling point of view. 3 3.1 The subset knower phenomenon At around 2 years of age, as children begin to learn basic words in their native language, they first gain the ability to reason about number in a limited way. If asked to retrieve 1 fish, a 2-year-old can successfully remove a single fish from a bowl containing many toy fish. If asked to retrieve two or more, however, children at the one-knower stage will simply provide a handful of fish [24]. This is striking, since the same children exhibit proficiency in counting up to 10 or more, and will spontaneously utter number words in correct order. They later proceed through two-knower and even a short three-knower period in which their number comprehension tops out at two or three. Eventually, though, children display a sudden shift in understanding during which they discover the underlying logic relating number words and numerosity. That is, rather than progressing on through four-knower and five-knower stages, they transition to being cardinal-principal (CP) knowers, able to grasp the abstract rules of number. In addition to the seismic shift in competence, the manner in which children provide answers to number questions is revealing. Theories of CP development posit a sudden jump from an association-based ability to a logical one. Consistent with this idea, children before the CP stage virtually never use counting to give large numbers, whereas children shortly after the CP-transition almost always do [24]. Younger children instead seem to have directly mapped particular small numerocities onto the correct number words, thereby succeeding at small numerosities with little effort, but lacking any general method for other numbers. 3.2 Conceptual bootstrapping Carey argues that the development of number competence can be explained by a process of conceptual bootstrapping [4]. By contrast with associationist theories, bootstrapping argues for a representation of number using mental models of small sets. For instance, one-knowers might have a model of “one” as {X}, and “two” as {X,X}. These representations serve as the input a domain-general capacity referred to as enriched parallel individuation. Carey and Corre argue that such a capacity is fundamental for working with sets and individuating objects, and can explain subset-knower’s ability to put a set labeled “two” into 1-1 correspondence with their mental model of two, {X,X} [16]. In the bootstrapping account, the transition to CP-knower happens when children observe a pattern between their memorized list of count words and the successively larger sets they map to. By discovering the rule that moving one step on the count list corresponds to increasing the set size by one, children have an abstract definition for bootstrap- 4 ping arbitrarily large sequences of number words onto their appropriate meaning. This account explains the dramatic CP-transition, but says little about how children discover the rule in the first place. Indeed, critics of the bootstrapping explanation cite vagueness and informality as key weaknesses [8]. Perhaps more interestingly, Rips et al. argue that bootstrapping is fundamentally a circular idea, which presupposes some form of innate successor function to support reasoning with sets [21]. A successor function is a function that for any number of objects in a set sequence, returns next(k), the next largest set. Such a dependence would be a problem for Carey’s account because it moves the work of relating counting words with sets from a learnable representation to an innate one. As will be discussed in Section 5, Piantadosi et al. present a model which demonstrates that all three criticisms are unfounded. First, however, we must provide some background on the general methods they employ. 4 Bayesian models of cognition One of the exciting trends in cognitive science over the past decade has been the unification of symbolic and statistical theories of learning under a common framework. The approach is broadly referred to as “bayesian modeling”, and technically includes any method which employs Bayes’ rule to formalize inference under uncertainty. In the developmental literature, however, it has take a more specific form, as a normative account of learning from data. Just over the past four years, bayesian models have been proposed to explain children’s patterns in the diverse areas of causal reasoning [10], social goal inference [3], word learning [7], sensorimotor integration [15], and structured representation general [14]. The common thread through each of these research programs is the formulation of learning as a process of scoring hypotheses from a hypothesis space based on how well they explain observed data. For example, a bayesian model for learning to recognize animals might select the hypothesis that “horses are the ones with hooves” over “horses are the ones that are brown” when trained on a collection of images of horses and bears. The insight which the model makes formal is that the first hypothesis, “hooves”, is more useful for identifying horses than “brownness” (at least when horses and bears are involved). Bayes’ rule is a powerfully general tool which can be applied to any situation in which a learner must select among competing explanations for some event or observation. The bayesian approach initially became popular in the artificial intelligence community for handling measurement uncertainty, but has come to play a more critical role in psychology as a mechanism for 5 formalizing the “inductive” learning of new concepts. Whereas the learning of concepts such as “horse” or “sunrise” from a small number of examples is considered an inductive leap, selecting among hypotheses is fundamentally deductive. Thus, bayesian inference stands to solve the “riddle of induction” by transforming an ill-defined inductive problem into a well-defined deductive one [9]. Formally, Bayes’ rule can be stated as P (x|h)P (h) P (h|x) = P (1) h∈H P (x|h) The term P (x|h) is referred to as the likelihood, and indicates how likely some observation x is given a hypothesis h. P (h) is the prior probability of the hypothesis h, before any data is observed. Finally, P (h|x) is the term of interest, and contains the probability of hypothesis h given that data x were observed. Each of these terms in fact refers full probability distribution, and Bayes’ rule gives a way of updating our estimate P (h|x) in light of some observation x. The term P h∈H P (x|h) is simply a normalizing term that sums the total probability of the data over all hypotheses, thus ensuring that P (h|x) is a legal distribution. From this general rule we can derive arbitrarily complicated models of data, from visual features to social behavior. The important property of a bayesian model is that it describes the statistically optimal way to update one’s beliefs in light of new data. 5 A computational model of number concept bootstrapping The debate on the bootstrapping theory of number learning rested on arguments of what computations the core systems of number were capable of. Carey et al. would have us believe that the knowledge of sets alone was sufficient to develop a full system for numerical reasoning, which would spontaneously develop as children gained experience with number words. Rips et al. argued instead that the core system must include a successor function to make cardinal-principle reasoning possible. Without a clear alternative to satisfy the role of next(k), the bootstrapping theory fails to address Rips’ concern. However, the bayesian approach offers a solution by casting number learning as search over a suitable hypothesis space. When combined with the machinery described above for inductive inference, Piantadosi et al. illustrate not only that bootstrapping for number words is possible, but that the developmental progression observed in children is consistent with an optimal bayesian learner. The core ingredient for Piantadosi’s model is a hypothesis space for language, or what Carey might refer to as the core conceptual prim- 6 itives that we can assume humans are born with. Carey’s bootstrapping theory is not alone in presuming sophisticated innate capacities in infants. Spelke has also argued for the existence of core systems of number for explaining young infant’s apparent facility in number tasks [6]. Thus, Piantadosi et al. are not straying far from the status quo in employing the lambda calculus as a core representational system for language. The use choice of lambda calculus is not unique to this work, and is in fact quite common as a formal language for compositional semantics [12, 22]. Lambda calculus is a rich yet convenient language for expressing a wide range of computational problems. It enjoys a long history within linguistics and computer science, and was the basis for Lisp, one of the oldest and most successful computer languages. The main activity of lambda calculus is specifying how to build larger expressions out of smaller parts. For example, λx.(not(singleton?x)). (2) describes a function for determining if a set contains a single element. The syntax of a lambda expression is as follows. To the left of a period, there is a “λx". This denotes that the argument to the function is the variable named x. On the right hand side of the period, the lambda expression specifies how the expression evaluates its arguments. Expression 2 returns the value of not applied to (singleton? x). In turn, (singleton? x) is the function singleton? applied to the argument x. Since this lambda expression represents a function, it can be applied to arguments (sets) to yield return values. For instance, 2 applied to {Bob, Joan} would yield TRUE, but {Carolyn} yields FALSE since only the former is not a singleton set [20]. In addition to simple expressions as explained above, the lambda calculus is capable of defining rich, arbitrarily complicated functions. In using the primitives in lambda calculus as a hypothesis space, a learner has a rich universe of concepts in which to search for a model that best fits the data. For such a learner, the data consists of a set of events, much as a child might experience, in which a group of some number of objects is presented, along with a word indicating the numerosity. One example of a more sophisticated concept which the learner might select is shown below. λS.(if (singleton?S) (3) 00 (4) (next(L(selectS)))) (5) “one In this example “next” returns the next word in the counting list (NOT the next largest set, as required for Rips et al.), and “select” returns a set containing a single element from S (a singleton). Overall, 7 then, the expression immediately returns “one” if the set contains 1 element, and if not it selects a single element from the set and calls itself on this element. During this call it will return “one”, since the set now contains a single element. This is an example of a somewhat complicated expression, in that it involves recursion, and one which is not obviously of use. However, it helps underscore the large space of potential functions which can be represented with a few simple predicates. For additional examples of the predicates used in the model, and functions for working with numbers, see [20]. To evaluate hypotheses in the form of lambda expressions given data as described above, Piantadosi et al. introduce a bayesian model which defines a probability distribution over expressions given observed words W, containing a target type T (e.g. “cats”), in a context C (e.g. cat, horse, dog, dog, cat, dog). By bayes rule, we have: " # Y P (L|W, T, C) ∝ P (W |T, C, L)P (L) = P (wi |ti , ci , L) P (L) (6) i The authors define a likelihood function for a word given an expression and a context which, intuitively, assigns higher probability to words that the grammar can generate, and lower probability to words that it can not. Finally, a prior term P (L) assigns greater probability to shorter expressions, and expressions without recursion, which biases the learner towards compact and simple concepts. This form of bayesian “Occam’s razor” has been shown to be an important inductive bias in learning concepts that fit observed data well but also generalize to new situations. One of the great advantages of the bayesian approach is that these well-known but abstract scientific principles are trivial to codify in a model. After specifying the model, all that remains is to provide data and “turn the crank”. The question then becomes: What hypotheses the learner will prefer as a function of the data it’s been given? In this experiment, the data provided was obtained from the CHILDES database, which contains examples with word frequencies that approximate the naturalistic word probabilities in child-directed speech. As can be seen in Figure 1, the model progressed through 4 distinct phases of development as the amount of experience increased. Like children, the model initially preferred simple direct mappings for one, two and even three objects. However, at four objects an interesting effect can be found. Rather than continuing to discover concepts for higher order words on the count list, it makes a sudden transition to a representation that’s equivalent to CP-knower status in children. Thus, the learner provides a computational verification of Carey’s initial suspicion: that at some point it is more advantageous to learn a computational system than rote memorize the correspondence between words and numerosity. 8 Figure 1: Learning results on CHILDES dataset. X-axis indicates amount of data provided for the model, and Y-axis scores the probability of each hypothesis given the data Figure 2: Comparison of learners. Grey traces depict the large number of low-probability hypotheses as the learner gains experience However, unlike the Rips explanation, Piantadosi presented a hypothesis space in which the CP knower was but one latent concept in an infinite space. While simple to codify, the lamdba calculus presented here is a highly expressive “language of thought”, and contains sufficient richness to model the developmental progression observed in humans. Furthermore, the bayesian model for search this space correctly chose the CP-knower hypothesis, over other possibilities equally consistent with the data (such as a MOD-10 system, [20]). 6 Conclusion While the model described above may be only of passing interest outside the scope of the bootstrapping debate, it has profound implications on the theories of development in artificial agents and robots. Over the 9 past decade, a significant contingent of the robotics community has sought inspiration for humanoid robots from the developmental psychology literature. The intuition behind the change might be obvious; after decades of watching robots repeatedly fail in non-laboratory environments, a natural question is what makes human infants so adept at coping with novel situations. For these communities of developmental and epigenetic robotics, the question changed from “how can we design behavior X into our robot” to “how can we design learner L into our robot such that it will learn behavior X (and Y,Z) as it gains experience”. Unfortunately, the theoretical foundation for learner L does not really exist, and certainly not to a level commensurate with human achievement. Not to be discouraged, robotics researchers such as Stoytchev [23] identified a series of principles of development through rigorous analysis, and went on to impose these stages in a humanoid robot. In the case of Stoytchev, principles included “gradual exploration” and “verifiability”, abstract ideas that governed his approach to robot control architecture. This approach is clearly insightful and forward thinking, but doesn’t truly break from the tradition of designing behaviors for robots manually. To the extent that it is true, however, such a shortcoming can be justified by the lack of any real method for providing domain-general developmental constraints. This is what makes the work of Piantadosi et al. so compelling from a robotics perspective. The important thing about their model is not what it seeks to explain, but the fact that the driving force behind its human-like performance is a general-purpose constraint imposed by the bayesian machinery. Contrary to much of the work in developmental robotics, in which a stage-like progression imposed explicitly or subconsciously by the programmer, the model described above came upon its stages organically. The observed transition from 3-knower to CP-knower was driven by an interaction between a likelihood term that maintained the CP hypothesis as new data came in, and a prior term that preferred it over competitors which became increasingly complicated in order to explain all the observed data. In the end, models like these will probably gain and lose favor as new results emerge in the literature, but the current climate marks a turning point towards a more principled approach to computational learning theory. As Kemp observes, We have other interesting computational (bayesian) models of cognitive processes, but they often assume significant inplace representational an reasoning abilities. The direction of the field is clearly towards an overarching theory of how all of these problems are learnable from a core set of domaingeneral and domain-specific faculties. 10 Piantadosi’s model can be seen as a step in this direction, and with any luck, robot will soon begin taking these steps as well. References [1] L.E. Bahrick. Infants’ perception of substance and temporal synchrony in multimodal events*. Infant Behavior and Development, 6(4):429–451, 1983. [2] Renee Baillargeon, Elizabeth S Spelke, and Stanley Wasserman. Object permanence in five-month-old infants. Cognition, 20(3):191–208, 1985. [3] C. L. Baker, R. Saxe, and J. B. Tenenbaum. Action understanding as inverse planning. Cognition, 2009. [4] S. Carey. The Origin of Concepts. Oxford University Press, 2009. [5] C. Chomsky. Stages in language development and reading exposure. Harvard Educational Review, 42(1):1–33, 1972. [6] L. Feigenson, S. Dehaene, and E. Spelke. Core systems of number. Trends in cognitive sciences, 8(7):307–314, 2004. [7] M. C. Frank, N. D. Goodman, and J. B. Tenenbaum. Using speakers’ referential intentions to model early cross-situational word learning. Psychological Science XX, 2009. [8] CR Gallistel. Commentary on Le Corre & Carey. Cognition, 105(2):439–445, 2007. [9] N. Goodman. The new riddle of induction. Philosophy of Science: An Historical Anthology, page 424, 2009. [10] A. Gopnik, C. Glymour, D.M. Sobel, L.E. Schulz, T. Kushnir, and D. Danks. A theory of causal learning in children: Causal maps and Bayes nets. Psychological review, 111(1):3–31, 2004. [11] A. Gopnik and H.M. Wellman. The theory theory. In An earlier version of this chapter was presented at the Society for Research in Child Development Meeting, 1991. Cambridge University Press, 1994. [12] I. Heim and A. Kratzer. Semantics in generative grammar. WileyBlackwell, 1998. [13] W. Josef et al. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1):103–128, 1983. [14] C. Kemp and J. B. Tenenbaum. The discovery of structural form. Proceedings of the National Academy of Sciences, 2008. 11 [15] K. Kording and J. B. Tenenbaum. Causal inference in sensorimotor integration. Advances in Neural Information Processing Systems 19, 2007. [16] M. Le Corre and S. Carey. One, two, three, four, nothing more: An investigation of the conceptual sources of the verbal counting principles. Cognition, 105(2):395–438, 2007. [17] J. Piaget and E. Duckworth. Genetic epistemology. American Behavioral Scientist, 13(3):459, 1970. [18] J. Piaget and B. Inhelder. The Psychology of the Child New York, 1969. [19] Jean Piaget. Part i: Cognitive development in children: Piaget development and learning. Journal of research in science teaching, 2(3):176–186, 1964. [20] Steven T Piantadosi, Joshua B Tenenbaum, and Noah D Goodman. Bootstrapping in a language of thought: A formal model of numerical concept learning. Cognition, 2012. [21] L.J. Rips, J. Asmuth, and A. Bloomfield. Giving the boot to the bootstrap: How not to learn the natural numbers. Cognition, 101(3):B51–B60, 2006. [22] M. Steedman. The syntactic process, volume 131. MIT Press, 2000. [23] A. Stoytchev. Robot Tool Behavior: A Developmental Approach to Autonomous Tool Use. PhD thesis, Georgia Institute of Technology, College of Computing, 2007. [24] K. Wynn. Children’s acquisition of the number words and the counting system* 1. Cognitive Psychology, 24(2):220–251, 1992. 12