Understanding the Embodied Teacher: Nonverbal Cues for Sociable Robot Learning by Matthew Roberts Berlin S.M., Media Arts and Sciences, Massachusetts Institute of Technology, 2003 A.B., Computer Science, Harvard College, 2001 Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Media Arts and Sciences at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2008 © Massachusetts Institute of Technology 2008. All rights reserved. Author Program in Media Arts and Sciences January 11, 2008 Certified by Cynthia Breazeal Associate Professor of Media Arts and Sciences Program in Media Arts and Sciences Thesis Supervisor Accepted by MASSACHUSETTS !W W. OF TEOHNOLOGY FEB 1 2008 LIBRARIES LIBRARIES Deb Roy Chair, Departmental Committee on Graduate Students Program in Media Arts and Sciences ARCHNES Understanding the Embodied Teacher: Nonverbal Cues for Sociable Robot Learning by Matthew Roberts Berlin Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, on January 11, 2008, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Media Arts and Sciences Abstract As robots enter the social environments of our workplaces and homes, it will be important for them to be able to learn from natural human teaching behavior. My research seeks to identify simple, non-verbal cues that human teachers naturally provide that are useful for directing the attention of robot learners. I conducted two novel studies that examined the use of embodied cues in human task learning and teaching behavior. These studies motivated the creation of a novel data-gathering system for capturing teaching and learning interactions at very high spatial and temporal resolutions. Through the studies, I observed a number of salient attention-direction cues, the most promising of which were visual perspective, action timing, and spatial scaffolding. In particular, this thesis argues that spatial scaffolding, in which teachers use their bodies to spatially structure the learning environment to direct the attention of the learner, is a highly valuable cue for robotic learning systems. I constructed a number of learning algorithms to evaluate the utility of the identified cues. I situated these learning algorithms within a large architecture for robot cognition, augmented with novel mechanisms for social attention and visual perspective taking. Finally, I evaluated the performance of these learning algorithms in comparison to human learning data, providing quantitative evidence for the utility of the identified cues. As a secondary contribution, this evaluation process supported the construction of a number of demonstrations of the humanoid robot Leonardo learning in novel ways from natural human teaching behavior. Thesis Supervisor: Cynthia Breazeal Title: Associate Professor of Media Arts and Sciences, Program in Media Arts and Sciences Understanding the Embodied Teacher: Nonverbal Cues for Sociable Robot Learning by Matthew Roberts Berlin The following people served as readers for this thesis: ThP•si ~r~~u~u OL RpAtdpr ~\~HU~~ 0eb Roy Associate Professor of Media Arts and Sciences Program in Media Arts and Sciences A if Thesis ------- Reader - -~------ -- ) I # I Linda Smith Professor of Psychology Indiana University Acknowledgments I owe thanks to many, many people, without whom this thesis would not have been possible. First and foremost, I want to thank my committee - Cynthia Breazeal, Deb Roy, and Linda Smith - for their patience with me and for the steady, illuminating guidance that they gave to this work. I am particularly grateful to Cynthia, who has given me a fantastic home here at the Media Lab for the past three-and-a-half years. I have had the great fortune to be a member of two extraordinary research groups during my time at MIT. It was a true pleasure working with the Synthetic Characters: Bruce Blumberg, Daphna Buchsbaum, Rob Burke, Jennie Cochran, Marc Downie, Scott Eaton, Matt Grimes, Damian Isla, Yuri Ivanov, Mike Johnson, Aileen Kawabe, Derek Lyons, Ben Resner, and Bill Tomlinson (plus Ari Benbasat and Josh Lifton). I am forever indebted to Bruce, Bill, and Marc, who were my first mentors at the lab and who got me started down this crazy path. Likewise, it has been a tremendous pleasure working with the Personal Robots: Sigurbur Orn Adalgeirsson, Polly Guggenheim, Matt Hancher, Cory Kidd, Heather Knight, Jun Ki Lee, Jeff Lieberman, John McBean, Philipp Robbel, Mikey Siegel, Dan Stiehl, and Rob Tuscano. In particular, I want to send an emphatic shout-out to my co-conspirators on the Leo software team: Andrew 'Zoz' Brooks, Jesse Gray, Guy Hoffman, and Andrea Thomaz. To the Characters and the Robots: thank you so much for riding into battle with me time and time again, and for sharing your ferocious creativity with me. Some special thanks are due to a few people who provided crucial assistance in the final stages of this work. Jeff Lieberman helped with the design and assembly of the motorized puzzle boxes. Mikey Siegel and Aileen Kawabe assisted with the construction of the illuminated table. And my hotshot UROP Crystal Chao contributed some essential pieces of software down the stretch. I am, as always, particularly indebted to Jesse Gray, who has been my great friend and collaborator for many years now. This thesis would have been much less interesting without his help, and it certainly wouldn't have been nearly as much fun to put together. This thesis is dedicated to my family. Mom, Dad, Alex - I love you all so much. Contents Abstract 1 Introduction 1.1 Human Studies and Benchmark Tasks ........... 1.2 Robotic Learning Algorithms and Cognitive Architecture 1.3 Interactive Learning Demonstrations ............ 2 Embodied Emphasis Cues 2.1 Background and Related Literature ............ 2.2 Perspective Taking Study . ................ 2.2.1 Task Design and Protocol ............ 2.2.2 Results ........................ 2.3 Emphasis Cues Study .................... 2.3.1 Task Design and Protocol ............. 2.3.2 Data-Gathering Overview ............. 2.3.3 Motion Capture: Tracking Heads and Hands .. 2.3.4 The Light Table: Tracking Object Movement .. 2.3.5 The Puzzle Boxes: Tracking Object Manipulation 2.3.6 Video Recording .................. 2.3.7 Data Stream Management and Synchonization . 2.3.8 Study Execution and Discussion ......... 2.3.9 Data Analysis Tools and Pipeline . ........ 2.4 Summary: Cues That Robotic Learners Need to Underst and ... ° .°.°. .°°° .°°° °.°. .°.°. .°.°. .. 3 Learning Algorithms and Architecture 3.1 Self-as-Simulator Cognitive Architecture .......... 3.2 Perspective Taking Mechanisms ............... 3.2.1 Belief Modeling .................... 3.2.2 Perspective Taking and Belief Inference ...... 3.3 Social Attention Mechanisms ................ 3.4 Unified Social Activity Framework ............. 3.4.1 Action Representation: Process Models ...... 3.4.2 Exploration Representation: Interpretation Modes 3.5 Learning Algorithms ..................... 3.5.1 Task and Goal Learning ............... ° . 3.5.2 3.5.3 4 Perspective Taking and Task Learning .......... Constraint Learning from Embodied Cues ....... . . . . . . . . 82 ... ..... 82 Evaluations and Demonstrations 4.1 Visual Perspective Taking Demonstration and Benchmarks . 4.2 Emphasis Cues Benchmarks .................. . 4.3 Emphasis Cues Demonstration .................. 5 Conclusion 5.1 Contributions ............................ 5.2 Future Work ............................. 5.2.1 Additional Embodied Emphasis Cues .......... 5.2.2 Cues for Regulating the Learning Interaction ...... 5.2.3 Robots as Teachers ..................... 5.2.4 Dexterity and Mobility ................... Bibliography ...... . ..... . . . . . . . . . . . . . . . . ... ... . . ... .. . 99 99 100 100 101 101 102 103 List of Figures 2-1 The four tasks demonstrated to participants in the study. ........... 2-2 Input domains consistent with the perspective taking (PT) vs. non-perspective taking (NPT) hypotheses. ............................. 2-3 Task instruction cards given to learners in Task 2. ................ 2-4 One of the motorized puzzle boxes.............. ........... 2-5 Graphical visualization within the object tracking toolkit............. 2-6 Hats, gloves, and rings outfitted with trackable markers. ......... . . 2-7 Initial assembly of the light table, with detail of the frosted acrylic top. . . . 2-8 Light table component details ............................ 2-9 Final setup of the light table ................... .......... 2-10 Color segmentation process ............................. 2-11 The block shape recognition system. . .................. .... 2-12 The block tracking system .............................. 2-13 Puzzle boxes in the study environment, on top of the light table. ........ 2-14 Two camcorders were mounted around the study area. . ............ 2-15 Data visualization environment. . .................. ...... 2-16 Blocks were mapped into the coordinate frame of the motion capture system. 2-17 Change in distance to the body of the learner for block movements initiated by the teacher. .................................... 2-18 Predictive power of movements towards and away from the body of the learner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 Predictive power of teacher's movements following learner's movements. . 58 59 3-1 3-2 3-3 3-4 3-5 3-6 3-7 The Leonardo robot and graphical simulator . ................. System architecture overview.................... ........ Architecture for modeling the human's beliefs. . ................. Timeline following the progress of the robot's beliefs for one button. ..... Saliency of objects is computed from several environmental and social factors. Leo in his workspace with a human partner. . ................. Action selection algorithm for process models. ........... . .. . . . 62 63 69 70 71 72 76 4-1 4-2 4-3 4-4 Leo was presented with similar learning tasks in a simulated environment. The robot was presented with the recorded study data. ............ . The robot learned live by interacting with a human teacher. .......... The gestural interface and figure planning algorithm. . ............. 86 89 93 95 24 25 31 32 36 37 38 39 41 42 43 44 45 46 54 55 57 4-5 Interaction sequence between the robot and a human teacher. ......... 96 List of Tables 2.1 Differential rule acquisition for study participants in social vs. nonsocial conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Embodied cues of positive emphasis ........................ Embodied cues of negative emphasis. . .................. ... 27 50 51 High-likelihood hypotheses entertained by the robot at the conclusion of benchmark task demonstrations .......................... 4.2 Hypotheses selected by study participants following task demonstrations.. 4.3 Performance of the human learners on study tasks 2a and 2b. ........ . 4.4 Learning performance of the robot observing benchmark task interactions. . 87 87 91 92 2.2 2.3 4.1 Chapter 1 Introduction How can we design robots that are competent, sensible learners? Learning will be an important part of bringing robots into the social, cooperative environments of our workplaces and homes. But social environments present a host of novel challenges for learning machines. Learning from real people means no carefully labeled data sets, no clear-cut reward signals, and no obvious indications of when to start learning and when to stop. Social environments are typically cluttered and dynamic, with other agents altering the world in potentially confusing or unpredictable ways. People working alongside of robots may be largely unfamiliar with robotic technology and machine learning and, to make matters worse, will expect a robot to adapt and learn as quickly and "effortlessly" as a real human teammate. To address these issues, and inspired by the way people and animals learn from others, researchers have begun to investigate various forms of social learning and interactive training techniques, such as imitation-based learning [Schaal, 1999], clicker training [Blumberg et al., 2002], learning by demonstration [Nicolescu and Matari4, 2003], and tutelage [Breazeal et al., 2004]. Most existing approaches to socially guided learning take an all-or-nothing approach to interpreting the human's behavior. The human's behavior or directives are either du- plicated exactly, as in imitation or verbal instruction, or else used simply as feedback indicating the success or failure of the robot's actions, as in reinforcement learning and clicker training. I seek a more flexible approach spanning this range of interpretation. This thesis advances a model wherein the human's behavior is viewed as a communicative, dynamic constraint upon the robot's exploration of its environment. This thesis is focused on the question of how the embodied presence of the teacher directs and constrains the learner's attention and bodily exploration. How does the body pose and activity of the teacher help to identify what matters in the interaction? By designing a robotic system with the right internal representations and processes, coupled in the right ways to the complexity of the human system, I aim to enable a human teacher to effectively guide the robot. My research seeks to identify simple, non-verbal cues that human teachers naturally provide that are useful for directing the attention of robot learners. The structure of social behavior and interaction engenders what I term "social filters:" dynamic, embodied cues through which the teacher can guide the behavior of the robot by emphasizing and deemphasizing objects in the environment. This thesis describes two novel studies that I conducted to examine the use of social filters in human task learning and teaching behavior. Through these studies, I observed a number of salient attention-direction cues, the most promising of which were visual perspective, action timing, and spatial scaffolding. In particular, I argue that spatial scaffolding, in which teachers use their bodies to spatially structure the learning environment to direct the attention of the learner, is a highly valuable cue for robotic learning systems. In order to directly evaluate the utility of the identified cues, I constructed a number of learning algorithms. I situated these learning algorithms within a large architecture for robot cognition, augmented with novel mechanisms for social attention and visual perspective taking. I evaluated the performance of these learning algorithms in comparison to human learning data on benchmark tasks drawn from the studies, providing quantitative evidence for the utility of the identified cues. As a secondary contribution, this evaluation process supported the construction of a number of demonstrations of the humanoid robot Leonardo learning in novel ways from natural human teaching behavior. 1.1 Human Studies and Benchmark Tasks I conducted two studies that examined the use of embodied cues in human task learning and teaching behavior. The studies focused on embodied, non-verbal cues through which human teachers emphasize and de-emphasize objects in the learning environment. The first study examined the role of visual perspective taking in human learning. The second study was more open-ended, and was designed to capture observations of a number of dynamic, embodied cues including visual attention, hand gestures, direct object manipulations, and spatial/environmental scaffolding. This study motivated the creation of a novel data-gathering system for capturing teaching and learning interactions at very high spatial and temporal resolutions. While a broad set of attention-direction cues were observed in the second study, a surprising result was the prevalent and consistent use of spatial scaffolding by the human teachers. The term spatial scaffolding refers to the ways in which teachers use their bodies to spatially structure the learning environment to direct the attention of the learner. One of the main contributions of my work is the empirical demonstration of the utility of spatial scaffolding for robotic learning systems. In particular, this thesis focuses on a simple, reliable, component of spatial scaffolding: attention direction through object movements towards and away from the body of the learner. Both of the human studies involved learning tasks that were designed to be closely matched to the Leonardo robot's existing perceptual and inferential capabilities. This served two purposes. First, it meant that the recorded observations would be directly applicable to the robot's cognitive architecture. Second, it allowed for the creation of a benchmark suite, whereby the robot's performance on the benchmark learning tasks could be directly compared to human learning performance on similar or identical tasks. This comparison provided direct, quantitative evidence for the utility of the attention-direction cues identified through the studies. 1.2 Robotic Learning Algorithms and Cognitive Architecture In order to directly evaluate the utility of the cues identified in the studies, including visual perspective, action timing, and spatial scaffolding, I constructed a number of learning algorithms. I situated these learning algorithms within a large architecture for robot cognition, augmented with novel mechanisms for social attention and visual perspective taking. Designing an integrated learning system in this way supported not only the direct evaluation of the robot's performance on the study benchmark tasks, but also a number of demonstrations of interactive social learning. I believe that socially situated robots will need to be designed as socially cognitive learners that can infer the intention behind human instruction, even if the teacher's demonstrations are insufficient or ambiguous from a strict machine learning perspective. My approach to endowing machines with socially-cognitive learning abilities is inspired by leading psychological theories and recent neuroscientific evidence for how human brains might infer the mental states of others. Specifically, Simulation Theory holds that certain parts of the brain have dual use; they are used to not only generate behavior and mental states, but also to predict and infer the same in others [Davies and Stone, 1995, Barsalou et al., 2003, Sebanz et al., 2006]. My research introduces a set of novel software technologies that build upon a large architecture for robotic behavior generation and control developed by the Personal Robots Group and based on [Blumberg et al., 2002]. My research extends the mechanisms of attention, belief, goals, and action selection within this architecture. Taken together, these changes result in an integrated architecture we call the "self-as-simulator" behavior system, wherein the robot's cognitive functionality is organized around an assumption of shared social embodiment - the assumption that the teacher has a body and mind "like mine." This design allows the robot to simulate the cognitive processes of a human interaction partner using its own generative mechanisms. The architecture was designed for and evaluated on the 65 degree of freedom humanoid robot Leonardo and its graphical simulator. This thesis introduces and describes the technologies behind the self-as-simulator behavior system. These include: mechanisms for understanding the environment from the visual perspective of the teacher, social mechanisms of attention direction and emphasis, and a unified framework for social action recognition and behavior generation. The details of these mechanisms are presented in chapter 3. 1.3 Interactive Learning Demonstrations In addition to comparing the learning performance of our cognitive architecture against that of human learners on benchmark tasks drawn from the studies, this thesis presents a pair of demonstrations of the Leonardo robot making use of embodied cues to learn in novel ways from natural human teaching behavior. In the first demonstration, the Leonardo robot takes advantage of perspective taking to learn from ambiguous task demonstrations involving colorful foam blocks. The second demonstration features Leo making use of action timing and spatial scaffolding to learn secret constraints associated with a number of construction tasks, again involving foam blocks. Leonardo is the first robot to make use of visual perspective, action timing, and spatial scaffolding to learn from human teachers. This document proceeds as follows. Chapter 2 describes the two studies of human teaching and learning behavior, and the novel data-gathering and analysis system that was designed to support this work. I discuss the embodied attention-direction cues observed through the studies, and argue that visual perspective, action timing, and spatial scaffolding are particularly promising cues for robot learners. Chapter 3 introduces the learning algorithms that were developed to empirically evaluate these embodied cues. I also provide a detailed description of the robot's cognitive architecture and its novel mechanisms of social attention and visual perspective taking. Chapter 4 presents the robot's learning performance on the study benchmark tasks, providing quantitative evidence for the utility of the identified cues. In addition, I describe the interactive demonstrations of the Leonardo robot making use of dynamic, embodied cues to learn from natural teaching behavior. Finally, chapter 5 provides some concluding thoughts and discusses some plans and possibilities for future research. Chapter 2 Embodied Emphasis Cues In this chapter, I describe two studies that I conducted to examine the use of embodied cues in human task learning and teaching behavior. The studies focused on embodied, non-verbal cues through which human teachers emphasize and de-emphasize objects in the learning environment. The first study examined the role of visual perspective taking in human learning. The second study was more open-ended, and was designed to capture observations of a number of dynamic, embodied cues including visual attention, hand gestures, direct object manipulations, and spatial/environmental scaffolding. This study motivated the creation of a novel data-gathering system for capturing teaching and learning interactions at very high spatial and temporal resolutions. While a broad set of attention-direction cues were observed in the second study, a surprising result was the prevalent and consistent use of spatial scaffolding by the human teachers. In this chapter, I present quantitative results highlighting a simple, reliable, spatial cue: attention direction through object movements towards and away from the body of the learner. Both studies involved learning tasks that were designed to be closely matched to the Leonardo robot's existing perceptual and inferential capabilities. This served two purposes. First, it meant that the recorded observations could be used to inform and revise the robot's cognitive architecture. Second, it allowed me to create a benchmark suite so that the robot's performance on the benchmark learning tasks could be directly compared to human learning performance on similar or identical tasks. This comparison, described in the following chapters, provided direct, quantitative evidence for the utility of the promising attention-direction cues identified through the studies: visual perspective, action timing, and spatial scaffolding. 2.1 Background and Related Literature There has been a large, interesting body of work focusing on human gesture, especially communicative gestures closely related to speech [Cassell, 2000, Kendon, 1997, McNeill, 1992]. A number of gestural classification systems have been proposed [Kipp, 2004, McNeill, 2005, Nehaniv et al., 2005]. Others have focused on the role of gesture in language acquisition [Iverson and Goldin-Meadow, 2005], and on the use of gesture in teaching interactions between caregivers and children [Zukow-Goldring, 2004, Singer and GoldinMeadow, 2005]. In the computer vision community, there has been significant prior work on technical methods for tracking head pose [Morency et al., 2002] and for recognizing hand gestures such as pointing [Wilson and Bobick, 1999, Kahn et al., 1996]. Others have contributed work on using these cues as inputs to multi-modal interfaces [Bolt, 1980,Oviatt et al., 1997]. Such interfaces often specify fixed sets of gestures for controlling systems such as graphical expert systems [Kobsa et al., 1986], natural language systems [Neal et al., 1998], and even directable robotic assistants [Ghidary et al., 2002, Severinson-Eklundh et al., 2003, Fransen et al., 2007]. However, despite a large body of work on understanding eye gaze [Perrett and Emery, 1994, Langton, 2000], much less work has been done on using other embodied cues to infer a human's emphasis and de-emphasis in behaviorally realistic scenarios. It is important to stress that tracking the human's head pose, which is a directly observable feature, is quite a different thing from extrapolating from this feature and other related features to infer the human's object of emphasis, which we might think of as an unobservable, mental state. There has been a small amount of related work on mapping from head pose to a specific attentional focus, such as a driver's point of interest within a car [Pappu and Beardsley, 1998] or the listener that a particular speaker is addressing in a meeting scenario [Stiefelhagen, 2002]. One of the important contributions of my work is the analysis of spatial scaffolding cues in a human teaching and learning interaction, and the empirical demonstration of the utility of spatial scaffolding for robotic learning systems. Spatial scaffolding refers to the ways in which teachers use their bodies to spatially structure the learning environment to direct the attention of the learner. In particular, my work identifies a simple, reliable, component of spatial scaffolding: attention direction through object movements towards and away from the body of the learner. It is important to note that spatial scaffolding may be effective through social mechanisms or through nonsocial mechanisms implicit in the layout of the learning space. The learner may be directly interpreting the bodily gestures of the teacher, or it may be that the teacher is simply creating organization for the spatial direction and extent of the learner's attention. Space itself, as mediated by the bodily orientation of attention, has been shown to be a powerful factor in conceptual linkage and word learning in children [Smith et al., 2007]. It may well be the case that social and nonsocial mechanisms play a combined role in learning through spatial scaffolding. The first study presented in this chapter examined the role of visual perspective taking in human task learning. A number of related studies have examined aspects of human perspective taking, most notably in false-belief reasoning [Wimmer and Perner, 1983]. Researchers have examined the role of visual perspective taking in language-mediated collaboration between people working face-to-face [Keysar et al., 2000, Hanna et al., 2003] as well as in distributed teams [Jones and Hinds, 2002]. Others have studied the role of spatial perspective taking in demonstrations of physical assembly tasks [Martin et al., 2005]. The study described in the following section is unique in its focus on strictly non-verbal interaction and on the potentially critical role of the teacher's visual perspective in disambiguating task demonstrations. 2.2 Perspective Taking Study Figure 2-1: The four tasks demonstrated to participants in the study (photos taken from the participant's perspective). Tasks 1 and 2 were demonstrated twice with blocks in different configurations. Tasks 3 and 4 were demonstrated only once. The first study examined the role of visual perspective taking in human learning, and enabled the creation of benchmark tasks with which to evaluate the robot's perspective taking abilities and analyze the utility of visual perspective as an information channel for automated learning systems. 2.2.1 Task Design and Protocol Study participants were asked to engage in four different learning tasks involving foam building blocks. I gathered data from 41 participants, divided into two groups. 20 participants observed demonstrations provided by a human teacher sitting opposite them (the social condition), while 21 participants were shown static images of the same demonstrations, with the teacher absent from the scene (the nonsocial condition). Participants were asked to show their understanding of the presented skill either by re-performing the skill on a novel set of blocks (in the social context) or by selecting the best matching image from a set of possible images (in the nonsocial context). Figure 2-1 illustrates sample demonstrations of each of the four tasks. The tasks were designed to be highly ambiguous, providing the opportunity to investigate how different types of perspective taking might be used to resolve these ambiguities. The subjects' demonstrated rules can be divided into three categories: perspective taking (PT) rules, non-perspective taking (NPT) rules, and rules that did not clearly support either hypothesis (Other). Figure 2-2: Input domains consistent with the perspective taking (PT) vs. non-perspective taking (NPT) hypotheses. In visual perspective taking (left image), the student's attention is focused on just the blocks that the teacher can see, excluding the occluded block. In resource perspective taking (right image), attention is focused on just the blocks that are considered to be "the teacher's," excluding the other blocks. Task 1 focused on visual perspective taking during the demonstration. Participants were shown two demonstrations with blocks in different configurations. In both demonstrations, the teacher attempted to fill all of the holes in the square blocks with the available pegs. Critically, in both demonstrations, a blue block lay within clear view of the participant but was occluded from the view of the teacher by a barrier. The hole of this blue block was never filled by the teacher. Thus, an appropriate (NPT) rule might be "fill all but blue," or "fill all but this one," but if the teacher's perspective is taken into account, a more parsimonious (PT) rule might be "fill all of the holes" (see Fig. 2-2). Task 2 focused on resource perspective taking during the demonstration. Again, par- ticipants were shown two demonstrations with blocks in different configurations. Various manipulations were performed to encourage the idea that some of the blocks "belonged" to the teacher, whereas the others "belonged" to the participant, including spatial separation in the arrangement of the two sets of blocks. In both demonstrations, the teacher placed markers on only "his" red and green blocks, ignoring his blue blocks and all of the participant's blocks. Because of the way that the blocks were arranged, however, the teacher's markers were only ever placed on triangular blocks, long, skinny, rectangular blocks, and bridge-shaped blocks, and marked all such blocks in the workspace. Thus, if the blocks' "ownership" is taken into account, a simple (PT) rule might be "mark only red and green blocks," but a more complicated (NPT) rule involving shape preference could account for the marking and non-marking of all of the blocks in the workspace (see Fig. 2-2). Task 3 and 4 investigated whether or not visual perspective is factored into the understanding of task goals. In both tasks, participants were shown a single construction demonstration, and then were asked to construct "the same thing" using a similar set of blocks. Figure 2-1 shows the examples that were constructed by the teacher. In both tasks, the teacher assembled the examples from left to right. In task 4, the teacher assembled the word "LiT" so that it read correctly from their own perspective. The question was, would the participants rotate the demonstration (the PT rule) so that it read correctly for themselves, or would they mirror the figure (the NPT rule) so that it looked exactly the same as the demonstration (and thus read backwards from their perspective). Task 3, in which the teacher assembled a sequence of building-like forms, was essentially included as a control, to see if people would perform any such perspective flipping in a non-linguistic scenario. 2.2.2 Results The results of the study are summarized in Table 2.1 where participant behavior was recorded and classified according to the exhibited rule. For every task, differences in rule choice between the social and nonsocial conditions were highly significant (chi-square, p < 0.001). The most popular rule for each condition is highlighted in bold (note that, Table 2.1: Differential rule acquisition for study participants in social vs. nonsocial conditions. ***: p < 0.001 Task Condition Task 1 social nonsocial Task 2 social nonsocial Task 3 social nonsocial Task 4 social nonsocial PT Rule 6 1 16 7 12 NPT Rule 1 12 0 12 8 0 21 14 6 0 21 Other 13 8 4 2 p *** *** - while many participants fell into the "Other" category for Task 1, there was very little rule agreement between these participants). These results strongly support the intuition that perspective taking plays an important role in human learning in socially situated contexts. However, a critical question remains: can a robot, using a simple learning algorithm and paying attention to the visual perspective of the teacher, exhibit the differences in rule choice observed in human learners? Is visual perspective a sufficiently informative cue to support automated learning from natural teaching behavior? In the next chapter, I describe a learning algorithm that I constructed to answer this question, along with the visual perspective taking mechanisms that I incorporated into the Leonardo robot's cognitive architecture. In chapter 4, I present the learning performance of the robot on benchmark tasks drawn from this study, providing evidence of the utility of visual perspective as an information channel for automated learning systems. 2.3 Emphasis Cues Study The first study examined how teachers implicitly emphasize and de-emphasize objects through their limited visual perspectives. In this section, I describe a second study that was more open-ended. This study was designed to capture a range of dynamic, embodied cues through which emphasis and de-emphasis are communicated by human teachers. These cues included visual attention, hand gestures, direct object manipulations, and spatial/environmental scaffolding. The study employed a novel sensory apparatus to capture embodied teaching and learning behavior with very high spatial and temporal resolution. This apparatus combined machine vision technologies including motion capture and object tracking with other technical interventions such as instrumented mechanical objects with which study participants interacted. The study had a number of goals. First, to produce a high-resolution observational data set that would be interesting to myself and to other researchers in its own right. Second, to provide data sufficient for identifying and implementing heuristics for tracking a number of important dynamic emphasis and de-emphasis cues that teachers provide in a realistic teaching domain. Finally, to produce a set of benchmark tasks with which to automatically evaluate the identified cues - by demonstrating an automated learning system, attending only to a handful of simple cues, learning "alongside of" the real human learners. This section proceeds as follows. I first introduce the teaching/learning tasks that study participants were asked to engage in. I then describe the design and construction of the sensory apparatus and data collection tools that were created to collect high-quality observations of the study tasks. Next, I describe the execution of the study itself, and discuss some qualitative observations of the participants' teaching and learning behavior. Finally, I describe the automated data analysis tools that were created to detect and measure the teachers' emphasis cues, and present the quantitative results generated by these tools. While a broad set of attention-direction cues were observed in this study, a surprising result was the prevalent and consistent use of spatial scaffolding by the human teachers. In particular, the quantitative results highlight a simple, reliable, spatial cue: attention direction through object movements towards and away from the body of the learner. 2.3.1 Task Design and Protocol A set of tasks was designed to examine how teachers emphasize and de-emphasize objects in a learning environment with their bodies, and how this emphasis and de-emphasis guides the exploration of a learner and ultimately the learning that occurs. The study centered around three types of tasks. The first two types of tasks involved colorful foam building blocks similar to the blocks used in the perspective taking study. The third type of task involved motorized "puzzle" boxes, shown in figure 2-4. Each study session included two of the first type of task, three of the second type, and three of the third type, for a total of eight tasks per study session. I gathered data from 72 individual participants, combined into 36 pairs. Each study session lasted approximately 45 minutes. Participants were combined into pairs, and were asked to engage in all eight of the study tasks. For each pair, one participant was randomly assigned to play the role of teacher and the other participant assigned the role of learner for the duration of the study. For all of the tasks, participants were asked not to talk, but were told that they could communicate in any way that they wanted other than speech. Tasks were presented in a randomized order. For all of the tasks, the teacher and learner stood on opposite sides of a tall table, with either the foam blocks or the puzzle boxes laid out between them on the tabletop. Participants were given general verbal instructions for each type of task by the experimenter, and then were handed printed instructions on note cards for each specific task. The first type of task was a non-interactive, demonstration task, designed to examine how teachers draw attention to a subset of objects in the environment. The teacher was asked to construct three successive demonstrations of a particular rule, that involved placing markers on some of the foam blocks and not placing markers on others of the blocks. The first rule (Task la) was to "put markers on all of the red and green blocks that aren't triangles." The second rule (Task Ib) was to "put markers on all the rectangular (and square) blocks that aren't red." After each of the teacher's three successive demonstrations, the learner was asked to write down their current best guess as to what the rule might be. To make the teacher's job more difficult, the teacher was only given six markers, which was not enough to successfully demonstrate the rule over all of the blocks on the table: the rule in Task la matched ten of the blocks on the table, while the rule in Task lb matched nine. Thus, one of the interesting questions for this task was how would the teacher overcome this resource limitation to successfully communicate the rule. How would teachers differentiate demonstration blocks from distractor blocks (blocks that matched the rule but that necessarily would go unmarked)? Other questions of interest at the outset were how the teachers would arrange their demonstrations spatially and how they would transition from one demonstration in the sequence to the next demonstration. The same 24 foam blocks were used for all of the tasks involving blocks. These 24 blocks were made up of four different colors - red, green, blue, and yellow, with six different shapes in each color - triangle, square, small circle, short rectangle, long rectangle, and a large, arch-shaped block. At the beginning of each task, the experimenter arranged the blocks in a default, initial configuration, shown in figure 2-9. This initial configuration was designed to seem random, and featured a fairly even distribution of the different colors and shapes into the four quadrants of the tabletop. The blocks were reset into this initial configuration between every task, with the exception of Tasks la and ib, where the teachers were allowed to continue on from their demonstrations of the first rule into their demonstrations of the second rule. The second type of task also involved teaching with foam blocks, but in a more interactive setting. The tasks were secret constraint tasks, where one person (the learner) knows what the task is but does not know the secret constraint. The other person (the teacher) doesn't know what the task is but does know the constraint. So, both people must work together to successfully complete the task. For each of the tasks, the learner received instructions, shown in figure 2-3, for a figure to construct using the blocks. In Task 2a, the learner was instructed to construct a sailboat figure using at least 7 blocks; in Task 2b, a truck/train figure using at least 8 blocks; and in Task 2c, a smiley face figure using at least Construct using at least 7 blocks: Construct using at least 8 blocks: O 0 0 0 Construct using at least 6 blocks: Figure 2-3: Task instruction cards given to learners in Task 2. 6 blocks. The block number requirements were intended to prevent minimalist interpretations of the figures (and thus very quick solutions to the tasks). When put together with the secret constraints, the number requirements turned tasks 2a and 2b into modestly difficult Tangram-style spatial puzzles. The secret constraint handed to the teacher for Task 2a was that "the figure must be constructed using only blue and red blocks, and no other blocks." The secret constraint for Task 2b was that "the figure must include all of the triangular blocks, and none of the square blocks," and for Task 2c, that "the figure must be constructed only on the left half of the table (from your perspective)." At the end of each task, the learner was asked to write down what they thought the secret constraint might have been. These tasks were designed to examine how the teacher would guide the actions of the learner. Would they interrupt only when the learner made a mistake, or would they provide guidance preemptively? Would they remove mistakenly placed blocks or possibly provide correct replacements? Would they direct the learner towards specific blocks gesturally (by pointing or tapping), or by handing them specific blocks to use, or might they provide assistance by organizing the blocks spatially for the learner? The interactive setting of these tasks provokes a rich range of questions for analysis, which I will return to Figure 2-4: One of the motorized puzzle boxes. Left: box closed with blue status light illuminated. Right: box open with orange status light illuminated. later in the section. The third type of task followed a similar secret constraint task setup, but applied to the domain of sequence learning. Participants interacted with a pair of mechanical puzzle boxes, one with blue controls and one with red controls (as in figure 2-4). The puzzle boxes were designed as follows. Each box has three squishy, silicone controls: a button, a left-toright slider, and a left-to-right switch. The controls were designed to be easy to manipulate for the Leonardo robot as well as for a human. In addition to the controls, each box has five colored status lights, and a motorized lid that opens and closes based on changes in box state. Additionally, changes in box state can cause songs and other noises to play through a nearby speaker. Since the boxes are controlled by a computer, the experimenter can design arbitrary mappings between control manipulations, status light changes, box openings and closings, and auditory events (a similar experimental setup, used for studying causal learning in children, is described in [Gopnik et al., 2001] and [Schulz and Gopnik, 2004]). In the study setup, both boxes were arranged on the tabletop so that the controls faced the learner (see figure 2-13). For these tasks, the learner was instructed that their job was to discover the specific sequence of manipulations of the box controls that would cause a song to play (in this case, a highly enthusiastic, game show-style song). The teacher was instructed to help the learner accomplish this task using the secret hints that they would be provided with - additional information that they could use to guide the actions of the learner. For Task 3a, the hint provided partial sequence information: "When the orange light goes on: first the red slider and then the red switch must be flipped. - When both lights turn green: the slider on the blue box must be flipped." For Task 3b, the hint provided some information about the box controls: "The blue slider and the blue button and the red switch are all bad. - The other controls are good." Finally, for Task 3c, the hint provided information about the status lights: "The lights on the blue box are distractors. The lights on the red box are helpful." For these box puzzle tasks, I was looking for: how the teacher might direct the learner towards particular controls gesturally or through direct manipulation, how they might direct the learner's attention towards important state changes, how they might guide the learner away from unhelpful controls and states, and so on. Now that I have described the tasks that study participants were asked to perform, I move on to a discussion of the sensory apparatus and data-gathering tools that were created to record their behavior. 2.3.2 Data-Gathering Overview In order to record high-resolution data about the study interactions, I developed a datagathering system which incorporated multiple, synchronized streams of information about the study participants and their environment. For all of the tasks, I tracked the positions and orientations of the heads and hands of both participants, recorded video of both participants, and tracked all of the objects with which the participants interacted. For Tasks 1 and 2, I tracked the positions and orientations of all of the foam blocks. For Task 3, I recorded the states of the lights and controls of both puzzle boxes. The data gathering system was designed to satisfy a number of goals. First, it was important to minimize the burden that the sensory system might impose on study participants, and thus the effect that it might have on the naturalness of their behavior. Second, it was important for the recording system to be able to be controlled by a single experimenter. Finally, it was important for the recording system to automatically digitize and synchronize all of the sensory streams on the fly, to minimize any time-consuming manual processing of the data that might be required after the conclusion of the study. Synchronization of the various sensory streams enabled the simultaneous playback of all of the recorded data for the purposes of visualization and analysis at the end of the study. The data gathering system consisted of a collection of specialized modules which managed the different sensory streams, and a central module which kept track of the study state and which was responsible for the synchronization of the other modules. In order to track the heads and hands of the participants, a motion capture system was used in combination with customized tracking and recording software. To track the foam blocks in Tasks 1 and 2, a machine vision system was created and embedded within a special illuminated table that was constructed for the study. Video digitization and compression software was used to record from two camcorders mounted in the study environment. Finally, an additional module recorded the states of the two puzzle boxes used in Task 3. In total, the system managed data from 13 cameras and 6 different streams of information. The study took place in the space in front of the Leonardo robot. The study made use of a number of the robot's cameras as well as the robot's networking infrastructure. Communication between the different recording modules was accomplished using a part of the robot's software architecture called IRCP (the Intra-Robot Communication Protocol). The modules communicated with each other over a dedicated gigabit network, allowing for high-bandwidth data transfer. This reuse of the robot's environment and infrastructure was for more than just the sake of convenience. It made possible a rapid transition from having the robot learn from sensory data recorded during the study interactions to having the robot learn live from highly similar sensory data generated by people interacting directly with the robot. In the following sections, I describe the sensory modules and physical materials which were the essential parts of the data gathering system. 2.3.3 Motion Capture: Tracking Heads and Hands In order to accurately track and record the head and hand movements of the study participants, I employed a Vicon motion capture system along with customized tracking and recording software. Our Vicon system uses ten cameras with rapidly-strobing red light emitting diodes to track small retroreflective spheres that can be attached to clothing and other materials. The positions of these trackable markers can be determined with very high accuracy within the volume defined by the cameras - often to within a few millimeters of error at a tracking rate of 100-120Hz. The most common use for such systems is to record human motion data for 3D animation used in films and video games. However, the Vicon system is designed to track a single human wearing a full-body motion capture suit, and thus its software is not well suited for tracking multiple independentlymoving objects. The software is very sensitive to the presence of extraneous markers and to object occlusion. All tracked objects are required to remain in the environment at all times; if objects are removed, the software performance becomes slow and unreliable. These software design flaws can lead to frequent crashes, which can jeopardize data gathering from human subjects. To address these issues, I developed a software toolkit for tracking rigid objects using the raw data provided by the Vicon system about individual marker positions. The toolkit identifies objects in the scene by searching for matches to tracking templates which specify the pairwise distances between markers mounted to the objects. The markers can be attached to the objects in any arrangement, with more distinguishable configurations leading to greater tracking reliability and lesser sensitivity to sensor noise and occlusion. The toolkit supports the 3D, graphical visualization of live or recorded tracking data (see Figure 2-5) and allows for rapid, on-the-fly calibration of tracking templates as well as object bounding boxes and forward vectors. The toolkit is robust to missing objects, partially- and fully-occluded objects, intermittent tracking of individual markers, and the presence of extraneous markers and objects. person Figure 2-5: Graphical visualization within the object tracking toolkit, with two toy objects and a tracked head and hand with calibrated forward vectors. For the data-gathering setup, I needed to attach reflective markers in rigid configurations to the heads and hands of the study participants. To this end, I constructed a set of large, wooden "buttons" that could be sewn onto various articles of clothing. The buttons were constructed by cutting 4-inch square pieces from a thin sheet of plywood, with four small holes per square. For the heads, I purchased two baseball caps with elastic headbands. The caps featured two different MIT logos, which were selected as a neutral affiliation that most of the study participants had in common (as opposed to, say, the logo for a professional sports team, which could be quite divisive). A single wooden button was sewn onto each hat. Then, the buttons were covered with adhesive-backed Velcro, and the reflective Vicon markers were attached to the Velcro in unique configurations (see figure 2-6). For the hands, I purchased two sets of spandex bicycle-racing gloves with adjustable wrist straps. The gloves were chosen because they were very lightweight, and left the fingers of the study participants free. Buttons were sewn onto the backs of each glove, and again the buttons were covered in Velcro and outfitted with unique arrangements of reflective markers. Figure 2-6: Hats, gloves, and rings outfitted with trackable markers. Additionally, to track the index fingers of the participants, I constructed four small rings. Each ring consisted of a very small reflective marker sewn onto a thin strip of double-sided velcro. By rolling the strips of velcro more or less tightly, the rings could be sized to fit any finger. The baseball caps were worn backwards by the study participants so that their faces would not be occluded from each other or from the cameras. The index finger rings were worn between the first and second knuckle from the tip of the finger. The sizing of the trackable clothing seemed to work out very well: most study participants reported a comfortable fit for the apparel. While the motion capture system was very well suited for tracking the positions and orientations of the participants' heads and hands, it was ruled out as a method for tracking the foam blocks in Tasks 1 and 2. The blocks were too small, and too numerous, to attach unique marker configurations to each block. Further, the density and proximity of the blocks on the tabletop would have caused problems for the template finding algorithm, as well as for the underlying dot finding algorithm used by the Vicon system. Finally, the markers on the blocks would have been frequently occluded by the hands of the participants, and often at the most critical of moments: when the participant was interacting with the block! In the following section, I discuss a better solution for tracking the blocks: a specially constructed, illuminated table with an embedded machine vision system. 2.3.4 The Light Table: Tracking Object Movement In order to accurately measure the positions and orientations of the colorful foam blocks, a special illuminated table was constructed which tracked blocks on the table's surface. A camera mounted underneath the table looked up at the blocks through the transparent tabletop. The blocks were tracked using a machine vision system which combined color segmentation, shape recognition, and object tracking, as described later in this section. Figure 2-7: Initial assembly of the light table, with detail of the frosted acrylic top. The basic table consisted of four cylindrical metal legs and a clear acrylic tabletop (see figure 2-7). The tabletop measured two feet by four feet, and was one-half of an inch think. Acrylic was chosen instead of glass because it is much lighter and harder to break, and thus required a much less extensive support structure under the table, resulting in a clearer view for the camera. The legs were adjustable length, and were set to make the tabletop 38 inches high, a good height for people standing on either side of the table to Figure 2-8: Light table component details. Top left: lights and camera mounted underneath the table. Top right: screens for diffusing the lights and preventing reflective glare. Bottom left: attachment of skirt and screen to table. Bottom right: clamp for holding the lights in place. interact with objects on the tabletop. The legs were attached to the four comers of the table using machine screws set into the bottom surface of the tabletop, leaving the top surface completely smooth. After the table was constructed, the acrylic top was frosted using a rotary sander and fine-grained sandpaper. Frosting was both for aesthetic purposes as well as for the benefit of the machine vision system. Frosting turned the tabletop into a thin, diffusive surface, meaning that objects pressed against the tabletop could be seen very clearly from below, but would become more and more diffuse as their distance up from the table surface increased. Thus, the blocks could be seen in sharp detail by the camera underneath the table, but the distracting effects of other objects such as hands, clothing, and external light sources were somewhat mitigated. A skirt was constructed for the table using a long piece of white canvas cloth. The skirt was attached to the table perimeter using adhesive-backed Velcro, allowing for easy access to the electronics and other materials underneath the table, as shown in figure 2-8. White canvas was selected to support the even diffusion of light underneath the table. Significant care was taken to ensure a consistent, optimized lighting environment for the camera, and to ensure a consistent geometrical relationship between the camera, the lights, and the table surface. Four lights were installed underneath the table, each attached to a different table leg using a strong clamp and zip tie. The bulbs used were 75-wattequivalent, compact fluorescent bulbs. To optimize the color separation of the foam blocks, it was determined that the best setup was to use a mix of two "warm white" (2700K color temperature) and two "cool white" (4100K) bulbs, with similarly colored bulbs positioned diagonally across from each other. Since the tabletop was somewhat reflective, the bright lights introduced four "hot" spots where blocks placed directly above the lights were invisible to the camera. To solve this problem, two large screens were constructed out of vellum paper to diffuse the lights. The screens were mounted at an angle, sloping down towards the camera from high up on the table legs, and attached to the inner side of the skirt. With the screens in place, the table surface was illuminated brightly and evenly. Other light sources that impinged upon the camera, such as the overhead lights directly above the table, were turned off or removed. Care was also taken in the positioning of the Vicon cameras so that the LEDs mounted on those cameras did not shine directly into the camera under the table. Images of the table surface were provided by a Videre Design DCAM firewire camera. While this is a stereo camera system, images from only one of the cameras were used to track the blocks. The camera was selected because it provided good image quality, and had a field-of-view which was well matched to the dimensions of the table. Camera images were captured by a Linux machine running frame grabbing software, and streamed uncompressed across the gigabit network. This setup supported both live processing as well as recording of the camera images. Images were provided at a resolution of 320 by 240 pixels, at approximately 17Hz. During the study, camera images were time stamped and saved directly to disk as sequences of minimally-compressed JPEG images. Figure 2-9: Final setup of the light table, with overhead view of the initial blocks configuration and detail of the rope bumpers on the foam blocks. The final setup of the table and blocks is shown in figure 2-9. A few additional details should be noted. While the camera could see most of the table surface, it could not see blocks placed directly above the table legs or on the far edges of the table. So, thin borders were drawn on either side of the table using black electrical tape to delineate the preferred workspace for the blocks. When participants were first introduced to the blocks, they were told that they could do whatever they wanted with the blocks - pick them up, pass them around, put them on their head - but that when they put the blocks down, they were asked to place them flat (as opposed to stacked up) and between the two black lines. The participants were given no other instructions or constraints about the blocks. As can be seen in figure 2-9, rope bumpers were constructed and placed around the perimeter of each block, midway down the sides. The bumpers were constructed using clothesline, and attached to the blocks using dressmaker's pins. The bumpers assisted the machine vision system in two ways. First, they ensured a minimum spacing between adjacent blocks, reducing the likelihood that two blocks of the same color, placed next to each other, would be perceived as one large block of that color. Second, the bumpers introduced a subtle affordance cue about which block face should lie on the tabletop, since Figure 2-10: Color segmentation converted each incoming image into four separate color probability images, one for each color of block. These probability images were thresholded to create binary color masks. the blocks would stick up at an odd angle if placed down on one of the faces to which the bumper was attached. This simplified the job of the block shape recognition algorithm. Additionally, reflective markers were mounted in each corner of the table so that the position and orientation of the table itself could be tracked by the motion capture system. I now turn to the machine vision system that was created for tracking the blocks. Camera images of the table surface were processed in a number of stages. First, color segmentation was used to identify pixels that were associated with the red, green, blue, and yellow blocks. Next, a blob finding algorithm identified the locations of possible blocks within the segmented images. Then, a shape recognition system classified each blob as one of the six possible block shapes. Finally, an object tracking algorithm updated the positions and orientations of each block using these new observations in conjunction with historical information about each block. Color models were constructed for each of the four block colors in the study. An interface was developed so that color models could be built interactively by the user, by clicking directly on live or recorded images from the camera. Pixels selected by the user for a given color were collected into a sample set and projected into the hue-saturation-value (HSV) color space. The value component for each sample was discarded, resulting in a sample Figure 2-11: The block shape recognition system produced an estimate of the shape, orientation, and position of each blob of color in the image. set of hue-saturation pairs. Next, a gaussian mixture model was fitted to these samples using expectation maximization (EM). To increase computational efficiency, these mixture models were then discretized into two-dimensional histograms. Thus, the color calibration process resulted in the creation of four hue-saturation histograms, one for each color. During the color segmentation process, each incoming image was converted to the HSV color space. Then, histogram back-propagation was performed to create a color probability image for each of the four color models. These color probability images were then thresholded to create binary images (color masks), where each pixel in the image indicated whether or not the corresponding pixel in the original image matched the given color model. Each color mask (thresholded color probability image) was then handed to a blob finding algorithm. The blob finder performed a depth-first search over the color mask to identify every contiguous region of the given color in the image. All blobs smaller than a given minimum size threshold were discarded, and the remaining blobs were handed to the shape recognition system. The shape recognition system was responsible for classifying each blob as one of the six possible block shapes, and also for estimating the position and orientation of each blob given its shape classification. The system first calculated a number of important, differen- Figure 2-12: Block tracking. In the middle image, the shape recognition system is confused because two of the green blocks have been blobbed together, producing an incorrect classification. The block tracking system (right image) corrects this mistake using historical information. tiating features for each blob, such as its mass, length-to-width ratio, symmetry, and so on. The system then used a classification tree operating on these geometrical features to assign a shape classification to each blob, or to reject the blob as not being a good match for any of the possible shapes. Finally, an object tracking algorithm used the classified blobs to update the tracked positions and orientations of each block using the new observations in conjunction with historical information about the blocks. Since the shape recognition system typically produced very high-fidelity results, the object tracking algorithm could be implemented relatively simply. The algorithm made use of the fact that, for the most part, all 24 of the blocks would be on the table at all times, except for brief periods of movement when blocks might disappear from one part of the table and reappear on another part of the table. The tracking algorithm used a three-tier system whereby each tracked block could lay "claim" to one of the newly classified blobs. Conflicts were resolved based on which block was closest to the claimed blob. While blocks could temporarily lay claim to blobs whose shape did not match their own, their positions and orientations would only be updated when they were matched to blobs with the same color and shape. In the first tier, blocks could lay claim to "strong matches," nearby blobs whose color and shape matched their own. Next, unmatched blocks could lay claim to "loose matches," nearby blobs whose color matched their own but whose shape did not. These loose matches were only allowed to be claimed for a short period of time. Finally, unmatched blocks became "free agents" after a short period of time, and were allowed to claim any unclaimed blob on the table whose color and shape matched their own. The object tracking algorithm worked very well, and was successful at cleaning up many of the mistakes that would occasionally arise from sensor noise, adjacent blocks of the same color being temporarily blobbed together, and other factors. Overall, I was very pleasantly surprised at the accuracy of the vision system in measuring the positions and orientations of the blocks over the course of the study. 2.3.5 The Puzzle Boxes: Tracking Object Manipulation r 'm Figure 2-13: Puzzle boxes in the study environment, on top of the light table. During study Task 3, the states of the lights and controls of both puzzle boxes were recorded. Since the boxes were computer-controlled, this recording module was quite easy to implement. No sensors were required beyond those already present in the box switches and buttons. On each update tick, the control software simply recorded a time stamp, along with the current states of all of the box controls, lights, and auditory events. Recording took place at approximately 60Hz. In order to track the positions and orientations of the two boxes relative to the partici- pants' hands and heads, reflective markers were mounted on each box, so that they could be tracked by the motion capture system. The full setup of the puzzle boxes in the study environment is shown in figure 2-13. 2.3.6 Video Recording Figure 2-14: Two camcorders were mounted around the study area. In order to record video of the two participants, two Sony camcorders were mounted at the periphery of the study area (see figure 2-14). Both cameras were mounted high so as to have a relatively unobstructed view of the tabletop, with one camera angled towards the hands and face of the teacher and the other camera angled towards the learner. Additionally, the cameras were aimed so as to avoid looking directly at any of the Vicon cameras, whose strobing red LEDs caused undesirable streaks and artifacts in the camcorder image. Instead of being recorded to tape, the camcorder video was digitized and compressed live during the study interactions, preventing a considerable amount of work and tedium at the end of the study. This had the important additional benefit that the video could be automatically time stamped and synchronized with the other sensory streams. To process the camcorder video, I used the BTV Pro shareware application created by Ben Bird. The software was set up to record the camera data as MPEG-4 compressed video streams with a resolution of 640 by 480 pixels. In order to automatically start and stop this recording process, a wrapper module was created using our software codebase. The wrapper module received start and stop commands over the network from the central study module, which were then relayed to the BTV Pro application via the AppleScript scripting language. A small but noticeable delay (about 1 second) was associated with starting and stopping the recording process. In order to accommodate this, the wrapper module sent receipts back to the central module with the actual recording start and stop times, enabling accurate synchronization of the video data. 2.3.7 Data Stream Management and Synchonization A central recording module kept track of the study state and managed the synchronization of the other recording modules. The central module was designed to be controlled via a simple graphical user interface as well as via a wireless presentation remote, giving a single experimenter the ability to control the complete recording infrastructure. A simple display presented the status of the various recording modules: "idle," "recording," or "off-line." Using the remote, the experimenter could advance to the next task in the random task sequence, and also start and stop recording for each task. The central recording module generated a log of task start and stop times for each study session. The module was responsible for sending start and stop commands over the network to the other recording modules, along with information about which file names to use for the various data streams. The video recording processes which controlled the two camcorders ran on two separate Macintosh tower computers - a G5 Power Mac and an Intel Mac Pro. The other recording modules, which recorded the motion capture data, the puzzle box data, and the stream of images from the light table camera, were all run on the same computer as the central module - a 4-processor, 3GHz Intel Mac Pro. Thus, with the exception of the video recording modules, which were synchronized as described in the previous section, the other modules could all be synchronized using the main computer's system clock, which was used to time stamp all of the remaining data streams. 2.3.8 Study Execution and Discussion I now turn to a discussion of the execution of the study itself, and present some qualitative observations about the teaching and learning behavior exhibited by participants on the various study tasks. As mentioned above, data was gathered from 72 study participants, grouped into 36 pairs. In total, over 130 gigabytes of raw data was recorded, representing approximately 10 hours of observed task interaction. Over the course of the study, the data gathering framework performed very well and with minimal loss of data, meaning that none of the study interactions were dropped or discarded from the final data set. On two occasions, the data from one of the two camcorders was interrupted, but in both cases the connection was restored in under a minute. These were the only problems encountered with the recording systems. My analysis focused on two of the study tasks: Tasks 2a and 2b, two of the secretconstraint tasks involving the foam blocks. These tasks were selected because the were the most vigorously interactive tasks in the study (and also, interestingly, the tasks that study participants seemed to enjoy the most). Since neither participant had enough information to complete the task on their own, these tasks required the direct engagement and cooperation of both participants. Correspondingly, I observed a rich range of dynamic, interactive behaviors during these tasks. To identify the emphasis and de-emphasis cues provided by the teachers in these tasks, an important piece of "ground-truth" information was exploited: for these tasks, some of the blocks were "good," and others of the blocks were "bad." In order to successfully complete the task, the teacher needed to encourage the learner to use some of the blocks in the construction of the figure, and to steer clear of some of the other blocks. In Task 2a, the blue and red blocks were "good," while the green and yellow blocks were "bad." In Task 2b, the triangular blocks were good, the square blocks were bad, and the remaining blocks fell into a neutral "other" category. To set the stage, I will first describe two pairs of study interactions before diving into a more detailed analysis of the observed cues. The first pair of interactions were for Task 2a, where the goal was to construct a sailboat figure using only red and blue blocks. In one recorded interaction (session 27), the teacher is very proactive, organizing the blocks almost completely before the learner begins to assemble the figure. The teacher clusters the yellow and green blocks on one side of the table and somewhat away from the learner. The learner initially reaches for a yellow triangle. The teacher shakes her head and reaches to take the yellow block back away from the learner, before continuing to organize the blocks. The learner proceeds to complete the task successfully. In another recorded interaction (session 7), the teacher's style is very different. Instead of arranging the blocks ahead of time, he waits for the learner to make a mistake, and then "fixes" the mistake by replacing the learner's block with one that fits the constraint. When the learner positions a green rectangle as part of the mast of the sailboat figure, the teacher quickly reaches in, pulls the block away, and replaces it with a red rectangle. Later, the teacher fixes a triangular part of the sail in a similar way, after which the learner completes the task successfully. The second pair of interactions were for Task 2b, where the goal was to construct a truck/train figure using all of the triangular blocks, and none of the square blocks. In one interaction (session 2), the teacher provides some very direct structuring of the space, pulling the square blocks away from the learner and placing the triangular blocks in front of her. In contrast, in another interaction (session 21), the teacher almost entirely refrains from moving the blocks. She instead provides gestural feedback, tapping blocks and shaking her hand "no" when the learner moves an inadmissible block, and nodding her head when the learner moves an acceptable block. As these descriptions suggest, I observed a wide range of embodied cues provided by the teachers in the interactions for these two tasks, as well as a range of different teaching Table 2.2: Cues of positive emphasis. Embodied cues provided by the teacher and directed toward good blocks. Simple Hand and Head Cues tapping with the index finger touching with the index finger pointing framing with both hands of clustered good blocks targeting by gaze Block Movement Cues block movement towards learner's body or hands block movement towards center of table addition of block to figure (often, via replacement of a bad block) placement of blocks along edge of table closest to learner clustering with other good blocks Compound Cues head nodding accompanying pointing or hand contact with block head nodding following learner's pointing or hand contact with block shrugging gesture following learner's block movement - "Idon't know/seems OK" "thumbs up" gesture following pointing or sequence of pointing gestures pointing back and forth between clustered good blocks and the learner Emphasis Through Inaction observation of learner's actions, accompanied by lack of intervention passing over block in process of providing negative emphasis styles. Table 2.2 enumerates some of the cues of positive emphasis that were observed. These were cues provided by the teachers and directed towards good blocks, in the process of guiding the learners to interact with these blocks. Table 2.3 enumerates some of the negative cues that were observed, which tended to steer the learners away from the targeted blocks. Positive cues included simple hand gestures such as tapping blocks, touching blocks, and pointing at blocks with the index finger. Teachers sometimes used both hands to frame the space occupied by single good blocks or collections of blocks to use. These cues were often accompanied by gaze targeting, or looking back and forth between the learner and the target blocks. Table 2.3: Cues of negative emphasis. Embodied cues provided by the teacher and directed toward bad blocks. Simple Hand and Head Cues covering blocks with the hands holding blocks fast with the fingers prolonged contact with blocks despite proximity of learner's hands interrupting learner's reaching action by blocking learner's hand interrupting learner's reaching action by touching or lightly slapping learner's hand Block Movement Cues block movement away from learner's body or hands block movement away from center of table removal of block from figure (sometimes followed by replacement with a good block) placement of blocks along edge of table closest to teacher clustering with other bad blocks interrupting the learner's reaching action by grabbing away the target block Compound Cues head shaking accompanying pointing or hand contact with block head shaking following learner's pointing or hand contact with block "thumbs down" gesture following pointing or sequence of pointing gestures vigorous horizontal "chop" gesture hand "wagging" gesture (with index finger or all fingers extended) large "X" symbol formed using both forearms Emphasis Through Inaction passing over block in process of providing positive emphasis Simple negative cues included covering up blocks with the hands, preventing visibility of and physical access to the blocks. Blocks were occasionally held fast by the teachers, so that they could not by used by the learners, or were kept in prolonged contact by the teachers despite the proximity of the learner's hands. Teachers would occasionally interrupt reaching motions directly by blocking the trajectory of the motion or even by touching or (rarely) lightly slapping the learner's hand. A number of compound cues were observed, often involving the specification of a particular block via pointing or tapping, accompanied by an additional gestural cue suggesting the valence of the block. Such gestures included head nodding, the "thumbs up" gesture, and even shrugging, which often seemed to be interpreted as meaning "I'm not sure, but it seems OK." Teachers nodded in accompaniment to their own pointing gestures, and also in response to actions taken by the learners, including actions that seemed to be direct queries for information from the teacher. Correspondingly, a number of compound, negative cues were observed. These included head shaking, the "thumbs down" gesture, and side-to-side finger or hand wagging gestures. A number of teachers used a vigorous horizontal "chop" gesture to identify bad blocks (with the palm down, the hand starts out near the center of the body and then moves rapidly outward and down). Another negative gesture was a large "X" symbol formed using both forearms. Inaction, in some contexts, was another important emphasis cue. When the teacher watched the learner's actions and did not intervene, this often seemed to be interpreted as positive feedback. In a similar vein, when the teacher's gestures passed over particular blocks on the way to provide negative or positive feedback about other blocks, the passed-over blocks could be seen as implicitly receiving some of the opposite feedback. For example, when the teacher left some blocks in place, while selecting other blocks and moving them away from the center of the table, the blocks left behind seemed to attain some positive status. Another important set of cues were cues related to block movement and the use of space. To positively emphasize blocks, teachers would move them towards the learner's body or hands, towards the center of the table, or align them along the edge of the table closest to the learner. Conversely, to negatively emphasize blocks, teachers would move them away from the learner, away from the center of the table, or line them up along the edge of the table closest to themselves. Teachers often devoted significant attention to clustering the blocks on the table, spatially grouping the bad blocks with other bad blocks and the good blocks with other good blocks. The learner's attention could then be directed towards or away from the resulting clusters using many of the gestural cues previously discussed. These spatial scaffolding cues were some of the most prevalent cues in the observed interactions. In particular, the teachers in the study, with very few exceptions, consistently used movements towards and away from the body of the learner to encourage the use of some of the blocks on the table, and discourage the use of other blocks. Having observed this collection of embodied emphasis cues, my next step was to establish how reliable and consistent these cues were in the recorded data set, and most importantly, how useful these cues were for robotic learners. In the next section, I describe the tools I developed for automatically analyzing some of these cues and generating quantitative data about their reliability. In later chapters, I describe the development of an even more powerful tool: a robotic learning architecture for directly assessing the utility of these cues for automated learning systems. 2.3.9 Data Analysis Tools and Pipeline In this section, I describe the automated data analysis tools that were created to detect and measure the emphasis and de-emphasis cues that teachers provided in the study. I then present the quantitative results generated by these tools. For all of the quantitative and evaluative data presented in this thesis, a cross-validation methodology was followed. All of the data analysis tools and learning algorithms were implemented and tested using a small set of study interactions, pulled from just 6 of the 36 study sessions. Reported data were generated by running these same tools and algorithms over the remaining 30 study sessions. While these sessions did not represent completely "blind" data, since I was present when their data was initially collected, I believe that this methodology was nevertheless valuable for minimizing the risk of overfitting by the systems presented in this thesis. I developed a set of data visualization tools to assist with the playback and analysis of the study data. My data visualization environment is shown in figure 2-15. Using a simple graphical interface, the user could select a study session by number and choose a particular task to load in for analysis. A time scrubber allowed the user to control the position and speed of playback of the synchronized data streams. The two windows at the bottom of the screen displayed camcorder footage from the two different camera recording angles. In both shots, the teacher is on the left and the learner is on the right. The big blue uirru*a, ~Q~I ~L~UIP ~ILO~ C rrx Nn ,rur ~Ml~ua 3~rr~a~ Figure 2-15: Data visualization environment. window presented the motion capture data, with tracking information about the position and orientation of the table, heads, and hands. The window on the right of the screen showed the view from the under-table camera. In this view, the teacher is at the top of the screen and the learner is at the bottom. This environment could also be run in "batchmode" for automated traversal and analysis of the data set. As mentioned previously, my quantitative analysis focused on two of the study tasks: Tasks 2a and 2b, two of the secret-constraint tasks involving the foam blocks. To analyze the behavioral data for these tasks, I developed an automated pipeline for extracting highlevel events from the raw, recorded sensor data. This pipeline proceeded in a number of stages. In the first stage, the recorded data about the positions of the reflective markers was run through the rigid object tracking system, producing a trace of the positions and orientations of the heads, hands, table, and boxes (for more information on this system, see section 2.3.3, above). Also in this stage, the images from the under-table camera were run through the block tracking system, producing a trace of the positions and orientations of all of the foam blocks (see section, 2.3.4, above). By recording the output of these tracking systems, I produced compact streams of information about the objects in the study environment, information that could then be randomly traversed and queried in later stages of processing. Figure 2-16: The blocks were mapped into the coordinate frame of the motion capture system, allowing for 3D visualization and analysis of all of the tracked objects. On the right, the scene is shown from a point of view over the learner's shoulder. The learner's right hand, shown as orange dots, is hovering over the red triangular block. In the next stage, the tracking information about the foam blocks was mapped into the coordinate system of the motion capture system, so that all of the tracked study objects could be analyzed in the same, three-dimensional frame of reference. To accomplish this, a correspondence was established between four pixel locations in the under-table camera image and four positions that were specified relative to the rigid arrangement of reflective markers attached to the table. The four locations that were selected were the endpoints of the black tape lines that delineated the block tracking region, since these spots could be easily identified from both below the table as well as from above. These four correspondence points were used to calculate a linear transformation for mapping between image coordinates and the spatial coordinates of the motion capture system. A simple interface was created which allowed the user to click on the correspondence points in the camera image stream. Luckily, the mounting of the under-table camera proved to be steady enough that the camera did not move perceptibly relative to the table over the course of the study, so this correspondence only needed to be specified once for the entire set of collected data. With all of the study objects now in the same frame of reference, the next stage of processing used spatial and temporal relationships between the blocks and the bodies of the participants to extract a stream of potentially salient events that occurred during the interactions. These events included, among other things, block movements and hand-to-block contact events, which were important focal points for my analysis. My processing system recognized these events, and attempted to ascribe agency to each one (i.e., which agent learner or teacher - was responsible for this event?). Finally, statistics were compiled looking at different features of these events, and assessing their relative utility at differentiating the "good" blocks from the "bad" blocks. In order to recognize block contact events, a "grab" location for each hand was estimated by aggregating a number of examples of grasping by the given hand, and looking at the position of the grasped block relative to the tracked location of the hand (grasping typically occurred in the palm, whereas the reflective markers were mounted to the glove on the back of the hand). Subsequent block contact events were recognized by simply thresholding the distance between this "grab" location and the centroid of potential target blocks. A detector for raw block movements was constructed on top of the block tracking system, and primarily looked at frame-to-frame changes in position for each block to classify motion. A small distance threshold was used to classify block motion rather robustly. To recognize temporally-extended block movements, this raw motion detector was combined with a simple temporal filter which filtered out very brief movements as well as brief moments of stillness within extended movements, thus smoothing out block motion trajectories. Agency was ascribed to each recognized block movement event by determining whose hand was closest to the given block throughout the duration of the movement. By running the interaction data through these event recognizers, I produced a stream of sequential event information corresponding to some of the high-level actions taken by the study participants during the tasks. I then analyzed some interesting features of these events to assess their relative ability to differentiate good blocks from bad. Total Distance Change for Learner (cm) Average Distance Change for Learner per Block Movement (cm) 10000 4.35 7861 8000 4 3 6000 2 4000 1 2000 0.22 0 210 0 -1 -2000 -1 -3095.2 -4000 E good blocks N bad blocks 0 other blocks -26 -2.62 l good blocks l bad blocks O other blocks Figure 2-17: Change in distance to the body of the learner for block movements initiated by the teacher. Negative values represent movement towards the learner, while positive values represent movement away from the learner. One of the most interesting features that I analyzed was movement towards and away from the bodies of the participants. The results of my analysis are summarized in figures 2-17 and 2-18. As can be seen in figure 2-17, the aggregate movement of good blocks by teachers is biased very substantially in the direction of the learners, while the aggregate movement of bad blocks by teachers is biased away from the learners. In fact, over the course of all of the 72 analyzed interactions, teachers differentiated the good and bad blocks by more than the length of a football field in terms of their movements relative to the bodies of the learners. Looking at individual movements, travel towards and away from the body of the learner was strongly correlated with whether or not the given block was good or bad. Movements Towards Learner by Block Type 10UU0 Movements Away From Learner by Block Type II 90% 90% 80% 80% 70% 70% 60% I 100% 60% * bad blocks E/other blocks 50% * bad blocks 5]other blocks * good blocks 50% Ngood blocks 40% 40% 30% 30% 20% 20% 10% 10% no I any toward 0% 10cm toward 20cm toward any away 10cm away 20cm away Figure 2-18: Movements towards the body of the learner initiated by the teacher were predictive of good blocks. Movements away from the body of the learner were predictive of bad blocks. The differentiating power of these movements increased for more substantial changes in distance towards and away. Movements towards the body of the student were applied to good blocks 42% of the time, bad blocks 32% of the time, and other blocks 26% of the time. For movements that changed the distance towards the learner by more than 10cm, these differences were more pronounced: 70% of such movements were applied to good blocks, 25% to other blocks, and just 5% to bad blocks. For changes in distance of 20cm or greater, fully 83% of such movements were applied to good blocks versus 13% for other blocks and 5% for bad blocks. A similar pattern was seen for block movements away from the body of the learner, with larger changes in distance being strongly correlated with a block being bad, as shown in figure 2-18. These results are very exciting for a number of reasons. First, I have used an entirely automated data processing system to produce quantitative results that match up well with the qualitative observations of what happened in the interactions between the study participants. This is an encouraging validation of my data-gathering technology and methodology. Second, I have identified an embodied cue which might be of significant value to Movements Within 5 Seconds by Block Type Iuu"/o 90% 80% 70% 60% N bad blocks %- 50% 400/ Oother blocks Ngood blocks 300/ I 20% 100/% n o./. V qV Figure 2-19: Learners moving a block, followed by teachers moving the same block within 5 seconds, was highly indicative of the given block being bad. a robotic system learning in this task domain. The results suggest that such a robot, observing a block movement performed by a teacher, might be able to make a highly reliable guess as to whether the target block should or should not be used by measuring the direction and distance of the movement. Such a cue, which can be interpreted simply and reliably even within the context of a chaotic and fast-paced interaction, is exactly what I was looking for. Another feature that I analyzed was the timing of the actions taken by the teacher relative to the actions of the learner. The results are summarized in figure 2-19. As can be seen, when the student moves a block, and then the teacher moves the same block within 5 seconds, the given block is good only 23% of the time, bad 68% of the time, and other 8% of the time. Thus, I have identified another highly reliable cue that a robot might be able to use to discover which blocks to steer clear of in this task domain. 2.4 Summary: Cues That Robotic Learners Need to Understand In this chapter, I have presented the results of two studies highlighting the usefulness of embodied cues including visual perspective, action timing, and spatial emphasis to humans engaging in realistic teaching and learning tasks. These are clear, useful cues that real teachers provide and that real learners take advantage of. Robots should also pay attention to these cues in order to learn more naturally and effectively from humans in social environments. In the next chapter, I describe a pair of learning algorithms that I constructed to directly evaluate the utility of these cues, and describe the mechanisms of social attention and visual perspective taking that I incorporated into the Leonardo robot's cognitive architecture. In chapter 4, I present the learning performance of the robot on benchmark tasks drawn from the two studies, providing quantitative evidence for the utility of visual perspective, action timing, and spatial scaffolding as attention-direction cues for robotic learning systems. Chapter 3 Learning Algorithms and Architecture In this chapter, I describe a pair of learning algorithms that I constructed to evaluate the utility of the cues identified in the studies: visual perspective, action timing, and spatial scaffolding. I situated these learning algorithms within a large architecture for robot cognition, augmented with novel mechanisms for social attention and visual perspective taking. Designing an integrated learning system in this way supported not only the direct evaluation of the robot's performance on benchmark tasks drawn from the studies, but also a number of demonstrations of interactive social learning. In this chapter, I first provide an overview of the self-as-simulator cognitive architecture. I then describe the key components of this architecture, focusing on mechanisms for understanding the environment from the visual perspective of the teacher, social mechanisms of attention direction and enhancement, and a unified framework for social action recognition and behavior generation. Finally, I describe the algorithms that I developed within this architecture for learning tasks from human teaching behavior, and describe how these algorithms take advantage of visual perspective, action timing, and spatial scaffolding cues. The architecture, along with all of the demonstrations and evaluations described in this thesis, was designed to run on the 65 degree of freedom humanoid robot Leonardo and its graphical simulator, shown in figure 3-1. Figure 3-1: The Leonardo robot and graphical simulator 3.1 Self-as-Simulator Cognitive Architecture My approach to endowing machines with socially-cognitive learning abilities is inspired by leading psychological theories and recent neuroscientific evidence for how human brains might infer the mental states of others and the role of imitation as a critical precursor. Specifically, Simulation Theory holds that certain parts of the brain have dual use; they are used to not only generate our own behavior and mental states, but also to predict and infer the same in others. To understand another person's mental process, we use our own similar brain structure to simulate the introceptive states of the other person [Davies and Stone, 1995, Gallese and Goldman, 1998, Barsalou et al., 2003]. For instance, Gallese and Goldman [Gallese and Goldman, 1998] propose that a class of neurons discovered in monkeys, labeled mirror neurons, are a possible neurological mechanism underlying both imitative abilities and Simulation Theory-type prediction of the behavior of others and their mental states. Further, Meltzoff and Decety [Meltzoff and Decety, 2003] posit that imitation is the critical link in the story that connects the function of mirror neurons to the development of mindreading. In addition, Barsalou [Barsalou et al., 2003] presents additional evidence from various social embodiment phenomena that when observing an action, people activate some part of their own representation of that action as well as other cognitive states that relate to that action. Inspired by this theory, my simulation-theoretic approach and implementation enables a humanoid robot to monitor an adjacent human teacher by simulating his or her behavior within the robot's own generative mechanisms on the motor, goal-directed action, and perceptual-belief levels. This grounds the robot's information about the teacher in the robot's own systems, allowing it to make inferences about the human's likely beliefs in order to better understand the intention behind the teacher's demonstrations. Recognition/Inference (Mindreading) Generation Figure 3-2: System architecture overview. An overview of the robot's cognitive architecture, based on [Blumberg et al., 2002] and [Gray, J.et al., 2005], is shown in Figure 3-2. The system diagram features two concentric loops highlighting the primary flow of information within the architecture. The outer loop is the robot's main behavior generation pipeline. Information from the environment is provided by the robot's sensors. Incoming sensory events are processed by the Perception System, which maintains a tree of classification functions known as the percept tree. Perceptual information extracted by the percept tree is then clustered into discrete object representations known as object beliefs by the robot's Belief System. Object beliefs encode the tracked perceptual histories of the discrete objects that the robot perceives to exist in its environment. The robot's object beliefs, in conjunction with internal state information, affect the decisions made by the robot's planning and action selection mechanisms. These action decisions result in commands that are processed by the robot's Motor System. The Motor System is a posegraph-based motor control system that combines procedural animation techniques with human-generated animation content to generate expressive movements of the robot's body. The Motor System ultimately controls the motors which actuate the physical body of the robot, and also controls the virtual joints of the animated robot in the robot's 3D graphical simulation environment. The inner loop in Figure 3-2 is not a loop at all, but rather two separate streams of information coming in from the robot's sensory-motor periphery and converging towards the central cognitive mechanisms of action selection and learning. These incoming streams highlight the dual use of the robot's cognitive mechanisms to not only generate the behavior of the robot, but also to recognize the behavior of human interaction partners and infer their mental states. My implementation computationally models simulation-theoretic mechanisms throughout several systems within the robot's overall cognitive architecture. Within the motor system, mirror-neuron inspired mechanisms are used to map and represent perceived body positions of another into the robot's own joint space to conduct action recognition. Leo reuses his belief-construction systems, and adopts the visual perspective of the human, to predict the beliefs the human is likely to hold to be true given what he or she can visually observe. Finally, within the goal-directed behavior system, where schemas relate preconditions and actions with desired outcomes and are organized to represent hierarchical tasks, motor information is used along with perceptual and other contextual clues (i.e., task knowledge) to infer the human's goals and how he or she might be trying to achieve them (i.e., plan recognition). In the following sections, I describe the most salient components of this architecture: the mechanisms of visual perspective taking and belief inference, the mechanisms of social attention, and the unified framework for social action recognition and behavior generation. 3.2 Perspective Taking Mechanisms I turn now to the robot's visual perspective taking mechanisms. While others have identified that visual perspective taking coupled with spatial reasoning are critical for effective action recognition [Johnson and Demiris, 2005], understanding spatial semantics in speech [Roy et al., 2004], and human-robot collaboration on a shared task within a physical space [Trafton et al., 2005], and collaborative dialog systems have investigated the role of plan recognition in identifying and resolving misconceptions (see [Carberry, 2001] for a review), this is the first work to examine the role of perspective taking for introceptive states (e.g., beliefs and goals) in human-robot learning tasks. 3.2.1 Belief Modeling In order to convey how the robot interprets the environment from the teacher's perspective, I must first describe how the robot understands the world from its own perspective. This section presents a technical description of two important components of the cognitive architecture: the Perception System and the Belief System. The Perception System is responsible for extracting perceptual features from raw sensory information, while the Belief System is responsible for integrating this information into discrete object representations. The Belief System represents an integrated approach to sensor fusion, object tracking and persistence, and short-term memory. On every time step, the robot receives a set of sensory observations O = {0, 02, ... , ON from its various sensory processes. As an example, imagine that the robot receives information about buttons and their locations from an eye-mounted camera, and information about the button indicator lights from an overhead camera. On a particular time step, the robot might receive the observations O = {(red button at position (10,0,0)), (green button at (0,0,0)), (blue button at (-10,0,0)), (light at (10,0,0)), (light at (-10,0,0))}. Information is extracted from these observations by the Perception System. The Perception System consists of a set of percepts P = {Pl, P2, ..., PK}, where each p E P is a classification function defined such that p(o) = (m, c, d), (3.1) where m, c E [0, 1] are match and confidence values and d is an optional derived feature value. For each observation oi E 0, the Perception System produces a percept snapshot si = {(p, m, c, d)ip E P,p(oi) = (m,c, d),m * c > k}, (3.2) where k E [0, 1] is a threshold value, typically 0.5. Returning to the example, the robot might have four percepts relevant to the buttons and their states: a location percept which extracts the position information contained in the observations, a color percept, a button shape recognition percept, and a button light recognition percept. The Perception System would produce five percept snapshots corresponding to the five sensory observations, containing entries for relevant matching percepts. These snapshots are then clustered into discrete object representations called beliefs by the Belief System. This clustering is typically based on the spatial relationships between the various observations, in conjunction with other metrics of similarity. The Belief System maintains a set of beliefs B, where each belief b E B is a set mapping percepts to history functions: b = {(Px, hx), (py, hy), ... }. For each (p, h) E b, h is a history function defined such that h(t) = (m', c', d') (3.3) represents the "remembered" evaluation for percept p at time t. History functions may be lossless, but they are often implemented using compression schemes such as low-pass filtering or logarithmic timescale memory structures. A Belief System is fully described by the tuple (B, G, M, d, q, w, c), where * B is the current set of beliefs, * G is a generator function map, G : P G, where each g E G is a history generator function where g(m, c, d) = h is a history function as above, * M is the belief merge function, where M(bl, b2 ) = b' represents the "merge" of the history information contained within bl and b2 , Sd = dl,d 2 , ..., dL is a vector of belief distance functions, di : B x B Sq = q, q2, ..***, qL is a vector of indicator functions where each element qi denotes the applicability of di, qi : B x B -- 0, 1}, * w = w1 , w2 ,..., WL is a vector of weights, wi E R, and * c = c1 , 2, ... , Cj is a vector of culling functions, cj : B - {0, 1}. Using the above, I define the Belief Distance Function, D, and the Belief Culling Function, C: L D(bl, b2 ) = wiqi(bi, b2 )di(bl, b2 ) (3.4) i=1 C(b) = -c1 (b) (3.5) j=1 The Belief System manages three key processes: creating new beliefs from incoming percept snapshots, merging these new beliefs into existing beliefs, and culling stale beliefs. For the first of these processes, I define the function N, which creates a new belief bi from a percept snapshot si: bi = N(si) = {(p, h)I(p, m, c,d) E si, g = G(p), h = g(m, c, d)} (3.6) For the second process, the Belief System merges new beliefs into existing ones by clustering proximal beliefs, assumed to represent different observations of the same object. This is accomplished via bottom-up, agglomerative clustering as follows. For a set of beliefs B: 1: while 3bx, by E B such that D(bx, by) < thresh do 2: find bl, b2 E B such that D(bl, b2 ) is minimal 3: B <- BU {M(bl, b2 )} \ {b, b2} 4: end while I label this process merge(B). Finally, the Belief System culls stale beliefs by removing all beliefs from the current set for which C(b) = 1. In summation, then, a complete Belief System update cycle proceeds as follows: 1: begin with current belief set B 2: receive percept snapshot set S from the Perception System 3: create incoming belief set B1 = {N(si)lsi E S} 4: merge: B -- merge(B U BI) 5: cull: B <- B \ {bb E B,C(b)= 1} Returning again to the example, the Belief System might specify a number of relevant distance metrics, including a measure of Euclidean spatial distance along with a number of metrics based on symbolic feature similarity. For example, a symbolic metric might judge observations that are hand-shaped as distant from observations that are button-shaped, thus separating these observations into distinct beliefs even if they are collocated. For the example, the merge process would produce three beliefs from the original five observations: a red button in the ON state, a green button in the OFF state, and a blue button in the ON state. The Belief System framework supports the implementation of a wide range of object tracking methods, including advanced tracking techniques such as Kalman filters [Kalman, 1960] and particle filters [Carpenter et al., 1999, Arulampalam et al., 2002]. The ability to specify multiple distance metrics allows sophisticated, general-purpose tracking methods such as these to operate side-by-side with hand-crafted rules which encode prior domain knowledge about object categories, dynamics and persistence. 3.2.2 Perspective Taking and Belief Inference I now describe the integration of perspective taking with the mechanisms for belief modeling discussed above. Inferring the beliefs of the teacher allows the robot to build task Recognition/Inference (Mindreading) Generation Figure 3-3: Architecture for modeling the human's beliefs re-uses the robot's own architecture for belief maintenance. models which capture the intent behind human demonstrations. When demonstrating a task to be learned, it is important that the context within which that demonstration is performed be the same for the teacher as it is for the learner. However, in complex and dynamic environments, it is possible for the instructor's beliefs about the context surrounding the demonstration to diverge from those of the learner. For example, a visual occlusion could block the teacher's viewpoint of a region of a shared workspace (but not that of the learner) and consequently lead to ambiguous demonstrations where the teacher does not realize that the visual information of the scene differs between them. To address this issue, the robot must establish and maintain mutual beliefs with the human instructor about the shared context surrounding demonstrations. The robot keeps track of its own beliefs about object state using its Belief System, described above. In order to model the beliefs of the human instructor as separate and potentially different from its own, the robot re-uses the mechanism of its own Belief System. These beliefs that represent the robot's model of the human's beliefs are in the same format as its own, but are maintained separately so the robot can compare differences between its beliefs and the human's beliefs. As described above, belief maintenance consists of incorporating new sensor data into existing knowledge of the world. The robot's sensors are all in its reference frame, so objects in the world are perceived relative to the robot's position and orientation. In order to model the beliefs of the human, the robot re-uses the same mechanisms used for its own belief modeling, but first transforms the data into the reference frame of the human (see Fig. 3-3). Leo's belief about Button: Iposition: state: side front on off Leo's model of human's belief about Button: position: state: front on side | off Button Visible to Human Time Human Arrives Leo Moves Button Behind Occlusion Button Turned OFF Human removes Occlusion Figure 3-4: Timeline following the progress of the robot's beliefs for one button. The robot updates its belief about the button with any sensor data available - however, the robot only integrates new data into its model of the human's belief if the data is available when the human is able to perceive it. The robot can also filter out incoming data that it believes is not perceivable to the human, thereby preventing that new data from updating the model of the human's beliefs. This incoming data might come from an object that resides outside the visual cone of the human (determined by the human's position and head orientation). Additionally, if the robot perceives an occlusion, data coming from objects on the opposite side of that occlusion from the human can be filtered before updating the robot's model of the human's beliefs. Maintaining this parallel set of beliefs is different from simply adding metadata to the robot's original beliefs because it reuses the entire architecture which has mechanisms for object permanence, history of properties, etc. This allows for a more sophisticated model of the human's beliefs. For instance, Fig. 3-4 shows an example where this approach keeps track of the human's incorrect beliefs about objects that have changed state while out of the human's view. This is important for establishing and maintaining mutual beliefs in time-varying situations where beliefs of individuals can diverge over time. 3.3 Social Attention Mechanisms In this section, I introduce another important system integrated into the robot's perceptual pipeline: the mechanisms of social attention that help to guide the robot's gaze behavior, action selection, and learning. These mechanisms also help the robot to determine which objects in the environment the teacher's communicative behaviors are about. Shared attention is a critical component for human-robot interaction. Gaze direction in general is an important, persistent communication device, verifying for the human partner what the robot is attending to. Additionally, the ability to share attention with a partner is a key component to social attention [Scassellati, 2001]. Figure 3-5: Saliency of objects and people are computed from several environmental and social factors. Referential looking is essentially "looking where someone else is looking". Shared attention, on the other hand, involves representing mental states of self and other [Baron- Cohen, 1991]. To implement shared attention, the system models both the attentional focus (what is being looked at right now) and the referential focus (the shared focus that activity is about). The system tracks the robot's attentional focus, the human's attentional focus, and the referential focus shared by the two. Figure 3-6: Leo in his workspace with a human partner. The human's attention on the toy influences the robot's attention as well as the referential focus. Leo's attentional system computes the saliency (a measure of interest) for objects in the perceivable space. Overall saliency is a weighted sum of perceptual properties (proximity, color, motion, etc.), the internal state of the robot (i.e., novelty, a search target, or other goals), and social cues (if something is pointed to, looked at, talked about, or is the referential focus saliency increases). The item with the highest saliency becomes the current attentional focus of the robot, and determines the robot's gaze direction [Breazeal, 2002]. The human's attentional focus is determined by what he or she is currently looking at. Assuming that the person's head orientation is a good estimate of their gaze direction, the robot follows this gaze direction to determine which (if any) object is the attentional focus. The mechanism by which infants track the referential focus of communication is still an open question, but a number of sources indicate that looking time is a key factor. This is discussed in studies of word learning [Baldwin and Moses, 1994, Bloom, 2002]. For example, when a child is playing with one object and they hear an adult say "It's a modi", they do not attach the label to the object they happen to be looking at, but rather redirect their attention to look at what the adult is looking at, and attach the label to this object. For the referential focus, the system tracks a relative - looking - time for each of the objects in the robot's environment (relative time the object has been the attentional focus of either the human or the robot). The object with the most relative - looking - time is identified as the referent of the communication between the human and the robot. Fig. 3-6 shows the robot and human sharing joint visual attention. The robot has tracked the human's head pose and pointing gesture to determine the human's attentional focus, which in turn made this object more salient and thus the robot's own attentional focus, thereby casting it as the referential focus of the communication. While these attentional saliency mechanisms do not play an important role in the learning algorithms described at the end of this chapter, they are a critical part of the interactive demonstrations discussed in chapter 4. The saliency mechanisms, in conjunction with the learning algorithms, directly influence the robot's gaze and choice of action as it responds to the cues provided by the human teachers. In particular, the robot's gaze is a crucial feedback channel in these interactions, making the state of the robot's cognition and learning more transparent and guidable for the human teacher. 3.4 Unified Social Activity Framework The final important element of the self-as-simulator cognitive architecture is its unified framework for social action recognition and behavior generation. In the activity framework, the actions of the other are interpreted as learning-directed communications which constrain the exploratory behavior of the robot. As an example, imagine that a person in the environment points to a box. They might be initiating any of a number of different interactions. Do they: * want to play an imitative game? (imitation) * want me to engage with the box or its contents? (teaching) * want the object in the box for themselves? (collaboration) How I interpret the person's actions will have a profound effect upon my subsequent behavior. Gauging their intent correctly is important for the success of the interaction. I seek to build robots that can flexibly and adaptively engage in this process of interpretation, making good initial guesses and recovering gracefully from mistakes. By responding sensibly to the actions of the human, the robot will be well situated to take advantage of the natural guidance that the human can provide. 3.4.1 Action Representation: Process Models In this section I introduce a new action representation for the robot, motivated by the need for a probabilistic, integrated framework for action recognition, production, and expectation. I see such a framework as a key piece of the simulation theoretic approach to social interaction, collaboration, and learning, where it is critical to establish a common ground between the behavior of the self and the behavior of the other. At the heart of the action representation is the action tuple, a learning-friendly representation based on [Blumberg et al., 2002]. An action tuple represents a basic, temporallyextended action undertaken by the robot or another agent. An action tuple A is fully described by the tuple (t, a, o, d, g), where * t is the triggeringcontext, an expectation about the state of the world when the action begins, * a is the physical action, * o is the object in the world that the physical action relates to, * d is the do-until context, an expectation about the state of the world during the action and when the action finishes, and * g indicates the agent who performs the action, important in a collaborative scenario. In order to create a flexible framework which encompasses action recognition, production, and expectation, I embed the action tuple within a more sophisticated representation called a process model. A process model M is described by the tuple (S, n), where * S = {A 1 , A 2 , ... , AK) is a set of action tuples, and * n is a next-state expectation function. The function n is defined such that n(.) = (Fn, Pn), where * Fn = {M 1, M2, ... M1} is a set of possible future process models, and * Pn: FN - [0, 1] is a probability distribution function, Yz=1 p(Mj) = 1. I denote the simplest process model, which specifies a single action tuple A' and which is agnostic about future states, as MA', MAI = ({A'}, *). The process model representation is strongly related to other action representations for learning which support temporal abstraction, such as options [Sutton et al., 1999, Singh et al., 2005], and indeed parts of the process model framework have been implemented using the options formalism. The representational framework is also inspired by temporallysophisticated action recognition techniques such as past-now-future networks [Pinhanez, 1999] and event logic [Siskind, 2001], and is clearly related to representations from the planning literature [Allen, 1994, Weld, 1994]. The process model framework is a relatively simple implementation based on these established techniques; the technical details and terminology are introduced here to support the discussion of my novel approach to exploration and social guidance. I now provide a sketch of how process models can be used for action generation, which will be useful for my subsequent discussion of how recognized actions can be used to constrain the robot's exploration of its environment. Fig. 3-7 contains a description of the core action selection algorithm for process models. The action selection mechanism maintains a set of core process models C, along with a working set of process models W. The core process models represent the robot's built-in, learned, or remembered models of its own behavior and the behavior of others in its environment. Each core process model is a seed for generating an extended interaction with the world. The working set represents the process models currently under consideration for describing the state of the environment and the present interaction. Our action selection mechanism maintains a set of core process models C, along with a working set of process models W. Action selection proceeds as follows: 1: initialize the working set to the core set: W +- C 2: compute the probability distribution function r : W -4 [0, 1] based on the applicability of each M E W in the current sensorimotor context 3: for each time step do 4: compute the biased probability distribution function b : W -4 [0, 1] based on r and the system's goals and other biases 5: execute process model M' by selecting probabilistically over b 6: collect the selection set Q = {the I models M E W that are most consistent with M'}, where 1 is a constant that can vary with processing constraints 7: initialize the set of expected future models E = 0 and the probability distribution function s : E -+ [0, 1] 8: 9: 10: 11: 12: 13: 14: 15: 16: for each M E Q do compute n(.) = (F,,pn) E- EUF,n s(f) <- r(M) -p,(f) for each f c F, end for renormalize s over E update the working set: W +- E U C receive new perceptual information from the world update r based on s and the applicability of each M E W in the current sensorimotor context 17: end for Figure 3-7: Action selection algorithm for process models. On every time step, a probability distribution function r is computed over the process models in the working set which encodes the applicability of each model given the current sensorimotor context. This distribution is then biased by the system's current goals, and the resulting biased distribution is sampled to select a process model M' for execution. The system then identifies the set of models, called the selection set, that are still applicable given the choice of action. As an example, imagine that the robot has decided to open a box. Multiple extended processes might be consistent with this behavior: opening the box might facilitate the retrieval of a tool from the box, or it might enable the human to store a toy in the box, and so on. All of these process models would be included in the selection set, as long as they are not explicitly ruled out by the current context. Once the selection set is assembled, it is used to generate an expectation about the next state of the world. This expectation consists of a set of possible future process models, along with a probability distribution over these models. This expectation set becomes the working set for the next time step, after any missing core models are added back in. The action selection mechanism makes sure that the working set always contains each core process model, including those that are highly unlikely given the current scenario. This guarantees that while the system devotes most of its attention to the most likely models, it can still respond to unexpected events that may necessitate a change of strategy. 3.4.2 Exploration Representation: Interpretation Modes In this section, I introduce my approach to using the perceived actions of the other as dynamic constraints on exploration. As described in the previous section, the process model representation supports an integrated framework for action production and recognition. Using the same representation for both production and recognition greatly simplifies the problem of shaping the robot's behavior through perceived social action. The robot's exploration mechanism maintains a set of interpretation modes D, where each I E D corresponds to a particular exploration strategy, generating possible behavioral responses to the perceived action of the other. Specifically, each I C D is a mapping such that I(M) = V, where V is a set of process model variants on M. The robot's exploration system operates by observing the actions of the human and building process models of their behavior. When a model M' is recognized, the system selects an interpretation mode I' E D, evaluates I'(M') = V', and incorporates V' into the robot's current set of working models W. If pursuing one of these new process models results in a favorable outcome, the new model may be added to the robot's set of core process models for later use, exploration, and refinement. Each interpretation mode is an implementation of a different learning question. In selecting a particular mode, the system takes a stance about how it should focus its exploration of the environment at the current time. In my implementation, each interpretation mode produces variants of the perceived action that are focused around different, specific components of the action tuple representation. This implementation induces a hierarchy of learning questions roughly as follows: * trigger: why? * action: how? * object: where? * do-until: to what end? * agent: who? A related idea is the taxonomy of learning strategies that often appears in the imitation literature. Placed into a similar framework, it might look something like this: * action: mimicry * object: stimulus enhancement * do-until: goal emulation These strategies, however, are based on a direct mapping of the behavior of the other onto the behavior of the self. My productive, exploration-driven framework, while consistent with these strategies, aims to be more flexible. The strategies undertaken through my system might be better described as follows: * trigger: context-space exploration * action: motor-space exploration * object: object-space exploration * do-until: goal-space exploration e agent: agent-space exploration Thus, each interpretation mode focuses the robot's exploration around a particular aspect of the human's behavior. In this framework, the human's actions are viewed as intentional communications aimed at drawing the robot's attention to important features of the interactive environment. The human is not a teacher or collaborator per se, but rather a dynamic guide for the robot's behavior. As an example, imagine that the robot has two interpretation modes, Iaction and Iobject- The first focuses the robot's exploration around the human's physical activity, while the second focuses exploration around the object implied by that activity. Imagine that the human in the environment starts to point at a box. The system, perceiving this activity, builds a partial action tuple representation A' = (*, pointing,box, *, agent-A) and embeds this action tuple within process model MA' = human-points-at-box. Then possible results for applying the two interpretation modes might be: Iaction(MA,) = {leo-points-same, leo-points-mirrored, leo-points-at-box} Iob ject(MA) = {leo-opens-box, leo-lifts-box, leo-slides-box} Thus the Iaction interpretation mode produces process models exploring different aspects of the pointing behavior, while the lobject interpretation mode produces models exploring different interactions with the target of the pointing behavior. 3.5 Learning Algorithms In this section, I describe the two algorithms that I developed within the self-as-simulator architecture for learning tasks from human teaching behavior. The first algorithm was a schema-based goal learning algorithm that was augmented to take advantage of visual perspective cues to learn tasks both from the perspective of the robot as well as from the perspective of the human teacher. The second algorithm was a simple, Bayesian, constraint learning algorithm that was designed to take advantage of the action timing and spatial scaffolding cues that were identified through the second study. These algorithms supported both the quantitative evaluation of the utility of the identified cues, as well as a number of demonstrations of interactive learning from human teachers, as described in the following chapter. 3.5.1 Task and Goal Learning I believe that flexible, goal-oriented, hierarchical task learning is imperative for learning in a collaborative setting from a human partner, due to the human's propensity to communicate in goal-oriented and intentional terms. Hence, I employed a hierarchical, goaloriented task representation, wherein a task is represented by a set, S, of schema hypotheses: one primary hypothesis and n others. A schema hypothesis has x executables, E, (each either a primitive action a or another schema), a goal, G,and a tally, c, of how many seen examples have been consistent with this hypothesis. Goals for actions and schemas are a set of y goal beliefs about what must hold true in order to consider this schema or action achieved. A goal belief represents a desired change during the action or schema by grouping a belief's percepts into i criteria percepts (indicating features that holds constant over the action or schema) and j expectation percepts (indicating an expected feature change). This yields straightforward goal evaluation during execution: for each goal belief, all objects with the criteria features must match the expectation features. Schema Representation: S= {[(E1...Ex), G,c]p, [(E1...Ex), G,c]...n} E=alS G = {B1...By} B = Pcl...Pci U PEI"'.PEj For the purpose of task learning, the robot can take a snapshot of the world (i.e. the state of the Belief System) at time t, Snp(t), in order to later reason about world state changes. Learning is mixed-initiative such that the robot pays attention to both its own and its partner's actions during a learning episode. When the learning process begins, the robot creates a new schema representation, S, and saves belief snapshot Snp(to). From time, to, until the human indicates that the task is finished, tend, if either the robot or the human completes an action, act, the robot makes an action representation, a = [act, G] for S: 1: For action act at time tb given last action at ta 2: G = belief changes from Snp(ta) to Snp(tb) 3: append [act, G] to executables of S 4: ta - tb At time tend, this same process works to infer the goal for the schema, S, making the goal inference from the differences in Snp(to) and Snp(tend). The goal inference mechanism notes all changes that occurred over the task; however, there may still be ambiguity around which aspects of the state change are the goal (the change to an object, a class of objects, the whole world state, etc.). My approach uses hypothesis testing coupled with human interaction to disambiguate the overall task goal over a few examples. Once the human indicates that the current task is done, S contains the representation of the seen example ([(E1 ...Ex), G, 1]). The system uses S to expand other hypotheses about the desired goal state to yield a hypothesis of all goal representations, G, consistent with the current demonstration (for details of this expansion process see [Lockerd and Breazeal, 2004]; to accommodate the tasks described here the system additionally expands hypotheses whose goal is a state change across a simple disjunction of object classes). The current best schema candidate (the primary hypothesis) is chosen through a Bayesian likelihood method: P(hlD) oc P(DIh)P(h). The data, D, is the set of all examples seen for this task. P(DIh) is the percentage of the examples in which the state change seen in the example is consistent with the goal representation in h. For priors, P(h), hypotheses whose goal states apply to the broadest object classes with the most specific class descriptions are preferred (determined by number of classes and criteria/expectation features, respectively). Thus, when a task is first learned, every hypothesis schema is equally represented in the data, and the algorithm chooses the most specific schema for the next execution. 3.5.2 Perspective Taking and Task Learning In a similar fashion, in order to model the task from the demonstrator's perspective, the robot runs a parallel copy of its task learning engine that operates on its simulated representation of the human's beliefs. In essence, this focuses the hypothesis generation mechanism on the subset of the input space that matters to the human teacher. At the beginning of a learning episode, the robot can take a snapshot of the world in order to later reason about world state changes. The integration of perspective taking means that this snapshot can either be taken from the robot's (R) or the human's (H) belief perspective. Thus when the learning process begins, the robot creates two distinct schema representations, SRobot and SHum, and saves belief snapshots Snp(to, R) and Snp(to, H). Learning proceeds as before, but operating on these two parallel schemas. Once the human indicates that the current task is done, SRobot and SHum both contain the representation of the seen example. Having been created from the same demonstration, the executables will be equivalent, but the goals may not be equal since they are from differing perspectives. Maintaining parallel schema representations gives the robot three options when faced with inconsistent goal hypotheses: assume that the human's schema is correct, assume that its own schema is correct, or attempt to resolve the conflicts between the schemas. The evaluation discussed in the following chapter focuses on the simplest approach: take the perspective of the teacher, and assume that their schema is correct. 3.5.3 Constraint Learning from Embodied Cues In order to give the robot the ability to learn from the embodied cues identified in the second study, I developed a simple, Bayesian learning algorithm. This algorithm was es- sentially a fuzzy, probability-based variant of the schema learning mechanism described above. The algorithm was designed to learn rules pertaining to the color and shape of the foam blocks used in study tasks 1 and 2. The learning algorithm maintained a set of classification functions which tracked the relative odds that the various block attributes were good or bad according to the teacher's secret constraints. In total, ten separate classification functions were used, one for each of the four possible block colors and six possible block shapes. Each time the robot observed a salient teaching cue, these classification functions were updated using the posterior probabilities presented in the previous chapter - the odds of the target block being good or bad given the observed cue. For example, if the teacher moved a green triangle away from the student, the relative odds of green and triangular being good block attributes would decrease. Similarly, if the teacher then moved a red triangle towards the student, the odds of red and triangularbeing good would increase. At the end of each interaction, the robot would attempt to identify the secret constraint being taught by the human teacher. First, the robot identified the single block attribute with the most significant good/bad probability disparity. If this attribute was a color attribute, the constraint was classified as a color constraint. If it was a shape attribute, the constraint was classified as a shape constraint. Next, all of the block attributes associated with the classified constraint type were ranked from "most good" to "most bad." Thus, if red was identified as the attribute with the most pronounced probability disparity, then all of the color attributes - red, green, blue, and yellow - would be ranked according to their relative probabilities of being good vs. bad. It should be noted that this learning algorithm imposed significant structural constraints on the types of rules that the robot could learn from the interactions. For example, the robot could learn rules about colors or rules about shapes, but not rules combining both color and shape information. The space of rules that the robot considered was thus much smaller than the space of rules entertained by the human learners. However, this space was still large enough to present a significant learning challenge for the robot, with low chance performance levels. Most importantly, this learning problem was hard enough to represent an interesting evaluation of the usefulness of the teaching cues identified in the previous chapter. The core question was: would these teaching cues be sufficient to support successful learning? The following chapter presents the evaluation that I conducted to answer this question, along with a number of demonstrations of the Leonardo robot learning interactively from natural human teaching behavior. Chapter 4 Evaluations and Demonstrations In this chapter, I compare the learning performance of Leonardo's cognitive architecture against that of human learners on benchmark tasks drawn from the two studies. This comparison provides quantitative evidence for the utility of visual perspective, action timing, and spatial scaffolding as attention-direction cues for robotic learning systems. As a secondary contribution, this evaluation process supported the construction of a pair of demonstrations of the Leonardo robot making use of embodied cues to learn in novel ways from natural human teaching behavior. In the first demonstration, the Leonardo robot takes advantage of perspective taking to learn from ambiguous task demonstrations involving colorful foam blocks. The second demonstration features Leo making use of action timing and spatial scaffolding to learn secret constraints associated with a number of construction tasks, again involving foam blocks. Leonardo is the first robot to make use of visual perspective, action timing, and spatial scaffolding to learn from human teachers. 4.1 Visual Perspective Taking Demonstration and Benchmarks The first evaluation centers around attention direction via visual perspective taking. In the demonstration, the Leonardo robot takes advantage of perspective taking to learn from ambiguous task demonstrations involving colorful foam blocks. As an evaluation, I compared the robot's task performance against the performance of study participants on identical learning tasks. The tasks from the first study were used to create a benchmark suite for the robot. In the robot's graphical simulation environment, the robot was presented with the same task demonstrations as were provided to the study participants (Fig. 4-1). The learning performance of the robot was analyzed in two conditions: with the perspective taking mechanisms intact, and with them disabled. Figure 4-1: The robot was presented with similar learning tasks in a simulated environment. The robot was instructed in real-time by a human teacher. The teacher delineated task demonstrations using verbal commands: "Leo, I can teach you to do task 1," "Task 1 is done," etc. The teacher could select and move graphical building blocks within the robot's 3D workspace via a mouse interface. This interface allowed the teacher to demonstrate a wide range of tasks involving complex block arrangements and dynamics. For the benchmark suite, the teacher followed the same task protocol that was used in the study, featuring identical block configurations and movements. For the purposes of perspective taking, the teacher's visual perspective was assumed to be that of the virtual camera through which the scene was rendered. Table 4.1: High-likelihood hypotheses entertained by the robot at the conclusion of benchmark task demonstrations. The highest likelihood (winning) hypotheses are highlighted in bold. Task Condition Task 1 with PT without PT Task 2 with PT without PT Task 3 with PT &4 without PT High-Likelihood Hypotheses all; all but blue all but blue all red andgreen; shape preference shape preference rotatefigure; mirrorfigure mirrorfigure Table 4.2: Hypotheses selected by study participants following task demonstrations. The most popular rules are highlighted in bold. As the teacher manipulated the blocks, the robot attended to the teacher's movements. The robot's task learning mechanisms parsed these movements into discrete actions and assembled a schema representation for the task at hand, as detailed in previous sections. At the conclusion of each demonstration, the robot expanded and revised a set of hypotheses about the intended goal of the task. After the final demonstration, the robot was instructed to perform the task using a novel set of blocks arranged in accordance with the human study protocol. The robot's behavior was recorded, along with all of the task hypotheses considered to be valid by the robot's learning mechanism. Table 4.1 shows the highest-likelihood hypotheses entertained by the robot in the various task conditions at the conclusion of the demonstrations. In the perspective taking condition, likely hypotheses included both those constructed from the teacher's perspective as well as those constructed from the robot's own perspective; however, as described above, the robot preferred hypotheses constructed from the teacher's perspective. The hypotheses favored by the learning mechanism (and thus executed by the robot) are highlighted in bold. For comparison, Table 4.2 displays the rules selected by study participants, with the most popular rules for each task highlighted in bold. For every task and condition, the rule learned by the robot matches the most popular rule selected by the humans. This strongly suggests that the robot's perspective taking mechanisms focus its attention on a region of the input space similar to that attended to by study participants in the presence of a human teacher. It should also be noted, as evident in the tables, that participants generally seemed to entertain a more varied set of hypotheses than the robot. In particular, participants often demonstrated rules based on spatial or numeric relationships between the objects - relationships which are not yet represented by the robot. Thus, the differences in behavior between the humans and the robot can largely be understood as a difference in the scope of the relationships considered between the objects in the example space, rather than as a difference in this underlying space. The robot's perspective taking mechanisms are successful at bringing the agent's focus of attention into alignment with the humans' in the presence of a social teacher. 4.2 Emphasis Cues Benchmarks Tasks 2a and 2b from the second study study were used to evaluate the ability of Leo's cognitive architecture to learn from cues that human teachers naturally provide. The robot was presented with the recorded behavioral data from these tasks, and its learning performance was measured. Care was taken to make sure that the robot could only use the cues provided by the teacher and not those of the learner (and thus possibly "piggy-back" on the human learner's success). The robot processed the recorded study data using the same analysis pipeline as described in section 2.3.9. The foam blocks and the heads and hands of the study participants @0 Event Log event4: teacher moved the blue triangle :ewrn: student moved the green triangle event2: teacher moved the green triangle eventtl: teacher moved the red triangle vntO: student moved the blue triangle Figure 4-2: The robot was presented with the recorded study data, from which the teachers' cues were extracted. were tracked and mapped into the same three-dimensional coordinate system. Salient events such as block movement and hand contact were identified, and agency was assigned for each event to either the teacher or the learner. From this stream of event information, two types of teaching cues were extracted: movements by the teacher towards and away from the body of the student, and movements by the teacher following movements by the student. The robot employed a Bayesian constraint learning algorithm to learn from these cues, as described in section 3.5.3. Each time the robot observed a salient teaching cue, the algorithm updated the classification functions which tracked the relative odds of each block attribute being good or bad. At the end of each interaction, the robot identified the single block attribute with the most significant good/bad probability disparity. Based on this attribute, the secret constraint was classified as either a color-based constraint or a shapebased constraint. Next, all of the block attributes associated with the classified constraint type were ranked from "most good" to "most bad." The resulting ranking was recorded for each observed task interaction. After each observed task, the robot was simulated "re-performing" the given task in a non-interactive setting. The robot followed the rules extracted by its learning algorithm, and its performance was gauged as correct or incorrect according to the teacher's secret constraint. Thus, in addition to measuring whether or not the robot extracted the correct rules, I assessed whether or not the robot would have completed the task successfully given the observed teaching behavior. A cross-validation methodology was followed for both of the benchmark tasks. The robot's learning algorithm was developed and tested using 6 of the 36 study sessions. The robot's learning performance was then evaluated on the remaining 30 study sessions, with 30 recorded interactions for Task 2a and 30 recorded interactions for Task 2b. The performance of the human learners and the robot on the benchmark tasks is presented in tables 4.3 and 4.4, respectively. Human performance was gauged based on the guesses that the learners wrote down at the end of each task about the secret constraint. For both tasks, the secret constraint involved two rules. In Task 2a (building a sailboat), the rules were that only red blocks and blue blocks could be used in the construction of the figure (or equivalently, that green and yellow blocks could not be used). In Task 2b (building a truck), the rules were that the square blocks could not be used in the construction of the figure, while the triangular blocks had to be used. The performance of the human learners was gauged using three metrics: whether or not they correctly identified the rules as being color-based or shape-based (Rule Type Correct), whether or not they correctly specified either of the two rules (One Rule Correct), and finally, whether or not they correctly specified both rules (Both Rules Correct). As can be seen in table 4.3, the performance of the human learners was quite high for both tasks. The rule type was identified correctly nearly 100% of the time, with both rules specified correctly 87% of the time for the Sailboat task, and at least one rule specified correctly 87% of the time for the Truck task. Interestingly, for the Truck task, both rules Table 4.3: Performance of the human learners on study tasks 2a (Sailboat) and 2b (Truck). Task Sailboat Truck Rule Type Correct (color / shape) 30 (100%) 29 (97%) One Rule Correct 27 (90%) 26 (87%) Both Rules Correct 26 (87%) 4 (13%) were specified correctly in only 4 instances (13%). In this task, the two rules were partially redundant: not being able to use the square blocks required the use of some, but not all, of the triangular blocks. Similarly, using the triangular blocks was most easily accomplished by covering some, but not all, of the square-shaped regions of the figure. Thus, the teachers could guide the learners most of the way towards the correct completion of the figure by either encouraging the use of the triangular blocks or discouraging the use of the square blocks. This may explain some of the disparity between the success rate for specifying both rules and the success rate for specifying either one of the rules. The performance of the robot is presented in table 4.4. As described in section 3.5.3, after observing the task interaction, the robot tried to identify the most salient block attribute - the attribute with the most significant good/bad disparity. This attribute was then used to classify the constraint type as either color or shape. Then, the robot ranked all of the attributes associated with the classified constraint type, ordering them from most good to most bad. The robot's performance was assessed via a number of measures. Rule Type Correct assessed whether or not the robot classified the constraint type correctly. One Rule Correct assessed whether or not the attribute that the robot identified as the most salient was indeed one of the task rules (and whether or not that attribute was correctly identified as good vs. bad). Both Rules Correct assessed whether or not both rules were ranked correctly in the ranking of salient block attributes. For the Sailboat task, this required red and blue to be the highest-ranked attributes, and yellow and green to be the lowest-ranked attributes. For the Truck task, this required triangularto be the highest-ranked attribute, and square to be the lowest-ranked attribute. Finally, the table presents how often the robot completed the task successfully, when it was simulated re-performing the task following its observation of the interaction. Table 4.4: Learning performance of the robot observing benchmark task interactions. Performance is measured against rules that participants were instructed to teach. Truck* adjusts for a number of instances of rule misunderstanding by the human teachers (see accompanying text). Task Sailboat Truck Truck* Rule Type Correct (color / shape) 24 (80%) 28 (93%) 28 (93%) One Rule Correct 22 (73%) 20 (67%) 23 (77%) Both Rules Correct 21 (70%) 10 (33%) 14 (47%) Correct Performance 21 (70%) 23 (77%) 23 (77%) The robot's performance results are very exciting. The robot was able to identify the constraint type 80% of the time in the Sailboat task and 93% of the time in the Truck task. The attribute that the robot identified as most salient correctly matched one of the task rules 73% of the time in the Sailboat task and 67% of the time in the Truck task. The robot's attribute ranking correctly matched both task rules 70% of the time for the Sailboat task and 33% of the time for the Truck task. The robot's re-performance of the task was successful 70% of the time for the Sailboat task and 77% of the time for the Truck task. Additionally, in the Truck task, the robot was able to correctly match the rules identified by the human learners in a number of instances where the human teacher misunderstood the specified rule. In four cases, the teacher taught the rule "all rectangles are bad" instead of "all squares are bad." This rule was identified by both the human learner and the robot in all four instances. The Truck* row in the table credits the robot with success in these cases, with correspondingly higher performance numbers. Essentially, this compares the robot's performance to the rules that the learner identified, rather than the rules that the teacher was instructed to teach. Taken together, the results suggest that the robot was able to learn quite successfully by paying attention to a few simple cues extracted from the teacher's observed behavior. This is an exciting validation both of the robot's learning mechanisms as well as of the usefulness of the cues themselves. These dynamic, embodied cues are not just reliable at predicting whether blocks are good and bad in isolation. They are prevalent enough and consistent enough throughout the observed interactions to support successful learning. 4.3 Emphasis Cues Demonstration Finally, I created a demonstration which featured Leo making use of action timing and spatial scaffolding to learn from live interactions with human teachers, in a similar, secretconstraint task domain. A mixed-reality workspace was created so that the robot and the human teacher could both interact gesturally with animated foam blocks on a virtual tabletop. Figure 4-3: The robot took advantage of action timing and spatial scaffolding cues to learn through a live interaction with a human teacher. The human and the robot interacted face-to-face, with a large plasma display positioned in between them with the screen facing upwards, as shown in figure 4-3. Twentyfour animated foam blocks were displayed on the screen, matching the colors and shapes of the blocks used in the study. The dimensions of the screen and the virtual blocks also closely matched the dimensions of the tabletop and blocks in the study. The robot could manipulate the virtual blocks directly, by procedurally sliding and rotating them on the screen. To provide feedback to the human teacher, the robot gesturally tracked these procedural movements with its hand outstretched, giving the impression that the robot was levitating the blocks with its hand. The robot also followed its own movements and those of the human with its gaze, and attended to the teacher's hands and head. Additionally, the robot provided gestures of confusion and excitement at appropriate moments during the interaction, such as when the robot was interrupted by the human or when the figure was completed successfully. The human teacher wore the same gloves and baseball cap as were worn by the participants in the study. The teacher's head and hands were tracked using the same motioncapture based object tracking pipeline. The human manipulated the virtual blocks via a custom-built gestural interface, which essentially turned the plasma display into a very large, augmented touch screen (see figure 4-4). The interface allowed the teacher to use both hands to pick up, slide, and rotate the blocks on the screen. The teacher could select a particular block by touching the screen with their fingertips, at which point the selected block would become "stuck" to their hand. In the 3D graphics environment, the selected block would follow the translation and rotation of the teacher's hand, and would continue to do so until the teacher put down the block by again touching the screen. In this manner, the teacher could manipulate any block on the screen, including blocks being moved by the robot (an interruption which would often be followed by a gesture of confusion by the robot). The robot could be instructed to build any of a number of different figures. The silhouettes of these target figures were specified ahead of time via a simple graphical interface. During the interaction, the robot selected which blocks to use to complete the figures, incorporating the guidance of the human teacher. An example of a successfully completed figure is shown in figure 4-4, which should be recognizable as the truck figure from study Task 2b. As is suggested in the figure, the teacher has moved the square blocks away from the robot (towards the bottom of the screen), causing the robot to use the triangular and small rectangular blocks to complete the figure. viis~lliV I 40 4 Figure 4-4: A gestural interface (left) allowed the human and the robot to interact with virtual foam blocks. A figure planning algorithm allowed the robot to use a simple spatial grammar to construct figures (right) in different ways, incorporating the guidance of the teacher. A figure planning algorithm allowed the robot to use a simple spatial grammar to construct the target figures in different ways. This allowed for flexibility in the shapes as well as the colors of the blocks used in the figures. The spatial grammar was essentially a spatially-augmented context-free grammar. Each rule in the grammar specified how a particular figure region could be constructed using different arrangements of one or more blocks. For example, one rule in the grammar specified three alternatives for constructing square figure regions: (1) using one square block, (2) using two triangular blocks, or (3) using two small rectangular blocks. Similar rules specified the alternatives for constructing rectangular figure regions, circular regions, triangular regions, and so on. The use of such a grammar imposed some strong restrictions on how the target figures could be constructed, and more open-ended planning techniques are certainly imaginable. However, this approach allowed the robot to be quite flexibly guided by the teacher's behavior, and allowed it to clearly demonstrate its understanding of the constraints it learned through the interaction. For each rule in the grammar, a preference distribution specified an initial bias about which alternatives the robot should prefer. For the example of square figure regions, this distribution specified that the robot should slightly prefer the solution involving just one square block over the solutions involving two triangular or rectangular blocks. During the Figure 4-5: Interaction sequence between the robot and a human teacher. (1) The robot starts to construct a sailboat figure as the teacher watches. (2) The teacher interrupts an incorrect addition to the figure, and then (3) moves preferable blocks closer to the robot. (4) The teacher watches as the robot continues to construct the figure, and finally (5) completes the figure successfully, using only blue and red blocks. (6) After the teacher leaves, the robot constructs a new, smiley-face figure following the learned constraints. interaction, the figure planning algorithm multiplied these distributions by the estimated probability of each available block being a good block, as inferred from the teacher's embodied cues. The resulting biased probability distribution governed the robot's choice of which block to use at each step in constructing the figure. The preference distributions in the grammar were typically skewed only slightly, and thus could be easily overridden by the teacher's guidance. For example, in the absence of any instruction, the robot would use single square blocks to complete square regions in the figures. However, if the robot estimated that triangular blocks were, for example, twice as likely to be good as square blocks given the teacher's behavior, the triangular blocks would be used instead. The robot employed the same learning algorithm as was used to learn from the recorded study data, and employed the same tracking algorithms to identify the teacher's embodied cues. As the interaction progressed, the robot updated its estimates of whether or not each block attribute was good or bad, and used these estimates to bias its action selection, as discussed above. Thus, if the teacher slid a green triangle closer to the robot, the odds of the robot attending to and using green blocks and triangular blocks would increase. If the teacher slid the block away from the robot, the odds of the robot ignoring such blocks would increase. Figure 4-5 shows off an interaction sequence between the robot and a human teacher. The robot, instructed to build a sailboat figure, starts to construct the figure as the teacher watches. The teacher's goal is to guide the robot into using only blue and red blocks to construct the figure. As the interaction proceeds, the robot tries to add a green rectangle to the figure. The teacher interrupts, pulling the block away from the robot. As the robot continues to build the figure, the teacher tries to help by sliding a blue block and a red block close to the robot's side of the screen. The teacher then watches as the robot completes the figure successfully. To demonstrate that the robot has indeed learned the constraints, the teacher walks away, and instructs the robot to build a new figure. Without any intervention from the teacher, the robot successfully constructs the figure, a smiley-face, using only red and blue blocks. Chapter 5 Conclusion In this chapter, I present a brief summary of the contributions of my work. I then present some concluding thoughts and a discussion of plans and possibilities for future research. 5.1 Contributions My thesis research resulted in a number of specific contributions: I conducted two novel studies that examined the use of embodied cues in human task learning and teaching behavior. To carry out these studies, I created a novel data-gathering system for capturing teaching and learning interactions at very high spatial and temporal resolutions. Through the studies, I observed a number of salient attention-direction cues, the most promising of which were visual perspective, action timing, and spatial scaffolding. In particular, spatial scaffolding, in which teachers use their bodies to spatially structure the learning environment to direct the attention of the learner, was identified as a highly valuable cue for robotic learning systems. I constructed a number of learning algorithms to evaluate the utility of the identified cues. I situated these learning algorithms within a large architecture for robot cognition, augmented with novel mechanisms for social attention and visual perspective taking. I evaluated the performance of these learning algorithms in comparison to human learning data, providing quantitative evidence for the utility of the identified cues. As a secondary contribution, this evaluation process allowed me to construct a number of demonstrations of the Leonardo robot taking advantage of embodied cues to learn from natural human teaching behavior. Leonardo is the first robot to make use of visual perspective, action timing, and spatial scaffolding to learn from human teachers. 5.2 Future Work Finally, I present some thoughts on plans and possibilities for future research stemming from this work. 5.2.1 Additional Embodied Emphasis Cues My analysis of the second study focused on just one of the three study tasks, and on only a handful of the embodied emphasis cues observed during the task. I am interested in using the automated analysis tools that I created to generate quantitative data about some of the other cues enumerated in chapter 2, as well as about the cues observed during tasks 1 and 3. This would be interesting both to better understand the use of these cues in the learning interactions, as well as to improve the robot's ability to learn from the recorded data. I am interested in looking more closely at how the teachers spatially cluster the blocks in tasks 1 and 2, drawing the learner's attention to the salient block features. I am also interested in looking at head shaking and nodding cues, hand gestures such as tapping and finger-wagging, and cues communicated through gaze and eye contact between the teacher and the learner. Additionally, the recorded data for task 3 contains a great deal of information about how teachers and learners communicate in a sequence learning domain. Cues of interest here include how teachers guide the actions taken by the learners, and how they convey constraints in the space of available actions and identify salient perceptual events. 100 5.2.2 Cues for Regulating the Learning Interaction The recorded data set also seems to contain some very interesting observations of how learners and teachers regulate the pacing and turn taking in learning interactions. A fluid continuum seems to exist between teacher-guided demonstration and learner-guided exploration, and it might be quite valuable to better understand the cues that can establish and modify where a particular interaction lies in this continuum. Some simple cues in the data set seemed to regulate the pacing of the learner's actions and how much attention was paid by the learner to the teacher. In task 3, for example, an important cue seemed to be the proximity of the teacher's hands to the box controls. If the teachers kept their hands close to their own bodies, the learners would often proceed quite quickly with their actions. As the teachers raised their hands and moved them towards the controls, the learners actions would tend to slow, and their attention to the teacher would seem to increase. More generally, an important cue across many of the task interactions seemed to be how much the teacher was intruding into the learner's "space" versus how much they were just observing and leaving the learner alone. 5.2.3 Robots as Teachers I used the data about the types of useful, embodied cues that human teachers naturally provide to build a robotic system that could learn in novel ways from human teaching behavior. You could also imagine using this very same data to build a better robot teacher, one that could provide appropriate, non-verbal cues and that could structure its demonstrations in a way that was highly understandable to human learners. One could imagine a compelling demonstration of such teaching based on a "telephone" style series of interactions. The robot could be taught something new by a human expert, and then could proceed to teach that same skill or task to human novices. There is quite a bit of interesting work that could be done to frame this interaction and develop methods for evaluating its success. 101 5.2.4 Dexterity and Mobility Finally, I am excited to extend the demonstrations of robot social learning presented in this thesis to a robotic platform featuring both dexterity and mobility. The ability to move around and grasp objects would allow the interaction to be fully grounded in the physical world, supporting the exploration of a wider range of embodied teaching cues. I am fascinated by the challenges presented in designing a robotic system that can learn from human teachers through interactions in real, physical space. As robots enter the social environments of our workplaces and homes, it will be important for them to be able to learn from natural human teaching behavior. This thesis has presented some concrete steps towards this goal, by identifying a handful of simple, information-rich cues that humans naturally provide through their visual perspective, action timing, and use of space, and by demonstrating a robotic learning system that can take advantage of these cues to learn tasks through natural interactions with human teachers. By continuing to explore the information contained within our nonverbal teaching cues, we will not only make progress towards a better understanding of ourselves as social actors and learners, but also enable the creation of robots that fit more seamlessly into our lives. 102 Bibliography [Allen, 1994] Allen, J. (1994). Readings in Planning.Morgan Kaufmann Publishers Inc. San Francisco, CA, USA. [Arulampalam et al., 2002] Arulampalam, M., Maskell, S., Gordon, N., Clapp, T., Sci, D., Organ, T., and Adelaide, S. (2002). A tutorial on particle filters for online nonlinear/nonGaussian Bayesian tracking. IEEE Transactionson Signal Processing,50(2):174-188. [Baldwin and Moses, 1994] Baldwin, D. and Moses, J. (1994). Early understanding of referential intent and attentional focus: Evidence from language and emotion. In Lewis, C. and Mitchell, P., editors, Children's Early Understandingof Mind, pages 133-156. Lawrence Erlbaum Assoc., New York. [Baron-Cohen, 1991] Baron-Cohen, S. (1991). Precursors to a theory of mind: Understanding attention in others. In Whiten, A., editor, Natural Theories of Mind, pages 233-250. Blackwell Press, Oxford, UK. [Barsalou et al., 2003] Barsalou, L. W., Niedenthal, P.M., Barbey, A., and Ruppert, J. (2003). Social embodiment. The Psychology of Learningand Motivation, 43. [Bloom, 2002] Bloom, P. (2002). Mindreading, communication and the learning of names for things. Mind and Language, 17(1 and 2):37-54. [Blumberg et al., 2002] Blumberg, B., Downie, M., Ivanov, Y., Berlin, M., Johnson, M., and Tomlinson, B. (2002). Integrated learning for interactive synthetic characters. In Proceedings of the ACM SIGGRAPH. [Bolt, 1980] Bolt, R. (1980). "Put-that-there": Voice and gesture at the graphics interface. Proceedings of the 7th annual conference on Computer graphics and interactive techniques, pages 262-270. [Breazeal, 2002] Breazeal, C. (2002). Designing Sociable Robots. MIT Press, Cambridge, MA. [Breazeal et al., 2004] Breazeal, C., Hoffman, G., and Lockerd, A. (2004). Teaching and working with robots as collaboration. In Proceedingsof the AAMAS. [Carberry, 2001] Carberry, S. (2001). Techniques for plan recognition. User Modeling and User-Adapted Interaction,11(1-2):31-48. 103 [Carpenter et al., 1999] Carpenter, J., Clifford, P., and Fearnhead, P. (1999). Improved particle filter for nonlinear problems. Radar,Sonar and Navigation, IEE Proceedings-,146(1):27. [Cassell, 2000] Cassell, J. (2000). Embodied ConversationalAgents. MIT Press. [Davies and Stone, 1995] Davies, M. and Stone, T. (1995). Introduction. In Davies, M. and Stone, T., editors, Folk Psychology: The Theory of Mind Debate. Blackwell, Cambridge. [Fransen et al., 2007] Fransen, B., Morariu, V., Martinson, E., Blisard, S., Marge, M., Thomas, S., Schultz, A., and Perzanowski, D. (2007). Using vision, acoustics, and natural language for disambiguation. In Proceedingsof the 2007 ACM Conference on Human-Robot Interaction(HRI '07). [Gallese and Goldman, 1998] Gallese, V. and Goldman, A. (1998). Mirror neurons and the simulation theory of mind-reading. Trends in Cognitive Sciences, 2(12):493-501. [Ghidary et al., 2002] Ghidary, S., Nakata, Y., Saito, H., Hattori, M., and Takamori, T. (2002). Multi-Modal Interaction of Human and Home Robot in the Context of Room Map Generation. Autonomous Robots, 13(2):169-184. [Gopnik et al., 2001] Gopnik, A., Sobel, D., Schulz, L., and Glymour, C. (2001). Causal learning mechanisms in very young children: Two-, three-, and four-year-olds infer causal relations from patterns of variation and covariation. Developmental Psychology, 37(5):620-629. [Gray, J. et al., 2005] Gray, J., Breazeal, C., Berlin, M., Brooks, A., and Lieberman, J. (2005). Action parsing and goal inference using self as simulator. In 14th IEEE International Workshop on Robot and Human Interactive Communication (ROMAN), Nashville, Tennessee. IEEE. [Hanna et al., 2003] Hanna, J., Tanenhaus, M., and Trueswell, J. (2003). The effects of common ground and perspective on domains of referential interpretation. Journalof Memory and Language, 49(1):43-61. [Iverson and Goldin-Meadow, 2005] Iverson, J. and Goldin-Meadow, S. (2005). Gesture Paves the Way for Language Development. Psychological Science, 16(5):367. [Johnson and Demiris, 2005] Johnson, M. and Demiris, Y. (2005). Perceptual perspective taking and action recognition. InternationalJournalof Advanced Robotic Systems, 2(4):301308. [Jones and Hinds, 2002] Jones, H. and Hinds, P. (2002). Extreme work teams: using swat teams as a model for coordinating distributed robots. In Proceedings of the 2002 ACM conference on Computer supported cooperative work, pages 372-381. ACM Press. [Kahn et al., 1996] Kahn, R. E., Swain, M. J., Prokopowicz, P. N., and Firby, R. J. (1996). Gesture recognition using the perseus architecture. In Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96), page 734, Washington, DC, USA. IEEE Computer Society. 104 [Kalman, 1960] Kalman, R. (1960). A new approach to linear filtering and prediction problems. Journalof Basic Engineering,82(1):35-45. [Kendon, 1997] Kendon, A. (1997). Gesture. Annual Review of Anthropology, 26:109-128. [Keysar et al., 2000] Keysar, B., Barr, D., Balin, J., and Brauner, J. (2000). Taking perspective in conversation: the role of mutual knowledge in comprehension. PsychologicalScience, 11(1):32-8. [Kipp, 2004] Kipp, M. (2004). Gesture Generation by Imitation: From Human Behavior to Computer CharacterAnimation. PhD thesis, Boca Raton, Florida: Dissertation.com. [Kobsa et al., 1986] Kobsa, A., Allgayer, J., Reddig, C., Reithinger, N., Schmauks, D., Harbusch, K., and Wahlster, W. (1986). Combining deictic gestures and natural language for referent identification. In Proceedings of the 11th InternationalConference on Computational Linguistics. [Langton, 2000] Langton, S. (2000). The mutual influence of gaze and head orientation in the analysis of social attention direction. The QuarterlyJournalof Experimental Psychology: Section A, 53(3):825-845. [Lockerd and Breazeal, 2004] Lockerd, A. and Breazeal, C. (2004). Tutelage and socially guided robot learning. In IEEE/IRSJ InternationalConference on Intelligent Robots and Systems (IROS). [Martin et al., 2005] Martin, B., Lozano, S., and Tversky, B. (2005). Taking the Actor's Perspective Enhances Action Understanding and Learning. In Proceedingsof the 27th Annual meeting of the Cognitive Science Society, Stresa, Italy. [McNeill, 1992] McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press. [McNeill, 2005] McNeill, D. (2005). Gesture and thought. University of Chicago Press. [Meltzoff and Decety, 2003] Meltzoff, A. N. and Decety, J. (2003). What imitation tells us about social cognition: a rapprochement between developmental psychology and cognitive neuroscience. PhilosophicalTransactions of the Royal Society: Biological Sciences, 358:491-500. [Morency et al., 2002] Morency, L.-P., Rahimi, A., Checka, N., and Darrell, T. (2002). Fast stereo-based head tracking for interactive environment. In Int. Conference on Automatic Face and Gesture Recognition. [Neal et al., 1998] Neal, J., Thielman, C., Dobes, Z., Haller, S., and Shapiro, S. (1998). Natural language with integrated deictic and graphic gestures. Readings in Intelligent User Interfaces, pages 38-51. [Nehaniv et al., 2005] Nehaniv, C., Dautenhahn, K., Kubacki, J., Haegele, M., Parlitz, C., and Alami, R. (2005). A methodological approach relating the classification of gesture 105 to identification of human intent in the context of human-robot interaction. IEEE International Workshop on Robot and Human Interactive Communication (ROMAN 2005), pages 371-377. [Nicolescu and Matarid, 2003] Nicolescu, M. N. and Matarik, M. J. (2003). Natural methods for robot task learning: Instructive demonstrations, generalization and practice. In Proceedingsof the 2nd Intl. Conf.AAMAS, Melbourne, Australia. [Oviatt et al., 1997] Oviatt, S., DeAngeli, A., and Kuhn, K. (1997). Integration and synchronization of input modes during multimodal human-computer interaction. Proceedings of the SIGCHI conference on Humanfactors in computing systems, pages 415-422. [Pappu and Beardsley, 1998] Pappu, R. and Beardsley, P. A. (1998). A qualitative approach to classifying gaze direction. In Proceedings of the Third IEEE InternationalConference on Automatic Face and Gesture Recognition. [Perrett and Emery, 1994] Perrett, D. and Emery, N. (1994). Understanding the intentions of others from visual signals: neurophysiological evidence. Cahiers de PsychologieCognitive, 13(5):683-694. [Pinhanez, 1999] Pinhanez, C. (1999). Representationand Recognition of Action in Interactive Spaces. PhD thesis, Massachusetts Institute of Technology. [Roy et al., 2004] Roy, D., Hsiao, K., and Mavridis, N. (2004). Mental Imagery for a Conversational Robot. IEEE Transactions On Systems, Man, and Cybernetics-PartB: Cybernetics, 34(3). [Scassellati, 2001] Scassellati, B. (2001). Foundations for a theory of mind for a humanoid robot. Massachusetts Institute of Technology, Department of ElectricalEngineeringand Computer Science, PhD Thesis. [Schaal, 1999] Schaal, S. (1999). Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3:233242. [Schulz and Gopnik, 2004] Schulz, L. and Gopnik, A. (2004). Causal learning across domains. Developmental Psychology, 40(2):162-176. [Sebanz et al., 2006] Sebanz, N., Bekkering, H., and Knoblich, G. (2006). Joint action: Bodies and minds moving together. Trends in Cognitive Sciences, 10(2):70-76. [Severinson-Eklundh et al., 2003] Severinson-Eklundh, K., Green, A., and Hiittenrauch, H. (2003). Social and collaborative aspects of interaction with a service robot. Robotics and Autonomous Systems, 42(3-4):223-234. [Singer and Goldin-Meadow, 2005] Singer, M. and Goldin-Meadow, S. (2005). Children Learn When Their Teacher's Gestures and Speech Differ. Psychological Science, 16(2):85. [Singh et al., 2005] Singh, S., Barto, A. G., and Chentanez, N. (2005). Intrinsically motivated reinforcement learning. In Proceedings of Advances in Neural Information Processing Systems 17 (NIPS). 106 [Siskind, 2001] Siskind, J. (2001). Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research, 15(1):31-90. [Smith et al., 2007] Smith, L., Maouene, J., and Hidaka, S. (2007). The body and children's word learning. In Plumert, J. M. and Spencer, J., editors, Emerging landscapes of mind: Mapping the nature of change in spatial cognitive development. Oxford University Press, New York. [Stiefelhagen, 2002] Stiefelhagen, R. (2002). Tracking focus of attention in meetings. In Proceedings of the Fourth IEEE InternationalConference on Multimodal Interfaces (ICMI '02). [Sutton et al., 1999] Sutton, R., Precup, D., and Singh, S. (1999). Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181-211. [Trafton et al., 2005] Trafton, J. G., Cassimatis, N. L., Bugajska, M. D., Brock, D. P., Mintz, F. E., and Schultz, A. C. (2005). Enabling effective human-robot interaction using perspective-taking in robots. IEEE Transactions on Systems, Man, and Cybernetics, 35(4):460-470. [Weld, 1994] Weld, D. (1994). An Introduction to Least Commitment Planning. AI Magazine, 15(4):27-61. [Wilson and Bobick, 1999] Wilson, A. D. and Bobick, A. F. (1999). Paramteric hidden markov models for gesture recognition. IEEE Transactions on PatternAnalysis and Machine Intelligence, 21(9). [Wimmer and Perner, 1983] Wimmer, H. and Perner, J. (1983). Beliefs about beliefs: Representation and constraining function on wrong beliefs in young children's understanding of deception. Cognition, 13:103-128. [Zukow-Goldring, 2004] Zukow-Goldring, P (2004). Caregivers and the education of the mirror system. In Proceedingsof the InternationalConference on Development and Learning. 107