IBM Research Extensible Language Interface for Robot Manipulation Jonathan Connell Exploratory Computer Vision Group Etienne Marcheret Speech Algorithms & Engines Group Sharath Pankanti (IBM Yorktown) Michiharu Kudoh (IBM Tokyo) Risa Nishiyama (IBM Tokyo) IBM Research Much of “Intelligence” Based on Two Illusions Animal part = mobility, perception, and reaction • • • • People flock around robots and readily anthropomorphize them Real-world action seems to convey a feeling of “aliveness” Responsiveness to changes in environment conveys sense of “mind” Key point in the embodied / situated agents viewpoint Human part = learning by being told • Bulk of human knowledge contained in culture, largely passed verbally • No one discovers how to cook macaroni and cheese – someone explains • Lack of communication makes even people (e.g. foreigners) seem less “human” Goal is to “fuse” these two parts into a harmonious whole Analogy to a Turing machine • Core is a simple finite state machine controller (= language interpreter) • Addition of tape vastly increases computational power (= learning from language) 2 IBM Research Required Innate Mechanisms Segmentation • Division of the world into spatial regions (partial segmentation okay) • Positive space regions are objects, people, and surface • Negative space regions are places and passages Comparison • Objects have properties like color and size that are different • Objects have relations to other objects such as position Actions • Operators can be indexed to operate on certain objects • Most have expected continuation and / or end conditions Time • Physical motions have expected durations • Actions can be sequenced based on completion • More complex actions can be built from simpler ones Language interpretation ties into all these pre-existing (animal) abilities • Nouns, adjectives, prepositions, verbs, adverbs, conjunctions 3 IBM Research ELI: A Fetch-and-Carry Robot Use speech, language, and vision to learn objects & actions • But not from lowest level like “what is a word” or “what visual properties signal an object” • Build in as much as is practical Save learning for terms not knowable a priori • Names for particular items or rooms in a house • How to perform special tasks like “clean up” Example dialog: command following verb learning noun learning advice taking Round up my mug. I don’t know how to “round up” your mug. Walk around the house and look for it. When you find it bring it back to me. I don’t know what your “mug” looks like. It is like this <shows another mug> but sort of orange-ish. OK … I could not find your mug. Try looking on the table in the living room. OK … Here it is! Potential use in eldercare scenario – a service dog with less slobber 4 IBM Research Capabilities Illustrated Through 4 Part Video camera Arm and camera removed from robot and mounted on table Simplifies problem by reducing the degrees of freedom arm OTC medications (Advil & Gaviscon) 5 IBM Research Multi-Modal Interaction (video part 1) Features: Automatically finds objects Selects by position, size, color Grabs selected object Understands pronoun reference Can ask clarifying questions Handles user pointing Robot points for emphasis 6 IBM Research Noun Learning Scenario (video part 2) Features: Builds visual models Adds new nouns to grammar Identifies objects from models Passes object to/from user Model = size + shape + colors Matching = nearest neighbor dist = Σ w[i] * | v[i] – m[i] | 7 IBM Research Once objects have names, more properties are available Oversee operation of physical robot to provide more intelligent action Eli Robot at Watson Brainy Response System at Tokyo Vision ASR Objects Parser Vocabulary Visual models Reasoning Semantic memory Action models Talk Archive context update Network Lifelog vetoes, recommendation Kinematics Sequencer Could envision a similar extension using RoboEarth online resource 8 Retrieve IBM Research Manipulation with Intelligent Backend (video part 3) Features: Vetoes actions based on DB Picks alternates using ontology Checks for valid dose interval Real-time cloud connection “Alice” aspirin lifelog history NO DB Gavagai problem 9 antacid Rolaids Tums (requested) (present) 7:14 AM xxxxx 8:39 AM zzzzz 9:01 AM took Tylenol IBM Research Verb Learning Scenario (video part 4) Features: Learns action sequences Handles relative motion commands Responds to incremental positioning Applies new actions to other objects “poke” 10 point 1.0 out 1.0 out -1.0 IBM Research ELI Arm Demos Video Also available on YouTube: http://www.youtube.com/watch?v=M2RXDI3QYNU 11 IBM Research Summary of Abilities Perception • Automatically detects and counts visual objects • Understands colors, sizes, and overall positions Action • Can successfully reach for seen objects • Can grasp and deposit objects in real world Language • Parses and responds appropriately to speech commands • Understands pointing and uses pointing itself • Properly interprets object passing interactions Reasoning • Knows limitation about what it can see, reach, and grab • Asks clarifying questions when there are ambiguities • Can alter actions based on known facts, histories, and ontologies Learning • Acquires new visual object models and corresponding words • Can verbally train and name a sequence of indexical actions Differences from some AGI work • Complete approach attacking core problem (language as tape) • Concrete, physical, and implemented system (all integrated) 12 IBM Research Extensions What is still missing? • Acquiring new data by observation & interaction • Filling in holes in learned representations & procedures • Fixing inaccuracies in taught knowledge Free the robot from top-down imperatives! • Add initiative – a smart assistant will look for answers himself • Improvisation – if something does not match perfectly, try a variation • Experiential learning – better to pick up a cup by rim instead of base 13