From Visuo-Motor to Language Deepali Semwal, Sunakshi Gupta, Amitabha Mukerjee

Modeling Changing Perspectives — Reconceptualizing Sensorimotor Experiences: Papers from the 2014 AAAI Fall Symposium From Visuo-Motor to Language Deepali Semwal, Sunakshi Gupta, Amitabha Mukerjee Department of Computer Science and Engineering Indian Institute of Technology, Kanpur Abstract We propose a learning agent that first learns concepts in an integrated, cross-modal manner, and then uses these as the semantics model to map language. We consider an abstract model for the action of throwing, modeling the entire trajectory. From a large set of throws, we take the trajectory images and and the throwing parameters. These are mapped jointly onto a low-dimensional non-linear manifold. Such models improve with practice, and can be used as the starting point for real-life tasks such as aiming darts or recognizing throws by others. How can such models can be used in learning language? We consider a set of videos involving throwing and rolling actions. These actions are analyzed into a set of contrastive semantic classes based on agent, action, and the thrown object (trajector). We obtain crowdsourced commentaries for these videos (raw text) from a number of adults. The learner attempts to associate labels using contrastive probabilities for the semantic class. Only a handful of high-confidence words are found, but the agent starts off with this partial knowledge. These are used to learn incrementally larger syntactic patterns, initially for the trajector, and eventually for full agent-trajector-action sentences. We demonstrate how this may work for two completely different languages - English and Hindi, and also show how rudiments of agreement, synonymy and polysemy are detected. Figure 1: The system first learns a model of the throw action from a set of trajectory images and the associated motor parameters (throw angle, velocity). The correlation space is learned as a joint manifold (right) shown with trajectories superimposed. The colour bands correspond to angles of throw (yellow=lowest). This serves as an unified model for both planning actions and recognizing them. It is used later to distinguish throw actions and associated concepts such as agent, object thrown (trajecor) etc. for acquiring language. (section 3) Mapping concepts multi-modally. Consider the following situations: (a) The agent recognizes [Sam throwing a ball to Sita] (b) The agent understands the sentence “Sam threw the ball to Sita” Introduction (c) The agent executes [throwing the ball to Sita]. We claim two main contributions in this work: (a) An unified abstract representation for actions. Actions is a core objective for robotics, a core recognition task in vision, and what controls verb arguments in sentences (both syntax and semantics). We initially learn this abstraction as a manifold in visuo-motor space. (b) Subsequently, using this acquired motor model as the core semantics, we consider the agent observing videos involving this action, and another contrastive action. We now have a group of adults describe these videos in several languages, and find that we are able to learn both the lexicon and syntax (including morphology) for languages as diverse as Hindi and English, for a very diverse set of descriptions for these actions. In classical AI, (a) would be modeled by a function whose input is visual and which classifies it as an action of class throw(); it would be trained on a large set of tagged videos. For the language input (b), one would parse the sentence into constituent phrases, and a semantic lexicon might map these to formal structures such as throw(agent:Sam, object: Ball, path: $p)∧goal($p, Sita). It would be trained on a huge set of POS-, parse-, and semantically- annotated sentencees (Kwiatkowski et al. 2011), though the last may be avoided in some situations (Liang, Jordan, and Klein 2013). For (c), the robotic agent would command a sequence of motor torques at its joints. To learn to project the ball so that it goes to a target, it would train by throwing the ball repeatedly until it is able to achieve a desired path. c 2014, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 26 This process makes it hard to correlate knowledge between these modalities - e.g. looking at another person throwing short, it is hard to tell her to throw it higher, say. It is also inefficient in terms of requiring thrice the circuitry compared to a unified system. Indeed, primate brains seem to be operating in a more integrated manner. Motor behaviour invokes visual simulation to enable rapid detection of anomalies, while visual recognition invokes motor responses to check if they match. Linguistic meaning activates this wide range of modalities (Binder and Desai 2011). Such a cross-modal model also permits affordance and intentionality judgments, which is a core capability for learning langugae (Rizzolatti and Arbib 1998). Figure 2: Variations in the manifold according to a) angle of projection, and b) range. (low values in yellow). For constant velocity, the range is high towards the left, corresponding to the “bends” in the coloured bands. This is also the region around the 45◦ angle of throw (see fig. 1b). Unified action models as manifolds. The first part of this work (section 2) is inspired by the observation that each time the robot throws the ball for motor learning), it also generates a visual trajectory. The motor parameters - torque history, time of release, etc. are correlated with the visual feedback - trajectory, velocities, etc. In fact, both map to the same low-dimensional non-linear manifold which can be discovered jointly. This is the core idea in the first part of this work. We note that this approach is more abstract than models that actually attempt to learn throwing actions, e.g. (da Silva, Konidaris, and Barto 2012; Neumann et al. 2014). Our model may be thought of as a high-level summary of the entire space of throws - given any desired trajectory, it can come up with the motor parameters; given a motor parameter, it can “imaagine” the throw. The model can be applied to either visual recognition or to motor tasks. It can also be used to assess other throws, to catch a ball, say, or to guess the intention of a person throwing something. ent objects (Steels and Loetzsch 2012) The semantic models learned here are completely unsupervised, whereas other models with similar motivation often use hand-coded intermediate structures to model the semantics (Roy 2005). Another distinguishing feature language learning is incremental and does not come to a closure at any point. At every stage, the learner has only a “best guess” to how a phrase or sentence may be used. Thus, another name for this approach may be dynamic NLP. Also, after an initial bootstrapping phase driven by this multi-modal corpus, learning can continue to be informed by text alone, a process well-known from the rapid vocabulary growth after the first phase of language acquisition in children (Bloom 2000). Visuo-motor Pattern Discovery Learning a few visuo-motor tasks are among our agent’s very first achievements. Let us consider the act of throwing a ball. Our learner knows the motor parameters of the throw as it is being thrown - here we focus not on the sequence of motor torques, but just the angle and velocity at the point of release. Each trajectory gives us an image (fig. 1). Given a dataset of images X = x1 ...xn , drawn from a space I of all possible trajectory images. Each image has 100×100 pixels, i.e. 4 I ⊂ R10 . Though the set of possible images is enormous, if we assign pixels randomly, the probability that the resulting image will be a [throw] trajectory is practically zero. Thus, the subspace of [throw] images I is a vanishingly small part of all possible images. Next we would like to ask what types of changes can we make to an image while remaining within this subspace? In fact, since each throw varies only on the parameters (θ, v), there are only two ways in which we can modify the images while remaining locally within the subspace I of throw trajectory images. This is the dimensionality of the local tangent space at any point, and by stitching up these tangent spaces we can model the entire subspace as a nonlinear manifold of the same intrinsic dimensionality. The struture of this image manifold exactly mimics the structure of the motor parameters (the motor manifold). They can be mapped to a single joint manifold (Davenport et al. 2010), which can be discovered using standard non-linear Bootstrapping a lexicon and syntax. For the language task (section 3), we cosider the system which already has a rudimentary model for [throw], being exposed to a set of narratives while observing different instances of throwing. The narratives are crowdsourced from speakers for a set of synthetic videos which have different actions (throw or roll), agents (“dome” or “daisy” - male/female), thrown object or trajector (ball or square), colour of trajector (red or blue) and path (on target, or short, or long). We work with transcribed text, and not with direct speech, so we are assuming that our agent is able to isolate words in the input. An important aspect of working with crowdsourced data is that descriptions of the same action vary widely. For the very beginning learner, we need a more coherent input, and we first identify a small coherent subset (called the family lect) on which initial learning is done. This is then extended to the remaining narratives (the multi-lect). The approach learns fragments of the linguistic discourse incrementally. Starting with a few high-confidence words (fig. 4), it uses these to learn larger phrases, interleaving the partial syntax with the semantic knowledge to build larger syntactic patterns. This partial-analysis based approach is substantially different from other attempts at grounded modeling in NLP (Madden, Hoen, and Dominey 2010), (Kwiatkowski et al. 2011), (Nayak and Mukerjee 2012). It is also different from attempts that learn only nouns and their refer- 27 Table 1: Sum Absolute Error (Velocity) falls as N increases Figure 3: Throwing darts. a) Two trajectores that hit the board near the middle. b) A band for “good throws” in the 2D manifold of projectile images, with a tighter curve for the “best” throws. This model can now be used to make suggestions such as “throw it higher”. Figure 4: High confidence lexeme discovery : Contrastive scores and dominance ratio for top lexemes in. Highlighted squares show high-confidence lexemes (ratio more than twice) dimensionality reduction algorithms such as ISOMAP. In fig. 2, we show the resulting manifold obtained using a hausdorff distance metric (Huttenlocher, Klanderman, and Rucklidge 1993): (h(A, B) = maxa∈A minb∈B ka − bk, symmetrized). The same idea can be used to find correlations in any system involving a motion or a change of initial conditions, using algorithm 1: Barto 2012). As in those studies, our model also ignores lateral deviations, the and we consider successful trajectories as those that intersect the dartboard near its center (fig. 3a). In order to throw the dart, the system focuses on a “good zone” in the latent space (central band in fig. 3b), where it focuses on throws that are more nearly orthogonal to the board at contact (short curved axis). Of course, actual task performance will improve with experience (more data points in the vicinity of the goal), resulting in better performance. Such an abstract model only provides a starting point. Bootstrapping Language In order to learn language, we create a set of 15 situations involving the actions [throw] and [roll] (one of the agents is shown in fig. 1a). The videos of these actions are then put up on our website. Largely students on the campus network contributed commentaries. We obtain 18 Hindi and 11 English transcribed narratives. The compiled commentaries vary widely in lexical and constructional choices - e.g. one subject may say "daisy throws a blue square", whereas another describes the same video as "now daisy threw the blue box which fell right on the mark". Differences in linguistic focus cause wide changes in lexical units and constructions. As described earlier, we first select a coherent subset of narratives - those that have a more consistent vocabulary, and identify these as the family lect. The abstract motor model already learned for throw is used as a classifier manifold - if the action falls in this class (is modelable in terms of a local neighbourhood), it is marked as a throw. The other class, roll, is not implemented specifically, but taken to be the negation. In this case, the two classes can be separated quite easily. In addition to the This is a generic algorithm for discovering patterns in visuo-motor activity. Initially, its estimate of how to achieve a throw are very bad, but they improve with experience. We model this process in table 1 - as the number of inputs N increases, the error in predicting a throw decreases - at N=100, the error is nearly twice that at N=1000. Using the model to throw darts The visuo-motor manifold model developed here is not taskspecific, but can be applied for many tasks involving projectile motion, catching a ball, throwing at a basket, darts, tennis, etc. As an example we can consider darts, a problem often addressed in the psychology and robotics literature (Müller and Sternad 2009; da Silva, Konidaris, and 28 Table 2: Initial constructions learned from the trajector delimited strings Table 3: Syntax for trajector - iteration 1, Family-lec action being recognized, we also assume that the system can the agent who throws, the object thrown, and its attributes - shape (square vs circle), and colour (red vs blue). These contrasts - e.g. the two agents - [Dome] with a moustache (fig. 1, male), and [Daisy] with a ponytail) - are crucial to word learning. [Note: we use square brackets to indicate a concept, the semantic pole of a linguistic symbol) This contrastive approach has been suggested as a possible strategy applied by child learners (Markman 1990). Given two contrasting concepts c1 , c2 , we compute the empirical joint probabilities of word (or later, n-gram) ωi and concept c1 , c2 , and compute their contrast score: patterns which can be used to induce broader regularities. We compare two available unsupervised grammar induction systems - Fast unsupervised incremental parsing (Seginer 2007) and ADIOS(Solan et al. 2005); results shown here adopt the latter because of a more explicit handling of discovered lexical classes. The initial patterns learned for the trajector in this manner are shown in Table 2. These are generalized using the phrase substitution process to yield the new lexeme include graphicsinline2(ball), [ball], in Hindi and “blue” in English (Table. 3). This is done based on the family-lect (Figure 4), and the filtered sub-corpus is used to learn patterns and equivalence classes. Sωi ,c1 = P (ωi , c1 ) P (ωi , c2 ) The system now knows patterns for [red square], say, and it now pays attention to situations where the pattern is almost present, except for a single substitution. It can look into a the other semantic classes as well, (e.g. [blue square], [red ball]). In most of these instances (Fig. 4) we already have partial evidence for these units from their contrastive scores. Now if we discover new substitution phrases p1 in the position of p0, referring to a concept in the same semantic class (e.g. [ball] for [square]), and if p1 is already partially acceptable for [ball] based on contrastive probability, then p1 becomes an acceptable label for this semantic concept. This process iterates - new lexemes are used to induce new patterns, and then further new lexemes, until the patterns stabilize. This is then extended to the entire corpus beyond the small family lect; results are shown in Table 4. The table captures a reasonable diversity of Noun Phrase patterns describing coloured objects. Note that words like “red” and “ball” have also become conventionalized in Hindi. We also observe that the token niley appears in the 4-word pattern niley rang kaa chaukor and is highly confident even from a single occurrence; this reflects the fast mapping process observed in child language acquisition after the initial grounding phase (Bloom 2000). As the iteration progresses, these patterns are used for further enhancing the learners inventory of partial grammar. We also compare the corpus of narratives here with that from a larger unannotated corpus (Brown Corpus for English and the CIIL/IITB Corpus for Hindi). We look for words that are more frequent in the domain input than in a general situation. This rules out many frequent words like a, and, the for [is, of] (Hindi). The small set of high conEnglish, and fidence words - whose contrastive probability is more than twice the next best match - are highlighted in (Fig. 4). Interleaving of Word / Syntax learning Once the system has a few grounded lexemes, we proceed to discovering syntactic constructions. At the start, we try to learn the structure of small contiguous elements. One assumption we use here is that concepts that are very tightly bound (e.g. object and its colour) are also likely to appear in close proximity in the text (Markman 1990). Another assumption (often called syntactic bootstrapping), is used for mapping new phrases and creating equivalence classes or contextual synonyms. This says that given a syntactic pattern, if phrase p1 appears in place of a known phrase p0, and if this substitution is otherwise improbable (e.g. the phrase is quite long), then p0, p1 are synonyms (if in the same semantic class) or they are in the same syntactic lexical category. We find that one type of trajector (e.g. “ball”) and its colour attribute (e.g. “lAl”, [red]) have been recognized. So the agent pays more attention to situations where these words appear. Computationally this is modelled by trying to find patterns among the strings starting and ending with (delimited by) one of these high-confidence labels (e.g. “red coloured ball”). This delimited corpus consistes of strings related to a known trajector-attribute complex. Within these tight fragments, we show that standard grammar induction procedures are able to discover preliminary word-order Verb phrases and sentence syntax Having learned the syntax for a trajector, this part of the input is now known with some confidence, and the learner can venture out to relate the agent to the action and path. In this study, we failed to find any high-confidence lexemes related to path, hence we were not able to bootstrap that aspect. In the following we restrict ourselves to patterns for the semantic classes [agent], [action], [trajector]. At this stage, the agent notes that many of the words seem 29 Figure 5: Verb learning. Recomputed contrast scores after morphological clustering. Note that “threw” now appears as a high-confidence label. Table 4: Learned constructions pertaining to trajectors : The coloured units are the initially grounded lexemes Table 6: Sentence syntax discovered - iteration 1 - FL used to obtain a filtered corpus. To simplify the task of discovering verb phrase syntax, the known trajector syntax (table 4) is considered as a unit (denoted as [TRJ], and the concept of agent ([daisy] or [dome]) is denoted [AGT]. Results of initial patterns, obtained based on the two known action lexemes, are shown in Table 5. Note that pattern matching with Adios now discovers the class a, the as equivalence class E23 (replaced hereon). Next, we interleave this syntactic discovery with lexical discovery, permitting also bigram substitutions. Thus “threw” and “rolled” are found to be substitutable by has thrown, is throwing; and has slid respectively. This gives us the more general results of Table 6. Again the system iterates over the corpus till the discovery of new patterns converges (Table 7) results. An interesting observation is that the Hindi (nilaa data finds a phrase in the TRJ position chaukor) [blue square]. This had not been learned in the trajector iteration, since nilaa was less frequent. Thus we see that with this approach we are able to acquire several significant patterns. These patterns apply to only a single action input, and for a very limited set of other participants. But it would be reasonable to say that the agent may observer similar structures elsewhere - e.g. in a context involving hitting, say, if we have the sentence “Daisy hit Dome” then the agent may use the syntax of [AGT] [verb] [TRJ] to extend to this context and guess that “hit” may be a verb and “Dome” the object of this action (which it knows Table 5: Initially acquired sentence constructions rather similar (e.g. “sarkAyA”, “sarkAyI” (H); or “throwing”, “thrown” (E)). A text-based morphological similarity analysis is reveals several clusters with alterations at the end of words (Fig. 5). To quantify this aspect, we consider normalised Levenshtein distance and perform a morphological similarity analysis. Since our input is text, we limit ourselves to analysis based on the alphabetic patterns as opposed to phonemic maps. Similar words are clustered using a normalized similarity index .Thus we have twelve type instances of -aa) - ( ii) , and seven types for ( -aa) - ( ( e). We find that these variants - e.g. “sarkaayaa”, “sarkaayii” - appear in the same syntactic and semantic context. These clusters are now used to further strengthen the lexeme and action association. Again, we use an iterative process, starting with grounded unigrams, moving a level up to learn simple word-order patterns, learning alternations and lexical classes through phrase substitution, and so on to acquire a richer lexicon and syntax. The learner has the concept of agent and has associated the words “daisy” and “dome”. One action word is known for both Hindi and English (Fig. 5), and these are 30 Will the approach scale to other actions? The first task is to demonstrate scalability, by including actions other than [throw]. One of our main strengths is that neither the visuomotor nor the linguistic components had any kind of annotation, so the training dataset needed for this should be relatively easy to generate compared to treebanks and semantically annotated data. Another problem would be to use the motor model to guess intentions and formulate sentences such as “try to throw a little higher”. In the present work, the intention can plausibly be captured, but we are not able to handle the linguistic part yet. This is because the last part of the sentences in these narratives - those that encode the quality of the throws (“overshot the mark”, “hit the target exactly”) exhibit too much variation to be learned from this small corpus, but it should be possible to do this with a larger dataset. A third area would be to extend this work into metaphor, e.g. sentences such as (d) “Sam threw the searchlight beam at the roof.”, or (e) "Sam threw her a searching look." These can be related to the motor model of [throw], as a motion. But these are typically never encountered in a grounded manner, but quite often in language-only input. Linking this language input to the existing grounded models would open up new vistas for concept and language discovery. References Binder, J. R., and Desai, R. H. 2011. The neurobiology of semantic memory. Trends in cognitive sciences 15(11):527– 536. Bloom, P. 2000. How Children Learn the Meaning of Words. MIT Press. da Silva, B. C.; Konidaris, G.; and Barto, A. G. 2012. Learning parameterized skills. In Proc. ICML. Davenport, M. A.; Hegde, C.; Duarte, M. F.; and Baraniuk, R. G. 2010. High-dimensional data fusion via joint manifold learning. In AAAI Fall 2010 Symposium on Manifold Learning, Arlington, VA. Huttenlocher, D. P.; Klanderman, G. A.; and Rucklidge, W. J. 1993. Comparing images using the hausdorff distance. IEEE PAMI 15(9):850–863. Kwiatkowski, T.; Zettlemoyer, L.; Goldwater, S.; and Steedman, M. 2011. Lexical generalization in ccg grammar induction for semantic parsing. In Proc. EMNLP, 1512–1523. Liang, P.; Jordan, M. I.; and Klein, D. 2013. Learning dependency-based compositional semantics. Computational Linguistics 39(2):389–446. Madden, C.; Hoen, M.; and Dominey, P. 2010. A cognitive neuroscience perspective on embodied language for humanrobot cooperation. Brain and Language 112(3):180–188. Markman, E. M. 1990. Constraints children place on word meanings. Cognitive Science 14(1):57–77. Müller, H., and Sternad, D. 2009. Motor learning: changes in the structure of variability in a redundant task. In Progress in motor control. 439–456. Nayak, S., and Mukerjee, A. 2012. Grounded language acquisition: A minimal commitment approach. In Proc. COLING 2012, 2059–76. Table 7: Learned constructions over trajector phrases, The coloured units are the initially grounded lexemes from the semantics). Thus, once a few patterns are known, it becomes easier to learn more and more patterns, which is the fast mapping stage we have commented upon earlier. Note that our lexical categories are not necessarily the same as that of a linguist. But those categories are often a matter of fierce contention, and most human language users are blissfully unaware of these and perhaps our system can also do quite well without quite mastering these. At the same time, most of the structures (e.g. Adj-N (red ball, laal chaukor),Art-N (E23 ball), etc are indeed discovered. It also discovers agreement between the the unit “chaukor”, [square, M] and verbs ending in -aa. However, following the usage-based approach (Tomasello and Tomasello 2009), we would be inclined to view these constructions as reducing the description length needed to code for the strings arising in this context. Conclusion The above analysis provides a proof-of-concept that a system starting with very few priors can, a) combine motor, visual (and possibly other modalities) into an integrated model, and b) use this rudimentary concept knowledge to bootstrap language. We also note that such a system can then learn further refinements to this concept space using language alone. We believe that this novel idea of linking modalities will open up a number of new areas in AI. We mention three possible directions. 31 Neumann, G.; Daniel, C.; Paraschos, A.; Kupcsik, A.; and Peters, J. 2014. Learning modular policies for robotics. Frontiers in Computational Neuroscience 8:62. Rizzolatti, G., and Arbib, M. A. 1998. Language within our grasp. Trends in neurosciences 21(5):188–194. Roy, D. 2005. Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence 167(1-2):170–205. Seginer, Y. 2007. Fast unsupervised incremental parsing. In ACL Ann. Meet, volume 45, 384. Solan, Z.; Horn, D.; Ruppin, E.; and Edelman, S. 2005. Unsupervised learning of natural languages. PNAS 102(33):11629–34. Steels, L., and Loetzsch, M. 2012. The grounded naming game. Experiments in Cultural Language Evolution. Amsterdam: John Benjamins. Tomasello, M., and Tomasello, M. 2009. Constructing a language: A usage-based theory of language acquisition. Harvard University Press. 32

From Visuo-Motor to Language Deepali Semwal, Sunakshi Gupta, Amitabha Mukerjee

Related documents

Products

Support

From Visuo-Motor to Language Deepali Semwal, Sunakshi Gupta, Amitabha Mukerjee

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib