Linking Vision, Brain and Language Jinyong Lee Computer Science Department University of Southern California 4/13/2015 Part I: Hearing Gazes – eye movements and visual scene recognition An unexpected visitor (I.E. Repin, 1884-1888) [2] SemRep (Semantic Representation) is a hierarchically organized graph-like representation for encoding semantics and concepts that are possibly anchored on a visual scene. WOMAN HITTING ACTION MAN edges to specify relations between nodes objects or actions as nodes BLUE CLOTHE Original image from: “Invisible Man Jangsu Choi”, Korean Broadcasting System (KBS) Another whole SemRep structure embedded in WOMAN node [3] SemRep is proposed as a ‘bridge’ between the vision and the language system. “A woman in blue hits a man” “A woman hits a man and she wears a blue dress” “A man is hit by a woman who wears a blue dress” “A pretty woman is hitting a man” Language System Vision System •Dynamically changing (even for static scenes) •Cognitively important entities encoded only (different case by case) SemRep •Temporally stored in WM [4] What’s happening? http://icanhascheezburger.files.wordpress.com/2008/10/funny-pictures-fashion-sense-cat-springs-into-action.jpg [5] A single structure of SemRep usually encodes a cognitive event that corresponds to a ‘minimal subscene (Itti & Arbib, 2006)’, which can later be extended into an ‘anchored subscene’. 1. Firstly the man’s face captures attention (human faces are cognitively salient); minimal subscene 3 4 2. Then the high contrast of the cat’s paw captures attention action recognition is performed and it biases the attention system onto the agent (cat) 2 1 3. The cat on the roof found as the agent of the paw action; incremental building 4. It completes a bigger subscene building the Agent-Action-Patient propositional structure; anchored subscene [6] The entities encoded in SemRep are NOT logical but perceptually grounded symbolic representations. Theories and findings on conceptual deficits (Damasio, 1989; Warrington & Shallice 1984; Caramazza & Shelton, 1998; Tyler & Moss, 2001) suggest modality-specific representations distributed categorically over the brain Perceptual Symbol System (PSS) (Barsalou, 1999): stimulus inputs are captured in the modality-specific feature maps, then integrated in an association area re-enactment (simulation) of the learned concept CAR Figure 2. Barsalou et al., 2003 learning (encoding) the concept CAR representing the encoded concepts as perceptual symbols (SemRep) [7] The schematic structure (relations, hierarchies, etc.) of SemRep can be viewed as a cross-modal association between very high level sensori-motor integration areas in the brain. Conceptual Topography Theory (CTT) (Simmons & Barsalou, 2003): hierarchically organized convergence zones the higher, the more symbolic and amodal X Figure 3. Simmons & Barsalou, 2003 [8] Since SemRep is still in its hypothetical stage, an eye-tracking study (but it only captures ‘overt’ attention) has been designed and conducted to evaluate its validity. Microphone Capturing human eye movements and verbal description simultaneously while showing images/videos To investigate the temporal relationship between eye-movements and the verbal description [9] Mainly four hypotheses are proposed and two types of tasks are suggested in order to verify each hypothesis. Hypothesis I – Mental Preparation Hypothesis II – Structural Message Hypothesis III – Dynamic Interaction Task I – “describe what you are seeing as quickly as possible” Task II – “describe what you have seen during the scene display” Hypothesis IV – Spatial Anchor Table 1. Possible experiment configurations between hypotheses and tasks/visual scenes [10] ‘Task I’ is to investigate the eye movements for estimating possible SemRep structures built and their connection to the spoken description. Instruction “Describe what you are seeing as quickly as possible” Real-time description (with speech recording) during the visual scene display Expectations 1) There would be rich dynamic interaction with the eye movements and the utterance 2) Having the subjects speak as quickly as possible would reveal the structure of the internal structure that they are using for speaking NOTE: Experiments for only this with static images were conducted [11] ‘Task II’ is to investigate the eye movements after the scene display for evaluating the validity of the spatial component of SemRep. Instruction “Describe what you have seen during the previous scene display” Post-event description (with speech recording) after the visual scene display Expectations 1) The eye movements will sincerely reflect the remembered object locations being described 2) The sentence structure of the description would be more wellformed, and the order of description and the order of fixations would be less congruent [12] The Mental Preparation Hypothesis (Hypothesis I) asserts that gazes are not merely meaningless gestures but they actually reflect attention and mental effort in verbal description production. 1) Speakers plan (building an internal representation) what to say while gazing, and only after it is relatively complete they begin speaking (Griffin, 2005) 2) There is a tight correlation between gazing and description usually gazing comes 100ms ~ 300ms earlier than the actual description (Griffin, 2001) An overlapping utterance pattern for two consecutive nouns A and B Figure 2. Griffin, 2001 [13] The experiment result suggests that there is a very strong correlation between the eye movements and the verbal description such that gazes come before the utterances, indicating what to be spoken next. Eye gazes ‘right before’ verbal descriptions 01_soldiers [fairman] 4.08~5.2::“one soldier smiles at him” gaze on a soldier’s face 7.33~8.79::“he’s got sandals and shoes on” gaze on person’s shoes 05_basketball [fairman] 2.82 ~ 3.52:: “he has the ball” gaze on ball 06_conversation [fairman] 4.08~5.01:: “reading books” gazes on books 14.86~16.62:: “the woman on the right wears boots” gazes on boots and a woman …and a lot more!!! Others 1)The delay between gazes and descriptions seem to be a bit longer than suggested by Griffin (2005; 2001) – about 500ms ~ 1,500ms this might be due to the fact that Griffin’s case is in the word level but ours is in the sentential/clausal level 2)There seems to be a parallelism between the vision system (gaze) and the language system (description) since subjects already moved their gazes to some other locations during speaking about an object/event [14] The Structural Message Hypothesis (Hypothesis II) asserts that the internal representation for speakers is an organized structure that can be directly translated into a proposition or clause. 1) van der Meulen (2003): speakers outline messages (some semantic representation similar to SemRep, used for producing utterances) approximately one proposition or clause at a time 2) Itti & Arbib (2006): for whatever reason, an agent, action or object may attract the viewer’s attention then attention would be directed to seek to place this ‘anchor’ in context within the scene, completing the ‘minimal subscene’ in an incremental manner; a minimal subscene can be extended by adding details to elements, resulting in ‘anchored subscene’ each subscene corresponds to a propositional structure that in turn corresponds to a clause (or sentence) 3) There must be a difference between the eye movement patterns for “a woman hits a man and she wears a hat” and “a woman wearing a hat hits a man” although the used lexicons are about the same [15] The experiment result indicates that a relatively complex eye movements are followed by utterances either with more complex sentence structures or with more thematic information involved. Eye gazes of a complex movement pattern 03_athletic [fairman] 2.81~3.93::“one team just lost” after inspecting at the both groups 3.93~5.52::“they are looking at the team that won” this involves inspecting the both of the groups (whereas [karl] “they’re really skinny… got a baton” only focuses on the left group) 11_pool [fairman] 6.26~7.57:: “sort of talking to some girls…” after inspecting the guy and the girls at the bar 7.57~9.62:: “but he's actually watching these two guys play pool..” but right after switched the gaze between the guy and the pool table (hole) 06_converstaion [fairman] 9.26~11.07:: “...looks like he's wondering what they shared” looking at both sides of the groups and the guy’s face (probably for figuring out the eye-gaze direction) and body Eye gazes of a simple pattern Conversely, eye movements involving with simple sentence structures (e.g. “there is a man...”, “He’s wearing sandals…”, “…got something”, etc.) are quite simple and they are more like a serial movement from one item to another whereas complex ones involve visiting several locations [16] The Dynamic Interaction Hypothesis (Hypothesis III) asserts that the vision and language system are constantly interacting so that not only the visual attention guides the verbal expression but also does the other way. 1) Fixations followed by producing complex nouns are longer than those followed by simple nouns (Myer, 2005) 2) Both the elemental (low-level elemental attention attraction) and structural (more high-level/top-down structure) considerations are done for producing sentences on the given scene (Bock et al., 2005) Passive sentence simple causal schemas might have been involved a relatively complex structure lengthens the fixation duration Active sentence occur more frequently and easier to produce Figure 1. Myer, 2005 Figure 1. Bock et al., 2005 [17] The tendency of longer fixation for complex expressions was found from the result; also, there were several cases where the description being produced seems to bias the gaze. Construction requiring more information 01_soldiers [fairman] 5.21~5.72 5.72~7.33:: “His hat…” “he’s wearing a military hat…” 10.18~11.45 11.45~12.70:: “…and he’s got a bag” “it’s like a crumpled up bag” longer gazes that follow right after short gazes; probably the subjects thought the construction is too simple or missing crucial information to be produced as a sentence More dynamic interaction between gaze and utterance 08_disabled [fairman] 4.5~5.68 5.68~7.42:: “...someone who just fell...” “..black woman just fell...number seven” probably the construction required more detailed information to fill in instead of just ‘someone’ 05_basketball [fairman] 10.35~12.28 13.41~14.39:: “...being guarded by another guy in the black who...” firstly produced sentence seemed to require more information (more than the gender and the color of uniform) on the guy who blocks the player “...looks kinda out of it...” later it turned out to be irrelevant [18] The Spatial Anchor Hypothesis (Hypothesis IV) asserts that the attentional focus serves the function of associating spatial pointers with elements perceived; accessing them in working memory (WM) later requires covert/overt attention shifts. 1) SemRep (Arbib & Lee, 2008): an internal structure that encodes semantic information associated with the elements of the scene by coarsely coded spatial coordinates 2) Hollywood Squares experiment (Richardson & Spivey, 2000) 3) The tropical fish experiment (Laeng & Teodorescu, 2002) More likely to see here! Figure 2. Richardson & Spivey, 2000 NOTE: This experiment is not conducted yet [19] The current study can be extended by including: (1) account for gist (covert attention), and (2) more thorough account for the language bias on the attentional shifts. 1) For almost all experiment results, ‘gist’ (without particular gazes involves) plays a role at least at the initial phase of the scene recognition the current study focuses only on the overt-type attention, but the covert-type would be important as well 05_basketball [fairman] 1.04~1.79:: “basket ball” after a short gaze only on a player 06_converstation [fairman] 1.48~2.53:: “Home situation” there is no specific gaze found This is usually at the beginning of the scene display later description goes on to more specific objects/events in the scene 2) van der Meulen et al.(2001): even if an object is repeatedly mentioned, the subjects still looks back at the object although less frequently a good example of the influence of the language system on the vision system (the Dynamic Interaction Hypothesis) Firstly mentioned: 82%, Secondly: 67% (with noun) / 50% (with pronoun) It needs to devise a more thorough way on accounting for the hypothesis [20] Part II: A New Approach – Construction Grammar and Template Construction Grammar (TCG) Sidney Harris’s Science Cartoons (S. Harris, 2003) [21] Constructions are basically ‘form-meaning pairings’ that serve as basic building blocks for grammatical structure, thus blurring the distinction between semantic and syntax. Goldberg (2003): Constructions are stored pairings of form and function, including morphemes, words, idioms, partially lexically filled and fully general linguistic patterns Table 1. Examples of constructions, varying in size and complexity; form and function are specified if not readily transparent (Goldberg, 2003) They are not grammatical components and might be different among individuals [22] Seriously, there is virtually NO difference between semantic and grammatical components – semantic constraints and grammatical constraints are interchangable. Examples where the distinction between semantic and syntactic constraints blurs out (a) Bill hit Bob’s arm. (b) Bill broke Bob’s arm. (a) John sent the package to Jane. (b) John sent Jane the package. (a) Bill hit Bob on the arm. (b) *Bill broke Bob on the arm. (a) John sent the package to the dormitory. (b) *John sent the dormitory the package. [X V Y on Z] V has to be a ‘contact verb (e.g. caress, kiss, lick, pat, stroke)’ not ‘break verb (e.g. break, rip, smash, spliter, tear)’ (Kemmerer, 2005) [X transfer-verb Y Z] X and (especially) Y have to be ‘animate’ objects The semantic of the verb, not the syntactic category, is what decides the syntactic category of the sentence [23] Construction grammar paradigm is also supported (1) from the neurophysiological perspective and (2) by the ontogenetic account. 1) Bergen & Wheeler (2005): not only the content words but also (or even more) the function words contribute to the context of a sentence (a) John is closing the drawer. motor activation (strong ACE found) (b) John has closed the drawer. visual activation (less or no ACE found) ACE: Action-sentence Compatibility Effect Although the same content words are used, the mental interpretations vary 2) Vocabulary among each individual expands/shrinks through time grammatical categories are at theoretical, not necessarily cognitive, convenience “want milk” (holophrase) “I’m-sorry” (holophrase) “want X” (general construction) (Hill, 1983) “I[NP] am[VP] sorry[ADJ]” (general construction) [24] Constructions defined in Template Construction Grammar (TCG) represent the ‘form-meaning’ pairings as ‘templates’. construction name class defines syntactic hierarchy level as well as the ‘competition and cooperation’ style Template SemRep as the semantic meaning and a lexical sequence (with slots) as the syntactic form [25] There are broadly two types of constructions; one encodes rather simple lexical information while the other encodes a more abstract syntactic structure. Complex one encodes a higher level abstract structure slots define syntactic connections while red lines indicate the expression order Simple one only encodes lexical information it corresponds to a word [26] In the production mode of TCG, constructions are trying to ‘cover’ the SemRep generated by the vision system, and the matching is done based on similarity. more than one construction is attached at this moment it will be sorted out later [27] Constructions interact each other under the ‘competition and cooperation’ paradigm; they compete when there are spatial conflicts (among the same classes). COMPETITION COMPETITION [28] Constructions interact each other under the ‘competition and cooperation’ paradigm; they cooperate (conjoin) when a grammatical connection is possible. Conjoined constructions form a hierarchical structure that resembles a parse tree COOPERATION (all over the place) [29] The proposed system basically consists of three main parts, the vision system, language system and visuo-linguistic working memory, with attention on top of them. building SemRep from visual scenes Three parts communicate via attention SemRep TCG works here [30] Concurrency is the virtue of the system; the system is designed as a real-time multi-schema network, and it allows the system a great deal of flexibility. The vision system provides parts of SemRep produced as the interpretation of the scene VLWM holds SemRep (with constructions attached) changing dynamically The language system attaches constructions and reads off the (partially) complete sentence at the moment 1)The result of eye experiment strongly implicate parallelism 2)Constructions are basically schemas that run parallel (the C&C paradigm) X [31] Attention is proposed as a computational resource, rather than a procedural unit, which affects all over the system – thus it can act as a communication medium. There is an attentional map that covers over the regions of the visual input via SemRep this affects the operation of the vision system and VLWM Attention requesting more details on a certain part of SemRep by biasing attention distribution Vision System More attention is required to retrieve more concrete details it corresponds to the PSS (Barsalou, 1999) account VLWM There is a forgetting effect which discards SemRep components or constructions a certain amount of computational resource for a certain time duration required Language System total computational = attention at each time unit X time duration resource required Zoom Lens model of attention (Eriksen & Yeh, 1985): attention is a kind of multi-resolution focus whose magnification is inversely proportional to the level of detail Low resolution used for large region, encompassing more objects, fewer details; perceiving groups of entities as a coherent whole High resolution used for small region, fewer objects, more details; perceiving individual entities [32] Constructions are distributed, being centered around the perisylvian regions which include classical Broca’s (BA 45/44), Wernicke’s (BA 22) and the left superior temporal lobe areas. a ‘functional web’ for linking phonological information related to the articulatory and acoustic pattern of a word form is developed (Pullvermuler, 2001) Kaan & Swaab 2002; Just et al. 1996; Keller et al. 2001; Stowe et al. 2002 a phonological rehearsal device as well as a working memory circuit for complex syntactic verbal processes (Aboitiz & Garcia, 1997) [33] Concrete words, or the lexicon constructions are topographically distributed across brain areas associated with the process of corresponding categorical properties. concepts of tools or actions correlated to the motor and parietal areas action and tool use It is also supported by the category-specific deficit accounts (Warrington & McCarthy 1987; Warrington & Shallice 1984; Caramazza & Shelton 1998; Tyler & Moss 2001). concepts of animals mostly associated with the temporal visual properties [34] The higher-level constructions that encode grammatical information might be distributed across the areas stretched from the inferior to medial part of the left temporal cortex. the left-inferior-temporal and fusiform gyri are activated during processing of pragmatic, semantic and syntactic linguistic information (Kuperberg et al., 2000) The perirhinal cortex is involved in object identification and its representation formation by integrating multi-modal attributes (Murray & Richmond 2001) [35] The syntacti manipulation of constructions (competition and cooperation) mainly happens in Broca’s area, which is believed to be activated when handling complex symbolic structure. Broca’s area is activated more when handling sentences of complex syntactic structure than of simple structure (Stromwold et al., 1996) for handling the arrangement of lexical items for retrieving semantic or linguistic components during the matching and ordering process BA 45 is activated by both speech and signing during the production of language narratives done by bilingual subjects whereas BA 44 is activated by the generation of complex articulatory movements of oral/laryngeal or limb musculature (Horwitz et al., 2003) [36] VLWM is built around the left dorsolateral prefrontal cortex (DLPFC; BA 9/46) stretched to the left posterior parietal cortex (BA40). the executive component for processing of the contents of working memory (Smith et al., 1998) the storage buffer for verbal working memory circuit (Smith et al., 1998) The monkey prefrontal cortex is found to involve in sustaining memory for object identity and location (Rainer et al., 1998) [37] References Aboitiz, F; Garcia, R (1997) The evolutionary origin of the language areas in the human brain. A neuroanatomical perspective, BRAIN RESEARCH REVIEWS 25 (3):381-396 Arbib, MA; Lee, J (2008) Describing visual scenes: towards a neurolinguistics based on Construction Grammar, BRAIN RESEARCH 1225:146-162 Barsalou, LW (1999) Perceptual symbol systems, BEHAVIORAL AND BRAIN SCIENCES 22 (4):577-+ Price, C; Thierry, G; Griffiths, T (2005) Speech-specific auditory processing: where is it?, TRENDS IN COGNITIVE SCIENCES 9 (6):271-276 Barsalou, LW; Simmons, WK; Barbey, AK; Wilson, CD (2003) Grounding conceptual knowledge in modalityspecific systems, TRENDS IN COGNITIVE SCIENCES 7 (2):84-91 Bergen, BK; Wheeler, KB (2005) Sentence understanding engages motor processes, PROCEEDINGS OF THE TWENTY-SEVENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY Bock, JK; Irwin, DE; Davidson, DJ (2005) Putting first things first, The Interface of Language, Vision, and Action (Henderson, JM; Ferreira F ed), Psychology Press:249-278 (Chapter 8) Caramazza, A; Shelton, JR (1998) Domain-specific knowledge systems in the brain: The animate-inanimate distinction, JOURNAL OF COGNITIVE NEUROSCIENCE 10 (1):1-34 [38] References Damasio, AR (1989) Time-locked multiregional retroactivation - A systems-level proposal for the neural substrates of recall and recognition, COGNITION 33 (1-2):25-62 Eriksen, CW; Yeh, YY (1985) Allocation of attention in the visual-field, JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE 11 (5):583-597 Goldberg, AE (2003) Constructions: a new theoretical approach to language, TRENDS IN COGNITIVE SCIENCES 7 (5):219-224 Griffin, ZM (2001) Gaze durations during speech reflect word selection and phonological encoding, COGNITION 82 (1):B1-B14 Griffin, ZM (2005) Why look? Reasons for eye movements related to language production, The Interface of Language, Vision, and Action (Henderson, JM; Ferreira F ed), Psychology Press:213-248 (Chapter 7) Horowitz, TS; Wolfe, JM (1998) Visual search has no memory, NATURE 394 (6):575-577 Itti, L; Arbib, MA (2006) Attention and the minimal subscene, Action to Language: via the Mirror Neuron System (Arbib, MA ed), Cambridge University Press:289-346 (Chapter 9) Just, MA; Carpenter, PA; Keller, TA; Eddy, WF; Thulborn, KR (1996) Brain activation modulated by sentence comprehension, SCIENCE 274 (5284):114-116 [39] References Kaan, E; Swaab, TY (2002) The brain circuitry of syntactic comprehension, TRENDS IN COGNITIVE SCIENCES 6 (8):350-356 Keller, TA; Carpenter, PA; Just, MA (2001) The neural bases of sentence comprehension: a fMRI examination of syntactic and lexical processing, CEREBRAL CORTEX 11 (3):223-237 Kuperberg, GR; McGuire, PK; Bullmore, ET; Brammer, MJ; Rabe-Hesketh, S; Wright, IC; Lythgoe, DJ; Williams, SCR; David, AS (2000) Common and distinct neural substrates for pragmatic, semantic, and syntactic processing of spoken sentences: an fMRI study, JOURNAL OF COGNITIVE NEUROSCIENCE 12 (2):321-341 Laeng, B; Teodorescu, D (2002) Eye scanpaths during visual imagery reenact those of perception of the same visual scene, COGNITIVE SCIENCE 26 (2):207-231 Meyer, AS (2005) The use of eye tracking in studies of sentence generation, The Interface of Language, Vision, and Action (Henderson, JM; Ferreira F ed), Psychology Press:191-211 (Chapter 6) Murray, EA; Richmond, BJ (2001) Role of perirhinal cortex in object perception, memory, and associations, CURRENT OPINION IN NEUROBIOLOGY 11 (2):188-193 Pulvermuller, F (2001) Brain reflections of words and their meaning, TRENDS IN COGNITIVE SCIENCES 5 (12):517-5242001 [40] References Rainer, G; Asaad, WF; Miller, EK (1998) Memory fields of neurons in the primate prefrontal cortex, PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA 95 (25):15008-15013 Richardson, DC; Spivey, MJ (2000) Representation, space and Hollywood Squares: looking at things that aren’t there anymore, COGNITION 76 (3):269-295 Simmons, WK; Barsalou, LW (2003) The similarity-in-topography principle: Reconciling theories of conceptual deficits, COGNITIVE NEUROPSYCHOLOGY 20 (3-6):451-486 Smith, EE; Jonides, J; Marshuetz, C; Koeppe, RA (1998) Components of verbal working memory: Evidence from neuroimaging, PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA 95 (3):876-882 Stowe, L; Withaar, R; Wijers, A; Broere, C; Paans, A (2002) Encoding and storage in working memory during sentence comprehension. Sentence Processing and the Lexicon: Formal, Computational and Experimental Perspectives (Merlo, P; Stevenson, S eds), John Benjamins Publishing Company:181-206 Stromswold, K; Caplan, D; Alpert, N; Rauch, S (1996) Localization of syntactic comprehension by positron emission tomography, BRAIN AND LANGUAGE 52 (3):452-473 Tyler, LK; Moss, HE (2001) Towards a distributed account of conceptual knowledge, TRENDS IN COGNITIVE SCIENCES (6):244-252 [41] References van der Meulen, FF; Meyer, AS; Levelt, WJ (2001) Eye movements during the production of nouns and pronouns, MEMORY AND COGNITION 29 (3):512-521 van der Meulen, FF (2003) Coordination of eye gaze and speech in sentence production, Mediating Warrington, EK; Shallice, T (1984) Category specific semantic impairments, BRAIN 107 (SEP):829-854 Warrington, EK; McCarthy, RA (1987) Categories of knowledge: Further fractionations and an attempted integration, BRAIN 110 (5):1273-1296 [42]