An Analysis-By-Synthesis Approach to Multisensory Object Shape Perception Ilker Yildirim Brain & Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA [email protected] Goker Erdogan Brain & Cognitive Sciences University of Rochester Rochester, NY [email protected] Robert A. Jacobs Brain & Cognitive Sciences University of Rochester Rochester, NY [email protected] 1 Introduction The world is multimodal.1 We sense our environments using inputs from multiple sensory modalities. Similarly, digital information is increasingly available through multiple media. In this extended abstract, we present a general computational framework for understanding multimodal learning and perception that builds on the analysis-by-synthesis approach [2, 3]. The analysis-by-synthesis approach makes use of generative models describing how causes in the environment give rise to percepts. It achieves an analysis of the input as a synthesis of these causes by inverting this generative model. This approach has been successfully applied in a wide range of fields from vision[2] to language[3]. Our framework is a natural extension of this approach to multimodal perception. From the perspective of Marr’s tri-level hypothesis2 , our framework can be understood as a computational level analysis of multimodal perception. It offers a conceptual analysis of multimodal perception that enables a unified treatment of a diverse set of questions. As a proof of concept of our framework, we apply it to object shape perception via visual and haptic modalities, and show that it captures people’s behavior in this task with high accuracy. We argue that any cognitive agent capable of multimodal perception can be analyzed in terms of three crucial components [1]. The first component is a representational language for characterizing modality-independent representations. For example, in the case of multimodal object recognition, this representational language characterizes the object shape information necessary for recognition extracted from the sensory inputs. These modality-independent representations need to be mapped to sensory inputs for the purposes of learning and inference. This is where the second component, sensory-specific forward models [5], come into play. For example, a vision-specific forward model maps object shape representations to visual inputs, i.e., images. However, perception goes in the other direction, from sensory inputs to multimodal representations. Therefore, the last component is an inference procedure for inverting forward models. Taking a probabilistic perspective, we think of forward models as specifying a probability distribution over sensory inputs D given a multimodal 1 This extended abstract is a short version of our manuscript titled “From Sensory Signals to ModalityIndependent Conceptual Representations: A Probabilistic Language of Thought Approach” published in PLOS Computational Biology[1]. 2 Marr[4] distinguished between three different levels at which one can understand a cognitive capability. The computational level characterizes what is the computational problem that needs to be solved. The algorithm/representation level specifies the representations and the algorithms used to solve the problem. Lastly, the implementation level is concerned with how the solution is implemented in physical medium. 1 representation H, i.e., P (D|H). One can use the calculus of probability to invert this forward model via Bayes’ rule. A similar framework has been successfully applied to problems in biological and computer vision, and our framework is a natural application of this analysis-by-synthesis [2] approach to multimodal perception. We believe that our framework presents a promising direction for understanding multimodal perception as it presents a unified treatment of various questions. For example, how is knowledge from one modality transferred to another? In our framework, this simply becomes estimating the posterior of observing some sensory input from one modality given another sensory input from another modality. Or, how is information from multiple modalities combined [6]? Again, this simply becomes estimating the posterior over multimodal representations given inputs from multiple modalities. 2 Our framework applied to multimodal perception of object shape In the rest of the abstract, we focus on the application of our framework to multimodal shape perception in visual and haptic settings. We present a specific instantiation of our model for this problem and show that it provides a highly accurate account of human subjects’ shape similarity judgments. We used a set of 16 objects in our experimental and computational work (see Figure 1). Each object consists of five parts at five fixed locations. One part is common to all objects. At each of the remaining four locations, one of two possible parts is chosen. We characterize these objects using a shape grammar that captures the identity and spatial configuration of the parts that make up an object. Production rules for this probabilistic context-free grammar can be seen in Figure 2. Note that this grammar characterizes only the coarse spatial structure of an object, and it needs to be combined with a spatial model, which will be explained in more detail shortly. There are two non-terminal symbols in our shape grammar. Non-terminal symbol S corresponds to spatial nodes in the grammar that determine the positions of parts in 3D space. The spatial model associates positions in 3D space with each one of these S nodes. P nodes are placeholder nodes for parts which are denoted by terminal nodes P 0 through P 8 (Figure 3). The grammatical derivation for an object can be represented using a parse tree. However, as mentioned above, this parse tree does not characterize the spatial structure of an object completely. Hence, a parse tree is extended to a spatial tree by combining it with a spatial model as follows (see Figure 4). Parts are positioned in a 3D space using a multiresolution representation. At the coarsest resolution in this representation, a “voxel” corresponds to the entire space. The center location of this voxel is the origin of the space, denoted (0, 0, 0). The root S node of a grammatical derivation of an object is always associated with this origin. At a finer resolution, this voxel is divided into 27 equal sized subvoxels arranged to form a 3 × 3 × 3 grid. Using a Cartesian coordinate system, a coordinate of a subvoxel’s location along an axis is either -1, 0, or 1. Even finer resolutions are created by dividing subvoxels into subsubvoxels (again, using a 3 × 3 × 3 grid), subsubvoxels into subsubsubvoxels, and so on. S nodes in an object’s grammatical derivation (other than the root S node) are associated with these finer resolution voxels with the constraint that the position of an S node must lie within its parent S node’s voxel (see Figure 4). In other words, each level in the parse tree is associated with a specific level of resolution. For example, for the child S nodes of the root S node, the whole space is split into 27 voxels, 3 along each axis, and each S nodes is assigned one of these voxels. Similarly, for the next level in the tree, each voxel is further split into 27 subvoxels, and each S node at this level can be associated with any of the 27 subvoxels of its parent voxel (hence, there are 27 × 27 voxels in total at this level.). The position of a part node P is simply its parent S node’s position, i.e., the center of the voxel associated with its parent. This assignment of spatial positions to S nodes extends a parse tree to a spatial tree by specifying the positions of the parts that make up an object. Let T denote the parse tree and S denote the spatial model, i.e., spatial assignments for each S node, for an object. The prior probability of a shape representation (T , S) is defined as: P (T , S|G) = P (T |G)P (S|T ) (1) where G denotes the shape grammar. The probability for a parse tree is defined based on the grammatical derivation associated with it: Y P (T |G, ρ) = P (n → ch(n)|G, ρ) (2) n∈Nnt 2 where Nnt is the set of non-terminal nodes in the tree, ch(n) is the set of node n’s children nodes, and P (n → ch(n)|G, ρ) is the probability for production rule n → ch(n). In this equation, ρ denotes the set of probability assignments to production rules and can be integrated out to get P (T |G) (see [9] for details). Letting V denote the set of voxels, the probability of spatial model S can simply be defined as follows since each S node in the parse tree is assigned one voxel: Y 1 1 P (S|T ) = . (3) = |NS | |V| |V| n∈N S Note that the set V simply consists of the voxels that can be assigned to a S node; this set always contains 27 elements since each S node is assigned one of 27 subvoxels of its parent’s voxel. Figure 1: Stimuli used in the experiment and by the computational model. S P → → S | SS | SSS | SSSS | P | P S | P SS | P SSS P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 Figure 2: Production rules of the shape grammar in Backus-Naur form. To map these modality-independent shape representations to sensory features, we make use of two forward models, one for each modality. For the vision-specific forward model, we use the Visualization Toolkit (VTK; www.vtk.org) graphics library to render images of objects. Given an object’s parse and spatial trees (T , S), VTK places the object’s parts at their specified locations and renders the object from three orthogonal viewpoints. These images are concatenated to form a single vector of visual inputs. For the haptic-specific forward model, we use a grasp simulator known as “GraspIt!” [7] (see Figure 5). GraspIt! enables us to measure the joint angles of a simulated human hand grasping a given object. We perform multiple grasps on an object as we rotate it around x, y, and z axes in increments of 45◦ (24 grasps in total), and use the joint angles (again, concatenating the joint angles for all grasps into a single haptic input vector) as our representation of sensory input for the haptic modality. The use of multiple grasps can be regarded as an approximation to active haptic exploration. Our haptic forward model is no doubt a crude approximation. We use joint angles as our sole haptic input and assume that an object can be grasped exactly in the same way across multiple trials. We believe these are necessary simplifications made partly in order to make the problem less computationally demanding, and partly due to lack of realistic haptic forward models in the literature. We define a probability distribution over sensory inputs given modality-independent shape representations by assuming Gaussian noise on sensory inputs. Hence, P (D|(T , S)) is defined as follows: ||D − F (T , S)||22 (4) P (D|T , S) ∝ exp − σ2 3 P0 P1 P2 P3 P4 P5 P6 P7 P8 Figure 3: Library of possible object parts. (a) (b) S(0,0,0) P S(0,-1,0) S(1,0,0) P0 P P P P1 P3 P5 S(-1,1,0) S(0,1,0) P P7 (c) Figure 4: (a) Image of an object. (b) Spatial tree representing the parts and spatial relations among parts for the object in (a). (c) Illustration of how the spatial tree uses a multi-resolution representation to represent the locations of object parts. where F denotes the forward model and σ 2 is a variance parameter. To estimate an object’s shape from visual and/or haptic inputs, we invert the forward models via Bayes’ rule to compute a posterior distribution over multimodal representations: P (T , S|D, G) ∝ P (T |G)P (S|T )P (D|T , S) (5) where the first term on the right-hand side is given by Equation 4 in [1], and rest are given here in Equations 3, and 4, respectively. Given the intractability of this distribution, we use a Metropolis4 Figure 5: GraspIt! simulates a human hand. Here the hand is grasping an object at three different orientations. Hastings algorithm, a popular Markov Chain Monte Carlo (MCMC) technique, to collect samples from the posterior. We make use of two different proposal distributions, one used on even-numbered and the other on odd-numbered iterations [8]. Our first proposal distribution, originally developed by [9], picks one of the nodes in the parse tree randomly, deletes all its descendants, and regenerates a random subtree from that node according to the rules of the grammar. The spatial model is also updated accordingly by removing the deleted nodes and picking random positions for newly added S nodes (see [1] for the acceptance ratio for this proposal). Our simulations showed that this “subtree regeneration” proposal by itself is not efficient, largely because it takes quite large steps in the representation space. Our second proposal distribution is a more local one that adds/removes a single part to/from an object. The acceptance probability for an add move3 is: P (D|T 0 , S 0 ) P (T 0 |G) |A| 0 0 A(T , S ; T , S) = min 1, |Gt | (6) P (D|T , S) P (T |G) |R0 | where R0 is the set of S nodes in tree T 0 that can be removed, A is the set of S nodes in tree T to which a new child S node can be added, and Gt is the set of terminal symbols in the grammar. Subtree regeneration and add/remove part proposals, when combined, enabled fast convergence to the posterior distribution. In our simulations, we ran each MCMC chain for 10,000 iterations (burn-in: 6,000 iterations). Samples from a representative run of the model are shown in Figure 6. Figure 6a shows the visual input to the model for this particular run. Samples from the posterior distribution, ordered according to posterior probability, are shown in Figure 6b. The MAP representation (leftmost sample) recovers the part structure and the spatial configuration of the object perfectly. This result is general—that is, our model always recovers the “true” representation for each object regardless of the modality through which the object is sensed. This demonstrates that the model shows modality invariance, an important type of perceptual constancy. 3 Behavioral evaluation of the model In our behavioral experiment, we investigated how people transfer shape knowledge across modalities. This can also be understood as a problem of crossmodal shape retrieval where one needs to measure how likely it is to observe some sensory input from one modality given sensory observations from another modality. In order to understand how well our model captures this aspect of human multimodal perception, we carried out the following behavioral experiment. We asked subjects to rate the similarity of shape between objects under two settings: crossmodal and multimodal.4 In the crossmodal condition, on each trial, subjects first viewed the image of one of the objects on a computer screen, and then haptically explored the second object (physical copies of the objects were created via 3-D printing) by reaching into a compartment where the second object was placed. The compartment prevented subjects from seeing the object. Subjects were asked to judge the shape similarity between the two objects on a scale of 1 to 7. In the multimodal condition, the procedure was the same except that subjects received both visual and haptic input for both of the objects. The experiment consisted of 4 blocks where in each block all possible pairwise comparisons of 16 objects were presented. We collected data from 7 subjects. 3 The acceptance probability for the remove move can be derived analogously and is given in [1]. Here we focus only on two conditions in our experiment; please refer to [1] for the full experiment, details on the experimental procedure, and a detailed analysis of our experimental results. 4 5 (a) (b) S(0,0,0) S(0,0,0) P S(0,-1,0) S(1,0,0) P0 P P P S(0,1,0) P1 P4 P6 P S(1,1,0) S(0,0,0) S(0,0,0) P S(0,-1,0) S(1,0,0) S(1,1,0) P S(0,-1,0) S(1,0,0) P0 P P P P0 P P P S(0,1,0) P1 P4 P6 P1 P4 P6 P P8 S(1,1,0) P S(0,-1,0) P0 P S(1,0,0) P P1 P4 P7 Figure 6: Samples from a representative run of our model. (a) Input to the model. (b) Samples from the posterior distribution over shape representations. Samples are ordered according to posterior probability from left to right. We evaluated how well our model captures subjects’ similarity judgments by comparing its similarity judgments to behavioral data. To obtain similarity ratings from our model for the crossmodal condition, we proceeded as follows. For each trial, we run two MCMC chains, one for each object. We call these the visual and haptic chains referring to the modality through which each object is presented. For the visual chain, the input is images of the object from three orthogonal viewpoints. For the haptic chain, we use the joint angles (for 24 grasps) obtained from GraspIt! as the sensory input. For the multimodal condition, we ran again two chains, one for each object, but provided both visual and haptic input. Here, we focus only on the MAP samples from these chains, and take the similarity between the MAP samples from the visual and haptic chains to be the crossmodal similarity judgment of our model. Similarly, the similarity between samples from the visual-haptic chains for the multimodal condition forms our model’s multimodal similarity judgments. Because our multimodal shape representations are trees, we used tree-edit distance [10] to measure the similarity of two MAP representations. We compare our model’s judgments to subjects’ judgments as follows. After making sure that all 7 subjects provided highly consistent similarity ratings (the average correlation between subjects’ similarity ratings are 0.8±0.097 and 0.86±0.065 for the crossmodal and multimodal conditions respectively), we average all 7 subjects’ similarity ratings to get a single similarity matrix for each condition. We measure the correlation between the average subject similarity matrices and the similarity matrices we get from our model. These matrices are highly correlated (crossmodal: r = 0.978, p < 0.001, multimodal: r = 0.980, p < 0.001), i.e., our model captures human’s performance in this task extremely well. 4 Conclusion We have presented a computational framework for understanding multimodal perception and showed that an instantiation of our model for object shape recognition captures human’s judgments on a shape similarity task very well. We believe this work constitutes a promising first step, and intend to evaluate our framework in more detail and more extensively in future work. 6 References [1] Goker Erdogan, Ilker Yildirim, and Robert A. Jacobs. From Sensory Signals to ModalityIndependent Conceptual Representations: A Probabilistic Language of Thought Approach. PLoS Comput Biol, 11(11):e1004610, November 2015. [2] Alan Yuille and Daniel Kersten. Vision as Bayesian inference: analysis by synthesis? Trends in cognitive sciences, 10(7):301–8, 2006. [3] Thomas G. Bever and David Poeppel. Analysis by Synthesis: A (Re-)Emerging Program of Research for Language and Vision. BIOLINGUISTICS, 4(2-3):174–200, 2010. [4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. MIT Press, Cambridge, Massachusetts, 1982. [5] Daniel M Wolpert and J Randall Flanagan. Forward Models. In T Bayne, A Cleermans, and P Wilken, editors, The Oxford Companion to Consciousness, pages 295–296. Oxford University Press, New York, 2009. [6] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, January 2002. [7] B Y Andrew T Miller and Peter K Allen. Graspit! a versatile simulator for robotic grasping. IEEE Robotics Automation Magazine, 11(December):110–122, 2004. [8] L Tierney. Markov Chains for Exploring Posterior Distributions. The Annals of Statistics, 22(4):1701–1728, 1994. [9] Noah D Goodman, Joshua B Tenenbaum, Jacob Feldman, and Thomas L Griffiths. A rational analysis of rule-based concept learning. Cognitive science, 32(1):108–54, January 2008. [10] Kaizhong Zhang and Dennis Shasha. Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM Journal on Computing, 18(6):1245–1262, December 1989. 7

Download
# An Analysis-By-Synthesis Approach to Multisensory Object