An Analysis-By-Synthesis Approach to Multisensory
Object Shape Perception
Ilker Yildirim
Brain & Cognitive Sciences
Massachusetts Institute of Technology
Cambridge, MA
[email protected]
Goker Erdogan
Brain & Cognitive Sciences
University of Rochester
Rochester, NY
[email protected]
Robert A. Jacobs
Brain & Cognitive Sciences
University of Rochester
Rochester, NY
[email protected]
1
Introduction
The world is multimodal.1 We sense our environments using inputs from multiple sensory modalities. Similarly, digital information is increasingly available through multiple media. In this extended
abstract, we present a general computational framework for understanding multimodal learning and
perception that builds on the analysis-by-synthesis approach [2, 3]. The analysis-by-synthesis approach makes use of generative models describing how causes in the environment give rise to percepts. It achieves an analysis of the input as a synthesis of these causes by inverting this generative
model. This approach has been successfully applied in a wide range of fields from vision[2] to
language[3]. Our framework is a natural extension of this approach to multimodal perception. From
the perspective of Marr’s tri-level hypothesis2 , our framework can be understood as a computational
level analysis of multimodal perception. It offers a conceptual analysis of multimodal perception that
enables a unified treatment of a diverse set of questions. As a proof of concept of our framework,
we apply it to object shape perception via visual and haptic modalities, and show that it captures
people’s behavior in this task with high accuracy.
We argue that any cognitive agent capable of multimodal perception can be analyzed in terms of
three crucial components [1]. The first component is a representational language for characterizing
modality-independent representations. For example, in the case of multimodal object recognition,
this representational language characterizes the object shape information necessary for recognition
extracted from the sensory inputs. These modality-independent representations need to be mapped
to sensory inputs for the purposes of learning and inference. This is where the second component,
sensory-specific forward models [5], come into play. For example, a vision-specific forward model
maps object shape representations to visual inputs, i.e., images. However, perception goes in the
other direction, from sensory inputs to multimodal representations. Therefore, the last component is
an inference procedure for inverting forward models. Taking a probabilistic perspective, we think of
forward models as specifying a probability distribution over sensory inputs D given a multimodal
1
This extended abstract is a short version of our manuscript titled “From Sensory Signals to ModalityIndependent Conceptual Representations: A Probabilistic Language of Thought Approach” published in PLOS
Computational Biology[1].
2
Marr[4] distinguished between three different levels at which one can understand a cognitive capability.
The computational level characterizes what is the computational problem that needs to be solved. The algorithm/representation level specifies the representations and the algorithms used to solve the problem. Lastly,
the implementation level is concerned with how the solution is implemented in physical medium.
1
representation H, i.e., P (D|H). One can use the calculus of probability to invert this forward
model via Bayes’ rule. A similar framework has been successfully applied to problems in biological
and computer vision, and our framework is a natural application of this analysis-by-synthesis [2]
approach to multimodal perception. We believe that our framework presents a promising direction
for understanding multimodal perception as it presents a unified treatment of various questions. For
example, how is knowledge from one modality transferred to another? In our framework, this simply
becomes estimating the posterior of observing some sensory input from one modality given another
sensory input from another modality. Or, how is information from multiple modalities combined
[6]? Again, this simply becomes estimating the posterior over multimodal representations given
inputs from multiple modalities.
2
Our framework applied to multimodal perception of object shape
In the rest of the abstract, we focus on the application of our framework to multimodal shape perception in visual and haptic settings. We present a specific instantiation of our model for this problem
and show that it provides a highly accurate account of human subjects’ shape similarity judgments.
We used a set of 16 objects in our experimental and computational work (see Figure 1). Each
object consists of five parts at five fixed locations. One part is common to all objects. At each of
the remaining four locations, one of two possible parts is chosen. We characterize these objects
using a shape grammar that captures the identity and spatial configuration of the parts that make
up an object. Production rules for this probabilistic context-free grammar can be seen in Figure 2.
Note that this grammar characterizes only the coarse spatial structure of an object, and it needs to
be combined with a spatial model, which will be explained in more detail shortly. There are two
non-terminal symbols in our shape grammar. Non-terminal symbol S corresponds to spatial nodes
in the grammar that determine the positions of parts in 3D space. The spatial model associates
positions in 3D space with each one of these S nodes. P nodes are placeholder nodes for parts
which are denoted by terminal nodes P 0 through P 8 (Figure 3). The grammatical derivation for an
object can be represented using a parse tree. However, as mentioned above, this parse tree does not
characterize the spatial structure of an object completely. Hence, a parse tree is extended to a spatial
tree by combining it with a spatial model as follows (see Figure 4).
Parts are positioned in a 3D space using a multiresolution representation. At the coarsest resolution
in this representation, a “voxel” corresponds to the entire space. The center location of this voxel is
the origin of the space, denoted (0, 0, 0). The root S node of a grammatical derivation of an object
is always associated with this origin. At a finer resolution, this voxel is divided into 27 equal sized
subvoxels arranged to form a 3 × 3 × 3 grid. Using a Cartesian coordinate system, a coordinate of a
subvoxel’s location along an axis is either -1, 0, or 1. Even finer resolutions are created by dividing
subvoxels into subsubvoxels (again, using a 3 × 3 × 3 grid), subsubvoxels into subsubsubvoxels,
and so on. S nodes in an object’s grammatical derivation (other than the root S node) are associated
with these finer resolution voxels with the constraint that the position of an S node must lie within
its parent S node’s voxel (see Figure 4). In other words, each level in the parse tree is associated
with a specific level of resolution. For example, for the child S nodes of the root S node, the whole
space is split into 27 voxels, 3 along each axis, and each S nodes is assigned one of these voxels.
Similarly, for the next level in the tree, each voxel is further split into 27 subvoxels, and each S
node at this level can be associated with any of the 27 subvoxels of its parent voxel (hence, there are
27 × 27 voxels in total at this level.). The position of a part node P is simply its parent S node’s
position, i.e., the center of the voxel associated with its parent. This assignment of spatial positions
to S nodes extends a parse tree to a spatial tree by specifying the positions of the parts that make up
an object.
Let T denote the parse tree and S denote the spatial model, i.e., spatial assignments for each S node,
for an object. The prior probability of a shape representation (T , S) is defined as:
P (T , S|G) = P (T |G)P (S|T )
(1)
where G denotes the shape grammar. The probability for a parse tree is defined based on the grammatical derivation associated with it:
Y
P (T |G, ρ) =
P (n → ch(n)|G, ρ)
(2)
n∈Nnt
2
where Nnt is the set of non-terminal nodes in the tree, ch(n) is the set of node n’s children nodes,
and P (n → ch(n)|G, ρ) is the probability for production rule n → ch(n). In this equation, ρ denotes
the set of probability assignments to production rules and can be integrated out to get P (T |G) (see
[9] for details). Letting V denote the set of voxels, the probability of spatial model S can simply be
defined as follows since each S node in the parse tree is assigned one voxel:
Y 1
1
P (S|T ) =
.
(3)
=
|NS |
|V|
|V|
n∈N
S
Note that the set V simply consists of the voxels that can be assigned to a S node; this set always
contains 27 elements since each S node is assigned one of 27 subvoxels of its parent’s voxel.
Figure 1: Stimuli used in the experiment and by the computational model.
S
P
→
→
S | SS | SSS | SSSS | P | P S | P SS | P SSS
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8
Figure 2: Production rules of the shape grammar in Backus-Naur form.
To map these modality-independent shape representations to sensory features, we make use of two
forward models, one for each modality. For the vision-specific forward model, we use the Visualization Toolkit (VTK; www.vtk.org) graphics library to render images of objects. Given an
object’s parse and spatial trees (T , S), VTK places the object’s parts at their specified locations and
renders the object from three orthogonal viewpoints. These images are concatenated to form a single
vector of visual inputs. For the haptic-specific forward model, we use a grasp simulator known as
“GraspIt!” [7] (see Figure 5). GraspIt! enables us to measure the joint angles of a simulated human
hand grasping a given object. We perform multiple grasps on an object as we rotate it around x,
y, and z axes in increments of 45◦ (24 grasps in total), and use the joint angles (again, concatenating the joint angles for all grasps into a single haptic input vector) as our representation of sensory
input for the haptic modality. The use of multiple grasps can be regarded as an approximation to
active haptic exploration. Our haptic forward model is no doubt a crude approximation. We use
joint angles as our sole haptic input and assume that an object can be grasped exactly in the same
way across multiple trials. We believe these are necessary simplifications made partly in order to
make the problem less computationally demanding, and partly due to lack of realistic haptic forward
models in the literature.
We define a probability distribution over sensory inputs given modality-independent shape representations by assuming Gaussian noise on sensory inputs. Hence, P (D|(T , S)) is defined as follows:
||D − F (T , S)||22
(4)
P (D|T , S) ∝ exp −
σ2
3
P0
P1
P2
P3
P4
P5
P6
P7
P8
Figure 3: Library of possible object parts.
(a)
(b)
S(0,0,0)
P
S(0,-1,0)
S(1,0,0)
P0
P
P
P
P1
P3
P5
S(-1,1,0)
S(0,1,0)
P
P7
(c)
Figure 4: (a) Image of an object. (b) Spatial tree representing the parts and spatial relations among
parts for the object in (a). (c) Illustration of how the spatial tree uses a multi-resolution representation
to represent the locations of object parts.
where F denotes the forward model and σ 2 is a variance parameter.
To estimate an object’s shape from visual and/or haptic inputs, we invert the forward models via
Bayes’ rule to compute a posterior distribution over multimodal representations:
P (T , S|D, G) ∝ P (T |G)P (S|T )P (D|T , S)
(5)
where the first term on the right-hand side is given by Equation 4 in [1], and rest are given here in
Equations 3, and 4, respectively. Given the intractability of this distribution, we use a Metropolis4
Figure 5: GraspIt! simulates a human hand. Here the hand is grasping an object at three different
orientations.
Hastings algorithm, a popular Markov Chain Monte Carlo (MCMC) technique, to collect samples
from the posterior. We make use of two different proposal distributions, one used on even-numbered
and the other on odd-numbered iterations [8]. Our first proposal distribution, originally developed
by [9], picks one of the nodes in the parse tree randomly, deletes all its descendants, and regenerates
a random subtree from that node according to the rules of the grammar. The spatial model is also
updated accordingly by removing the deleted nodes and picking random positions for newly added S
nodes (see [1] for the acceptance ratio for this proposal). Our simulations showed that this “subtree
regeneration” proposal by itself is not efficient, largely because it takes quite large steps in the
representation space. Our second proposal distribution is a more local one that adds/removes a
single part to/from an object. The acceptance probability for an add move3 is:
P (D|T 0 , S 0 ) P (T 0 |G) |A|
0
0
A(T , S ; T , S) = min 1,
|Gt |
(6)
P (D|T , S) P (T |G) |R0 |
where R0 is the set of S nodes in tree T 0 that can be removed, A is the set of S nodes in tree T to
which a new child S node can be added, and Gt is the set of terminal symbols in the grammar. Subtree regeneration and add/remove part proposals, when combined, enabled fast convergence to the
posterior distribution. In our simulations, we ran each MCMC chain for 10,000 iterations (burn-in:
6,000 iterations). Samples from a representative run of the model are shown in Figure 6. Figure 6a
shows the visual input to the model for this particular run. Samples from the posterior distribution, ordered according to posterior probability, are shown in Figure 6b. The MAP representation
(leftmost sample) recovers the part structure and the spatial configuration of the object perfectly.
This result is general—that is, our model always recovers the “true” representation for each object
regardless of the modality through which the object is sensed. This demonstrates that the model
shows modality invariance, an important type of perceptual constancy.
3
Behavioral evaluation of the model
In our behavioral experiment, we investigated how people transfer shape knowledge across modalities. This can also be understood as a problem of crossmodal shape retrieval where one needs to
measure how likely it is to observe some sensory input from one modality given sensory observations from another modality. In order to understand how well our model captures this aspect of
human multimodal perception, we carried out the following behavioral experiment. We asked subjects to rate the similarity of shape between objects under two settings: crossmodal and multimodal.4
In the crossmodal condition, on each trial, subjects first viewed the image of one of the objects on
a computer screen, and then haptically explored the second object (physical copies of the objects
were created via 3-D printing) by reaching into a compartment where the second object was placed.
The compartment prevented subjects from seeing the object. Subjects were asked to judge the shape
similarity between the two objects on a scale of 1 to 7. In the multimodal condition, the procedure was the same except that subjects received both visual and haptic input for both of the objects.
The experiment consisted of 4 blocks where in each block all possible pairwise comparisons of 16
objects were presented. We collected data from 7 subjects.
3
The acceptance probability for the remove move can be derived analogously and is given in [1].
Here we focus only on two conditions in our experiment; please refer to [1] for the full experiment, details
on the experimental procedure, and a detailed analysis of our experimental results.
4
5
(a)
(b)
S(0,0,0)
S(0,0,0)
P
S(0,-1,0)
S(1,0,0)
P0
P
P
P
S(0,1,0)
P1
P4
P6
P
S(1,1,0)
S(0,0,0)
S(0,0,0)
P
S(0,-1,0)
S(1,0,0)
S(1,1,0)
P
S(0,-1,0)
S(1,0,0)
P0
P
P
P
P0
P
P
P
S(0,1,0)
P1
P4
P6
P1
P4
P6
P
P8
S(1,1,0)
P
S(0,-1,0)
P0
P
S(1,0,0)
P
P1
P4
P7
Figure 6: Samples from a representative run of our model. (a) Input to the model. (b) Samples from
the posterior distribution over shape representations. Samples are ordered according to posterior
probability from left to right.
We evaluated how well our model captures subjects’ similarity judgments by comparing its similarity judgments to behavioral data. To obtain similarity ratings from our model for the crossmodal
condition, we proceeded as follows. For each trial, we run two MCMC chains, one for each object.
We call these the visual and haptic chains referring to the modality through which each object is
presented. For the visual chain, the input is images of the object from three orthogonal viewpoints.
For the haptic chain, we use the joint angles (for 24 grasps) obtained from GraspIt! as the sensory
input. For the multimodal condition, we ran again two chains, one for each object, but provided both
visual and haptic input. Here, we focus only on the MAP samples from these chains, and take the
similarity between the MAP samples from the visual and haptic chains to be the crossmodal similarity judgment of our model. Similarly, the similarity between samples from the visual-haptic chains
for the multimodal condition forms our model’s multimodal similarity judgments. Because our multimodal shape representations are trees, we used tree-edit distance [10] to measure the similarity of
two MAP representations. We compare our model’s judgments to subjects’ judgments as follows.
After making sure that all 7 subjects provided highly consistent similarity ratings (the average correlation between subjects’ similarity ratings are 0.8±0.097 and 0.86±0.065 for the crossmodal and
multimodal conditions respectively), we average all 7 subjects’ similarity ratings to get a single similarity matrix for each condition. We measure the correlation between the average subject similarity
matrices and the similarity matrices we get from our model. These matrices are highly correlated
(crossmodal: r = 0.978, p < 0.001, multimodal: r = 0.980, p < 0.001), i.e., our model captures
human’s performance in this task extremely well.
4
Conclusion
We have presented a computational framework for understanding multimodal perception and showed
that an instantiation of our model for object shape recognition captures human’s judgments on a
shape similarity task very well. We believe this work constitutes a promising first step, and intend
to evaluate our framework in more detail and more extensively in future work.
6
References
[1] Goker Erdogan, Ilker Yildirim, and Robert A. Jacobs. From Sensory Signals to ModalityIndependent Conceptual Representations: A Probabilistic Language of Thought Approach.
PLoS Comput Biol, 11(11):e1004610, November 2015.
[2] Alan Yuille and Daniel Kersten. Vision as Bayesian inference: analysis by synthesis? Trends
in cognitive sciences, 10(7):301–8, 2006.
[3] Thomas G. Bever and David Poeppel. Analysis by Synthesis: A (Re-)Emerging Program of
Research for Language and Vision. BIOLINGUISTICS, 4(2-3):174–200, 2010.
[4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. MIT Press, Cambridge, Massachusetts, 1982.
[5] Daniel M Wolpert and J Randall Flanagan. Forward Models. In T Bayne, A Cleermans, and
P Wilken, editors, The Oxford Companion to Consciousness, pages 295–296. Oxford University Press, New York, 2009.
[6] Marc O Ernst and Martin S Banks. Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429–33, January 2002.
[7] B Y Andrew T Miller and Peter K Allen. Graspit! a versatile simulator for robotic grasping.
IEEE Robotics Automation Magazine, 11(December):110–122, 2004.
[8] L Tierney. Markov Chains for Exploring Posterior Distributions. The Annals of Statistics,
22(4):1701–1728, 1994.
[9] Noah D Goodman, Joshua B Tenenbaum, Jacob Feldman, and Thomas L Griffiths. A rational
analysis of rule-based concept learning. Cognitive science, 32(1):108–54, January 2008.
[10] Kaizhong Zhang and Dennis Shasha. Simple Fast Algorithms for the Editing Distance between
Trees and Related Problems. SIAM Journal on Computing, 18(6):1245–1262, December 1989.
7
Download

An Analysis-By-Synthesis Approach to Multisensory Object