Formatting Instructions for NIPS -8-

advertisement
An Information Maximization Model of
Eye Movements
Laura Walker Renninger, James Coughlan, Preeti Verghese
Smith-Kettlewell Eye Research Institute
{laura, coughlan, preeti}@ski.org
Jitendra Malik
University of California, Berkeley
malik@eecs.berkeley.edu
Abstract
We propose a sequential information maximization model as a
general strategy for programming eye movements. The model
reconstructs high-resolution visual information from a sequence of
fixations, taking into account the fall-off in resolution from the
fovea to the periphery. From this framework we get a simple rule
for predicting fixation sequences: after each fixation, fixate next at
the location that minimizes uncertainty (maximizes information)
about the stimulus. By comparing our model performance to human
eye movement data and to predictions from a saliency and random
model, we demonstrate that our model is best at predicting fixation
locations. Modeling additional biological constraints will improve
the prediction of fixation sequences. Our results suggest that
information maximization is a useful principle for programming
eye movements.
1
In trod u ction
Since the earliest recordings [1, 2], vision researchers have sought to understand the
non-random yet idiosyncratic behavior of volitional eye movements. To do so, we
must not only unravel the bottom-up visual processing involved in selecting a
fixation location, but we must also disentangle the effects of top-down cognitive
factors such as task and prior knowledge. Our ability to predict volitional eye
movements provides a clear measure of our understanding of biological vision.
One approach to predicting fixation locations is to propose that the eyes move to
points that are “salient”. Salient regions can be found by looking for centersurround contrast in visual channels such as color, contrast and orientation, among
others [3, 4]. Saliency has been shown to correlate with human fixation locations
when observers “look around” an image [5, 6] but it is not clear if saliency alone
can explain why some locations are chosen over others and in what order. Task as
well as scene or object knowledge will play a role in constraining the fixation
locations chosen [7]. Observations such as this led to the scanpath theory, which
proposed that eye movement sequences are tightly linked to both the encoding and
retrieval of specific object memories [8].
1.1
Our Approach
We propose that during natural, active vision, we center our fixation on the most
informative points in an image in order to reduce our overall uncertainty about what
we are looking at. This approach is intuitive and may be biologically plausible, as
outlined by Lee & Yu [9]. The most informative point will depend on both the
observer’s current knowledge of the stimulus and the task. The quality of the
information gathered with each fixation will depend greatly on human visual
resolution limits. This is the reason we must move our eyes in the first place, yet it
is often ignored. A sequence of eye movements may then be understood within a
framework of sequential information maximization.
2
Human eye movements
We investigated how observers examine a novel shape when they must rely heavily
on bottom-up stimulus information. Because eye movements will be affected by the
task of the observer, we constructed a learn-discriminate paradigm. Observers are
asked to carefully study a shape and then discriminate it from a highly similar one.
2.1
Stimuli and Design
We use novel silhouettes to reduce the influence of object familiarity on the pattern
of eye movements and to facilitate our computations of information in the model.
Each silhouette subtends 12.5º to ensure that its entire shape cannot be characterized
with a single fixation.
During the learning phase, subjects first fixated a marker and then pressed a button
to cue the appearance of the shape which appeared 10º to the left or right of fixation.
Subjects maintained fixation for 300ms, allowing for a peripheral preview of the
object. When the fixation marker disappeared, subjects were allowed to study the
object for 1.2 seconds while their eye movements were recorded. During the
discrimination phase, subjects were asked to select the shape they had just studied
from a highly similar shape pair (Figure 1). Performance was near 75% correct,
indicating that the task was challenging yet feasible. Subjects saw 140 shapes and
given auditory feedback.
release fixation,
view object freely
(1200ms)
maintain fixation
(300ms)
Which shape is a match?
fixate, initiate trial
Figure 1. Temporal layout of a trial during the learning phase (left). Discrimination of
learned shape from a highly similar one (right).
2.2
Apparatus
Right eye position was measured with an SRI Dual Purkinje Image eye tracker while
subjects viewed the stimulus binocularly. Head position was fixed with a bitebar. A
25 dot grid that covered the extent of the presentation field was used for calibration.
The points were measured one at a time with each dot being displayed for 500ms.
The stimuli were presented using the Psychtoolbox software [10].
3
Model
We wish to create a model that builds a representation of a shape silhouette given
imperfect visual information, and which updates its representation as new visual
information is acquired. The model will be defined statistically so as to explicitly
encode uncertainty about the current knowledge of the shape silhouette. We will use
this model to generate a simple rule for predicting fixation sequences: after each
fixation, fixate next at the location that will decrease the model’s uncertainty as
much as possible. Similar approaches have been described in an ideal observer
model for reading [11], an information maximization algorithm for tracking
contours in cluttered images [12] and predicting fixation locations during object
learning [13].
3.1
Representing information
The information in silhouettes clearly resides at its contour, which we represent
with a collection of points and associated tangent orientations. These points and
their associated orientations are called edgelets, denoted e1, e2, ... eN, where N is the
total number of edgelets along the boundary. Each edgelet ei is defined as a triple
ei=(xi, yi, zi) where (xi, yi) is the 2D location of the edgelet and zi is the orientation
of the tangent to the boundary contour at that point. zi can assume any of Q possible
values 1, 2, …, Q, representing a discretization of Q possible orientations ranging
from 0 to π , and we have chosen Q=8 in our experiments. The goal of the model is
to infer the most likely orientation values given the visual information provided by
one or more fixations.
3.2
Updating knowledge
The visual information is based on indirect measurements of the true edgelet values
e1, e2, ... eN. Although our model assumes complete knowledge of the number N and
locations (xi, yi) of the edgelets, it does not have direct access to the orientations zi.1
Orientation information is instead derived from measurements that summarize the
local frequency of occurrence of edgelet orientations, averaged locally over a coarse
scale (corresponding to the spatial scale at which resolution is limited by the human
visual system). These coarse measurements provide indirect information about
individual edgelet orientations, which may not uniquely determine the orientations.
We will use a simple statistical model to estimate the distribution of individual
orientation values conditioned on this information.
Our measurements are defined to model the resolution limitations of the human
visual system, with highest resolution at the fovea and lower resolution in the
1
Although the visual system does not have precise knowledge of location coordinates, the
model is greatly simplified by assuming this knowledge. It is reasonable to expect that
location uncertainty will be highly correlated with orientation uncertainty, so that the
inclusion of location should not greatly affect the model's decisions of where to fixate next.
periphery. Distance to the fovea is rmeasured as eccentricity E, the visual angle
between any point and the fovea. If x = ( x, y ) is the location of a point in an image
r
and f = ( f x , f y ) is the fixation (i.e. foveal) location in the image then the
r
eccentricity is E = xr − f , measured in units of visual degrees. The effective
resolution of orientation discrimination falls with increasing eccentricity as
r (E ) = FPH ( E + E 2 ) where r(E) is an effective radius over which the visual system
spatially pools information and FPH =0.1 and E2=0.8 [14].
Our model represents pooled information as a histogram of edge orientations within
the effective radius. For each edgelet ei we define the histogram of all edgelet
orientations ej within radius ri = r(E) of ei , where E is the eccentricity of xri = ( xi , yi )
r
r
relative to the current fixation f , i.e. E = xri − f . To define the histogram more
precisely we will introduce the neighborhood set Ni of all indices j corresponding to
r r
edgelets within radius ri of ei : N i = all j s.t. xi − x j ≤ ri , with number of
{
}
neighborhood edgelets |Ni|. The (normalized) histogram centered at edgelet ei is then
defined as
hiz =
1
Ni
∑δ
j∈N i
z,z j
,
which is the proportion of edgelet orientations that assume value z in the
(eccentricity-dependent) neighborhood of edgelet ei.2
Figure 2. Relation between eccentricity E and radius r(E) of the neighborhood (disk)
which defines the local orientation histogram (hiz ). Left and right panels show two
fixations for the same object.
Up to this point we have restricted ourselves to the case of a single fixation. To
designate a sequence of multiple fixations we will index them byrk=1, 2, …, K (for
K total fixations). The kth fixation location is denoted by f ( k ) = ( f xk , f yk ) . The
quantities ri , Ni and hiz depend on fixation location and so to make this dependence
explicit we will augment them with superscripts as ri(k ) , N i(k ) , and hiz(k ) .
2
δ x, y is the Kronecker delta function, defined to equal 1 if x = y and 0 if x ≠ y .
Now we describe the statistical model of edgelet orientations given information
obtained from multiple fixations. Ideally we would like to model the exact
distribution
of
orientations
conditioned
on
the
histogram
data:
(1)
( 2)
(K )
(k )
,
where
represents
all
histogram
P(zi , z 2 , ... z N | {hiz }, {hiz },K, {hiz })
{hiz }
r
components z at every edgelet ei for fixation f (k ) . This exact distribution is
intractable, so we will use a simple approximation. We assume the distribution
factors over individual edgelets:
N
P(zi , z 2 , ... z N | {hiz(1) }, {hiz( 2 ) },K, {hiz( K ) }) = ∏ g i(zi )
i =1
where gi(zi) is the marginal distribution of orientation zi. Determining these
marginal distributions is still difficult even with the factorization assumption, so we
K
will make an additional approximation: g (z ) = 1
hiz( k ) , where Zi is a suitable
∏
i i
Z i k =1
(k )
normalization factor. This approximation corresponds to treating hiz as a likelihood
function over z, with independent likelihoods for each fixation k. While the
approximation has some undesirable properties (such as making the marginal
distribution gi(zi) more peaked if the same fixation is made repeatedly), it provides a
simple mechanism for combining histogram evidence from multiple, distinct
fixations.
3.3
Selecting the next fixation
r ( K +1)
Given the past K fixations, the next fixation f
is chosen to minimize the model
r ( K +1)
entropy of the edgelet orientations. In other words, f
is chosen to minimize
r ( K +1)
H( f
) = entropy[ P(zi , z2 , ... z N | {hiz(1) }, {hiz( 2 ) },K, {hiz( K +1) })] , where the entropy of a
distribution P(x) is defined as − ∑ P( x) log P ( x) . In practice, we minimize the
x
r
entropy by evaluating it across a set of candidate locations f ( K +1) which forms a
regularly sampled grid across the image.3 We note that this selection rule makes
decisions that depend, in general, on the full history of previous K fixations.
4
Results
Figure 3 shows an example of one observer’s eye movements superimposed over the
shape (top row), the prediction from a saliency model (middle row) [3] and the
prediction from the information maximization model (bottom row). The information
maximization model updates its prediction after each fixation.
An ideal sequence of fixations can be generated by both models. The saliency model
selects fixations in order of decreasing salience. The information maximization
model selects the maximally informative point after incorporating information from
the previous fixations. To provide an additional benchmark, we also implemented a
3
This rule evaluates the entropy resulting from every possible next fixation before making a
decision. Although this rule is suitable for our modeling purposes, it would be inefficient to
implement in a biological or machine vision system. A practical decision rule would use
current knowledge to estimate the expected (rather than actual) entropy.
Figure 3. Example eye movement pattern, superimposed over the stimulus (top row),
saliency map (middle row) and information maximization map (bottom row).
model that selects fixations at random. One way to quantify the performance is to
map a subject’s fixations onto the closest model predicted fixation locations,
ignoring the sequence in which they were made. In this analysis, both the saliency
and information maximization models are significantly better than random at
predicting candidate locations (p < 0.05; t-test) for three observers (Figure 4, left).
The information maximization model performs slightly but significantly better than
the saliency model for two observers (lm, kr). If we match fixation locations while
retaining the sequence, errors become quite large, indicating that the models cannot
account for the observed behavior (Figure 4, right).
Sequence Error
Visual Angle (deg)
Location Error
R S I
R S I
R S I
R S I
R S I
R S I
Figure 4. Prediction error of three models: random (R), saliency (S) and information
maximization (I) for three observers (pv, lm, kr). The left panel shows the error in
predicting fixation locations, ignoring sequence. The right panel shows the error when
sequence is retained before mapping. Error bars are 95% confidence intervals.
The information maximization model incorporates resolution limitations, but there
are further biological constraints that must be considered if we are to build a model
that can fully explain human eye movement patterns. First, saccade amplitudes are
typically around 2-4º and rarely exceed 15º [15]. When we move our eyes, the
image of the visual world is smeared across the retina and our perception of it is
actively suppressed [16]. Shorter saccade lengths may be a mechanism to reduce
this cost. This biological constraint would cause a fixation to fall short of the
prediction if it is distant from the current fixation (Figure 5).
Figure 5. Cost of moving the eyes. Successive fixations may fall short of the maximally
salient or informative point if it is very distant from the current fixation.
Second, the biological system may increase its sampling efficiency by planning a
series of saccades concurrently [17, 18]. Several fixations may therefore be made
before sampled information begins to influence target selection. The information
maximization model currently updates after each fixation. This would create a
discrepancy in the prediction of the eye movement sequence (Figure 6).
Figure 6. Three fixations are made to a location that is initially highly informative
according to the information maximization model. By the fourth fixation, the subject
finally moves to the next most informative point.
5
D i s c u s s i on
Our model and the saliency model are using the same image information to
determine fixation locations, thus it is not surprising that they are roughly similar in
their performance of predicting human fixation locations. The main difference is
how we decide to “shift attention” or program the sequence of eye movements to
these locations. The saliency model uses a winner-take-all and inhibition-of-return
mechanism to shift among the salient regions. We take a completely different
approach by saying that observers adopt a strategy of sequential information
maximization. In effect, the history of where we have been matters because our
model is continually collecting information from the stimulus. We have an implicit
“inhibition-of-return” because there is little to be gained by revisiting a point.
Second, we attempt to take biological resolution limits into account when
determining the quality of information gained with each fixation. By including
additional biological constraints such as the cost of making large saccades and the
natural time course of information update, we may be able to improve our prediction
of eye movement sequences.
We have shown that the programming of eye movements can be understood within a
framework of sequential information maximization. This framework is portable to
any image or task. A remaining challenge is to understand how different tasks
constrain the representation of information and to what degree observers are able to
utilize the information.
Acknowledgments
Smith-Kettlewell Eye Research Institute, NIH Ruth L. Kirschstein NRSA, ONR #N0001401-1-0890, NSF #IIS0415310, NIDRR #H133G030080, NASA #NAG 9-1461.
References
[1] Buswell (1935). How people look at pictures. Chicago: The University of Chicago Press.
[2] Yarbus (1967). Eye movements and vision. New York: Plenum Press.
[3] Itti & Koch (2000). A saliency-based search mechanism for overt and covert shifts of
visual attention. Vision Research, 40, 1489-1506.
[4] Kadir & Brady (2001). Scale, saliency and image description. International Journal of
Computer Vision, 45(2), 83-105.
[5] Parkhurst, Law, and Niebur (2002). Modeling the role of salience in the allocation of
overt visual attention. Vision Research, 42(1), 107-123.
[6] Nothdurft (2002). Attention shifts to salient targets. Vision Research, 42, 1287-1306.
[7] Oliva, Torralba, Castelhano & Henderson (2003). Top-down control of visual attention in
object detection. Proceedings of the IEEE International Conference on Image Processing,
Barcelona, Spain.
[8] Noton & Stark (1971). Scanpaths in eye movements during pattern perception. Science,
171, 308-311.
[9] Lee & Yu (2000). An information-theoretic framework for understanding saccadic
behaviors. Advanced in Neural Processing Systems, 12, 834-840.
[10] Brainard (1997). The psychophysics toolbox. Spatial Vision, 10 (4), 433-436.
[11] Legge, Hooven, Klitz, Mansfield & Tjan (2002). Mr.Chips 2002: new insights from an
ideal-observer model of reading. Vision Research, 42, 2219-2234.
[12] Geman & Jedynak (1996). An active testing model for tracking roads in satellite images. IEEE
Trans. Pattern Analysis and Machine Intel, 18(1), 1-14.
[13] Renninger & Malik (2004). Sequential information maximization can explain eye
movements in an object learning task. Journal of Vision, 4(8), 744a.
[14] Levi, Klein & Aitesbaomo (1985). Vernier acuity, crowding and cortical magnification.
Vision Research, 25(7), 963-977.
[15] Bahill, Adler & Stark (1975). Most naturally occurring human saccades have
magnitudes of 15 degrees or less. Investigative Ophthalmology, 14, 468-469.
[16] Burr, Morrone & Ross (1994). Selective suppression of the magnocellular visual
pathway during saccadic eye movements. Nature, 371, 511-513.
[17] Caspi, Beutter & Eckstein (2004). The time course of visual information accrual guiding
eye movement decisions. Proceedings of the Nat’l Academy of Science, 101(35), 13086-90.
[18] McPeek, Skavenski & Nakayama (2000). Concurrent processing of saccades in visual
search. Vision Research, 40, 2499-2516.
Download