Institute Jean Nicod, Oct 28, 2005
What is focal attention for?
The What and Why of perceptual selection
The central function of focal attention is to select
We must select because our capacity to process information is limited
We must select because we need to be able to mark certain aspects of a scene and to refer to the marked tokens individually
That’s what this talk is principally about: but first some background
The functions of focal attention
A central notion in vision science is that of “picking out” or selecting (also referring, tracking ) . T he usual mechanism for perceptual selection is called selective attention or focal attention .
Why must we select at all? Overview
We must select because we can’t process all the information available.
This is the resource-limitation reason.
○ But in what way (along what dimensions) is it limited? What happens to what is not selected? The “filter theory” has many problems.
We need to select because certain patterns cannot be computed without first marking certain special elements (e.g. in counting)
We need to select in order to track the identity of individual things e.g., to solve the correspondence problem by identifying tokens in order to establish the equivalence of this
(t=i) and this
(t=i+ε)
We need to select because of the way relevant information in the world is packaged. This leads to the Binding Problem
. That’s an important part of what I will discuss in this talk.
(illustrating the resource-limited account of selection)
Rehearsal loop
Effectors
Motor planner
Limited Capacity Channel
Store of conditional probabilities of past events (in LTM)
Broadbent, D. E. (1958). Perception and Communication . London: Pergamon Press.
The question of what is the basis for selection has been at the bottom of a lot of controversy in vision science. Some options that have been proposed include:
We select what can be described physically (i.e., by
“channels”) – we select transducer outputs
e.g., we select by frequency, color, shape, or location
We select according to what is important to us (e.g., affordances – Gibson), or according to phenomenal salience (William James)
We select what we need to treat as special or what we need to refer to
selecting as “marking”
Consider the options for what is the basis of visual selection
The most obvious answer to what we select is places or locations . We can select most other properties by their location – e.g., we can move our eyes so our gaze lands on different places
Must we always move our eyes to change what we attend to?
○
○
Studies of Covert Attention-Movement : Posner (1980)
Other empirical questions about place selection…
• When places are selected, are they selected automatically or can they be selected voluntarily?
• How does the visual system specify where to move attention to?
• Are there restrictions on what places we can select?
• Are selected places punctate or can they be regions?
• Must selected places be filled or can they be empty places?
• Can places be specifiable in relation to landmark objects (e.g., select the place half way between X and Y )?
Fixation frame
Cue
Target-cue interval
Detection target
*
Cued
Uncued
*
Example of an experiment using a cue-validity paradigm for showing that the locus of attention moves without eye movements and for estimating its speed.
Posner, M. I. (1980). Orienting of Attention. Quarterly Journal of Experimental Psychology, 32 , 3-25.
Extension of Posner’s demonstration of attention switch
Does the improved detection in intermediate locations entail that the “spotlight of attention” moves continuously through empty space?
Sperling & Weichselgartner argued that this analog movement is best explained by a quantal mechanism
The theory assumes a quantal jump in attention in which the spotlight pointed at location -2 is extinguished and, simultaneously, the spotlight at location +2 is turned on. Because extinction and onset take a measurable amount of time, there is a brief period when the spotlights partially illuminate both locations simultaneously.
An independently motivated alternative is that selection occurs when token perceptual objects are individuated
Individuation involves distinguishing something from all things it is not. In general individuation involves appealing to properties of the thing in question (cf Strawson).
○ But a more primitive type of individuation or perceptual parsing may be computed in early vision
Primitive Individuation ( PI ) may be automatic
○ PI is associated with transients or the appearance of a new object
○ PI is sometimes accompanied by assignment of a deictic reference or FINST that keeps individuals distinct without encoding their properties (nonconceptual individuation). This indexing process is, however, numerically limited (to about 4 objects) [* More later]
○ Individuation is often accompanied by the creation of an
Object
File (OF) for that individual, though the OF may remain empty
General empirical considerations
Individuals and patterns – the need for argument-binding
Examples: subitizing, collinarity and other relational judgments
Experimental demonstrations
Single-object advantage in joint judgments
Evidence that whole enduring objects are selected
Multiple-Object tracking
Clinical/neuroscience findings
General empirical considerations
Individuals and patterns – the need for argument-binding
Examples: subitizing, collinarity and other relational judgments
Experimental demonstrations
Single-object advantage in joint judgments
Evidence that whole enduring objects are selected
Multiple-Object tracking
Clinical/neuroscience findings
Individuals and patterns
Vision does not recognize patterns by applying templates but by parsing the pattern into parts – recognition-by-parts (Biederman)
A pattern is encoded over time (and over eye movements), so the visual system must keep track of the individual parts and recognize them as the same objects at different times and stages of encoding
Individuating is a prerequisite for recognition of configurational properties (patterns) defined among several individual parts
An example of how we can easily detect patterns if they are defined over a small enough number of parts is in subitizing
In order to recognize a pattern, the visual system must pick out individual parts and bind them to the representation being constructed
Examples include what Ullman called “visual routines”
Another area where the concept of an individual has become important is in cognitive development, where it is clear that babies are sensitive to the numerosity of individual things in a way that is independent of their perceptual properties
Are there collinear items (n>3)?
Several objects must be picked out at once in making relational judgments
The same is true for other relational judgments like inside or on-thesame-contour
… etc. We must pick out the relevant individual objects first. Respond: Inside-same contour? On-same contour?
Another example: Subitizing vs Counting.
How many squares are there?
Subitizing is fast, accurate and only slightly dependent on how many items there are . Only the squares on the right can be subitized.
Concentric squares cannot be subitized because individuating them requires a curve tracing operation that is not automatic.
Signature subitizing phenomena only appear when objects are automatically individuated and indexed
Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A limited capacity preattentive stage in vision. Psychological Review, 101 (1), 80-102.
General empirical considerations
Individuals and patterns – the need for argument-binding
Examples: subitizing, collinarity and other relational judgments
Some experimental demonstrations
Single-object advantage in joint judgments
Evidence that whole enduring objects are selected
Multiple-Object tracking
Clinical/neuroscience findings
Instruction: Attend to the Red objects
Which vertex is higher, left or right
(Note: There are now many control studies that eliminate most obvious confounds)
Spreads to
B and not C
Spreads to
C and not B
Spreads to
B and not C
Spreads to
C and not B
Using a priming method (Egly, Driver & Rafal, 1994) showed that the effect of a prime spreads to other parts of the same visual object compared to equally distant parts of different objects.
We can select a shape even when it is intertwined among other similar shapes
Are the green items the same? On a surprise test at the end, subjects were not able to recall shapes that had been present but had not been attended in the task
(Rock & Gutman, 1981; DeSchepper & Treisman, 1996)
Further evidence that attention is object-based comes from the finding that various attention phenomena move with moving objects
Once an object is selected, the selection appears to remain with the object as it moves
Inhibition of return appears to be object-based
Inhibition-of-return (IOR) is the phenomenon whereby attention is slow to go back to an object that had been attended about 0.7 – 1.0 secs before
It is thought to help in visual search since it prevents previously visited objects from being revisited
Tipper, Driver & Weaver (1991) showed that IOR moves with the inhibited object
IOR appears to be object-based (it travels with the object that was attended)
Objects endure despite changes in location; and they carry their history with them!
Object File Theory of Kahneman & Treisman
A B
A
1 2 3
Letters are faster to read if they appear in the same box where they appeared initially. Priming travels with the object. According to the theory, when an object first appears, a file is created for it and the properties of the object are encoded and subsequently accessed through this object-file.
General empirical considerations
Individuals and patterns – the need for argument-binding
Examples: subitizing, collinarity and other relational judgments
Experimental demonstrations
Single-object advantage in joint judgments
Evidence that whole enduring objects are selected
Multiple-Object tracking studies (later)
Clinical/neuroscience findings
Visual neglect
Balint syndrome & simultanagnosia
Visual neglect syndrome is object-based
When a right neglect patient is shown a dumbbell that rotates, the patient continues to neglect the object that had been on the right, even though It is now on the left (Behrmann & Tipper, 1999) .
Simultanagnosic (Balint Syndrome) patients attend to only one object at a time
Simultanagnosic patients cannot judge the relative length of two lines, but they can tell that a figure made by connecting the ends of the lines is not a rectangle but a trapezoid
(Holmes & Horax, 1919)
.
Balint patients attend to only one object at a time even if they are overlapping!
Luria, 1959
Some general empirical considerations
Individuals and patterns – the need for argument-binding
Examples: subitizing, collinarity and other relational judgments
Some direct experimental demonstrations
Single-object advantage in joint judgments
Evidence that whole enduring objects are selected
Multiple-Object tracking studies
Clinical/neuroscience findings
One of the clearest cases illustrating object-based selection is Multiple Object Tracking
Keeping track of individual objects in a scene requires a mechanism for individuating, selecting, accessing and tracking the identity of individuals over time
These are the functions we have proposed are carried out by the mechanism of visual indexes (FINSTs)
We have been using a variety of methods for studying visual indexing , including subitizing, subset selection for search, and Multiple Object Tracking (MOT).
In a typical experiment, 8 simple identical objects are presented on a screen and 4 of them are briefly distinguished in some visual manner – usually by flashing them on and off.
After these 4 “targets” have been briefly identified, all objects resume their identical appearance and move randomly. The subjects’ task is to keep track of which ones had earlier been designated as targets.
After a period of 5-10 seconds the motion stops and subjects must indicate, using a mouse, which objects were the targets.
People are very good at this task (80%-98% correct).
The question is: How do they do it?
Explaining Multiple Object Tracking
Basic finding: People (even 5 year old children) can track 4 to 5 individual objects that have no unique visual properties. How is it done?
Can it be done by keeping track of the only distinctive property of objects – their location?
○ Based on the assumption of finite attention movement speed, our modeling suggest that this cannot be done by encoding and updating locations (because of the speed at which they are moving and the distance between them)
○ If tracking is not done by using the only uniquely distinguishing property of objects, then it must be done by tracking their historical continuity as the same individual object
Our independently motivated hypothesis is that a small number of objects (e.g., 4-5) are individuated and reference tokens or indexes are assigned to them
An index keeps referring to the object as the object changes its properties and its location (that makes it the same object!)
An object is not selected or tracked by using an encoding of any of its properties. It is picked it out nonconceptually just the way a demonstrative does in language (i.e., this, that)
Although some physical properties must be responsible for the individuation and indexing of an object, we have data showing that these properties are not encoded, and the properties that are encoded need not be used in tracking
First I will introduce the binding problem as it appears in psychology
The role of selection in encoding conjunctions of properties (the binding problem)
The binding problem was initially described by Anne
Treisman who showed conditions under which vision may fail to correctly bind conjunctions of properties
(resulting in conjunction illusions)
Feature binding requires focal attention (i.e., selection )
The problem has been of interest to philosophers because it places constraints on how information may be encoded in early vision (or, as Clark would put it,
‘at the sensory level’ or nonconceptually)
I introduce the binding problem to show how the object-based view is essential for its solution
Introduction to the Binding Problem:
Encoding conjunctions of properties
Experiments show the special difficulty that vision has in detecting conjunctions of several properties
It seems that items have to be attended (i.e., individuated and selected) in order for their property-conjunction to be encoded
When a display is not attended, conjunction errors are frequent
Read the vertical line of digits in this display
What were the letters and their colors?
This is what you saw briefly …
Under these conditions Conjunction Errors are very frequent
Encoding conjunctions requires selection
One source of evidence is from search experiments:
Single feature search is fast and appears to be independent of the number if items searched through
(suggesting it is automatic and ‘pre-attentive’)
Conjunction search is slower and the time increases with the number of items searched through (suggesting it requires serial scanning of attention)
(Treisman)
Find the following simple figure in the next slide:
This case is easy – and the time is independent of how many nontargets there are – because there is only one red item. This is called a ‘popout’ search
This case is also easy – and the time is independent of how many nontargets there are – because there is only one right-leaning item. This is also a ‘popout’ search.
(conjunction)
Find the following simple figure in the next slide:
Feature Integration Theory and feature Binding
Treisman’s attention as glue hypothesis: focal attention
(selection) is needed in order to bind properties together
We can recognize not only the presence of “squareness” and
“redness”, but we can also distinguish between different ways they may be conjoined together
• Red square and green circle vs green square and red circle
The evidence suggests that conjoined properties are encode only if they are attended or selected
Notice that properties are considered to be conjoined if and only if they are properties of the same object, so it is objects that must be selected!
Constraints on nonconceptual representation of visual information (and the binding problem)
Because early (nonconceptual) vision must not fuse the conjunctive grouping of properties, visual properties can’t just be represented as being present in the scene – because then the binding problem could not be solved!
What else is required?
The most common answer is that each property must be represented as being at a particular location
According to Peter Strawson and Austin Clark, the basic unit of sensory representation is
Feature F at location L
This is the global map or feature placing proposal.
This proposal fails for interesting empirical reasons
But if feature placing is not the answer, what is?
The role of attention to location in Treisman’s
Feature Integration Theory
Conjunction detected
R
Color maps Shape maps Orientation maps
Y
G
Master location map Attention “beam”
Original Input
But in encoding properties, early vision can’t just bind them together according to their spatial co-occurrence – even their cooccurrence within the same region . That’s because the relevant region depends on the object. So the selection and binding must be according to the objects that have those properties
If location of properties will not give us a way of solving the binding problem, what will?
This is why we need object-based selection and why the object-based attention literature is relevant …
If we assume that only properties of indexed objects (of which there are about 4-5) are encoded and that these are stored in object files associated with each object, then properties that belong to the same object are stored in the same object file , which is why they get bound together
This automatically solves the binding problem!
This is the view exemplified by both FINST Theory (1989) and Object File Theory (1992)
The assumption that only properties of indexed objects are encoded raises the question of what happens to properties of the other (unindexed) objects or properties in a display
The logical answer is that they are not encoded and therefore not available to conceptualization and cognition
But this is counter-intuitive!
Maybe we see far less than we think we do!
This possibility has received a great deal of recent attention with the discovery of various ‘blindnesses’ such as change-blindness and inattentional blindness
The assumption that no properties other than properties of indexed objects can be encoded is in conflict with strong intuitions – namely that we see much more than we conceptualize and are aware of. So what do we do about the things we “see” but do not conceptualize?
Some philosophers say they are represented nonconceptually
But what makes this a nonconceptual representation , as opposed to just a causal reaction?
○ At the very minimum postulating that something is a representation must allow generalizations to be captured over their content , which would otherwise not be available
○ Traditionally representations are explanatory because they account for the possibility of misrepresentation and they also enter into conceptualizations and inferences. But unselected objects and unencoded properties don’t seem to fit this requirement (or do they?)
Maybe information about non-indexed objects is not represented at all!!
A possible view (which I am not prepared to fully endorse yet) is that certain topographical or biological reactions
(e.g., retinal activity) are not representations – because they have no truth values and so cannot misrepresent
One must distinguish between causal and represented properties
Properties that cause objects to be indexed and tracked and result in object files being created need not be encoded and made available to cognition
Is this just terminological imperialism?
If we call all forms of patterned reactions representations then we will need to have a further distinction among types within this broader class of representation
We may need to distinguish between personal and subpersonal types of ‘representation’ with only the former being representations for our purposes
We may also need to distinguish between patterned states within an encapsulated module that are not available to the rest of the mind/brain and those that are available
○ Certain patterned causal properties may be available to motor control – but does that make them representations?
An essential diagnostic is whether reference to content – to what is represented
– allows generalizations that would otherwise be missed and that, in turn, suggests that there is no representation without misrepresentation
○ We don’t want to count retinal images as representations because they can’t misrepresent, though they can be misinterpreted later
This picture leaves many unanswered questions, but it does provide a mechanism for solving the binding problem and also explaining how mental representations could have a nonconceptual connection with objects in the world (something required if mental representations are to connect with actions)
… except for a few loose ends …
Can objects be individuated but not indexed? A new twist to this story
We have recently obtained evidence that objects that are not tracked in MOT are nonetheless being inhibited and the inhibition moves with them
It is harder to detect a probe dot on an untracked object than on either a tracked object or empty space!
But how can inhibition move with a nontarget when the space through which they move is not inhibited?
Doesn’t this require the nontargets to be tracked?
The beginnings of the puzzle of clustering prior to indexing, and what that might mean!
If moving objects are inhibited then inhibition moves along with the objects. How can this be unless they are being tracked? And if they are being tracked there must be at least 8 FINSTs!
This puzzle may signal the need for a kind of individuation that is weaker than the individuation we have discussed so far – a mere clustering, circumscribing, figure-ground distinction without a pointer or access mechanism – i.e. without reference!
It turns out that such a circumscribing-clustering process is needed to fulfill many different functions in early vision. It is needed whenever the correspondence problem arises – whenever visual elements need to be placed in correspondence or paired with other elements. This occurs in computing stereo, apparent motion, and other grouping situations in which the number of elements does not affect ease of pairing (or even results in faster pairing when there are more elements). Correspondence is not computed over continuous visual manifolds but only over some pre-clustered elements.
Example of the correspondence problem for apparent motion
The grey disks correspond to the first flash and the black ones to the second flash. Which of the 24 possible matches will the visual system select as the solution to this correspondence problem? What principal does it use?
Curved matches Linear matches
Here is how it actually looks
Views of a dome
Structure from Motion Demo
Cylinder Kinetic Depth Effect
The correspondence problem for biological motion
FINST Theory postulates a limited number of pointers in early vision that are elicited by causal events in the visual field and that enable vision to refer to things without doing so under concept or a description