A neglected problem in the Representational Theory of Mind Object Tracking and the Mind-World Connection Before I begin I would like you to see a ‘video game’ to which I will refer later The demonstration shows a task called “Multiple Object Tracking” Track the initially-distinct (flashing) items through the trial (here 10 secs) and indicate at the end which items are the “targets” After each example I’d like you to ask yourself, “How do I do it?” If you are like most of our subjects you will have no idea, or a false idea… Keep track of the objects that flash 512x6.83 172x 169 How did you do it? What properties of individual objects did you use in order to track them? Did you use some grouping or chunking heuristic? Does your introspection reveal how you tracked the targets? Does your introspection ever reveal what processes go on in your mind? Going behind occluding surfaces does not disrupt tracking Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual objecthood. Cognitive Psychology, 38(2), 259-290. Not all well-defined features can be tracked: Track endpoints of these lines Endpoints move exactly as the squares did! The basic problem of cognitive science What determines our behavior is not how the world is, but how we represent it as being As Chomsky pointed out in his review of Skinner, if we describe behavior in relation to the objective properties of the world, we would have to conclude that behavior is essentially stimulus-independent Every naturally-occurring behavioral regularity is cognitively penetrable Any information that changes beliefs can systematically and rationally change behavior Representation and Mind Why representations are essential Do representations only come into play in “higher level” mental activities, such as reasoning? Even at early stages of perception many of the states that must be postulated are representations (i.e. their content or what they are about plays a role in explanations). Examples from vision (1): Intrapercept constraints Epstein, W. (1982). Percept-percept couplings. Perception, 11, 75-83. Another example of a classical representation Other forms of representation…. a) b) c) d) e) f) g) Lines FG, BC are parallel and equal. Lines EH, AD are parallel and equal. Lines FB, GC are parallel and equal. Lines EA, HD are parallel and equal. Vertices EF, HG, DC and AB are joined.... Part-Of{Cube, Top-Face(EFGH), BottomFace(ABCD), Front-Face(FGCB), BackFace(EHDA)} Part-Of{Top-Face(Front-Edge(FG), BackEdge(EH), Left-Edge(EF), Right-Edge(HG)},… What’s wrong with these representations? What’s wrong is that the CTM is incomplete — it does not address a number of fundamental questions It fails to specify how representations connect with what they represent – it’s not enough to use English words in the representation (that’s been a common confusion in AI) or to draw pictures (a common confusion in theories of mental imagery) English labels and pictures may help the theorist recall which objects are being referred to … but What makes it the case that a particular mental symbol refers to one thing rather than another? Or, How are concepts grounded? (Symbol Grounding Problem) Another way to look at what the Computational Theory of Mind lacks The missing function in the CTM is a mechanism that allows perception to refer to individual things in the visual field directly without appealing to their properties – i.e., nonconceptually: Not as “whatever has properties P1, P2, P3, ...”, but as a singular term that refers directly to an individual and does not appeal to a representation of the individual’s properties. Such a reference is like a proper name, or like a demonstrative term (like this or that) in natural language or like a pointer in a computer data structure. There is more to come on the mechanism of visual indexing An example from personal history: Why we need to pick out individual things without referring to their properties We wanted to develop a computer system that would reason about geometry by actually drawing a diagram and noticing adventitious properties of the diagram from which it would conjecture lemmas to prove We wanted the system to be as psychologically realistic as possible so we assumed that it had a narrow field of view and noticed only limited, spatiallyrestricted information as it examined the drawing This immediately raised the problem of coordinating noticings and led us to the idea of visual indexes to keep track of previously encoded parts of the diagram. Begin by drawing a line…. L1 Now draw a second line…. L2 And draw a third line…. L3 What do we have so far? We know there are three lines, but we don’t know the spatial relations between them. That requires: 1. Seeing several of them together (at least in pairs) 2. Knowing which object seen at time t+1 corresponds to a particular object that was seen at time t. Establishing (2) requires solving one form of the correspondence problem. This problem is ubiquitous in perception. Solving it over time is called tracking. For example, suppose you recall noticing two intersecting lines such as these: L1 L2 You know that there is an intersection of two lines… But which of the two lines you drew earlier are they? There is no way to indicate which individual things are seen again without a way to refer to individual token things Look around some more to see what is there …. L5 L2 V12 Here is another intersection of two lines… Is it the same intersection as the one seen earlier? Without a special way to keep track of individuals the only way to tell would be to encode unique properties of each of the lines. Which properties should you encode? In examining a geometrical figure one only gets to see a sequence of local glimpses A note about the use of labels in this example There are two purposes for figure labels. One is to specify what type of individual it is (line, vertex,..). The other is to specify which individual it is in order to keep track of it and in order to bind it to the argument of a predicate. The second of these is what I am concerned with because indicating which individual it is is essential in vision. Many people (e.g., Marr, Yantis) have suggested that individuals may be marked by tags, but that won’t do since one cannot literally place a tag on an object and even if we could it would not obviate the need to individuate and index just as labels don’t help. Labeling things in the world is not enough because to refer to the line labeled L1 you would have to be able to think “this is line L1” and you could not think that unless you had a way to first picking out the referent of this. The correspondence Problem A frequent task in perception is to establish a correspondence between proximal tokens that arise from the same distal token. Apparent Motion. Tokens at different times may correspond to the same object that has moved. Constructing a representation over time (and over eye fixations) requires determining the correspondence between tokens at different stages in constructing the representation. Tracking token individuals over time/space. To distinguish “here it is again” from “here is another one” and so to maintain the identity of objects. Stereo Vision requires establishing a correspondence between two proximal (retinal) tokens – one in each eye Apparent Motion solves a correspondence problem Dawson Configuration (Dawson &Pylyshyn, 1988) Linear trajectory? Curved trajectory? Which criterion does the visual module prefer? Dawson Configuration (animated) Apparent Motion solves a correspondence problem Dawson Configuration (Dawson &Pylyshyn, 1988) Nearest mean distance? Nearest vector distance? Nearest configural distance? Which criterion does the visual module prefer? Dawson Configuration (animated) Colors & shapes are ignored Dawson Configuration Different properties Ignored Yantis use of the “Ternus Configuration” to demonstrate the early visual effect of objecthood Short time delays result in “element motion” (the middle object persists as the “same object” so it does not appear to move) Long time delays result in “group motion” because the middle object does not persist but is perceived as a new object each time it reappears Relevance to the present theme These different examples illustrate the need to keep track of objects’ numerical identity (or their same‘individuality’) in a primitive non-conceptual way (and of putting their token representations in correspondence) In each case the correspondence is computed without any conscious awareness by the early vision module The examples (apparent motion, stereovision, incremental construction of representations, and keeping track of individuality over time/space) are on different time scales so it is an empirical matter whether they involve the same mechanism, but they do address the same problem – tracking individuals without using their unique properties. The difference between a direct (demonstrative) and a descriptive way of picking something out has produced many “You are here” cartoons. It is also illustrated in this recent New Yorker cartoon… The difference between descriptive and demonstrative ways of picking something out (illustrated in this New Yorker cartoon by Sipress ) ‘Picking out’ Picking out entails individuating, in the sense of separating something from a background (what Gestalt psychologists called a figure-ground distinction) This sort of picking out has been studied in psychology under the heading of focal or selective attention. Focal attention appears to pick out and adhere to objects rather than places In addition to a unitary focal attention there is also evidence for a mechanism of multiple references (about 4 or 5), that I have called a visual index or a FINST Indexes are different from focal attention in many ways that we have studied in our laboratory (I will mention a few later) A visual index is like a pointer in a computer data structure – it allows access but does not itself tell you anything about what is being pointed to. Note that the English word pointer is misleading because it suggests that vision picks out objects by pointing to their location. The requirements for picking out and keeping track of several individual things reminded me of an early comic book character called Plastic Man Imagine being able to place several of your fingers on things in the world without recognizing their properties while doing so. You could then refer to those things (e.g. ‘what finger #2 is touching’) and could move your attention to them. You would then be said to possess FINgers of INSTantiation (FINSTs) FINST Theory postulates a limited number of pointers in early vision that are elicited by certain events in the visual field and that index the objects associated with the event. These enable vision to refer to those objects without doing so under a concept/description This idea is intriguing but it is missing one or two details as well as some distinctions We need to distinguish the mechanisms of early vision (inside the vision module) from those of general cognition We need to distinguish different types of information in different parts of vision (e.g., representations vs physical states, conceptual vs nonconceptual, as well as personal vs subpersonal). Closely related to these, we need to distinguish between the process of vision from those of belief fixation. Finally, we need to provide a motivated proposal for what the modular (subpersonal?) part of vision hands off to the rest of the cognitive mind. This is a difficult problem and will occupy some of our time in the rest of this class. Returning to the FINST Theory First Approximation: FINSTs and Object Files and the link between the world and its conceptualization The only nonconceptual Object File contents in thisare picture contents are FINST indexes! conceptual! Information (causal) link FINST Demonstrative reference link Summarizing the theory so far A FINST index is a primitive mechanism of reference that refers to individual visible objects in the world. There are a small number (~4-5) indexes available at any one time. Indexes refer to individual objects without referring to them under conceptual categories, so they provide nonconceptual reference. Q: Is this a case of seeing without seeing as? Indexing objects is prior to encoding any of their properties. So objects are picked out and referred to without using any encoding of their properties. This does not mean that object properties are irrelevant to the grabbing of indexes or the subsequent process of tracking The claim that we initially refer to objects without having encoded their location is surprising to many people (why?) What may be even more surprising is that we can index and refer to objects without knowing what they are! Summarizing the theory so far An important function of these indexes is to bind arguments of visual predicates to things in the world to which they refer. Only predicates with bound arguments can be evaluated. Since predicates are quintessential concepts, an index serves as a bridge from objects to conceptual representations. Indexes can also bind arguments of motor commands, including the command to move focal attention or gaze to the indexed object: e.g., MoveGaze(x) Some hard problems that Fodor and I will discuss at a later lecture Getting information about a particular object into its Object File How and when does this happen? Who can use the information in an object file? Can it be used to track objects by checking whether a candidate object has the same properties as a particular previous object? Some hard problems and some open empirical questions To be discussed at various later lectures ● ● How and when does information about a particular object get into its Object File? Who can use the information in an object file? ● Can the information in the file be used to determine the correspondence between objects by checking whether they have the same properties? ● Is the Object File inside the vision module or outside? Is this how tracking is accomplished? Is information in the Object File used to solve the many-properties binding problem? Is this done during tracking? Part 2 Some notes on how indexes might be implemented A thought experiment: How might one implement an indexing system? The attempt might clarify how it is possible to index an object without having explicit access to the coordinates or other properties of objects. I will sketch a network model but will only describe how it looks functionally to a user who pushes buttons and notices which lights come on. The model takes as input an activation map (on the proximal stimulus) with a set of sensors at each point (each pixel). Based on the relative activity at each point it indexes a number of active objects and illuminates a light for each. The user choses one of the illuminated objects (by name – nobody knows where they are) and pressing a button beside one of the lights. Some notes on how indexes might be implemented The person then presses a button on a property detection panel marked with a property name. If the light beside the button illuminates then we know that the object indexed in panel 2 has property indicated by panels at 3. The way this model is wired up is simple. The first panel feeds a Winner-Take-All network which inhibits every input unit but the most active one (a classical Darwinian or capitalist world). This enables a circuit from the button next to the illuminated light to the input unit which led to the light being on (that’s the ‘index’). Pressing the button sends a unit of activity to that input unit which now has a property transducer and an activity selector on (two out of the required 3 before it send out a general tremor of activity). Now you press the button by a property inquiry (panel 3) which activates all P detectors. If the selected input unit, the property transducer, and the property inquiry signal are all on, that input fires. Moral It is trivial to design a circuit that allows one to check whether a particular place on a proximal stimulus that has grabbed an index, has a particular property. All it takes are some threshold units and some and and or units. Although the simple black box I showed you can only detect one static input place at a time, it can inquiry about several properties. Extending this to moving objects is easy using the same ideas – you partially activate regions near each selected input units, this increasing the likelihood that it will be selected at the next cycle and so on. Some evidence for indexes and Object Files ● ● ● The correspondence problem The binding problem Evaluating multi-place visual predicates ● Operating over several visual elements at once without having to search for them first ● ● Recognizing shapes by their part-whole relations Subitizing Subset search Multiple-Object Tracking Imagining space without requiring a spatial display in the head {This is a large topic beyond the scope of this class, but see Things and Places, Chapter 5} A quick tour of some evidence for FINSTs ● ● ● ● The correspondence problem (mentioned earlier) The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without having to search for them first Subitizing Subset selection Multiple-Object Tracking • Imagining space without requiring a spatial display in the head FEATURE DEMONS COGNITIVE DEMONS Pandemonium An early architecture for vision, called Pandemonium, was proposed by Oliver Selfridge in 1959. This idea continues to be at the heart of many psycho-logical models, including ones implemented in contemporary connectionist or neural net models. It is also the basic idea in what are called Blackboard Architectures in AI (e.g., Hearsay speech recognition systems). These architectures have no way to represent that some of the features detected actually belonw with other features detected. Vertical lines Horizontal lines Oblique lines IMAGE DEMONS Right angles DECISION DEMON Acute angles CORTICAL SIGNAL PROCESSING Discontinuous curves Continuous curves Introduction to the Binding Problem: Encoding conjunctions of properties Experiments show the special difficulty that vision has in detecting conjunctions of several properties It seems that items have to be attended (i.e., individuated and selected) in order for their property-conjunction to be encoded When a display is not attended, conjunction errors are frequent Read the vertical line of digits in this display What were the letters and their colors? This is what you saw briefly … Under these conditions Conjunction Errors are very frequent Encoding conjunctions requires attention One source of evidence is from search experiments: Single feature search is fast and appears to be independent of the number if items searched through (suggesting it is automatic and ‘pre-attentive’) Conjunction search is slower and the time increases with the number of items searched through (suggesting it requires serial scanning of attention) Rapid visual search (Treisman) Find the following simple figure in the next slide: This case is easy – and the time is independent of how many nontargets there are – because there is only one red item. This is called a ‘popout’ search This case is also easy – and the time is independent of how many nontargets there are – because there is only one right-leaning item. This is also a ‘popout’ search. Rapid visual search (conjunction) Find the following simple figure in the next slide: Constraints on nonconceptual representation of visual information (and the binding problem) Because early (nonconceptual) vision must not fuse the conjunctive grouping of properties, visual properties can’t just be represented as being present in the scene – because then the binding problem could not be solved! What else is required? The most common answer is that each property must be represented as being at a particular location According to Peter Strawson and Austin Clark, the basic unit of sensory representation is Feature-F-at-location-L This is the so-called feature placing proposal. This proposal fails for interesting empirical reasons But if feature placing is not the answer, what is? The role of attention to location in Treisman’s Feature Integration Theory Conjunction detected Color maps Shape maps Orientation maps R Y G Master location map Original Input Attention “beam” Individual objects and the binding problem We can distinguish scenes that differ by conjunctions of properties, so early vision must somehow keep track of how properties co-occur – conjunction must not be obscured. How to do this is called the binding problem. The most common proposal is that vision keeps track of properties according to their location and binds together colocated properties. 1 2 The proposal of binding conjunctions by the location of conjuncts does not work when feature location is not punctate and becomes even more problematic if they are colocated – e.g., if their relation is “inside” Binding as object-based The proposal that properties are conjoined by virtue of their common location has many problems In order to assign a location to a property you need to know its boundaries, which requires distinguishing the object that has those properties from its background (figure-ground individuation) Properties are properties of objects, not of locations – which is why properties move when objects move. Empty locations have no causal properties. The alternative to conjoining-by-location is conjoining by object. According to this view, solving the binding problem requires first selecting individual objects and then keeping track of each object’s properties (in its object file or OF) If only properties of selected objects are encoded and if those properties are recorded in each object’s OFs, then all conjoined properties will be recorded in the same object file, thus solving the binding problem A quick tour of some evidence for FINSTs ● ● ● ● The correspondence problem (mentioned earlier) The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without having to search for them first Subitizing Subset selection ● ● Multiple-Object Tracking Cognizing space without requiring a spatial display in the head Being able to refer to individual objects or object-parts is essential for recognizing patterns Encoding relational predicates; e.g., Collinear (x,y,z,..); Inside (x, C); Above (x,y); Square (w,x,y,z), requires simultaneously binding the arguments of n-place predicates to n elements* in the visual scene Evaluating such visual predicates requires individuating and referring to the objects over which the predicate is evaluated: i.e., the arguments in the predicate must be bound to individual elements in the scene. *Note: “elements” is used to refer to objects that serve as parts of other objects Several objects must be picked out at once in making relational judgments When we judge that certain objects are collinear, we must first pick out the relevant objects while ignoring their properties Several objects must be picked out at once in making relational judgments The same is true for other relational judgments like inside or onthe-same-contour… etc. We must pick out the relevant individual objects first. Are dots Inside-same contour? On-same contour? *Note: Ullman (1984) has shown that some patterns cannot be recognized without doing so in a serial manner, where the serial elements must be indexed first. And that is yet another reason why Connectionist architectures cannot work! A quick tour of some evidence for FINSTs • The correspondence problem The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without first having to search for them Subitizing Subset selection • Multiple-Object Tracking Cognizing space without requiring a spatial display in the head More functions of FINSTs Further experimental explorations Recognizing the cardinality of small sets of things: Subitizing vs counting (Trick, 1994) Searching through subsets – selecting items to search through (Burkell, 1997) Selecting subsets and maintaining the selection during a saccade (Currie, 2002) Application of FINST index theory to infant cardinality studies (Carey, Spelke, Leslie, Uller, etc) Indexes may explain how children are able to acquire words for objects by ostension without suffering Quine’s Gavagai problem. Signature subitizing phenomena only appear when objects are automatically individuated and indexed Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102. Subitizing results There is evidence that a different mechanism is involved in enumerating small (n<4) and large (n>4) numbers of items (even different brain mechanisms – Dehaene & Cohen, 1994) Rapid small-number enumeration (subitizing) only occurs when items are first (automatically) individuated* Unlike counting, subitizing is not enhanced by precuing location* Subitizing is insensitive to distance among items* Our account for what is special about subitizing is that once FINST indexes are assigned to n< 4 individual objects, the objects can be enumerated without first searching for them. In fact they might be enumerated simply by counting active indexes which is fast and accurate because it does not require visual scanning * Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102. Subset selection for search + + + Target = + single feature search conjunction feature search Burkell, J., & Pylyshyn, Z. W. (1997). Searching through subsets: A test of the visual indexing hypothesis. Spatial Vision, 11(2), 225-258. Subset search results: Only properties of the subset matter – but note that properties of the entire subset must be taken into account simultaneously (since that is what distinguishes a feature search from a conjunction search) If the subset is a single-feature search it is fast and the slope (RT vs number of items) is shallow If the subset is a conjunction search set, it takes longer is more error prone and is more sensitive to the set size As with subitizing, the distance between targets does not matter, so observers don’t seem to be scanning the display looking for the target The stability of the visual world entails the capacity to track some individuals after a saccade There is no problem about how the tactile sense can provide a stable world when you move around while keeping your fingers on the same objects – because in that case retaining individual identity is automatic But with FINSTs the same can be true in vision – at least for a small number of visual objects This is compatible with the fact that it appears that one retains the relative location of only about 4 elements during saccadic eye movements (Irwin, 1996) [Irwin, D. E. (1996). Integrating information across saccadic eye movements. Current Directions in Psychological Science, 5(3), 94-100.] The selective search experiment with a saccade induced between the late onset cues and start of search Onset of new objects grabs indexes + + A saccade occurs here + Target = + single feature search conjunction feature search Even with a saccade between selection and access, items can be accessed efficiently A quick tour of some evidence for FINSTs ● ● ● ● The correspondence problem (mentioned earlier) The binding problem Evaluating multi-place visual predicates (recognizing multi-element patterns) Operating over several visual elements at once without having to search for them first Subitizing Subset selection ● Multiple-Object Tracking Imagining space without requiring a spatial display in the head Demonstrating the function of FINSTs with Multiple Object Tracking (MOT) In a typical experiment, 8 simple identical objects are presented on a screen and 4 of them are briefly distinguished in some visual manner – usually by flashing them on and off. After these 4 targets are briefly identified, all objects resume their identical appearance and move randomly. The observers’ task is to keep track of the ones that had been designated as targets at the start After a period of 5-10 seconds the motion stops and observers must indicate, using a mouse, which objects are the targets Another example of MOT: With self occlusion 5 x 5 1.75 x 1.75 Self occlusion dues not seriously impair tracking Some findings with Multiple Object Tracking ● ● ● Basic finding: Most people can track at least 4 targets that move randomly among identical non-target objects (even some 5 year old children can track 3 objects) Object properties do not appear to be recorded during tracking and tracking is not improved if no two objects have the same color, shape or size (asynch vs synch changes) How is tracking done? We showed that it is unlikely that the tracking is done by keeping a record of the targets’ locations and updating them by serially visiting the objects (Pylyshyn & Storm, 1998) Other strategies may be employed (e.g., tracking a single deforming pattern), but they do not explain tracking Hypothesis: FINST Indexes are grabbed by blinking targets. At the end of the trial these indexes can be used to move attention to the targets and hence to select them in making the response What role do visual properties play in MOT? Certain properties must be present in order for an index to be grabbed, and certain properties (probably different properties) must be present in order for the index to keep track of the object, but this does not mean that such properties are encoded, stored, or used in tracking. Is there something special about location? Do we record and track properties-at-locations? Location in time & space may be essential for individuating or clustering objects, but metrical coordinates need not be encoded or made cognitively available The fact that an object is actually at some location or other does not mean that it is represented as such. Representing property ‘P’ (where P happens to be at location L) ≠ Representing property ‘P-at-L’. A way of viewing what goes on in MOT An object file may contain information about the object to which it is bound. But according to FINST Theory, keeping track of the object’s identity does not require the use of this information. The evidence suggests that in MOT, little or nothing is stored in the object file. Occasionally some information may get encoded and entered in the Object File (e.g., when an object appears or disappears) but this is not used in the tracking process itself.* * We will see later that this has to be stated with care since location may be stored in the object file and used in a certain sense when the usual continuous tracking does not work. Another way of viewing MOT What makes something the same object over time is that it remains connected to the same object-file by the same Index. Thus, for something to be the same enduring object no appeal to properties or concepts is needed. The only requirement is that it be trackable. Another view of tracking is that it is the basis of objecthood: An object is something that can be perceptually tracked (Fodor). There seems to be growing evidence that tracking is a reflex -- it proceeds without interference from other attentive tasks.* Franconeri et al.** showed that the apparent sensitivity of tracking performance to such properties as speed is due to a confound of speed with object density. Distance between objects is critical to MOT performance, which is predicted by parallel tracking models. * Although tracking feels effortful, many secondary tasks do not interfere with tracking (search) ** Franconeri, S., Lin, J., Pylyshyn, Z., Fisher, B., & Enns, J. (2008). Evidence against a speed limit in multiple-object tracking. Psychonomic Bulletin & Review, 15(4), 802-808. Why is this relevant to foundational questions in the philosophy of mind? ● ● ● ● According to Quine, Strawson, and most philosophers, you cannot pick out or track individuals without concepts (sortals) But you also cannot pick out individuals with only concepts Sooner or later you have to pick out individuals using nonconceptual causal connections between things and thoughts. The present proposal is that FINSTs provide the needed nonconceptual mechanism for individuating objects and for tracking their (numerical) identity, which works most of the time in our kind of world. It relies on some natural constraints (Marr). FINST indexes provide the right sort of connection to allow the arguments of predicates to be bound to objects prior to the predicates being evaluated. They may also be the basis for learning nouns by ostension. But there must be some properties that cause indexes to be grabbed! Of course there are properties that are causally responsible for indexes being grabbed, and also properties (probably different ones) that make it possible for objects to be tracked; But these properties need not be represented (encoded) and used in tracking The distinction between properties that cause indexes to be grabbed and those that are represented (in Object Files) is similar to Kripke’s distinction between properties that are needed to name an object (by baptismal) and those that constitute its meaning Effect of target properties on MOT Changes of object properties are not noticed during MOT Keeping all targets at different color, size, or shape does not improve tracking Observers do not use target speed or direction in tracking (e.g., they do not track by anticipating where the targets will be when they reappear after occlusion) Targets can go behind an opaque screen and come out the other side transformed in: color, shape, speed or direction of motion (up to 60° from pre-occlusion direction), without affecting tracking, but also without observers noticing the change! What affects tracking is the distance travelled while behind the occluding screen. The closer the reappearance to the point of disappearance the better the tracking – even if the closer location happens to be in the middle of the occluding screen! Some open questions We have arrived at the view that only properties of selected (indexed) objects enter into subsequent conceptualization and perception-based thought (i.e., only information in object files is made available to cognition) So what happens to the rest of the visual information? Visual information seems rich and fine-grained while this theory says that properties of only 4 or 5 objects are encoded! The present view also leaves no room for representations whose content corresponds to the content of conscious experience According to the present view, the only content that modular nonconceptual representations have is the demonstrative content of indexes that refer to perceptual objects Question: Why do we need any more than that? An intriguing possibility…. Maybe the theoretically relevant information we take in is less than (or at least different from) what we experience This possibility has received attention recently with the discovery of various “blindnesses” (e.g., change-blindness, inattentional blindness, blindsight…) as well as the discovery of independentvision systems (e.g., recognition and motor control) The qualitative content of conscious experience may not play a role in explanations of cognitive processes Even if detailed quantitative information enters into causal process (e.g., motor control) it may not be represented – not even as nonconceptual representation For something to be a representation its content must figure in explanations – it must capture generalizations. It must have truth conditions and therefore allow for misrepresentation. It is an empirical question whether current proposals do (e.g., primal sketch, scenarios). cf Devitt: Pylyshyn’s Razor An alternative view of reference by Indexes ● ● This provisional revised theory responds to Fodor’s argument that there is no seeing without seeing-as According to Fodor, the visual module must do more than the current theory assumes, because its output must provide the basis for induction over what something is seen as. This is not the traditional argument that percepts have a finer grain than most theories provide for – especially theories that assume a symbolic output like this one. That argument relies too much on our phenomenology which more often than not leads us astray. ● So the vision module must contain more than object files. It must be able to classify objects by their visual properties alone, or to compute for each object a particular appearanceclass to which it belongs (see black swan example). An alternative view of reference by Indexes ● ● Since the vision module is encapsulated it must have a mechanism for assigning each object x to an equivalence class based solely on what x looks like. It must do this for a large number of such classes, based both on its innate mechanisms and its visual experience [Look of x = L (x)]. L (x) is thus an equivalence class induced by the sensorium which includes the current token x. The L (x) associated with each token x must be sufficiently distinctive to allow the cognitive system to recognize x unambiguously as an token of something it knows about (e.g., L (x) => looks like a cow & this is a farm => x is likely a cow). The sequence from x to recognition must be correct most of the time in our kind of world (so it must embody a natural constraint). An alternative view of reference by Indexes ● This idea of an appearance class L (x) has been explored in computational vision, where a number of different functions have been proposed, many of them based on mathematical compression or encoding functions. An early idea which has implications for the present discussion, is a proposal by David Marr called a Multiple-View proposal. He wrote: “The Multiple View representation is based on the insight that if one chooses one’s primitives correctly, the number of qualitatively different views of an object may be quite small” and Marr cites Minsky as speculating that the representation of a 3D shape might consist of a catalog of different appearances of that shape, and that catalog may not need to be very large. (Marr & Nishihara, 1976) The search for the most general form of representation has yielded many proposals, many of which have been tested in Psychology Labs. E.g., generalized cylinders and part-decomposition: Biederman, I. (1987). Recognition-by-components: A theory of human image interpretation. Psychological Review, 94, 115-148. Seeing without Seeing As? ● It’s true that instances of visual encounters deliver an equivalence-class to which the object belongs by virtue of its appearance as mapped by the function L (x). It is an appearance class because it can only use information from the sensorium and the “natural constraints” built into the modular vision system. So in that respect one might say that seeing is always a seeing as where the relevant category is L (x). ● But this is unlikely to be the category under which the object enters into thought. So the kind of seeing as category L (x), is not the same category as the one under which the object is contemplated in thought, where its category would depend on background knowledge and personal history. The appearance L (x) is now replaced by familiar categories of thought (e.g. card table, Ford car, Coca Cola bottle, Warhol Brillo Box, and so on, categories rich in their interconnections). More on the structure of the Visual Module In order to compute L (x), the vision module must possess enough machinery to map a token object x onto an equivalence class designated by L(x) using only sensory information and module-specific processes and representations, without appealing to general knowledge. The module must also have some 4-5 Object Files, because it needs those to solve the binding problem as well as to bind predicate arguments to objects (and also to use the proposed Recognition-By-Parts process for recognizing complex objects). Alternative view of what’s in the module The alternative view of what goes on inside the visual module would furnish it with more processes to catalog and lookup of object shapetypes L (x). Our assumptions would seem to require that this augmented machinery also be barred from accessing cognitive memories and general inference capacity. Does this conflict with Fodor’s requirement that the output be right for belief fixation? Which functions are in the visual module? L (x) Modular vision computer. Input is sensory information, output is standard form for appearance of objects L Minimal (Just indexes) Original (indexes and files) Maximal (computing L (x) ) (x). Summary of the current FINST model Up to 5 indexes can be grabbed based on local properties Active indexes bind objects to object files (initially empty) Bound objects can then be queried* and salient properties encoded in their Object File Does this require voluntary attention? Indexes stay bound to the objects that grabbed them even as the objects change any of their properties, including briefly disappearing behind an occluding screen. When the objects change their location, the result is tracking which is automatic / reflexive We also have evidence that objects can be tracked through other continuously changing properties (Blaser, Pylyshyn & Holcombe 2000) The only factor that impairs tracking performance is spacing: too close yields item-ambiguity and tracking errors Tracking and spatial proximity Many experiments show that the only factor that affects tracking performance is inter-item spacing: when items are too close there is item-ambiguity resulting in tracking errors Other factors that allegedly impair tracking (e.g., speed) do so only because they affect average spacing. The very process of tracking, which requires something like smooth continuous movement, makes use of proximity. So does the process of Gestalt individuation which must collect nearby pixels and features (regardless of type). We have many results showing that when objects disappear their only recalled property is where they were at the time and the only thing that determines how well they continue to be tracked when they reappear is how far away they have moved. Franconeri, S., Pylyshyn, Z. W., & Scholl, B. J. (2012). A simple proximity heuristic allows tracking of multiple objects through occlusion Attention, Perception and Psychophysics, 72(4). How is location stored and used? It is possible that location is stored in object files since it is one of the more important properties of moving objects. Object location is a property that must be used in tracking since to track smoothly moving objects just is to solve the correspondence problem by taking the nearest object Many experiments show that the correspondence problem in this case does not involve choosing the most similar object or the one moving with the same speed or in the same direction… but the closest one to the locus of disappearance. Does this mean that object location is stored and used in tracking, contrary to my earlier claim? Maybe, but … That depends on whether location is in this case a conceptual property and tracking is a process involving conceptual representations and there is evidence that it is not. Is location a conceptual property? Is location in this case a conceptual property and is tracking a process involving conceptual representations? Computing correspondence and tracking are prototypical automatic and cognitively impenetrable processes, likely computed by local parallel processes, which suggests that it is subpersonal, modular and nonconceptual, since most automatic processes are nonconceptual. Location plays a critical part in all motor control and there is reason to believe that it plays this role in a different way than the way conceptual information does. It typically involves a different visual system, the dorsal pathway. A great deal of evidence is now available showing that only the central pathway contributes to object recognition while the dorsal pathway is specialized for motor control (Milner & Goodale, 1995; 2004) All in all it seems more likely that location is used in MOT and other visual processes but that it is not a conceptual process at all. If you accept that location is conceptual, you pay a high price: you lose the goal of finding a nononceptual link between cognition and the world! Summary of augmented FINST model So far the only visual information that is available to the mind is contained in the Object Files in the visual module. The index mechanism discussed so far also makes it possible to use additional currently perceived information (see Things & Places, Chapt 5) Information in the module is in a symbolic form very similar to the subsequent conceptual representation, except: It is encoded in the vocabulary of modular (subpersonal) categories (that many would call nonconceptual), not in person-level conceptual vocabulary. Construction of the intramodular representation cannot use general knowledge, so all relevant representations must reside in the module The intramodular representation uses information in Object Files and preserves its bindings. The Object Files are the only mechanisms for dealing with the General Binding Problem, as well as the problem of binding predicate arguments to objects in the world. Open Questions about the augmented FINST model The modular processes must somehow recover the relations between objects, and these may or may not be encoded in OFs. Since information in the module may serve a number of subsequent functions – including visual-motor coordination and multimodal perceptual integration – it will have to represent metrical information, very likely in a nonconceptual form. The question of representing metrical information is one we leave for the future since little is known about how analogue representation might function in cognition We now arrive at a central question of considerable importance to the view we are promoting: What form is the visual representation in when it is handed on to Cognition? END For a copy of these slides see: http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionRefere nce.ppt Or MIT Press Paperback You are now here X But you are also here Additional examples of MOT MOT with occlusion MOT with virtual occluders MOT with matched nonoccluding disappearance Track endpoints of lines Track rubber-band linked boxes Track and remember ID by location Track and remember ID by name (number) Track while everything briefly disappears (½ sec) and goes on moving while invisible Track while everything briefy disappears and reappears where they were when they disappeared