These notes are based on the following sources: Shimon Edelman and Nathan Intrator: Visual Processing of Object Structure, preliminary draft of an article to appear in The Handbook of Brain Theory and Neural Networks (2nd ed.), M. A. Arbib, ed., MIT Press, 2002. Guy Wallis and Heinrich Bülthoff: Object recognition, neurophysiology, preliminary draft of an article to appear in The Handbook of Brain Theory and Neural Networks (2nd ed.), M. A. Arbib, ed., MIT Press, 2002. Simon Thorpe and Michèle Fabre-Thorpe: Fast Visual Processing and its implications, preliminary draft of an article to appear in The Handbook of Brain Theory and Neural Networks (2nd ed.), M. A. Arbib, ed., MIT Press, 2002. Introduction Everyday experience tells us that our visual systems are very fast. In the 1970s, experiments using Rapid Serial Visual Presentation (RSVP) techniques showed that humans are remarkably good at following sequences of unrelated images presented at rates of up to 10 frames a second (Intraub 1999, Potter 1999), an ability frequently exploited by producers of video clips. But the fact that we can handle a new image every 100 ms or so does not necessarily mean than visual processing can be completed in this time. As computer chip designers know, processing rates can be improved by using pipelining in which several computational steps can operate simultaneously one after the other. So, how can we determine the time it takes for the visual system to process a scene? And how can we use this information to help constrain our models of how the brain computes? These are some of the issues that we will address in this chapter. Interestingly, temporal constraints were one of the prime motivations for the development of connectionist and PDP modeling in the early 1980s. Around this time, Jerry Feldman proposed the so-called 100-step limit. He argued that since many high level cognitive tasks can be performed in about half a second, and since the interspike interval for cortical neurons is seldom shorter than 5 ms, the underlying algorithms should involve no more than about 100 sequential, though massively parallel, steps. Note, however, that the values used by Feldman were only rough estimates unrelated to any particular processing task. In this chapter, we will review more specific experimental data on processing speed before looking at how these temporal constraints can be used to refine models of neural computation. Behavioral measures of processing speed The ultimate test for processing speed lies in behavior. If animals can reliably make appropriate behavioral responses to a given category of visual stimulus with a particular reaction time, there can be no argument about whether the processing has been done. Thus, if a fly can react to displacements of the visual world by a change in wing torque 30 ms later, it is clear that 30 ms is enough for both visual processing and motor execution. Fast behavioral reactions are not limited to insects. For example, tracking eye movements are initiated within 70-80 ms in humans and in around 50 ms in monkeys (Kawano 1999), and the vergence eye movements required to keep objects within the fixation plane have latencies around 85 ms in humans and under 60 ms in monkeys (Miles 1997). Such low values probably reflect the relatively simple visual processing needed to detect stimulus movement and the short path lengths seen in the oculomotor system. How fast could behavioral responses be in tasks that require more sophisticated visual processing? A. 1400 1200 1000 800 Targets Distractors 600 400 Minimum ResponseTime 200 0 0 200 400 600 800 1000 Reaction Tim e B. 6 A n li m a N o n -l a n i m a Difference µV ERP difference onset 100 -6 200 300 ms Mean of 15 subjects Figure 1: A. Reaction Time distributions in a go/no-go scene categorization task. Statistically significant differences between the responses to targets and distractors start at the minimum response time of approximately 250 ms. B. Differential ERP responses to targets and non-targets in the same task. The ERP difference starts at about 150 ms (Thorpe et al 1996). In 1996, we reported results using a task that is a major challenge to the visual system (Thorpe et al 1996). Subjects were presented with color photographs flashed for only 20 ms, and asked to respond as quickly as possible if the image contained an animal. The images were extremely varied, with targets that included mammals, birds, fish, insects in their natural environments, and the distractors were also very diverse. Furthermore, no image was shown more than once, forcing subjects to process each image from scratch with minimal contextual help. Despite all these constraints, accuracy was high (around 94%) with mean reaction times (RTs) typically around 400 ms. While mean RT might be the obvious candidate for measuring processing speed, another useful value is the minimal time needed to complete the task. Figure 1A plots separately RT distributions for correct responses to targets and for incorrect responses to distractors in the animal/non-animal task. Since targets and distractors were equally probable, the first time bin at which correct responses start to significantly outnumber incorrect ones defines the minimal response time. Responses at earlier times with no bias towards targets are presumably anticipations triggered before stimulus categorization was completed. Remarkably, in the animal/non-animal categorization task, these minimal response times can be under 250 ms. It might be thought that the images that trigger particularly short reaction times constitute a sub-population of particularly easy images. However, we found no obvious features that characterized rapidly categorized images (Fabre-Thorpe et al 2001). In other words, even with highly varied and unpredictable images, the human visual system is capable of completing the processing sequence that stretches from the activation of the retinal photoreceptors to moving the hand in under 250 ms. Humans can perform this challenging visual task quickly, but intriguingly, rhesus monkeys are even faster. In monkeys, minimal RTs to previously unseen animal targets are as low as 170180 ms (Fabre-Thorpe et al 1998). As in the tracking and vergence eye movement studies mentioned earlier, it appears that humans take nearly 50% longer than their monkey cousins to perform a given task. Such data clearly impose an upper limit on the time needed for visual processing. However, they do not directly reveal how long visual processing takes because the times obviously also include response execution. How much time should we allow for the motor part of the task? Although behavioral methods alone are unable to answer such questions, electrophysiological data from single unit recording and ERP or MEG studies can be used to track information processing between stimulus and response. Single cell recordings and processing speed Single unit activity is perhaps the easiest method to use since individual spikes are rather like behavioral responses and the same technique of searching for the minimal latency at which differential responses occur can be applied. If a neuron in monkey inferotemporal cortex responds selectively (i.e. differentially) to faces at a latency of 80-100 ms post-stimulus, then it follows that at least some forms of face processing can be completed by this time. By examining the sorts of information that can be derived from differential spiking activity at different times and in different visual structures, one can follow how processing develops over time. Surprisingly, the use of response latency to track the time course of visual processing is a relatively recent technique in experimental neuroscience. Nevertheless, by 1989 it was clear that the onset latencies of selective visual responses in brain structures along the visual pathway were a major constraint on models (Thorpe & Imbert 1989). Face-selective neurons had been described in monkey inferotemporal cortex with typical onset latencies around 100 ms and, beyond the visual system as such, it was known that neurons in the lateral hypothalamus could respond selectively to food with a latency of 150 ms. Although these earlier studies suggested that visual processing could be very fast, they did not specifically determine at which point the neuronal response was fully selective. This issue was dealt with in 1992 when it was shown that even the first 5 ms of the response of neurons in monkey inferotemporal cortex could be highly selective to faces (Oram & Perrett 1992). Thus, by determining the earliest point at which a particular form of stimulus specificity can be seen in the visual pathway, it can be possible to assign firm limits on the processing time required to reach a certain level of analysis. Before leaving our discussion of single unit responses, we should mention another approach to measuring processing speed, directly inspired by the behavioral RSVP studies mentioned earlier. Keysers et al recently looked at how face-selective neurons in the monkey temporal lobe respond to sequences of images presented at high rates. By varying the effective frame rate at which the images were presented, they found that although the strength of the response decreased when frame-rate was increased, the neurons were still being clearly specifically driven by the stimulus when the image was changed every 14 ms, i.e. at a frame rate of 72 Hz (Keysers et al 2001). This very impressive ability to follow rapidly changing inputs is one of the hallmarks of feed-forward pipeline processing, a point we will return to later. ERP or MEG data and processing speed. Event-Related Potentials and Magnetoencephalography can also be very informative although it is less easy to be sure about the precise start of the neuronal response than with single unit data. Furthermore, signals recorded from a particular site on the scalp can be influenced by activity from a very large number of neurons making it difficult to localize their source with precision. However, by looking for the earliest times at which the response varies in a systematic way according to a given characteristic of the input, we can determine the minimal time it takes to process it. For example, in subjects performing the animal/non-animal categorization task described earlier, simultaneously recorded ERP recordings showed that if one averages together the response for all correct target trials and compares the traces with the average response to all correctly categorized distractors, the two curves coincide almost perfectly until about 150 ms post-stimulus, at which point they diverge dramatically - see figure 1B (Thorpe et al 1996). This differential ERP response, which appears to be specifically related to target detection, is remarkably robust and occurs well in advance of even the fastest behavioral responses. A value of 150 ms for this initial rapid form of visual processing leaves no more than 100 ms for motor execution when behavioral reactions occur at around 250 ms. Some more recent studies have reported differential category specific activation at even earlier latencies. For example, differential activity specific to gender has been reported to start as early as 45-85 ms post stimulus (Mouchetant-Rostaing et al 2000). However, it might be that such early differential activity should be interpreted more in terms of low-level statistical differences between different categories of stimuli, rather than marking the decision that a particular category is present. This point is made clear in a study that used two different categorization tasks with the same sets of images. The images were either animals, means of transport, or other varied distractor images, but the target category varied from block to block. By averaging ERP activity appropriately, it was possible to demonstrate that early ERP differences (between 75 and 120 ms) could be explained by statistically significant differences between processing for the two types of image. In contrast, the differential activity starting at around 150 ms was clearly related to the processing of the image as a target and not to its physical characteristics (VanRullen & Thorpe 2001). This rapid, and very incomplete, review has hopefully shown how behavioral and electrophysiological data can be used to define temporal constraints that can be applied to particular sensory processing tasks. In the remainder of this chapter we will discuss how such data can be used to constrain the underlying processing algorithms. Implications for computational models The ability to determine the minimal time required to perform a particular task or computation is not, by itself, enough to constrain models of the underlying mechanisms. For example, we know that neurons in primate inferotemporal cortex can respond selectively to faces with latencies of 80-100 ms. But the computational implications of this fact only become clear when one takes into account the number of processing steps involved and some details of the underlying physiology. As pointed out in the late 1980s (Thorpe & Imbert 1989), information reaching Anterior Inferior Temporal cortex (AIT) in 80-100 ms presumably has to go through the retina and the lateral geniculate as well as cortical areas V1, V2, V4 and the Posterior Inferior Temporal cortex (PIT). While only one synaptic relay is required to pass through the geniculate, it is unlikely that afferents reaching cortical areas will make significant direct connections onto output neurons, meaning that at least two synaptic relays are involved at each cortical stage. This means that the minimal path length from retina to AIT involves probably at least 10 successive steps, implying that at each stage processing must be done within about 10 ms. Given that firing rates of cortical neurons only rarely exceed 100 spikes.s-1, very few neurons will have time to fire more that one spike in this 10 ms processing window. Such constraints severely limit the possibilities for using iterative processing loops but also question the feasibility of using conventional firing-rate based coding strategies While one can always raise doubts concerning the functional significance of face-selective neuronal responses at 100 ms, there can be little ambiguity when one takes into account the fact that monkeys can produce reliable manual responses from as early as 170 ms post-stimulus onset in a challenging high level visual categorization task. If we suppose that high order visual areas such as AIT are indeed involved in such tasks, we need to propose a route by which activation could pass from the temporal lobe to the hand. This is not a trivial problem, since the temporal lobe does not have direct connections to motor outputs. Figure 2 shows one possible route, via prefrontal and premotor cortex. It also stresses the point that at least in the case of the earliest behavioral responses, the time available for processing at each stage is so limited that there is very little time for anything other than a feed-forward pass. Figure 2: A possible input-output pathway for performing go/no-go visual categorization tasks in the monkey. Information passes from the retina to the lateral geniculate nucleus (LGN) before arriving in cortical area V1. Further processing occurs in areas V2 and V4 and the posterior and anterior inferotemporal cortex (PIT and AIT) before being relayed to the prefrontal cortex (PFC), premotor (PMC) and motor cortices (MC). Finally, activation of motoneurons in the spinal cord triggers movement of the hand. For each area, the two numbers provide approximate values for(i) the latency of the earliest responses, and (ii) a more typical average response latency. Some of the other data mentioned earlier also points towards the notion of the rapid visual processing being largely feed-forward. We noted that following three-weeks of training, there was no evidence that very familiar images could be processed faster than previously unseen ones. Such a "floor effect" for visual processing speed is one of the hallmarks of feed-forward processing. Other arguments in favor of feed-forward processing come from studies on the ability of neurons in IT to follow very rapidly changing inputs. While it would not be that surprising to learn that neurons close to the sensory input can follow rapid input changes, the fact that neurons so far into the brain can still modulate their responses to inputs that last only 14 ms is strong evidence that their selectivity does not depend on lengthy recurrent processing at previous stages. Finally, all the studies that show the full selectivity of the very initial part of the neuronal response should be considered as arguing in favor of feed-forward mechanisms. Distinguishing feed-forward and recurrent processing We have seen that behavioral and electrophysiological data provide strong evidence in favor of the idea that at least some forms of visual processing can be achieved on the basis of a feedforward pass. Indeed, the behavioral data on Ultra-Rapid Visual Categorization implies that even tasks involving high-level superordinate categorization of complex natural scenes can be realized in this way. Nevertheless, it needs to be stressed that this should not be taken to imply that vision is "done" in 150 ms. Many important aspects of vision must rely on extensive recurrent processing. So how can temporal analysis be used to determine which aspects of vision can be done using a feed-forward pass, and which require more time-consuming iterative mechanisms? Let us return for a moment to the question of the dynamics of single cell selectivity, raised earlier. We argued that many forms of selectivity are present right from the very beginning of the neuronal response (for example, orientation selectivity, and selectivity to faces). However, this does not mean that neuronal selectivity is fixed. For example, the orientation tuning of neurons in V1 can fluctuate considerably during the course of the (Ringach et al 1997), a result consistent with the idea that neuronal properties are dynamic and constantly under the influence of recurrent connections. However, even more interesting are reports that certain neuronal properties appear only later during the time course of the response. For example, a number of recent studies have looked at the time course of processing related to perceptual processes such as texture segmentation and filling-in at the level of primary visual cortex and have reported that such phenomena can often take several tens of milliseconds to develop (Lamme et al 1998). Another example is inferotemporal cortex, where it has been reported that while certain attributes of the face were encoded at the very start of the response (around 100 ms poststimulus), other attributes such as face identity or expressions were only encoded from about 150 ms post-stimulus (Sugase et al 1999). Detailed temporal analysis of the time course of neuronal selectivity can thus allow processing delays to be attached to processing of particular visual characteristics, and it looks likely that while some aspects of the visual input can be derived from the very beginning of the neural response, suggesting that they can be derived from mainly feed-forward processing, other aspects of the input take time to analyse and presumably involve recurrent computation. Final Comments Artificial Neural Networks are typically divided into two main types - feed-forward networks (which include Multi-layer Perceptrons and most Back-propagation networks), and recurrent networks in which processing loops can occur (see Figure 1). In biological systems, it is relatively rare to find sensory pathways that are anatomically purely feed-forward. For example, in the primate visual system, only the retina is not affected by feedback connections coming from the thalamus. All the other levels (LGN, V1, V2, V4 etc) have extensive feedback connections and it is widely believed that all visual processing is a complex interaction of bottom-up propagation and top-down interactive feedback. However, even in a sensory pathway in which top-down connections greatly outnumber bottom-up ones, a distinction between feed-forward and recurrent processing can still be made if one examines the temporal dynamics of the response. In this chapter we have discussed both behavioral and electrophysiological data that indicates that at least some forms of sophisticated visual processing can be achieved on the basis of the initial feed-forward pass. But of course this should not be considered as being the end of visual processing. Indeed, the results of this rapid first pass can be used to improve the efficiency of demanding processes such as image segmentation , an extremely difficult task using purely image based techniques. Suppose, for example, that the initial feed-forward pass was able to locate the presence of an eye at a particular point in the retinal image. This information could be sufficient to trigger a behavioral response in a task for which an eye is a diagnostic feature. But the same information could be extremely useful in guiding segmentation processes occurring at lower levels since knowing that an eye is present implies that a face is probably also present. Note how the present formulation contrasts with a more classic hierarchical processing model in which visual processing is divided into several steps and where each step needs to be completed before the next step can be initiated. For example, in many current image-processing models scene segmentation has to occur before the mechanisms of object recognition can start. But the alternative proposed here is that despite the multi-layered natural of the visual system, a very fast feed-forward pass can act as a seeding process that can allow subsequent processing to be performed in an intelligent top-down way. Finally, it should be stressed that the existence of a very fast feed-forward mode of processing has major consequences for our understanding of brain computation. Although space does not allow us to discuss this point in detail, the fact that computation can be done under conditions where most neurons only get to fire one spike causes serious problems for the conventional view that most if not all information is transmitted in the form of a rate code. In particular, it forces us to look for other alternative coding schemes that are compatible with a single spike processing mode. Figure WB1: Principle divisions of neocortex, including the main areas of the temporal lobe. Light arrows indicate information flow up along the dorsal stream. Dark arrows indicate flow along the ventral stream. The functional divisions of neocortex From the occipital lobe, information flows down into the temporal lobe, forming the lower (ventral) stream; and up into the parietal lobe, forming the upper (dorsal) stream (young92b - see figure WB1). The classical symptoms of patients with parietal lobe lesions are a good ability to recognize and name objects, but a poor ability to integrate them into a scene. Patients often suffer from neglect in specific areas of the visual field, being unaware of certain elements of the scene before them. In extreme cases an entire hemifield can be largely ignored, leading to curious phenomena, such as a failure to eat food from one half of a plate. Visuo-spatial neglect and scene understanding problems are attributed to an inability to motivate shifts in attention and direction of gaze throughout a scene, leading to speculation that the parietal lobe is involved in guiding attention and eye-movements (farah90). Damage to the temporal lobe, on the other hand, is associated with specific types of recognition agnosias, including problems in naming and categorizing objects such as peoples' faces (farah90). The dorsal stream was likened to the task of deciding "where" an object is, and the ventral stream "what" an object is. This distinction has been born out by many more recent studies which have looked at the selectivity of individual neurons in cortex (ungerleider94). The "what" stream is seen as the center of object recognition, but an integrated model of scene perception will almost certainly require a wider reaching approach. If we look at an airplane it may appear on one level to be a single entity or object, but if our task is to analyze details of the plane it becomes a scene containing a fuselage, wings, a tail-plane, engines and wheels. It may, therefore, be inappropriate to regard scene and object perception as being subserved by totally separate mechanisms of analysis, and that an understanding of how we represent objects may in some part guide models of scene representation. At an abstract level, scene analysis can be analyzed globally and extremely rapid, providing observers with (what has been referred to in the literature as) the "gist" of what they are looking at (rensink00b). Object recognition system may well perform analysis of a scene at this level. Indeed the rapid processing of scenes at a global/abstract level has been shown to influence the speed and accuracy with which objects within the scene are recognized. But how then, are scenes represented at the level of segregated objects, including relative location information? Recording and modeling results all point to this being achieved through the response properties of parietal lobe neurons, and hence a full understanding of how scenes are represented must include information stored in the parietal lobe. Indeed, since gist can affect the basic processes going on in scene analysis such as target selection, information about the gist of the scene must be relayed to the parietal lobe from the temporal lobe. There are plenty of routes which gist information could take between the temporal and parietal lobes - including directly, via the occipital lobe, or via the frontal lobe. Later stages of IT (AIT/CIT) connect to the frontal lobe, whereas earlier ones (CIT/PIT) connect to the parietal lobe. This functional distinction may well be important in forming a complete picture of inter-lobe interaction. Whilst many functions have been ascribed to the parietal and temporal lobes, relatively little is know about the function of the frontal lobe. What we do know, is that it acts [MAA: in part] as a temporary or working memory store (desimone95). It may well turn out that the frontal lobe acts as a running store of objects currently being represented within a scene (rensink00b,logothetis98), providing the final link in the chain. The ventral stream The path from primary visual cortex to the temporal lobe is a long one, passing through as many as ten neural areas before reaching the last wholly visual areas just beyond CIT. Neurons' receptive fields grow larger the further down the processing chain one looks. Neurons in the occipital lobe might have receptive field sizes up to 4 in V4, but neurons in CIT can be as large as 150. Recordings from the temporal lobe showed neurons selective for hands and faces. Cells selective to faces could not be excited by simple visual stimuli, complex non-faces, or indeed every face tested. Many neurons in the temporal lobe are tolerant to shifts in stimulus position, changes in viewing angle, image contrast, size/depth, illumination or the spatial frequencies present in the image (desimone91,logothetis96). [MAA: Some people (e.g., Biederman) would argue that the recognition of faces (or other familiar object families, such as motor cars, for which we recognize parametric variation) involves very different mechanisms from recognition of objects for which 3D relations are crucial (as in Biederman's geons, but other models are possible). You need to be more analytic about the general challenge of object recognition, and the way in which face recognition is paradigmatic only for some aspects of object recognition considered more generally.] Work on how the cellular response properties of temporal lobe neurons change over time has mainly concerned slow, long-term learning effects in which continuous exposure to an object class resulted in changes in the number of neurons selective for that stimulus (kobatake98,miyashita93b. However, some studies have focused on almost instantaneous changes which reflect the speed of behavioral changes measured in human responses. In one study, two-tone images of strongly lit faces were shown to monkeys. Some temporal lobe neurons which did not respond to any of the two-tone faces did so if once exposed to the standard picture of the face. This accords with findings in humans, who often struggle to interpret two-tone images for the first time, but after seeing the standard picture have no difficulty interpreting the two-tone image, even weeks later. [MAA: How might this be modeled? I think it's a strong argument for a top-down influence that goes beyond the sort of feedforward model you insist on later.] A processing hierarchy Figure WB2: Schematic of convergence in the ventral processing stream. The steady growth in receptive field size suggests that neurons in one layer of the hierarchy receive input from a select group of neurons in the preceding layer. The time taken for the effects of seeing a new visual stimulus increases systematically through the hierarchy, supporting the notion of a strictly layer by layer structure. Neurons in the latter regions of the temporal lobe can be thought of as sitting on the top of a processing pyramid - see figure WB2. Receptive field size grows steadily larger the further up this pyramid one looks, and this can be seen as a direct consequence of the convergent, hierarchical nature of the processing stream. Moreover, delays grow in a consistent manner the further down the processing chain one looks. However, there are as many connections running back as there are forward in the ventral stream, and this is important when one comes to devise models. Some theorists have argued that they are used in recall, and it is true that the act of remembering visual events causes activity to spread into primary visual areas. Alternatively they may control visual attention. The very act of attending to specific regions of our visual environment have been shown to facilitate the processing of signals in that region, which may well be due to selectively raising activity of neurons along the processing hierarchy which correspond to that visual region. Still other theorists have proposed an integral role for backward connections in visual processing. Whilst such models may be required to deal with confusing or low quality images, there is good evidence that timing constraints prohibit such a model from acting during normal recognition, due to the sheer number of neural delays which have to be traversed before information can travel from the eye to the temporal lobe. For that reason, several theorists have proposed a feedforward architecture (wallis97a,wallis99c). One of the possible roles of the processing hierarchy is to construct ever more complex combinations of features, as a natural continuation of the simple and complex cells of primary visual cortex. In a simplistic sense one can imagine a hierarchy in which edges are combined into junctions and junctions into closed contours, contours into volumes, volumes related into shapes, shapes into objects. [MAA: Thorpe and Fabre-Thorpe have an article on Fast Visual Processing. I think feedforward models are fine for recognizing variations of iconic scenes ("Oh, there's the Eiffel Tower") but not for more complex scenes which combine contextual cues with the unexpected.] Wallis argues that the visual system extracts invariances from the environment so as to build representations which facilitate data abstraction and extrapolation, and which also facilitate the building of compact, detailed memories. [MAA: (a) Recognition of 3D spatial relationships must surely be crucial to object recognition generally, even if not for face recognition. (b) How could there not be data abstraction? But what are useful hypotheses on what is being abstracted? What are the best models of the process? How well do they relate to neurophysiological data (e.g., that of Tanaka)? And what forms of extrapolation do they support, etc.?] The highest level of abstraction appears to take place in the superior temporal areas where, it has been suggested, view invariant cells in STPa cells pool the outputs of view selective AIT cells. The only problem with such an approach is how disparate views can be associated by a neuron seeking to build an invariant representation of an object feature. Encoding objects in the temporal lobe Hypothesis: Object encoding is achieved via a distributed scheme in which many hundreds or thousands of neurons - each selective for its specific feature - would act together to represent an object. Although many of these features represent only small regions of an object, others appear to represent an object's outline, or some other global but general property. In addition, the neural representation of these features may exhibit invariance to scale and size, something typical of temporal lobe neurons. (hinton86b.) [MAA: How could one explain the recognition of a human figure in a complex scene, despite huge variations in size, pose, clothing, activity, etc., etc.] Network models of the object recognition The "feature binding problem": imagine a cell trained to respond to the appearance of a set of features in, say, a translation invariant way. What is to stop this neuron from responding to novel rearrangements of these features? For example, a simple feature based recognition scheme would treat a jumble of two eyes, a mouth and nose as the same as a normal face. Real neurons responsive to faces are generally not impressed by such stimuli. Successfully determining the features present in an invariant manner, whilst still retaining spatial configuration, leads to an instance of the binding problem. Models which throw away spatial information so as to achieve translation invariance will run into the problem of "recognizing" rearrangements of the features triggering recognition. Our current opinion on this is that binding does not become an issue if the features supporting recognition are built up gradually in spatially localized regions, as they are in the successive stages of the ventral stream. By responding to local combinations of neurons coactive in the previous layer, a single neuron in the next layer should not learn to respond to some arbitrary spatial arrangement of the same features (wallis96d). [MAA: cf Neocognitron in 1e.] Temporal order as a key to recognition learning The question still remains as to how neurons learn to treat their preferred feature as the same, irrespective of its size or location. Indeed, ultimately, one would like to understand how neurons learn to recognize objects as they undergo non-trivial transformations due to changes in, say, viewing direction or lighting. Hypothesis: under normal viewing conditions, and by approaching an object, watching it move, or rotating it in our hand we will receive a consistent associative signal capable of bringing all of the views of the object together. The discovery that cells in the temporal lobe become selective to stimuli on the basis of the order in which they appear, rather than spatial similarity, provides important preliminary support for this idea (miyashita93b, wallis99a). A functional characterization of structure processing any object recognition task, such as identification or categorization, has at its core a common operation, namely, the matching of the stimulus against a stored memory trace (Ullman, 1989). [MAA: Not necessarily. If one is, e.g., counting the number of legs on a creature as part of recognizing it, then one might be analyzing the stimulus using a stored memory trace, but not matching it.] For the structure-processing tasks, a key characteristic is the restriction of the spatial scope of at least some of the operations involved to some fraction of the visual extent of the object or scene under consideration. Here are a few examples of structural tasks: Given two objects, or an object and a class prototype, identify their corresponding regions. The correspondence here may be based on local shape similarity (find the eyes of a face in a Cubist painting), or on similar role played by the regions in the global structure (find the eyes in a smilie icon). Given an object and an action, identify a region in the object towards which the action can be directed. Similarities between objects vis a vis this task are defined functionally (as in the parallel that can be drawn between the handle of a pan and a door handle: both afford grasping). [MAA: (a) What are the best references on this? (b) One issue you seem to neglect is "recognition for what": E.g., when do I recognize a hand versus a monkey hand versus a monkey hand of a particular size grasping in a particular way, and how are these "levels of recognition" related to one another?] Given an object, describe its structure. This explicitly structural task arises in the context of trying to make sense of an unfamiliar object (as in perceiving a hot-air balloon, upon seeing it for the first time, as a pear-like shape over a box-like one). Appearance-based computational approaches to recognition and categorization, according to which objects are represented by collections of entire, spatially unanalyzed views (Murase and Nayar, 1995; Duvdevani-Bar and Edelman, 1999). "Classical" mereological structural decomposition approaches (Biederman, 1987; Bienenstock et al., 1997) have the opposite tendency: the recursive symbolic structure they impose on objects seems too rigid and too elaborate. Object form processing in computer vision The specific notion that structural descriptions are to be expressed in terms of volumetric parts (Binford,1971), was subsequently adopted by (Biederman, 1987), who developed it into a (psychological) theory of Recognition By Components (RBC): the generic parts, called geons (generalized cylinders) are bound together by categorical relations (Bienenstock et al., 1997). By virtue of their compositionality, the classical structural descriptions meet the two main challenges in the processing of structure: the visual system is (i) productive - it can deal effectively with a potentially infinite set of objects; and (ii) systematic - a well-defined change in the spatial configuration of the object (e.g., swapping top and bottom parts) causes a principled change in the representation (Hadley, 1997; Hummel, 2000b)). However, the requirement that object parts be "crisp" and relations syntactically compositional is difficult to adhere to in practice. The two implemented systems designed to derive structural descriptions from raw images (Dickinson et al., 1992; Du and Munck-Fairwood, 1995) had difficulty with extracting sufficiently good line drawings, and with the idealized nature of the geon representation. Both these problems can be effectively neutralized by giving up the classical compositional representation of shape by a fixed alphabet of crisp primitives (geons) in favor of a superpositional coarse-coding by an open-ended set of image fragments. The system described by Nelson and Selinger (1998) starts by detecting contour segments, then determines whether their relative arrangement approximates that of a model object. Because none of the individual segment shapes or locations is critical to the successful description of the entire shape, this method does not suffer from the brittleness associated with the classical structural description models of recognition. Moreover, the tolerance to moderate variation in the segment shape and location data allows it to categorize novel members of familiar object classes. [MAA: How does one go recursive? I.e., when do loose groupings of features define salient subparts to then be loosely related in defining larger objects. Conversely, what about top-down approaches where a crude low-spatial frequency analysis of a scene can drive generic pattern recognition within which one can focus attention to find features which direct finer object characterization?] Burl et al. (1998) combines "local photometry" (shape primitives that are approximate templates for small snippets of images) with "global geometry" (the probabilistic quantification of spatial relations between pairs or triplets of primitives). Camps et al. (1998) represent objects in terms of appearance-based parts (defined as projections of image fragments onto principal components of stacks of such fragments) and their approximate relations. Sali and Ullman (1999) use snippets of images taken from objects to be recognized to represent these objects; recognition is declared if a sufficient proportion of the fragments are detected, and if the spatial relations among these conform to the stored description of the target. In all these methods, the interplay of loosely defined local shape ("what") and approximate location ("where") information leads to robust algorithms supporting both recognition and categorization. We contend that these same methods also provide an effective alternative to the classical structural description approach to the processing of object form. Mechanisms implicated in structure processing in primate vision Although no evidence seems to exist for the neural embodiment of geons as such, cells in the inferotemporal (IT) cortex were reported to exhibit a higher sensitivity to "non-accidental" visual features such as those that define geons than to "metric" properties of the stimuli (Vogels et al., 2000). Also, an interpretation along the classical structural lines has been offered (Biederman, 2000) for the role of the cells in the inferotemporal (IT) cortex that are tuned to very specific shapes, as determined by the stimulus reduction technique (Tanaka et al., 1991). It has been argued that symbols representing the parts are bound into the proper structure dynamically, by the synchronous firing of the neurons that code each symbol (von der Malsburg, 1999). Thus, a mechanism capable of supporting dynamic binding must be available; it is possible that this function is fulfilled by the synchronous or phase-locked firing of cortical neurons (Gray et al., 1989; Singer and Gray, 1995), although the status of this phenomenon in primates has been disputed (Young et al., 1992; Kirschfeld, 1995). An alternative theory proposed by (Edelman and Intrator, 2000b) calls for an open-ended set of fragments instead of geons, and posits binding by retinotopy in addition to activity synchronization. [MAA: Is activity synchronization necessary for the model, or can it work just by exploiting binding by retinotopy? Does activity synchronization provide enough distinct phases to keep separate the different regions we recognize in a scene?] The role of fragment detectors may be fulfilled by those neurons in the IT cortex that respond selectively to some particular views of an object or to a specific shape irrespective of view (Logothetis and Sheinberg, 1996; Rolls, 1996; Tanaka, 1996). This very kind of shape-selective response may also constitute the neural basis of binding by retinotopy, in which the visual field itself can serve as the frame encoding the relative positions of object fragments, simply because each such fragment is already localized within that frame when it is detected (Didday and Arbib, 1975). Cf. the notion of the visual world serving as an "external memory" (O'Regan, 1992) or its own representation (Edelman, 1998). Binding by retinotopy is possible if the receptive field of each cell is confined to some relatively limited portion of the entire visual. Such response properties have been found in the IT cortex (Kobatake and Tanaka, 1994) and the term what+where response type has been coined to describe the joint tuning of cells in the prefrontal (PF) cortex (Rao et al., 1997; Rainer et al., 1998). An imaging study of the representation of structure in IT (Tsunoda et al., 1998) revealed ensembles of neurons responding to overlapping sets of "moderately complex" geometrical features, spatially bound into distributed codes for entire objects; see also (Perrett and Oram, 1998; Tsunoda et al., 1999; Tsunoda and Tanifuji, 2000). Neuromorphic models of visual structure processing JIM.3 The JIM.3 model(Hummel, 2000a), which has evolved from JIM (Hummel and Biederman, 1992), combines the classical and the alternative. It is structured as an 8-layer network (Figure 1). The first three layers extract local features: contours, vertices and axes of symmetry, and surface properties. Surfaces are represented in terms of five categorical properties: (1) elliptical or not; (2) possessing parallel, expanding, convex or concave axes of symmetry; (3) possessing curved or straight major axis; (4) truncated or pointed; (5) planar or curved in 3D. Units coding these local features group themselves into representations of geons by synchrony of firing. These representations are then routed by the units of layer 4 to two distinct destinations in layer 5. The first of these is a population of units coding for geons and spatial relations that are independent or "disembodied" in the sense that each of them may have originated from any location within the image. Within this population, the emergence of a representation of the object's structure requires dynamic binding, which the model stipulates to be carried our under attentional guidance and to take a relatively long time (a few hundred milliseconds). The second destination of the outgoing connections of layer 4 is a population of geon units arranged in the form of a retinotopic map. Here, the relations between the geons are coded implicitly, by virtue of each representation unit residing in the proper location within the map, which reflects the location of the corresponding geon in the image. In contrast to the attentioncontrolled stream, this one can operate much faster, and is postulated to be able to form a structural representation in a few tens of milliseconds. This speed and automaticity have a price: because of the fixed spatial structure imposed by the retinotopic map, the representation this stream supports is more sensitive to object transformations such as rotation in depth and reflection (Hummel, 2000a). Figure 1: The architecture of the JIM.3 model (Hummel, 2000a). The model had been trained on a single view (actually, a line drawing) of each of 20 objects: hammer, scissors, etc., as well as some "nonsense" objects. It was then tested on translated, scaled, reflected and rotated (in the image plane) versions of the same images. The model exhibited a pattern of results consistent with a range of psychophysical data obtained from human subjects (Stankiewicz et al., 1998; Hummel, 2000a). Specifically, the categorization performance was invariant with respect to translation and scaling, and was reduced by rotation. Moreover, due to the dual nature of the binding process in JIM.3 dynamic and static/retinotopic the model behaved differently given attended and unattended objects: reflected images primed each other in the former, but not in the latter case. (Figure courtesy of J. Hummel). Chorus of Fragments or CoF This model exemplifies the coarse-coded fragment-based approach to the representation of structure (Edelman and Intrator, 2000a; Edelman and Intrator, 2001). It simulates cells with what+where receptive fields (above) to represent object fragments, and uses attentional gain fields, such as those found in area V4 (Connor et al., 1997), to decouple the representation of object structure from its location in the visual field. #Explain gain fields.# Unlike JIM.3, the CoF system operates directly on gray-level images, pre-processed by a front end that simulates the primary visual cortex (Heeger et al., 1996), with complex-cell responses modified to use the MAX operation suggested in (Riesenhuber and Poggio, 1999). The system illustrated in Figure 2 contains two what+where units, one (labeled "above center") responsible for the top fragment of the object (as extracted by an appropriately configured Gaussian gain field), and the other (labeled "below center") responsible for the bottom fragment. The units are trained jointly for three-way discrimination, for translation tolerance, and for autoassociation. [MAA: cf. Fukushima's Neocognitron (see article in 1e).] Figure 2: The CoF model, trained on three composite objects (numerals 4 over 5, 3 over 6, and 2 over7). The model consists of two what+where units, responsible for the top and the bottom fragments of the stimulus, respectively. Gain fields (boxes labeled below center and above center) steer each input fragment to the appropriate unit. The learning mechanism (R/C, for Reconstruction and Classification) can be implemented either as a multilayer perceptron, or as a radial basis function network. The reconstruction error () modulates the classification outputs and helps the system learn binding (a co-activation pattern over units of the preceding stage will have a small reconstruction error only if both its what and where aspects are correct). Figure 3 shows the performance of a CoF system charged with learning to reuse fragments of the members of the training set (three bipartite objects composed of numeral shapes) in interpreting novel composite objects. The gain field mechanism allowed it to respond largely systematically to the learned fragments shown in novel locations, both absolute and relative. The CoF model offers a unified framework, which can be linked to the Minimum Description Length principle (Edelman and Intrator, 2001), for the understanding of the functional significance of what+where receptive fields and of attentional gain modulation. It extends the previous use of gain fields in the modeling of translation invariance (Salinas and Abbott, 1997), and highlights a parallel between what+where cells and probabilistic fragment-based approaches to structure representation in computer vision, such as that of (Burl et al., 1998). The representational framework it embodies is both productive and effectively systematic. It is capable, as a matter of principle, of recognizing as such objects that are related through a rearrangement of "middlescale" parts, without being taught those parts individually, and without the need for dynamic binding. Further testing is needed to determine whether or not the CoF model can be scaled up to learn larger collections of objects, and to represent finer structure, under realistic transformations such as rotation in depth. Figure 3: The response of the CoF model to a familiar composite object at a novel location (test I), and novel compositions of fragments of familiar objects (tests II and III). In the test scenario, each unit ( above and below ) must be fed each of the two input fragments ( above and below ), hence the 12 bars in the plots of the model's output. [MAA: The example is rather artificial. Can you give some sense as to how the method might extend to, e.g., recognition of a human form where the drape of clothing can greatly change the shape of any component or the appearance where they come together.] Conclusions Neither recognition, nor categorization require a prior derivation of a classical structural description. Moreover, making structure explicit may not be a good idea, either from a philosophical viewpoint, or from a practical one. On the philosophical level, it embodies an gratuitous ontological commitment (Schyns and Murphy, 1994) to the existence of object parts; on the practical level, reliable detection of such parts proved to be an elusive goal. Moreover, a system can be productive and systematic without relying on representations that are compositional in the classical sense (Chrisley, 1998; Edelman and Intrator, 2000b). Structure can be represented by a coarse code based on image fragments, bound together by retinotopy. This notion is supported by the success of computer vision methods (such as "local photometry, global geometry"), by data from neurophysiological studies in primates (such as the discovery of what+where cells), as well as by psychological findings and by meta-theoretical considerations not mentioned here (Edelman and Intrator, 2000b). [MAA: Complementary issue: Having recognized the object, how do we extract relevant parameters, e.g., the pose of a hand?] References Biederman, I. (1987). Recognition by components: a theory of human image understanding. Psychol. Review, 94:115-147. Biederman, I. (2000). Recognizing depth-rotated objects: A review of recent research and theory. Spatial Vision, -:-. in press. Bienenstock, E. (1996). Composition. In Aertsen, A. and Braitenberg, V., editors, Brain Theory Biological Basis and Computational Theory of Vision, pages 269-300. Elsevier. Bienenstock, E. and Geman, S. (1995). Compositionality in neural systems. In Arbib, M. A., editor, The handbook of brain theory and neural networks, pages 223-226. MIT Press. Bienenstock, E., Geman, S., and Potter, D. (1997). Compositionality, MDL priors, and object recognition. In Mozer, M. C., Jordan, M. I., and Petsche, T., editors, Neural Information Processing Systems, volume 9. MIT Press. Binford, T., 1971, Visual Perception by a Computer, IEEE Conference on Systems and Controls, Miami. Florida. Burkhardt and B. Neumann (Eds.), LNCS-Series Vol. 1406-1407, Springer-Verlag, pages 628Burl, M. C., Weber, M., and Perona, P. (1998). A probabilistic approach to object recognition using local photometry and global geometry. In Proc. 4 th Europ. Conf. Comput. Vision, H. Camps, O. I., Huang, C.-Y., and Kanungo, T. (1998). Hierarchical organization of appearancebased parts and relations for object recognition. In Proc. ICCV, pages 685-691. IEEE. Chrisley, R. (1998). Non-compositional representation in connectionist networks. In Niklasson, L., Boden, M., and Ziemke, T., editors, ICANN 98: Proceedings of the 8th International Conference on Artificial Neural Networks, pages 393-398, Berlin. Springer. Connor, C. E., Preddie, D. C., Gallant, J. L., and Van Essen, D. C. (1997). Spatial attention efiects in macaque area V4. J. of Neuroscience, 17:3201-3214. Desimone R.. Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3:1-8, 1991. Desimone R., E.K. Miller, L. Chelazzi, and A. Lueschow. Multiple memory systems in visual cortex. In M.S. Gazzaniga, editor, Cognitive Neurosciences, chapter 30, pages 475-486. MIT Press: New York, 1995. Dickinson, S., Bergevin, R., Biederman, I., Eklundh, J., Munck-Fairwood, R., Jain, A., and Pentland, A. (1997). Panel report: The potential of geons for generic 3-d object recognition. Image and Vision Computing, 15:277-292. Dickinson, S. J., Pentland, A. P., and Rosenfeld, A. (1992). 3-D shape recovery using distributed aspect matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14:174-98. Didday, R.L., and Arbib, M. A., 1975, Eye movements and visual perception: 'two visual systems' model, Int. J. Man-Machine Studies, 7: 547-569. Du, L. and Munck-Fairwood, R. (1995). Geon recognition through robust feature grouping. In Scandinavian Conference on Image Analysis, pages 715-722. Duvdevani-Bar, S. and Edelman, S. (1999). Visual recognition and categorization on the basis of similarities to multiple class prototypes. International Journal of Computer Vision, 33:201Edelman, S. (1994). Biological constraints and the representation of structure in vision and language. Psycoloquy, 5(57). FTP host: /pub/harnad/Psycoloquy/1994.volume.5/; ftp.princeton.edu; file name: FTP directory: psyc.94.5.57.language- network.3.edelman. Edelman, S. (1998). Representation is representation of similarity. Behavioral and Brain Sciences, 1:449-498. Edelman, S. and Intrator, N. (2000a). (Coarse Coding of Shape Fragments) + (Retinotopy) fi Representation of Structure. Spatial Vision, -:-. in press. Edelman, S. and Intrator, N. (2000b). A framework for object representation that is shallowly structural, recursively compositional, and effectively systematic. -, -:-. in preparation. Edelman, S. and Intrator, N. (2001). A productive, systematic framework for the representation of visual structure. In Leen, T., editor, NIPS (Advances in Neural Information Processing Systems), volume 13, Cambridge, MA. MIT Press. Fabre-Thorpe M, Delorme A, Marlot C, Thorpe S. 2001. A limit to the speed of processing in ultra-rapid visual categorization of novel natural scenes. J Cogn Neurosci 13: 171-80. Fabre-Thorpe M, Richard G, Thorpe SJ. 1998. Rapid Categorization of Natural Images by RhesusMonkeys. Neuroreport 9: 303-8 Farah. M.J. Visual Agnosia: Disorders of Object Recognition and What They Can Tell Us About Normal Vision. Cambridge, Massachusetts: MIT Press, 1990. Fodor, J. A. (1998). Concepts: where cognitive science went wrong. Clarendon Press, Oxford. Fodor, J. and McLaughlin, B. (1990). Connectionism and the problem of systematicity: Why Smolensky's solution doesn't work. Cognition, 35:183-204. Gray, C. M., König, P., Engel, A. K., and Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which re ects global stimulus properties. Nature, 38:334-337. Hadley, R. F. (1997). Cognition, systematicity, and nomic necessity. Mind and Language, 12:137-53. Heeger, D. J., Simoncelli, E. P., and Movshon, J. A. (1996). Computational models of cortical visual processing. Proceedings of the National Academy of Science, 93:623-627. Hinton, G.E. J.L. McClelland, and D.E. Rumelhart. Distributed representations. In D.E. Rumelhart and J.L. McClelland, editors, Parallel Distributed Processing, volume 1: Foundations, chapter 3. Cambridge, Massachusetts: MIT Press, 1986. Hummel, J. E. (2000a). Complementary solutions to the binding problem in vision. -, pages -. submitted. Hummel, J. E. (2000b). Where view-based theories of human object recognition break down: the role of structure in human shape perception. In Dietrich, E. and Markman, A., editors, Cognitive Dynamics: conceptual change in humans and machines, chapter 7. Erlbaum, Hillsdale, NJ. Hummel, J. E. and Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psychological Review, 99:480-517. Intraub H. 1999. Understanding and Remembering Briefly Glimpsed Pictures: Implications for Visual Scanning and Memory. In Fleeting memories : cognition of brief visual stimuli, ed. V Coltheart. Cambridge, Mass.: MIT Press Kawano K. 1999. Ocular tracking: behavior and neurophysiology. Curr Opin Neurobiol 9: 467-73. Keysers C, Xiao DK, Foldiak P, Perrett DI. 2001. The speed of sight. J Cogn Neurosci 13: 90-101. Kirschfeld, K. (1995). Neuronal oscillations and synchronized activity in the central nervous system: functional aspects. Psycoloquy, 6(36). available electronically as ftp://ftp.princeton.edu/pub/harnad/Psycoloquy/1995.volume.6/psyc.95.6.36.brainrhythms.11.kirschfeld. Kobatake, E. and Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. J. Neurophysiol., 71:856-867. Kobatake, E. K. Tanaka, and G. Wang. Effects of shape discrimination learning on the stimulus selectivity of inferotemporal cells in adult monkeys. Journal of Neurophysiology, 80:324-330, 1998. Lamme VA, Super H, Spekreijse H. 1998. Feedforward, horizontal, and feedback processing in the visual cortex. Curr Opin Neurobiol 8: 529-35 Logothetis N.K.. Object vision and visual awareness. Current Opinion in Neurobi- ology, 8(4):536544, 1998. Logothetis, N. K. and Sheinberg, D. L. (1996). Visual object recognition. Annual Review of Neuroscience, 19:577-621. Marr, D. (1982). Vision. W. H. Freeman, San Francisco, CA. Marr, D. and Poggio, T. (1977). From understanding computation to understanding neural circuitry. Neurosciences Res. Prog. Bull., 15:470-488. Miles FA. 1997. Visual stabilization of the eyes in primates. Curr Opin Neurobiol 7: 867-71 Miyashita Y.. Inferior temporal cortex: Where visual perception meets memory. Annual Review of Neuroscience, 16:245-263, 1993. Mouchetant-Rostaing Y, Giard MH, Delpuech C, Echallier JF, Pernier J. 2000. Early signs of visual categorization for biological and non-biological stimuli in humans. Neuroreport 11: 2521-5. Murase, H. and Nayar, S. (1995). Visual learning and recognition of 3D objects from appearance. International Journal of Computer Vision, 14:5-24. Nelson, R. C. and Selinger, A. (1998). Large-scale tests of a keyed, appearance-based 3-D object recognition system. Vision Research, 38:2469-2488. Oram MW, Perrett DI. 1992. Time course of neural responses discriminating different views of the face and head. J Neurophysiol 68: 70-84 O'Regan, J. K. (1992). Solving the real mysteries of visual perception: The world as an outside memory. Canadian J. of Psychology, 46:461-488. Perrett, D. I. and Oram, M. W. (1998). Visual recognition based on temporal cortex cells: viewercentred processing of pattern configuration. Z. Naturforsch. [C], 53:518-541. Potter MC. 1999. Understanding Sentences and Scenes: The Role of Conceptual Short-Term Memory. In Fleeting memories : cognition of brief visual stimuli, ed. V Coltheart. Cambridge, Mass.: MIT Press Rainer, G., Asaad, W., and Miller, E. K. (1998). Memory fields of neurons in the primate prefrontal cortex. Proceedings of the National Academy of Science, 95:15008-15013. Rao, S. C., Rainer, G., and Miller, E. K. (1997). Integration of what and where in the primate prefrontal cortex. Science, 276:821-824. Rensink R.A.. The dynamic representation of scenes. Visual Cognition, 7:17-42, 2000. Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2:1019-1025. Ringach DL, Hawken MJ, Shapley R. 1997. Dynamics of orientation tuning in macaque primary visual cortex. Nature 387: 281-4. Rolls, E. T. (1996). Visual processing in the temporal lobe for invariant object recognition. In Torre, V. and Conti, T., editors, Neurobiology, pages 325-353. Plenum Press, New York. Sali, E. and Ullman, S. (1999). Detecting object classes by the detection of overlapping 2-D fragments. In Proc. 10th British Machine Vision Conference, volume 1, pages 203-213. Salinas, E. and Abbott, L. F. (1997). Invariant visual responses from attentional gain fields. J. of Neurophysiology, 77:3267-3272. Schyns, P. G. and Murphy, G. L. (1994). The ontogeny of part representation in object concepts. In Medin, D., editor, The Psychology of Learning and Motivation, volume 31, pages 305-354. Academic Press, San Diego, CA. Singer, W. and Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual review of neuroscience, 18:555-586. Stankiewicz, B., Hummel, J. E., and Cooper, E. E. (1998). The role of attention in priming for leftright re ections of object images: evidence for a dual representation of object shape. Journal of Experimental Psychology: Human Perception and Performance, 24:732-744. Sugase Y, Yamane S, Ueno S, Kawano K. 1999. Global and fine information coded by single neurons in the temporal visual cortex. Nature 400: 869-73 Tanaka, K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience, 9:109-139. Tanaka, K., Saito, H., Fukada, Y., and Moriya, M. (1991). Coding visual images of objects in the inferotemporal cortex of the macaque monkey. J. Neurophysiol., 66:170-189. Thorpe S, Fize D, Marlot C. 1996. Speed of processing in the human visual system. Nature 381: 520-2 Thorpe SJ, Imbert M. 1989. Biological constraints on connectionist models. In Connectionism in Perspective., ed. R Pfeifer, Z Schreter, F Fogelman-Soulié, L Steels, pp. 63-92. Amsterdam: Elsevier Tsunoda, K. and Tanifuji, M. (2000). Direct evidence for feature-based representation of visually presented objects in monkey inferotemporal cortex revealed by optical imaging. -, -:-. submitted. Tsunoda, K., Fukuda, M., and Tanifuji, M. (1999). Feature-based representation of objects in Macaque area TE revealed by intrinsic signal imaging. In Soc. Neurosci. Abstr., volume 25, page 918. Tsunoda, K., Nishizaki, M., Rajagopalan, U., and Tanifuji, M. (1998). Optical imaging of functional structure evoked by complex and simplified objects in Macaca area TE. Society for Neuroscience Abstracts, 24:897. Ullman, S. (1979). The interpretation of visual motion. MIT Press, Cambridge, MA. Ullman, S. (1989). Aligning pictorial descriptions: an approach to object recognition. Cognition, 2:193-254. Ungerleider L.G. and J.V. Haxby. 'what' and 'where' in the human brain. Current Opinion in Neurobiology, 4:157-165, 1994. VanRullen R, Thorpe SJ. 2001. The time course of visual processing: from early perception to decision- making. J Cogn Neurosci 13: 454-61. Vogels, R., Biederman, I., Bar, M., and Lorincz, A. (2000). Inferior temporal neurons show greater sensitivity to nonaccidental than metric difierences. Journal of Cognitive Neuroscience. in press. von der Malsburg, C. (1999). The what and why of binding: The modeler's perspective. Neuron, 4:95-104. Wallis G.. Spatio-temporal influences at the neural level of object recognition. Net- work: Computation in Neural Systems, 9(2):265-278, 1998. Wallis G.. Temporal association in a feed-forward framework. Network: Computa- tion in Neural Systems, 10(3):281-284, 1999. Wallis G. and E. T .Rolls. A model of invariant object recognition in the visual system. Progress in Neurobiology, 51:167-194, 1997. Wallis G. and H.H. Biilthoff. Learning to recognize objects. Trends in Cognitive Sciences, 3:22-31, 1999. Young. M.P. Objective analysis of the topological organization of the primate cor- tical visualsystem. Nature, 358:152-155, 1992. Young, M., Tanaka, K., and Yamane, S. (1992). On oscillating neuronal responses in the visual cortex of the monkey. J. of Neurophysiology, 67:1464-1474.