What is Vision? Why is it Hard? Alan L. Yuille. UCLA. The Purpose of Vision. “To Know What is Where by Looking”. Aristotle. (384-322 BC). Information Processing: receive a signal by light rays and decode its information. Vision appears deceptively simple, but this is highly misleading. What can humans see? Rich description: fox, tree trunk, grass and background twigs. Also: shape of fox’s legs and head, its fur, what it is doing? old or young? and more. Local regions (B,C,D) are highly ambiguous. Local regions are ambiguous Airplane Car Boat Sign Building Why can Humans see so well? Humans appear understand images effortlessly. But this is only because of the enormous amount of our brains that we devote to this task. It is estimated that 40-50% of neurons in the cortex are involved in doing vision. Humans are very visual. We get much more information from our eyes than other animals. Vision and the Brain Half the Cortex does Vision But human vision is not perfect Human perception is often incorrect. Visual illusions suggest that perception is often a reconstruction, or even a controlled hallucination. Here are some illusions. Ames Room Perspective Cues Are these lines horizontal? Which square is brighter? Visual Illusions The perception of brightness of a surface, or the length of a line depends on context. Not on basic measurements like: the no. of photons that reach the eye or the length of line in the image. Perception is Inference Helmholtz. 1821-1894. “Perception as Unconscious Inference”. So why is vision hard? Vision is an inverse inference problem. There is a highly complex process that generates the image. This process is studied in Computer Graphics. It models objects, light sources, and how they interact. Vision must invert this image formation process to estimate the “causal factors” – objects, lighting, and so on. Vision as Inverse Inference: Inverting image formation. Inverse Problems are often hard There are an infinite number of ways that images can be formed. Which ways are most plausible? . Vision and Bayes Bayes’ Theorem gives a procedure to solve inverse inference problems. It states that we can infer the state S of the world from the observed image I by, P(S|I) = P(I|S)P(S)/P(I). But doesn’t tell us how to Specify these distributions. Complexity: The Fundamental Problem? The set of all images is almost infinite (Kersten 1987). Estimate that only a tiny fraction of all possible images have been seen by humans (even small images 10x10). The no. of objects is large – 30,000? The no. of scene types is large – 1,000? Hence I and S are extremely complex. Variability and Ambiguity Variability: Images of similar objects (bicycle) can be very different (left, center). Ambiguity: Images of different objects can be identical (right). Complexity, Variability, Ambiguity What is the space of all possible images? It is only now becoming possible to explore this questions – (technological advances). Large datasets – 20,000 images (Pascal), 1,000,000 (ImageNet). But are these datasets big enough for results on them to apply to other images? Datasets: Promise and Peril Vision evaluates performance of algorithms on benchmarked datasets. But results on these datasets may fail to generalize to other images. Datasets need to be representative of the complexity of the real world. Tyranny of Datasets. Unrepresentative datasets. Limited visual tasks. Dataset Examples Pascal (top). 20,000 images. 20 objects. ImageNet (bottom) 1,000,000 images. 1,000 objects. Brief History of Vision The difficulty of vision became clear in the 1960’s,70’s when Artificial Intelligence researchers tried to get computer programs to interpret images. Initially AI researchers thought vision was easy. “Solve vision in a summer”. Researchers gradually started realizing that vision was extremely hard. Much harder than Chess. . Big Picture Theories of Vision. Work in the 1980’s in vision was largely influenced by big picture theories. e.g., David Marr’s Vision (1982). The theories were interesting, but practical results were limited. Progress started being made by breaking up vision into subproblems. Inventing or borrowing mathematical tools for these problems. Techniques used in Vision Multi-Disciplinary set of tools. Linear and Non-linear filtering. Variational Methods. Probabilities on Graphs. Perspective Geometry. Differential Geometry. And many more. We will introduce many of these at this summer school. Vision as a Bag of Tricks? Dividing vision up into sub-problems enabled progress to be made. But lead to a situation where vision was seen as an enormous bag of tricks. “What papers should I read in computer vision? There are so many, and they are so different.” Chinese Graduate student. Attempts are being made – summer schools, Wikivision project – to give a core and foundation for vision, The mathematical richness of vision The complexity and ambiguity of vision is a challenge. Vision problems (broadly defined) have attracted the attention of learning mathematicians (e.g., Field’s medalists). David Mumford, Shing-Tung Yau, Maxim Kontsevich, Stephen Smale, Terrence Tao. Relationships to other Disciplines Computer vision relates closely to Image Processing Machine Learning. Partially covered in this school. Artificial Intelligence. Robotics. Study of Biological Vision Systems – Psychology and Neuroscience. Imaging Devices In this school we will be dealing with “natural images” of everyday scenes. But the methods used can be applied to other imaging modalities. Medical images: fMRI, EEC, Astronomial Images:… Ever-increasing number of imaging devices. Need a science of visual images. Structure of the School Vision can be broken down into low-, mid-, and high-level vision (very roughly). Low-level vision – local image operations which have limited knowledge of the world. Mid-level vision – non-local operations which know about surfaces and geometry. High-level vision – operations which know about objects and scene structures. Low-level vision Image processing. Filtering, denoising, enhancement. Edge detection. Image segmentation. Right: ideal segmentation (followed by labeling). Mid-Level Vision Estimation of 3D surfaces: Binocular stereo, structure from motion, other depth cues (e.g., perspective, shading, texture, focus). Fig: Image, Depth (blue to red), Segmentation. Mid-Level Vision: Grouping Kanisza. Humans have a strong tendency to group image structures as surfaces. High-Level Vision Object detection and Scene Understanding. Example: detect objects in Pascal. High-Level Vision: parts Detect parts of objects: High-Level Vision – AI. Rich Explicit Representations enable: Understanding of objects, scenes, and events. Reasoning about functions and roles of objects, goals and intentions of agents, predicting the outcomes of events. SC Zhu – MURI. Visual Turing tests Vision algorithms that have similar properties to those of humans: (i) flexible, adaptive, robust (ii) capable of learning from limited data, ability to transfer, (iii) able to perform multiple tasks, (iv) closely coupled to reasoning, language, and other cognitive abilities. Grammars, Compositionality, and Pattern Theory. How to put everything together? How to address the complexity problem of vision. Pattern theory offers suggestions: Mumford and Desolneux 2010. Describe the class of patterns that can happen in images using grammars. Grammars and Pattern Theory Parse images by decomposing them into elementary image patterns. Inverse inference on a grammar for generating images. The next few lectures The next few lectures will concentrate on basic low-level vision tasks: Filtering – linear and non-linear. Segmentation. Variational and probabilistic methods. Conclusion Vision is difficult and challenging. Humans devote half of their cortex to doing it. It is difficult because it is inverse inference in an extremely complex domain. It involves a great variety of mathematical and computational techniques. Low-, Mid-, High-Level taxonomy. Pattern Theory and Grammars.