Hierarchical Models of Vision: Machine Learning/Computer Vision Alan Yuille UCLA: Dept. Statistics Joint App. Computer Science, Psychiatry, Psychology Dept. Brain and Cognitive Engineering, Korea University Structure of Talk • Comments on the relations between Cognitive Science and Machine Learning. • Comments about Cog. Sci. ML and Neuroscience. • Three related Hierarchical Machine Learning Models. • • • • (I) Convolutional Networks. (II) Structured Discriminative Models. (III) Grammars and Compositional Models. The examples will be on vision, but the techniques are generally applicable. Cognitive Science helps Machine Learning • Cognitive Science is useful to ML because the human visual system has many desirable properties: (not present in most ML systems). • (i) flexible, adaptive, robust • (ii) capable of learning from limited data, ability to transfer, • (iii) able to perform multiple tasks, • (iv) closely coupled to reasoning, language, and other cognitive abilities. • Cognitive Scientists search for fundamental theories and not incremental pragmatic solutions. Cognitive Science and Machine Learning • Machine Learning is useful to Cog. Sci. because it has experience dealing with complex tasks on huge datasets (e.g., the fundamental problem of vision). • Machine Learning – and Computer Vision --- has developed a very large number of mathematical and computational techniques, which seem necessary to deal with the complexities of the world. • Data drives the modeling tools. Simple data requires only simple tools. But simple tasks also require simple tools. (neglected by CV). Combining Cognitive and ML • Augmented Reality – we need computer systems that can interact with humans. • How can a visually impaired person best be helped by a ML/CV system? Wants to be able to ask the computer questions– who was that person? – i.e. interact with it as if it was a human. Turing tests for vision (S. Geman and D. Geman). • Image Analyst (Medicine, Military) – wants a ML system that can reason about images, make analogies to other images, and so on. Data Set Dilemmas • Too complicated a dataset: requires a lot of engineering to perform well (“neural network tricks”, N students testing 100x N parameter settings). • Too simple a dataset: • Results may not generalize to the real world. It may focus on side issues. • Tyranny of Datasets: You can only evaluate performance on a limited set of tasks (e.g., can do “object classification” and not “object segmentation” or “cat part detection”, or ask “what is the cat doing?”) Datasets and Generalization • Machine Learning methods are tested on large benchmarked datasets. • Two of the applications involve 20,000 and 1,000,000 images. • Critical Issues of Machine Learning: • (I) Learnability: will the results generalize to new datasets? • (II) Inference: can we compute properties fast enough? • Theoretical Results: Probably Approximately Correct (PAC) Theorems. Vision: The Data and the Problem • Complexity, Variability, and Ambiguity of Images. • Enormous range of visual tasks that can be performed. Set of all images is practically infinite. • 30,000 objects, 1,000 scenes. • How can humans interpret images in 150 Msec? • Fundamental Problem: complexity. Neuroscience: Bio-Inspired • Theoretical Models of the Visual Cortex (e.g., T. Poggio) are hierarchical and closely related to convolutional nets. • Generative models (later in this talk) may help explain the increasing evidence of top-down mechanisms. • Behavior-to-Brain: propose models for the visual cortex that can be tested by fMRI, multi-electrodes, and related techniques. • (multi-electrodes T.S. Lee, fMRI D.K. Kersten). • Caveat: real neurons don’t behave like neurons in textbooks… • Conjecture: Structure of the Brain and ML systems is driven by the statistical structure of the environment. The Pattern Theory Manifesto. Hierarchical Models of Vision • Why Hierarchies? • Bio-inspired: Mimics the structure of the human/macaque visual system. • Computer Vision Architectures: low-, middle-, high-level. From ambiguous low-level to unambiguous high level. • Optimal design: for representing, learning, and retrieving image patterns? Three Types of Hierarchies: • (I) Convolutional Neural Networks: ImageNet Dataset. • Krizhevsky, Sutskever, and Hinton (2013). • LeCun, Salakudinov. • (II) Discriminative Part-Based Models (McAllester, Ramanan, Felzenswalb 2008, L. Zhu et al. 2010). PASCAL dataset. • (III) Generative Models. Grammars and Compositional Models. (Geman, Mumford, SC Zhu, L. Zhu,…). Example I: Convolutional Nets • • • • • Krizhevsky, Sutskever, and Hinton (2013). Dataset ImageNet (Fei Fei Li). 1,000,000 images. 1,000 objects. Task: detect and localize objects. Example I: Neural Network • Architecture: Neural Network. • Convolutional: each hidden unit applies the same localized linear filter to the input. Example I: Neurons Example I: The Hierarchy. Example 1: Model Details • New model. Learning Example 1: Learnt Filters • Image features learnt – the usual suspects. Example I: Dropout Example I: Results Example I: Conclusion • This convolutional net was the most successful algorithm on the ImageNet Challenge 2012. • It requires a very large amounts of data to train. • Devil is in the details (“tricks for neural networks”). • Algorithm implemented on Graphics Processing Units (GPUs) to deal with complexity of inference and learning. Example II: Structured Discriminative Models. • Star Models : MacAllester, Felzenszwalb, Ramanan. 2008. • Objects are made from “parts” (not semantic parts). • Discriminative Models: • Hierarchical variant: L. Zhu, Y. Chen. et al. 2010. • Learning: latent support-vector machines. • Inference: window search plus dynamic programming. • Application: Pascal object detection challenge. 20,000 images, 20 objects. • Task: identify and localize (bounding box). Example II: Mixture Models • Each Object is represented by six models – to allow for different viewpoints. • Energy function/Probabilistic Model defined on hierarchical graph. • Nodes represent parts which can move relative to each other enabling spatial deformations. • Constraints on deformations impose by potentials on the graph structure. Parent-Child spatial constraints Parts: blue (1), yellow (9), purple (36) Deformations of Car Deformations of Horse Example II: Mixture Models: • Each object is represented by 6 hierarchical models (mixture of models). • These mixture components account for pose/viewpoint changes. Example II: Features and Potentials • Edge-Like Cues: Histogram of Gradients (HOGs) • Appearance-Cues: Bag of Words Models (dictionary obtained by clustering SIFT or HOG features). • Learning: (I) weights for the importance of features, (ii) weights for the spatial relations between parts. Example II: Learning by Latent SVM • The Graph Structure is known. • The training data is partly supervised. It gives image regions labeled by object/non-object. • But you do not know which mixture (viewpoint) component or the positions of the parts. These are hidden variables. • Learning: Latent Support Vector Machine (L SVM). • Learn the weights while simultaneously estimating the hidden variables (part positions, viewpoint). Example II: Details (1) • Each hierarchy is a 3-layer tree. • Each node represents a part. • Total of 46 nodes: (1+9+ 4 x 9) • Each node has a spatial position (parts can “move” or are “active”) • Graph edges from parents to child – impose spatial constraints. Example II: Details (2) • The object model has variables: 1. p – represents the position of the parts. 2. V – specifies which mixture component (e.g. pose). 3. y – specifies whether the object is present or not. 4. – model parameter (to be learnt). • Note: during learning the part positions p and the pose V are unknown – so they are latent variables and will be expressed as h (V , p ) Example II: Details (3) • The “energy” of the model is defined to be: ( x, y, h) where x is the image in the region. • The object is detected by solving: y*, h* arg max ( x, y, h) • If y* 1 then we have detected the object. • If so, h* ( p*,V *) specifies the mixture component and the positions of the parts. Example II: Details (4) • There are three types of potential terms ( x, y, h) (1) Spatial terms shape ( y, h) which specify the distribution on the positions of the parts. (2) Data terms for the edges of the object HOG ( x, y, h) defined using HOG features. (3) Regional appearance data terms HOW ( x, y, h) defined by histograms of words (HOWs – using grey SIFT features and Kmeans). Example II: Details (5) • Edge-like: Histogram of Oriented Gradients HOGs (Upper row) • Regional: Histogram Of Words (Bottom row) • Dense sampling: 13950 HOGs + 27600 HOWs Example II: Details (6) • To detect an object requiring solving: y*, h* arg max ( x, y, h) for each image region. • We solve this by scanning over the subwindows of the image, use dynamic programming to estimate the part positions p and do exhaustive search over the y & V Example II: Details (7) • The input to learning is a set of labeled image regions. {( x _ i, y _ i) : i 1,..., N } • Learning require us to estimate the parameters • While simultaneously estimating the hidden variables h ( p,V ) Example II: Details (8) • We use Yu and Joachim’s (2009) formulation of latent SVM. • This specifies a non-convex criterion to be minimized. This can be re-expressed in terms of a convex plus a concave part. N 1 2 min || w || C max [ w ( xi , y, h) L( yi , y, h)] max [ w ( xi , yi , h)] y ,h w 2 h i 1 N 1 2 min || w || C max [ w ( xi , y, h) L( yi , y, h)] w y ,h i 1 2 N C max [ w ( xi , yi , h)] i 1 h Example II: Details (9) • Yu and Joachims (2009) propose the CCCP algorithm (Yuille and Rangarajan 2001) to minimize this criterion. • This iterates between estimating the hidden variables and the parameters (like the EM algorithm). • We propose a variant – incremental CCCP – which is faster. • Result: our method works well for learning the parameters without complex initialization. Example II: Details (10) • Iterative Algorithm: – Step 1: fill in the latent positions with best score(DP) – Step 2: solve the structural SVM problem using partial negative training set (incrementally enlarge). • Initialization: – No pretraining (no clustering). – No displacement of all nodes (no deformation). – Pose assignment: maximum overlapping • Simultaneous multi-layer learning Detection Results on PASCAL 2010: Cat Example II: Cat Results Example II: Horse Results Example II: Car Results Example II: Conclusion • All current methods that perform well on the Pascal Object Detection Challenge use these types of models. • Performance is fairly good for medium to large objects. Errors are understandable – cat versus dog, car versus train. • But seems highly unlikely that this is how humans perform these tasks – humans can probably learn from much less data). • The devil is in the details. Small “engineering” changes can yield big improvements. • Improved results by combining these “top-down” object models with “bottom-up” edge cues: Fidler, Mottaghi, Yuille, Urtasun. CVPR 2013. Example III: Grammars/Compositional Models • Generative models of objects and scenes. • These models have explicit representation of parts – e.g., can “parse” objects instead of just detect them. • Explicit Representations – gives the ability to perform multiple tasks (arguably closer to human cognition). • Part sharing – efficiency of inference and learning. • Adaptive and Flexible. Can learn from little data. • Tyranny of Datatsets: “will they work on Pascal?”. Example III: Generative Models • Basic Grammars (Grenander, Fu, Mjolsness, Biederman). • Images are generated from dictionaries of elementary components – with stochastic rules for spatial and structural relations. Example III: Analysis by Synthesis • Analyze an image by inverting image formation. • Inverse problem: determine how the data was generated, how was it caused? • Inverse computer graphics. Example III: Real Images • Image Parsing: (Z. Tu, X. Chen, A.L. Yuille, and S.C. Zhu 2003). • Learn probabilistic models for the visual patterns that can appear in images. • Interpret/understand an image by decomposing it into its constituent parts. • Inference algorithm: bottom-up and top-down. Example III: Advantages • Rich Explicit Representations enable: • Understanding of objects, scenes, and events. • Reasoning about functions and roles of objects, goals and intentions of agents, predicting the outcomes of events. SC Zhu – MURI. Example III: Advantages • Ability to transfer between contexts and generalize or extrapolate (e.g. , from Cow to Yak). Reduces hypothesis space – PAC Theory. • Ability to reason about the system, intervene, do diagnostics. • Allows the system to answer many different questions based on the same underlying knowledge structure. • Scale up to multiple objects by part-sharing. Example III: Car Detection • Kokkinos and Yuille 2010. A 3-layer model. • Object made from parts – Car = Red-Part AND Blue-Part AND Green Part • Parts are made by AND-ing contours. Red-Part=Con-1 AND Con-2… • These contours correspond to AND-ing tokens extracted from the image. The model has flexible geometry to deal with different types of cars: An SUV looks different than a Prius. Parts move relative to the object. Contours can move relative to the parts. Quantify this spatial variation by a probability distribution which is learnt from data. 48 Example III: Generative Models. 49 Example III: Analogy -- Building a puzzle Bottom-Up solution: Combine pieces until you build the car Does not exploit the box’ cover Top-Down solution: Try fitting each piece to the box’ cover. Most pieces are uniform/irrelevant Bottom-Up/Top-Down solution: Form car-like structures, but use cover to suggest combinations. Uses AI from MacAllester and Felzewnswalb. 50 Example III: Localize and Parse 51 51 Example III • Summary. • Car/Object is represented as a hierarchical graphical models. • Inference algorithm: message passing/dynamic programming/A*. • Learning algorithms: parameter estimation. Multi-instance learning (Latent SVM is a special case). Example III: Part Sharing. • Exploit part-sharing to deal with multiple objects. • More efficient inference and representation – exponential gains: quantified in Yuille and Mottaghi ICML. 2013 • Learning requires less data: a part learnt for a Cow can be used for a Yak. Example III: AND/OR Graphs for Baseball • Part sharing enables the model to deal with objects with multiple poses and viewpoints (~100). • Inference and Learning by bottom-up and topdown processing: 54 Example III: Results on Baseball Players: • Performed well on benchmarked datasets. • Zhu, Chen, Lin, Lin, Yuille CVPR 2008, 2010. 55 Example III: Structure Learning • Task: given 10 training images, no labeling, no alignment, highly ambiguous features. – Estimate Graph structure (nodes and edges) – Estimate the parameters. Correspondence is unknown ? Combinatorial Explosion problem 56 Example III: Unsupervised Learning • Structure Induction. • Bridges the gap between low-, mid-, and high-level vision. • Between Chomsky and Hinton? 57 Example III: Learning Multiple Objects • Unsupervised learning algorithm to learn parts shared between different objects. • Zhu, Chen, Freeman, Torralba, Yuille 2010. • Structure Induction – learning the graph structures and learning the parameters. 58 Example III: Many Objects/Viewpoints • 120 templates: 5 viewpoints & 26 classes 59 Example III: Learn Hierarchical Dictionary. • Low-level to Mid-level to High-level. • Automatically shares parts and stops. 60 Example III: Part Sharing decreases with Levels 61 Example III: Summary • These generative models with explicit rich representations offer potential advantages: flexibility, adaptability, transfer. • Enable reasoning about functions and roles of objects, goals and intentions of agents, predicting the outcomes of events. • Access to semantic descriptions. Making analogies between images. • Augmented Reality – e.g. computer vision system communicating with a visually impaired human. • “In the long term models will be generative”. G. Hinton. 2013. Conclusions • Three examples of Hierarchical Models of Vision. • Convolutional Networks, Structured Discriminative Models, Generative Grammars/Compositional Models. • Relations to Neuroscience. • Machine Learning and Cognitive Science. • Augmented Reality: Humans and Computers • Importance of Data and Tasks. 63 Theoretical Frameworks • All three models formulated in terms of probability distributions/energy functions defined over graphs or grammars. • Discriminative versus Generative models. • P(W|I) versus P(I|W) P(W). • Representation – are properties represented explicitly? (Requirement for performing tasks). • Inference algorithms and learning algorithm. • Generalization (PAC theorems). A Probabilistic Model is defined by four elements • • • • (i) Graph Structure – Nodes/Edges -- Representation (ii) State Variables – W – input I. --Representation (ii) Potentials – Phi -- Probability (iii) Parameters/Weights – Lambda – Probability • The state variables are defined at the graph nodes. • The potentials and parameters are defined over the graph edges – and relate the model to the image I.