Alan Yuille

advertisement
Hierarchical Models of Vision:
Machine Learning/Computer Vision
Alan Yuille
UCLA: Dept. Statistics
Joint App. Computer Science, Psychiatry,
Psychology
Dept. Brain and Cognitive Engineering, Korea
University
Structure of Talk
• Comments on the relations between Cognitive Science
and Machine Learning.
• Comments about Cog. Sci. ML and Neuroscience.
• Three related Hierarchical Machine Learning Models.
•
•
•
•
(I) Convolutional Networks.
(II) Structured Discriminative Models.
(III) Grammars and Compositional Models.
The examples will be on vision, but the techniques are
generally applicable.
Cognitive Science helps Machine Learning
• Cognitive Science is useful to ML because the
human visual system has many desirable
properties: (not present in most ML systems).
• (i) flexible, adaptive, robust
• (ii) capable of learning from limited data, ability
to transfer,
• (iii) able to perform multiple tasks,
• (iv) closely coupled to reasoning, language, and
other cognitive abilities.
• Cognitive Scientists search for fundamental
theories and not incremental pragmatic solutions.
Cognitive Science and Machine Learning
• Machine Learning is useful to Cog. Sci. because it
has experience dealing with complex tasks on
huge datasets (e.g., the fundamental problem of
vision).
• Machine Learning – and Computer Vision --- has
developed a very large number of mathematical
and computational techniques, which seem
necessary to deal with the complexities of the
world.
• Data drives the modeling tools. Simple data
requires only simple tools. But simple tasks also
require simple tools. (neglected by CV).
Combining Cognitive and ML
• Augmented Reality – we need computer
systems that can interact with humans.
• How can a visually impaired person best be
helped by a ML/CV system? Wants to be able to
ask the computer questions– who was that
person? – i.e. interact with it as if it was a
human. Turing tests for vision (S. Geman and D.
Geman).
• Image Analyst (Medicine, Military) – wants a ML
system that can reason about images, make
analogies to other images, and so on.
Data Set Dilemmas
• Too complicated a dataset: requires a lot of
engineering to perform well (“neural network
tricks”, N students testing 100x N parameter
settings).
• Too simple a dataset:
• Results may not generalize to the real world. It
may focus on side issues.
• Tyranny of Datasets: You can only evaluate
performance on a limited set of tasks (e.g., can do
“object classification” and not “object
segmentation” or “cat part detection”, or ask
“what is the cat doing?”)
Datasets and Generalization
• Machine Learning methods are tested on large
benchmarked datasets.
• Two of the applications involve 20,000 and
1,000,000 images.
• Critical Issues of Machine Learning:
• (I) Learnability: will the results generalize to new
datasets?
• (II) Inference: can we compute properties fast
enough?
• Theoretical Results: Probably Approximately
Correct (PAC) Theorems.
Vision: The Data and the Problem
• Complexity, Variability, and Ambiguity of Images.
• Enormous range of visual tasks that can be performed. Set
of all images is practically infinite.
• 30,000 objects, 1,000 scenes.
• How can humans interpret images in 150 Msec?
• Fundamental Problem: complexity.
Neuroscience: Bio-Inspired
• Theoretical Models of the Visual Cortex (e.g., T. Poggio)
are hierarchical and closely related to convolutional nets.
• Generative models (later in this talk) may help explain
the increasing evidence of top-down mechanisms.
• Behavior-to-Brain: propose models for the visual cortex
that can be tested by fMRI, multi-electrodes, and related
techniques.
• (multi-electrodes T.S. Lee, fMRI D.K. Kersten).
• Caveat: real neurons don’t behave like neurons in
textbooks…
• Conjecture: Structure of the Brain and ML systems is
driven by the statistical structure of the environment. The
Pattern Theory Manifesto.
Hierarchical Models of Vision
• Why Hierarchies?
• Bio-inspired: Mimics the structure of the
human/macaque visual system.
• Computer Vision Architectures: low-, middle-,
high-level. From ambiguous low-level to
unambiguous high level.
• Optimal design: for representing, learning,
and retrieving image patterns?
Three Types of Hierarchies:
• (I) Convolutional Neural Networks: ImageNet
Dataset.
• Krizhevsky, Sutskever, and Hinton (2013).
• LeCun, Salakudinov.
• (II) Discriminative Part-Based Models
(McAllester, Ramanan, Felzenswalb 2008, L. Zhu
et al. 2010). PASCAL dataset.
• (III) Generative Models. Grammars and
Compositional Models. (Geman, Mumford, SC
Zhu, L. Zhu,…).
Example I: Convolutional Nets
•
•
•
•
•
Krizhevsky, Sutskever, and Hinton (2013).
Dataset ImageNet (Fei Fei Li).
1,000,000 images.
1,000 objects.
Task: detect and localize objects.
Example I: Neural Network
• Architecture: Neural Network.
• Convolutional: each hidden unit applies the
same localized linear filter to the input.
Example I: Neurons
Example I: The Hierarchy.
Example 1: Model Details
• New model.
Learning
Example 1: Learnt Filters
• Image features learnt – the usual suspects.
Example I: Dropout
Example I: Results
Example I: Conclusion
• This convolutional net was the most successful
algorithm on the ImageNet Challenge 2012.
• It requires a very large amounts of data to
train.
• Devil is in the details (“tricks for neural
networks”).
• Algorithm implemented on Graphics
Processing Units (GPUs) to deal with
complexity of inference and learning.
Example II: Structured Discriminative
Models.
• Star Models : MacAllester, Felzenszwalb, Ramanan.
2008.
• Objects are made from “parts” (not semantic parts).
• Discriminative Models:
• Hierarchical variant: L. Zhu, Y. Chen. et al. 2010.
• Learning: latent support-vector machines.
• Inference: window search plus dynamic programming.
• Application: Pascal object detection challenge. 20,000
images, 20 objects.
• Task: identify and localize (bounding box).
Example II: Mixture Models
• Each Object is represented by six models – to allow for different
viewpoints.
• Energy function/Probabilistic Model defined on hierarchical graph.
• Nodes represent parts which can move relative to each other enabling
spatial deformations.
• Constraints on deformations impose by potentials on the graph structure.
Parent-Child spatial constraints Parts: blue (1), yellow (9), purple (36)
Deformations of Car
Deformations of Horse
Example II: Mixture Models:
• Each object is represented by 6 hierarchical
models (mixture of models).
• These mixture components account for
pose/viewpoint changes.
Example II: Features and Potentials
• Edge-Like Cues: Histogram of Gradients
(HOGs)
• Appearance-Cues: Bag of Words Models
(dictionary obtained by clustering SIFT or HOG
features).
• Learning:
(I) weights for the importance of features,
(ii) weights for the spatial relations between
parts.
Example II: Learning by Latent SVM
• The Graph Structure is known.
• The training data is partly supervised. It gives
image regions labeled by object/non-object.
• But you do not know which mixture (viewpoint)
component or the positions of the parts. These are
hidden variables.
• Learning: Latent Support Vector Machine (L SVM).
• Learn the weights while simultaneously estimating
the hidden variables (part positions, viewpoint).
Example II: Details (1)
• Each hierarchy is a 3-layer tree.
• Each node represents a part.
• Total of 46 nodes: (1+9+ 4 x 9)
• Each node has a spatial position
(parts can “move” or are “active”)
• Graph edges from parents to
child – impose spatial
constraints.
Example II: Details (2)
• The object model has variables:
1. p – represents the position of the parts.
2. V – specifies which mixture component (e.g. pose).
3. y – specifies whether the object is present or not.
4.  – model parameter (to be learnt).
• Note: during learning the part positions p and the
pose V are unknown – so they are latent variables
and will be expressed as h  (V , p )
Example II: Details (3)
• The “energy” of the model is defined to be:
    ( x, y, h) where x is the image in the region.
• The object is detected by solving:
y*, h*  arg max    ( x, y, h)
• If y*  1 then we have detected the object.
• If so, h*  ( p*,V *) specifies the mixture
component and the positions of the parts.
Example II: Details (4)
• There are three types of potential terms ( x, y, h)
(1) Spatial terms  shape ( y, h) which specify the
distribution on the positions of the parts.
(2) Data terms for the edges of the object
 HOG ( x, y, h) defined using HOG features.
(3) Regional appearance data terms
 HOW ( x, y, h) defined by histograms of words
(HOWs – using grey SIFT features and Kmeans).
Example II: Details (5)
• Edge-like: Histogram of Oriented Gradients
HOGs (Upper row)
• Regional: Histogram Of Words (Bottom row)
• Dense sampling: 13950 HOGs + 27600 HOWs
Example II: Details (6)
• To detect an object requiring solving:
y*, h*  arg max   ( x, y, h)
for each image region.
• We solve this by scanning over the
subwindows of the image, use dynamic
programming to estimate the part positions p
and do exhaustive search over the y & V
Example II: Details (7)
• The input to learning is a set of labeled image
regions. {( x _ i, y _ i) : i  1,..., N }
• Learning require us to estimate the
parameters 
• While simultaneously estimating the hidden
variables h  ( p,V )
Example II: Details (8)
• We use Yu and Joachim’s (2009) formulation
of latent SVM.
• This specifies a non-convex criterion to be
minimized. This can be re-expressed in terms
of a convex plus a concave part.
N
1
2
min || w || C  max [ w  ( xi , y, h)  L( yi , y, h)]  max [ w  ( xi , yi , h)]
 y ,h

w 2
h
i 1 
N
1

2
min  || w || C  max [ w  ( xi , y, h)  L( yi , y, h)]
w
y ,h
i 1
2

 N

 C  max [ w  ( xi , yi , h)]
 i 1 h

Example II: Details (9)
• Yu and Joachims (2009) propose the CCCP algorithm
(Yuille and Rangarajan 2001) to minimize this
criterion.
• This iterates between estimating the hidden variables
and the parameters (like the EM algorithm).
• We propose a variant – incremental CCCP – which is
faster.
• Result: our method works well for learning the
parameters without complex initialization.
Example II: Details (10)
• Iterative Algorithm:
– Step 1: fill in the latent positions with best score(DP)
– Step 2: solve the structural SVM problem using partial
negative training set (incrementally enlarge).
• Initialization:
– No pretraining (no clustering).
– No displacement of all nodes (no deformation).
– Pose assignment: maximum overlapping
• Simultaneous multi-layer learning
Detection Results on PASCAL 2010: Cat
Example II: Cat Results
Example II: Horse Results
Example II: Car Results
Example II: Conclusion
• All current methods that perform well on the Pascal Object
Detection Challenge use these types of models.
• Performance is fairly good for medium to large objects.
Errors are understandable – cat versus dog, car versus train.
• But seems highly unlikely that this is how humans perform
these tasks – humans can probably learn from much less
data).
• The devil is in the details. Small “engineering” changes can
yield big improvements.
• Improved results by combining these “top-down” object
models with “bottom-up” edge cues: Fidler, Mottaghi, Yuille,
Urtasun. CVPR 2013.
Example III: Grammars/Compositional
Models
• Generative models of objects and scenes.
• These models have explicit representation of parts
– e.g., can “parse” objects instead of just detect
them.
• Explicit Representations – gives the ability to
perform multiple tasks (arguably closer to human
cognition).
• Part sharing – efficiency of inference and learning.
• Adaptive and Flexible. Can learn from little data.
• Tyranny of Datatsets: “will they work on Pascal?”.
Example III: Generative Models
• Basic Grammars (Grenander, Fu, Mjolsness,
Biederman).
• Images are generated from dictionaries of
elementary components – with stochastic rules for
spatial and structural relations.
Example III: Analysis by Synthesis
• Analyze an image by inverting image formation.
• Inverse problem: determine how the data was generated, how was
it caused?
• Inverse computer graphics.
Example III: Real Images
• Image Parsing: (Z. Tu, X. Chen, A.L. Yuille, and S.C. Zhu 2003).
• Learn probabilistic models for the visual patterns that can appear in
images.
• Interpret/understand an image by decomposing it into its
constituent parts.
• Inference algorithm: bottom-up and top-down.
Example III: Advantages
• Rich Explicit Representations enable:
• Understanding of objects, scenes, and events.
• Reasoning about functions and roles of objects, goals
and intentions of agents, predicting the outcomes of
events. SC Zhu – MURI.
Example III: Advantages
• Ability to transfer between contexts and
generalize or extrapolate (e.g. , from Cow to
Yak). Reduces hypothesis space – PAC Theory.
• Ability to reason about the system, intervene,
do diagnostics.
• Allows the system to answer many different
questions based on the same underlying
knowledge structure.
• Scale up to multiple objects by part-sharing.
Example III: Car Detection
• Kokkinos and Yuille 2010. A 3-layer model.
• Object made from parts – Car = Red-Part AND Blue-Part AND Green
Part
• Parts are made by AND-ing contours. Red-Part=Con-1 AND Con-2…
• These contours correspond to AND-ing tokens extracted from the
image.
The model has flexible geometry to deal with different types
of cars:
An SUV looks different than a Prius.
Parts move relative to the object.
Contours can move relative to the parts.
Quantify this spatial variation by a probability distribution
which is learnt from data.
48
Example III: Generative Models.
49
Example III: Analogy -- Building a puzzle
 Bottom-Up solution: Combine pieces until you build the car
 Does not exploit the box’ cover
 Top-Down solution: Try fitting each piece to the box’ cover.
 Most pieces are uniform/irrelevant
 Bottom-Up/Top-Down solution:
 Form car-like structures, but use cover to suggest combinations.
 Uses AI from MacAllester and Felzewnswalb.
50
Example III: Localize and Parse
51
51
Example III
• Summary.
• Car/Object is represented as a hierarchical
graphical models.
• Inference algorithm: message
passing/dynamic programming/A*.
• Learning algorithms: parameter estimation.
Multi-instance learning (Latent SVM is a
special case).
Example III: Part Sharing.
• Exploit part-sharing to deal with multiple objects.
• More efficient inference and representation –
exponential gains: quantified in Yuille and Mottaghi
ICML. 2013
• Learning requires less data: a part learnt for a Cow
can be used for a Yak.
Example III: AND/OR Graphs for Baseball
• Part sharing enables the model to deal with
objects with multiple poses and viewpoints
(~100).
• Inference and Learning by bottom-up and topdown processing:
54
Example III: Results on Baseball
Players:
• Performed well on benchmarked datasets.
• Zhu, Chen, Lin, Lin, Yuille CVPR 2008, 2010.
55
Example III: Structure Learning
• Task: given 10 training images, no labeling, no
alignment, highly ambiguous features.
– Estimate Graph structure (nodes and edges)
– Estimate the parameters.
Correspondence is
unknown
?
Combinatorial
Explosion problem
56
Example III: Unsupervised Learning
• Structure Induction.
• Bridges the gap between low-, mid-, and high-level
vision.
• Between Chomsky and Hinton?
57
Example III: Learning Multiple Objects
• Unsupervised learning algorithm to learn
parts shared between different objects.
• Zhu, Chen, Freeman, Torralba, Yuille 2010.
• Structure Induction – learning the graph
structures and learning the parameters.
58
Example III: Many Objects/Viewpoints
• 120 templates: 5 viewpoints & 26 classes
59
Example III: Learn Hierarchical
Dictionary.
• Low-level to Mid-level to High-level.
• Automatically shares parts and stops.
60
Example III: Part Sharing decreases with
Levels
61
Example III: Summary
• These generative models with explicit rich
representations offer potential advantages:
flexibility, adaptability, transfer.
• Enable reasoning about functions and roles of
objects, goals and intentions of agents, predicting
the outcomes of events.
• Access to semantic descriptions. Making analogies
between images.
• Augmented Reality – e.g. computer vision system
communicating with a visually impaired human.
• “In the long term models will be generative”. G.
Hinton. 2013.
Conclusions
• Three examples of Hierarchical Models of
Vision.
• Convolutional Networks, Structured
Discriminative Models, Generative
Grammars/Compositional Models.
• Relations to Neuroscience.
• Machine Learning and Cognitive Science.
• Augmented Reality: Humans and Computers
• Importance of Data and Tasks.
63
Theoretical Frameworks
• All three models formulated in terms of
probability distributions/energy functions
defined over graphs or grammars.
• Discriminative versus Generative models.
• P(W|I) versus P(I|W) P(W).
• Representation – are properties represented
explicitly? (Requirement for performing tasks).
• Inference algorithms and learning algorithm.
• Generalization (PAC theorems).
A Probabilistic Model is
defined by four elements
•
•
•
•
(i) Graph Structure – Nodes/Edges -- Representation
(ii) State Variables – W – input I. --Representation
(ii) Potentials – Phi
-- Probability
(iii) Parameters/Weights – Lambda – Probability
• The state variables are defined at the graph nodes.
• The potentials and parameters are defined over the
graph edges – and relate the model to the image I.
Download