L. Fei-Fei and P. Perona. CVPR 2005
J. Sivic, B. Russell, A. Efros, A. Zisserman and B. Freeman. ICCV 2005
Tomasz Malisiewicz tomasz@cmu.edu
Advanced Machine Perception
February 2006
Graphical Models: Recent Trend in
Machine Learning
Describing Visual Scenes using
Transformed Dirichlet Processes .
E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005.
Outline
Goals of both vision papers
Techniques from statistical text modeling
- pLSA vs LDA
Scene Classification via LDA
Object Discovery via pLSA
Goal: Learn and Recognize Natural
Scene Categories
Classify a scene without first extracting objects
Other techniques we know of:
-Global frequency (Oliva and Torralba)
-Texton Histogram (Renninger, Malik et al)
Goal: Discover Object Categories
Discover what objects are present in a collection of images in an unsupervised way
Find those same objects in novel images
Determine what local image features correspond to what objects; segmenting the image
Enter the world of Statistical Text Modeling
D. Blei, A. Ng, and M. Jordan. Latent
Dirichlet allocation . Journal of Machine
Learning Research , 3:993 –1022, January
2003.
Bag-of-words approaches: the order of words in a document can be neglected
Graphical Model Fun
Bag-of-words
A document is a collection of M words
A corpus (collection of documents) is summarized in a term-document matrix
1990: Latent Semantic Analysis (LSA)
Goal: map high-dimensional count vectors to a lower dimensional representation to reveal semantic relations between words
The lower dimensional space is called the latent semantic space
Dim( latent space ) = K
1990: Latent Semantic Analysis (LSA)
D = {d
1
,…,d
N
}
N documents
W = {w
1
,…,w
M
}
M words
N ij
= #(d i
,w j
)
NxM co-occurrence term-document matrix words
NxM
= topics
NxK x topics
KxK x words
KxM
Singular Value Decomposition words
NxM
= topics
NxK x topics
KxK x words
KxM
LSA summary
SVD on term-document matrix
Approximate N by thresholding all but the largest K singular values in W to zero
Produces rank-K optimal approximation to
N in the L
2
-matrix or Frobenius norm sense
LSA and Polysemy
According to this superposition
Polysemy: the ambiguity of an individual different contexts) to express two or more different meanings
Under the LSA model, the coordinates of a word in latent space can be written as a linear superposition of the coordinates of the documents that contain the word
Problems with LSA
LSA does not define a properly normalized probability distribution
No obvious interpretation of the directions in the latent space
From statistics, the utilization of L
2 norm in
LSA corresponds to a Gaussian Error assumption which is hard to justify in the context of count variables
Polysemy problem
pLSA to the rescue
Probabilistic Latent Semantic Analysis
pLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model
p ( w i
| d j
)
k
K
1 p ( w i
| z k
) p ( z k
| d j
)
Observed word distributions word distributions per topic
Topic distributions per document
Slide credit: Josef Sivic
Learning the pLSA parameters
Observed counts of word i in document j
Unlike LSA, pLSA does not minimize any type of ‘squared deviation.’
The parameters are estimated in a probabilistically sound way.
Maximize likelihood of data using EM.
Minimize KL divergence between empirical distribution and model
Slide credit: Josef Sivic
EM for pLSA (training on a corpus)
E-step: compute posterior probabilities for the latent variables
M-step: maximize the expected complete data log-likelihood
Graphical View of pLSA
pLSA is a generative model
Observed variables d z w
Latent variables Plates
Select a document d i with prob P(d i
)
Pick latent class z k with prob P(z k
|d i
)
Generate word w j with prob P(w j
|z k
)
How does pLSA deal with previously unseen documents?
“Folding-in” Heuristic
First train on Corpus to obtain
Now re-run same training EM algorithm, but don’t re-estimate and let
D={d unseen
}
Problems with pLSA
Not a well-defined generative model of documents; d is a dummy index into the list of documents in the training set (as many values as documents)
No natural way to assign probability to a previously unseen document
Number of parameters to be estimated grows with size of training set
LDA to the rescue
Latent Dirichlet Allocation treats the topic mixture weights as a k-parameter hidden random variable and places a Dirichlet prior on the multinomial mixing weights
Dirichlet distribution is conjugate to the multinomial distribution (most natural prior to choose: the posterior distribution is also a
Dirichlet!)
LDA pLSA
Corpus-Level parameters in LDA
Alpha and beta are corpus-level documents that are sampled once in the corpus creating generative model (outside of the plates!)
Alpha and beta must be estimated before we can find the topic mixing proportions belonging to a previously unseen document
LDA
Getting rid of plates
1 z
1 z
2 z
3 w
1 w
2 w
3 z
N w
N
2 z
1 z
2 z
3 w
1 w
2 w
3
z
N w
N b
K z
1 z
2 z
3 w
1 w
2 w
3 z
N w
N
Thanks to Jonathan Huang for the un-plated LDA graphic
Inference in LDA
Inference = estimation of document-level parameters
Intractable to compute must employ approximate inference
Approximate Inference in LDA
Variational Methods: Use Jensen’s inequality to obtain a lower bound on the log likelihood that is indexed by a set of variational
parameters
Variational Methods are one way of doing this.
Optimal Variational Parameters (documentspecific) are obtained by minimizing the KL divergence between the variational distribution and the true posterior
Variational distribution
Look at some P(w|z) produced by LDA
Show some pLSI and LDA results applied to text
An LDA project by Tomasz Malisiewicz and Jonathan Huang
Search for the word ‘drive’
pLSA and LDA applied to Images
How can one apply these techniques to the images?
Hierarchical Bayesian text models
Probabilistic Latent Semantic Analysis (pLSA)
D d z w
N
Hoffman, 2001
Latent Dirichlet Allocation (LDA)
D c
N z w
Blei et al., 2001
Hierarchical Bayesian text models
Probabilistic Latent Semantic Analysis (pLSA)
D d z w
N
“face”
Sivic et al. ICCV 2005
Hierarchical Bayesian text models
“beach”
Latent Dirichlet Allocation (LDA)
N z w
D c
Fei-Fei et al. ICCV 2005
A Bayesian Hierarchical Model for Learning Natural
Scene Categories
Flow Chart: Quick Overview
How to Generate an Image?
Choose a scene (mountain, beach, …)
Given scene generate an intermediate probability vector over ‘themes’
Determine current theme from mixture of themes
Draw a codeword from that theme
How to Generate an Image?
Inference
How to make decision on a novel image
Integrate over latent variables to get:
Approximate Variational Inference (not easy, but Gibbs sampling is supposed to be easier)
Codebook
174 Local Image Patches
Detection:
Evenly Sampled Grid
Random Sampling
Saliency Detector
Lowe’s DoG Detector
Representation:
Normalized 11x11 gray values
128-dim SIFT
Results: Average performance 64%
Confusion Matrix
100 training examples and 50 test examples
Rank statistic test: the probability of a test scene correctly belong to one of the top N most probable categories
Results: The Distributions
Theme distribution
Codeword distribution
The peak at 174
Summary of detection and representation choices
SIFT outperforms pixel gray values
Sliding grid, which creates the largest number of patches, does best
Discovering objects and their location in images
Visual Words
Vector Quantized SIFT descriptors computed in regions
Regions come from elliptical shape adaptation around interest point, and from the maximally stable regions of Matas et al.
Both are elliptical regions at twice their detected scale
Building a Vocabulary
…
Building a Vocabulary
K-means clustering of 300K regions to get about 1K clusters for each of
Shape Adapted and Maximally Stable regions
…
Vector quantization
Slide credit: Josef Sivic
pLSA Training
Sanity Check: Remember what quantities must be estimated?
Results #1: Topic Discovery
This is just the training stage 4 object categories
Plus background
Obtain P(z k
|d j
) for each image, then classify image as containing object k according to the max of P(z k
|d j
) over k
Results #1: Topic Discovery
Results #2: Classifying New Images
Object Categories learned on a corpus, then object categories found in new image
How does pLSA deal with previously unseen documents?
“Folding-in” Heuristic
First train on Corpus to obtain
Now re-run same training EM algorithm, but don’t re-estimate and let
D={d unseen
}
Results #2: Classifying New Images
Train on one set and test on another
Results #3: Segmentation
Localization and Segmentation of Object
For a word occurrence in a particular document we can examine the probability of different topics
Find words with P(z k
|d j
,w i
) > .8
Results #3: Segmentation
Note: words shown are not the most probable words for a topic, but instead they are words that have a high probability of occurring in a topic AND high probability of occurring in the image
Results #3: Segmentation and Doublets
Two class image dataset consisting of half the faces
(218 images) and backgrounds (217 images)
A 4 topic pLSA model is learned for all training faces and training backgrounds with 3 fixed background topics, i.e. one (face) topic is learned in addition to the three fixed background topics
A doublet vocabulary is then formed from the top 100 visual words of the face topic. A second 4 topic pLSA model is then learned for the combined vocabulary of singlets and doublets with the background topics fixed.
Doublets
Efros: didn’t work as much as you’d think
Face
Segmentation
Scores
Singleton: .49
Doublets: .61
Conclusions
Showed how both papers use bag-ofwords approaches
We’re now ready to become experts on generative models like pLSA and LDA
Graphical Model Fun! (Carlos Guestrin teaches Graphical Models)
Are you really into Graphical Models?
Describing Visual Scenes using Transformed Dirichlet Processes . E.
Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005.
References
A Bayesian Hierarchical Model for Learning
Natural Scene Categories, Fei Fei Li et al
Describing Visual Scenes using Transformed
Dirichlet Processes, Sudderth et al
Discovering objects and their location in images,
Sivic et al
Latent Dirichlet Allocation, Blei et al
Unsupervised Learning by Probabilistic Latent
Semantic Analysis, T. Hoffman