Efficient Image Search and Retrieval using Compact Binary Codes

advertisement
Efficient Image Search and
Retrieval using Compact Binary
Codes
Rob Fergus (NYU)
Antonio Torralba (MIT)
Yair Weiss (Hebrew U.)
Large scale image search
Internet contains many billions of images
How can we search them, based on visual content?
The Challenge:
– Need way of measuring similarity between images
– Needs to scale to Internet
Existing approaches to
Content-Based Image Retrieval
• Focus of scaling rather than understanding image
• Variety of simple/hand-designed cues:
– Color and/or Texture histograms, Shape, PCA, etc.
• Various distance metrics
– Earth Movers Distance (Rubner et al. ‘98)
• Most recognition approaches slow (~1sec/image)
Our Approach
• Learn the metric from training data
• Use compact binary codes for speed
DO BOTH TOGETHER
Large scale image/video search
• Representation must fit in memory (disk too slow)
• Facebook has ~10 billion images (1010)
• PC has ~10 Gbytes of memory (1011 bits)
 Budget of 101 bits/image
• YouTube has ~ a trillion video frames (1012)
• Big cluster of PCs has ~10 Tbytes (1014 bits)
 Budget of 102 bits/frame
Binary codes for images
• Want images with similar content
to have similar binary codes
• Use Hamming distance between codes
– Number of bit flips
– E.g.: Ham_Dist(10001010,10001110)=1
Ham_Dist(10001010,11101110)=3
• Semantic Hashing [Salakhutdinov & Hinton, 2007]
– Text documents
Semantic Hashing
[Salakhutdinov & Hinton, 2007] for text documents
Query
Image
Semantic
Hash
Function
Quite different
to a (conventional)
randomizing hash
Binary
code
Semantically
similar
images
Address Space
Images in database
Query address
Semantic Hashing
• Each image code is a memory address
• Find neighbors by exploring Hamming
ball around query address
Address Space
• Lookup time is
independent
of # of data points
• Depends on radius of
ball & length of code:
Code length
ChooseRadius
Images in database
Query address
Code requirements
•
•
•
•
Similar images  Similar Codes
Very compact (<102 bits/image)
Fast to compute
Does NOT have to reconstruct image
Three approaches:
1. Locality Sensitive Hashing (LSH)
2. Boosting
3. Restricted Boltzmann Machines (RBM’s)
Input Image representation:
Gist vectors
•
•
•
•
Pixels not a convenient representation
Use Gist descriptor instead (Oliva & Torralba, 2001)
512 dimensions/image (real-valued  16,384 bits)
L2 distance btw. Gist vectors not bad substitute for
human perceptual distance
NO COLOR
INFORMATION
Oliva & Torralba, IJCV 2001
1. Locality Sensitive Hashing
• Gionis, A. & Indyk, P. & Motwani, R. (1999)
• Take random projections of data
• Quantize each projection with few bits
101
0
Gist descriptor
1
0
1
1
0
No learning involved
2. Boosting
• Modified form of BoostSSC
[Shaknarovich, Viola & Darrell, 2003]
• Positive examples are pairs of similar images
• Negative examples are pairs of unrelated images
0
0
1
1
0
1
Learn threshold &
dimension for each
bit (weak classifier)
3. Restricted Boltzmann Machine (RBM)
• Type of Deep Belief Network
• Hinton & Salakhutdinov, Science 2006
Hidden units
Single
RBM
layer
Symmetric
weights
W
Visible units
• Attempts to reconstruct input at visible layer
from activation of hidden layer
Multi-Layer RBM: non-linear
dimensionality reduction
Output binary code (N dimensions)
Layer 3
N
w3
256
Layer 2
256
w2
512
Layer 1
512
w1
512
Input Gist vector (512 dimensions)
Linear units
at first layer
Training RBM models
1st Phase: Pre-training
2nd Phase: Fine-tuning
Unsupervised
Supervised
Can use unlabeled data
(unlimited quantity)
Requires labeled data
(limited quantity)
Learn parameters
greedily per layer
Back propagate gradients
of chosen error function
Gets them to right ballpark
Moves parameters to local
minimum
Greedy pre-training (Unsupervised)
Layer 1
512
w1
512
Input Gist vector (512 real dimensions)
Greedy pre-training (Unsupervised)
Layer 2
256
w2
512
Activations of hidden units from
layer 1 (512 binary dimensions)
Greedy pre-training (Unsupervised)
Layer 3
N
w3
256
Activations of hidden units from
layer 2 (256 binary dimensions)
Fine-tuning: back-propagation of
Neighborhood Components Analysis objective
Output binary code (N dimensions)
Layer 3
N
w3 + ∆w3
256
Layer 2
256
w2 + ∆ w2
512
Layer 1
512
w1 + ∆ w1
512
Input Gist vector (512 real dimensions)
Neighborhood Components Analysis
• Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004
• Tries to preserve neighborhood structure of input space
– Assumes this structure is given (will explain later)
Toy example with 2 classes & N=2 units at top of network:
Points in output space (coordinate is activation probability of unit)
Neighborhood Components Analysis
• Adjust network parameters (weights and biases)
to move:
– Points of SAME class closer
– Points of DIFFERENT class away
Neighborhood Components Analysis
• Adjust network parameters (weights and biases)
to move:
– Points of SAME class closer
– Points of DIFFERENT class away
Points close in input space (Gist) will be close in output code space
Simple Binarization Strategy
Set threshold
- e.g. use median
0
1
0
1
Overall Query Scheme
Binary code
<10μs
Image 1
Retrieved images
<1ms
RBM
Semantic Hash
Gist descriptor
Query
Image
Compute
Gist
~1ms (in Matlab)
Retrieval Experiments
Test set 1: LabelMe
• 22,000 images (20,000 train | 2,000 test)
• Ground truth segmentations for all
• Can define ground truth distance btw. images
using these segmentations
Defining ground truth
• Boosting and NCA back-propagation require
ground truth distance between images
• Define this using labeled images from LabelMe
Defining ground truth
• Pyramid Match (Lazebnik et al. 2006, Grauman & Darrell 2005)
Defining ground truth
• Pyramid Match (Lazebnik et al. 2006, Grauman & Darrell 2005)
Sky
Tree
Building
Car
Tree
Building
Car
Car
Car
Road
Road
Sky
Tree
Building
Car
Car
Car
Tree
Building
Car
Road
Road
Sky
Tree
Building
Car
Car
Car
Road
Building
Tree
Car
Road
Varying spatial resolution to capture approximate spatial correspondance
Examples of LabelMe retrieval
• 12 closest neighbors under different distance metrics
% of 50 true neighbors in retrieval set
LabelMe Retrieval
0 2,000
10,000
Size of retrieval set
20,0000
% of 50 true neighbors in retrieval set
0 2,000
10,000
Size of retrieval set
20,0000
% of 50 true neighbors in first 500 retrieved
LabelMe Retrieval
Number of bits
Test set 2: Web images
• 12.9 million images
• Collected from Internet
• No labels, so use Euclidean distance between
Gist vectors as ground truth distance
% of 50 true neighbors in retrieval set
Web images retrieval
Size of retrieval set
% of 50 true neighbors in retrieval set
Size of retrieval set
% of 50 true neighbors in retrieval set
Web images retrieval
Size of retrieval set
Examples of Web retrieval
• 12 neighbors using different distance metrics
Retrieval Timings
Summary
• Explored various approaches to learning binary
codes for hashing-based retrieval
– Very quick with performance comparable to complex
descriptors
• More recent work on binarization
– Spectral Hashing (Weiss, Torralba, Fergus NIPS 2009)
Download
Related flashcards
Create Flashcards