Efficient Image Search and
Retrieval using Compact Binary
Codes
Rob Fergus (NYU)
Antonio Torralba (MIT)
Yair Weiss (Hebrew U.)
Large scale image search
Internet contains many billions of images
How can we search them, based on visual content?
The Challenge:
– Need way of measuring similarity between images
– Needs to scale to Internet
Existing approaches to
Content-Based Image Retrieval
• Focus of scaling rather than understanding image
• Variety of simple/hand-designed cues:
– Color and/or Texture histograms, Shape, PCA, etc.
• Various distance metrics
– Earth Movers Distance (Rubner et al. ‘98)
• Most recognition approaches slow (~1sec/image)
Our Approach
• Learn the metric from training data
• Use compact binary codes for speed
DO BOTH TOGETHER
Large scale image/video search
• Representation must fit in memory (disk too slow)
• Facebook has ~10 billion images (10 10 )
• PC has ~10 Gbytes of memory (10 11 bits)
Budget of 10 1 bits/image
• YouTube has ~ a trillion video frames (10 12 )
• Big cluster of PCs has ~10 Tbytes (10 14 bits)
Budget of 10 2 bits/frame
Binary codes for images
• Want images with similar content to have similar binary codes
• Use Hamming distance between codes
– Number of bit flips
– E.g.: Ham_Dist(10001010,10001 1 10)=1
Ham_Dist(10001010,1 11 01 1 10)=3
• Semantic Hashing [Salakhutdinov & Hinton, 2007]
– Text documents
Query
Image
Semantic Hashing
[Salakhutdinov & Hinton, 2007] for text documents
Semantic
Hash
Function
Binary code
Address Space
Images in database
Query address
Quite different to a (conventional) randomizing hash
Semantically similar images
Semantic Hashing
• Each image code is a memory address
• Find neighbors by exploring Hamming ball around query address
Address Space
• Lookup time is independent of # of data points
• Depends on radius of ball & length of code:
Images in database
Query address
Code length
Choose
Radius
Code requirements
• Similar images Similar Codes
• Very compact (<10 2 bits/image)
• Fast to compute
• Does NOT have to reconstruct image
Three approaches:
1. Locality Sensitive Hashing (LSH)
2. Boosting
3. Restricted Boltzmann Machines (RBM’s)
Input Image representation:
Gist vectors
• Pixels not a convenient representation
• Use Gist descriptor instead (Oliva & Torralba, 2001)
• 512 dimensions/image (real-valued 16,384 bits)
• L2 distance btw. Gist vectors not bad substitute for human perceptual distance
NO COLOR
INFORMATION
Oliva & Torralba, IJCV 2001
1. Locality Sensitive Hashing
• Gionis, A. & Indyk, P. & Motwani, R. (1999)
• Take random projections of data
• Quantize each projection with few bits
0 101
Gist descriptor
1
0
1
0
1
No learning involved
2. Boosting
• Modified form of BoostSSC
[Shaknarovich, Viola & Darrell, 2003]
• Positive examples are pairs of similar images
• Negative examples are pairs of unrelated images
0
1
0
1
Learn threshold & dimension for each bit (weak classifier)
0 1
3. Restricted Boltzmann Machine (RBM)
• Type of Deep Belief Network
• Hinton & Salakhutdinov, Science 2006
Hidden units
Single
RBM layer
Symmetric weights
W
Visible units
• Attempts to reconstruct input at visible layer from activation of hidden layer
Multi-Layer RBM: non-linear dimensionality reduction
Output binary code (N dimensions)
Layer 3 N w
3
256
Layer 2
Layer 1
256 w
2
512
512 w
1
512
Linear units at first layer
Input Gist vector (512 dimensions)
Training RBM models
1 st Phase: Pre-training 2 nd Phase: Fine-tuning
Unsupervised
Can use unlabeled data
(unlimited quantity)
Learn parameters greedily per layer
Gets them to right ballpark
Supervised
Requires labeled data
(limited quantity)
Back propagate gradients of chosen error function
Moves parameters to local minimum
Greedy pre-training (Unsupervised)
Layer 1
Input Gist vector (512 real dimensions)
512 w
1
512
Greedy pre-training (Unsupervised)
Layer 2
Activations of hidden units from layer 1 (512 binary dimensions)
256 w
2
512
Greedy pre-training (Unsupervised)
Layer 3 N w
3
256
Activations of hidden units from layer 2 (256 binary dimensions)
Fine-tuning: back-propagation of
Neighborhood Components Analysis objective
Output binary code (N dimensions)
Layer 3 N
+ ∆ w
3
256
Layer 2
Layer 1
256
+ ∆ w
2
512
512
+ ∆ w
1
512
Input Gist vector (512 real dimensions)
Neighborhood Components Analysis
• Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004
• Tries to preserve neighborhood structure of input space
– Assumes this structure is given (will explain later)
Toy example with 2 classes & N=2 units at top of network:
Points in output space (coordinate is activation probability of unit)
Neighborhood Components Analysis
• Adjust network parameters (weights and biases) to move:
– Points of SAME class closer
– Points of DIFFERENT class away
Neighborhood Components Analysis
• Adjust network parameters (weights and biases) to move:
– Points of SAME class closer
– Points of DIFFERENT class away
Points close in input space (Gist) will be close in output code space
Simple Binarization Strategy
Set threshold
- e.g. use median
0
1
0 1
Overall Query Scheme
Image 1
<10 μ s
Retrieved images
Semantic Hash
Query
Image
Compute
Gist
Binary code
<1ms
RBM
Gist descriptor
~1ms (in Matlab)
Test set 1: LabelMe
• 22,000 images (20,000 train | 2,000 test)
• Ground truth segmentations for all
• Can define ground truth distance btw. images using these segmentations
Defining ground truth
• Boosting and NCA back-propagation require ground truth distance between images
• Define this using labeled images from LabelMe
Defining ground truth
• Pyramid Match (Lazebnik et al. 2006, Grauman & Darrell 2005)
Defining ground truth
• Pyramid Match (Lazebnik et al. 2006, Grauman & Darrell 2005)
Building
Car
Car
Sky
Tree
Car
Road
Building
Car
Tree
Road
Building
Car
Car
Sky
Tree
Car
Road
Building
Car
Tree
Road
Building
Car
Car
Sky
Tree
Car
Road
Building
Car
Tree
Road
Varying spatial resolution to capture approximate spatial correspondance
Examples of LabelMe retrieval
• 12 closest neighbors under different distance metrics
LabelMe Retrieval
0 2,000 10,000 20,0000
Size of retrieval set
LabelMe Retrieval
0 2,000 10,000 20,0000
Size of retrieval set Number of bits
Test set 2: Web images
• 12.9 million images
• Collected from Internet
• No labels, so use Euclidean distance between
Gist vectors as ground truth distance
Web images retrieval
Size of retrieval set
Web images retrieval
Size of retrieval set Size of retrieval set
Examples of Web retrieval
• 12 neighbors using different distance metrics
Retrieval Timings
Summary
• Explored various approaches to learning binary codes for hashing-based retrieval
– Very quick with performance comparable to complex descriptors
• More recent work on binarization
– Spectral Hashing (Weiss, Torralba, Fergus NIPS 2009)