Efficient Image Search and Retrieval using Compact Binary Codes

advertisement

Efficient Image Search and

Retrieval using Compact Binary

Codes

Rob Fergus (NYU)

Antonio Torralba (MIT)

Yair Weiss (Hebrew U.)

Large scale image search

Internet contains many billions of images

How can we search them, based on visual content?

The Challenge:

– Need way of measuring similarity between images

– Needs to scale to Internet

Existing approaches to

Content-Based Image Retrieval

• Focus of scaling rather than understanding image

• Variety of simple/hand-designed cues:

– Color and/or Texture histograms, Shape, PCA, etc.

• Various distance metrics

– Earth Movers Distance (Rubner et al. ‘98)

• Most recognition approaches slow (~1sec/image)

Our Approach

• Learn the metric from training data

• Use compact binary codes for speed

DO BOTH TOGETHER

Large scale image/video search

• Representation must fit in memory (disk too slow)

• Facebook has ~10 billion images (10 10 )

• PC has ~10 Gbytes of memory (10 11 bits)

 Budget of 10 1 bits/image

• YouTube has ~ a trillion video frames (10 12 )

• Big cluster of PCs has ~10 Tbytes (10 14 bits)

 Budget of 10 2 bits/frame

Binary codes for images

• Want images with similar content to have similar binary codes

• Use Hamming distance between codes

– Number of bit flips

– E.g.: Ham_Dist(10001010,10001 1 10)=1

Ham_Dist(10001010,1 11 01 1 10)=3

• Semantic Hashing [Salakhutdinov & Hinton, 2007]

– Text documents

Query

Image

Semantic Hashing

[Salakhutdinov & Hinton, 2007] for text documents

Semantic

Hash

Function

Binary code

Address Space

Images in database

Query address

Quite different to a (conventional) randomizing hash

Semantically similar images

Semantic Hashing

• Each image code is a memory address

• Find neighbors by exploring Hamming ball around query address

Address Space

• Lookup time is independent of # of data points

• Depends on radius of ball & length of code:

Images in database

Query address

Code length

Choose

Radius

Code requirements

• Similar images  Similar Codes

• Very compact (<10 2 bits/image)

• Fast to compute

• Does NOT have to reconstruct image

Three approaches:

1. Locality Sensitive Hashing (LSH)

2. Boosting

3. Restricted Boltzmann Machines (RBM’s)

Input Image representation:

Gist vectors

• Pixels not a convenient representation

• Use Gist descriptor instead (Oliva & Torralba, 2001)

• 512 dimensions/image (real-valued  16,384 bits)

• L2 distance btw. Gist vectors not bad substitute for human perceptual distance

NO COLOR

INFORMATION

Oliva & Torralba, IJCV 2001

1. Locality Sensitive Hashing

• Gionis, A. & Indyk, P. & Motwani, R. (1999)

• Take random projections of data

• Quantize each projection with few bits

0 101

Gist descriptor

1

0

1

0

1

No learning involved

2. Boosting

• Modified form of BoostSSC

[Shaknarovich, Viola & Darrell, 2003]

• Positive examples are pairs of similar images

• Negative examples are pairs of unrelated images

0

1

0

1

Learn threshold & dimension for each bit (weak classifier)

0 1

3. Restricted Boltzmann Machine (RBM)

• Type of Deep Belief Network

• Hinton & Salakhutdinov, Science 2006

Hidden units

Single

RBM layer

Symmetric weights

W

Visible units

• Attempts to reconstruct input at visible layer from activation of hidden layer

Multi-Layer RBM: non-linear dimensionality reduction

Output binary code (N dimensions)

Layer 3 N w

3

256

Layer 2

Layer 1

256 w

2

512

512 w

1

512

Linear units at first layer

Input Gist vector (512 dimensions)

Training RBM models

1 st Phase: Pre-training 2 nd Phase: Fine-tuning

Unsupervised

Can use unlabeled data

(unlimited quantity)

Learn parameters greedily per layer

Gets them to right ballpark

Supervised

Requires labeled data

(limited quantity)

Back propagate gradients of chosen error function

Moves parameters to local minimum

Greedy pre-training (Unsupervised)

Layer 1

Input Gist vector (512 real dimensions)

512 w

1

512

Greedy pre-training (Unsupervised)

Layer 2

Activations of hidden units from layer 1 (512 binary dimensions)

256 w

2

512

Greedy pre-training (Unsupervised)

Layer 3 N w

3

256

Activations of hidden units from layer 2 (256 binary dimensions)

Fine-tuning: back-propagation of

Neighborhood Components Analysis objective

Output binary code (N dimensions)

Layer 3 N

+ ∆ w

3

256

Layer 2

Layer 1

256

+ ∆ w

2

512

512

+ ∆ w

1

512

Input Gist vector (512 real dimensions)

Neighborhood Components Analysis

• Goldberger, Roweis, Salakhutdinov & Hinton, NIPS 2004

• Tries to preserve neighborhood structure of input space

– Assumes this structure is given (will explain later)

Toy example with 2 classes & N=2 units at top of network:

Points in output space (coordinate is activation probability of unit)

Neighborhood Components Analysis

• Adjust network parameters (weights and biases) to move:

– Points of SAME class closer

– Points of DIFFERENT class away

Neighborhood Components Analysis

• Adjust network parameters (weights and biases) to move:

– Points of SAME class closer

– Points of DIFFERENT class away

Points close in input space (Gist) will be close in output code space

Simple Binarization Strategy

Set threshold

- e.g. use median

0

1

0 1

Overall Query Scheme

Image 1

<10 μ s

Retrieved images

Semantic Hash

Query

Image

Compute

Gist

Binary code

<1ms

RBM

Gist descriptor

~1ms (in Matlab)

Retrieval Experiments

Test set 1: LabelMe

• 22,000 images (20,000 train | 2,000 test)

• Ground truth segmentations for all

• Can define ground truth distance btw. images using these segmentations

Defining ground truth

• Boosting and NCA back-propagation require ground truth distance between images

• Define this using labeled images from LabelMe

Defining ground truth

• Pyramid Match (Lazebnik et al. 2006, Grauman & Darrell 2005)

Defining ground truth

• Pyramid Match (Lazebnik et al. 2006, Grauman & Darrell 2005)

Building

Car

Car

Sky

Tree

Car

Road

Building

Car

Tree

Road

Building

Car

Car

Sky

Tree

Car

Road

Building

Car

Tree

Road

Building

Car

Car

Sky

Tree

Car

Road

Building

Car

Tree

Road

Varying spatial resolution to capture approximate spatial correspondance

Examples of LabelMe retrieval

• 12 closest neighbors under different distance metrics

LabelMe Retrieval

0 2,000 10,000 20,0000

Size of retrieval set

LabelMe Retrieval

0 2,000 10,000 20,0000

Size of retrieval set Number of bits

Test set 2: Web images

• 12.9 million images

• Collected from Internet

• No labels, so use Euclidean distance between

Gist vectors as ground truth distance

Web images retrieval

Size of retrieval set

Web images retrieval

Size of retrieval set Size of retrieval set

Examples of Web retrieval

• 12 neighbors using different distance metrics

Retrieval Timings

Summary

• Explored various approaches to learning binary codes for hashing-based retrieval

– Very quick with performance comparable to complex descriptors

• More recent work on binarization

– Spectral Hashing (Weiss, Torralba, Fergus NIPS 2009)

Download