ACCOUNTING FOR THE RELATIVE IMPORTANCE OF OBJECTS IN IMAGE RETRIEVAL

advertisement
ACCOUNTING FOR THE
RELATIVE IMPORTANCE OF
OBJECTS IN IMAGE RETRIEVAL
Sung Ju Hwang and Kristen Grauman
University of Texas at Austin
Image retrieval
Content-based retrieval from an image database
Image 1
Image
Database
…
Query image
Image 2
Image k
Relative importance of objects
Which image is more relevant to the query?
Image
Database
Query image
?
Relative importance of objects
Which image is more relevant to the query?
water
sky
water
bird
bird
cow
Image
Database
?
fence
Query image
cow
cow
mud
Relative importance of objects
architecture
sky
mountain
but some are
more “important”
than others.
bird
water
An image can
contain many
different objects,
cow
Relative importance of objects
architecture
sky
mountain
bird
water
cow
Some objects
are background
Relative importance of objects
architecture
sky
mountain
bird
water
cow
Some objects
are less salient
Relative importance of objects
architecture
sky
mountain
bird
water
cow
Some objects
are more
prominent or
perceptually
define the scene
Our goal
Goal: Retrieve those images that share important
objects with the query image.
versus
How to learn a representation that accounts for this?
Idea: image tags as importance cue
The order in which person assigns tags provides
implicit cues about object importance to scene.
TAGS
Cow
Birds
Architecture
Water
Sky
Idea: image tags as importance cue
The order in which person assigns tags provides
implicit cues about object importance to scene.
TAGS:
Cow
Birds
Architecture
Water
Sky
Learn this connection to
improve cross-modal
retrieval and CBIR.
Related work
Previous work using tagged images focuses on
the noun ↔ object correspondence.
Duygulu et al. 02
Berg et al. 04
Fergus et al. 05
Li et al., 09
Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et
al. 2007, Gupta & Davis 2008, …
Related work building richer image representations
from “two-view” text+image data:
height: 6-11 weig
ht: 235 lbs positi
on:forward, croati
a college:
Gupta et al. 08
Hardoon et al. 04
Blaschko & Lampert 08
Bekkerman & Jeon 07, Qi et al. 09, Quack et al. 08, Quattoni et al 07, Yakhnenko
& Honavar 09,…
Approach overview:
Building the image database
Cow
Grass
Horse
Grass
…
Car
House
Grass
Sky
Tagged
training
images
Extract visual
and tag-based
features
Learn projections
from each feature
space into common
“semantic space”
Approach overview:
Retrieval from the database
Untagged
query image
Cow
Tree
Grass
Tag list
query
Retrieved
images
Image
database
Cow
Tree
Retrieved
tag-list
• Image-to-image retrieval
• Image-to-tag auto annotation
• Tag-to-image retrieval
Dual-view semantic space
Visual features and tag-lists are two views
generated by the same concept.
Semantic space
Learning mappings to semantic space
Canonical Correlation Analysis (CCA): choose
projection directions that maximize the correlation of
views projected from same instance.
View 2
View 1
Semantic space:
new common feature space
Kernel Canonical Correlation Analysis
[Akaho 2001, Fyfe et al. 2001, Hardoon et al. 2004]
Linear CCA
Given paired data:
Select directions
Kernel CCA
so as to maximize:
Given pair of kernel functions:
,
Same objective, but projections in kernel space:
,
Building the kernels for each view
Visual kernels
Word frequency,
rank kernels
Semantic space
Visual features
Gist
captures the total
scene structure
[Torralba et al.]
Color Histogram
captures the HSV
color distribution
Visual Words
captures local
appearance
(k-means on
DoG+SIFT)
Average the component χ2 kernels to build a
single visual kernel .
Tag features
Word Frequency
Traditional bag-of-(text)words
Cow
Bird
Water
Architecture
Mountain
Sky
tag
Cow
Bird
Water
Architecture
Mountain
Sky
Car
Person
count
1
1
1
1
1
1
0
0
Tag features
Absolute Rank
Absolute rank in this image’s tag-list
Cow
Bird
Water
Architecture
Mountain
Sky
tag
Cow
Bird
Water
Architecture
Mountain
Sky
Car
Person
value
1
0.63
0.50
0.43
0.39
0.36
0
0
Tag features
Relative Rank
Percentile rank obtained from the
rank distribution of that word in all
tag-lists.
tag
value
Cow
Bird
Water
Architecture
Mountain
Sky
Cow
Bird
Water
Architecture
Mountain
Sky
Car
Person
Average the component χ2 kernels to build a
single tag kernel .
0.9
0.6
0.8
0.5
0.8
0.8
0
0
Recap: Building the image database
Visual feature space
tag feature space
Semantic space
Experiments
We compare the retrieval performance of our method
with two baselines:
Words+Visual
Baseline
Visual-Only
Baseline
Query
image
1st retrieved
image
Query
image
1st retrieved
KCCA
image
semantic
space
[Hardoon et al. 2004, Yakhenenko et al. 2009]
Evaluation
We use Normalized Discounted Cumulative Gain at
top K (NDCG@K) to evaluate retrieval performance:
Reward term
score for pth
ranked example
Sum of all the scores
for the perfect ranking
(normalization)
[Kekalainen & Jarvelin, 2002]
Doing well in the top ranks
is more important.
Evaluation
We present the NDCG@k score using two different
reward terms:
Object presence/scale
Ordered tag similarity
Cow
Tree
Grass
Rewards similarity of query’s
objects/scales and those in
retrieved image(s).
scale
presence
Person
Cow
Tree
Fence
Grass
Rewards similarity of query’s
ground truth tag ranks and those
in retrieved image(s).
relative
rank
absolute
rank
Dataset
LabelMe
Pascal
 6352 images
 9963 images
 Database: 3799 images
 Database: 5011 images
 Query: 2553 images
 Query: 4952 images
 Scene-oriented
 Object-central
 Contains the ordered
 Tag lists obtained on
tag lists via labels
added
 56 unique taggers
 ~23 tags/image
Mechanical Turk
 758 unique taggers
 ~5.5 tags/image
Image-to-image retrieval
We want to retrieve images most similar to the given
query image in terms of object importance.
Visual kernel space
Untagged
query image
Tag-list kernel space
Image
database
Retrieved images
Image-to-image retrieval results
Query Image
Visual
only
Words
+
Visual
Our
method
Image-to-image retrieval results
Query Image
Visual
only
Words
+
Visual
Our
method
Image-to-image retrieval results
Our method better retrieves images that share the
query’s important objects, by both measures.
39% improvement
Retrieval accuracy
measured by
object+scale similarity
Retrieval accuracy
measured by ordered
tag-list similarity
Tag-to-image retrieval
We want to retrieve the images that are
best described by the given tag list
Visual kernel space
Tag-list kernel space
Image
database
Retrieved images
Cow
Person
Tree
Grass
Query tags
Tag-to-image retrieval results
Our method better respects the importance cues
implied by the user’s keyword query.
31% improvement
Image-to-tag auto annotation
We want to annotate query image with ordered tags
that best describe the scene.
Visual kernel space
Untagged
query image
Tag-list kernel space
Image
database
Cow
Field Cow
Tree
Cow
Grass
Grass Fence
Output tag-lists
Image-to-tag auto annotation results
Tree
Boat
Grass
Water
Person
Boat
Person
Water
Sky
Rock
Person
Tree
Car
Chair
Window
Bottle
Knife
Napkin
Light
fork
Method
k=1
k=3
k=5
k=10
Visual-only
0.0826
0.1765
0.2022
0.2095
Word+Visual
0.0818
0.1712
0.1992
0.2097
Ours
0.0901
0.1936
0.2230
0.2335
k = number of nearest neighbors used
Implicit tag cues as localization prior
[Hwang & Grauman, CVPR 2010]
Training: Learn object-specific connection between
localization parameters and implicit tag features.
Computer
Poster
Desk
Screen
Mug
Poster
P (location, scale | tags)
Desk
Mug
Office
Mug
Eiffel
Woman
Table
Mug
Ladder
Mug
Coffee
Implicit tag
features
Testing: Given novel image, localize objects based on both
tags and appearance.
Object
detector
Mug
Key
Keyboard
Toothbrush
Pen
Photo
Post-it
Implicit tag
features
Conclusion
• We want to learn what is implied (beyond objects
present) by how a human provides tags for an
image
• Approach requires minimal supervision to learn
the connection between importance conveyed by
tags and visual features.
•
Consistent gains over
• content-based visual search
• tag+visual approach that disregards importance
THANK YOU
Download