cvpr10

advertisement
Sung Ju Hwang and Kristen Grauman
University of Texas at Austin
READING BETWEEN THE LINES:
OBJECT LOCALIZATION USING
IMPLICIT CUES FROM IMAGE TAGS
Detecting tagged objects
Image tagged with keywords clearly tell us
Which object to search for
Dog
Black lab
Jasper
Sofa
Self
Living room
Fedora
Explore
#24
Detecting tagged objects
Image tagged with keywords clearly tell us
Which object to search for
Duygulu et al. 2002
Berg et al. 2004
Fergus et al. 2005
Vijayanarasimhan & Grauman 2008
Previous work using tagged images focuses on the
noun ↔ object correspondence.
3
Main Idea
The list of tags on an image may give useful information
Beyond just what objects are present
Mug
?
Key
Keyboard
Toothbrush
Pen
Photo
Post-it
?
Computer
Poster
Desk
Bookshelf
Screen
Keyboard
Screen
Mug
Poster
Computer
Can you guess where and what size the mug will
appear in both images?
Main Idea
Tag as context
Mug
Key
Keyboard
Toothbrush
Pen
Photo
Post-it
Computer
Poster
Desk
Bookshelf
Screen
Keyboard
Screen
Mug
Poster
Computer
Absence of larger objects
Presence of larger objects
Mug is named the first
Mug is named later in the list
Feature: word presence/absence
Presence/absence of some other objects, and the
number of those objects affects the scene layout
Computer
Poster
Desk
Bookshelf
Screen
Keyboard
Screen
Mug
Key
Keyboard
Toothbrush
Pen
Photo
Post-it
Mug
Poster
Computer
Presence of smaller
objects, such as key, and
the absence of larger
objects hints that it
might be a close-up scene
Presence of the larger
objects such as desk and
bookshelf hints that the
image describes a typical
office scene
Feature: word presence/absence
Plain bag-of-words feature
describing word frequency.
Wi = word
Computer
Poster
Desk
Bookshelf
Screen
Keyboard
Screen
Mug
Key
Keyboard
Toothbrush
Pen
Photo
Post-it
Mug
Poster
Computer
Word
Mug
Computer
Screen
Keyboard
Desk
Bookshelf
Poster
Photo
Pen
Post-it
Toothbrus
h
Key
W1
1
0
0
1
0
0
0
1
1
1
1
1
W2
1
2
2
1
1
1
2
0
0
0
0
0
Blue Larger objects
Red Smaller objects
Feature: tag rank
People tag the ‘important’ objects earlier
Computer
Poster
Mug
Key
Desk
Keyboard
Bookshelf
Toothbrush
Screen
Pen
Keyboard
Screen
Photo
Mug
Post-it
If the object is tagged the
first, there is a high
chance that it is the main
object: large, and
centered
Poster
Computer
If the object is tagged
later, then it means that
the object might not be
salient: either it might be
far from the center or
small in scale
Feature: tag rank
Percentile of the absolute
rank of the tag compared
against its typical rank.
ri = percentile of the rank for tag i
Computer
Poster
Mug
Key
Desk
Keyboard
Bookshelf
Toothbrush
Screen
Pen
Keyboard
Screen
Photo
Mug
Post-it
Poster
Computer
Word
Mug
Computer
Screen
Keyboard
Desk
Bookshelf
Poster
Photo
Pen
Post-it
Toothbrus
h
Key
W1
0.80
0
0
0.51
0
0
0
0.28
0.72
0.82
0
0.90
W2
0.23
0.62
0.21
0.13
0.48
0.61
0.41
0
0
0
0
0
Blue High relative rank (>0.6)
Green Medium relative rank (0.4~0.6)
Red Low relative rank(<0.4)
Feature: proximity
People tend to move their eyes to the objects
nearby
5
6
1) Mug
4
7
1
2
3
2) Key
3) Keyboard
4) Toothbrush
5) Pen
6) Photo
7) Post-it
2
9
4
5
8
1
3
7
6
1) Computer
2) Poster
3) Desk
4) Bookshelf
5) Screen
6) Keyboard
7) Screen
8) Mug
10
Objects that are close to each other in the tag list are
likely to be close in the image
9) Poster
10) Computer
Feature: proximity
Pi,j = rank difference between tag i and j
5
6
1) Mug
4
7
1
3
2
Word
Mug
Screen
Keyboard
Desk
Bookshelf
Mug
1
Screen
0
0
Encoded as the inverse of the
average rank difference between
tag words.
2) Key
3) Keyboard
4) Toothbrush
5) Pen
6) Photo
7) Post-it
2
9
4
5
8
3
6
1
Keyboard
0.5
0
1
Desk
0
0
0
0
Bookshelf
0
0
0
0
0
Blue Objects close to each other
Word
Mug
Screen
Keyboard
Desk
Bookshelf
7
8) Mug
10
Mug
1
Screen
1
1
1) Computer
2) Poster
3) Desk
4) Bookshelf
5) Screen
6) Keyboard
7) Screen
Keyboard
0.5
1
1
9) Poster
10) Computer
Desk
0.2
0.33
0.33
1
Bookshelf
0.25
0.5
0.5
1
1
Overview of the approach
Image
Appearance Model
Getting appearance
Based prediction
P(X|A)
What?
Sliding window
detector
Mug
Key
Keyboard
Toothbrush
Pen
Photo
Post-it
Tags
W = {1, 0, 2, … , 3}
R = {0.9, 0.5, … , 0.2}
P = {0.25, 0.33, … , 0.1}
Implicit tag features
P(X|W)
P(X|R)
P(X|P)
Where?
Modeling P(X|T)
Priming the
detector
Localization
result
Overview of the approach
Image
Appearance Model
Getting appearance
Based prediction
P(X|A)
What?
+
Mug
Key
Keyboard
Toothbrush
Pen
Photo
Post-it
Tags
W = {1, 0, 2, … , 3}
R = {0.9, 0.5, … , 0.2}
P = {0.25, 0.33, … , 0.1}
Implicit tag features
P(X|W)
P(X|R)
P(X|P)
Modeling P(X|T)
Sliding window
detector
0.24
0.81
Modulating the
detector
Localization
result
Approach: modeling P(X|T)
We wish to know the conditional PDF of the location and
scale of the target object, given the tag features:
P(X|T) (X = s,x,y, T = tag feature)
Lamp
Car
Wheel
Wheel
Light
Window
House
House
Car
Car
Road
House
Lightpole
We modeled this conditional PDF
P(X|T) directly without calculating
the joint distribution P(X,T), using
the mixture density network (MDN)
Car
Boulder
Windows
Building
Man
Barrel
Car
Truck
Car
Car
Top 30 mostly liked positions
for class car. Bounding box
sampled according to P(X|T)
Approach: Priming the detector
Then how can we make use of this learned distribution P(X|T)?
1) Use it to speed the detection process
2) Use it to modulate the detection confidence score
Most probable
scale
38600
Ignored
Unlikely scale
5
Region to search
33000
Ignored
1) Rank the detection
results based on the
learned P(X|T)
2) Search only the
probable region and the
scale, following the rank
Approach: Modulating the detector
Then how can we make use of this learned distribution P(X|T)?
1) Use it to speed the detection process
2) Use it to modulate the detection confidence score
P(X|A)
Detector
Lamp
Car
Wheel
Wheel
Light
Image tags
P(X|W)
P(X|R)
P(X|P)
Logistic regression
Classifier
We learn the weights for each prediction,
P(X|A), P(X|W), P(X|R), and P(X|P)
Approach: Modulating the detector
Then how can we make use of this learned distribution P(X|T)?
1) Use it to speed the detection process
2) Use it to modulate the detection confidence score
Prediction based on the
original detector score
0.7
0.8
0.9
Approach: Modulating the detector
Then how can we make use of this learned distribution P(X|T)?
1) Use it to speed the detection process
2) Use it to modulate the detection confidence score
Prediction based on
the tag features
Prediction based on the
original detector score
0.9
0.3
0.7
0.2
0.9
0.8
Approach: Modulating the detector
Then how can we make use of this learned distribution P(X|T)?
1) Use it to speed the detection process
2) Use it to modulate the detection confidence score
0.63
0.24
0.18
Experiments
 We compare the following two
 Detection Speed
 Number of windows to search
 Detection Accuracy
 AUROC
 AP
 On three methods
 Appearance-only
 Appearance + Gist
 Appearance + tag features (ours)
Experiments: Dataset
Dataset
Number of training/test images
Number of classes
Number of keywords
Number of taggers
Avg. Number of Tags / Image
LabelMe
 contains the ordered tag
list.
 Used Dalal & Trigg’s Hog
detector
LabelMe
3799/2553
5
209
56
23
Pascal
5011/4953
20
399
758
5.5
PASCAL VOC 2007
 contains images that have high
variance in composition.
 Tag lists are obtained from
anonymous workers on
Mechanical Turks
 Felzenszwalb’s LSVM detector
LabelMe: Performance Evaluation
Modified version of the HOG detector by Dalal and Triggs.
Faster detection, because
we know where to look first
More accurate detection,
Because we know which
hypotheses to trust most.
Results: LabelMe
Sky
Buildings
Person
Sidewalk
Car
Car
Road
Car
Window
Road
Window
Sky
Wheel
Sign
HOG
HOG+Gist
HOG+Tags
Gist and Tags are likely to predict the same position, but
different scale. Most of the accuracy gain using the tag
features comes from accurate scale prediction
Results: LabelMe
Desk
Keyboard
Screen
Bookshelf
Desk
Keyboard
Screen
Mug
Keyboard
Screen
CD
HOG
HOG+Gist
HOG+Tags
PASCAL VOC 2007: Performance Evaluation
Modified Felzenszwalb’s LSVM detector
65%
77%
70%
25%
Need to test less number
of windows to achieve the
same detection rate.
9.2% improvement in
accuracy over all classes
(Average Precision)
Per-class localization accuracy
 Significant
improvement on
 Bird
 Boat
 Cat
 Dog
 Potted plant
PASCAL VOC 2007 (examples)
Ours
Building
Aeroplane
Smoke
Aeroplane
Aeroplane
Aeroplane
Aeroplane
Aeroplane
Aeroplane
LSVM baseline
Bottle
Person
Table
Chair
Mirror
Tablecloth
Bowl
Bottle
Shelf
Painting
Food
Lamp
Person
Bottle
Dog
Sofa
Painting
Table
PASCAL VOC 2007 (examples)
Dog
Floor
Hairclip
Person
Person
Ground
Bench
Scarf
Dog
Person
Dog
Dog
Dog
Person
Horse
Microphone
Light
Person
Tree
House
Building
Ground
Hurdle
Fence
PASCAL VOC 2007 (Failure case)
Aeroplane
Bottle
Sky
Building
Shadow
Glass
Wine
Table
Dog
Person
Person
Clothes
Rope
Rope
Plant
Ground
Shadow
String
Wall
Pole
Building
Sidewalk
Grass
Road
Some Observations
 We find that often implicit features predict:
- scale better for indoor objects
- position better for outdoor objects
 We find Gist usually better for y position,
while tags are generally stronger for scale
- agrees with previous experiments using Gist
 In general, need to have learned about target
objects in variety of examples with different
contexts
Conclusion
 We showed how to exploit the implicit
information present in human tagging
behavior, on improving object localization
performance in both speed and accuracy.
Future Work
 Joint multi-object detection
 From tags to natural language sentences
 Image retrieval
 Using Wordnet to group words with similar
meanings
Conclusion
 We showed how to exploit the implicit
information present in human tagging
behavior, on improving object localization
performance in both speed and accuracy.
Download