Reading Between The Lines: Object Localization Using Implicit

advertisement
Baby
Infant
Kid
Child
Headphones
Red
Cute
Laughing
Boy
Dog
Grass
Blue
Sky
Puppy
River
Stream
Sun
Colorado
Nikon
Weakly labeled images
Bicycle
Person
Lamp
Chair
Painting
Baby
Table
Chair
Table
Lamp
Chair
Object detection approaches
Sliding window object detector
Appearance-based detector
Need to reduce the # of windows
scanned
0 Prioritize search windows within image, based on learned
distribution of tags for speed.
0 Combine the models based on both tags + images for
accuracy.
Motivation
Idea: What can be predicted from the image before even
looking at it and only with given tags?
Both sets of tags suggest that mug appears on the image, but when
considering that set of tags is based on what “catches they eye”
first, then the area that object detector has to search can be
narrowed.
Implicit Tag Feature Definitions
0 What implicit features can be
obtained from tags?
0 Relative prominence of
each object based on the
order in the list.
0 Scale cues implied by
unnamed objects.
0 The rough layout and
proximity between objects
based on the sequence in
which tags are given.
Implicit Tag Feature Definitions
0 Word presence and absence – bag-of-words
representation
W  [ w1 ,..., wN ]
0 wi denotes the number of times that tag-word i
occurs in that image’s associated list of keywords
for a vocabulary of N total possible words
0 For most tag lists, this vector will consist of only
binary entries saying whether each tag has been
named or not
Implicit Tag Feature Definitions
0 Tag rank – prominence of each object: certain things
will be named before others
R  [r1,...,rN ]
0 ri denotes the percentile ranks observed in the
training data for that word (for entire vocabulary)
0 Some objects have context-independent
“noticeability”—such as baby or fire truck—often
named first regardless of their scale or position.

Implicit Tag Feature Definitions
0 Mutual tag proximity - tagger will name prominent
objects first, then move his/her eyes to some other
objects nearby
1 1
1
1
1
P [
,
,...,
,...,
,...,
]
p1,2 p1,3
p1,N
p2,3
pN 1,N
0 pi,j denotes the (signed) rank difference between
tag words i and j for the given image.
0 The entry is 0 when the pair is not present.

Modeling the localization
distributions
0 Relate defined tag-based features to the object
detection T  W ,R,P (or combination)
0 Model conditional probability density that the
window contains the object of interest, given only

these image tags: PO (X | T)
0 O - the target object category.


Modeling the localization distributions
m
0 Use mixture of Gaussians model:
PO (X | T)   i N(X; i , i )
0 i, i , i - parameters of the mixture model obtained by
trained Mixture Density Network (MDN)
i1
0 Training:
Classification:
 image with no BBoxes.
Novel
Computer
Bicycle
Chair
MDN provides the mixture
model representing most
likely locations for the
target object.
The top 30 most likely places for a car sampled according
to modeled distribution based only on tags of the images.

Modulating or Priming the detector
0 Use
PO (X | T) from the previous step and:
0 Combine with predictions with object detector based on
appearance PO (X | A) , A – appearance cues:
HOG:
Part-based detector
(deformable part model)

0 Use the model to rank sub-windows and run the detector on
most probable locations only (“priming”).
0 Decision value of detectors d(x, y,s) is mapped to probability:
1
PO (X  (x, y,s) | A) 
1 exp( d(x, y,s))
Modulating the detector
0 Balance appearance and tag-based predictions:
0 Use all tags cues:
0 Learn the weights w using detection scores for true
detections and a number of randomly sampled
windows from the background.
0 Can add Gist descriptor to compare against global
scene visual context.
0 Goal: improve accuracy.
Priming the detector
0 Prioritize the search windows according to PO (X | T)
0 Assumption that object is present, and only
localization parameters (x,y,s) have to be
estimated.

0 Stop search when confident detection is found
0 Confidence ( >0.5)
0 Goal: improve efficiency.
Results
0 Datasets
0 LabelMe
0 PASCAL
- use the HOG detector
- use the part-based detector
L
P
Note:
Last three columns show the ranges of positions/scales
present in the images, averaged per class, as a percentage of
image size.
LabelMe Dataset
• Priming Object Search: Increasing Speed
For a detection rate of 0.6, proposed method considers only 1/3
of those scanned by the sliding window approach.
• Modulating the Detector: Increasing Accuracy
The proposed features make noticeable improvements in
accuracy over the raw detector.
Example detections on LabelMe
• Each image
shows the best
detection found.
• Scores denote
overlap ratio with
ground truth.
• The detectors
modulated
according to the
visual or tag-based
context are more
accurate.
PASCAL Dataset
0 Priming Object Search: Increasing Speed
Adopt the Latent SVM (LSVM) part-based windowed detector, faster
here than the HOG’s was on LabelMe.
0 Modulating the Detector: Increasing Accuracy
Augmenting the LSVM detector with the tag features noticeably
improves accuracy—increasing the average precision by 9.2% overall.
Example detections on PASCAL VOC
0 Red dotted boxes denote most confident detections according to
the raw detector (LSVM)
0 Green solid boxes denote most confident detections when
modulated by our method (LSVM + tags)
0 The first two rows show good results, and third row shows
failure cases
Conclusions
0 Novel approach to use information “between the
lines” of tags.
0 Utilizing this implicit tag information helps to make
search faster and more accurate.
0 The method complements and even exceeds
performance of the methods using visual cues.
0 Shows potential for learning tendencies of real
taggers.
Thank you!
Download