Sung Ju Hwang and Kristen Grauman University of Texas at Austin READING BETWEEN THE LINES: OBJECT LOCALIZATION USING IMPLICIT CUES FROM IMAGE TAGS Detecting tagged objects Image tagged with keywords clearly tell us Which object to search for Dog Black lab Jasper Sofa Self Living room Fedora Explore #24 Detecting tagged objects Image tagged with keywords clearly tell us Which object to search for Duygulu et al. 2002 Berg et al. 2004 Fergus et al. 2005 Vijayanarasimhan & Grauman 2008 Previous work using tagged images focuses on the noun ↔ object correspondence. 3 Main Idea The list of tags on an image may give useful information Beyond just what objects are present Mug ? Key Keyboard Toothbrush Pen Photo Post-it ? Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Can you guess where and what size the mug will appear in both images? Main Idea Tag as context Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Absence of larger objects Presence of larger objects Mug is named the first Mug is named later in the list Feature: word presence/absence Presence/absence of some other objects, and the number of those objects affects the scene layout Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Key Keyboard Toothbrush Pen Photo Post-it Mug Poster Computer Presence of smaller objects, such as key, and the absence of larger objects hints that it might be a close-up scene Presence of the larger objects such as desk and bookshelf hints that the image describes a typical office scene Feature: word presence/absence Plain bag-of-words feature describing word frequency. Wi = word Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Key Keyboard Toothbrush Pen Photo Post-it Mug Poster Computer Word Mug Computer Screen Keyboard Desk Bookshelf Poster Photo Pen Post-it Toothbrus h Key W1 1 0 0 1 0 0 0 1 1 1 1 1 W2 1 2 2 1 1 1 2 0 0 0 0 0 Blue Larger objects Red Smaller objects Feature: tag rank People tag the ‘important’ objects earlier Computer Poster Mug Key Desk Keyboard Bookshelf Toothbrush Screen Pen Keyboard Screen Photo Mug Post-it If the object is tagged the first, there is a high chance that it is the main object: large, and centered Poster Computer If the object is tagged later, then it means that the object might not be salient: either it might be far from the center or small in scale Feature: tag rank Percentile of the absolute rank of the tag compared against its typical rank. ri = percentile of the rank for tag i Computer Poster Mug Key Desk Keyboard Bookshelf Toothbrush Screen Pen Keyboard Screen Photo Mug Post-it Poster Computer Word Mug Computer Screen Keyboard Desk Bookshelf Poster Photo Pen Post-it Toothbrus h Key W1 0.80 0 0 0.51 0 0 0 0.28 0.72 0.82 0 0.90 W2 0.23 0.62 0.21 0.13 0.48 0.61 0.41 0 0 0 0 0 Blue High relative rank (>0.6) Green Medium relative rank (0.4~0.6) Red Low relative rank(<0.4) Feature: proximity People tend to move their eyes to the objects nearby 5 6 1) Mug 4 7 1 2 3 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 2 9 4 5 8 1 3 7 6 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 10 Objects that are close to each other in the tag list are likely to be close in the image 9) Poster 10) Computer Feature: proximity Pi,j = rank difference between tag i and j 5 6 1) Mug 4 7 1 3 2 Word Mug Screen Keyboard Desk Bookshelf Mug 1 Screen 0 0 Encoded as the inverse of the average rank difference between tag words. 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 2 9 4 5 8 3 6 1 Keyboard 0.5 0 1 Desk 0 0 0 0 Bookshelf 0 0 0 0 0 Blue Objects close to each other Word Mug Screen Keyboard Desk Bookshelf 7 8) Mug 10 Mug 1 Screen 1 1 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen Keyboard 0.5 1 1 9) Poster 10) Computer Desk 0.2 0.33 0.33 1 Bookshelf 0.25 0.5 0.5 1 1 Overview of the approach Image Appearance Model Getting appearance Based prediction P(X|A) What? Sliding window detector Mug Key Keyboard Toothbrush Pen Photo Post-it Tags W = {1, 0, 2, … , 3} R = {0.9, 0.5, … , 0.2} P = {0.25, 0.33, … , 0.1} Implicit tag features P(X|W) P(X|R) P(X|P) Where? Modeling P(X|T) Priming the detector Localization result Overview of the approach Image Appearance Model Getting appearance Based prediction P(X|A) What? + Mug Key Keyboard Toothbrush Pen Photo Post-it Tags W = {1, 0, 2, … , 3} R = {0.9, 0.5, … , 0.2} P = {0.25, 0.33, … , 0.1} Implicit tag features P(X|W) P(X|R) P(X|P) Modeling P(X|T) Sliding window detector 0.24 0.81 Modulating the detector Localization result Approach: modeling P(X|T) We wish to know the conditional PDF of the location and scale of the target object, given the tag features: P(X|T) (X = s,x,y, T = tag feature) Lamp Car Wheel Wheel Light Window House House Car Car Road House Lightpole We modeled this conditional PDF P(X|T) directly without calculating the joint distribution P(X,T), using the mixture density network (MDN) Car Boulder Windows Building Man Barrel Car Truck Car Car Top 30 mostly liked positions for class car. Bounding box sampled according to P(X|T) Approach: Priming the detector Then how can we make use of this learned distribution P(X|T)? 1) Use it to speed the detection process 2) Use it to modulate the detection confidence score Most probable scale 38600 Ignored Unlikely scale 5 Region to search 33000 Ignored 1) Rank the detection results based on the learned P(X|T) 2) Search only the probable region and the scale, following the rank Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1) Use it to speed the detection process 2) Use it to modulate the detection confidence score P(X|A) Detector Lamp Car Wheel Wheel Light Image tags P(X|W) P(X|R) P(X|P) Logistic regression Classifier We learn the weights for each prediction, P(X|A), P(X|W), P(X|R), and P(X|P) Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1) Use it to speed the detection process 2) Use it to modulate the detection confidence score Prediction based on the original detector score 0.7 0.8 0.9 Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1) Use it to speed the detection process 2) Use it to modulate the detection confidence score Prediction based on the tag features Prediction based on the original detector score 0.9 0.3 0.7 0.2 0.9 0.8 Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? 1) Use it to speed the detection process 2) Use it to modulate the detection confidence score 0.63 0.24 0.18 Experiments We compare the following two Detection Speed Number of windows to search Detection Accuracy AUROC AP On three methods Appearance-only Appearance + Gist Appearance + tag features (ours) Experiments: Dataset Dataset Number of training/test images Number of classes Number of keywords Number of taggers Avg. Number of Tags / Image LabelMe contains the ordered tag list. Used Dalal & Trigg’s Hog detector LabelMe 3799/2553 5 209 56 23 Pascal 5011/4953 20 399 758 5.5 PASCAL VOC 2007 contains images that have high variance in composition. Tag lists are obtained from anonymous workers on Mechanical Turks Felzenszwalb’s LSVM detector LabelMe: Performance Evaluation Modified version of the HOG detector by Dalal and Triggs. Faster detection, because we know where to look first More accurate detection, Because we know which hypotheses to trust most. Results: LabelMe Sky Buildings Person Sidewalk Car Car Road Car Window Road Window Sky Wheel Sign HOG HOG+Gist HOG+Tags Gist and Tags are likely to predict the same position, but different scale. Most of the accuracy gain using the tag features comes from accurate scale prediction Results: LabelMe Desk Keyboard Screen Bookshelf Desk Keyboard Screen Mug Keyboard Screen CD HOG HOG+Gist HOG+Tags PASCAL VOC 2007: Performance Evaluation Modified Felzenszwalb’s LSVM detector 65% 77% 70% 25% Need to test less number of windows to achieve the same detection rate. 9.2% improvement in accuracy over all classes (Average Precision) Per-class localization accuracy Significant improvement on Bird Boat Cat Dog Potted plant PASCAL VOC 2007 (examples) Ours Building Aeroplane Smoke Aeroplane Aeroplane Aeroplane Aeroplane Aeroplane Aeroplane LSVM baseline Bottle Person Table Chair Mirror Tablecloth Bowl Bottle Shelf Painting Food Lamp Person Bottle Dog Sofa Painting Table PASCAL VOC 2007 (examples) Dog Floor Hairclip Person Person Ground Bench Scarf Dog Person Dog Dog Dog Person Horse Microphone Light Person Tree House Building Ground Hurdle Fence PASCAL VOC 2007 (Failure case) Aeroplane Bottle Sky Building Shadow Glass Wine Table Dog Person Person Clothes Rope Rope Plant Ground Shadow String Wall Pole Building Sidewalk Grass Road Some Observations We find that often implicit features predict: - scale better for indoor objects - position better for outdoor objects We find Gist usually better for y position, while tags are generally stronger for scale - agrees with previous experiments using Gist In general, need to have learned about target objects in variety of examples with different contexts Conclusion We showed how to exploit the implicit information present in human tagging behavior, on improving object localization performance in both speed and accuracy. Future Work Joint multi-object detection From tags to natural language sentences Image retrieval Using Wordnet to group words with similar meanings Conclusion We showed how to exploit the implicit information present in human tagging behavior, on improving object localization performance in both speed and accuracy.