Baby Infant Kid Child Headphones Red Cute Laughing Boy Dog Grass Blue Sky Puppy River Stream Sun Colorado Nikon Weakly labeled images Bicycle Person Lamp Chair Painting Baby Table Chair Table Lamp Chair Object detection approaches Sliding window object detector Appearance-based detector Need to reduce the # of windows scanned 0 Prioritize search windows within image, based on learned distribution of tags for speed. 0 Combine the models based on both tags + images for accuracy. Motivation Idea: What can be predicted from the image before even looking at it and only with given tags? Both sets of tags suggest that mug appears on the image, but when considering that set of tags is based on what “catches they eye” first, then the area that object detector has to search can be narrowed. Implicit Tag Feature Definitions 0 What implicit features can be obtained from tags? 0 Relative prominence of each object based on the order in the list. 0 Scale cues implied by unnamed objects. 0 The rough layout and proximity between objects based on the sequence in which tags are given. Implicit Tag Feature Definitions 0 Word presence and absence – bag-of-words representation W [ w1 ,..., wN ] 0 wi denotes the number of times that tag-word i occurs in that image’s associated list of keywords for a vocabulary of N total possible words 0 For most tag lists, this vector will consist of only binary entries saying whether each tag has been named or not Implicit Tag Feature Definitions 0 Tag rank – prominence of each object: certain things will be named before others R [r1,...,rN ] 0 ri denotes the percentile ranks observed in the training data for that word (for entire vocabulary) 0 Some objects have context-independent “noticeability”—such as baby or fire truck—often named first regardless of their scale or position. Implicit Tag Feature Definitions 0 Mutual tag proximity - tagger will name prominent objects first, then move his/her eyes to some other objects nearby 1 1 1 1 1 P [ , ,..., ,..., ,..., ] p1,2 p1,3 p1,N p2,3 pN 1,N 0 pi,j denotes the (signed) rank difference between tag words i and j for the given image. 0 The entry is 0 when the pair is not present. Modeling the localization distributions 0 Relate defined tag-based features to the object detection T W ,R,P (or combination) 0 Model conditional probability density that the window contains the object of interest, given only these image tags: PO (X | T) 0 O - the target object category. Modeling the localization distributions m 0 Use mixture of Gaussians model: PO (X | T) i N(X; i , i ) 0 i, i , i - parameters of the mixture model obtained by trained Mixture Density Network (MDN) i1 0 Training: Classification: image with no BBoxes. Novel Computer Bicycle Chair MDN provides the mixture model representing most likely locations for the target object. The top 30 most likely places for a car sampled according to modeled distribution based only on tags of the images. Modulating or Priming the detector 0 Use PO (X | T) from the previous step and: 0 Combine with predictions with object detector based on appearance PO (X | A) , A – appearance cues: HOG: Part-based detector (deformable part model) 0 Use the model to rank sub-windows and run the detector on most probable locations only (“priming”). 0 Decision value of detectors d(x, y,s) is mapped to probability: 1 PO (X (x, y,s) | A) 1 exp( d(x, y,s)) Modulating the detector 0 Balance appearance and tag-based predictions: 0 Use all tags cues: 0 Learn the weights w using detection scores for true detections and a number of randomly sampled windows from the background. 0 Can add Gist descriptor to compare against global scene visual context. 0 Goal: improve accuracy. Priming the detector 0 Prioritize the search windows according to PO (X | T) 0 Assumption that object is present, and only localization parameters (x,y,s) have to be estimated. 0 Stop search when confident detection is found 0 Confidence ( >0.5) 0 Goal: improve efficiency. Results 0 Datasets 0 LabelMe 0 PASCAL - use the HOG detector - use the part-based detector L P Note: Last three columns show the ranges of positions/scales present in the images, averaged per class, as a percentage of image size. LabelMe Dataset • Priming Object Search: Increasing Speed For a detection rate of 0.6, proposed method considers only 1/3 of those scanned by the sliding window approach. • Modulating the Detector: Increasing Accuracy The proposed features make noticeable improvements in accuracy over the raw detector. Example detections on LabelMe • Each image shows the best detection found. • Scores denote overlap ratio with ground truth. • The detectors modulated according to the visual or tag-based context are more accurate. PASCAL Dataset 0 Priming Object Search: Increasing Speed Adopt the Latent SVM (LSVM) part-based windowed detector, faster here than the HOG’s was on LabelMe. 0 Modulating the Detector: Increasing Accuracy Augmenting the LSVM detector with the tag features noticeably improves accuracy—increasing the average precision by 9.2% overall. Example detections on PASCAL VOC 0 Red dotted boxes denote most confident detections according to the raw detector (LSVM) 0 Green solid boxes denote most confident detections when modulated by our method (LSVM + tags) 0 The first two rows show good results, and third row shows failure cases Conclusions 0 Novel approach to use information “between the lines” of tags. 0 Utilizing this implicit tag information helps to make search faster and more accurate. 0 The method complements and even exceeds performance of the methods using visual cues. 0 Shows potential for learning tendencies of real taggers. Thank you!