ACCOUNTING FOR THE RELATIVE IMPORTANCE OF OBJECTS IN IMAGE RETRIEVAL Sung Ju Hwang and Kristen Grauman University of Texas at Austin Image retrieval Content-based retrieval from an image database Image 1 Image Database … Query image Image 2 Image k Relative importance of objects Which image is more relevant to the query? Image Database Query image ? Relative importance of objects Which image is more relevant to the query? water sky water bird bird cow Image Database ? fence Query image cow cow mud Relative importance of objects architecture sky mountain but some are more “important” than others. bird water An image can contain many different objects, cow Relative importance of objects architecture sky mountain bird water cow Some objects are background Relative importance of objects architecture sky mountain bird water cow Some objects are less salient Relative importance of objects architecture sky mountain bird water cow Some objects are more prominent or perceptually define the scene Our goal Goal: Retrieve those images that share important objects with the query image. versus How to learn a representation that accounts for this? Idea: image tags as importance cue The order in which person assigns tags provides implicit cues about object importance to scene. TAGS Cow Birds Architecture Water Sky Idea: image tags as importance cue The order in which person assigns tags provides implicit cues about object importance to scene. TAGS: Cow Birds Architecture Water Sky Learn this connection to improve cross-modal retrieval and CBIR. Related work Previous work using tagged images focuses on the noun ↔ object correspondence. Duygulu et al. 02 Berg et al. 04 Fergus et al. 05 Li et al., 09 Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, … Related work building richer image representations from “two-view” text+image data: height: 6-11 weig ht: 235 lbs positi on:forward, croati a college: Gupta et al. 08 Hardoon et al. 04 Blaschko & Lampert 08 Bekkerman & Jeon 07, Qi et al. 09, Quack et al. 08, Quattoni et al 07, Yakhnenko & Honavar 09,… Approach overview: Building the image database Cow Grass Horse Grass … Car House Grass Sky Tagged training images Extract visual and tag-based features Learn projections from each feature space into common “semantic space” Approach overview: Retrieval from the database Untagged query image Cow Tree Grass Tag list query Retrieved images Image database Cow Tree Retrieved tag-list • Image-to-image retrieval • Image-to-tag auto annotation • Tag-to-image retrieval Dual-view semantic space Visual features and tag-lists are two views generated by the same concept. Semantic space Learning mappings to semantic space Canonical Correlation Analysis (CCA): choose projection directions that maximize the correlation of views projected from same instance. View 2 View 1 Semantic space: new common feature space Kernel Canonical Correlation Analysis [Akaho 2001, Fyfe et al. 2001, Hardoon et al. 2004] Linear CCA Given paired data: Select directions Kernel CCA so as to maximize: Given pair of kernel functions: , Same objective, but projections in kernel space: , Building the kernels for each view Visual kernels Word frequency, rank kernels Semantic space Visual features Gist captures the total scene structure [Torralba et al.] Color Histogram captures the HSV color distribution Visual Words captures local appearance (k-means on DoG+SIFT) Average the component χ2 kernels to build a single visual kernel . Tag features Word Frequency Traditional bag-of-(text)words Cow Bird Water Architecture Mountain Sky tag Cow Bird Water Architecture Mountain Sky Car Person count 1 1 1 1 1 1 0 0 Tag features Absolute Rank Absolute rank in this image’s tag-list Cow Bird Water Architecture Mountain Sky tag Cow Bird Water Architecture Mountain Sky Car Person value 1 0.63 0.50 0.43 0.39 0.36 0 0 Tag features Relative Rank Percentile rank obtained from the rank distribution of that word in all tag-lists. tag value Cow Bird Water Architecture Mountain Sky Cow Bird Water Architecture Mountain Sky Car Person Average the component χ2 kernels to build a single tag kernel . 0.9 0.6 0.8 0.5 0.8 0.8 0 0 Recap: Building the image database Visual feature space tag feature space Semantic space Experiments We compare the retrieval performance of our method with two baselines: Words+Visual Baseline Visual-Only Baseline Query image 1st retrieved image Query image 1st retrieved KCCA image semantic space [Hardoon et al. 2004, Yakhenenko et al. 2009] Evaluation We use Normalized Discounted Cumulative Gain at top K (NDCG@K) to evaluate retrieval performance: Reward term score for pth ranked example Sum of all the scores for the perfect ranking (normalization) [Kekalainen & Jarvelin, 2002] Doing well in the top ranks is more important. Evaluation We present the NDCG@k score using two different reward terms: Object presence/scale Ordered tag similarity Cow Tree Grass Rewards similarity of query’s objects/scales and those in retrieved image(s). scale presence Person Cow Tree Fence Grass Rewards similarity of query’s ground truth tag ranks and those in retrieved image(s). relative rank absolute rank Dataset LabelMe Pascal 6352 images 9963 images Database: 3799 images Database: 5011 images Query: 2553 images Query: 4952 images Scene-oriented Object-central Contains the ordered Tag lists obtained on tag lists via labels added 56 unique taggers ~23 tags/image Mechanical Turk 758 unique taggers ~5.5 tags/image Image-to-image retrieval We want to retrieve images most similar to the given query image in terms of object importance. Visual kernel space Untagged query image Tag-list kernel space Image database Retrieved images Image-to-image retrieval results Query Image Visual only Words + Visual Our method Image-to-image retrieval results Query Image Visual only Words + Visual Our method Image-to-image retrieval results Our method better retrieves images that share the query’s important objects, by both measures. 39% improvement Retrieval accuracy measured by object+scale similarity Retrieval accuracy measured by ordered tag-list similarity Tag-to-image retrieval We want to retrieve the images that are best described by the given tag list Visual kernel space Tag-list kernel space Image database Retrieved images Cow Person Tree Grass Query tags Tag-to-image retrieval results Our method better respects the importance cues implied by the user’s keyword query. 31% improvement Image-to-tag auto annotation We want to annotate query image with ordered tags that best describe the scene. Visual kernel space Untagged query image Tag-list kernel space Image database Cow Field Cow Tree Cow Grass Grass Fence Output tag-lists Image-to-tag auto annotation results Tree Boat Grass Water Person Boat Person Water Sky Rock Person Tree Car Chair Window Bottle Knife Napkin Light fork Method k=1 k=3 k=5 k=10 Visual-only 0.0826 0.1765 0.2022 0.2095 Word+Visual 0.0818 0.1712 0.1992 0.2097 Ours 0.0901 0.1936 0.2230 0.2335 k = number of nearest neighbors used Implicit tag cues as localization prior [Hwang & Grauman, CVPR 2010] Training: Learn object-specific connection between localization parameters and implicit tag features. Computer Poster Desk Screen Mug Poster P (location, scale | tags) Desk Mug Office Mug Eiffel Woman Table Mug Ladder Mug Coffee Implicit tag features Testing: Given novel image, localize objects based on both tags and appearance. Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features Conclusion • We want to learn what is implied (beyond objects present) by how a human provides tags for an image • Approach requires minimal supervision to learn the connection between importance conveyed by tags and visual features. • Consistent gains over • content-based visual search • tag+visual approach that disregards importance THANK YOU