Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Frontiers in Computer Vision Workshop, MIT August 22, 2011 Work with Sudheendra Vijayanarasimhan, Adriana Kovashka, Devi Parikh, Prateek Jain, Sung Ju Hwang, and Jeff Donahue Problem: how to capture human insight about the visual world? • Point+label “mold” restrictive • Human effort expensive Annotator [tiny image montage by Torralba et al.] The complex space of visual objects, activities, and scenes. Problem: how to capture human insight about the visual world? Our approach: Ask: Actively learn Annotator Listen: [tiny image montage by Torralba et al.] The complex space of visual objects, activities, and scenes. Explanations, Comparisons, Implied cues,… Deepening human communication to the system ? What is this? How do you know? What’s worth mentioning? < ? Which is more ‘open’? Do you find him attractive? Why? ? Is it ‘furry’? What property is changing here? [Donahue & Grauman ICCV 2011; Hwang & Grauman BMVC 2010; Parikh & Grauman ICCV 2011, CVPR 2011; Kovashka et al. ICCV 2011] Soliciting rationales • We propose to ask the annotator not just what, but also why. Is the team winning? Is it a safe route? Is her form perfect? How can you tell? How can you tell? How can you tell? Spatial rationale Soliciting rationales Spatial rationale Annotation task: Is her form perfect? How can you tell? pointed toes balanced falling knee angled Attribute rationale Synthetic contrast example Influence on classifier Attribute rationale balanced balanced falling pointed toes pointed toes knee angled knee angled Synthetic contrast example [Zaidan et al. HLT 2007] [Donahue & Grauman, ICCV 2011] Rationale results • Scene Categories: How can you tell the scene category? • Hot or Not: What makes them hot (or not)? • Public Figures: What attributes make them (un)attractive? Collect rationales from hundreds of MTurk workers. [Donahue & Grauman, ICCV 2011] Rationale results Mean AP Scenes Originals +Rationales Kitchen 0.1196 0.1395 Living Rm 0.1142 0.1238 Inside City 0.1299 0.1487 Coast 0.4243 0.4513 Highway 0.2240 0.2379 Bedroom 0.3011 0.3167 Street 0.0778 Country Hot or Not Originals +Rationales Male 54.86% 60.01% Female 55.99% 57.07% 0.0790 PubFig Originals +Rationales 0.0926 0.0950 Male 64.60% 68.14% Mountain 0.1154 0.1158 Female 51.74% 55.65% Office 0.1051 0.1052 Tall Building 0.0688 0.0689 Store 0.0866 0.0867 Forest 0.3956 0.4006 [Donahue & Grauman, ICCV 2011] Learning what to mention • Issue: presence of objects != significance • Our idea: Learn cross-modal representation that accounts for “what to mention” TAGS: Birds Visual: • Texture • Scene • Color… Architecture Water Cow Birds Sky Architecture Tiles Cow Water Sky Training: human-given descriptions Textual: • Frequency • Relative order • Mutual proximity Learning what to mention View x View y Importance-aware semantic space [Hwang & Grauman, BMVC 2010] Learning what to mention: results Visual only Words + Visual Query Image Our method [Hwang & Grauman, BMVC 2010] Problem: how to capture human insight about the visual world? Our approach: Ask: Actively learn Annotator Listen: Explanations, Comparisons, Implied cues [tiny image montage by Torralba et al.] The complex space of visual objects, activities, and scenes. Traditional active learning At each cycle, obtain label for the most informative or uncertain example. [Mackay 1992, Freund et al. 1997, Tong & Koller 2001, Lindenbaum et al. 2004, Kapoor et al. 2007,…] Annotator Current Model Unlabeled data ? Active Selection Labeled data Challenges in active visual learning • Annotation tasks vary in cost and info • Multiple annotators working parallel • Massive unlabeled pools of data Annotator Current Model Unlabeled data $ $ $ $ $ $ ? Labeled data Active Selection [Vijayanarasimhan & Grauman NIPS 2008, CVPR 2009, Vijayanarasimhan et al. CVPR 2010, CVPR 2011, Kovashka et al. ICCV 2011] Sub-linear time active selection We propose a novel hashing approach to identify the most uncertain examples in sub-linear time. Current classifier 110 For 4.5 million unlabeled instances, 101 10 minutes machine time per iter, 111 Actively vs. 60 hours for a naïve scan.selected Hash table examples Unlabeled data [Jain, Vijayanarasimhan, Grauman, NIPS 2010] on Flickr test set Live active learning results Outperforms status quo data collection approach [Vijayanarasimhan & Grauman, CVPR 2011] Summary • Humans are not simply “label machines” • Widen access to visual knowledge – New forms of input, often requiring associated new learning algorithms • Manage large-scale annotation efficiently – Cost-sensitive active question asking • Live learning: moving beyond canned datasets