Using geometry and related things Region labels Qualitative + Boundaries and objects Stronger geometric constraints from domain knowledge More quantitative more precise Reasoning on aspects and poses 3D point clouds Explicit [D. Hoiem, A. A. Efros, and M. Hebert. Recovering surface layout from an image. IJCV, 75(1):151–172, 2007] What level of representation? How qualitative? What type of training information is available? Assumptions (camera geometry, etc…)? Learning from image features to depth + MRF: A. Saxena, et al.. 3-D depth reconstruction from a single still image. IJCV, 76, 2007. Make3D: Learning 3D Scene Structure from a Single Still Image: A. Saxena, et al. TPAMI, 2010. Stage classes: Nedovic, V., Smeulders, A., Redert, A., Geusebroek, J.: Stages as models of scene geometry. In: PAMI (2010) Next…. Can coarse surface labels be used for improving object recognition and scene analysis performance through better geometric reasoning? Object Detection Surface Estimates Viewpoint Prior Local Car Detector From geometry to objects and back? Distributions versus decisions? D. Hoiem, A. Efros, M. Hebert. Putting objects in perspective. IJCV 2009 Local Ped Detector S.Y. Bao, M. Sun, S.Savarese. Toward Coherent Object Detection And Scene Layout Understanding. CVPR 2010. B. Leibe, N. Cornelis, K. Cornelis, and L. Van Gool. Dynamic 3D Scene Analysis from a Moving Vehicle. CVPR07 • Is a more precise representation possible? • Boundaries, interposition, relative depth ordering…. • Still low-level: Can we combine reasoning about semantic labels Region labels Qualitative + More geom. Stronger geometric relations and constraints from semantic labels domain knowledge More quantitative more precise Reasoning on aspects and poses 3D point clouds Explicit Iterative refinement Iter 1 Iter 2 Toward true integration of geometric and semantic cues: How to beat the intractable Global natureoptimization of the problem? F(D|I,L) = (d ) + (d ,d ,d ) + (d ,d ) What (features/cues can be used? F( |I,L,S) = )+ ( )+ ( , ) D. Hoiem, A. A. Efros, and M. Hebert. Closing the loop on scene interpretation. In CVPR, 2008 Final p 1 p p 1 pqr i i i 2 p 2 i Tx =1 q r p ij 3 j 3 p i j Tx =1 g B. Liu, S. Gould, D. Koller. Single image depth estimation from predicted semantic labels. CVPR 2010 S. Gould, R. Fulton, D. Koller. Decomposing a scene into geometric and semantically consistent regions. ICCV 2009 • • Region labels Qualitative + Boundaries and objects Stronger geometric constraints from domain knowledge More quantitative more precise Still mostly bottom-up classification approach No use of domain constraints or constraints governing the physical world Reasoning on aspects and poses 3D point clouds Explicit Score How to generate and search through hypotheses (in a tractable manner)? How to evaluate score? How to avoid early decisions? How to represent constraints in a more general way? D. Lee, T. Kanade, M. Hebert. Geometric Reasoning for Single Image Structure Recovery. CVPR09. Lines Faces V. Hedau, D. Hoiem, D.Forsyth, “Recovering the Spatial Layout of Cluttered Rooms,” IEEE International Conference on Computer Vision (ICCV), 2009. H. Wang, S. Gould, D. Koller. Discriminative learning with latent variables for cluttered indoor scene understanding. ECCV 2010. Classifiers f(x,y,w) = wT (x,y) Feature vector measuring agreement Learned weight vector between lines, faces and labels Region labels Qualitative + Boundaries and objects Stronger geometric + more constraints3D point clouds constraints from domain knowledge More quantitative more precise Explicit • Finite volume • Spatial exclusion • Containment • • • • Stability Contact Proximity …………. • Search through hypothesis space Input image Line segments and Vanishing points Room hypotheses Reject invalid configurations Geometric context Orientation map Compatibility of image data with geometric configuration f x, y Features from image (surface labels, vanishing points, etc.) Object hypotheses Penalty term for incompatible configurations T w x, y w T y Hypothesis: Scene layout+ object hypothesis D. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces. Advances in Neural Information Processing Systems (NIPS), Vol. 24, 2010. Surface Layout Density Map Bag of Segments Frontal Front-Right Front-Left Left-Right Left-OccludedRight-Occluded Porous Solid Catalogue sky above Medium above above above Medium Medium High Infront Medium Infront Pointsupported Original Image Pointsupported supported Ground High supported 3D Parse Graph A. Gupta, A. Efros, and M. Hebert. Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics. ECCV 2010. http://www.cs.cmu.edu/~abhinavg/blocksworld • Direct search through hypothesis space • Sampling L. Del Pero, J. Guan, E. Brau, J. Schlecht, K. Barnard. Sampling Bedrooms. CVPR 2011. Diffusion moves: Sample room boundary Sample camera Change r = (x,y,z,w,h,l, ) Change c = Sample object parameters f Sample over a block edge only Change o = (x,y,z,w,h,l) Jump moves: • Direct search through hypothesis space • Sampling • (Constrained) object detection P(O1,..,ON,L,H|I) = P(H)P(L|H,I) i P(Oi|L,H,I) V. Hedau, D. Hoiem, D. Forsyth, “Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry,” European Conference on Computer Vision (ECCV), 2010. • • • • Direct search through hypothesis space Sampling Object detection Grammars Truth Max Window Wall Balcony Door Roof = L. Simon, O. Teboul, P. Koutsourakis and N. Paragios. Random Exploration of the Procedural Space for Single-View 3D Modeling of Buildings. International Journal of Computer Vision (IJCV), 2010. O. Teboul, I. Kokkinos, P. Koutsourakis, L. Simon and N. Paragios. Shape Grammar Parsing via Reinforcement Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2011. • • • • Direct search through hypothesis space Sampling Object detection Grammars What constraints? How to combine low level classifiers with “higher-level” reasoning? What are the right tools to search through and score hypotheses? How to represent partial interpretations without early commitment? G. Tsai et al. Realtime Indoor Scene Understanding using Bayesian Filtering with Motion Cues. ICCV 2011. Region labels Qualitative + Boundaries and objects + sparse/partial 3D data Stronger geometric + more constraints3D+ point otherclouds constraints from constraints domain knowledge + (large) prior data More quantitative more precise Explicit • What level of representation? Do we need explicit parse of the input or how far can we go with associations? – Hierarchical, region labels, associations,... • How to incorporate knowledge/context external to the input image? – Task, geometry, contextual info (scene type, location), text, ... • Should we use global models vs. sequences of simpler models? Is the problem too hard as posed, i.e., intractable? • What are the right definitions of actions, activities, behaviors? • How to combine temporal (actions) and spatial (scenes, objects) information effectively? • What should be evaluated and what stage? – bounding boxes, pixelwise labels, 3D models, actions/behaviors, predictions?