Learning Local Affine Representations for Texture and Object Recognition Svetlana Lazebnik Beckman Institute, University of Illinois at Urbana-Champaign (joint work with Cordelia Schmid, Jean Ponce) Overview • Goal: – Recognition of 3D textured surfaces, object classes • Our contribution: – Texture and object representations based on local affine regions • Advantages of proposed approach: – Distinctive, repeatable primitives – Robustness to clutter and occlusion – Ability to approximate 3D geometric transformations The Scope 1. Recognition of single-texture images (CVPR 2003) 2. Recognition of individual texture regions in multi-texture images (ICCV 2003) 3. Recognition of object classes (BMVC 2004, work in progress) 1. Recognition of Single-Texture Images Affine Region Detectors Harris detector (H) Laplacian detector (L) Mikolajczyk & Schmid (2002), Gårding & Lindeberg (1996) Affine Rectification Process Patch 1 Patch 2 Rectified patches (rotational ambiguity) Rotation-Invariant Descriptors 1: Spin Images • Based on range spin images (Johnson & Hebert 1998) • Two-dimensional histogram: distance from center × intensity value Rotation-Invariant Descriptors 2: RIFT • Based on SIFT (Lowe 1999) • Two-dimensional histogram: distance from center × gradient orientation • Gradient orientation is measured w.r.t. to the direction pointing from the center of the patch Signatures and EMD • Signatures S = {(m , w ), … , (m , w )} 1 1 k k mi — cluster center wi — relative weight • Earth Mover’s Distance (Rubner et al. 1998) – Computed from ground distances d(mi, m'j) – Can compare signatures of different sizes – Insensitive to the number of clusters Database: Textured Surfaces 25 textures, 40 sample images each (640x480) Evaluation • Channels: HS, HR, LS, LR – Combined through addition of EMD matrices • Classification results – 10 training images per class, rates averaged over 200 random training subsets Comparative Evaluation Our method Varma & Zisserman (2003) Spatial selection Harris and Laplacian detectors None (every pixel location is used) Neighborhood shape selection Affine adaptation None (support of descriptors is fixed) Descriptors Spin images, RIFT Raw pixel values Textons Separate set of textons for each image Universal texton dictionary Representing/comparing Signatures/EMD texton distributions Histograms/ chi-squared distance Results of Evaluation: Classification rate vs. number of training samples (H+L)(S+R) VZ-Joint VZ-MRF • Conclusion: an intrinsically invariant representation is necessary to deal with intra-class variations when they are not adequately represented in the training set Summary • A sparse texture representation based on local affine regions • Two novel descriptors (spin images, RIFT) • Successful recognition in the presence of viewpoint changes, non-rigidity, non-homogeneity • A flexible approach to invariance 2. Recognition of Individual Regions in Multi-Texture Images • A two-layer architecture: – Local appearance + neighborhood relations • Learning: – Represent the local appearance of each texture class using a mixture-of-Gaussians model – Compute co-occurrence statistics of sub-class labels over affinely adapted neighborhoods • Recognition: – Obtain initial class membership probabilities from the generative model – Use relaxation to refine these probabilities Two Learning Scenarios • Fully supervised: every region in the training image is labeled with its texture class brick • Weakly supervised: each training image is labeled with the classes occurring in it brick, marble, carpet Neighborhood Statistics Estimate: • probability p(c,c') • correlation r(c,c') Neighborhood definition Relaxation (Rosenfeld et al. 1976) • Iterative process: – Initialized with posterior probabilities p(c|xi) obtained from the generative model – For each region i and each sub-class label c, update the probability pi(c) based on neighbor probabilities pj(c') and correlations r(c,c') • Shortcomings: – No formal guarantee of convergence – After the initialization, the updates to the probability values do not depend on the image data Experiment 1: 3D Textured Surfaces Single-texture images T1 (brick) T2 (carpet) T3 (chair) T4 (floor 1) T5 (floor 2) T6 (marble) T7 (wood) Multi-texture images 10 single-texture training images per class, 13 two-texture training images, 45 multi-texture test images Effect of Relaxation on Labeling Original image Top: before relaxation, bottom: after relaxation Retrieval (single-texture training images) T1 (brick) T2 (carpet) T5 (floor 2) T3 (chair) T6 (marble) T4 (floor 1) T7 (wood) Successful Segmentation Examples Unsuccessful Segmentation Examples Experiment 2: Animals cheetah, background zebra, background giraffe, background • No manual segmentation • Training data: 10 sample images per class • Test data: 20 samples per class + 20 negative images Cheetah Results Zebra Results Giraffe Results Summary • A two-level representation (local appearance + neighborhood relations) • Weakly supervised learning of texture models Future Work • Design an improved representation using a random field framework, e.g., conditional random fields (Lafferty 2001, Kumar & Hebert 2003) • Develop a procedure for weakly supervised learning of random field parameters • Apply method to recognition of natural texture categories 3. Recognition of Object Classes The approach: • Represent objects using multiple composite semi-local affine parts – More expressive than individual regions – Not globally rigid • Correspondence search is key to learning and detection Correspondence Search • Basic operation: a two-image matching procedure for finding collections of affine regions that can be mapped onto each other using a single affine transformation A • Implementation: greedy search based on geometric and photometric consistency constraints – Returns multiple correspondence hypotheses – Automatically determines number of regions in correspondence – Works on unsegmented, cluttered images (weakly supervised learning) Matching: 3D Objects Matching: 3D Objects closeup closeup Matching: Faces spurious match ??? Finding Symmetries Finding Repeated Patterns and Symmetries Learning Object Models for Recognition • Match multiple pairs of training images to produce a set of candidate parts • Use additional validation images to evaluate repeatability of parts and individual regions • Retain a fixed number of parts having the best repeatability score Recognition Experiment: Butterflies Admiral • • • • Swallowtail Machaon Monarch 1 Monarch 2 16 training images (8 pairs) per class 10 validation images per class 437 test images 619 images total Peacock Zebra Butterfly Parts Recognition • Top 10 parts per class used for recognition total number of regions detected • Relative repeatability score: total part size • Classification results: Total part size (smallest/largest) Classification Rate vs. Number of Parts Detection Results (ROC Curves) Circles: reference relative repeatability rates. Red square: ROC equal error rate (in parentheses) Successful Detection Examples Training images Test images (blue: occluded regions) All ellipses found in the test images Unsuccessful Detection Examples Training images Test images (blue: occluded regions) All ellipses found in the test image Summary • Semi-local affine parts for describing structure of 3D objects • Finding a part vocabulary: – Correspondence search between pairs of images – Validation • Additional application: – Finding symmetry and repetition Future Work • Find a better affine region detector • Represent, learn inter-part relations • Evaluation: CalTech database, harder classes, etc. Birds Egret Snowy Owl Mandarin Duck Puffin Wood Duck Birds: Candidate Parts Mandarin Duck Puffin Objects without Characteristic Texture (LeCun’04) Summary of Talk 1. Recognition of single-texture images • Distribution of local appearance descriptors 2. Recognition of individual regions in multi-texture images • Local appearance + loose statistical neighborhood relations 3. Recognition of object categories • Local appearance + strong geometric relations For more information: http://www-cvr.ai.uiuc.edu/ponce_grp Issues, Extensions • Weakly supervised learning – Evaluation methods? – Learning from contaminated data? • • • • • Probabilistic vs. geometric approaches to invariance EM vs. direct correspondence search Training set size Background modeling Strengthening the representation – Heterogeneous local features – Automatic feature selection – Inter-part relations