1 Low-Level Features and Descriptors in Video Jason J. Corso SUNY at Buffalo jcorso@buffalo.edu http://www.cse.buffalo.edu/~jcorso UCLA IPAM GSS 2013: Computer Vision July-August 2013 2 Sources: Maas 1971 with Johansson; downloaded from Youtube. 3 Human Activity Perception and Cognition • Summary of findings (to be presented) – Humans can perceive biological motion, such as walking and dancing, from as few as 10-12 stimuli in the form of point. – Biological motion activates the superior temporal sulcus (STS). – Mammalian recognition of biological motion is viewpoint specific. – The STS has cells tuned to specific hand gestures, such as tearing and pointing. – The STS has specific cells tuned to interactions of objects and mammalian agents in the environment, e.g., hands and food. 4 Johansson: Perception of Biological Motion Sources: Johansson, G. “Visual perception of biological motion and a model for its analysis.” Perception & Psychophysics. 14(2):201-211. 1973. Videos were made by JB Maas in 1971 (released via Houghton-Mifflin and now available on Youtube). 5 Mammalian Activity Recognition is Viewpoint Specific Source: M. A. Giese. Neural model for the recognition of biological motion, In Dynamische Perzeption, 105–110. Infix Verlag, 2000. • • • The paper explores the basic question of viewpoint invariance of these mechanisms to recognize complex biological motion within the STS. Not a human study. Builds a simple model of the known visual pathway. – Two parts: form and motion – Motion encodes • Optical flow, • Max-pooling (invariance), • Non-linear templates on max-pooled outputs. • Results – Biological motion recognition can be achieved independently by each pathway. – Shift and scale invariance (up to 1.5 octaves) observed. – Model predicts view variance with recognition performance degrading as a function of the viewpoint angle. – Motion pathway is less tolerant against temporal distortion than form pathway. – Individual motions form basis of predicted smooth motion fields (linear combinations of models look biologically plausible). 6 Pelphrey Study: STS Prefers Biological Motion Source: Pelphrey et al. “Brain Activity Evoked by the Perception of Human Walking.” Journal of Neuroscience 23(17):6819-6825, 2003. • • Explores the question of whether or not the superior temporal sulcus specifically focuses on biological motion. Key Findings: Experiment Setup – The STS is sensitive specifically to biological motion. – The STS responds more strongly to biological motion than to nonmeaningful but complex nonbiological motion. – The STS responds more strongly to biological motion than to complex and meaningful nonbiological motion. Peak Results Image Source: Wikimedia Commons. 7 Select STS Cells Tuned to Object-Agent Interactions Source: Perrett et al. Frameworks of analysis for the neural representation of animate objects and actions. J. Experimental Biology, 146:87–113, 1989. • Studied a population of cells tuned to hands (in macaque monkeys). – Studied cells unresponsive to simple bars/gratings as well as complex body movements, faces, or food. • Findings – Different actions of the hand activated different subpopulations of cells. – Isolated cells for reach for, retrieve, manipulate, pick, tear, present to the monkey, and hold. – Selectivity is clearly detected for hand-object interactions over objectobject interactions. 8 Activity Recognition Overview 9 Activity Recognition Has Many Faces Multiview Video Description Summarization Multiview Images Single Video Segmentation Localization RGBD Classification Detection Single Image Events Group Actions Interactions Actions Gestures Videos on this slide are sourced from my work, Youtube and other standard data sets. 10 Applications of Activity Recognition • • • • • • • Automated surveillance systems. Real-time monitoring of patients, elderly persons. Gesture- and action-based interfaces (e.g., Kinect). Anomaly detection in video. Sports-video analysis. Semantic video retrieval. Video-to-text. 11 Applications of Activity Recognition • • • • • • • Automated surveillance systems. Real-time monitoring of patients, elderly persons. Gesture- and action-based interfaces (e.g., Kinect). Anomaly detection in video. Sports-video analysis. Semantic video retrieval. Video-to-text. Source: http://www.youtube.com/ 12 Applications of Activity Recognition • • • • • • • Automated surveillance systems. Real-time monitoring of patients, elderly persons. Gesture- and action-based interfaces (e.g., Kinect). Anomaly detection in video. Sports-video analysis. Semantic video retrieval. Video-to-text. 13 Inherent Complexity in Activity Source: http://www.youtube.com/watch?v=6H0D8VaIli0 14 Humans are Highly Articulated Source: http://www.youtube.com/ 15 Motion of the Camera and/or of the Scene Source: Goodfellas (copyright Columbia Pictures) used under fair use; Video trimmed from GaTech Video Segmentation data set. 16 Action is View Dependent Source: UCF 50 data set from UCF. 17 Action is Subject Dependent Source: UCF 50 data set from UCF. 18 Occlusion Source: GaTech Video Segmentation Data Set. 19 Johansson: Perception of Biological Motion Sources: Johansson, G. “Visual perception of biological motion and a model for its analysis.” Perception & Psychophysics. 14(2):201-211. 1973. Videos were made by JB Maas in 1971 (released via Houghton-Mifflin and now available on Youtube). 20 High-level Representations are a Challenge Pose computed using state of the art Yang and Ramanan method independently on each frame. Notice the jittery character of the pose due to local variation. 21 Current Performance: Activity Recognition 100 2012 KTH (6) 90 2010 2009 2008 Accuracy of Method 80 70 2004 UCF Sports (9) 76% 71% 60 50 40 30 20 UCF50 Pose-based Activity Recognition Findings: • • • • 10 These results depict performance circa 2012. Performance decreases with number of classes. Performance increases with time. All methods are based on low-level features. • State of the art pose method performs worse. 0 1 2 63 4 5 6 7 9 8 9 10 HMDB (51) 11 Number of Classes in the Data Set 5012 13 5114 15 22 The (Very Common) Bag-of-Features Pipeline Source: materials adapted from Laptev’s CVPR 2008 slides. Space-Time Features Histogram of Visual Words Space-Time Patch Descriptors Multi-channel Classifier • Examples include Schüldt et al. ICPR 2004, Niebles et al. IJCV 2008, and many works building on this basic idea. 23 Typical Elements of the Bag-of-Features Pipeline • Feature quantization: – Feature descriptions are max-pooled according to various spatial domains. – Histograms are computed. • Most typical classifier is a support vector machine. • E.g., Laptev 2008 uses a multi-channel Chi-square kernel: Set of channels (greedy selection) Mean of training sample distances for channel. Chi-square distance 24 Outline • Features – Local spatiotemporal features • STIP • Cuboids • Dense – Trajectories • Keypoint tracking • Dense • Descriptors – Local • • • • HOG/HOF MBH: Motion Boundary Histograms HOG3D MIP: Motion Interchange Patterns – Global space-time energy 26 Detectors / Features Features 27 STIP: Space-Time Interest Points Source: Laptev. “On Space-Time Interest Points.” Intl Journal of Computer Vision. 64(2/3):107-123. 2005. • Basic idea is to detect points in the video that have significant local variations in both space and time. • Builds on the existing work of Harris corner detector and incorporates a scale parameter. • The original work incorporates a scale-selection term; most subsequent works densely sample scale. Features 28 STIP: Space-Time Interest Points Source: Laptev. “On Space-Time Interest Points.” Intl Journal of Computer Vision. 64(2/3):107-123. 2005. Features 29 STIP: Space-Time Interest Points Source: Laptev. “On Space-Time Interest Points.” Intl Journal of Computer Vision. 64(2/3):107-123. 2005. Video from Laptev’s CVPR 2008 slides. Features 30 Dollár’s Cuboids Source: Dollar et al. “Behavior Recognition” ICCV PETS Workshop 2005. • Detector fires when local image intensities contain periodic frequency components. • It will fire more frequently than STIP. • Based on temporal Gabor quadrature pair filter with response function: Features 31 Dense Sampling of Locations • Motivated by successes in object recognition where densely sampled features outperformed sparse ones, it has become common to sample densely for activity recognition too. • Example videos below – 7x7x7 non-overlapping samples, – Simple temporal derivative (much simpler than HOF and HOG3D). – k-Means in 128 visual words. Features 32 Discussion: Local Spatiotemporal Features • Benefits of local feature methods: – – – – Robustness to viewpoint changes and occlusion. Relatively computationally inexpensive. Do not need to detect and track the agent. Implicitly incorporate motion, form, and context. • But, they may be too limited for comprehensive activity recognition. – Temporal structure is diminished or lost. – Human performance suggests a broader spatial and temporal range may be needed for good activity recognition. – Typically do not incorporate any inter-relationships among the extracted features or points. Features 33 Trajectories by Local Keypoint Tracking Source: Messing et al. “Activity Recognition using velocity histories of tracked keypoints.” ICCV 2009. • Detects corners in the image and tracks them using a KLT tracker. – 500 points at a time w/ replacement. – Mean duration is 150 frames. • • • • Represent trajectories by quantized trajectory velocity. Learn a mixture model over velocity Markov chains. Each action has a distribution over the mixture components. Joint model over action and observations: • Learn via EM. Features 34 Trajectories by Local Keypoint Tracking Source: Messing et al. “Activity Recognition using velocity histories of tracked keypoints.” ICCV 2009. Features 35 Dense Trajectories Source: Wang et al. “Action Recognition by Dense Trajectories.” CVPR 2011. • Dense sampling improves object recognition and action recognition; why not use it for trajectories? • Matching features across frames is very expensive. • Proposes a method to track the trajectories densely using a single dense optical flow field calculation. – Global smoothness enforced. • Compute the descriptors aligned with the trajectories using HOG/HOF/MBH. Features 36 Dense Trajectories: Convincing Improvements Source: Wang et al. “Action Recognition by Dense Trajectories.” CVPR 2011. 37 Descriptors Descriptors 38 Local Descriptors: HOG/HOF Source: materials adapted from Laptev’s CVPR 2008 slides. Description (sparse/dense) in space-time patches. Histogram of oriented spatial grad. (HOG) 3x3x2x4bins HOG descriptor Histogram of optical flow (HOF) 3x3x2x5bins HOF descriptor 39 Motion Boundary Histograms Source: Dalal et al. “Human Detection Using Oriented Histograms of Flow and Appearance.” ECCV 2006. • Rather than HOF directly, MBH focuses on histograms of differential optical flow. – Descriptive of motion articulation but resistant to background and camera motion. • Compute optical flow and take differentials separately over dx and dy. Use separate histograms over resulting dx and dy images. Descriptors 40 Local Descriptors: HOG3D Source: Kläser et al. “A Spatio-Temporal Descriptor Based on 3-D Gradients.” BMVC 2008. And the provided poster. Descriptors 41 Local Descriptors: HOG3D Source: Kläser et al. “A Spatio-Temporal Descriptor Based on 3-D Gradients.” BMVC 2008. And the provided poster. Descriptors 42 Local Descriptors: Motion Interchange Patterns Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides. • Local-binary patterns based video descriptor. – Dense characterization of motion changes. – Captures the shape of moving edges. – Methodology incorporates a stabilization mechanism. • Incorporates a per-pixel encoding using binary/trinary digits. • Descriptor is frequency of binary/trinary strings. Slide from O. Kliper-Gross. Descriptors 43 Local Descriptors: Motion Interchange Patterns Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides. t t-1 SSD( , ) t+1 SSD( , ) i x α j Slide from O. Kliper-Gross. -1 0 1 Descriptors 44 Local Descriptors: Motion Interchange Patterns Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides. α=0 t t-1 SSD( , ) SSD( , ) α=0 Slide from O. Kliper-Gross. -1 t+1 x 0 1 Descriptors 45 Local Descriptors: Motion Interchange Patterns Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides. different α = different channels = diagonals SSD( , ) 64digits trinary code Slide from O. Kliper-Gross. SSD( , ) α -1 0 1 Descriptors 46 Local Descriptors: Motion Interchange Patterns Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides. Each α defines a channel 8 channels Per-pixel 64-digits trinary code 0 0 1 0 0 -1 1 00 1 0 0 11 0 1 -1 0 0 -1 0 0 -1-1 0 -1 1 00 1 0 -1 1 1 -1 1 1 -1-1 1 1 -1 1 1 0 1 0 1 0 0 2 integers per-pixel Per Channel 0 -1 0 0 -1 Slide from O. Kliper-Gross. 0-255 integer 0 0-255 integer Descriptors 47 Local Descriptors: Motion Interchange Patterns Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides. An example - one channel basic coding • Vote for next frame • Vote for prev frame • Static edges MIP captures: Motion, Motion Changes, and Shape Slide from O. Kliper-Gross. Descriptors 48 Local Descriptors: Motion Interchange Patterns Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides. Suppress background structure and noise Original Coding = 1 Switched Patch Suppression Switched Locations Coding = -1 2 ways to look at this: - No motion. - Contradicted motion voting. i.e. Original coding voted down Switched patches voted up Suppress the code Slide from O. Kliper-Gross. Descriptors 49 Local Descriptors: Motion Interchange Patterns Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides. An example of MIP suppression. Without Suppression Slide from O. Kliper-Gross. Original With Suppression Descriptors 50 Global Descriptors: Templates Source: Derpanis et al. “Efficient action spotting based on a spacetime oriented structure representation.” CVPR 2010. And supporting videos. Action Templates Squat Spin-Left Spotted Actions Jump-Right Descriptors 51 Action Spotting: Space-Time Oriented Energy Source: Derpanis et al. “Efficient action spotting based on a spacetime oriented structure representation.” CVPR 2010. And supporting videos. • Templates are decomposed into space-time oriented energy by broadly tuned third order Gaussian filters (classical). • Sets up a basis that makes it plausible to compute the energy along any spatiotemporal orientation. • The directions are specified according to the application: leftward, rightward, upward, download, flicker, and so on. Descriptors 52 Action Spotting: Space-Time Oriented Energy Source: Derpanis et al. “Efficient action spotting based on a spacetime oriented structure representation.” CVPR 2010. And supporting videos. Input Video Space-Time Oriented Energies Left Right Up Down Static Flicker ~ Descriptors 53 Pure Motion Energy Descriptors 54 Pure Motion Energy Raw Space-Time Oriented Energies Left Right Up Down Static Flicker ~ Capture structure and not motion; so take difference for motion. Pure Space-Time Oriented Energies Left Right Up Down Static Flicker ~ Descriptors 55 Spotting – Template Matching • Standard Bhattacharyya matching is used to correlate the templates into a query video. – Space-time oriented energy vector-videos are normalized. – Correlate in each energy channel separately; sum. Output correlation video. Template Ranges over support of the template. Query Video Correlation Video 56 Descriptors 57 Basic Idea in Action Bank Bank of Action Detectors View 2 Action Bank Representation View n Max Pooling View 1 Detection Volumes Biking Volumetric Javelin Jump Rope Fencing Semantics Transfer via high-level representation Positive: jumping, throwing, running, ... Negative: biking, fencing, drumming, ... Input Video SVM Classifiers [Sadanand and Corso, CVPR 2012] 58 Action Bank Representation • Each correlation video is max-pooled for the action bank vector. – Hierarchical (3 layers). – Volumetric/Space-time. • We have used a standard SVM as the final classifier on this representation. – Tried L1 regularization (no significant change). – Tried random forests (no significant change). • Depending on the data set, we use one or two scales of the detectors. • All action bank vectors are available on our website for major data sets in Python and Matlab formats. [Sadanand and Corso, CVPR 2012] 59 Building the Bank • Current bank has 205 templates in it – Examples from all 50 UCF 50 actions and all 6 KTH actions as well as the digging actions from visint.org. – 3-6 examples for each action from various viewpoints and styles of the particular action. Tempo variation when possible. – Average spatial resolution of 50x120 pixels and range in number of frames from 40-50 frames. • Template preparation – Each template was selected as soon as a plausible new viewpoint/tempo/style was found. – Each template was manually cropped to the full space-time extent of the human within it. – No optimization of the bank was done at any point. 60 Performance Results of Action Bank KTH Results UCF Sports [Sadanand and Corso, CVPR 2012] 61 Large-Scale Results of Action Bank HMDB51-V: video-wise 10-fold CV HMDB51-B: three benchmark splits UCF50-V: video-wise 10-fold CV UCF50-G: group-wise 5-fold CV [Sadanand and Corso, CVPR 2012] 62 Worst Best 63 Discussion: Semantics Transfer • Can action bank permit a high-level transfer of semantics from the bank templates through to the final classifier? The KTH data set is used for this study. [Sadanand and Corso, CVPR 2012] 64 Discussion: Does Size Matter? The UCF Sports data set is used for this study. [Sadanand and Corso, CVPR 2012] 65 Closing Remarks • Summary – Overview of motion perception and activity recognition. – Low-level features • STIP • Cuboids • Trajectories – Low-level descriptors • • • • HOG/HOF HOG3D MBH Motion Interchange Patterns • Challenges – Space-time or “space and time”? – Building intuition and understanding for the low-level features and descriptors. Where are they good? Where do they fail?