Lecture1_Human Activity Analysis1

Human Activity Analysis 吴心筱 wuxinxiao@bit.edu.cn Based on the CVPR 2011 Tutorial: Human Activity Analysis. http://cvrc.ece.utexas.edu/mryoo/cvpr2011tutorial/ Introduction Understanding People in Video Goal •Find the people •Infer their poses •Recognize what they do Level of People Understanding Object-Level Understanding • Locations of persons and objects Tracking-Level Understanding • Object trajectories --correspondence Level of People Understanding Pose-Level Understanding • Human body parts Activity-Level Understanding • Recognition of human activities and events Object Detection Pedestrian (i.e. human) detection • Detect all the persons in the video Object Tracking Person tracking • Tracking the person in every frame Pose Estimation Human Pose • Joint locations or angles of a person measured per frame Video as a sequence of poses Human Activity Recognition Human Activity • A collection of human/object movements with a particular semantic meaning Activity Recognition •Finding of video segments containing such movements Levels of Human Activities Categorized based on their complexity  Gestures Hierarchy  Interactions # of participants  Group Activities  Actions Gestures  Atomic components  Single body-part movements Actions Single actor movement Composed of multiple gestures organized temporally Interactions Human-human interaction Human-object interaction Group Activities Multiple persons or objects Applications Surveillance Monitor suspicious activities for real-time reactions. (e.g.,‘Fighting’, ‘stealing’) Currently, surveillance systems are mainly for recording. Activity recognition is essential for surveillance and other monitoring systems in public places Intelligent Environments (HCI) Intelligent home, office, and workspace Monitoring of elderly people and children. Recognition of ongoing activities and understanding of current context is essential. Sports Play Analysis Web-based video retrieval YouTube 20 hours of videos uploaded every minute Content-based search Search based on contents of the video, instead of user-attached keywords Example: search ‘kiss’ from long movies Challenges Robustness Environment variations Background Moving backgrounds Pedestrians Occlusions View-points – moving camera Motion Style Each person has his/her own style of executing an activity Who stretches his hand first? How long does one stay his hand stretched? Various Activities There are various types of activities The ultimate goal is to make computer recognize all of them reliably. Learning Insufficient amount of training videos Traditional setting: Supervised learning Human efforts are expensive! Unsupervised learning Interactive learning Overview Activity Classification Simple task of identifying videos • Classify given videos into their types. Known, limited number of classes Assumes that each video contains a single activity Activity Detection Search for the particular time interval • <starting time, ending time> •Video segment containing the activity Activity Detection by Classification Binary classifier Yes, Puch ! Sliding window technique • Classify all possible time intervals Recognition Process Represent videos in terms of features • Captures properties of activity videos Classify activities by comparing video representations • Decision boundary Approaches Approach based taxonomy Single-layered vs. hierarchical Single-layered approaches Hierarchical approaches Single-layered approaches Two different types Space-time approaches (data-orientated) •Activities as video observations •3D space-time volume (3D XYT volume) •A set of features extracted from the volume Two different types Sequential approaches (semantic-oriented) • Activities as human movements •A sequence of particular observations(feature vectors) Space-time approaches Space-time volumes Space-time local features Space-time trajectories Space-time volumes Problem: matching between two volumes Motion history images Bobick and J. Davis, 2001 Motion history images (MHIs) Weighted projection of a XYT foreground volume Template matching Bobick and J. Davis, The recognition of human movement using temporal templates,IEEE T PAMI 23(3),2001 Segments Ke, Suktankar, Herbert 2007 Volume matching based on its segments. Segment matching scores are combined. [ Ke, Y., Sukthankar, R., and Hebert, M., Spatio-temporal shape and flow correlation for action recognition. CVPR 2007] Space-time local features Local descriptors / interest points From 2D to 3D ; Sparse • Low-level: which local features to be extracted •Mid-level: How to represent the activity using local features •High-level: What method to classify activities Cuboid Dollar et al., Cuboid, VS-PETS 2005 2D Gaussian smoothing kernel: 1D Gabor filters applied temporally: [Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S., Behavior recognition via sparse spatio-temporal features, VS-PETS 2005] Cuboid Dollar et al., Cuboid, VS-PETS 2005 Appearances of local 3-D XYT volumes Raw appearance Gradients Optical flows Captures salient periodic motion. [Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S., Behavior recognition via sparse spatio-temporal features, VS-PETS 2005] STIP interest points Laptev and Linderberg 2003 Introduced the KTH dataset Local descriptor based on Harris corner detector Simple periodic actions •[Schuldt, C., Laptev, I., and Caputo, B., Recognizing human actions: A local SVM approach, ICPR 2004] STIP interest points Laptev and Linderberg 2003 •[Schuldt, C., Laptev, I., and Caputo, B., Recognizing human actions: A local SVM approach, ICPR 2004] Bag-of-words Representation SVMs classification SVMs classifier Shake hands ! Multiple kernel learning …… •[Schuldt, C., Laptev, I., and Caputo, B., Recognizing human actions: A local SVM approach, ICPR 2004] pLSA models pLSA from text recognition Probabilistic latent semantic analysis Reasoning the probability of features originated from a particular action video. •[Niebles, J. C., Wang, H., and Fei-Fei, L., Unsupervised learning of human action categories using spatial-temporal words, BMVC 2006] pLSA models •[Niebles, J. C., Wang, H., and Fei-Fei, L., Unsupervised learning of human action categories using spatial-temporal words, BMVC 2006] Space-time trajectories Trajectory patterns Yilmaz and Shah, 2005 – UCF Joint trajectories in 3-D XYT space. •[[Yilmaz, A. and Shah, M., Recognizing human actions in videos acquired by uncalibrated moving cameras, ICCV 2005] Space-time trajectories Trajectory patterns Yilmaz and Shah, 2005 – UCF Compared trajectory shapes to classify human actions. •[[Yilmaz, A. and Shah, M., Recognizing human actions in videos acquired by uncalibrated moving cameras, ICCV 2005] Space-time approaches Summary Space-time volumes • A straightforward solution • Difficult in handling speed and motion variations Space-time approaches Summary Space-time local features •Robust to noise and illumination changes •Recognize multiple activities without background subtraction or body-part modeling •Difficult to model more complex activities Space-time approaches Summary Space-time trajectories •Perform detailed-level analysis •View-invariant in most cases •Difficult to extract the trajectories Two different types Sequential approaches (semantic-oriented) • Activities as human movements •A sequence of particular observations(feature vectors) Sequential approaches Exemplar-based approaches State model-based approaches Exemplar-based approaches • Matching between the input sequence of feature vectors and the template sequences • Problem: how to match two sequences in different styles/different rates Dynamic Time warping Match two sequences with variations Find an optimal nonlinear match •[[Yilmaz, A. and Shah, M., Recognizing human actions in videos acquired by uncalibrated moving cameras, ICCV 2005] State model-based approaches A human activity as a model composed of a set of states Each class has a corresponding model Measure the likelihood between the model an the input image sequence HMMs HMMs Given observations V (a sequence of poses), find the HMM Mi that maximizes P(V|Mi). Transition probabilities aij and observations probabilities bik are pre-trained using training data. HMMs for Actions Each hidden state is trained to generate a particular body posture. Each HMM produces a pose sequence: action HMMs for Hand Gestures HMMs for gesture recognition American Sign Language (ASL) Sequential HMMs shapes and position of hands •[Starner, T. and Pentland, A., Real-time American Sign Language recognition from video using hidden Markov models. International Symposium on Computer Vision, 1995.] Sequential approaches summary Designed for modeling sequential dynamics Markov process Motion features are extracted per frame Limitations Feature extraction Assumes good observation models Complex human activities? Large amount of training data exemplar-based vs. state model-based Exemplar-based Provide more flexibility for the recognition system: multiple sample sequences Less training data State model-based make a probabilistic analysis of the activity model more complex activity Hierarchical approaches Hierarchy Hierarchy implies decomposition into subparts Hierarchical approaches Statistical approaches Syntactic approaches Description-based approaches Statistical approaches Statistical state-based models for recognition multiple layers of state-based models low-level: a sequence of feature vectors mid-level: a sequence of atomic action labels high-level: label of activity Layered hidden Markov models (LHMMs) Oliver et al. 2002 Bottom layer HMMs recognize atomic actions of a single person Upper layer HMMs treat recognized atomic actions as observations …… •[Oliver, N., Horvitz, E., and Garg, A. 2002. Layered representations for human activity recognition. In Proceedings of the IEEE International Conference on Multimodal Interfaces (ICMI). IEEE, Los Alamitos, CA, 3-8. Statistical approaches summary Reliably recognize activities in the case of noisy inputs with enough training data Inability to recognize activities with complex temporal structures (e.g., concurrent subevents) subevent A subevent A subevent B Subevent B subevent A subevent B Syntactic approaches Model activity as a string of symbols, each symbol corresponds to an atomic-level action Parsing techniques from the field of programming languages Context-free grammars (CFG) and stochastic context-free grammars (SCFGs) CFG for human activities Given the recognized atomic actions Production rules Parse tree Gesture analysis with CFGs Primitive recognition with HMMs Parse Tree Syntactic approaches summary Difficult in the recognition of concurrent activities Require a set of production rules fro all possible events Directions: how to learn grammar rules from observations automatically? Description-based approaches Represents a high-level human activity in terms of simpler activities Describe various relationships between subevents (atomic actions) temporal relationship spatial relationship logical relationship Description-based approaches Recognize activities using semantic matching. Hand shake = “two persons do shake-action (stretches, stays stretched, withdraw) simultaneously, while touching”. Recognition by finding observations satisfying the definition. Hierarchical approaches summary

Lecture1_Human Activity Analysis1

Related documents

Products

Support

Lecture1_Human Activity Analysis1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib