Mubarak Shah Computer Vision Lab University of Central Florida shah@cs.ucf.edu 1 Mubarak Shah Computer Vision Lab University of Central Florida shah@cs.ucf.edu 2 Human Action is purposeful behavior. Ludwig L d i Von V Mises Mi All human actions have one or more of these seven causes: chance, nature, compulsion, habit, reason, passion and desire. Aristotle Change does not come from the sky. It comes from human action. Tenzin Gyatso, 14th Dalai Lama Human beings must have action; and they will make it if they cannot find it. Albert Einstein 3 Action Event Movement Activity Interaction Verb 4 10 actions 9 actors per action Bend Sidestep Jack Hop Wave 2 hands Walk Wave 1 hand Skip Jump in place Run Six Categories, 25 actors, 4 instances, total nearly 600. clips Boxing Hand Clapping Jogging Running Hand Waving Walking 9 actions, 142 videos. Kick Swing Dive Bench Swing Lift Ride Golf Swing Run Skate g 13 action categories, 4 camera views, 10 actors, 3 instances. View 1 View 2 View 3 Check Watch Get Up View 4 Cycling Juggling Volleyball Spiking Diving Golf Swinging Riding Basketball Shooting Swinging T Tennis Swinging i S i i Trampoline Jumping Walking Dog Labeling Feature Points with the Video class L1 L2 L3 L3… Feature Tree Visual Vocabulary Using Diffusion Maps What Is Missing? 10 Saad Ali, Arslan Basharat and Mubarak M b k Shah Sh h ICCV2007 Copyrights Mubarak Shah, UCF Treat Trajectories of human joints as the representation of the non-linear dynamical system that is generating the particular action. f (1t , 2t ,..., Nt ) f Rule f (1t 1 , 2t 1,..., Nt 1 ) f (1t 2 , 2t 2 ,..., Nt 2 ) ( 1t , 2t ,..., Nt ) True State Space Variables f (1t 3 , 2t 3 ,..., Nt 3 ) We have the access to the data generated by the dynamical system controlling this action ! Experimental Data: Trajectories of body joints. From this data construct the phase space. That is: Let the data speak to you and tell you what mechanisms are generating chaotic behavior. Action Video Time Series: X,Y Joint Trajectories Extract Trajectories j Feature Vector Convert into time series C o r r e la t io n s u m -6 -8 ln C(r) -1 0 -1 2 -1 4 -1 6 -1 8 -2 - 1 .5 -1 -0 .5 0 0 .5 1 1 .5 ln r S c a lin g o f D 2 0 -1 -2 -3 ld C(r) -4 -5 -6 -7 -8 -9 -5 -4 -3 -2 -1 0 ld r P r e d i c t io n e r r o r For each time series perform Phase Space Embedding 2 p 1 .5 1 0 .5 0 0 Reconstructed Phase Space Reconstructed Phase Space 100 150 200 C o r r e la t io n s u m -6 -8 -1 0 ln C(r) Reconstructed Phase Space 50 -1 2 -1 4 -1 6 -1 8 -2 -1 .5 -1 -0 .5 0 0 .5 1 1 .5 ln r S c a lin g o f D 2 0 30 20 20 10 10 -1 -2 -3 30 -4 ld C(r) 30 -5 -6 20 -7 -8 10 0 -10 -10 -20 50 -20 50 0 -50 0 -20 20 -9 -5 -4 -3 0 0 40 -50 0 -20 20 40 ………….. -2 ld r -1 0 P r e d i c t io n e r r o r 2 -10 1 .5 -20 40 p 0 1 0 .5 40 20 0 20 0 0 50 100 150 200 0 -20 -20 C o r r e la t io n s u m -6 -8 ln C(r) -1 0 -1 2 -1 4 -1 6 Compute Chaotic Invariants -1 8 -2 - 1 .5 -1 -0 .5 0 0 .5 1 1 .5 ln r S c a lin g o f D 2 0 -1 -2 -12 -14 -16 -0.5 0 0.5 1 -14 -16 -18 -2 1.5 -1 .5 -1 -0.5 ln r -1 -5 -6 -7 -8 -2 -1 Correlation Integral -5 -6 -7 -8 -9 -5 0 -4 -3 -2 -1 0 ld r P red iction e rro r p 1 0.5 0 Correlation Dimension 2 1.5 p Correlation Dimension 1.5 1 0.5 0 1 00 -1 .5 -1 -0.5 15 0 2 00 0 0.5 1 1.5 ln r -4 ld r 50 -18 -2 S caling of D 2 0 -3 P red iction e rro r 2 0 -16 1.5 -2 ld C(r) ld C(r) -4 -3 1 -1 Correlation Integral -3 -4 0.5 -14 S caling of D 2 0 -2 -9 -5 0 -12 ln r S caling of D 2 0 Lyapunov Exponent -10 -1 Correlation Integral -2 -3 ld C(r) -1 -12 -4 -5 -6 ………….. -7 -8 -9 -5 -4 -3 -2 -1 0 ld r P red iction e rro r Correlation Dimension 2 1.5 1 0.5 0 0 50 1 00 15 0 2 00 -5 -6 -7 -8 -9 -5 -4 -3 0 50 1 00 ld r -2 -1 0 P r e d i c t io n e r r o r -8 p -1 .5 Lyapunov Exponent -10 -4 C orrela tion su m -6 -8 ln C(r) Lyapunov Exponent -10 -18 -2 C orrela tion su m -6 -8 Chaotic Invariants ln C(r) C orrela tion su m -6 ln C(r) Chaotic Invariants 15 0 2 00 2 1 .5 p Chaotic Invariants ld C(r) -3 1 0 .5 0 0 50 100 150 200 • Six Body Joints – Two Hands, Two Feet, Head, Belly. • Normalized with respect to the belly point. point • Results in 5 trajectories per action. Each dimension of the trajectory is considered as a univariate time series y g Idea: All the variables of the dynamical y Underlying system influence each other. z1 , z 2 , z3 ,...., zt Every point of the series results from the intricate combination of influences of all the true state variables. zi 2 , zi 3 ,...., zi m Therefore, can be considered as a second substitute variable which carries the influence of all the systems variables during time interval . Using this reasoning, reasoning introduce a series of substitute variables and obtain the whole m-dimensional space. Copyrights Mubarak Shah, UCF m3 z1 , z2 , z3 , z4 , z5 , z6 , z7 ,...., zt 2 z1 X z 2 . z3 z4 . z5 z6 . m-dimensional reconstructed phase space Copyrights Mubarak Shah, UCF Each row is a point i a m-dimensional in di i l phase space. Phase Spaces Head d Right Hand Copyrights Mubarak Shah, UCF Lefft Hand Right Foott Left Fo oot Maximal Lyapunov exponent C Correlation l ti IIntegral t l Measures exponential divergence of nearby trajectories in phase space. Maximal Lyapunov exponent > 0, implies dynamics of the underlying system are chaotic. Measures the number of points within a neighborhood of some radius radius, averaged over the entire attractor Correlation Dimension Measures the change in the density of phase space with respect to neighborhood radius. p Motion capture data Dataset size Dance : 19 D Run : 26 Walk : 46 Sit: 14 Jump: 33 Leave-One-Out Cross validation using Kmeans classifier. Walking Running Jumping Ballet Sitting Copyrights Mubarak Shah, UCF Dance Run Walk Jump Sit Dance Dance Run 13 2 Walk 2 1 1 22 Sit Walk Sit 28 Jump Run Jump 1 4 33 3 2 Mean Accuracy: 89.7% 89 7% Copyrights Mubarak Shah, UCF 43 Wizemann Action Data Set Nine actions p performed by y nine different actors: Bend, Jumping Jack, Jump Forward, Jump in Place, Run Side Gallop, Run, Gallop Walk, Walk Wave One Hand, Hand Wave Two Hands 81 videos Copyrights Mubarak Shah, UCF Experiments Wizemann Action Data Set A1 A2 A3 A4 A5 A6 A7 A8 A9 A1 9 A2 9 A3 5 2 2 A4 9 A5 8 1 A6 1 8 A7 9 A8 9 A9 9 A1: Bend, A2: Jumping Jack, A3: Jump in Place, A4: Run, A5: Side Gallop A6: Walk, A7: Wave1, A8: Wave2 Mean Accuracy: 92.6% UCF Sports Dataset Moving camera 118 videos 14 Diving, Diving 18 Golf Golf-Swings, Swings 20 Kicking, Kicking 12 HorseHorse riding, 12 Running, 13 Skateboarding, 12 Swingbench, 13 Swing Swing-side side 31 Diving Golf‐Swing‐Back Kicking Riding Horse Riding‐Horse Skateboarding Swing‐Bench Golf‐Swing‐Side Run Swing‐Side 32 33 34 Diving 13 Golf-Swing-Back 1 4 Golf-Swing-Front 1 8 Golf-Swing-Side 4 Kicking 20 Riding-Horse 2 1 Run 7 1 Skateboarding Swing-Bench 1 1 1 Swing-Side 1 1 2 7 4 7 15 13 Mean Accuracy: 83% 35 Assumes availability of joint trajectories. Requires Robust Detection Robust Tracking 36 Labeling Feature Points with the Video class L1 L2 L3 L3… Feature Tree Visual Vocabulary Using Diffusion Maps What Is Missing? 37 Consider an action as bag of video words. Represent action as a histogram of video video-words words Perform recognition using classifier (SVM, KNN) Ad Advantages t No object detection No object b tracking k 38 Interest Point Detector Video-word A Video-word B Video-word C 40 41 What is the (vocabulary) codebook? Vocabulary should be compact and discriminative. Video words may not be semantically Video-words meaningful. 42 Jingen Liu, Yang Yang and Dr. Mubarak Shah CVPR 2009 43 Make use of the co occurrence of the visual co-occurrence words. Embed video words into meaningful g low dimensional space. Use video word clusters ((high g level features)) instead of Video words. 44 K means v.s. manifold approaches K-means approaches* K‐means Spectral clustering *A. Y. Ng, et al. “On spectral clustering: analysis and algorithm” Embedded points 45 Raw Feature Extraction (low level Raw Feature Extraction (low level features (Dollar Interest Points)) Feature Quantization (k‐means, midlevel Feature Quantization (k means midlevel features, video words) Midlevel Feature dl l Embedding (Diffusion Maps) Vocabulary Construction (k‐ means, high‐level features) 46 g Advantages: Provides an explicit metric to reflect the geometric structure of the feature space. Di Discovers th the semantic ti relationships l ti hi b between t th the feature points. Ability to analyze data at multiple scales Using different diffusion times Analysis can be performed at different levels i.e. Sports > (Football, Baseball), Baseball > (Pitching, score).. 47 Use PMI to represent each video word xi [m1,i , m2,i ,...mNc ,i ] mc , w log( og( f cw f f w Nc ) Nw c fcw= ncw/Nw, ncw is the number of times word w appears in clip c w appears in clip c. w12 w13 w15 Construct weighted graph w11 w 21 W wN w1 w12 w22 wN w 2 wN wNw w1Nw w2 Nw wij ( xi , x j ) exp( xi x j 2 2 2 ) is one of the scale parameters Symmetric wij = wji Positive: wij > 0 48/21 Markov Transition Matrix p11 p 21 P pN 1 w p12 p22 pN 2 w p1N p2 N pN N w w w w w15 pij ( xi , x j ) wij ( xi , x j ) di w12 w13 Nw d i rowsum wij j 1 p(xi, xj) = transition probability in one time step. P t ( P )t t is the second scale parameter Pt is the probability of transition from x i th b bilit f t iti f i to xj in t steps 49/21 Goal: relate spectral properties of Markov chain to geometry of the data. [ D (t ) ( xi , x j )] 2 ( piq(t ) p (jqt ) ) 2 0 ( xq ) q 0 (x q ) lim p((t)q ) t dq d j j d q wqj j Stationary probability distribution: the probability of landing at location q Stationary probability distribution: the probability of landing at location q after after taking infinite steps of random walk. Diffusion distances can be computed using eigenvectors and Eigen values of P. 1 ( xN ) 1 0 1T T 2 ( xN ) 2 1 P T ( x ) ( x ) ( x ) 0 N 2 N N N N N 1 1 ( x1 ) (x ) 2 1 w 1 ( x2 ) 2 ( x2 ) w w w w w w w 50 The distance may be approximated with the first α Eigen values. [ D ( xi , x j )] (lt ) 2 ( l ( xi ) l ( x j )) 2 (t ) 2 l 1 1 ( x1 ) 1 ( x2 ) 1 ( xN ) 1 (x ) (x ) 2 ( xN ) 2 2 1 2 2 P ( x ) ( x ) (x ) 1 2 N ( x ) ( x ) ( x ) N 1 N 2 N N 0 w w w w w w w T 0 1 T 1 T N N w w The diffusion map embedding. t ( x) (1t 1 ( x )), t2 2 ( x)), , t ( x )) The Euclidean distance is equal to the diffusion distance: [ D ( xi , x j )] ( xi ) ( x j ) (t ) 2 t t 2 51 spiral points KTH dataset diffusion Diffusion distance geodesic Geodesic distance 52 Raw Feature Extraction (low level Raw Feature Extraction (low level features (Dollar)) Feature Quantization (k‐means, midlevel Feature Quantization (k means midlevel features, video words) Midlevel Feature Embedding dl l b dd (DM, embedded midlevel features) Semantic Vocabulary Construction (k‐means Construction (k means, high high‐ level features) 53 Boxing Clapping Walking Jogging Waving Running 54 55 • Recognition rate with and without embedding. embedding 94 92 90 88 86 84 82 80 Embedded origianl 56 • Comparison of recognition rate between different manifold learning schemes 57 Average Accuracy=92.3% 58 mc , w log( f cw f f w 94 ) c PMI 92 Freq. 90 88 86 84 82 50 100 150 200 250 59 60 Average Accuracy=76.1% 61 • Comparison of recognition rate using different manifold learning schemes on You Tube dataset. 62 B d Bed room (216 ( 6) Kitchen (210) Forest (328) Suburb (241) Coast (360) Highway (260) I d t ( ) Industry (311) Inside of City (308) Store (315) Li i R Living Room (289) ( 8 ) Mountain (374) M t i ( ) Street (292) Open Country (410) Office (215) Tall Building (356) 63 M1 M2 M3 Three visual words with their corresponding real image patches Mid level features. Part of the building H1 H2 Part of the foliage H3 Part of the window H4 H5 H6 Six high level features with their corresponding real image patches. 64 • Comparison of recognition rate using different manifold learning schemes on fifteen scene dataset. 75.5 75 74.5 74 73.5 73 72.5 72 DM ISOMAP PCA EigenMap Average Accuracy Average Accuracy 65 Labeling Feature Points with the Video class L1 L2 L3 L3… Feature Tree Visual Vocabulary Using Diffusion Maps What Is Missing? 66 K Reddy K. J Liu J. ICCV 2009 M Shah M. Feature Detection Limitations • Intensive training stage to obtain good performance • Sensitive to the vocabulary size • Unable to cope with incremental recognition problems • Simultaneous Si lt multiple lti l actions ti can nott b be recognized i d • Cannot perform recognition frame by frame K means Histogram SVM Detect Local features: S ti t Spatiotemporal l IInterest t t points i t Index them Using a Tree Advantages Effective integration of indexing and recognition. g No vocabulary construction and category model training. Incremental action recognition. recognition The tree provides a disk-based data structure, Feature Detection Scalable for large scale datasets. The recognition can be performed nearly nearl in real time. Feature-Tree Detect Local features: S ti t Spatiotemporal l IInterest t t points i t Index them Using a Tree Advantages Effective integration of indexing and recognition. g No vocabulary construction and category model training. Incremental action recognition. recognition The tree provides a disk-based data structure, Feature Detection Scalable for large scale datasets. The recognition can be performed nearly nearl in real time. Feature Tree Tree Construction Input Video d Data Set Feature F t D Detection t ti and Extraction Labeling b l Feature Points with the Video class L1 L2 Constructing the tree with the labeled Feature points L3… • Training videos V = {v1, v2,…,vM} with corresponding class label li {1,2,…,C}. {1 2 C} • We extract n spatiotemporal features (1≤j≤n) from each video vi. • We associate each feature with its class label and get a twoelements l ffeature tuple l xij = [dij lc]. ] • Collection of labeled features T = {xij} is used to construct the tree. SR-tree is a Multi-dimensional indexing method (data partition using predefined shapes). SR-tree SR tree organizes the data by hierarchical regions. regions The region is defined by the intersection of a bounding sphere and a bounding rectangle. Root Node Node Level Leaf Level Query Leaf to search Root Node Node Level Leaf Level Labeled Data Query Tree Construction IInputt Video Feature Detection Labeling L b li Feature F t Points P i t with the Video class L1 L3… Constructing C t ti th the tree L2 Query Query Video Feature Detection Retrieve the top K features for each Feature Query Feature Voting Label the Query Video Query Query Video Feature Detection and Extraction Retrieve the top K features for each Feature Query using the T Tree Feature Voting Objective: Given a query video Q, assign a class label to it. Feature Extraction: 1. For a given video Q, detect the spatiotemporal interest points using the Dollar detector and extract the cuboids around them. 2. Compute gradient for each cuboid q, and use PCA to reduce the dimension, then it is represent by a descriptor dq. 3 Representing Q as a set of features {dq} 3. Action Recognition: Given the query features of Q: {dq} 1. For each query feature dq in Q retrieve the nearest neighbor g . 2. The class label of Q is decided by summing the votes of the label assigned to the features {dq}. . Label the Query Video 5 – Persons and 6 – actions to construct the tree SR-Tree average performance of 87.77% Vocabulary-based (standard bag of words) approach 84.4%. 10 – Persons and 6 – actions to construct the tree SR-tree performance increases to 90.78% SR Tree Training on Four views, Testing on One view 72 70 68 66 64 62 60 58 Camera‐1 Camera‐2 Camera‐3 Camera‐4 • Average Accuracy •Training Training on Fo Fourr views ie s and Testing on One view ie 66.4% 66 4% •Training on Four views and Testing on Four views 72% 77 90 80 70 60 50 40 30 20 10 0 Training on 4 views Testing on views,Testing on One view Training on 3 views, g Testing on 4th View 78 Confusion Table (Four Views for training and testing) KTH Data set Initial tree is constructed using videos from Boxing, Clapping, waving and Jogging performed by 5 persons. Tree is expanded using videos from a new action Running incrementally adding videos from 1 person at a time Testing is done for 5 actions using videos of 20 people not used in constructing the tree. KTH Data set Ground Truth Classification Accuracy KTH Data set Number of Frames Cluttered environment Labeling Feature Points with the Video class L1 L2 L3 L3… Feature Tree Visual Vocabulary Using Diffusion Maps 85 Jingen Liu Saad Ali Arslan Basharat Kishore Reddy Yang Yang htt // http://www.cs.ucf.edu/~vision f d / ii 86 Labeling Feature Points with the Video class L1 L2 L3 L3… Feature Tree Visual Vocabulary Using Diffusion Maps What Is Missing? 87 Complex actions Explanation/Understanding Unsupervised learning Other sources of information e.g. text (Semantic Similarity) 88 89 90 91 Basket Ball going through the hoop. 92 Complex actions Explanation/Understanding Unsupervised learning Other sources of information e.g. text (Semantic Similarity) 93 Labeling Feature Points with the Video class L1 L2 L3 L3… Feature Tree Visual Vocabulary Using Diffusion Maps What Is Missing? 94