Action Recognition

A general survey of previous works on Sobhan Naderi Parizi September 2009  Statistical Analysis of Dynamic Actions  On Space-Time Interest Points  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words  What, where and who? Classifying events by scene and object recognition  Recognizing Actions at a Distance  Recognizing Human Actions: A Local SVM Approach  Retrieving Actions in Movies  Learning Realistic Human Actions from Movies  Actions in Context  Selection and Context for Action Recognition  Paper info:  Title: ▪ Statistical Analysis of Dynamic Actions  Authors: ▪ Lihi Zelnik-Manor ▪ Michal Irani  TPAMI 2006  A preliminary version appeared in CVPR 2001 ▪ “Event-Based video Analysis”  Overview:  Introduce a non-parametric distance measure  Video matching (no action model): given a reference video, similar sequences are found  Dense features from multiple temporal scales (only corresponding scales are compared)  Temporal extent of videos in each category should be the same! (a fast and slow dancing are different)  New database is introduced ▪ Periodic activities (walk) ▪ Non-periodic activities (Punch, Kick, Duck, Tennis) ▪ Temporal Textures (water) ▪ www.wisdom.weizmann.ac.il/~vision/EventDetection.html  Feature description:  Space-time gradient of each pixel  Threshold the gradient magnitudes  Normalization (ignoring appearance)  Absolute value (invariant to dark/light transitions) ▪ Direction invariant ▪ ( N xl , N yl , Ntl )  (| S xl |, | S yl |, | Stl |) ( S xl ) 2  ( S yl ) 2  ( Stl ) 2  Comments:  Actions are represented by 3L independent 1D distributions (L being number of temporal scales)  The frames are blurred first ▪ Robust to change of appearance e.g. high textured clothing  Action recognition/localization ▪ For a test video sequence S and a reference sequence of T frames: ▪ Each consequent sub-sequence of length T is compared to the reference ▪ In case of multiple reference videos: ▪ Mahalanobis distance  Paper info:  Title: ▪ On Space-Time Interest Points  Authors: ▪ Ivan Laptev: INRIA / IRISA  IJCV 2009   Extends Harris detector to 3D (space-time) Local space-time points with non-constant motion:  Points with accelerated motion: physical forces   Independent space and time scales Automatic scale selection  Automatic scale selection procedure:  Detect interest points  Move in the direction of optimal scale  Repeat until locally optimal scale is reached (iterative)  The procedure can not be used in real-time:  Frames in future time are needed  There exist estimation approaches to solve this problem  Paper info:  Title: ▪ Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words  Authors: ▪ Juan Carlos Niebles: University of Illinois ▪ Hongcheng Wang: University of Illinois ▪ Li Fei-Fei: University of Illinois  BMVC 2006   Generative graphical model (pLSA) STIP detector is used (piotr dollár et al.)  Laptev’s STIP detector is too sparse     Dictionary of video words is created The method is unsupervised Simultaneous action recognition/localization Evaluations on:  KTH action database  Skating actions database (4 action classes)  Overview of the method:    K w: video word d: video sequence z: latent topic (action category) P( wi , d j )  P( wi , d j , zk )  P(d j ) P( z k | d j ) P( wi , z k ) k 1  Feature descriptor:  Brightness gradient + PCA  Brightness gradient found equiv. to Optical Flow for motion capturing  Multiple action can be localized in the video: P( z k | wi , d j )   Average classification accuracy:  KTH action database: 81.5%  Skating dataset: 80.67% P( wi | z k ) P( z k | d j )  K l 1 P( wi | zl ) P( zl | d j )  Paper info:  Title: ▪ What, where and who? Classifying events by scene and object recognition  Authors: ▪ Li-Jia Li: University of Illinois ▪ Li Fei-Fei: Princeton University  ICCV 2007  Goal of the paper:  Event classification in still images  Scene labeling  Object labeling  Approach:  Generative graphical model  Assumes that objects and scenes are independent given the event category  Ignores spatial relationships between objects  Information channels:  Scene context (holistic representation)  Object appearance  Geometrical layout (sky at infinity/vertical structure/ground plane)  Feature extraction:  12x12 patches obtained by grid sampling (10x10)  For each patch: ▪ SIFT feature (used both for scene and object models) ▪ Layout label (used only for object model)  The graphical model  E: event  S: scene  O: object  X: scene feature  A: appearance feature  G: geometry layout  A new database is compiled:  8 sport even categories (downloaded from web)  Bocce, croquet, polo, rowing, snowboarding, badminton, sailing, rock climbing  Average classification accuracy over all 8 event classes = 74.3%  Sample results:  Paper info:  Title: ▪ Recognizing Actions at a Distance  Authors: ▪ Alexei A. Efros: UC Berkeley ▪ Alexander C. Berg: UC Berkeley ▪ Greg Mori: UC Berkeley ▪ Jitendra Malik: UC Berkeley  ICCV 2003  Overall review:     Actions in medium resolution (30 pix tall) Proposing a new motion descriptor KNN for classification Consistent tracking bounding box of the actor is required  Action recognition is done only on the tracking bounding box  Motion in terms of as relative movement of body parts  No info. about movements is given by the tracker  Motion Feature:  For each frame, a local temporal neighborhood is considered  Optical flow is extracted (other alternatives: image pixel values, temporal gradients)  OF is noisy: ▪ half-wave rectifying + blurring  To preserve motion info: ▪ OF vector is decomposed to its vertical/horizontal components  Similarity measure:  i,j: index of frame  T: temporal extent 4 S (i, j )    aci t ( x, y)bcj t ( x, y) tT c 1 x , yI  I: spatial extent  {a1i , a2i , a3i , a4i }  A: 1st video sequence =  B: 2nd video sequence = {b1i , b2i , b3i , b4i }  New Dataset:  Ballet (stationary camera): ▪ 16 action classes ▪ 2 men + 2 women ▪ Easy dataset (controlled environment)  Tennis (real action, stationary camera): ▪ 6 action classes (stand, swing, move-left, …) ▪ different days/location/camera position ▪ 2 players (man + woman)  Football (real action, moving camera): ▪ 8 action classes (run-left 45˚, run-left, walk-left, …) ▪ Zoom in/out  Average classification accuracy:  Ballet: 87.44% (5NN)  Tennis: 64.33% (5NN)  Football: 65.38% (1NN)  What can be done?  Applications:  Do as I Do: ▪ Replace actors in videos  Do as I Say: ▪ Develop real-world motions in computer games  2D/3D skeleton transfer  Figure Correction: ▪ Remove occlusion/clutter in movies  Paper info:  Title: ▪ Recognizing Human Actions: A Local SVM Approach  Authors: ▪ Christian Schuldt: KTH university ▪ Ivan Laptev: KTH university  ICPR 2004  New dataset (KTH action database):  2391 video sequences  6 action classes (Walking, Jogging, Running, Handclapping, Boxing, Hand-waving)  25 persons  Static camera  4 scenarios: ▪ ▪ ▪ ▪ Outdoors (s1) Outdoors + scale variation (s2): the hardest scenario Outdoors + cloth variation (s3) Indoors (s4)  Features:  Sparse (STIP detector)  Spatio-temporal jets of order 4  Different feature representations:  Raw jet feature descriptors 2   Exponential kernel on the histogram of jets  Spatial HoG with temporal pyramid  Different classifiers:  SVM  NN  Experimental results:  Local Feature (jets) + SVM performs the best  SVM outperforms NN  HistLF (histogram of jets) is slightly better than HistSTG (histogram of spatio-temporal gradients)  Average classification accuracy on all scenarios = 71.72%  Paper info:  Title: ▪ Retrieving Actions in Movies  Authors: ▪ Ivan Laptev: INRIA / IRISA ▪ Patrik Perez: INRIA / IRISA  ICCV 2007 A new action database from real movies Experiments only on Drinking action vs. random/Smoking  Main contributions:    Recognizing unrestricted real actions  Key-frame priming  Configuration of experiments:  Action recognition (on pre-segmented seq.)  Comparing different features  Action detection (using key-frame priming)  Real movie action database:  105 drinking actions  141 smoking actions  Different scenes/people/views   www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html Action representation:  R = (P, ΔP)  P = (X, Y, T): space-time coordinates  ΔP = (ΔX, ΔY, ΔT): ▪ ΔX: 1.6 width of head bounding box ▪ ΔY: 1.3 height of head bounding box  Learning scheme:  Discrete AdaBoost + FLD (Fisher Linear Discriminant)  All action cuboids are normalized to 14x14x8 cells of 5x5x5 pixels (needed for boosting)  Slightly temporal-randomized sequences is added to training  HoG(4bins)/OF(5bins) is used  Local features: ▪ Θ=(x,y,t, δx, δy, δt, β, Ψ) ▪ Β Є{plain, temp-2, spat-4} ▪ ΨЄ{OF5, Grad4}    HoG captures shape, OF captures motion Informative motions: start & end of action Key-frame:  When hand reaches head  Boosted-Histogram on HOG  No motion info around key-frame  Integration of motion & key-frame should help  Experiments:  OF/OF+HoG/STIP+NN/only key-frame  OF/OF+HoG works best on hard test (drinking vs. smoking)  Extension of OF5 to OFGrad9 does not help!  Key-frame priming:  #FPs decreases significantly (different info. channels)  Significant overall accuracy: ▪ It’s better to model motion and appearance separately  Speed of key-primed version: 3 seconds per frame  Possible extensions:  Extend the experiments to more action classes  Make it real-time  Paper info:  Title: ▪ Learning Realistic Human Actions from Movies  Authors: ▪ Ivan Laptev: INRIA / IRISA ▪ Marcin Marszalek: INRIA / LEAR ▪ Cordelia Schmid: INRIA / LEAR ▪ Benjamin Rozenfeld: Bar-Ilan university  CVPR 2008  Overview:  Automatic movie annotation: ▪ Alignment of movie scripts ▪ Text classification  Classification of real action  Providing a new dataset  Beat state-of-the-art results on KTH dataset  Extending spatial pyramid to space-time pyramid  Movie script:  Publicly available textual description about: ▪ ▪ ▪ ▪ Scene description Characters Transcribed dialogs Actions (descriptive)  Limitations: ▪ ▪ ▪ ▪ No exact timing alignment No guarantee for correspondence with real actions Actions are expressed literally (diverse descriptions) Actions may be missed due to lack of conversation  Automatic annotation:  Subtitles include exact time alignment  Timing of scripts is matched by subtitles  Textual description of action is done by a text classifier  New dataset:  8 action classes (AnswerPhone, GetOutCar, SitUp, …)  Two training sets (automatically/manually annotated)  60% of the automatic training set is correctly annotated  http://www.irisa.fr/vista/actions  Action classification approach:  BoF framework (k=4000)  Space-time pyramids ▪ 6 spatial grids: {1x1, 2x2, 3x3, 1x3, 3x1, o2x2} ▪ 4 temporal grids: {t1, t2, t3, ot2}  STIP with multiple scales  HoG and HoF  Feature extraction:  A volume of (2kσ x 2kσ x 2kτ) is taken around each STIP where σ/τ is spatial/temporal extent (k=9)  The volume is divided to nx  n y  nt  3  3  2 grid  HoG and HoF for each grid cell is calculated and concatenated together  These concatenated features are concatenated once more according to the pattern of spatiotemporal pyramid  Different channels:      Each spatio-temporal template: one channel Greedy search to find the best channel combination C Kernel function = channel1 KernelDist channel Chi2 distance Observations:      HoG performs better than HoF No temporal subdivision is preferred (temporal grid = t1) Combination of channels improves classification in real scenario Mean AP on KTH action database = 91.8% Mean AP on real movies database: ▪ Trained on manually annotated dataset : 39.5% ▪ Trained on automatically annotated dataset : 22.9% ▪ Random classifier (chance) : 12.5%  Future works:       Increase robustness to annotation noise Improve script to video alignment Learn on larger database of automatic annotation Experiment more low-level features Move from BoF to detector based methods The table shows: ▪ effect of temporal division when combining channels (HMM based methods should work) ▪ Pattern of spatio-temporal pyramid changes so that context is best captured when the action is scene-dependent  Paper info:  Title: ▪ Actions in Context  Authors: ▪ Marcin Marszalek: INRIA / LEAR ▪ Ivan Laptev: INRIA / IRISA ▪ Cordelia Schmid: INRIA / LEAR  CVPR 2009  Contributions:  Automatic learning of scene classes from video  Improve action recognition using image context and vice versa    Movie scripts is used for automatic training For both action and scene: BoF + SVM New large database:  12 action classes  69 movies involved  10 scene classes  www.irisa.fr/vista/actions/hollywood2   For automatic annotation, scenes are identified only from text Features:  SIFT (modeling scene) on 2D-Harris  HoG and HoF (motion) on 3D-Harris (STIP)  Features:  SIFT: extracted from 2D-Harris detector ▪ Captaures static appearance ▪ Used for modeling scene context ▪ Calculated for single frame (every 2 seconds)  HoG/HoF: extracted from 3D-Harris detector ▪ HoG captures dynamic appearance ▪ HoF captures motion pattern  One video dictionary per channel is created  Histogram of video words is created for each channel  Classifier:  SVM using chi2 distance  Exponential kernel (RBF)  Sum over multiple channels K ( xi , x j )  exp(   channel 1  channel Dchannel ( xi , x j ))  Evaluations:  SIFT: better for context  HoG/HoF: better for action  Only context can also classify actions fairly good!  Combination of the 3 channels works best  Observations:  Context is not always good ▪ Idea: The model should control contribution of context for each action class individually  Overall, the gain of accuracy is not significant using context: ▪ Idea: other types of context should work better  Paper info:  Title: ▪ Selection and Context for Action Recognition  Authors: ▪ Dong Han: University of Bonn ▪ Liefeng Bo: TTI-Chicago ▪ Cristian Sminchisescu: University of Bonn  ICCV 2009  Main contributions:  Contextual scene descriptors based on: ▪ Presence/absence of objects (bag-of-detectors) ▪ Structural relation between objects and their parts  Automatic learning of multiple features ▪ Multiple Kernel Gaussian Process Classifier (MKGPC)  Experimental results on:  KTH action dataset  Hollywood1,2 Human Action database (INRIA)  Main message:  Detection of a Car and a Person in its proximity increases probability of Get-Out-Car action  Provides a framework to train a classifier based on combination of multiple features (not necessarily relevant) e.g. HOG+HOF+histogram intersection, …  Similar to MKL but here  Parameters are learnt automatically i.e. (weights + hyper- parameters) T k m ( xi , x j ;  ,  )   e k ( xit , x tj ;  t ) t 1  Gaussian Process scheme is used for learning t  Descriptors:  Bag of Detectors ▪ Deformable part models are used (Pedro) ▪ Once object BBs are detected, 3 descriptors are built: ▪ ObjPres (4D) ▪ ObjCount (4D) ▪ ObjDist (21D): pair-wise distances of object parts for all of Person detector (7 parts)  HOG (4D) + HOF (5D) from STIP detector (Ivan) ▪ Spatial grids: 1x1, 2x1, 3x1, 4x1, 2x2, 3x3 ▪ Temporal grids: t1, t2, t3  3D gradient features  Experimental results:  KTH dataset ▪ 94.1% mean AP vs. 91.8% reported by Laptev ▪ Superior to state-of-the-art in all but Running class  HOHA1 dataset ▪ Trained on clean set only ▪ The optimal subset of features is found greedily (addition/removal) based on test error ▪ 47.5% mean AP vs. 38.4% reported by Laptev  HOHA2 dataset ▪ 43.12% mean AP vs. 35.1% reported by Marszalek  Best feature combination

Action Recognition

Related documents

Products

Support

Action Recognition

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib