virat_presentation_jim

advertisement
UCF VIRAT EFFORTS
Bag of Video-Words Video Representation
Outline

Bag of Video-words approach for video representation





Feature detection
Feature quantization
Histogram-based video descriptor generation
Preliminary experimental results on aerial videos
Discussion on ways to improve the performance
Bag of video-words approach (I)
Interest Point Detector
Motion Feature Detection
Bag of video-words approach (II)
Video-word A
Video-word B
Video-word C
Feature Quantization: Codebook Generation
Histogram-based
Video Descriptor
Bag of video-words approach (III)
Histogram-based video descriptor generation
Similarity Metrics

Histogram Intersection
Sim( H a , H b )  i min(H a (i), H b (i))

Chi-square distance
( H a (i)  H b (i))2
Sim( H a , H b )  exp(  ( H a , H b ))  exp(i
)
H a (i)  H b (i)
2
Classifiers



Bayesian Classifier
K-Nearest Neighbors (KNN)
Support Vector Machines (SVM)
 Histogram
Intersection Kernel
 Chi-square Kernel
 RBF (Radial Basis Function) Kernel
Experiments on Aerial videos

Dataset
 Blimp
with a HD camera on a gimbal
 11 Actions: Digging, gesturing,
picking up, throwing, kicking, carrying
object, walking, standing, running,
entering vehicle, exiting vehicle
Clipping & Cropping Actions
End of Frame
Start of Frame
- Optimal box is created so that the object of interest doesn't go out of view in
all the frames (Start Frame to End Frame)
200 Features
Feature Detection for Video Clips
Digging
Kicking
Throwing
Walking
Classification Results (I)

“kicking”(22 clips) v.s. “non kicking” (22 clips)
Number of features
Per video
50
100
200
Codebook 50
Codebook 100
Codebook 200
65.91%
79.55%
75.00%
79.55%
77.27%
77.27%
77.27%
79.55%
81.82%
Classification Results (II)
Classification Results (III)
“Digging”, “Kicking”, “Walking”, “Throwing” ( 25clips x 4 )
walking throwing kicking digging

Similarity Matrix
(Histogram Intersection)
Classification Results (V)

Average accuracy with different codebook size
Number of Features
Per Video
200

Codebook 100
Codebook 200
Codebook 300
84.6%
85.0%
86.7%
Confusion table for the case of codebook size of 300
Misclassified examples (I)

Misclassified “walking” into “kicking”
Misclassified examples (I)

Misclassified “digging” into “walking”
Misclassified examples (III)

Misclassified “walking” into “throwing”
How to improve the performance?

Low Level Features





Stable motion features
Different Motion Features
Different Motion Feature Sampling
Hybrid of Motion and Static Features
Video-words generation

Unsupervised method


Supervised method



Hierarchical K-Means (David Nister, et al., CVPR 2006)
Random Forest (Bill Triggs, et al., NIPS 2007)
“Visual Bits” (Rong Jin, et al., CVPR 2008)
Classifiers


SVM Kernels : histogram intersection v.s. Chi-Square distance
Multiple Kernels
Stable motion features


Motion compensation
Video clipping and cropping
End of Frame
Start of Frame
Different Low-level Features




Flattened gradient vector (magnitude)
Histogram of Gradient (direction)
Histogram of Optical Flow
Combination of all types of features
Feature sampling

Feature detection: Gabor filter or 3D Harris corner detection

Random sampling

Grid-based sampling

Bill Triggs et al., Sampling Strategies for Bag-of-Features Image
Classification, ECCV 2006
Hybrid of Motion and Static Features (I)

Multiple-frame Features (spatiotemporal, motion)



Single-frame Features (spatial, static)





3D Harris
Capture the local spatiotemporal information around the interest points
2D Harris detector
MSER (Maximally Stable Extremal Regions ) detector
Perform action recognition by a sequence instantaneous postures or poses
Overcome the shortcoming of multiple-frame features which require relative stable camera
motion
Hybrid of motion and static features

Represent a video by the combination of multiple-frame and single-frame features
Hybrid of Motion and Static Features (II)
MSER
2D Harris

Examples of 2D Harris and MSER feature
Hybrid of Motion and Static Features (III)

Experiments on three action datasets
 KTH,
6 action categories, 600 videos
 UCF sports, 10 action categories, about 200 videos
 YouTube videos, 11 action categories, about 1,100
videos
KTH dataset
Boxing
Clapping
Waving
Walking
Jogging
Running
Experimental results on KTH dataset

Recognition using either Motion (L), Static (M) features and Hybrid (R) features
Average Accuracy 87.65%
Average Accuracy 82.96%
Average Accuracy 92.66%
Results on UCF sports dataset
The average accuracy for static, motion and static+motion experimental strategy is 74.5%, 79.6%
and 84.5% respectively.
YouTube Video Dataset (I)
Golf Swinging
Diving
Cycling
Riding
Juggling
YouTube Video Dataset (II)
Basketball Shooting
Volleyball Spiking
Swinging
Tennis Swinging
Trampoline Jumping
Results on YouTube dataset

The average accuracy for motion, static and hybrid features are 65.4%, 63.1% and 71.2%,
respectively
Hierarchical K-Means (I)


Traditional k-means
 Slow when generating large size of codebook
 Less discriminative when dealing with large size of codebook
Hierarchical k-means
 Building a tree on the training features
 Children nodes are clusters of applying k-means on the parent node
 Treat each node as a “word”, so the tree is a hierarchical codebook
 D. Nister, Scalable Recognition with a Vocabulary Tree, CVPR 2006
Hierarchical K-Means (II)

Advantages
 Tree also defines the quantization of features, so it integrates the
indexing and quantization in one tree
 It is much more efficient when generating a large size of codebook
 The word (node) frequency can be integrated with the inverse document
frequency to weight it.
 It can generate more discriminative word than that of k-means
 Large size of codebook which can generally obtain better performance.
Random Forests (I)

K-means based quantization methods



Single-tree based methods




Unsupervised
It suffers from the high dimensionality of the features
Each path through the tree typically accesses only a few of the feature dimensions
It fails to deal with the variance of the feature dimensions
It is fast, but performance is not even as good as k-means
Random Forests





Build an ensemble trees
Each tree node is splitted by checking the randomly selected subset of feature dimensions
Building all the trees using video or image labels (supervised method)
Instead of taking the trees as an ensemble classifiers, we treat all the leaves of all the trees as
“words”.
The generated “words” are more meaningful and discriminative, since it contains class category
information
Random Forests (II)
“Visual Bits” (I)


Both k-means or random forests

Treat all the features equally when generating the codebooks.

Hard assignment (each feature can only be assigned to one “word”)
“Visual Bits”

Rong Jin et al., Unifying Discriminative Visual Codebook Generation with Classifier
Training for Object Category Recognition, CVPR 2008

Training a visual codebook for each category, so it can overcome the shortcomings of
“hard assignment” of the features

It integrates the classification and codebook generation together, so it is able to
select the relevant features by weighting them
“Visual Bits” (II)
Classifiers

Kernel SVM
 Histogram
Intersection
 Chi-square distance

Multiple kernels
 Fuse
different type of features
 Fuse different distance metrics
The end…

Thank you!
Download