Automatic annotation of human actions

advertisement
Learning realistic human actions from movies
By
Abhinandh Palicherla
Divya Akuthota
Samish Chandra Kolli
Introduction
Address recognition of natural human actions
in diverse and realistic video settings.
 Addresses the limitations
(lack of realistic and annotated video datasets)
Visual recognition progressed from classifying toy objects
towards recognizing the classes of objects and scenes in
natural images .
Existing datasets for human action recognition provide
samples for few action classes .
To Address these limitations we implement
• Automatic annotation of human actions
• Manual annotation is difficult
• Video classification for action recognition
Automatic annotation of human actions
Alignment of actions in scripts and videos
Text Retrieval of human actions
Video datasets for human actions
Alignment of actions in scripts and videos
•
•
•
Scripts available for >500 movies (no time synchronization)
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …
Subtitles (with time info.) are available for the most of movies
Can transfer time to scripts by text alignment
movie script
subtitles
…
1172
01:20:17,240 --> 01:20:20,437
…
RICK
Why weren't you honest with me?
Why'd you keep your marriage a secret?
1173
01:20:20,640 --> 01:20:23,598
Why weren't you honest with me? Why
did you keep your marriage a secret?
01:20:17
01:20:23
Rick sits down with Ilsa.
lt wasn't my secret, Richard.
Victor wanted it that way.
ILSA
Oh, it wasn't my secret, Richard.
Victor wanted it that way. Not even
our closest friends knew about our
marriage.
1174
01:20:23,800 --> 01:20:26,189
Not even our closest friends
knew about our marriage.
…
…
Script alignment: Evaluation
•
•
•
Annotate action samples in text
Do automatic script-to-video alignment
Check the correspondence of actions in scripts and movies
Example of a “visual false positive”
a: quality of subtitle-script matching
A black car pulls up, two army
officers get out.
Text Retrieval of human actions
•
•
Large variation of action expressions in text:
GetOutCar
action:
“… Will gets out of the Chevrolet. …”
“… Erin exits her new truck…”
Potential false
positives:
“…About to sit down, he freezes…”
=> Supervised text classification approach
20
different
movies
12 movies
Video Datasets for Human actions
•
•
Learn vision-based classifier from automatic training set
Compare performance to the manual training set
Video Classification for action recognition
SPACE-TIME FEATURES


Good performance for action recognition
Compact and provide tolerance to background
clutter, occlusions and scale changes.
INTEREST POINT DETECTION


Harris operator - with a space-time extension.
We use multiple levels of spatio-temporal scales
σ = 2(1+i)/2 , i = 1, …, 6
τ = 2j/2 , j = 1, 2
I. Laptev. On space-time interest points. IJCV, 64(2/3):107–123,
2005.
DESCRIPTORS

Compute histogram descriptors of volume
around the interest points.
(∆x , ∆y , ∆t ) is related to the detection scales by ∆x , ∆y = 2kσ,
∆t = 2kτ


Each volume is divided into (nx, ny, nt) grid of
cuboids.
We use k = 9, nx, ny=3, nt=2.
..CONTD


For each cuboid, we calculate HoG and HoF
(optic flow) descriptors
Very similar to SIFT descriptors, adapted to the
third dimension.
SPATIO-TEMPORAL BOF




Construct a visual vocabulary using k-means,
with k = 4000. (Just like what we do in hw3)
Assign each feature to one word.
Compute a frequency histogram for the entire
video, Or, a subsequence defined by a spatiotemporal grid.
If divided into grids, concatenate and normalize.
SPATIO-TEMPORAL BOF




Construct a visual vocabulary using k-means,
with k = 4000. (Just like what we do in hw3)
Assign each feature to one word.
Compute a frequency histogram for the entire
video, Or, a subsequence defined by a spatiotemporal grid.
If divided into grids, concatenate and normalize.
GRIDS
We divide both spatial and temporal dimensions.
 Spatial – 1x1, 2x2, 3x3,
v1x3, h3x1, o2x2
 Temporal – t1, t2, t3,
ot2
 6 * 4 = 24 possible grid combinations!

Descriptor + grid = channel.

NON-LINEAR SVM


Classification using a non-linear SVM
Multi-channel Gaussian kernel
V = vocab size, A = mean distances between training samples

Best set of channels for a training set is found by a
greedy approach.
WHAT CHANNELS TO USE?
Channels may complement each other
 Greedy approach to pick the best combination


Combining channels is more advantageous
Table: Classification performance of different channels and their
combinations
EVALUATION OF SPATIO-TEMPORAL GRIDS
Figure: Number of occurrences for each channel component within the optimized
channel combinations for the KTH action dataset and our manually labeled movie
dataset
RESULTS WITH THE KTH DATASET
Figure: Sample frames from the KTH actions sequences, all six classes (columns) and
scenarios (rows) are presented
RESULTS WITH THE KTH DATASET
• 2391 sequences divided into the
training/validation set (8+8 people) and test
set (9 people)
• 10 fold cross validation
Table: Confusion matrix for the
KTH actions
ROBUSTNESS TO NOISE IN THE TRAINING
DATA
Up to p=0.2 the performance decreases
insignificantly
 At p=0.4 the performance decreases by around
10%

Figure: Performance of our video classification approach in the
presence of wrong labels
ACTION RECOGNITION IN REAL-WORD
VIDEOS
Table: Average precision (AP) for each action class of our test set. Comparison results
for clean (annotated) and automatic training data and also results for a random classifier
(chance)
ACTION RECOGNITION IN REAL-WORLD
VIDEOS
Figure: Example results for action classification trained on the automatically annotated data. We
show the key frames for test movies with the highest confidence values for true/false; pos/neg
• the rapid getting up is typical for “GetOutCar”
• the false negatives are very difficult to recognize
• occluded handshake
• hardly visible person getting out of the car
CONCLUSIONS

Summary
Automatic generation of realistic action samples
 Transfer of recent bag-of-features experience to
videos

Improved performance on KTH benchmark
 Decent results for actions in real-videos


Future direction
Improving the script-video alignment
 Experimenting with space-time-low-level-features
 Internet-scale video search

THANK YOU
Download