Spatio-temporal Embedding for Statistical Face Recognition from Video Wei Liu

advertisement
Spatio-temporal Embedding for Statistical Face Recognition from Video
Wei Liu†, Zhifeng Li†, and Xiaoou Tang‡
† MMLAB,
Department of Information Engineering
The Chinese University of Hong Kong
Introduction
‡ Visual
Computing Group
Microsoft Research Asia
Spatial Embedding: NDE
Experimental results
This paper addresses the problem of how to learn an appropriate representation from
video to benefit video-based face recognition. We pose it as learning spatio-temporal
embedding (STE) from raw video.
z STE of a video sequence is defined as its condensed version capturing the
essence of space-time characteristics of the video.
z Relying on co-occurrence statistics of training videos, Bayesian keyframe learning
leads to the temporal embedding, keyframes, of each video.
z Given supervised signatures of face videos, nonparametric discriminant embedding
(NDE) learned from the keyframes makes up STE.
z A statistical formulation in terms of STEs to the video-based recognition problem.
Framework
(a)
(b)
Fig. 3. Keyframes learned from one video sequence in XM2VTS. (a) Top-1 keyframes,
each of which stands for a cluster in the sequence; (b) 10 keyframes shown in the
temporal axis, compared with the speech signal.
(c)
(d)
Fig. 2. (a) The two half-sphere toy data points with 2 labels “*” and “o”; (b) PCA embedding; (c)
LDA embedding; (d) NDE.
• NDE is the multi-class nonparametric invariant of conventional LDA bases on the
following improved within and between-class scatter matrices:
Fig. 1. The framework of our video-based face recognition approach. (a) Training stage:
learn keyframes from video sequences and then arrange them into K groups consisting
of homogeneous frames, whose low-dimensional spatial embeddings are STEs via NDE;
(b) testing stage: construct a statistical classifier in terms of learned STEs.
Temporal Embedding
• Only use image information, utilize the co-occurrence statistics of video sequences.
• Based on the spatio-temporal correlation between frames, a synchronized frame
clustering method applies K-means to incrementally cluster frames across all videos.
• Given K clusters Ck with sub-clusters Ck(i) in each training video sequence V(i), the
optimal keyframe e in Ck(i) should be such that
which is called Bayesian keyframe learning.
• Collect top-m keyframes within each Ck(i) to span the temporal embedding T(i) for
sequence V(i), and denote its k-th component in cluster Ck(i) as Tk(i).
Recognition results
• NDE is a generalized version of LDA and usually provides more than c-1 projections.
• NDE effectively captures the boundary structures for different classes.
• Fig. 2 demonstrates NDE can find a better subspace than LDA or PCA in the case of
abundant training data, so we take merits of NDE into video-based classification.
Statistical Recognition
Fig. 4. NDE versus LDA: the cumulative matching score is used for the
performance measure.
• Keyframes learned by our method approximate closely with those obtained by audiovideo frame synchronization [13].
• NDE outperforms conventional LDA in most cases.
• Our recognition framework integrating STEs and statistical classification achieves the
best recognition accuracy.
Approaches
Recognition
Approaches
Recognition
Rate
Rate
• To run NDE on K slices slice_k={Tk(i)}I, we form keyframes carrying the same human label
Nearest frame using unified
79.3 %
Mutual subspace
93.2 %
into one class in each slice. Thus K NDE projections Wk are acquired along with the STEs.
subspace analysis
• For any test video frame x, we will compute its statistical correlation to the learned STE L(i)
Temporal embeddings +
81.7 %
98.0 %
Nearest frame
multi-level subspace analysis
of video sequences V(i) in gallery. Model the correlation as the posterior probability p(L(i)|x).
Spatio-temporal embeddings +
Nearest frame
90.9 %
99.3 %
• Using a probabilistic fusion scheme, we construct a MAP classifier in terms of learned
using LDA
statistical classifier
STEs to realize image-to-video recognition that may also be developed to video-to-video
If any question on this paper, please freely contact me ( wliu5@ie.cuhk.edu.hk). Visiting
recognition.
http://mmlab.ie.cuhk.edu.hk/~face/ for more information about my other works.
Download