A Robust Technique for Motion-Based Video Sequences Temporal Alignment , Member, IEEE

advertisement
70
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013
A Robust Technique for Motion-Based
Video Sequences Temporal Alignment
Cheng Lu, Member, IEEE, and Mrinal Mandal, Senior Member, IEEE
Abstract—In this paper, we propose a robust technique for temporal alignment of video sequences with similar planar motions acquired using uncalibrated cameras. In this technique, we model the
motion-based video temporal alignment problem as a spatio-temporal discrete trajectory point sets alignment problem. First, the
trajectory of the object of interest is tracked throughout the videos.
A probabilistic method is then developed to calculate the ‘soft’
spatial correspondence between the trajectory point sets. Next, a
dynamic time warping technique (DTW) is applied to the spatial
correspondence information to compute the temporal alignment of
the videos. The experimental results show that the proposed technique provides a superior performance over existing techniques for
videos with similar trajectory patterns.
Index Terms—Video alignment, video synchronization, temporal
registration, dynamic time warping, point sets alignment.
I. INTRODUCTION
T
HE temporal alignment of video sequences is very important in computer vision applications such as human action
recognition [12], video synthesis, video mosaicing [25], superresolution imaging [4], 3D visualization [14], robust multi-view
surveillance [9], and human action temporal segmentation [24].
Most existing techniques focus on alignment of videos captured
from different cameras of the same scene or similar scenes with
overlapped background. However, there are some applications
that require synchronizing the videos captured under significantly different scenes. For example, consider computer-aided
sport self-training application where a sport learner wants to
learn the TaiChiQuan sport. The learner follows the instructions
and captures a video sequence while playing. After a training
session, he can compare his postures with that of an expert by
comparing the recorded videos. The TaiChiQuan sport is typically performed in a relatively narrow space and has complex
motions (example videos are shown in Section IV) which may
require a robust video alignment technique to bring two videos
into a common time axis. Note that in such scenario, the sport
videos are captured under widely different scenes.
In the literature, the video temporal alignment or temporal
registration can be broadly classified into two categories: video
Manuscript received October 03, 2011; revised March 02, 2012 and June 25,
2012; accepted June 26, 2012. Date of publication October 16, 2012; date of
current version December 12, 2012. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mihaela van
der Schaar.
The authors are with the Department of Electrical and Computer Engineering,
University of Alberta, Edmonton, AB T6G 2V4, Canada (e-mail: lcheng4@ualberta.ca; mmandal@ualberta.ca).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2012.2225036
sequences of the same scene and video sequences of similar
scenes. We present a brief review of the literature for these two
categories in the following.
In the case of video sequences captured from the same scene,
the videos record the identical motion/event from the same
environment. Lee et al. [9] proposed a video synchronization
technique, in which they estimated a time shift between two
sequences based on alignment of centroids of moving objects in
planar scene. Dai et al. [17] use 3D phase correlation between
video sequences to calculate the synchronization. Tuytelaars et
al. [21] computed alignment by checking the rigidity of a set
of five (or more) points. Tresadern et al. [20] also followed a
similar approach of computing a rank-constraint based rigidity
measure between four pairs of non-rigidly moving feature
points. Caspi et al. [4], [5] proposed two kinds of techniques:
feature-based sequence alignment and direct-based sequence
alignment. In the feature-based sequence alignment technique,
they calculated the spatial and temporal relationship between
two sequences by minimizing the sum of square differences
(SSD) error over extracted trajectories that are visible in both
sequences. In the direct-based sequence alignment technique,
they computed the frame-to-frame correspondences by minimizing the SSD error over each frames in the video sequences.
Padua et al. [11] extended [5] to align sequences based on scene
points that need to be visible only in two consecutive frames.
Shrestha et al. [28] proposed several video synchronization
methods based on the audio and visual features. These audio
and visual features include flash light, audio fingerprints, and
audio onsets. All these techniques solve the alignment problem
where videos are captured at the same scene using uncalibrated
cameras.
When aligning video sequences of different scenes, albeit
sequences correlated via motion, one has to factor in the dynamic temporal scale of activities in the video sequences. Giese
and Poggio [18] approached alignment of activities of different
people by computing a dynamic time warp between the feature
trajectories. They did not consider the situation where the activity sequences are from varying viewpoints. Perperidis et al.
[19] attempted to locally warp cardiac MRI sequences, by extending Caspi’s work to incorporate spline based local alignment. Rao et al. [13] used rank constraint as the distance measure in the dynamic time warping (DTW) technique to align
human activity-based videos. This is the first work reported in
the literature that can deal with video sequences of correlated
activities in similar scenes. Singh et al. [15] proposed a video
synchronization technique to generate a high resolution MRI sequence by combining several low resolution MRI sequences.
This technique formulates a symmetric transfer error (STE) as
1520-9210/$31.00 © 2012 IEEE
LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT
a functional of regularized temporal warp and selects the time
warp that has the smallest STE as the alignment result. Lu et
al. [8], [30] extended Singh’s technique, using unbiased bidirectional dynamic time warping (UBD), to calculate the optimal
warp for the alignment. These techniques were used to synchronize the MRI sequences captured from the same patient (doing a
certain activity), resulting in a super-resolution MRI sequences.
However, these techniques may not work well in cases where
the MRI sequences need to be synchronized between different
patients. Chudova et al. [29] developed a probabilistic model
for clustering and time-warping of multi-dimensional curves.
This model is able to learn the clusters of curves (with local
and global deformations in the underlying curve) using a finite
mixture models. However, the number of the mixture components is needed to be known in advance for the learning procedure, and the model is more focused on the clustering of different type of curves with distortions. Li et al. [26] proposed
the alignment manifold and solved the spatio-temporal alignment problem in a pre-specified non-linear space. Zhou et al.
[27] proposed the canonical time warping (CTW) technique,
which combines the canonical correlation analysis and DTW
technique, for the human motion alignment.
In this paper, we propose a novel technique for motion-based
temporal alignment of video sequences which is able to deal
with the video captured from the same scene and different
scenes. The proposed technique formulates the motion-based
video temporal alignment problem as a spatio-temporal discrete
trajectory point sets alignment problem. First, the trajectory of
the interested object is tracked throughout the videos. Considering the imperfection of the features extraction module which
will introduce noise in the extracted trajectory, a probabilistic
method is developed to calculate the ‘soft’ spatial correspondence between the trajectory point sets. Next, a dynamic time
warping technique (DTW) is applied to the ‘soft’ spatial correspondence information to compute the temporal alignment of
the videos.
The advantages of the proposed technique are: (1) it does
not require overlapping views between videos to select corresponding feature points, i.e., it can be applied in videos
containing different scenes; (2) it is able to deal with situations
where videos contain complex dynamic object motion (e.g.,
long trajectory with several intersections) or noisy feature
trajectory with consistent performance.
The rest of this paper is organized as follows. Section II
presents the background information. The proposed technique is
presented in Section III. Performance evaluation of the proposed
technique is presented in Section IV, followed by the conclusions.
II. BACKGROUND
A. Video Synchronization Between Similar Scenes
Most video alignment techniques deal with video sequences
of the same scene and hence assume that the temporal relationship between the videos is considered to be a linear relationship
[4], [5], [11], such as
, where is the ratio of
frame rates and
is a fixed translational offset. However, for
applications such as video search, video comparison and human
71
Fig. 1. Illustration of two different scenes acquired using two distinct cameras.
Fig. 2. Typical schematic of similar scene videos synchronization.
activity recognition [12], we need to align video sequences from
two different scenes.
Assume that two cameras
and
view two independent
scenes of similar activities, as shown in Fig. 1. In Fig. 1,
(Camera 1) views 3D scene
in View 1 and acquires video .
Similarly,
(Camera 2) views another 3D scene
in View 2
and acquires video . Note that the motions in these two scenes
are similar but have dynamic time shift, i.e., the linear time shift
constraint assumed in most of existing techniques (e.g.,
) would not hold anymore.
A typical schematic for calculating temporal alignment of
similar scene videos is shown in Fig. 2. The RBC, STE, UBD,
and CTW techniques fall within this schematic. Note that for
the sake of correlating two video sequences and representing the
motion between them, features are extracted and tracked from
two video sequences. Robust view-invariance tracker is used to
generate feature trajectory. On their own, the feature trajectories
are discrete representations of the motion in the scene.
The difference of RBC, STE, UBD and CTW techniques lies
in how to compute the alignment using DTW. The RBC technique uses the rank constraint of corresponding points in two
videos to measure the similarity. A similarity measurement is
then used as the distance measure in DTW to calculate the dynamic time alignment function. Though the RBC technique does
lead to good alignment of dynamic time-varying videos, the authors of [13] note that if feature points are close to each other,
their rank constraint will result in erroneous matching. The proposed technique does not suffer from this limitation. Furthermore, if the object of interest moves over a planar surface, and
if the fundamental matrix is singular, the RBC technique does
not work very well.
The STE technique projects the trajectory in one view into the
other view, and then calculates the symmetric alignment using
Euclidean distance based DTW. The technique determines the
time warp that has the smallest STE as the final alignment.
72
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013
The UBD technique utilizes the symmetric alignment as the
global constraint, and calculates the optimal warp using DTW.
The CTW technique combines the canonical correlation analysis and DTW technique, and is able to capture the local spatial
deformations between two trajectories. It is noted that due to
the imperfection of the feature extraction module and the natural spatial and temporal variations of the motions within the
different videos, the extracted feature trajectories are usually
noisy and complex. When computing the temporal alignment
using the DTW, most existing techniques use Euclidean distance as the spatial deformation measurement. The unavoidable
noises within the extracted trajectories generally degrade the
performance of the above mentioned techniques. In addition, the
RBC, STE and UBD techniques suffer from a common drawback that they require overlapping views between the first pair
of frames in videos in order to specify enough corresponding
feature points. In the experimental section, i.e., Section IV, we
compare the STE, UBD, CTW techniques with the proposed
technique. It is shown that the proposed technique is able to provide robust performance compared to the existing techniques.
Fig. 3. The schematic of the proposed technique.
B. Point Sets Alignment
Point sets alignment is an important technique for solving
computer vision problems. Beal and McKay [2] proposed the iterative closest point (ICP) technique for 3D shape registration.
The ICP technique computes the correspondence between two
point sets based on distance criterion. Other techniques were developed using soft-assignment of correspondences between two
point sets instead of the binary assignment in ICP technique [6],
[10]. Such probabilistic techniques perform better than the ICP
technique in cases where noise and outliers are present. Point
sets alignment technique has been used successfully in stereo
matching, shape recognition and image registration. Although
it has never been used in video synchronization technique, it
has the potential to achieve good synchronization performance.
III. PROPOSED TECHNIQUE
In order to overcome the limitations of existing techniques,
we propose a robust technique for motion-based videos temporal alignment. The proposed technique does not require the
correspondence points from overlapped background for the spatial relationship estimation, which allows the proposed technique to handle temporal alignment of video sequences of totally different scenes. The assumption of the proposed technique is that the patterns of the extracted trajectories between
the videos are similar. This is a general assumption in practice.
For example, Fig. 4(a) and (b) show the trajectory of the motion
in video and
respectively. Note that, in this example, the
motions in these two videos are similar but with different dynamic time shift and the viewpoint of the camera. The numbers
in Fig. 4 represent the frame number while the starts and circles
represent the location of the interested object at certain frame
number. The total frame number of videos and are 30 and
40, respectively. Note that the patterns of these trajectories are
similar even though they are captured from different views and
with dynamic time shift. For more examples for the real videos
with similar trajectory patterns, please refer to the Section IV.B
and the supplemental material [23].
Fig. 4. An Example of trajectory of the motion (object of interest) in two video
sequences: (a) the motion trajectory in video ; (b) the motion trajectory in
video .
The schematic of the proposed technique is shown in Fig. 3.
The technique consists of two main modules: extraction of
feature trajectories and temporal alignment estimation. The
temporal alignment estimation module has two steps: computation of the spatial correspondences of the trajectory points and
computation of the temporal correspondences of the trajectory
points. In the computation of the spatial correspondences, we
model the spatial correspondence via affine transformation.
Note that the two similar trajectory patterns may not be fully
modeled by the affine transformation since the motions captured in the video are dynamically changing and the view
points of the cameras are different. In addition, the noises produced by the imperfection of the tracking algorithm and local
discrepancies may exist between the trajectories. Examples
LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT
are shown in Fig. 7(b) and (c), and in Fig. 8(a) and (b). In
order to tackle such problems, we recover the spatial relationship by computing ‘soft’ correspondence via a probabilistic
model. The potential inaccurate spatial correspondences are
then rectified in the temporal correspondence computation by
imposing a time constraint via DTW technique. Note that the
final spatio-temporal correspondence information is actually
the temporal alignment information between the videos. The
proposed temporal alignment estimation technique is presented
in details in the following sections. Note that for simplicity, we
introduce the proposed technique by assuming that the number
of videos to be synchronized is two.
73
re-parameterize the GMM centroids to the data points by
maximizing the log likelihood function. The probability of the
th Gaussian component in generating data point
, i.e.,
the posterior probability
, is the soft correspondence
we are looking for. The conditional Gaussian probability density
function is defined as follows.
(4)
where represents the parameters set and . In this paper we
assume that the GMM components have identical covariance
and the prior term
is equal to
for all . The GMM
probability density function can then be expressed as follows.
A. Extraction of Feature Trajectories
This module calculates the trajectory of an object of interest
throughout the video. Let
and
denote the trajectories obtained from videos and , respectively. Techniques such as
mean shift tracker [3] can be used to generate the trajectory.
Fig. 4 shows an example of the object of interest tracked in two
videos. Let and denote the discrete trajectory points set for
trajectories
and
respectively. The
and are defined
as follows.
..
.
..
.
..
.
..
.
(1)
..
.
..
.
..
.
..
.
(2)
(5)
The log-likelihood function of the GMM probability density
function can be calculated using the following equation.
(6)
and
in (1) and (2) are the total number of
Note that
points in trajectories
and
, respectively. The variables
(
and (
represent the coordinates of the th and
th trajectory points in trajectories
and , respectively.
Each discrete point in trajectory is written in homogenous form.
The indices of the point also indicate the temporal information,
i.e., the frame index in the two videos.
B. Compute Spatial Correspondence of Trajectory Points
In this section, we aim to calculate the spatial correspondence
of trajectory point sets. The estimated spatial point correspondence will then be used for computation of the temporal correspondence in the next section.
The spatial correspondences of the two trajectory point sets
are modeled by the affine transformation. The affine transformation consists of a linear transformation followed by a translation [7], and the one pair of correspondence trajectory points
satisfies the following equation.
We now model the problem as seeking the parameters while
the log likelihood reaches the maximum. Note that finding the
maximum likelihood with respect to the parameters using (6)
is difficult, since we cannot find a closed form solution for it.
The Expectation Maximization (EM) algorithm [1] is therefore
used to estimate the parameters. The EM algorithm focuses on
finding the maximum log-likelihood solutions posed in terms
of missing data. In our case, the missing data is actually the
point-to-centroid correspondence
, i.e., the posterior
of the GMM centroid
given data
. By introducing the
posterior
, the EM algorithm estimates the parameters in an iterative framework. The revised log-likelihood function of the GMM probability density function in the EM algorithm is defined as follows [1]:
(7)
The EM algorithm estimates
and iteratively
in a two-step manner. In the first step (E-step), the posterior
is estimated using Bayes rule as follows:
(3)
In order to compute the soft spatial correspondences between
two trajectory point sets, we treat trajectory point set as the
centroids of a Gaussian mixture model (GMM) [1], and assume
another trajectory point set
as the data points generated by
the GMM independently given the knowledge of
. We then
(8)
74
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013
where
is the parameters estimated at
th iteration.
Note that the posterior
is proportional to the
likelihood function
. To avoid ambiguity, we
denote the variable with superscript within parenthesis as the
th version of a variable. For example,
represents the th
version of , and
represents the th version of in the
iteration framework.
In the second step (M-step), we estimate the
by maximizing the
in (7). The EM algorithm iterates these
two steps (i.e., E-step and M-step) until
(in (7)) converges. The final version of estimated posterior probability
is the spatial correspondence between two point
sets we are interested in. Substituting (4) into (7) and ignoring the constant terms, we have the log-likelihood function
in the M-step as follows.
TABLE I
PSEUDO CODE FOR CALCULATION OF SPATIAL POINT-TO-CENTROID
CORRESPONDENCE USING EM ALGORITHM
choosing the th point
maximum, i.e.,
C. Compute Temporal Correspondence of Trajectory Points
In the previous section, the spatial correspondence module
estimated the spatial point correspondence matrix
. In this
section, we present the procedure to calculate the exact correspondence using
.
In a general point set alignment problem, for each point
in , the exact corresponding points can be determined by
such that the posterior is the
(13)
(9)
It can be shown that the solution for maximizing (9) at th
M-step with respect to
and
is as shown in (10) and
(11) at the bottom of the page, where 1 is the column vector
of all ones,
is the diagonal matrix formed from the
vector
, and Tr is the trace operation for a matrix. The
spatial point correspondence matrix
is given by (12) at the
bottom of the page. Note that each element in
is computed
using (8). The EM algorithm for calculating the point-to-centroid correspondence in our problem is summarized in Table I.
in
However, in our video synchronization problem, this method
may incur errors in matching if the feature trajectory itself has
intersections at different time instances or the trajectory is noisy.
This is illustrated in Fig. 5. Note that in Fig. 5(a) and (b), the
horizontal and vertical axes represent the index of points in the
point set
and point set , respectively. In this example, we
generated a sequence of 100 frames in one view. In the second
view, we generated the same sequence but with a time warped,
i.e., slowing down the motion by a factor of five, in the range
of frame 55 to frame 70. This results in the 160 frames of the
second sequence. The solid dots indicate the correspondences
determined by (13). However, note that the ground truth correspondences are shown as the dot line in Fig. 5(b). In Fig. 5(a), it
shows that many of the correspondences are not correct. For example, the correspondences encircled by the dotted-line circle
are not correct. This correspondence shows that the 7th to 13th
points in the
point set (horizontal axis) should match to the
131st to 143rd points in the point set (vertical axis). However,
as per the ground truth, the correspondence should lie between
(10)
(11)
..
.
..
.
..
.
..
.
(12)
LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT
75
where
is the posterior value of the given
indices (
in the th element of the warp and
can be obtained from the pre-computed spatial point correspondence matrix
.
The optimal warp is calculated in a two steps procedure which
is explained below:
i) In order to find the optimal warp, an
accumulated
distance matrix is created. The element in the accumulated probability matrix is calculated as follows.
(16)
ii) A greedy search technique is then employed in the accumulated probability matrix to find the optimal warp
such that
is maximum. The above method
employs the dynamic programming algorithm to obtain
an optimal warp [16].
Fig. 5(b) shows the alignment obtained using the DTW on the
spatial point correspondence matrix
. Note that with the
imposed continuity conditions and monotonicity conditions, we
obtain an accurate alignment compared to Fig. 5(a). Note that
the results obtained by the proposed technique and the ground
truth are almost identical.
IV. PERFORMANCE EVALUATIONS
Fig. 5. Alignment obtained (a) only by spatial correspondence, and (b) after
using the DTW. The dotted line represents the ground truth. Note that the result
obtained by the proposed technique and the ground truth are almost overlapped.
the 5th to 15th points in the point set (vertical axis). We can
observe other incorrect correspondence by comparing with the
ground truth correspondences in Fig. 5(a).
In order to solve the problem stated above, we propose to utilize the temporal constraint on the trajectory points and compute
the actual correspondence using the DTW technique based on
the obtained spatial point correspondence matrix
. Denote
the temporal alignment between two trajectories as the warp.
We construct the warp
as follows:
(14)
where and are the length of trajectories and obtained
from video and , respectively. The th element of the warp
is
. The warp satisfies the boundary conditions,
continuity conditions and monotonicity conditions which are
explained in [16]. The traditional DTW technique computes the
warping based on the Euclidean distance between two sequences
and chooses the warp which has the minimum accumulated distance as the optimal warp. The proposed technique computes the
warp based on the probabilistic value obtained from the spatial
point correspondence matrix
, and chooses the warp which
has the maximum accumulated probability as the optimal warp.
The accumulated probability of a warp is defined as follows:
(15)
In this section, we evaluate the proposed technique using
both synthetic trajectories and real image sequences. To compare the proposed technique with existing techniques, we have
also implemented the STE technique [15], UBD technique [8]
and CTW technique [27] that deal with aligning videos of similar motions. Our test cases and results are presented below.
A. Synthetic Data Evaluation
For synthetic data evaluation, we generate a 100 frames long
complex planar trajectory, using a pseudo-random number generator to simulate the motion in videos. The trajectory is then
projected onto two image planes using user defined camera projection matrices. In order to simulate the dynamic time shift, 5
trajectory points in one of the trajectory were interpolated and
65 trajectory points were obtained. We refer this trajectory as
the time-warped trajectory and its length is 160 frames long,
which is different from the planar trajectory. The STE, UBD,
CTW and the proposed techniques are then applied to the synthetic trajectories to compute the alignment between them. This
process is repeated on 50 different synthetic trajectories.
The results of alignment of smooth and noisy trajectories
with the STE, UBD, CTW techniques and the proposed techniques are shown in Table II, where the mean absolute differences between the actual and computed frame correspondence
is reported as the alignment error. The percentage numbers in
the last column show the improvement with respect to the CTW
technique. Note that in the case of noise free trajectory, the CTW
and the proposed technique provide comparable performance,
whereas the STE and the UBD techniques have relatively larger
alignment error. It is noted that with the introduction of the
76
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013
TABLE II
ALIGNMENT ERRORS (IN TERMS OF FRAMES) OF STE, UBD, CTW
AND PROPOSED TECHNIQUE FOR SYNTHETIC DATA. THE LAST
COLUMN SHOWS THE IMPROVEMENT OBTAINED BY THE PROPOSED
TECHNIQUE OVER THE CTW TECHNIQUE
Fig. 7. An example of alignments for noisy synthetic trajectory using the proposed, STE, UBD, and CTW techniques. (a) is the 3D configuration of the cameras and the noise-free trajectory. (b) and (c) are the projected noisy trajectories
and projected noisy time-warped trajectory on two different image planes, respectively. Note that the interpolated area in (c) is the warped part which simulates the dynamic time shift between two similar motions. (d) shows the alignment results obtained by the STE, UBD, and CTW techniques, where CTW
provides the best performance. (e) shows the alignment results obtained by the
proposed, and the CTW techniques. A zoom in version of the alignment result
is presented in a small window.
Fig. 6. An example of alignments for noise-free synthetic trajectory using the
proposed, STE, UBD, and CTW techniques. (a) illustrates the 3D camera and
trajectory configuration, where two rectangles A and B represent the positions
of two cameras which record the motions. (b) and (c) show the projected trajectories on two image planes, respectively. Note that the interpolated area in (c)
is the warped part which simulates the dynamic time shift between two similar
motions. (d) shows the alignment results obtained by the STE, UBD, and CTW
techniques, where CTW provides the best performance. (e) shows the alignment
results obtained by the proposed, and the CTW techniques. A zoom in version
of the alignment result is presented in a small window. The ground truth alignment is shown as solid line.
noise, the performances of the existing techniques decrease significantly while the proposed technique provides consistent performance and outperforms the other existing techniques.
Fig. 6 shows a synthetic data example where the noise level is
zero. Note that the performance closer to the ground truth indicates higher accuracy for alignment. For better illustration, we
compare the existing techniques in Fig. 6(d). We then chose the
best existing technique, i.e., CTW technique in this case, and
compare it with the proposed technique in Fig. 6(e). Note that
a zoomed version of the alignment result is shown in a small
window. The ground truth alignment is shown as solid line.
In this case, the alignment errors are 2.34, 2.77, 0.44 and 0.38
frames for STE, UBD, CTW and the proposed technique, respectively. It is observed that CTW and the proposed technique
provide the best and comparable performance in this noise-free
case.
In a practical situation, the feature trajectory is usually imperfect and contains noise. In the evaluation, we also evaluated
the effect of noisy trajectories on the proposed technique. Normally distributed and zero mean noise with various values of
variance
was added to the synthetic feature trajectories. A
synthetic noisy trajectory example is shown in Fig. 7. A normal
distributed noise, with mean equal to 0, and variance equal to
0.1, is added to both x and y coordinates, and the resulting trajectories in two image planes are shown in Fig. 7(b) and (c).
Fig. 8. The motion trajectory in UCF videos and the computed alignment results with ground truth. (a) and (b) show the two feature trajectories obtained
from the two videos, respectively. (c) is the alignment result comparison.
From the comparisons shown in Fig. 7(d), it is clear that the
CTW technique provides the best alignment compared to other
existing techniques. In Fig. 7(e), it is shown that the proposed
technique provides better performance than that of the CTW
technique. In this case, the alignment errors are 3.03, 2.60, 0.93,
and 0.51 frames for STE, UBD, CTW and the proposed technique, respectively.
LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT
77
Fig. 9. Visual comparison of the alignments computed using the STE, the UBD, the CTW and the proposed technique. The first row shows the 59th, 63rd, 67th,
71th frames of video UCF2. The second to the fifth rows shows the aligned frame obtained using the STE, the UBD, the CTW and the proposed technique,
respectively.
B. Real Data Evaluation
In this section, we evaluate the proposed technique on ten
pairs of real video data. The video data can be divided into two
categories: videos captured under the same scene and videos
captured under different scenes. Note that the ground truth is
not available for all pairs of videos. In order to evaluate the performance of the proposed technique and compare it with other
techniques, we manually chose several correspondences of key
frames as the ground truth. The key frame correspondences are
selected based on the distinct motion connection points. For example, in the action of “open the drawer”, the key frames are the
frames where the hand is put on the handle of the drawer, or the
drawer is completely closed. In the action of “cup lifting”, the
key frames are the frames where the cup is lifted on the top, or
the cup is put on another cabinet. In the action of “ball playing”,
the key frames are the frames where the ball is on the left/right
hand, or the ball is on the highest point. In order to reduce the
bias, we select the ground truth carefully by three people and
take the average of the three as the final ground truth. The average alignment error is then calculated as the mean absolute
difference between the computed correspondence of frames and
the ground truth.
1) Videos Captured Under the Same Scene: For evaluation
with real video sequences which were captured in the same
scene, we first used video sequences provided by Rao et al. [22]
(we refer these videos as UCF video). Feature (location of hand)
trajectories are available for the UCF video files [22]. Note that
in this case, there is overlapped background which enables the
previous techniques, i.e., the STE, and the UBD techniques, to
estimate the spatial relationship by using sufficient number of
corresponding points. To reduce the error induced by the automatic corresponding points selection, the corresponding points
are selected manually. Note that the proposed technique and the
CTW technique do not need to use the corresponding points
from the overlapped background. The proposed technique, STE,
UBD and the CTW techniques were applied to this pair of real
videos.
78
The two UCF test videos cab2-4.1.mpg (UCF1) and
cab2-4.4.mpg (UCF2) recorded the action of opening a
drawer. These videos are 84 and 174 frames long, respectively.
The trajectories of these two actions are shown in Fig.8 (a) and
(b). Fig. 8(c) shows the frame alignment obtained using the
STE, UBD, CTW and the proposed techniques, as well as the
ground truth. The horizontal axis and vertical axis represent the
indices of the frames of videos UCF1 and UCF2, respectively.
It is observed that the proposed technique provides the best
alignment results (i.e., closest to the ground truth) compared to
other techniques. Note that the two trajectories in the videos
have similar patterns and did not follow the affine transformation. Even if we have modeled the spatial relationship under
the affine transformation (see (3)), with the help of the soft corresponding and the time constraint used in DTW, the proposed
technique can still provide robust and satisfactory performance.
For the other existing techniques, since hard corresponding is
used for the spatial relationship estimation and the Euclidean
distance is used in the DTW, it is difficult to handle the noise
and spatial trajectory distortions effectively which leads to
higher alignment errors.
We now present subjective evaluation of the alignment result.
Fig. 9 shows four representative frames (frame# 59, 63, 67 and
71) of video UCF2 in the first row. The correspondence frames
computed for these four frames using the STE, UBD, CTW and
the proposed techniques are shown in the second, third, fourth
and the fifth rows of Fig. 9, respectively. Because the object of
interest is the hand in the video, we can examine the relative position of the hand in the entire trajectory. If the computed temporal aligned frame is matched, i.e., the relative hand position
in the computed temporal aligned frame is the same as that in
the reference frame in another video, we mark “matched” under
the frame, otherwise “mismatched”. Among the existing techniques, the results obtained by CTW technique lead to the singularity, i.e., many frames in one video matched to one frame
in another video. Other existing techniques can provide better
result, but still have mismatched results. It is shown that the
proposed technique provides superior performance compared to
other techniques in this real video case.
Besides the UCF videos, we have used three additional pairs
of videos capturing the coffee cup lifting motion under the
same scene, named Cuplifting1-3. The alignment comparisons
are shown in Fig. 10. It is observed that the proposed technique
provides better alignment performance than other techniques.
The average alignment errors with respect to the ground truth
of the STE, UBD, CTW and the proposed techniques are
summarized in Table III. Note that the proposed technique has
outperformed other existing techniques for the test videos.
2) Videos Captured Under Different Scenes: In this section
we present the performance of the proposed technique on videos
with significantly different scenes. Note that the videos were
captured from different views, and the similar actions were performed by different people in different scenes. In these cases,
the STE and UBD techniques cannot be applied as there are no
overlapping backgrounds between the two videos. Therefore,
we compare the performance of the proposed technique with
the CTW technique.
We first evaluate the CTW and the proposed technique on
the videos where the coffee cup lifting action is captured. Four
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013
Fig. 10. Performances comparisons on Cuplifting videos. (a) Cuplifting1;
(b) Cuplifting2; (c) Cuplifting3. The horizontal axis and vertical axis represent
the indices of the frames of the first and second video, respectively.
TABLE III
ALIGNMENT ERROR COMPUTED WITH RESPECT TO THE GROUND
TRUTH ON VIDEO PAIRS CAPTURED UNDER THE SAME SCENE
pairs of videos with four different people at two different scenes
are evaluated. These pairs of videos are named DCupLifting1-4
in this paper. The performance obtained by the CTW and the
LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT
79
Fig. 11. The alignment results of real video pair (DCupLifting1) obtained by the proposed technique. (a) shows the alignment obtained by the proposed technique.
(b) presents the visual evaluation of the alignments. The first row shows the 43rd, 107th, 189th, and 218th frame of the first video of DCupLifting1. The second
row and the third row show the computed corresponding frames using the CTW, and the proposed technique, respectively.
TABLE IV
ALIGNMENT ERRORS COMPUTED WITH RESPECT TO THE GROUND
TRUTH ON VIDEOS CAPTURED UNDER DIFFERENT SCENE
proposed technique is shown in Table IV. We take the DCupLifting1 for the visual evaluation, which is shown in Fig. 11.
In this example (DCupLifting1), the first video contains 221
frames while the second video contains 207 frames. Note that
the similar action, i.e., lifting the coffee cup, is performed by
two different people under different scenes with different motion speed. The alignments computed using the CTW and the
proposed technique are shown in Fig. 11(a). The ground truths
of the key frame correspondences are indicated as the circle
bars. In Fig. 11(b), the first row shows the 43rd, 107th, 189th,
and 218th frame of the first video of DCupLifting1. The second
and the third rows show the corresponding frames computed
using the CTW and the proposed technique, respectively. The
description, e.g., matched, below the frames indicates if the two
computed corresponding frames are matched or not. It is clear
that the proposed technique provides a better performance.
80
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013
Fig. 12. The alignment results of real video pair (TaChiQuan) obtained by the proposed technique. (a) shows the alignment obtained by the proposed technique. (b)
The first row shows the 102th, 160th, 249th, and 335th frame of the first video of DCupLifting1. The second row and the third row show the computed corresponding
frames using the CTW, and the proposed technique, respectively.
The second evaluation is on three pairs of video with ball
throwing motion (using videos BallPlaying1-3). The performance obtained by the proposed technique is shown in Table IV.
The third evaluation is on one pair of video TaiChiQuan capturing the complex motion of the TaiChiQuan sport. The first
video contains 414 frames while the second video contains 247
frames. The left hand of a player is being tracked throughout
the two videos. Fig. 12 shows the alignment results as well as
the representative frames correspondences computed by the
CTW and the proposed technique. It is observed in Fig. 12(b)
that the representative frame correspondences obtained using
the proposed technique are more accurate compared to those
obtained using the CTW technique. Table IV summarizes the
performance of all test videos. It is clear that the proposed
technique provides better performance compared to the CTW
technique in terms of the key frames correspondences computation. It can be inferred that the proposed technique is able to
provide robust synchronization of the videos captured under
different scenes where similar approximated planar motions
are present.
C. Execution Time
The experiments were run on a 2.4 GHz Intel Core II Duo
CPU with 3GB RAM computer and the techniques were implemented using MATLAB 7.04. Excluding the preprocessing time
(i.e., the object of interest tracking time), the execution time of
the STE, UBD, CTW and proposed technique for synthetic trajectories are briefly summarized in Table V. When the lengths
LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT
TABLE V
EXECUTION TIME (IN SECONDS) OF THE STE, UBD, CTW
AND PROPOSED TECHNIQUE FOR THE ENTIRE SEQUENCE
of two trajectories are 100 and 160 frames, the execution time
of the proposed technique is faster than the STE, comparable
with the UBD technique and slower than the CTW technique.
Note that the execution time is expected to be proportional to
the length of the video being processed.
D. Discussion
For the STE and the UBD techniques, after the mutual projections of the trajectories onto different views, they rely on the
Euclidean distance between the two trajectories in the video.
The imperfection or jagging effect of object tracking introduces
the noise in the trajectory and the spatial relationship estimation based on the imperfect correspondence points from the
overlapped background will lead to incorrect synchronization
results. As for the CTW technique, it incorporates the canonical correlation analysis in the DTW estimation and recovers
the spatial relationship by using a set of linear transformations
and selecting common features between the two trajectories.
This technique provides a good performance when the noise and
local distortion are small. It there exist considerable noises and
distortions in the extracted feature trajectories, the performance
degrades as the computation of the DTW is based on the projected Euclidean distances between two trajectories. The novelty of the proposed technique is to establish the ‘soft’ correspondence between the two trajectories constrained by the spatial relationship using GMMs, and then searching for the optimal synchronization result based on the ‘soft’ correspondence.
The proposed technique is robust to fluctuation of motion trajectory with the help of ‘soft’ correspondence. The introduction of
the ‘soft’ correspondences greatly alleviate the influence from
the noise and the imperfection of the spatial relationship estimation, i.e., we do not intend to seek for a good spatial relationship estimation, a roughly estimation of the spatial relationship
is good enough. The EM algorithm iteratively seeks the maximum likelihood of (8) while estimating the posterior (i.e., the
alignment correspondences) and the spatial correspondence. It
is has been observed that if the estimated spatial relationship is
perfect, the two trajectories from two videos will align exactly.
In practice, even with a rough estimated spatial transformation,
we can bring two trajectories close to each other. This will help
us to compute ‘soft’ correspondence by the GMM. This is primarily the reason the proposed technique leads to more accurate
correspondence calculation.
V. CONCLUSION
In this paper, we have proposed a novel technique for motion-based video synchronization. The proposed technique is
able to synchronize videos containing complex motions with
dynamic time shift in an efficient way. Comparative analysis
with the existing techniques, i.e., STE, UBD and CTW techniques, demonstrated that an improvement of over 6% to 36%
81
in video temporal alignment can be achieved using the proposed technique. It has been observed that the proposed technique can also synchronize the videos with significantly different scenes and robust to the noise and underlying local deformations. Although the number of videos to be synchronized
is assumed to be two in this paper, the proposed technique can
easily be extended to applications where the number of videos is
greater than two by setting one video as the reference and compute the temporal alignments of other videos with respect to the
reference.
REFERENCES
[1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford,
U.K.: Oxford Univ. Press, 1996.
[2] P. J. Besl and N. D. McKay, “A method for registration of 3-D shapes,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, pp. 239–256, 1992.
[3] D. Comaniciu et al., “Real-time tracking of non-rigid objects using
mean shift,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2000, vol. Ii, pp. 142–149.
[4] Y. Caspi and M. Irani, “Spatio-temporal alignment of sequences,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 11, pp.
1409–1424, Nov. 2002.
[5] Y. Caspi, D. Simakov, and M. Irani, “Feature based sequence to sequence matching,” Int. J. Comput. Vision, vol. 68, no. 1, pp. 53–64,
2006.
[6] S. Gold, C. P. Lu, A. Rangarajan, S. Pappu, and E. Mjolsness, “New
algorithms for 2D and 3D point matching: Pose estimation and correspondence,” in Proc. Advances in Neural Information Processing Systems, 1994, vol. 7, pp. 957–964.
[7] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer
Vision, 2nd ed. London, U.K.: Cambridge Univ. Press, 2004.
[8] C. Lu and M. Mandal, “Efficient temporal alignment of video sequences using unbiased bidirectional dynamic time warping,” J.
Electron. Imag., vol. 19, no. 4, pp. 0501–0504, Aug. 2010.
[9] L. Lee, R. Romano, and G. Stein, “Monitoring activities from multiple
video streams: Establishing a common coordinate frame,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 758–767, Aug. 2000.
[10] A. Myronenko and X. B. Song, “Point Set Registration: Coherent Point
Drift,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 12, pp.
2262–2275, Dec. 2010.
[11] F. L. C. Padua et al., “Linear sequence-to-sequence alignment,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 32, no. 2, pp. 304–320, Feb.
2010.
[12] C. Rao, A. Yilmaz, and M. Shah, “View-invariant representation and
recognition of actions,” Int. J. Comput. Vision, vol. 50, no. 2, pp.
203–226, Nov. 2002.
[13] C. Rao, A. Gritai, M. Shah, and T. F. S. Mahmood, “View-invariant
alignment and matching of video sequences,” in Proc. ICCV03, 2003,
pp. 939–945.
[14] M. Singh, A. Basu, and M. Mandal, “Event dynamics based temporal
registration,” IEEE Trans. Multimedia, vol. 9, no. 5, pp. 1004–1015,
Aug. 2007.
[15] M. Singh et al., D. Forsyth, Ed. et al., “Optimization of symmetric
transfer error for sub-frame video synchronization,” in Proc. Computer
Vision - ECCV 2008, Pt Ii, 2008, vol. 5303, pp. 554–567.
[16] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. Acoust., Speech, Signal
Process., vol. 26, no. 1, pp. 43–49, 1978.
[17] C. X. Dai, Y. F. , and X. Zheng, “Subframe video synchronization via
3D phase correlation,” in Proc. Int. Conf. Image Processing, 2006, vol.
1, pp. 501–504.
[18] M. A. Giese and T. Poggio, “Morphable models for the analysis and
synthesis of complex motion patterns,” Int. J. Comput. Vision, vol. 38,
no. 1, pp. 59–73, Jun. 2000.
[19] D. Perperidis, R. H. Mohiaddin, and D. Rueckert, “Spatio-temporal
free-form registration of cardiac MR image sequences,” Med. Image
Comput. Comput.-Assist. Intervent., vol. 3216, pp. 441–456, 2004.
[20] P. Tresadern and I. Reid, “Synchronizing image sequences of non-rigid
objects,” in Proc. 14th British Machine Vision Conf., Norwich, U.K.,
Sep. 9-11, 2003, vol. 2, pp. 629–638.
82
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013
[21] T. Tuytelaars and L. J. VanGool, “Synchronizing video sequences,”
in Proc. IEEE Computer Society Conf. Computer Vision and Pattern
Recognition, Jul. 2004, vol. 1, pp. 762–768.
[22] C. Rao, View-Invariant Representation for Human Activity. [Online]. Available: http://server.cs.ucf.edu/vision/projects/ViewInvariance/ViewInvariance.html.
[23] C. Lu, Experimental Results for the Robust Video Alignment Technique. [Online]. Available: http://www.ece.ualberta.ca/lcheng4/
VideoAlignment/VA.htm.
[24] F. Zhou, F. Torre, and J. K. Hodgins, “Aligned cluster analysis for temporal segmentation of human motion,” in Proc. 8th IEEE Int. Conf. Automatic Face & Gesture Recognition, 2008, pp. 1–7.
[25] R. Hess and A. Fern, “Improved video registration using non-distinctive local image features,” in Proc. IEEE Conf. Computer Vision and
Pattern Recognition, 2007, pp. 1–8.
[26] R. Li and R. Chellappa, “Aligning spatio-temporal signals on a special
manifold,” in Proc. Eur. Conf. Computer Vision, 2010, pp. 547–560.
[27] F. Zhou and F. Torre, “Canonical time warping for alignment of
human behavior,” in Advances in Neural Information Processing
Systems, 2009.
[28] P. Shrestha, M. Barbieri, H. Weda, and D. Sekulovski, “Synchronization of multiple camera videos using audio-visual features,” IEEE
Trans. Multimedia, pp. 79–92, 2010.
[29] D. Chudova, S. Gaffney, and P. Smyth, “Probabilistic models for joint
clustering and time-warping of multidimensional curves,” in Proc. 19th
Conf. Uncertainty in Artificial Intelligence, 2003.
[30] C. Lu, M. Singh, I. Cheng, A. Basu, and M. Mandal, “Efficient
video sequences alignment using unbiased bidirectional dynamic time
warping,” J. Vision Commun. Image Represent., vol. 22, no. 7, pp.
606–614, Oct. 2011.
Cheng Lu (M’11) received the B.Sc. and M.Sc.
degrees in computer engineering in China, in 2006
and 2008. He is currently working toward the Ph.D.
degree in electrical engineering in the University
of Alberta. He is the recipient of the Chinese
Scholarship Council for his Ph.D. studies. He is the
recipient of the Graduate Student Interdisciplinary
Research Award 2012, University of Alberta. His
research interest includes computer vision, pattern
recognition, super resolution image, and medical
imaging. He is an author or coauthor of more than
ten papers in leading international journals and conferences.
Mrinal Mandal (M’99–SM’03) is a Full Professor
and Associate Chair in the Department of Electrical
and Computer Engineering and is the Director of
the Multimedia Computing and Communications
Laboratory at the University of Alberta, Edmonton,
AB, Canada. He has authored the book Multimedia
Signals and Systems (Kluwer Academic), and
co-authored the book Continuous and Discrete Time
Signals and Systems (Cambridge University Press).
His current research interests include Multimedia,
Image and Video Processing, Multimedia Communications, and Medical Image Analysis. He has published over 140 papers
in refereed journals and conferences, and has a US patent on lifting wavelet
transform architecture. He has been the Principal Investigator of projects
funded by Canadian Networks of Centers of Excellence such as CITR and
MICRONET, and is currently the Principal Investigator of a project funded
by the NSERC. He was a recipient of Canadian Commonwealth Fellowship
from 1993 to 1998, and Humboldt Research Fellowship from 2005-2006 at the
Technical University of Berlin.
Download