70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013 A Robust Technique for Motion-Based Video Sequences Temporal Alignment Cheng Lu, Member, IEEE, and Mrinal Mandal, Senior Member, IEEE Abstract—In this paper, we propose a robust technique for temporal alignment of video sequences with similar planar motions acquired using uncalibrated cameras. In this technique, we model the motion-based video temporal alignment problem as a spatio-temporal discrete trajectory point sets alignment problem. First, the trajectory of the object of interest is tracked throughout the videos. A probabilistic method is then developed to calculate the ‘soft’ spatial correspondence between the trajectory point sets. Next, a dynamic time warping technique (DTW) is applied to the spatial correspondence information to compute the temporal alignment of the videos. The experimental results show that the proposed technique provides a superior performance over existing techniques for videos with similar trajectory patterns. Index Terms—Video alignment, video synchronization, temporal registration, dynamic time warping, point sets alignment. I. INTRODUCTION T HE temporal alignment of video sequences is very important in computer vision applications such as human action recognition [12], video synthesis, video mosaicing [25], superresolution imaging [4], 3D visualization [14], robust multi-view surveillance [9], and human action temporal segmentation [24]. Most existing techniques focus on alignment of videos captured from different cameras of the same scene or similar scenes with overlapped background. However, there are some applications that require synchronizing the videos captured under significantly different scenes. For example, consider computer-aided sport self-training application where a sport learner wants to learn the TaiChiQuan sport. The learner follows the instructions and captures a video sequence while playing. After a training session, he can compare his postures with that of an expert by comparing the recorded videos. The TaiChiQuan sport is typically performed in a relatively narrow space and has complex motions (example videos are shown in Section IV) which may require a robust video alignment technique to bring two videos into a common time axis. Note that in such scenario, the sport videos are captured under widely different scenes. In the literature, the video temporal alignment or temporal registration can be broadly classified into two categories: video Manuscript received October 03, 2011; revised March 02, 2012 and June 25, 2012; accepted June 26, 2012. Date of publication October 16, 2012; date of current version December 12, 2012. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mihaela van der Schaar. The authors are with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2V4, Canada (e-mail: lcheng4@ualberta.ca; mmandal@ualberta.ca). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2012.2225036 sequences of the same scene and video sequences of similar scenes. We present a brief review of the literature for these two categories in the following. In the case of video sequences captured from the same scene, the videos record the identical motion/event from the same environment. Lee et al. [9] proposed a video synchronization technique, in which they estimated a time shift between two sequences based on alignment of centroids of moving objects in planar scene. Dai et al. [17] use 3D phase correlation between video sequences to calculate the synchronization. Tuytelaars et al. [21] computed alignment by checking the rigidity of a set of five (or more) points. Tresadern et al. [20] also followed a similar approach of computing a rank-constraint based rigidity measure between four pairs of non-rigidly moving feature points. Caspi et al. [4], [5] proposed two kinds of techniques: feature-based sequence alignment and direct-based sequence alignment. In the feature-based sequence alignment technique, they calculated the spatial and temporal relationship between two sequences by minimizing the sum of square differences (SSD) error over extracted trajectories that are visible in both sequences. In the direct-based sequence alignment technique, they computed the frame-to-frame correspondences by minimizing the SSD error over each frames in the video sequences. Padua et al. [11] extended [5] to align sequences based on scene points that need to be visible only in two consecutive frames. Shrestha et al. [28] proposed several video synchronization methods based on the audio and visual features. These audio and visual features include flash light, audio fingerprints, and audio onsets. All these techniques solve the alignment problem where videos are captured at the same scene using uncalibrated cameras. When aligning video sequences of different scenes, albeit sequences correlated via motion, one has to factor in the dynamic temporal scale of activities in the video sequences. Giese and Poggio [18] approached alignment of activities of different people by computing a dynamic time warp between the feature trajectories. They did not consider the situation where the activity sequences are from varying viewpoints. Perperidis et al. [19] attempted to locally warp cardiac MRI sequences, by extending Caspi’s work to incorporate spline based local alignment. Rao et al. [13] used rank constraint as the distance measure in the dynamic time warping (DTW) technique to align human activity-based videos. This is the first work reported in the literature that can deal with video sequences of correlated activities in similar scenes. Singh et al. [15] proposed a video synchronization technique to generate a high resolution MRI sequence by combining several low resolution MRI sequences. This technique formulates a symmetric transfer error (STE) as 1520-9210/$31.00 © 2012 IEEE LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT a functional of regularized temporal warp and selects the time warp that has the smallest STE as the alignment result. Lu et al. [8], [30] extended Singh’s technique, using unbiased bidirectional dynamic time warping (UBD), to calculate the optimal warp for the alignment. These techniques were used to synchronize the MRI sequences captured from the same patient (doing a certain activity), resulting in a super-resolution MRI sequences. However, these techniques may not work well in cases where the MRI sequences need to be synchronized between different patients. Chudova et al. [29] developed a probabilistic model for clustering and time-warping of multi-dimensional curves. This model is able to learn the clusters of curves (with local and global deformations in the underlying curve) using a finite mixture models. However, the number of the mixture components is needed to be known in advance for the learning procedure, and the model is more focused on the clustering of different type of curves with distortions. Li et al. [26] proposed the alignment manifold and solved the spatio-temporal alignment problem in a pre-specified non-linear space. Zhou et al. [27] proposed the canonical time warping (CTW) technique, which combines the canonical correlation analysis and DTW technique, for the human motion alignment. In this paper, we propose a novel technique for motion-based temporal alignment of video sequences which is able to deal with the video captured from the same scene and different scenes. The proposed technique formulates the motion-based video temporal alignment problem as a spatio-temporal discrete trajectory point sets alignment problem. First, the trajectory of the interested object is tracked throughout the videos. Considering the imperfection of the features extraction module which will introduce noise in the extracted trajectory, a probabilistic method is developed to calculate the ‘soft’ spatial correspondence between the trajectory point sets. Next, a dynamic time warping technique (DTW) is applied to the ‘soft’ spatial correspondence information to compute the temporal alignment of the videos. The advantages of the proposed technique are: (1) it does not require overlapping views between videos to select corresponding feature points, i.e., it can be applied in videos containing different scenes; (2) it is able to deal with situations where videos contain complex dynamic object motion (e.g., long trajectory with several intersections) or noisy feature trajectory with consistent performance. The rest of this paper is organized as follows. Section II presents the background information. The proposed technique is presented in Section III. Performance evaluation of the proposed technique is presented in Section IV, followed by the conclusions. II. BACKGROUND A. Video Synchronization Between Similar Scenes Most video alignment techniques deal with video sequences of the same scene and hence assume that the temporal relationship between the videos is considered to be a linear relationship [4], [5], [11], such as , where is the ratio of frame rates and is a fixed translational offset. However, for applications such as video search, video comparison and human 71 Fig. 1. Illustration of two different scenes acquired using two distinct cameras. Fig. 2. Typical schematic of similar scene videos synchronization. activity recognition [12], we need to align video sequences from two different scenes. Assume that two cameras and view two independent scenes of similar activities, as shown in Fig. 1. In Fig. 1, (Camera 1) views 3D scene in View 1 and acquires video . Similarly, (Camera 2) views another 3D scene in View 2 and acquires video . Note that the motions in these two scenes are similar but have dynamic time shift, i.e., the linear time shift constraint assumed in most of existing techniques (e.g., ) would not hold anymore. A typical schematic for calculating temporal alignment of similar scene videos is shown in Fig. 2. The RBC, STE, UBD, and CTW techniques fall within this schematic. Note that for the sake of correlating two video sequences and representing the motion between them, features are extracted and tracked from two video sequences. Robust view-invariance tracker is used to generate feature trajectory. On their own, the feature trajectories are discrete representations of the motion in the scene. The difference of RBC, STE, UBD and CTW techniques lies in how to compute the alignment using DTW. The RBC technique uses the rank constraint of corresponding points in two videos to measure the similarity. A similarity measurement is then used as the distance measure in DTW to calculate the dynamic time alignment function. Though the RBC technique does lead to good alignment of dynamic time-varying videos, the authors of [13] note that if feature points are close to each other, their rank constraint will result in erroneous matching. The proposed technique does not suffer from this limitation. Furthermore, if the object of interest moves over a planar surface, and if the fundamental matrix is singular, the RBC technique does not work very well. The STE technique projects the trajectory in one view into the other view, and then calculates the symmetric alignment using Euclidean distance based DTW. The technique determines the time warp that has the smallest STE as the final alignment. 72 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013 The UBD technique utilizes the symmetric alignment as the global constraint, and calculates the optimal warp using DTW. The CTW technique combines the canonical correlation analysis and DTW technique, and is able to capture the local spatial deformations between two trajectories. It is noted that due to the imperfection of the feature extraction module and the natural spatial and temporal variations of the motions within the different videos, the extracted feature trajectories are usually noisy and complex. When computing the temporal alignment using the DTW, most existing techniques use Euclidean distance as the spatial deformation measurement. The unavoidable noises within the extracted trajectories generally degrade the performance of the above mentioned techniques. In addition, the RBC, STE and UBD techniques suffer from a common drawback that they require overlapping views between the first pair of frames in videos in order to specify enough corresponding feature points. In the experimental section, i.e., Section IV, we compare the STE, UBD, CTW techniques with the proposed technique. It is shown that the proposed technique is able to provide robust performance compared to the existing techniques. Fig. 3. The schematic of the proposed technique. B. Point Sets Alignment Point sets alignment is an important technique for solving computer vision problems. Beal and McKay [2] proposed the iterative closest point (ICP) technique for 3D shape registration. The ICP technique computes the correspondence between two point sets based on distance criterion. Other techniques were developed using soft-assignment of correspondences between two point sets instead of the binary assignment in ICP technique [6], [10]. Such probabilistic techniques perform better than the ICP technique in cases where noise and outliers are present. Point sets alignment technique has been used successfully in stereo matching, shape recognition and image registration. Although it has never been used in video synchronization technique, it has the potential to achieve good synchronization performance. III. PROPOSED TECHNIQUE In order to overcome the limitations of existing techniques, we propose a robust technique for motion-based videos temporal alignment. The proposed technique does not require the correspondence points from overlapped background for the spatial relationship estimation, which allows the proposed technique to handle temporal alignment of video sequences of totally different scenes. The assumption of the proposed technique is that the patterns of the extracted trajectories between the videos are similar. This is a general assumption in practice. For example, Fig. 4(a) and (b) show the trajectory of the motion in video and respectively. Note that, in this example, the motions in these two videos are similar but with different dynamic time shift and the viewpoint of the camera. The numbers in Fig. 4 represent the frame number while the starts and circles represent the location of the interested object at certain frame number. The total frame number of videos and are 30 and 40, respectively. Note that the patterns of these trajectories are similar even though they are captured from different views and with dynamic time shift. For more examples for the real videos with similar trajectory patterns, please refer to the Section IV.B and the supplemental material [23]. Fig. 4. An Example of trajectory of the motion (object of interest) in two video sequences: (a) the motion trajectory in video ; (b) the motion trajectory in video . The schematic of the proposed technique is shown in Fig. 3. The technique consists of two main modules: extraction of feature trajectories and temporal alignment estimation. The temporal alignment estimation module has two steps: computation of the spatial correspondences of the trajectory points and computation of the temporal correspondences of the trajectory points. In the computation of the spatial correspondences, we model the spatial correspondence via affine transformation. Note that the two similar trajectory patterns may not be fully modeled by the affine transformation since the motions captured in the video are dynamically changing and the view points of the cameras are different. In addition, the noises produced by the imperfection of the tracking algorithm and local discrepancies may exist between the trajectories. Examples LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT are shown in Fig. 7(b) and (c), and in Fig. 8(a) and (b). In order to tackle such problems, we recover the spatial relationship by computing ‘soft’ correspondence via a probabilistic model. The potential inaccurate spatial correspondences are then rectified in the temporal correspondence computation by imposing a time constraint via DTW technique. Note that the final spatio-temporal correspondence information is actually the temporal alignment information between the videos. The proposed temporal alignment estimation technique is presented in details in the following sections. Note that for simplicity, we introduce the proposed technique by assuming that the number of videos to be synchronized is two. 73 re-parameterize the GMM centroids to the data points by maximizing the log likelihood function. The probability of the th Gaussian component in generating data point , i.e., the posterior probability , is the soft correspondence we are looking for. The conditional Gaussian probability density function is defined as follows. (4) where represents the parameters set and . In this paper we assume that the GMM components have identical covariance and the prior term is equal to for all . The GMM probability density function can then be expressed as follows. A. Extraction of Feature Trajectories This module calculates the trajectory of an object of interest throughout the video. Let and denote the trajectories obtained from videos and , respectively. Techniques such as mean shift tracker [3] can be used to generate the trajectory. Fig. 4 shows an example of the object of interest tracked in two videos. Let and denote the discrete trajectory points set for trajectories and respectively. The and are defined as follows. .. . .. . .. . .. . (1) .. . .. . .. . .. . (2) (5) The log-likelihood function of the GMM probability density function can be calculated using the following equation. (6) and in (1) and (2) are the total number of Note that points in trajectories and , respectively. The variables ( and ( represent the coordinates of the th and th trajectory points in trajectories and , respectively. Each discrete point in trajectory is written in homogenous form. The indices of the point also indicate the temporal information, i.e., the frame index in the two videos. B. Compute Spatial Correspondence of Trajectory Points In this section, we aim to calculate the spatial correspondence of trajectory point sets. The estimated spatial point correspondence will then be used for computation of the temporal correspondence in the next section. The spatial correspondences of the two trajectory point sets are modeled by the affine transformation. The affine transformation consists of a linear transformation followed by a translation [7], and the one pair of correspondence trajectory points satisfies the following equation. We now model the problem as seeking the parameters while the log likelihood reaches the maximum. Note that finding the maximum likelihood with respect to the parameters using (6) is difficult, since we cannot find a closed form solution for it. The Expectation Maximization (EM) algorithm [1] is therefore used to estimate the parameters. The EM algorithm focuses on finding the maximum log-likelihood solutions posed in terms of missing data. In our case, the missing data is actually the point-to-centroid correspondence , i.e., the posterior of the GMM centroid given data . By introducing the posterior , the EM algorithm estimates the parameters in an iterative framework. The revised log-likelihood function of the GMM probability density function in the EM algorithm is defined as follows [1]: (7) The EM algorithm estimates and iteratively in a two-step manner. In the first step (E-step), the posterior is estimated using Bayes rule as follows: (3) In order to compute the soft spatial correspondences between two trajectory point sets, we treat trajectory point set as the centroids of a Gaussian mixture model (GMM) [1], and assume another trajectory point set as the data points generated by the GMM independently given the knowledge of . We then (8) 74 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013 where is the parameters estimated at th iteration. Note that the posterior is proportional to the likelihood function . To avoid ambiguity, we denote the variable with superscript within parenthesis as the th version of a variable. For example, represents the th version of , and represents the th version of in the iteration framework. In the second step (M-step), we estimate the by maximizing the in (7). The EM algorithm iterates these two steps (i.e., E-step and M-step) until (in (7)) converges. The final version of estimated posterior probability is the spatial correspondence between two point sets we are interested in. Substituting (4) into (7) and ignoring the constant terms, we have the log-likelihood function in the M-step as follows. TABLE I PSEUDO CODE FOR CALCULATION OF SPATIAL POINT-TO-CENTROID CORRESPONDENCE USING EM ALGORITHM choosing the th point maximum, i.e., C. Compute Temporal Correspondence of Trajectory Points In the previous section, the spatial correspondence module estimated the spatial point correspondence matrix . In this section, we present the procedure to calculate the exact correspondence using . In a general point set alignment problem, for each point in , the exact corresponding points can be determined by such that the posterior is the (13) (9) It can be shown that the solution for maximizing (9) at th M-step with respect to and is as shown in (10) and (11) at the bottom of the page, where 1 is the column vector of all ones, is the diagonal matrix formed from the vector , and Tr is the trace operation for a matrix. The spatial point correspondence matrix is given by (12) at the bottom of the page. Note that each element in is computed using (8). The EM algorithm for calculating the point-to-centroid correspondence in our problem is summarized in Table I. in However, in our video synchronization problem, this method may incur errors in matching if the feature trajectory itself has intersections at different time instances or the trajectory is noisy. This is illustrated in Fig. 5. Note that in Fig. 5(a) and (b), the horizontal and vertical axes represent the index of points in the point set and point set , respectively. In this example, we generated a sequence of 100 frames in one view. In the second view, we generated the same sequence but with a time warped, i.e., slowing down the motion by a factor of five, in the range of frame 55 to frame 70. This results in the 160 frames of the second sequence. The solid dots indicate the correspondences determined by (13). However, note that the ground truth correspondences are shown as the dot line in Fig. 5(b). In Fig. 5(a), it shows that many of the correspondences are not correct. For example, the correspondences encircled by the dotted-line circle are not correct. This correspondence shows that the 7th to 13th points in the point set (horizontal axis) should match to the 131st to 143rd points in the point set (vertical axis). However, as per the ground truth, the correspondence should lie between (10) (11) .. . .. . .. . .. . (12) LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT 75 where is the posterior value of the given indices ( in the th element of the warp and can be obtained from the pre-computed spatial point correspondence matrix . The optimal warp is calculated in a two steps procedure which is explained below: i) In order to find the optimal warp, an accumulated distance matrix is created. The element in the accumulated probability matrix is calculated as follows. (16) ii) A greedy search technique is then employed in the accumulated probability matrix to find the optimal warp such that is maximum. The above method employs the dynamic programming algorithm to obtain an optimal warp [16]. Fig. 5(b) shows the alignment obtained using the DTW on the spatial point correspondence matrix . Note that with the imposed continuity conditions and monotonicity conditions, we obtain an accurate alignment compared to Fig. 5(a). Note that the results obtained by the proposed technique and the ground truth are almost identical. IV. PERFORMANCE EVALUATIONS Fig. 5. Alignment obtained (a) only by spatial correspondence, and (b) after using the DTW. The dotted line represents the ground truth. Note that the result obtained by the proposed technique and the ground truth are almost overlapped. the 5th to 15th points in the point set (vertical axis). We can observe other incorrect correspondence by comparing with the ground truth correspondences in Fig. 5(a). In order to solve the problem stated above, we propose to utilize the temporal constraint on the trajectory points and compute the actual correspondence using the DTW technique based on the obtained spatial point correspondence matrix . Denote the temporal alignment between two trajectories as the warp. We construct the warp as follows: (14) where and are the length of trajectories and obtained from video and , respectively. The th element of the warp is . The warp satisfies the boundary conditions, continuity conditions and monotonicity conditions which are explained in [16]. The traditional DTW technique computes the warping based on the Euclidean distance between two sequences and chooses the warp which has the minimum accumulated distance as the optimal warp. The proposed technique computes the warp based on the probabilistic value obtained from the spatial point correspondence matrix , and chooses the warp which has the maximum accumulated probability as the optimal warp. The accumulated probability of a warp is defined as follows: (15) In this section, we evaluate the proposed technique using both synthetic trajectories and real image sequences. To compare the proposed technique with existing techniques, we have also implemented the STE technique [15], UBD technique [8] and CTW technique [27] that deal with aligning videos of similar motions. Our test cases and results are presented below. A. Synthetic Data Evaluation For synthetic data evaluation, we generate a 100 frames long complex planar trajectory, using a pseudo-random number generator to simulate the motion in videos. The trajectory is then projected onto two image planes using user defined camera projection matrices. In order to simulate the dynamic time shift, 5 trajectory points in one of the trajectory were interpolated and 65 trajectory points were obtained. We refer this trajectory as the time-warped trajectory and its length is 160 frames long, which is different from the planar trajectory. The STE, UBD, CTW and the proposed techniques are then applied to the synthetic trajectories to compute the alignment between them. This process is repeated on 50 different synthetic trajectories. The results of alignment of smooth and noisy trajectories with the STE, UBD, CTW techniques and the proposed techniques are shown in Table II, where the mean absolute differences between the actual and computed frame correspondence is reported as the alignment error. The percentage numbers in the last column show the improvement with respect to the CTW technique. Note that in the case of noise free trajectory, the CTW and the proposed technique provide comparable performance, whereas the STE and the UBD techniques have relatively larger alignment error. It is noted that with the introduction of the 76 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013 TABLE II ALIGNMENT ERRORS (IN TERMS OF FRAMES) OF STE, UBD, CTW AND PROPOSED TECHNIQUE FOR SYNTHETIC DATA. THE LAST COLUMN SHOWS THE IMPROVEMENT OBTAINED BY THE PROPOSED TECHNIQUE OVER THE CTW TECHNIQUE Fig. 7. An example of alignments for noisy synthetic trajectory using the proposed, STE, UBD, and CTW techniques. (a) is the 3D configuration of the cameras and the noise-free trajectory. (b) and (c) are the projected noisy trajectories and projected noisy time-warped trajectory on two different image planes, respectively. Note that the interpolated area in (c) is the warped part which simulates the dynamic time shift between two similar motions. (d) shows the alignment results obtained by the STE, UBD, and CTW techniques, where CTW provides the best performance. (e) shows the alignment results obtained by the proposed, and the CTW techniques. A zoom in version of the alignment result is presented in a small window. Fig. 6. An example of alignments for noise-free synthetic trajectory using the proposed, STE, UBD, and CTW techniques. (a) illustrates the 3D camera and trajectory configuration, where two rectangles A and B represent the positions of two cameras which record the motions. (b) and (c) show the projected trajectories on two image planes, respectively. Note that the interpolated area in (c) is the warped part which simulates the dynamic time shift between two similar motions. (d) shows the alignment results obtained by the STE, UBD, and CTW techniques, where CTW provides the best performance. (e) shows the alignment results obtained by the proposed, and the CTW techniques. A zoom in version of the alignment result is presented in a small window. The ground truth alignment is shown as solid line. noise, the performances of the existing techniques decrease significantly while the proposed technique provides consistent performance and outperforms the other existing techniques. Fig. 6 shows a synthetic data example where the noise level is zero. Note that the performance closer to the ground truth indicates higher accuracy for alignment. For better illustration, we compare the existing techniques in Fig. 6(d). We then chose the best existing technique, i.e., CTW technique in this case, and compare it with the proposed technique in Fig. 6(e). Note that a zoomed version of the alignment result is shown in a small window. The ground truth alignment is shown as solid line. In this case, the alignment errors are 2.34, 2.77, 0.44 and 0.38 frames for STE, UBD, CTW and the proposed technique, respectively. It is observed that CTW and the proposed technique provide the best and comparable performance in this noise-free case. In a practical situation, the feature trajectory is usually imperfect and contains noise. In the evaluation, we also evaluated the effect of noisy trajectories on the proposed technique. Normally distributed and zero mean noise with various values of variance was added to the synthetic feature trajectories. A synthetic noisy trajectory example is shown in Fig. 7. A normal distributed noise, with mean equal to 0, and variance equal to 0.1, is added to both x and y coordinates, and the resulting trajectories in two image planes are shown in Fig. 7(b) and (c). Fig. 8. The motion trajectory in UCF videos and the computed alignment results with ground truth. (a) and (b) show the two feature trajectories obtained from the two videos, respectively. (c) is the alignment result comparison. From the comparisons shown in Fig. 7(d), it is clear that the CTW technique provides the best alignment compared to other existing techniques. In Fig. 7(e), it is shown that the proposed technique provides better performance than that of the CTW technique. In this case, the alignment errors are 3.03, 2.60, 0.93, and 0.51 frames for STE, UBD, CTW and the proposed technique, respectively. LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT 77 Fig. 9. Visual comparison of the alignments computed using the STE, the UBD, the CTW and the proposed technique. The first row shows the 59th, 63rd, 67th, 71th frames of video UCF2. The second to the fifth rows shows the aligned frame obtained using the STE, the UBD, the CTW and the proposed technique, respectively. B. Real Data Evaluation In this section, we evaluate the proposed technique on ten pairs of real video data. The video data can be divided into two categories: videos captured under the same scene and videos captured under different scenes. Note that the ground truth is not available for all pairs of videos. In order to evaluate the performance of the proposed technique and compare it with other techniques, we manually chose several correspondences of key frames as the ground truth. The key frame correspondences are selected based on the distinct motion connection points. For example, in the action of “open the drawer”, the key frames are the frames where the hand is put on the handle of the drawer, or the drawer is completely closed. In the action of “cup lifting”, the key frames are the frames where the cup is lifted on the top, or the cup is put on another cabinet. In the action of “ball playing”, the key frames are the frames where the ball is on the left/right hand, or the ball is on the highest point. In order to reduce the bias, we select the ground truth carefully by three people and take the average of the three as the final ground truth. The average alignment error is then calculated as the mean absolute difference between the computed correspondence of frames and the ground truth. 1) Videos Captured Under the Same Scene: For evaluation with real video sequences which were captured in the same scene, we first used video sequences provided by Rao et al. [22] (we refer these videos as UCF video). Feature (location of hand) trajectories are available for the UCF video files [22]. Note that in this case, there is overlapped background which enables the previous techniques, i.e., the STE, and the UBD techniques, to estimate the spatial relationship by using sufficient number of corresponding points. To reduce the error induced by the automatic corresponding points selection, the corresponding points are selected manually. Note that the proposed technique and the CTW technique do not need to use the corresponding points from the overlapped background. The proposed technique, STE, UBD and the CTW techniques were applied to this pair of real videos. 78 The two UCF test videos cab2-4.1.mpg (UCF1) and cab2-4.4.mpg (UCF2) recorded the action of opening a drawer. These videos are 84 and 174 frames long, respectively. The trajectories of these two actions are shown in Fig.8 (a) and (b). Fig. 8(c) shows the frame alignment obtained using the STE, UBD, CTW and the proposed techniques, as well as the ground truth. The horizontal axis and vertical axis represent the indices of the frames of videos UCF1 and UCF2, respectively. It is observed that the proposed technique provides the best alignment results (i.e., closest to the ground truth) compared to other techniques. Note that the two trajectories in the videos have similar patterns and did not follow the affine transformation. Even if we have modeled the spatial relationship under the affine transformation (see (3)), with the help of the soft corresponding and the time constraint used in DTW, the proposed technique can still provide robust and satisfactory performance. For the other existing techniques, since hard corresponding is used for the spatial relationship estimation and the Euclidean distance is used in the DTW, it is difficult to handle the noise and spatial trajectory distortions effectively which leads to higher alignment errors. We now present subjective evaluation of the alignment result. Fig. 9 shows four representative frames (frame# 59, 63, 67 and 71) of video UCF2 in the first row. The correspondence frames computed for these four frames using the STE, UBD, CTW and the proposed techniques are shown in the second, third, fourth and the fifth rows of Fig. 9, respectively. Because the object of interest is the hand in the video, we can examine the relative position of the hand in the entire trajectory. If the computed temporal aligned frame is matched, i.e., the relative hand position in the computed temporal aligned frame is the same as that in the reference frame in another video, we mark “matched” under the frame, otherwise “mismatched”. Among the existing techniques, the results obtained by CTW technique lead to the singularity, i.e., many frames in one video matched to one frame in another video. Other existing techniques can provide better result, but still have mismatched results. It is shown that the proposed technique provides superior performance compared to other techniques in this real video case. Besides the UCF videos, we have used three additional pairs of videos capturing the coffee cup lifting motion under the same scene, named Cuplifting1-3. The alignment comparisons are shown in Fig. 10. It is observed that the proposed technique provides better alignment performance than other techniques. The average alignment errors with respect to the ground truth of the STE, UBD, CTW and the proposed techniques are summarized in Table III. Note that the proposed technique has outperformed other existing techniques for the test videos. 2) Videos Captured Under Different Scenes: In this section we present the performance of the proposed technique on videos with significantly different scenes. Note that the videos were captured from different views, and the similar actions were performed by different people in different scenes. In these cases, the STE and UBD techniques cannot be applied as there are no overlapping backgrounds between the two videos. Therefore, we compare the performance of the proposed technique with the CTW technique. We first evaluate the CTW and the proposed technique on the videos where the coffee cup lifting action is captured. Four IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013 Fig. 10. Performances comparisons on Cuplifting videos. (a) Cuplifting1; (b) Cuplifting2; (c) Cuplifting3. The horizontal axis and vertical axis represent the indices of the frames of the first and second video, respectively. TABLE III ALIGNMENT ERROR COMPUTED WITH RESPECT TO THE GROUND TRUTH ON VIDEO PAIRS CAPTURED UNDER THE SAME SCENE pairs of videos with four different people at two different scenes are evaluated. These pairs of videos are named DCupLifting1-4 in this paper. The performance obtained by the CTW and the LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT 79 Fig. 11. The alignment results of real video pair (DCupLifting1) obtained by the proposed technique. (a) shows the alignment obtained by the proposed technique. (b) presents the visual evaluation of the alignments. The first row shows the 43rd, 107th, 189th, and 218th frame of the first video of DCupLifting1. The second row and the third row show the computed corresponding frames using the CTW, and the proposed technique, respectively. TABLE IV ALIGNMENT ERRORS COMPUTED WITH RESPECT TO THE GROUND TRUTH ON VIDEOS CAPTURED UNDER DIFFERENT SCENE proposed technique is shown in Table IV. We take the DCupLifting1 for the visual evaluation, which is shown in Fig. 11. In this example (DCupLifting1), the first video contains 221 frames while the second video contains 207 frames. Note that the similar action, i.e., lifting the coffee cup, is performed by two different people under different scenes with different motion speed. The alignments computed using the CTW and the proposed technique are shown in Fig. 11(a). The ground truths of the key frame correspondences are indicated as the circle bars. In Fig. 11(b), the first row shows the 43rd, 107th, 189th, and 218th frame of the first video of DCupLifting1. The second and the third rows show the corresponding frames computed using the CTW and the proposed technique, respectively. The description, e.g., matched, below the frames indicates if the two computed corresponding frames are matched or not. It is clear that the proposed technique provides a better performance. 80 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013 Fig. 12. The alignment results of real video pair (TaChiQuan) obtained by the proposed technique. (a) shows the alignment obtained by the proposed technique. (b) The first row shows the 102th, 160th, 249th, and 335th frame of the first video of DCupLifting1. The second row and the third row show the computed corresponding frames using the CTW, and the proposed technique, respectively. The second evaluation is on three pairs of video with ball throwing motion (using videos BallPlaying1-3). The performance obtained by the proposed technique is shown in Table IV. The third evaluation is on one pair of video TaiChiQuan capturing the complex motion of the TaiChiQuan sport. The first video contains 414 frames while the second video contains 247 frames. The left hand of a player is being tracked throughout the two videos. Fig. 12 shows the alignment results as well as the representative frames correspondences computed by the CTW and the proposed technique. It is observed in Fig. 12(b) that the representative frame correspondences obtained using the proposed technique are more accurate compared to those obtained using the CTW technique. Table IV summarizes the performance of all test videos. It is clear that the proposed technique provides better performance compared to the CTW technique in terms of the key frames correspondences computation. It can be inferred that the proposed technique is able to provide robust synchronization of the videos captured under different scenes where similar approximated planar motions are present. C. Execution Time The experiments were run on a 2.4 GHz Intel Core II Duo CPU with 3GB RAM computer and the techniques were implemented using MATLAB 7.04. Excluding the preprocessing time (i.e., the object of interest tracking time), the execution time of the STE, UBD, CTW and proposed technique for synthetic trajectories are briefly summarized in Table V. When the lengths LU AND MANDAL: A ROBUST TECHNIQUE FOR MOTION-BASED VIDEO SEQUENCES TEMPORAL ALIGNMENT TABLE V EXECUTION TIME (IN SECONDS) OF THE STE, UBD, CTW AND PROPOSED TECHNIQUE FOR THE ENTIRE SEQUENCE of two trajectories are 100 and 160 frames, the execution time of the proposed technique is faster than the STE, comparable with the UBD technique and slower than the CTW technique. Note that the execution time is expected to be proportional to the length of the video being processed. D. Discussion For the STE and the UBD techniques, after the mutual projections of the trajectories onto different views, they rely on the Euclidean distance between the two trajectories in the video. The imperfection or jagging effect of object tracking introduces the noise in the trajectory and the spatial relationship estimation based on the imperfect correspondence points from the overlapped background will lead to incorrect synchronization results. As for the CTW technique, it incorporates the canonical correlation analysis in the DTW estimation and recovers the spatial relationship by using a set of linear transformations and selecting common features between the two trajectories. This technique provides a good performance when the noise and local distortion are small. It there exist considerable noises and distortions in the extracted feature trajectories, the performance degrades as the computation of the DTW is based on the projected Euclidean distances between two trajectories. The novelty of the proposed technique is to establish the ‘soft’ correspondence between the two trajectories constrained by the spatial relationship using GMMs, and then searching for the optimal synchronization result based on the ‘soft’ correspondence. The proposed technique is robust to fluctuation of motion trajectory with the help of ‘soft’ correspondence. The introduction of the ‘soft’ correspondences greatly alleviate the influence from the noise and the imperfection of the spatial relationship estimation, i.e., we do not intend to seek for a good spatial relationship estimation, a roughly estimation of the spatial relationship is good enough. The EM algorithm iteratively seeks the maximum likelihood of (8) while estimating the posterior (i.e., the alignment correspondences) and the spatial correspondence. It is has been observed that if the estimated spatial relationship is perfect, the two trajectories from two videos will align exactly. In practice, even with a rough estimated spatial transformation, we can bring two trajectories close to each other. This will help us to compute ‘soft’ correspondence by the GMM. This is primarily the reason the proposed technique leads to more accurate correspondence calculation. V. CONCLUSION In this paper, we have proposed a novel technique for motion-based video synchronization. The proposed technique is able to synchronize videos containing complex motions with dynamic time shift in an efficient way. Comparative analysis with the existing techniques, i.e., STE, UBD and CTW techniques, demonstrated that an improvement of over 6% to 36% 81 in video temporal alignment can be achieved using the proposed technique. It has been observed that the proposed technique can also synchronize the videos with significantly different scenes and robust to the noise and underlying local deformations. Although the number of videos to be synchronized is assumed to be two in this paper, the proposed technique can easily be extended to applications where the number of videos is greater than two by setting one video as the reference and compute the temporal alignments of other videos with respect to the reference. REFERENCES [1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.: Oxford Univ. Press, 1996. [2] P. J. Besl and N. D. McKay, “A method for registration of 3-D shapes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, pp. 239–256, 1992. [3] D. Comaniciu et al., “Real-time tracking of non-rigid objects using mean shift,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2000, vol. Ii, pp. 142–149. [4] Y. Caspi and M. Irani, “Spatio-temporal alignment of sequences,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 11, pp. 1409–1424, Nov. 2002. [5] Y. Caspi, D. Simakov, and M. Irani, “Feature based sequence to sequence matching,” Int. J. Comput. Vision, vol. 68, no. 1, pp. 53–64, 2006. [6] S. Gold, C. P. Lu, A. Rangarajan, S. Pappu, and E. Mjolsness, “New algorithms for 2D and 3D point matching: Pose estimation and correspondence,” in Proc. Advances in Neural Information Processing Systems, 1994, vol. 7, pp. 957–964. [7] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. London, U.K.: Cambridge Univ. Press, 2004. [8] C. Lu and M. Mandal, “Efficient temporal alignment of video sequences using unbiased bidirectional dynamic time warping,” J. Electron. Imag., vol. 19, no. 4, pp. 0501–0504, Aug. 2010. [9] L. Lee, R. Romano, and G. Stein, “Monitoring activities from multiple video streams: Establishing a common coordinate frame,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 758–767, Aug. 2000. [10] A. Myronenko and X. B. Song, “Point Set Registration: Coherent Point Drift,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 12, pp. 2262–2275, Dec. 2010. [11] F. L. C. Padua et al., “Linear sequence-to-sequence alignment,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 2, pp. 304–320, Feb. 2010. [12] C. Rao, A. Yilmaz, and M. Shah, “View-invariant representation and recognition of actions,” Int. J. Comput. Vision, vol. 50, no. 2, pp. 203–226, Nov. 2002. [13] C. Rao, A. Gritai, M. Shah, and T. F. S. Mahmood, “View-invariant alignment and matching of video sequences,” in Proc. ICCV03, 2003, pp. 939–945. [14] M. Singh, A. Basu, and M. Mandal, “Event dynamics based temporal registration,” IEEE Trans. Multimedia, vol. 9, no. 5, pp. 1004–1015, Aug. 2007. [15] M. Singh et al., D. Forsyth, Ed. et al., “Optimization of symmetric transfer error for sub-frame video synchronization,” in Proc. Computer Vision - ECCV 2008, Pt Ii, 2008, vol. 5303, pp. 554–567. [16] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. 26, no. 1, pp. 43–49, 1978. [17] C. X. Dai, Y. F. , and X. Zheng, “Subframe video synchronization via 3D phase correlation,” in Proc. Int. Conf. Image Processing, 2006, vol. 1, pp. 501–504. [18] M. A. Giese and T. Poggio, “Morphable models for the analysis and synthesis of complex motion patterns,” Int. J. Comput. Vision, vol. 38, no. 1, pp. 59–73, Jun. 2000. [19] D. Perperidis, R. H. Mohiaddin, and D. Rueckert, “Spatio-temporal free-form registration of cardiac MR image sequences,” Med. Image Comput. Comput.-Assist. Intervent., vol. 3216, pp. 441–456, 2004. [20] P. Tresadern and I. Reid, “Synchronizing image sequences of non-rigid objects,” in Proc. 14th British Machine Vision Conf., Norwich, U.K., Sep. 9-11, 2003, vol. 2, pp. 629–638. 82 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013 [21] T. Tuytelaars and L. J. VanGool, “Synchronizing video sequences,” in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition, Jul. 2004, vol. 1, pp. 762–768. [22] C. Rao, View-Invariant Representation for Human Activity. [Online]. Available: http://server.cs.ucf.edu/vision/projects/ViewInvariance/ViewInvariance.html. [23] C. Lu, Experimental Results for the Robust Video Alignment Technique. [Online]. Available: http://www.ece.ualberta.ca/lcheng4/ VideoAlignment/VA.htm. [24] F. Zhou, F. Torre, and J. K. Hodgins, “Aligned cluster analysis for temporal segmentation of human motion,” in Proc. 8th IEEE Int. Conf. Automatic Face & Gesture Recognition, 2008, pp. 1–7. [25] R. Hess and A. Fern, “Improved video registration using non-distinctive local image features,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007, pp. 1–8. [26] R. Li and R. Chellappa, “Aligning spatio-temporal signals on a special manifold,” in Proc. Eur. Conf. Computer Vision, 2010, pp. 547–560. [27] F. Zhou and F. Torre, “Canonical time warping for alignment of human behavior,” in Advances in Neural Information Processing Systems, 2009. [28] P. Shrestha, M. Barbieri, H. Weda, and D. Sekulovski, “Synchronization of multiple camera videos using audio-visual features,” IEEE Trans. Multimedia, pp. 79–92, 2010. [29] D. Chudova, S. Gaffney, and P. Smyth, “Probabilistic models for joint clustering and time-warping of multidimensional curves,” in Proc. 19th Conf. Uncertainty in Artificial Intelligence, 2003. [30] C. Lu, M. Singh, I. Cheng, A. Basu, and M. Mandal, “Efficient video sequences alignment using unbiased bidirectional dynamic time warping,” J. Vision Commun. Image Represent., vol. 22, no. 7, pp. 606–614, Oct. 2011. Cheng Lu (M’11) received the B.Sc. and M.Sc. degrees in computer engineering in China, in 2006 and 2008. He is currently working toward the Ph.D. degree in electrical engineering in the University of Alberta. He is the recipient of the Chinese Scholarship Council for his Ph.D. studies. He is the recipient of the Graduate Student Interdisciplinary Research Award 2012, University of Alberta. His research interest includes computer vision, pattern recognition, super resolution image, and medical imaging. He is an author or coauthor of more than ten papers in leading international journals and conferences. Mrinal Mandal (M’99–SM’03) is a Full Professor and Associate Chair in the Department of Electrical and Computer Engineering and is the Director of the Multimedia Computing and Communications Laboratory at the University of Alberta, Edmonton, AB, Canada. He has authored the book Multimedia Signals and Systems (Kluwer Academic), and co-authored the book Continuous and Discrete Time Signals and Systems (Cambridge University Press). His current research interests include Multimedia, Image and Video Processing, Multimedia Communications, and Medical Image Analysis. He has published over 140 papers in refereed journals and conferences, and has a US patent on lifting wavelet transform architecture. He has been the Principal Investigator of projects funded by Canadian Networks of Centers of Excellence such as CITR and MICRONET, and is currently the Principal Investigator of a project funded by the NSERC. He was a recipient of Canadian Commonwealth Fellowship from 1993 to 1998, and Humboldt Research Fellowship from 2005-2006 at the Technical University of Berlin.