J. Vis. Commun. Image R. 22 (2011) 606–614 Contents lists available at SciVerse ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci Efficient video sequences alignment using unbiased bidirectional dynamic time warping Cheng Lu a, Meghna Singh a, Irene Cheng b, Anup Basu b, Mrinal Mandal a,⇑ a b Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada T6G 2V4 Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2V4 a r t i c l e i n f o Article history: Received 5 January 2011 Accepted 16 June 2011 Available online 30 June 2011 Keywords: Video alignment Video synchronization Temporal registration Symmetric alignment Temporal offset Dynamic time warping Symmetric transfer error MRI a b s t r a c t In this paper, we propose an efficient technique to synchronize video sequences of events that are acquired via uncalibrated cameras at unknown and dynamically varying temporal offsets. Unlike other existing techniques that just take unidirectional alignment into consideration, the proposed technique considers symmetric alignments and compute the optimal alignment. We also establish sub-frame accuracy video alignment. The advantages of our approach are validated by tests conducted on several pairs of real and synthetic sequences. We present qualitative and quantitative comparisons with other existing techniques. A unique application of this work in generating high-resolution 4D MRI data from multiple low-resolution MRI scans is described. Ó 2011 Elsevier Inc. All rights reserved. 1. Introduction Alignment of video sequences plays a crucial role in applications such as super-resolution imaging [1], 3D visualization [2], robust multi-view surveillance [3] and mosaicing [4]. Most video alignment techniques deal with video sequences of the same scene and hence assume that the temporal offset between the video sequences does not change over time, i.e., a simple temporal translation is assumed. However, for applications such as video search, video comparison and enhanced video generation, we need to align video sequences from two different (but similar) scenes. In such scenarios, the temporal offset between the video sequences is dynamically changing, and cannot be estimated by a translational offset. Consider the problem of generating high resolution visualization of MRI (magnetic resonance imaging) data from multiple low resolution videos. Due to the tradeoff between acquisition speed and image resolution in MRI, we can only acquire a few frames of functional events such as swallowing (4–7 fps). We can, however, acquire multiple swallowing videos that differ from each other spatially and temporally. The objective then is to compute a robust spatial and ⇑ Corresponding author. Address: Dept. of Electrical and Computer Engineering, 2nd Floor, ECERF Building, University of Alberta, Edmonton, Alberta, Canada T6G 2V4. Fax: +1 (780) 492 1811. E-mail addresses: lcheng4@ualberta.ca (C. Lu), meghna@ece.ualberta.ca (M. Singh), locheng@ualberta.ca (I. Cheng), basu@ualberta.ca (A. Basu), mandal@ece.ualberta.ca (M. Mandal). 1047-3203/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2011.06.003 temporal alignment between the MRI videos in order to fuse them together for better visualization. Since the video sequences are of repeated swallows, the temporal offset between the video sequences does not remain constant, and cannot be estimated by a translational offset. In fact, the temporal offset varies dynamically with time and depends on the stage of the swallowing process. Several techniques have been proposed to synchronize the video sequences with the same or similar scenes. Giese and Poggio [5] proposed a dynamic time warping based technique to align motions performed by different people between the feature trajectories. Perperidis et al. [6] proposed to use the spline for accurate alignment for locally warp cardiac MRI sequences. Rao et al. [7] proposed to use rank constraint as the distance measure in a dynamic time warping technique to align multiple sequences (we refer this technique as RCB technique in this paper). Their technique is limited to integer frame alignment of video sequences. However, with low acquisition rates, as in our case, subframe alignment is crucial. Singh et al. [8] proposed a video alignment technique based on minimization of the symmetric transfer error (STE) between two commutative alignments. The STE technique provides a good performance with sub-frame accuracy. But the accuracy can be improved further since the technique does not completely eliminate the biasing between two sequences. The main contribution of this work lies in formulating the alignment problem as unbiased bidirectional dynamic time warping (UBDTW), in which the optimal alignment is computed in the combined symmetric accumulated distance matrix. UBDTW eliminates any bias that is introduced by choosing one sequence as the 607 C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614 reference sequence, to which target sequences are warped. In addition, the proposed technique allows us to compute sub-frame accurate and view-invariant alignment of video sequences that have a dynamically varying temporal offset between them and also helps us to reduce the occurrence of singularities1 in the alignment. The rest of this paper is organized as follows. Section 2 presents related work in video alignment, and justifies the necessity for this work. In Section 3, we propose the UBDTW technique. Experimental setup and present comparative results are shown in Section 4. Application of our work in 4D MRI visualization is presented in Section 5, which is followed by a summary and conclusions in Section 6. 2. Review of previous work Past literature in video alignment and temporal registration can be broadly classified into two categories: video sequences of the same scene or video sequences of similar scenes; differing primarily on the assumptions made with respect to the temporal offset between sequences. It can be seen from Table 1, that our work addresses all three scenarios (view-invariance, dynamic time shifts and sub-frame accuracy), while previous works have only addressed a subset of these scenarios. 2.1. Video alignment of same scene In synchronizing videos of the same scene, the temporal offset is considered to be an affine transform [1], such as t0 = s t+ Dt, where s is the ratio of frame rates and Dt is a fixed translational offset. Dai et al. [9] use 3D phase correlation between video sequences, whereas Tuytelaars and VanGool [10] compute alignment by checking the rigidity of a set of five (or more) points. Tresadern and Reid [11] also follow a similar approach of computing a rankconstraint based rigidity measure between four non-rigidly moving feature points. Caspi and Irani [1] recover the spatial and temporal relation between two sequences by minimizing the SSD error over extracted trajectories that are visible in both sequences. Carceroni et al. [12] extend [1] to align sequences based on scene points that need to be visible only in two consecutive frames. Singh et al. [2] also build on [1] to develop parameterized event models of discrete trajectories and align the event models based on an SSD measure for sub-frame alignment. However, they do not address the problem of view-invariance and dynamic temporal offsets. 2.2. Video alignment of different scenes Table 1 Review of related work. Legend: D.S. – Dynamic time shift, V.I. – View invariant, S.A. – Sub-frame accuracy. Author Scene D.S. V.I. S.A. Caspi and Irani [1] Tresadern and Reid [11] Lei and Yang [13] Wolf and Zomet [14] Dai et al. [9] Singh et al. [2] Carceroni et al. [12] Pooley et al. [15] Tuytelaars and VanGool [10] Giese and Poggio [5] Perperidis et al. [6] Rao et al. [7] Singh et al. [8] Proposed Same Same Same Same Same Same Same Same Same Different Different Different Different Different No No No No No No No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes No No Yes Yes Yes Yes Yes No No Yes Yes Yes Yes No No No No Yes Yes technique proposed by Rao et al. [7] will not work for planar motion activity since the fundamental matrix is singular in this case. Singh et al. [8] proposed an STE-based video alignment technique. In a two videos framework, the STE is calculated in two steps. In the first step, one video is considered as the reference, the second video is aligned using a DTW technique, and a warping function is obtained. In the second step, the second video is considered as the reference, the first video is aligned using the DTW technique, and a second warping function is obtained. The STE is calculated based on the difference between the two warping functions. It is expected that two videos temporally aligned well would have small STE. The STE technique provides better performance over RCB techniques. However, the STE technique selects one of the two symmetric warps as the final alignment which may not be accurate. In addition, due to its iterative STE calculation, the computational complexity of this technique is high compared to existing techniques. 3. Proposed technique Suppose that two cameras C1 and C2 view two independent scenes of similar activities, as shown in Fig. 1. Camera 1 views 3D scene X(X1, Y1, Z1, t1) in View 1 and acquires video I1(x1, y1, t1). Camera 2 views another 3D scene X(X2, Y2, Z2, t2) in View 2 and acquires video I2(x2, y2, t2). Note that the motions When aligning video sequences of different scenes, albeit sequences correlated via motion, one has to factor in the dynamic temporal scale of activities in the video sequences. Giese and Poggio [5] approach alignment of activities of different people by computing a dynamic time warp between the feature trajectories. They did not address the problem when the activity sequences are from varying viewpoints and their approach is a one-to-one frame correspondence. Perperidis et al. [6] attempt to locally warp cardiac MRI sequences, by extending Caspi’s work [1] to incorporate spline based local alignment. Though their approach does lead to good alignment of time-varying sequences, it has two main drawbacks: (1) the computation space for spline based registration is quite large and the authors need to compute points of inflexion in the cardiac volume change; and (2) the alignment is still a one-toone frame correspondence and not sub-frame accurate. The RCB 1 Singularities [7] occur when multiple frames in the target sequence map to the same frame in the reference sequence. Such a situation can occur when the temporal speed of an event in one sequence is significantly slower than another sequence. However, in most cases this slowing down of the event should lead to a sub-frame mapping and not singularities. Fig. 1. Illustration of two different scenes acquired using two distinct cameras. The projections of scenes onto the reciprocal cameras are also shown. 608 C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614 in these two scenes are similar but have dynamic time shift. In this paper we assume that the scene is planar and the homography matrix H is used to represent the spatial relationship between these two views. These assumptions are not limiting since the proposed technique can be adapted to account for epipolar geometry for non-planar scenes and robust correspondence techniques can be used for wide-baseline cameras [16]. Additionally, we assume that a single feature trajectory of interest is available to us2 such that the beginning and end points of the activity are marked in the trajectory, similar to the assumption made in [7,8]. The schematic of the proposed alignment technique is shown in Fig. 2. The technique consists of five modules which are explained in the following sections. 3.1. Feature trajectories extraction Note that for the sake of correlating two video sequences and representing the motions between the video sequences, features are extracted and tracked from two video sequences. A robust view-invariance tracker is used to generate feature trajectory F1(x1, y1, t1) and F2(x2, y2, t2) from video I1(x1, y1, t1) and I2(x2, y2, t2), respectively. The feature trajectories are illustrated in the Fig. 1. On their own these feature trajectories are discrete representations of the event in the scene. However, to align video sequences to subframe accuracy, we need to interpolate between the discrete representations. Most techniques used linear interpolation between the discrete points; however, we chose the technique proposed by Singh et al.[2] to generate continuous models from discrete points. 3.2. Event models calculation We derive event models from the discrete features as proposed by Singh et al. [2]. An event model is a global representation of the activity based on all the extracted feature points and can be tailored to address varying sub-frame alignment needs. Using event models instead of discrete feature points allows the proposed technique to adapt to increasing complexity of event dynamics, which is demonstrated in our test cases. Given a discrete trajectory F(x, y, tn), we compute linear model parameters b such that a continuous estimate F ðx; y; tÞ can be derived in the least square sense as shown in Eqs. (1),(2), where the ˆ symbolizes estimation and e represents the error term. F ¼ Fb þ e ^ b ¼ Fb F ð1Þ ð2Þ In a weighted least square sense, with a weight matrix W, the best linear estimate of b is given by: T 1 T ^ ¼ ðF WFÞ F WF b ð3Þ The weights are computed iteratively and the effect of outliers on the event model estimation is minimized, for details see [2]. Using event models has two additional advantages. Firstly, trajectory information for non-integer frames are now no longer based on linearly interpolated values from adjacent frames, and hence videos with low acquisition rates can also be accurately synchronized to non-integer values, as shown in [2]. Secondly, computation of the event models inherently smoothes the trajectory to remove outliers due to noise or poor tracking. The continuous feature trajectories are represented as F 1 and F 2 and the resolutions are higher than the discrete trajectories F1 and F2, respectively. 2 In typical videos, multiple object trajectories will be generated and an additional task of the synchronization technique will be used to find corresponding feature trajectories in the multiple video sequences. This is an open problem in vision research and one that we will not address in this paper. Misaligned video sequences Feature Trajectories Extraction Aligned video sequences Computation of Optimal Alignment Event Models Calculation Bidirectional Projections Calculation Computation of Symmetric Alignments Fig. 2. Schematic of the proposed technique. 3.3. Bidirectional projections calculation In our problem formulation the Cameras 1 and 2 view two independent scenes, as shown in Fig. 1. However, if Camera 2 had witnessed the first scene X(x1, y1, z1, t1), it would have recorded the sequence I2 ðx01 ; y01 ; t 1 Þ. We call I2 ðx01 ; y01 ; t1 Þ the projection of I1(x1, y1, t1) (Eq. (4)). The spatial relationship (homography H is used in this paper, fundamental matrix can also be used in our technique to deal with more general case.) from Scene 1 to 2 H1?2 and H2?1 vice versa is independent of the scene structure and can be computed from X(x1, y1, z1, t1) \ X(x2, y2, z2, t2). We use the Direct Linear Transform (DLT) technique to initialize the non-linear least squares (Levenberg–Marquardt) computation of the homography [17]. Similarly, the projection of the second scene in Camera 1 can be expressed as I1 ðx02 ; y02 ; t2 Þ (Eq. (5)). I2 ðx01 ; y01 ; t 1 Þ ¼ H1!2 I1 ðx1 ; y1 ; t1 Þ ð4Þ I1 ðx02 ; y02 ; t 2 Þ ¼ H2!1 I2 ðx2 ; y2 ; t2 Þ ð5Þ To generalize the alignment method to multi-modal alignment and allow for sub-frame alignment we compute the projections of the event models F derived from the two scenes, instead of the whole frame I1 and I2 as follows: F p2 ðx01 ; y01 ; t 1 Þ ¼ H1!2 F 1 ðx1 ; y1 ; t1 Þ ð6Þ F p1 ðx02 ; y02 ; t 2 Þ ¼ H2!1 F 2 ðx2 ; y2 ; t2 Þ ð7Þ 3.4. Computation of symmetric alignments Once we obtain two pairs of feature trajectories, ðF 1 ; F p2 Þ and which are in the same projection space, we compute the symmetric warps W 1;2p and W 1p;2 for these two pairs of trajectories using regularized dynamic time warping (RDTW). We construct the warp as follows: ðF p1 ; F 2 Þ, W ¼ w1 ; w2 ; w3 ; . . . ; wL maxðL1 ; L2 Þ 6 L < L þ L2 where L1 and L2 are the length of the trajectories F 1 and F 2 , respectively. The Lth element of the warp W is wL = (i, j), where i and j are the time indices of F 1 and F 2 , respectively. The warp satisfies the boundary conditions, continuity conditions and monotonicity conditions which are explained in [18]. Note that there are many possible warps between F 1 and F 2 which can be computed. The optimal warp is the minimum distance warp where the distance of a warp is defined as follows: DistðWÞ ¼ k¼L X Dist½F ðik Þ; F p ðjk Þ ð8Þ k¼1 where Dist½F ðik Þ; F p ðjk Þ is the distance between the two values of the given time indices (i, j) in the kth element of the warp. The regularized distance metric function is proposed here to compute the distance of the warp as follows: Dist½F ðiÞ; F p ðjÞ ¼ kF ðiÞ F p ðjÞk2 þ wReg p 2 2 2 p ð9Þ 2 Reg ¼ k@F ðiÞ @F ðjÞk þ k@ F ðiÞ @ F ðjÞk ð10Þ 609 C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614 20 10 0 -10 101 -20 0 -10 -20 100 -30 99 (a) Simple trajectory 100 Ground Truth Propose technique STE technique RCB technique 80 60 40 Fig. 3. Illustration of the global constraint on combined distance matrix Dc and the optimal alignment W copt obtained within the constrain area. Table 2 Execution time (in seconds) of the RCB, STE and the proposed technique. 20 0 Length RCB STE Proposed technique Synthetic data: 100 and 160 frames Real data: 54 and 81 frames 0.20 s 0.19 s 1.16 s 1.03 s 0.22 s 0.21 s 0 50 100 150 (b) Alignment results Fig. 4. Performance comparison of RCB, STE, and the proposed technique for a simple synthetic trajectory. where w is the weight (normally, w = 50), @F and @ 2 F are the first and second derivatives of F , and are computed numerically as follows. 40 F ðiÞ F ði 1Þ þ ½F ði þ 1Þ F ði 1Þ=2 2 @F ðiÞ @F ði 1Þ þ ½@F ði þ 1Þ @F ði 1Þ=2 2 @ F¼ 2 @F ¼ 20 0 The purpose of the regularization term is twofold: (i) it allows us to factor in a smoothness constraint on the warping and (ii) it also reduces the occurrence of singularities in the alignment. In order to find the optimal warp, a L by Lp accumulated distance matrix is created. The value of the element in the accumulated distance matrix is: Dði; jÞ ¼ Dist½F ðiÞ; F p ðjÞ þ minð/Þ ð11Þ -20 40 A greedy search technique is then employed in the accumulated distance matrix to find the optimal warp W such that the DistðWÞ is minimum. Using RDTW described above, we are able to obtain the symmetric warps (alignments) W 1;2p and W 1p;2 for ðF 1 ; F p2 Þ and ðF p1 ; F 2 Þ, respectively. However, these two warps do have bias: W 1;2p is biased toward F 1 , whereas W 1p;2 is biased toward F 2 . 3.5. Computation of optimal alignment Note that we calculated symmetric warps W 1;2p ; W 1p;2 , and corresponding distance matrixes D1;2p and D1p;2 in the last step. However, the warps do have bias. In order to eliminate the bias, we first 0 100 -20 99 (a) Complex trajectory 100 Ground Truth Propose technique STE technique RCB technique where / ¼ ½Dði 1; jÞ; Dði 1; j 1Þ; Dði; j 1Þ 101 20 80 60 40 20 0 0 50 100 150 (b) Alignment results Fig. 5. Performance comparison of RCB, STE, and the proposed technique for a complex synthetic trajectory. 610 C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614 The gray area represents the search area constrained by W 1;2p and W 1p;2 . The warp W copt which is inside the global constraint is considered as the final unbiased warp. Table 3 Alignment errors for RCB, STE and the proposed technique. Noise RCB STE Proposed technique 0 0.0001 0.001 0.01 0.1 5.14 7.53 8.02 9.32 12.45 3.12 3.17 3.29 3.37 3.45 1.97 2.08 2.36 2.74 2.91 A pseudocode of the proposed technique is given below: Pseudocode of the proposed technique Extract feature trajectories F1 and F2 Compute event models F 1 and F 2 Compute spatial relationship, homography H is used here Project event models to derive projections F p1 and F p2 Set w = 50 Compute regularized warp W 1;2p from ðF 1 ; F p2 Þ combine distance matrices D1;2p and D1p;2 to make a new distance matrix, named combined symmetric accumulated distance matrix, Dc as follows: Dc ði; jÞ ¼ D1;2p ði; jÞ þ D1p;2 ði; jÞ Compute regularized warp W 1p;2 from ðF p1 ; F 2 Þ Compute symmetric accumulated distance matrix Dc Compute optimal warp W copt under global constraint Report W copt as the final alignment ð12Þ Once Dc is obtained, global constraint based on W 1;2p and W 1p;2 are added onto this matrix. Denote W c as the warps under the global constraint and the constraint are formulated as follows: minðW 1;2 ; W 2;1 Þ 6 W c 6 maxðW 1;2 ; W 2;1 Þ 0 < Lc < minðL1;2 ; L2;1 Þ 4. Experiment and comparative analysis where Lc, L1,2, and L2,1 are the length of the warp W c ; W 1;2p , and W 1p;2 , respectively. Finally, the warp W c which satisfies the following equation is chosen as the final optimal warp (alignment) for video sequences I1 and I2 : W copt ¼ argW c min½DistðW c Þ ð13Þ Fig. 3 shows an intuitive explanation for the proposed optimal warp calculation. The grids represent the distance matrix Dc computed by adding D1;2p and D1p;2 . The horizontal and vertical axes represent the indices of trajectory F 2 and F 1 , respectively. The two bold lines represent the symmetric warps W 1;2p and W 1p;2 . We tested our method on both synthetic and real image sequences. We also implemented the RCB and STE technique that deals with aligning videos of similar events. While the RCB technique cannot compute sequence alignment to sub-frame accuracy, we still compare our method with it for integer alignment. Our test cases and results are presented next. All experiments were run on a 2.4 GHz Intel Core II Duo CPU with 3GB RAM using MATLAB 7.04. Excluding the preprocessing time, the execution time of the RCB, STE and the proposed technique are summarized in Table 2. The proposed technique took 0.22 s to compute the optimal alignment 20 20 10 10 0 0 -10 -10 -20 40 101 20 100 0 -20 40 101 20 0 -20 99 100 -20 99 (a) Clean and noisy trajectories with σ 2 = 0.1 100 80 60 40 Ground Truth Proposed Technique STE technique RCB technique 20 0 0 50 100 150 (b) Alignment result comparison Fig. 6. Performance evaluation using noisy trajectory. (a) The noisy trajectory (in right) is generated by adding noise with r2 = 0.1 to the clean trajectory (in left). (b) The alignment performance obtained using the RCB, STE, and the proposed technique. 611 C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614 20 40 40 60 60 80 80 100 100 120 120 140 140 160 160 180 180 200 200 220 100 120 140 160 180 200 220 120 (a) Activity trajectory from video LCC1 140 160 180 200 220 (b) Activity trajectory from video LCC2 80 60 40 Propose technique STE technique RCB technique 20 0 0 10 20 30 40 50 (c) Alignment results Fig. 7. Activity trajectories from video LCC1 and LCC2. between two synthetic sequences of lengths 100 and 160, whereas RCB technique took 0.20 s and the STE technique took 1.16 s. The processing time for real sequences of length 54 and 81 was 0.21 s, 0.19, and 1.03 s for the proposed, RCB, and STE technique, respectively. In other words, the complexity of the proposed technique is at a similar level to that of the RCB technique, and is much smaller compared to the STE technique. 4.1. Synthetic data evaluation In synthetic data evaluation, we generate planar trajectories, 100 frames long, using a pseudo-random number generator. These trajectories are then projected onto two image planes using user defined camera projection matrices. The camera matrices are designed so that the acquisition emulates a homography, and are used only for generation purpose and not thereafter. A 60 frames long time warp is then applied to a section of one of the trajectory projections. The RCB, STE and the proposed techniques are then applied to the synthetic trajectories to compute the alignment between them. This process is repeated on 50 different synthetic trajectories. Figs. 4 and 5 show alignment results with simple and complex synthetic trajectories. The performance of the proposed technique and RCB technique are comparable in the case of simple trajectories. However, as the complexity of the trajectory increased, the proposed technique outperformed the RCB technique. We also evaluated the effect of noisy trajectories on the RCB, STE, and the proposed technique. Normally distributed and zero mean noise with various values of variance (r2) was added to the synthetic feature trajectories. The results of alignment of smooth and noisy trajectories with the RCB, STE, and the proposed techniques are shown in Table 3, where the mean absolute differences between the actual and computed frame correspondence is reported as the alignment error. It should be noted that the STE and the proposed technique are marginally affected by the addition of noise. However, the performance of the RCB method degrades significantly. An example of noisy trajectory is shown in Fig. 6. 4.2. Real data evaluation For tests on real video sequences, we used two video sequences recording the activity of lifting a coffee cup by different people 612 C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614 Fig. 8. Visual comparison of the alignments computed using the RCB, STE, and the proposed technique on lifting coffee cup video sequence. from different views (we refer these two video as LCC video). The first video is 54 frames long and the second video is 81 frames long captured from another view point. We tracked the coffee cup which can represent the activity in the video sequences to generate feature trajectories. The feature trajectory of video LCC1 and LCC2 are shown in Figs. 7(a) and (b), respectively. The RCB, STE, and the proposed techniques were then applied to the real video data. Note that the ground truth information is not available. However, we can use visual judgment to check if the alignment is correct. Fig. 7(c) shows the results computed using the RCB, STE, and the proposed technique. The rectangle in Fig. 7(c) indicates the frames range for the frames as shown in Fig. 8(a)–(c). Fig. 8(a)–(c) shows some representative alignment results computed using the RCB, STE, and the proposed technique, respectively. We choose the 3rd, 6th, 9th, 13th, 16th frames from video LCC2 and their corresponding frames in video LCC1. Note that if the interested feature, i.e. the coffee cup, is at the same position in two frames, we marked it as matched, otherwise, mismatched. It is observed that there are two pairs of mismatched frames for the RCB technique (see Fig. 8(a)) whereas there is only one pair of mismatched frames for the STE technique (see Fig. 8(b)). This indicates that the existing techniques introduce erroneous alignments, whereas the alignment computed using the proposed technique (Fig. 8(c)) provides a more accurate temporal matching. C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614 100 as the final output. The proposed technique computes the warp using Eq. (12), and provides more accurate results. Ground Truth W1,2p 90 W1p,2 80 5. Application: 4D MRI registration Propose technique 70 60 50 40 30 20 10 0 0 20 40 613 60 80 100 120 140 160 Fig. 9. An illustration of the unidirectional vs. bidirectional alignment. 4.3. Unidirectional vs. bidirectional alignment In the previous section we compared the proposed unbiased bidirectional technique against the unidirectional alignment technique proposed in [7]. The proposed technique and the RCB technique differ not only in the sub-frame alignment capability and reduction in singularities, but also in the capability of eliminating the bias. In order to highlight the advantage of the capability of eliminating the bias, we validate it against unidirectional alignment. The warps computed using both the bidirectional and unidirectional technique for two trajectories are shown in Fig. 9. Note that the two unidirectional warps are shown as black dot-dash line; whereas the blue dash line is the bidirectional warp computed using the proposed technique. Note that the STE technique and the proposed technique both compute the bidirectional alignments. The STE technique computes the symmetric error between two unidirectional warps by turning the regularization factor iteratively until it reaches the minimum and then selects one of the warps as the final warp. In the example shown in Fig. 9, it is clear that these two unidirectional warps W 1;2p and W 1p;2 are close to the ground truth, but not accurate enough due to the bias towards the reference sequence. In the STE technique, either W 1;2p or W 1p;2 will be selected One of our motivations behind this work is to build a 4D (volume + time) representation of functional events in the body using 2D planar acquisitions, specifically swallowing disorders. In our experiments, a subject lies prone inside an MRI scanner and is fed small measured quantities of water (bolus) via a system of tubing (water is displayed as white in the MRI images because of the high Hydrogen content in it). For each swallow, a time series of 2D images are acquired on a fixed plane. The acquisition plane is then changed and another series of 2D images are acquired. We acquire three such video sequences corresponding to left, right and center MRI slice planes, as illustrated in Fig. 10(a). The MRI video sequences are subjected to dynamic temporal offsets in the motion of the bolus. Also, since the acquisitions are at very low frame rates, limited by the technology to 4–7 fps, it is crucial to align the sequences to sub-frame accuracy. The proposed technique shows promising results in aligning the MRI sequences to generate a 4D representation. Implementation details of this application are discussed next. The trailing and leading edges of the bolus are extracted from the MRI sequences using standard background separation techniques. The center of the trailing bolus is extracted using horizontal and vertical profiles, and is used to generate feature trajectories in the three sequences. After suitable event models have been computed for the trajectories, the proposed technique is applied to the video sequences to compute the alignment. The result of alignments with the proposed technique is shown in Fig. 11. Fig. 11 shows a few frames from the alignment computed between the center and left MRI slices that demonstrate sub-frame alignment. Frame 5 of the left MRI sequence is mapped to frame 5.5 of the center MRI sequence. Once the alignment has been computed, 4D visualization of the MRI data is carried out with a 4D model being shown in Fig. 10(b). 6. Conclusions We proposed and successfully evaluated an efficient technique to align video sequences that are related by varying temporal offsets. Our formulation of alignment as the unbiased bidirectional dynamic time warping resulted in alignment that was not biased Fig. 10. (a) Illustration of the acquisition slices. (b) Static visualization (one time instant) of 4D synchronized MRI data. 614 C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614 Fig. 11. Alignment computed between Center–Left MRI sequences by the proposed technique. by the choice of the reference sequence. The regularized dynamic time warping successfully reduced the occurrence of false singularities and resulted in sub-frame accuracy synchronization. Comparative analysis with a rank-constraint based technique and symmetric transfer error based technique demonstrated a significant improvement in video alignment with our proposed technique. The proposed technique has a lower or similar computational complexity compared to existing techniques. An application of the proposed method in 4D MRI visualization was also presented. Although the proposed technique has been presented for a two-camera set-up, and the performance has been evaluated based on two-videos, the proposed technique can easily be extended to align more than two videos. References [1] Y. Caspi, M. Irani, Spatio-temporal alignment of sequences, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (11) (2002) 1409–1424. [2] M. Singh, A. Basu, M. Mandal, Event dynamics based temporal registration, IEEE Transactions on Multimedia 9 (5) (2007) 1004–1015. [3] L. Lee, R. Romano, G. Stein, Monitoring activities from multiple video streams: establishing a common coordinate frame, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 758–767. [4] R. Hess, A. Fern, Ieee, Improved video registration using non-distinctive local image features, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2007, vol. 1, pp. 154–161, 2007. [5] M.A. Giese, T. Poggio, Morphable models for the analysis and synthesis of complex motion patterns, International Journal of Computer Vision 38 (1) (2000) 59–73. [6] D. Perperidis, R.H. Mohiaddin, D. Rueckert, Spatio-temporal free-form registration of cardiac MR image sequences, Medical Image Computing and Computer-Assisted Intervention 3216 (2004) 441–456. [7] C. Rao, A. Gritai, M. Shah, T.F.S. Mahmood, View-invariant alignment and matching of video sequences, in: Proceedings of the International Conference on Computer Vision 2003, vol. 1, pp. 939–945, 2003. [8] M. Singh, I. Cheng, M. Mandal, A. Basu, Optimization of symmetric transfer error for sub-frame video synchronization, in: Proceedings of European Conference on Computer Vision 2008, pp. 554–567, 2008. [9] C. Dai, Y. Zheng, X. Li, Subframe video synchronization via 3D phase correlation, in: Proceedings of the International Conference on Image Processing 2006, vol. 1, pp. 501–504, 2006. [10] T. Tuytelaars, L.J. VanGool, Synchronizing video sequences, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2004, pp. 762–768. [11] P. Tresadern, I. Reid, Synchronizing image sequences of non-rigid objects, in: Proceedings of the 14th British Machine Vision Conference, Norwich, 9–11 September, vol. 2, pp. 629–638, 2003. [12] R.L. Carceroni, F.L.C Padua, G.A.M.R. Santos, K.N. Kutulakos, Linear sequenceto-sequence alignment, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2004, pp. 746–753. [13] C. Lei, Y.H Yang, Tri-focal tensor-based multiple video synchronization with subframe optimization, in: Proceedings of the International Conference on Image Processing 2006, vol. 15, 2006, pp. 2473–2480. [14] L. Wolf, A. Zomet, Wide baseline matching between unsynchronized video sequences, International Journal of Computer Vision 1 (2006) 43–52. [15] D.W. Pooley, M.J. Brooks, A. van den Hengel, W. Chojnacki, A voting scheme for estimating the synchrony of moving-camera videos, in: Proceedings of the IEEE International Conference on Image Processing 2003, vol. 1, September 2003, pp. 413–416. [16] J. Matas, O. Chum, M. Urban, T. Pajdla, Robust wide baseline stereo from maximally stable external regions, in: Proceedings of British Machine Vision Conference 2002, vol. 1, September 2002, pp. 384–393. [17] R.I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, second ed., Cambridge University Press, London, 2004. pp. 87–127. [18] H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech and Signal Processing 26 (1) (1978) 43–49.