Efficient video sequences alignment using unbiased bidirectional dynamic time warping Cheng Lu

advertisement
J. Vis. Commun. Image R. 22 (2011) 606–614
Contents lists available at SciVerse ScienceDirect
J. Vis. Commun. Image R.
journal homepage: www.elsevier.com/locate/jvci
Efficient video sequences alignment using unbiased bidirectional dynamic
time warping
Cheng Lu a, Meghna Singh a, Irene Cheng b, Anup Basu b, Mrinal Mandal a,⇑
a
b
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada T6G 2V4
Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2V4
a r t i c l e
i n f o
Article history:
Received 5 January 2011
Accepted 16 June 2011
Available online 30 June 2011
Keywords:
Video alignment
Video synchronization
Temporal registration
Symmetric alignment
Temporal offset
Dynamic time warping
Symmetric transfer error
MRI
a b s t r a c t
In this paper, we propose an efficient technique to synchronize video sequences of events that are
acquired via uncalibrated cameras at unknown and dynamically varying temporal offsets. Unlike other
existing techniques that just take unidirectional alignment into consideration, the proposed technique
considers symmetric alignments and compute the optimal alignment. We also establish sub-frame accuracy video alignment. The advantages of our approach are validated by tests conducted on several pairs of
real and synthetic sequences. We present qualitative and quantitative comparisons with other existing
techniques. A unique application of this work in generating high-resolution 4D MRI data from multiple
low-resolution MRI scans is described.
Ó 2011 Elsevier Inc. All rights reserved.
1. Introduction
Alignment of video sequences plays a crucial role in applications
such as super-resolution imaging [1], 3D visualization [2], robust
multi-view surveillance [3] and mosaicing [4]. Most video alignment techniques deal with video sequences of the same scene and
hence assume that the temporal offset between the video sequences
does not change over time, i.e., a simple temporal translation is assumed. However, for applications such as video search, video comparison and enhanced video generation, we need to align video
sequences from two different (but similar) scenes. In such scenarios,
the temporal offset between the video sequences is dynamically
changing, and cannot be estimated by a translational offset. Consider
the problem of generating high resolution visualization of MRI
(magnetic resonance imaging) data from multiple low resolution
videos. Due to the tradeoff between acquisition speed and image
resolution in MRI, we can only acquire a few frames of functional
events such as swallowing (4–7 fps). We can, however, acquire multiple swallowing videos that differ from each other spatially and
temporally. The objective then is to compute a robust spatial and
⇑ Corresponding author. Address: Dept. of Electrical and Computer Engineering,
2nd Floor, ECERF Building, University of Alberta, Edmonton, Alberta, Canada T6G
2V4. Fax: +1 (780) 492 1811.
E-mail addresses: lcheng4@ualberta.ca (C. Lu), meghna@ece.ualberta.ca (M.
Singh), locheng@ualberta.ca (I. Cheng), basu@ualberta.ca (A. Basu), mandal@ece.ualberta.ca (M. Mandal).
1047-3203/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved.
doi:10.1016/j.jvcir.2011.06.003
temporal alignment between the MRI videos in order to fuse them
together for better visualization. Since the video sequences are of repeated swallows, the temporal offset between the video sequences
does not remain constant, and cannot be estimated by a translational offset. In fact, the temporal offset varies dynamically with
time and depends on the stage of the swallowing process.
Several techniques have been proposed to synchronize the video sequences with the same or similar scenes. Giese and Poggio [5] proposed
a dynamic time warping based technique to align motions performed
by different people between the feature trajectories. Perperidis et al.
[6] proposed to use the spline for accurate alignment for locally warp
cardiac MRI sequences. Rao et al. [7] proposed to use rank constraint
as the distance measure in a dynamic time warping technique to align
multiple sequences (we refer this technique as RCB technique in this
paper). Their technique is limited to integer frame alignment of video
sequences. However, with low acquisition rates, as in our case, subframe alignment is crucial. Singh et al. [8] proposed a video alignment
technique based on minimization of the symmetric transfer error (STE)
between two commutative alignments. The STE technique provides a
good performance with sub-frame accuracy. But the accuracy can be
improved further since the technique does not completely eliminate
the biasing between two sequences.
The main contribution of this work lies in formulating the alignment problem as unbiased bidirectional dynamic time warping
(UBDTW), in which the optimal alignment is computed in the
combined symmetric accumulated distance matrix. UBDTW eliminates any bias that is introduced by choosing one sequence as the
607
C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614
reference sequence, to which target sequences are warped. In addition, the proposed technique allows us to compute sub-frame accurate and view-invariant alignment of video sequences that have a
dynamically varying temporal offset between them and also helps
us to reduce the occurrence of singularities1 in the alignment.
The rest of this paper is organized as follows. Section 2 presents
related work in video alignment, and justifies the necessity for this
work. In Section 3, we propose the UBDTW technique. Experimental
setup and present comparative results are shown in Section 4. Application of our work in 4D MRI visualization is presented in Section 5,
which is followed by a summary and conclusions in Section 6.
2. Review of previous work
Past literature in video alignment and temporal registration can
be broadly classified into two categories: video sequences of the
same scene or video sequences of similar scenes; differing primarily on the assumptions made with respect to the temporal offset
between sequences. It can be seen from Table 1, that our work addresses all three scenarios (view-invariance, dynamic time shifts
and sub-frame accuracy), while previous works have only addressed a subset of these scenarios.
2.1. Video alignment of same scene
In synchronizing videos of the same scene, the temporal offset
is considered to be an affine transform [1], such as t0 = s t+ Dt,
where s is the ratio of frame rates and Dt is a fixed translational
offset. Dai et al. [9] use 3D phase correlation between video sequences, whereas Tuytelaars and VanGool [10] compute alignment
by checking the rigidity of a set of five (or more) points. Tresadern
and Reid [11] also follow a similar approach of computing a rankconstraint based rigidity measure between four non-rigidly moving feature points. Caspi and Irani [1] recover the spatial and temporal relation between two sequences by minimizing the SSD error
over extracted trajectories that are visible in both sequences.
Carceroni et al. [12] extend [1] to align sequences based on scene
points that need to be visible only in two consecutive frames. Singh
et al. [2] also build on [1] to develop parameterized event models
of discrete trajectories and align the event models based on an SSD
measure for sub-frame alignment. However, they do not address
the problem of view-invariance and dynamic temporal offsets.
2.2. Video alignment of different scenes
Table 1
Review of related work. Legend: D.S. – Dynamic time shift, V.I. – View invariant, S.A. –
Sub-frame accuracy.
Author
Scene
D.S.
V.I.
S.A.
Caspi and Irani [1]
Tresadern and Reid [11]
Lei and Yang [13]
Wolf and Zomet [14]
Dai et al. [9]
Singh et al. [2]
Carceroni et al. [12]
Pooley et al. [15]
Tuytelaars and VanGool [10]
Giese and Poggio [5]
Perperidis et al. [6]
Rao et al. [7]
Singh et al. [8]
Proposed
Same
Same
Same
Same
Same
Same
Same
Same
Same
Different
Different
Different
Different
Different
No
No
No
No
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
No
No
No
No
Yes
Yes
technique proposed by Rao et al. [7] will not work for planar motion activity since the fundamental matrix is singular in this case.
Singh et al. [8] proposed an STE-based video alignment technique. In a two videos framework, the STE is calculated in two
steps. In the first step, one video is considered as the reference,
the second video is aligned using a DTW technique, and a warping
function is obtained. In the second step, the second video is considered as the reference, the first video is aligned using the DTW technique, and a second warping function is obtained. The STE is
calculated based on the difference between the two warping functions. It is expected that two videos temporally aligned well would
have small STE. The STE technique provides better performance
over RCB techniques. However, the STE technique selects one of
the two symmetric warps as the final alignment which may not
be accurate. In addition, due to its iterative STE calculation, the
computational complexity of this technique is high compared to
existing techniques.
3. Proposed technique
Suppose that two cameras C1 and C2 view two independent
scenes of similar activities, as shown in Fig. 1.
Camera 1 views 3D scene X(X1, Y1, Z1, t1) in View 1 and acquires
video I1(x1, y1, t1). Camera 2 views another 3D scene X(X2, Y2, Z2, t2)
in View 2 and acquires video I2(x2, y2, t2). Note that the motions
When aligning video sequences of different scenes, albeit sequences correlated via motion, one has to factor in the dynamic
temporal scale of activities in the video sequences. Giese and Poggio [5] approach alignment of activities of different people by computing a dynamic time warp between the feature trajectories. They
did not address the problem when the activity sequences are from
varying viewpoints and their approach is a one-to-one frame correspondence. Perperidis et al. [6] attempt to locally warp cardiac
MRI sequences, by extending Caspi’s work [1] to incorporate spline
based local alignment. Though their approach does lead to good
alignment of time-varying sequences, it has two main drawbacks:
(1) the computation space for spline based registration is quite
large and the authors need to compute points of inflexion in the
cardiac volume change; and (2) the alignment is still a one-toone frame correspondence and not sub-frame accurate. The RCB
1
Singularities [7] occur when multiple frames in the target sequence map to the
same frame in the reference sequence. Such a situation can occur when the temporal
speed of an event in one sequence is significantly slower than another sequence.
However, in most cases this slowing down of the event should lead to a sub-frame
mapping and not singularities.
Fig. 1. Illustration of two different scenes acquired using two distinct cameras. The
projections of scenes onto the reciprocal cameras are also shown.
608
C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614
in these two scenes are similar but have dynamic time shift. In this
paper we assume that the scene is planar and the homography matrix H is used to represent the spatial relationship between these
two views. These assumptions are not limiting since the proposed
technique can be adapted to account for epipolar geometry for
non-planar scenes and robust correspondence techniques can be
used for wide-baseline cameras [16]. Additionally, we assume that
a single feature trajectory of interest is available to us2 such that
the beginning and end points of the activity are marked in the trajectory, similar to the assumption made in [7,8].
The schematic of the proposed alignment technique is shown in
Fig. 2. The technique consists of five modules which are explained
in the following sections.
3.1. Feature trajectories extraction
Note that for the sake of correlating two video sequences and
representing the motions between the video sequences, features
are extracted and tracked from two video sequences. A robust
view-invariance tracker is used to generate feature trajectory
F1(x1, y1, t1) and F2(x2, y2, t2) from video I1(x1, y1, t1) and I2(x2, y2, t2),
respectively. The feature trajectories are illustrated in the Fig. 1.
On their own these feature trajectories are discrete representations
of the event in the scene. However, to align video sequences to subframe accuracy, we need to interpolate between the discrete representations. Most techniques used linear interpolation between the
discrete points; however, we chose the technique proposed by
Singh et al.[2] to generate continuous models from discrete points.
3.2. Event models calculation
We derive event models from the discrete features as proposed
by Singh et al. [2]. An event model is a global representation of the
activity based on all the extracted feature points and can be tailored to address varying sub-frame alignment needs. Using event
models instead of discrete feature points allows the proposed technique to adapt to increasing complexity of event dynamics, which
is demonstrated in our test cases.
Given a discrete trajectory F(x, y, tn), we compute linear model
parameters b such that a continuous estimate F ðx; y; tÞ can be derived in the least square sense as shown in Eqs. (1),(2), where the ˆ
symbolizes estimation and e represents the error term.
F ¼ Fb þ e
^
b ¼ Fb
F
ð1Þ
ð2Þ
In a weighted least square sense, with a weight matrix W, the best
linear estimate of b is given by:
T
1 T
^ ¼ ðF WFÞ F WF
b
ð3Þ
The weights are computed iteratively and the effect of outliers on
the event model estimation is minimized, for details see [2]. Using
event models has two additional advantages. Firstly, trajectory
information for non-integer frames are now no longer based on linearly interpolated values from adjacent frames, and hence videos
with low acquisition rates can also be accurately synchronized to
non-integer values, as shown in [2]. Secondly, computation of the
event models inherently smoothes the trajectory to remove outliers
due to noise or poor tracking. The continuous feature trajectories
are represented as F 1 and F 2 and the resolutions are higher than
the discrete trajectories F1 and F2, respectively.
2
In typical videos, multiple object trajectories will be generated and an additional
task of the synchronization technique will be used to find corresponding feature
trajectories in the multiple video sequences. This is an open problem in vision
research and one that we will not address in this paper.
Misaligned
video sequences
Feature
Trajectories
Extraction
Aligned
video sequences
Computation of
Optimal Alignment
Event Models
Calculation
Bidirectional
Projections
Calculation
Computation of
Symmetric Alignments
Fig. 2. Schematic of the proposed technique.
3.3. Bidirectional projections calculation
In our problem formulation the Cameras 1 and 2 view two independent scenes, as shown in Fig. 1. However, if Camera 2 had witnessed the first scene X(x1, y1, z1, t1), it would have recorded the
sequence I2 ðx01 ; y01 ; t 1 Þ. We call I2 ðx01 ; y01 ; t1 Þ the projection of
I1(x1, y1, t1) (Eq. (4)). The spatial relationship (homography H is used
in this paper, fundamental matrix can also be used in our technique to deal with more general case.) from Scene 1 to 2 H1?2
and H2?1 vice versa is independent of the scene structure and
can be computed from X(x1, y1, z1, t1) \ X(x2, y2, z2, t2). We use the Direct Linear Transform (DLT) technique to initialize the non-linear
least squares (Levenberg–Marquardt) computation of the homography [17]. Similarly, the projection of the second scene in Camera
1 can be expressed as I1 ðx02 ; y02 ; t2 Þ (Eq. (5)).
I2 ðx01 ; y01 ; t 1 Þ ¼ H1!2 I1 ðx1 ; y1 ; t1 Þ
ð4Þ
I1 ðx02 ; y02 ; t 2 Þ ¼ H2!1 I2 ðx2 ; y2 ; t2 Þ
ð5Þ
To generalize the alignment method to multi-modal alignment
and allow for sub-frame alignment we compute the projections of
the event models F derived from the two scenes, instead of the
whole frame I1 and I2 as follows:
F p2 ðx01 ; y01 ; t 1 Þ ¼ H1!2 F 1 ðx1 ; y1 ; t1 Þ
ð6Þ
F p1 ðx02 ; y02 ; t 2 Þ ¼ H2!1 F 2 ðx2 ; y2 ; t2 Þ
ð7Þ
3.4. Computation of symmetric alignments
Once we obtain two pairs of feature trajectories, ðF 1 ; F p2 Þ and
which are in the same projection space, we compute
the symmetric warps W 1;2p and W 1p;2 for these two pairs of trajectories using regularized dynamic time warping (RDTW). We construct the warp as follows:
ðF p1 ; F 2 Þ,
W ¼ w1 ; w2 ; w3 ; . . . ; wL
maxðL1 ; L2 Þ 6 L < L þ L2
where L1 and L2 are the length of the trajectories F 1 and F 2 ,
respectively. The Lth element of the warp W is wL = (i, j), where i
and j are the time indices of F 1 and F 2 , respectively.
The warp satisfies the boundary conditions, continuity conditions and monotonicity conditions which are explained in [18].
Note that there are many possible warps between F 1 and F 2
which can be computed. The optimal warp is the minimum distance warp where the distance of a warp is defined as follows:
DistðWÞ ¼
k¼L
X
Dist½F ðik Þ; F p ðjk Þ
ð8Þ
k¼1
where Dist½F ðik Þ; F p ðjk Þ is the distance between the two values of
the given time indices (i, j) in the kth element of the warp.
The regularized distance metric function is proposed here to
compute the distance of the warp as follows:
Dist½F ðiÞ; F p ðjÞ ¼ kF ðiÞ F p ðjÞk2 þ wReg
p
2
2
2
p
ð9Þ
2
Reg ¼ k@F ðiÞ @F ðjÞk þ k@ F ðiÞ @ F ðjÞk
ð10Þ
609
C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614
20
10
0
-10
101
-20
0
-10
-20
100
-30 99
(a) Simple trajectory
100
Ground Truth
Propose technique
STE technique
RCB technique
80
60
40
Fig. 3. Illustration of the global constraint on combined distance matrix Dc and the
optimal alignment W copt obtained within the constrain area.
Table 2
Execution time (in seconds) of the RCB, STE and the proposed technique.
20
0
Length
RCB
STE
Proposed technique
Synthetic data: 100 and 160 frames
Real data: 54 and 81 frames
0.20 s
0.19 s
1.16 s
1.03 s
0.22 s
0.21 s
0
50
100
150
(b) Alignment results
Fig. 4. Performance comparison of RCB, STE, and the proposed technique for a
simple synthetic trajectory.
where w is the weight (normally, w = 50), @F and @ 2 F are the first
and second derivatives of F , and are computed numerically as
follows.
40
F ðiÞ F ði 1Þ þ ½F ði þ 1Þ F ði 1Þ=2
2
@F ðiÞ @F ði 1Þ þ ½@F ði þ 1Þ @F ði 1Þ=2
2
@ F¼
2
@F ¼
20
0
The purpose of the regularization term is twofold: (i) it allows
us to factor in a smoothness constraint on the warping and (ii) it
also reduces the occurrence of singularities in the alignment.
In order to find the optimal warp, a L by Lp accumulated distance matrix is created. The value of the element in the accumulated distance matrix is:
Dði; jÞ ¼ Dist½F ðiÞ; F p ðjÞ þ minð/Þ
ð11Þ
-20
40
A greedy search technique is then employed in the accumulated
distance matrix to find the optimal warp W such that the DistðWÞ
is minimum.
Using RDTW described above, we are able to obtain the symmetric warps (alignments) W 1;2p and W 1p;2 for ðF 1 ; F p2 Þ and
ðF p1 ; F 2 Þ, respectively. However, these two warps do have bias:
W 1;2p is biased toward F 1 , whereas W 1p;2 is biased toward F 2 .
3.5. Computation of optimal alignment
Note that we calculated symmetric warps W 1;2p ; W 1p;2 , and corresponding distance matrixes D1;2p and D1p;2 in the last step. However, the warps do have bias. In order to eliminate the bias, we first
0
100
-20 99
(a) Complex trajectory
100
Ground Truth
Propose technique
STE technique
RCB technique
where
/ ¼ ½Dði 1; jÞ; Dði 1; j 1Þ; Dði; j 1Þ
101
20
80
60
40
20
0
0
50
100
150
(b) Alignment results
Fig. 5. Performance comparison of RCB, STE, and the proposed technique for a
complex synthetic trajectory.
610
C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614
The gray area represents the search area constrained by W 1;2p
and W 1p;2 . The warp W copt which is inside the global constraint is
considered as the final unbiased warp.
Table 3
Alignment errors for RCB, STE and the proposed technique.
Noise
RCB
STE
Proposed technique
0
0.0001
0.001
0.01
0.1
5.14
7.53
8.02
9.32
12.45
3.12
3.17
3.29
3.37
3.45
1.97
2.08
2.36
2.74
2.91
A pseudocode of the proposed technique is given below:
Pseudocode of the proposed technique
Extract feature trajectories F1 and F2
Compute event models F 1 and F 2
Compute spatial relationship, homography H is used here
Project event models to derive projections F p1 and F p2
Set w = 50
Compute regularized warp W 1;2p from ðF 1 ; F p2 Þ
combine distance matrices D1;2p and D1p;2 to make a new distance
matrix, named combined symmetric accumulated distance matrix,
Dc as follows:
Dc ði; jÞ ¼ D1;2p ði; jÞ þ D1p;2 ði; jÞ
Compute regularized warp W 1p;2 from ðF p1 ; F 2 Þ
Compute symmetric accumulated distance matrix Dc
Compute optimal warp W copt under global constraint
Report W copt as the final alignment
ð12Þ
Once Dc is obtained, global constraint based on W 1;2p and W 1p;2
are added onto this matrix. Denote W c as the warps under the global constraint and the constraint are formulated as follows:
minðW 1;2 ; W 2;1 Þ 6 W c 6 maxðW 1;2 ; W 2;1 Þ
0 < Lc < minðL1;2 ; L2;1 Þ
4. Experiment and comparative analysis
where Lc, L1,2, and L2,1 are the length of the warp W c ; W 1;2p , and
W 1p;2 , respectively. Finally, the warp W c which satisfies the following equation is chosen as the final optimal warp (alignment) for video sequences I1 and I2 :
W copt ¼ argW c min½DistðW c Þ
ð13Þ
Fig. 3 shows an intuitive explanation for the proposed optimal
warp calculation. The grids represent the distance matrix Dc computed by adding D1;2p and D1p;2 . The horizontal and vertical axes
represent the indices of trajectory F 2 and F 1 , respectively. The
two bold lines represent the symmetric warps W 1;2p and W 1p;2 .
We tested our method on both synthetic and real image sequences. We also implemented the RCB and STE technique that
deals with aligning videos of similar events. While the RCB technique cannot compute sequence alignment to sub-frame accuracy,
we still compare our method with it for integer alignment. Our test
cases and results are presented next. All experiments were run on a
2.4 GHz Intel Core II Duo CPU with 3GB RAM using MATLAB 7.04.
Excluding the preprocessing time, the execution time of the RCB,
STE and the proposed technique are summarized in Table 2. The
proposed technique took 0.22 s to compute the optimal alignment
20
20
10
10
0
0
-10
-10
-20
40
101
20
100
0
-20
40
101
20
0
-20 99
100
-20 99
(a) Clean and noisy trajectories with σ 2 = 0.1
100
80
60
40
Ground Truth
Proposed Technique
STE technique
RCB technique
20
0
0
50
100
150
(b) Alignment result comparison
Fig. 6. Performance evaluation using noisy trajectory. (a) The noisy trajectory (in right) is generated by adding noise with r2 = 0.1 to the clean trajectory (in left). (b) The
alignment performance obtained using the RCB, STE, and the proposed technique.
611
C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614
20
40
40
60
60
80
80
100
100
120
120
140
140
160
160
180
180
200
200
220
100
120
140
160
180
200
220
120
(a) Activity trajectory from video LCC1
140
160
180
200
220
(b) Activity trajectory from video LCC2
80
60
40
Propose technique
STE technique
RCB technique
20
0
0
10
20
30
40
50
(c) Alignment results
Fig. 7. Activity trajectories from video LCC1 and LCC2.
between two synthetic sequences of lengths 100 and 160, whereas
RCB technique took 0.20 s and the STE technique took 1.16 s. The
processing time for real sequences of length 54 and 81 was
0.21 s, 0.19, and 1.03 s for the proposed, RCB, and STE technique,
respectively. In other words, the complexity of the proposed
technique is at a similar level to that of the RCB technique, and is
much smaller compared to the STE technique.
4.1. Synthetic data evaluation
In synthetic data evaluation, we generate planar trajectories,
100 frames long, using a pseudo-random number generator. These
trajectories are then projected onto two image planes using user
defined camera projection matrices. The camera matrices are designed so that the acquisition emulates a homography, and are
used only for generation purpose and not thereafter. A 60 frames
long time warp is then applied to a section of one of the trajectory
projections. The RCB, STE and the proposed techniques are then applied to the synthetic trajectories to compute the alignment between them. This process is repeated on 50 different synthetic
trajectories. Figs. 4 and 5 show alignment results with simple
and complex synthetic trajectories. The performance of the proposed technique and RCB technique are comparable in the case
of simple trajectories. However, as the complexity of the trajectory
increased, the proposed technique outperformed the RCB
technique.
We also evaluated the effect of noisy trajectories on the RCB,
STE, and the proposed technique. Normally distributed and zero
mean noise with various values of variance (r2) was added to
the synthetic feature trajectories. The results of alignment of
smooth and noisy trajectories with the RCB, STE, and the proposed
techniques are shown in Table 3, where the mean absolute differences between the actual and computed frame correspondence is
reported as the alignment error. It should be noted that the STE
and the proposed technique are marginally affected by the addition
of noise. However, the performance of the RCB method degrades
significantly. An example of noisy trajectory is shown in Fig. 6.
4.2. Real data evaluation
For tests on real video sequences, we used two video sequences
recording the activity of lifting a coffee cup by different people
612
C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614
Fig. 8. Visual comparison of the alignments computed using the RCB, STE, and the proposed technique on lifting coffee cup video sequence.
from different views (we refer these two video as LCC video). The
first video is 54 frames long and the second video is 81 frames long
captured from another view point. We tracked the coffee cup
which can represent the activity in the video sequences to generate
feature trajectories. The feature trajectory of video LCC1 and LCC2
are shown in Figs. 7(a) and (b), respectively. The RCB, STE, and the
proposed techniques were then applied to the real video data. Note
that the ground truth information is not available. However, we
can use visual judgment to check if the alignment is correct.
Fig. 7(c) shows the results computed using the RCB, STE, and the
proposed technique. The rectangle in Fig. 7(c) indicates the frames
range for the frames as shown in Fig. 8(a)–(c). Fig. 8(a)–(c) shows
some representative alignment results computed using the RCB,
STE, and the proposed technique, respectively. We choose the
3rd, 6th, 9th, 13th, 16th frames from video LCC2 and their corresponding frames in video LCC1. Note that if the interested feature,
i.e. the coffee cup, is at the same position in two frames, we marked
it as matched, otherwise, mismatched. It is observed that there are
two pairs of mismatched frames for the RCB technique (see
Fig. 8(a)) whereas there is only one pair of mismatched frames for
the STE technique (see Fig. 8(b)). This indicates that the existing
techniques introduce erroneous alignments, whereas the alignment computed using the proposed technique (Fig. 8(c)) provides
a more accurate temporal matching.
C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614
100
as the final output. The proposed technique computes the warp
using Eq. (12), and provides more accurate results.
Ground Truth
W1,2p
90
W1p,2
80
5. Application: 4D MRI registration
Propose technique
70
60
50
40
30
20
10
0
0
20
40
613
60
80
100
120
140
160
Fig. 9. An illustration of the unidirectional vs. bidirectional alignment.
4.3. Unidirectional vs. bidirectional alignment
In the previous section we compared the proposed unbiased
bidirectional technique against the unidirectional alignment technique proposed in [7]. The proposed technique and the RCB technique differ not only in the sub-frame alignment capability and
reduction in singularities, but also in the capability of eliminating
the bias. In order to highlight the advantage of the capability of
eliminating the bias, we validate it against unidirectional alignment. The warps computed using both the bidirectional and unidirectional technique for two trajectories are shown in Fig. 9. Note
that the two unidirectional warps are shown as black dot-dash
line; whereas the blue dash line is the bidirectional warp computed using the proposed technique.
Note that the STE technique and the proposed technique both
compute the bidirectional alignments. The STE technique computes the symmetric error between two unidirectional warps by
turning the regularization factor iteratively until it reaches the
minimum and then selects one of the warps as the final warp.
In the example shown in Fig. 9, it is clear that these two unidirectional warps W 1;2p and W 1p;2 are close to the ground truth, but
not accurate enough due to the bias towards the reference sequence. In the STE technique, either W 1;2p or W 1p;2 will be selected
One of our motivations behind this work is to build a 4D (volume + time) representation of functional events in the body using
2D planar acquisitions, specifically swallowing disorders. In our
experiments, a subject lies prone inside an MRI scanner and is
fed small measured quantities of water (bolus) via a system of tubing (water is displayed as white in the MRI images because of the
high Hydrogen content in it). For each swallow, a time series of 2D
images are acquired on a fixed plane. The acquisition plane is then
changed and another series of 2D images are acquired. We acquire
three such video sequences corresponding to left, right and center
MRI slice planes, as illustrated in Fig. 10(a). The MRI video sequences are subjected to dynamic temporal offsets in the motion
of the bolus. Also, since the acquisitions are at very low frame
rates, limited by the technology to 4–7 fps, it is crucial to align
the sequences to sub-frame accuracy. The proposed technique
shows promising results in aligning the MRI sequences to generate
a 4D representation. Implementation details of this application are
discussed next.
The trailing and leading edges of the bolus are extracted from
the MRI sequences using standard background separation techniques. The center of the trailing bolus is extracted using horizontal and vertical profiles, and is used to generate feature trajectories
in the three sequences. After suitable event models have been computed for the trajectories, the proposed technique is applied to the
video sequences to compute the alignment. The result of alignments with the proposed technique is shown in Fig. 11. Fig. 11
shows a few frames from the alignment computed between the
center and left MRI slices that demonstrate sub-frame alignment.
Frame 5 of the left MRI sequence is mapped to frame 5.5 of the center MRI sequence. Once the alignment has been computed, 4D
visualization of the MRI data is carried out with a 4D model being
shown in Fig. 10(b).
6. Conclusions
We proposed and successfully evaluated an efficient technique
to align video sequences that are related by varying temporal offsets. Our formulation of alignment as the unbiased bidirectional
dynamic time warping resulted in alignment that was not biased
Fig. 10. (a) Illustration of the acquisition slices. (b) Static visualization (one time instant) of 4D synchronized MRI data.
614
C. Lu et al. / J. Vis. Commun. Image R. 22 (2011) 606–614
Fig. 11. Alignment computed between Center–Left MRI sequences by the proposed technique.
by the choice of the reference sequence. The regularized dynamic
time warping successfully reduced the occurrence of false singularities and resulted in sub-frame accuracy synchronization. Comparative analysis with a rank-constraint based technique and
symmetric transfer error based technique demonstrated a significant improvement in video alignment with our proposed technique. The proposed technique has a lower or similar
computational complexity compared to existing techniques. An
application of the proposed method in 4D MRI visualization was
also presented. Although the proposed technique has been presented for a two-camera set-up, and the performance has been
evaluated based on two-videos, the proposed technique can easily
be extended to align more than two videos.
References
[1] Y. Caspi, M. Irani, Spatio-temporal alignment of sequences, IEEE Transactions
on Pattern Analysis and Machine Intelligence 24 (11) (2002) 1409–1424.
[2] M. Singh, A. Basu, M. Mandal, Event dynamics based temporal registration,
IEEE Transactions on Multimedia 9 (5) (2007) 1004–1015.
[3] L. Lee, R. Romano, G. Stein, Monitoring activities from multiple video streams:
establishing a common coordinate frame, IEEE Transactions on Pattern
Analysis and Machine Intelligence 22 (8) (2000) 758–767.
[4] R. Hess, A. Fern, Ieee, Improved video registration using non-distinctive local
image features, in: Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition 2007, vol. 1, pp. 154–161, 2007.
[5] M.A. Giese, T. Poggio, Morphable models for the analysis and synthesis of
complex motion patterns, International Journal of Computer Vision 38 (1)
(2000) 59–73.
[6] D. Perperidis, R.H. Mohiaddin, D. Rueckert, Spatio-temporal free-form
registration of cardiac MR image sequences, Medical Image Computing and
Computer-Assisted Intervention 3216 (2004) 441–456.
[7] C. Rao, A. Gritai, M. Shah, T.F.S. Mahmood, View-invariant alignment and
matching of video sequences, in: Proceedings of the International Conference
on Computer Vision 2003, vol. 1, pp. 939–945, 2003.
[8] M. Singh, I. Cheng, M. Mandal, A. Basu, Optimization of symmetric transfer
error for sub-frame video synchronization, in: Proceedings of European
Conference on Computer Vision 2008, pp. 554–567, 2008.
[9] C. Dai, Y. Zheng, X. Li, Subframe video synchronization via 3D phase
correlation, in: Proceedings of the International Conference on Image
Processing 2006, vol. 1, pp. 501–504, 2006.
[10] T. Tuytelaars, L.J. VanGool, Synchronizing video sequences, in: IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2004,
pp. 762–768.
[11] P. Tresadern, I. Reid, Synchronizing image sequences of non-rigid objects, in:
Proceedings of the 14th British Machine Vision Conference, Norwich, 9–11
September, vol. 2, pp. 629–638, 2003.
[12] R.L. Carceroni, F.L.C Padua, G.A.M.R. Santos, K.N. Kutulakos, Linear sequenceto-sequence alignment, in: IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, vol. 1, 2004, pp. 746–753.
[13] C. Lei, Y.H Yang, Tri-focal tensor-based multiple video synchronization with
subframe optimization, in: Proceedings of the International Conference on
Image Processing 2006, vol. 15, 2006, pp. 2473–2480.
[14] L. Wolf, A. Zomet, Wide baseline matching between unsynchronized video
sequences, International Journal of Computer Vision 1 (2006) 43–52.
[15] D.W. Pooley, M.J. Brooks, A. van den Hengel, W. Chojnacki, A voting scheme for
estimating the synchrony of moving-camera videos, in: Proceedings of the
IEEE International Conference on Image Processing 2003, vol. 1, September
2003, pp. 413–416.
[16] J. Matas, O. Chum, M. Urban, T. Pajdla, Robust wide baseline stereo from
maximally stable external regions, in: Proceedings of British Machine Vision
Conference 2002, vol. 1, September 2002, pp. 384–393.
[17] R.I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, second
ed., Cambridge University Press, London, 2004. pp. 87–127.
[18] H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken
word recognition, IEEE Transactions on Acoustics, Speech and Signal
Processing 26 (1) (1978) 43–49.
Download